Bayesian Word Sense Induction

Samuel Brody and Mirella Lapata

EACL '09

Presenter: Kyoungrok Jang

Introduction

What is Sense Induction?

  • Sense induction is the task of discovering automatically all possible senses of an ambiguous word
  • Typically treated as unsupervised clustering problem
    • Input 
      • An ambiguous word & accompanying context
    • Output
      • Groups of context words, each of which represent a specific word sense(use)

Their Approach

  • The contexts of an ambiguous word are modeled as samples from a multinomial distribution over senses
    • The Bayesian framework

Example of Inferred Senses

Background

Previous Methods

  • Features used
    • Co-occurrence, parts-of-speech, grammatical relations, ...
  • Size of the context window
    • 2 words, whole sentence, 20 surrounding words, ...
  • Clustering algorithms
    • k-means, agglomerative clustering, graph-based methods, ...

Sense Induction Compared with WSD

Sense induction WSD
Assumption We don't know all the possible senses of a word All the possible senses are known beforehand 
Goal Discover all the possible senses Identify the intended sense of a word in context

Limitation of Traditional WSD

  1. Require dictionaries or other lexical resources (e.g. WordNet) 
    • Makes it hard to adapt to new domains
    • Not all languages have such resources
  2. The sense granularity is fixed(in dictionary)
    • Hard to tune to meet the applications' needs

Example: "Great" in WordNet

The Benefit of Sense Induction

  • By inducing the word's senses directly from the data, we can reflect the empirical uses of the word in the task and the domain at hand

Their Approach

They took Bayesian Approach

  • Assumption
    • Different senses are signaled by different lexical distribution
  • Formulation:
    • The context words are is sampled from a multinomial sense distribution

Modeling

Each word in the context window is generated by:

  1. First sampling a sense from the sense distribution,
  2. Then choosing a word from the sense-context distribution

"Formal" Modeling

p(w_i)=\sum { p(w_i|s_i=j)p(s_i=j) }
p(wi)=p(wisi=j)p(si=j)p(w_i)=\sum { p(w_i|s_i=j)p(s_i=j) }

prob. of word

w_i
wiw_i

prob. of word

w_i
wiw_i

prob. of sense

s_i=j
si=js_i=j

given

s_i=j
si=js_i=j
\phi^{(j)}=p(w_i|s_i=j)
ϕ(j)=p(wisi=j)\phi^{(j)}=p(w_i|s_i=j)
\theta=p(s)
θ=p(s)\theta=p(s)

Limitation of This Modeling

  • The model counts only words
  • In contrast, traditional supervised WSD use variety of information sources
    (e.g. POS, dependency relations)

Solution

  • Treat each information source(or feature type) individually and then combine them

Word

POS

Dep.

For each layer:

  • sense distribution is the same
  • sense-feature distribution is different
(\theta)
(θ)(\theta)
(\phi)
(ϕ)(\phi)

Comparison with LDA

  • LDA
    • Global topics of the whole document
  • This paper
    • Local topics of the context window surrounding the ambiguous word

Evalution Setup

Modeling Target

  • Focused on modeling nouns
    • Rationale: nouns constitute the largest portion of content words
      (e.g. 45% of British National Corpus)

Features

  • Used features that are widely adopted in various WSD algorithms
    1. ±10-word window (10w)
    2. ±5-word window (5w)
    3. collocation (1w)
    4. n-grams (ng)
    5. POS n-grams (pg)
    6. dependency relations (dp)

** Lemmatized words are used

Testing

  • Semeval-2007 benchmark dataset
    • Consists of articles from the first half of the 1989 WSJ (same as Penn Treebank II corpus)
    • 35 nouns are hand-annotated with OntoNotes senses

Training

  • Used two corpora (in-domain & out-of-domain)
    • Wall Street Journal (WSJ)
      • All the articles during 1987-89 and 1994, excluding those in test set
      • Serve as in-domain corpus
    • British National Corpus (BNC)
      • 100 million word collection from newspapers, magazines, letters, etc.
      • Contains 730,000 instances of 35 target nouns
      • Serve as out-of-domain corpus

Evaluation Methodology

  • Adopted the scheme used in (Agirre and Soroa, 2007)
    1. Split the corpus into a train/test set
    2. Using the train set with hand-annotated sense information, compute a mapping between the system-generated cluster and the gold-standard sense
      • This can be done by counting the times a sense has been assigned to a specific cluster
    3. Check the performance on the test set

Semeval-2007 Task 02: Evaluating Word Sense Induction and Discrimination Systems

Experiments

Model Performance (by # of senses)

  • The model worked best in WSJ with 4 senses and in BNC with 8 senses
  • Why BNC require more senses than WSJ?
    • BNC has broader focus and is different domain compared with the test set
    • Finer granularity might helped the model to encompass all the relevant distinction

Example: "drug" senses

Model Performance (by # of features)

  • Examine which individual feature categories and combination of them are most informative
  • Above is for WSJ; BNC showed the similar trend

Comparison to state-of-the-art (2009)

  • Systems in comparison
    • IR2 - Information Bottleneck algorithm
    • UMND -  k-means to cluster co-occurrence vectors
    • MFS - most-frequent-sense baseline
  • Analysis
    • The proposed system is significantly better than UMND2 and quantitatively better than IR2 (the difference is not statistically significant)

Discussion

The Model's Applicability

  • The model is applicable to any tasks that needs to take multiple types of information into account
    • e.g. document classification
      (text + image + caption)

Room for Improvement

  • More rigorous parameter estimation techniques can help boosting the performance
    • e.g., derive optimal # of senses using infinite Dirichlet model (Teh et al., 2006)
      • # of sense for each word is fixed in this work

Bayesian Word Sense Induction

By Kyoung Rok Jang

Bayesian Word Sense Induction

  • 190