Introduction

What is Sense Induction?

Sense induction is the task of discovering automatically all possible senses of an ambiguous word
Typically treated as unsupervised clustering problem
- Input
  - An ambiguous word & accompanying context
- Output
  - Groups of context words, each of which represent a specific word sense(use)

Their Approach

The contexts of an ambiguous word are modeled as samples from a multinomial distribution over senses
- The Bayesian framework

Example of Inferred Senses

Background

Previous Methods

Features used
- Co-occurrence, parts-of-speech, grammatical relations, ...
Size of the context window
- 2 words, whole sentence, 20 surrounding words, ...
Clustering algorithms
- k-means, agglomerative clustering, graph-based methods, ...

Sense Induction Compared with WSD

	Sense induction	WSD
Assumption	We don't know all the possible senses of a word	All the possible senses are known beforehand
Goal	Discover all the possible senses	Identify the intended sense of a word in context

Limitation of Traditional WSD

Require dictionaries or other lexical resources (e.g. WordNet)
- Makes it hard to adapt to new domains
- Not all languages have such resources
The sense granularity is fixed(in dictionary)
- Hard to tune to meet the applications' needs

Example: "Great" in WordNet

The Benefit of Sense Induction

By inducing the word's senses directly from the data, we can reflect the empirical uses of the word in the task and the domain at hand

Their Approach

They took Bayesian Approach

Assumption
- Different senses are signaled by different lexical distribution
Formulation:
- The context words are is sampled from a multinomial sense distribution

Modeling

Each word in the context window is generated by:

First sampling a sense from the sense distribution,
Then choosing a word from the sense-context distribution

"Formal" Modeling

p(w_i)=\sum { p(w_i|s_i=j)p(s_i=j) }

p(w_i)=\sum { p(w_i|s_i=j)p(s_i=j) }

prob. of word

w_i

w_i

prob. of word

w_i

w_i

prob. of sense

s_i=j

s_i=j

given

s_i=j

s_i=j

\phi^{(j)}=p(w_i|s_i=j)

\phi^{(j)}=p(w_i|s_i=j)

\theta=p(s)

\theta=p(s)

Limitation of This Modeling

The model counts only words
In contrast, traditional supervised WSD use variety of information sources
(e.g. POS, dependency relations)

Solution

Treat each information source(or feature type) individually and then combine them

Word

POS

Dep.

For each layer:

sense distribution is the same
sense-feature distribution is different

(\theta)

(\theta)

(\phi)

(\phi)

Comparison with LDA

LDA
- Global topics of the whole document
This paper
- Local topics of the context window surrounding the ambiguous word

Evalution Setup

Modeling Target

Focused on modeling nouns
- Rationale: nouns constitute the largest portion of content words
  (e.g. 45% of British National Corpus)

Features

Used features that are widely adopted in various WSD algorithms
1. ±10-word window (10w)
2. ±5-word window (5w)
3. collocation (1w)
4. n-grams (ng)
5. POS n-grams (pg)
6. dependency relations (dp)

** Lemmatized words are used

Testing

Semeval-2007 benchmark dataset
- Consists of articles from the first half of the 1989 WSJ (same as Penn Treebank II corpus)
- 35 nouns are hand-annotated with OntoNotes senses

Training

Used two corpora (in-domain & out-of-domain)
- Wall Street Journal (WSJ)
  - All the articles during 1987-89 and 1994, excluding those in test set
  - Serve as in-domain corpus
- British National Corpus (BNC)
  - 100 million word collection from newspapers, magazines, letters, etc.
  - Contains 730,000 instances of 35 target nouns
  - Serve as out-of-domain corpus

Evaluation Methodology

Adopted the scheme used in (Agirre and Soroa, 2007)
1. Split the corpus into a train/test set
2. Using the train set with hand-annotated sense information, compute a mapping between the system-generated cluster and the gold-standard sense
  - This can be done by counting the times a sense has been assigned to a specific cluster
3. Check the performance on the test set

Semeval-2007 Task 02: Evaluating Word Sense Induction and Discrimination Systems

Experiments

Model Performance (by # of senses)

The model worked best in WSJ with 4 senses and in BNC with 8 senses
Why BNC require more senses than WSJ?
- BNC has broader focus and is different domain compared with the test set
- Finer granularity might helped the model to encompass all the relevant distinction

Example: "drug" senses

Model Performance (by # of features)

Examine which individual feature categories and combination of them are most informative
Above is for WSJ; BNC showed the similar trend

Comparison to state-of-the-art (2009)

Systems in comparison
- IR2 - Information Bottleneck algorithm
- UMND - k-means to cluster co-occurrence vectors
- MFS - most-frequent-sense baseline
Analysis
- The proposed system is significantly better than UMND2 and quantitatively better than IR2 (the difference is not statistically significant)

Discussion

The Model's Applicability

The model is applicable to any tasks that needs to take multiple types of information into account
- e.g. document classification
  (text + image + caption)

Room for Improvement

More rigorous parameter estimation techniques can help boosting the performance
- e.g., derive optimal # of senses using infinite Dirichlet model (Teh et al., 2006)
  - # of sense for each word is fixed in this work

Bayesian Word Sense Induction

Introduction

What is Sense Induction?

Their Approach

Example of Inferred Senses

Background

Previous Methods

Sense Induction Compared with WSD

Limitation of Traditional WSD

Example: "Great" in WordNet

The Benefit of Sense Induction

Their Approach

They took Bayesian Approach

Modeling

"Formal" Modeling

Limitation of This Modeling

Solution

Comparison with LDA

Evalution Setup

Modeling Target

Features

Testing

Training

Evaluation Methodology

Experiments

Model Performance (by # of senses)

Example: "drug" senses

Model Performance (by # of features)

Comparison to state-of-the-art (2009)

Discussion

The Model's Applicability

Room for Improvement

Bayesian Word Sense Induction

Bayesian Word Sense Induction

Kyoung Rok Jang

Bayesian Word Sense Induction

Introduction

What is Sense Induction?

Their Approach

Example of Inferred Senses

Background

Previous Methods

Sense Induction Compared with WSD

Limitation of Traditional WSD

Example: "Great" in WordNet

The Benefit of Sense Induction

Their Approach

They took Bayesian Approach

Modeling

"Formal" Modeling

Limitation of This Modeling

Solution

Comparison with LDA

Evalution Setup

Modeling Target

Features

Testing

Training

Evaluation Methodology

Experiments

Model Performance (by # of senses)

Example: "drug" senses

Model Performance (by # of features)

Comparison to state-of-the-art (2009)

Discussion

The Model's Applicability

Room for Improvement

Bayesian Word Sense Induction

More from Kyoung Rok Jang