GloVe

Global Vectors for Word Representation

Jeffrey Pennington

Richard Socher

Christopher D. Manning

Computer Science Department, Stanford University

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543

Introduction

Word Representation

Semantic vector space models of language represent each word with a real-valued vector.
Word vectors can be used as features of these tasks:
- Information Retrieval
- Document Classification
- Question Answering
- Named Entity Recognition

Evaluation

Calculating the distance between pairs of word vectors.
- happy is similar with joyful.
- angry is similar with mad.
Calculating the difference of words.
- king - queen = man - woman
- go - went = eat - ate
- child - children = person - people

Vector Learning Methods

Global matrix factorization methods.
- Ex: Latent Semantic Analysis (LSA).
Local context window methods.
- Ex: Skip-gram model.

Global Matrix Factorization

These methods efficiently leverage statistical information.
Relatively poorly on the word analogy task.

Local Context Window

These methods train on separate local context windows.
Poorly utilize the statistics of the corpus.
Have better performance on the analogy task.

In This Work

Proposing a global log-bilinear regression model that train on global word-word co-occurrence counts.
This method outperform other methods on several word similarity tasks.

The GloVe Model

Let the matrix of word-word co-occurence counts be denoted by \(X\).
\(X_{ij}\) tabulate the number of times word \(j\) occurs in the context of word \(i\).
\(X_i=\sum_kX_{ik}\) is the number of times any word appears in the context of word \(i\).
\(P_{ij}=P(j|i)=X_{ij}/X_i\) is the probability that word j appear in the context of word \(i\).

Co-occurence Table

The ratio is better able to distinguish relevant words from irrelevant words.

The Most General Model

\(F(w_i,w_j,\tilde w_k)=\frac{P_{ik}}{P_{jk}}\)

\(w \in R^d\) and \(\tilde w \in R^d\).
\(w\) and \(\tilde w\) are word vectors.
\(F\) is going to encode the information.

(1)

Since vector spaces are inherently linear structures, the most natural way to do this is with vector differences.

\(F(w_i-w_j,\tilde w_k)=\frac{P_{ik}}{P_{jk}}\)

The arguments of \(F\) are vectors while the right-hand side is a scalar, using the dot product to prevents \(F\) from mixing the vector dimensions.

(2)

\(F((w_i-w_j)^T\tilde w_k)=\frac{P_{ik}}{P_{jk}}\)

(3)

Next, rewrite Eqn. (3) to this,

\(F((w_i-w_j)^T\tilde w_k)=\frac{F(w^T_i\tilde w_k)}{F(w^T_j\tilde w_k)}\)

which is solved by,

\(F(w_i^T\tilde w_k)=P_{ik}=\frac{X_{ik}}{X_i}\)

The solution to Eqn. (4) is \(F=exp\), or,

\(w_i^T\tilde w_k=log(P_{ik})=log(X_{ik})-log(X_i)\)

(4)

(5)

(6)

Finally, adding a bias.

\(w_i^T\tilde w_k+b_i+\tilde b_k=log(X_{ik})\)

(7)

A main drawback to this model is that it weights all co-occurences equally.
Author propose a weighting function \(f(X_{ij})\) to address these problems.

\(J=\sum_{i,j=1}^{V}f(X_{ij})(w_i^T\tilde w_j+b_i+\tilde b_j-logX_{ij})^2\)

Where \(V\) is the size of the vocabulary.

(8)

Cost Function

The weighting function should obey the following properties:
1. \(f(0)=0\). It should vanish as \(x\to 0\) fast enough that \(lim_{x\to 0}f(x)log^2x\) is finite.
2. \(f(x)\) should be non-decreasing.
3. \(f(x)\) should be relatively small for large value of \(x\)
Author found this function to work well:
- \(f(x)=\begin{cases}(x/x_{max})^a,& \text{if }x < x_{max}\\1,& \text{otherwise}\end{cases}\)

Weighting Function

The performance of the model depends weakly on the cutoff.
- Fixing to \(x_{max}=100\) for all experiments.
Author found that \(a=3/4\) gives a modest improvement over a linear version with \(a=1\).

Experiments

Corpora

Training model on 5 corpora of varying sizes:
- 2010 Wikipedia dump (1B)
- 2014 Wikipedia dump (1.6B)
- Gigaword 5 (4.3B)
- Gigaword 5 + Wikipedia 2014 (6B)
- Common Crawl web data (42B)

Preprocessing

Tokenizing and lowercase each corpus with Stanford tokenizer.
Building a vocabulary of the 400,000 most frequent words.

Training Details

\(x_{max}=100\) and \(a=3/4\).
Using AdaGrad as optimizer.
Initial learning rate is 0.05.
50 iterations for which dimension smaller than 300.
100 iterations for otherwise.
Using Context of 10 words to left and 10 words to right.

Experiments

There are 3 parts of experiments
1. Word analogy task. (Mikolov et al., 2013a)
2. A variety of similarity tasks. (Luong et al., 2013)
3. CoNLL-2003 NER. (Tjong Kim Sang and De Meulder, 2003)

Word Analogies

Consisting of questions like, "\(a\) is to \(b\) as \(c\) is to __?".
The dataset contains 19,544 questions.
Finding the \(w_d\) is closest to \(w_b-w_a+w_c\) according to the cosine similarity.

Word Similarity

This part include 5 tasks:
1. WordSim-353 (Finkelstein et al., 2001)
2. MC (Miller and Charles, 1991)
3. RG (Rubenstein and Goodenough, 1965)
4. SCWS (Huang et al., 2012)
5. RW (Luong et al., 2013)

Named Entity Recognition

The CoNLL-2003 is annotated with 4 entity types:
1. Person
2. Location
3. Organization
4. Miscellaneous
Training model on CoNLL-03 training data.
Testing with 3 datasets:
1. CoNLL-03 testing data.
2. ACE Phase and ACE-2003 data.
3. MUC7 Formal Run test set.

Comparison

Comparing with the published results of state-of-the-art models.
Using w2v tool to produce w2v results.
- Both skip-gram (SG) and CBOW are trained.
Using SVDs as baselines.
- SVD-S take the SVD of \(\sqrt{X_{trunc}}\).
- SVD-L take the SVD of \(log(1+X_{trunc})\).

Results

Results using the word2vec tool are better than most of the proviously published results.
Increasing the corpus size does not guarantee improved results.

Word Similarity Tasks

Model	Size	WS353	MC	RG	SCWS	RW
SVD	6B	35.3	35.1	42.5	38.3	25.6
SVD-S	6B	56.5	71.5	71.0	53.6	34.7
SVD-L	6B	65.7	72.7	75.1	56.5	37.0
CBOW	6B	57.2	65.6	68.2	57.0	32.5
SG	6B	62.8	65.2	69.7	58.1	37.2
GloVe	6B	65.8	72.7	77.8	53.9	38.1
SVD-L	42B	74.0	76.4	74.1	58.3	39.9
GloVe	42B	75.9	83.6	82.9	59.6	47.8
CBOW	100B	68.4	79.6	75.4	59.4	45.5

A similarity score is obtained from the word vectors by first normalizing each feature across the vocabulary and then calculating the cosine similarity.
Computing Spearman's rank correlation coefficient between this score and the human judgements.

NER Task

F1 score on NER task with 50d vectors.
HPCA, HSMN, and CW are publicly-available vectors.

Model	Dev	Test	ACE	MUC7
Discrete	91.0	85.4	77.4	73.4
SVD	90.8	85.7	77.3	73.7
SVD-S	91.0	85.5	77.6	74.3
SVD-L	90.5	84.8	73.6	71.5
HPCA	92.6	88.7	81.7	80.7
HSMN	90.5	85.7	78.7	74.7
CW	92.2	87.4	81.7	80.2
CBOW	93.1	88.2	82.2	81.1
GloVe	93.2	88.3	82.9	82.2

Analogy Task

All models are trained on 6B corpus.
In (a), the window size is 10.
In (b) and (c), the vector size is 100.

Analogy Task

This figure show the performance of the word analogy task for 300d vectors trained on different corpora.

Comparison with Word2Vec

Overall accuracy on the word analogy task.
Training 300d vectors on the same 6B corpus.
Using symmetric context window of size 10.

Conclusion

This paper argue that the two classes of methods are not dramatically different at a fundamental level since they both probe the underlying co-occurence statistics of the corpus.
The count-based methods capture global statistics can be advantageous.
GloVe is a new global log-bilinear regression model for the unsupervised learning of word representations.

GloVe

By Penut Chen (PenutChen)

GloVe

Penut Chen (PenutChen)

I love oppai!

github.com/penut85420

GloVe

Global Vectors for Word Representation

Introduction

Word Representation

Evaluation

Vector Learning Methods

Global Matrix Factorization

Local Context Window

In This Work

The GloVe Model

The GloVe Model

Co-occurence Table

The Most General Model

Cost Function

Weighting Function

Weighting Function

Experiments

Corpora

Preprocessing

Training Details

Experiments

Word Analogies

Word Similarity

Named Entity Recognition

Comparison

Results

Results

Word Similarity Tasks

NER Task

Analogy Task

Analogy Task

Comparison with Word2Vec

Conclusion

Conclusion

GloVe

More from Penut Chen (PenutChen)