GloVe

Global Vectors for Word Representation

 

Jeffrey Pennington

Richard Socher

Christopher D. Manning

 

Computer Science Department, Stanford University

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532-1543

October 25-29, 2014, Doha, Qatar. (C)2014 Association for Computational Linguistics

Introduction

Word Representation

  • Semantic vector space models of language represent each word with a real-valued vector.
  • Word vectors can be used as features of these tasks:
    • Information Retrieval
    • Document Classification
    • Question Answering
    • Named Entity Recognition

Evaluation

  • Calculating the distance between pairs of word vectors.
    • happy is similar with joyful.
    • angry is similar with mad.
  • Calculating the difference of words.
    • king - queen = man - woman
    • go - went = eat - ate
    • child - children = person - people

Vector Learning Methods

  • Global matrix factorization methods.
    • Ex: Latent Semantic Analysis (LSA).
  • Local context window methods.
    • Ex: Skip-gram model.

Global Matrix Factorization

  • These methods efficiently leverage statistical information.
  • Relatively poorly on the word analogy task.

Local Context Window

  • These methods train on separate local context windows.
  • Poorly utilize the statistics of the corpus.
  • Have better performance on the analogy task.

In This Work

  • Proposing a global log-bilinear regression model that train on global word-word co-occurrence counts.
  • This method outperform other methods on several word similarity tasks.

The GloVe Model

The GloVe Model

  • Let the matrix of word-word co-occurence counts be denoted by \(X\).
  • \(X_{ij}\) tabulate the number of times word \(j\) occurs in the context of word \(i\).
  • \(X_i=\sum_kX_{ik}\) is the number of times any word appears in the context of word \(i\).
  • \(P_{ij}=P(j|i)=X_{ij}/X_i\) is the probability that word j appear in the context of word \(i\).

Co-occurence Table

  • The ratio is better able to distinguish relevant words from irrelevant words.

The Most General Model

\(F(w_i,w_j,\tilde w_k)=\frac{P_{ik}}{P_{jk}}\)

  • \(w \in R^d\) and \(\tilde w \in R^d\).
  • \(w\) and \(\tilde w\) are word vectors.
  • \(F\) is going to encode the information.

(1)

  • Since vector spaces are inherently linear structures, the most natural way to do this is with vector differences.

\(F(w_i-w_j,\tilde w_k)=\frac{P_{ik}}{P_{jk}}\)

  • The arguments of \(F\) are vectors while the right-hand side is a scalar, using the dot product to prevents \(F\) from mixing the vector dimensions.

(2)

\(F((w_i-w_j)^T\tilde w_k)=\frac{P_{ik}}{P_{jk}}\)

(3)

  • Next, rewrite Eqn. (3) to this,

 

\(F((w_i-w_j)^T\tilde w_k)=\frac{F(w^T_i\tilde w_k)}{F(w^T_j\tilde w_k)}\)

 

  • which is solved by,

 

\(F(w_i^T\tilde w_k)=P_{ik}=\frac{X_{ik}}{X_i}\)

 

  • The solution to Eqn. (4) is \(F=exp\), or,

 

\(w_i^T\tilde w_k=log(P_{ik})=log(X_{ik})-log(X_i)\)

(4)

(5)

(6)

  • Finally, adding a bias.

\(w_i^T\tilde w_k+b_i+\tilde b_k=log(X_{ik})\)

(7)

  • A main drawback to this model is that it weights all co-occurences equally.
  • Author propose a weighting function \(f(X_{ij})\) to address these problems.

 

\(J=\sum_{i,j=1}^{V}f(X_{ij})(w_i^T\tilde w_j+b_i+\tilde b_j-logX_{ij})^2\)

 

  • Where \(V\) is the size of the vocabulary.                      

(8)

Cost Function

  • The weighting function should obey the following properties:
    1. \(f(0)=0\). It should vanish as \(x\to 0\) fast enough that \(lim_{x\to 0}f(x)log^2x\) is finite.
    2. \(f(x)\) should be non-decreasing.
    3. \(f(x)\) should be relatively small for large value of \(x\)
  • Author found this function to work well:
    • \(f(x)=\begin{cases}(x/x_{max})^a,& \text{if }x < x_{max}\\1,& \text{otherwise}\end{cases}\)

Weighting Function

Weighting Function

  • The performance of the model depends weakly on the cutoff.
    • Fixing to \(x_{max}=100\) for all experiments.
  • Author found that \(a=3/4\) gives a modest improvement over a linear version with \(a=1\).

Experiments

Corpora

  • Training model on 5 corpora of varying sizes:
    • 2010 Wikipedia dump (1B)
    • 2014 Wikipedia dump (1.6B)
    • Gigaword 5 (4.3B)
    • Gigaword 5 + Wikipedia 2014 (6B)
    • Common Crawl web data (42B)

Preprocessing

  • Tokenizing and lowercase each corpus with Stanford tokenizer.
  • Building a vocabulary of the 400,000 most frequent words.

Training Details

  • \(x_{max}=100\) and \(a=3/4\).
  • Using AdaGrad as optimizer.
  • Initial learning rate is 0.05.
  • 50 iterations for which dimension smaller than 300.
  • 100 iterations for otherwise.
  • Using Context of 10 words to left and 10 words to right.

Experiments

  • There are 3 parts of experiments
    1. Word analogy task. (Mikolov et al., 2013a)
    2. A variety of similarity tasks. (Luong et al., 2013)
    3. CoNLL-2003 NER. (Tjong Kim Sang and De Meulder, 2003)

Word Analogies

  • Consisting of questions like, "\(a\) is to \(b\) as \(c\) is to __?".
  • The dataset contains 19,544 questions.
  • Finding the \(w_d\) is closest to \(w_b-w_a+w_c\) according to the cosine similarity.

Word Similarity

  • This part include 5 tasks:
    1. WordSim-353 (Finkelstein et al., 2001)
    2. MC (Miller and Charles, 1991)
    3. RG (Rubenstein and Goodenough, 1965)
    4. SCWS (Huang et al., 2012)
    5. RW (Luong et al., 2013)

Named Entity Recognition

  • The CoNLL-2003 is annotated with 4 entity types:
    1. Person
    2. Location
    3. Organization
    4. Miscellaneous
  • Training model on CoNLL-03 training data.
  • Testing with 3 datasets:
    1. CoNLL-03 testing data.
    2. ACE Phase and ACE-2003 data.
    3. MUC7 Formal Run test set.

Comparison

  • Comparing with the published results of state-of-the-art models.
  • Using w2v tool to produce w2v results.
    • Both skip-gram (SG) and CBOW are trained.
  • Using SVDs as baselines.
    • SVD-S take the SVD of \(\sqrt{X_{trunc}}\).
    • SVD-L take the SVD of \(log(1+X_{trunc})\).

Results

Results

  • Results using the word2vec tool are better than most of the proviously published results.
  • Increasing the corpus size does not guarantee improved results.

Word Similarity Tasks

Model Size WS353 MC RG SCWS RW
SVD 6B 35.3 35.1 42.5 38.3 25.6
SVD-S 6B 56.5 71.5 71.0 53.6 34.7
SVD-L 6B 65.7 72.7 75.1 56.5 37.0
CBOW 6B 57.2 65.6 68.2 57.0 32.5
SG 6B 62.8 65.2 69.7 58.1 37.2
GloVe 6B 65.8 72.7 77.8 53.9 38.1
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
GloVe 42B 75.9 83.6 82.9 59.6 47.8
CBOW 100B 68.4 79.6 75.4 59.4 45.5
  • A similarity score is obtained from the word vectors by first normalizing each feature across the vocabulary and then calculating the cosine similarity.
  • Computing Spearman's rank correlation coefficient between this score and the human judgements.

NER Task

  • F1 score on NER task with 50d vectors.
  • HPCA, HSMN, and CW are publicly-available vectors.
Model Dev Test ACE MUC7
Discrete 91.0 85.4 77.4 73.4
SVD 90.8 85.7 77.3 73.7
SVD-S 91.0 85.5 77.6 74.3
SVD-L 90.5 84.8 73.6 71.5
HPCA 92.6 88.7 81.7 80.7
HSMN 90.5 85.7 78.7 74.7
CW 92.2 87.4 81.7 80.2
CBOW 93.1 88.2 82.2 81.1
GloVe 93.2 88.3 82.9 82.2

Analogy Task

  • All models are trained on 6B corpus.
  • In (a), the window size is 10.
  • In (b) and (c), the vector size is 100.

Analogy Task

  • This figure show the performance of the word analogy task for 300d vectors trained on different corpora.

Comparison with Word2Vec

  • Overall accuracy on the word analogy task.
  • Training 300d vectors on the same 6B corpus.
  • Using symmetric context window of size 10.

Conclusion

Conclusion

  • This paper argue that the two classes of methods are not dramatically different at a fundamental level since they both probe the underlying co-occurence statistics of the corpus.
  • The count-based methods capture global statistics can be advantageous.
  • GloVe is a new global log-bilinear regression model for the unsupervised learning of word representations.

GloVe

By Penut Chen (PenutChen)