Papers We Love San Diego
July 5th, 2018
Presented by Michael Jalkio
We start with a corpus of documents.
$$ C = \{ d_1, d_2, d_3 \} $$
We examine the words in the corpus to build a vocabulary.
$$ V = \{ w_1, w_2, w_3, w_4, w_5 \} $$
Documents are represented as vectors of term weights.
$$d_1 = (w_{1,1}, w_{2,1}, w_{3,1}, w_{4,1}, w_{5,1})$$
$$d_2 = (w_{1,2}, w_{2,2}, w_{3,2}, w_{4,2}, w_{5,2})$$
$$d_3 = (w_{1,3}, w_{2,3}, w_{3,3}, w_{4,3}, w_{5,3})$$
A query is similarly represented by a term weight vector.
$$ q = (w_{1,q}, w_{2,q}, w_{3,q}, w_{4,q}, w_{5,q})$$
Cosine similarity is used to compare the query to documents.
$$\cos(d_i, q) = \frac{d_i^T q}{\| d_i \| \| q \|}$$
Because vectors are nonnegative:
$$\cos(d_i, q) \in [0,1]$$
A first approach is a bag of words.
Vectors contain counts of the terms the documents contain.
$$d_1 = [1, 1, 1, 0, 0]$$
$$d_2 = [1, 0, 1, 1, 0]$$
$$d_3 = [1, 1, 0, 0, 1]$$
$$d_4 = [1, 2, 0, 0, 1]$$
BoW overemphasizes common words.
An improvement is term frequency-inverse document frequency.
$$w_{t,d} = tf_{t,d} \cdot \log{\frac{|D|}{| \{ d \in D : t \in d \} |}}$$
From the example before we get:
$$idf(\text{the}, D) = \log{\frac{4}{4}} = 0$$
$$idf(\text{dog}, D) = \log{\frac{4}{3}} \approx 0.125$$
$$idf(\text{runs/eats}, D) = \log{\frac{4}{2}} \approx 0.301$$
$$idf(\text{horse}, D) = \log{\frac{4}{1}} \approx 0.602$$
Source: Andrew Ng / deeplearning.ai Sequence Models course
Man
(5391)
Woman
(9853)
King
(4914)
Queen
(7157)
Apple
(456)
Orange
(6257)
One-Hot Encodings
Source: Andrew Ng / deeplearning.ai Sequence Models course
Man | Woman | King | Queen | Apple | Orange | |
---|---|---|---|---|---|---|
Gender | -1 | 1 | -0.95 | 0.97 | 0.00 | 0.01 |
Royal | 0.01 | 0.02 | 0.93 | 0.95 | -0.01 | 0.00 |
Food | 0.04 | 0.01 | 0.02 | 0.01 | 0.95 | 0.97 |
Source: Andrew Ng / deeplearning.ai Sequence Models course
One potential application is named entity recognition.
If the training set contains:
And the test set contains:
You can use word embeddings to figure out that these sentences have very similar structures.
Nearest neighbors to frog:
Source: https://nlp.stanford.edu/projects/glove
You can pre-train word embeddings in an unsupervised manner on a large corpus of text data.
Or even use someone else's pre-trained model!
You can then use embeddings to squeeze out more information from a smaller training set for your task.
Generally, this technique is called transfer learning.
Man | Woman | King | Queen | Apple | Orange | |
---|---|---|---|---|---|---|
Gender | -1 | 1 | -0.95 | 0.97 | 0.00 | 0.01 |
Royal | 0.01 | 0.02 | 0.93 | 0.95 | -0.01 | 0.00 |
Food | 0.04 | 0.01 | 0.02 | 0.01 | 0.95 | 0.97 |
Woman is to man as queen is to _ ?
Source: https://nlp.stanford.edu/projects/glove
Source: https://nlp.stanford.edu/projects/glove
Source: https://nlp.stanford.edu/projects/glove
Source: https://nlp.stanford.edu/projects/glove
Task: given a context, predict the word that fits.
The quick brown fox jumps over the lazy dog.
The quick ___ fox jumps
quick brown ___ jumps over
brown fox ___ over the
Task: given a word, predict the context.
The quick brown fox jumps over the lazy dog.
brown => {The, quick, fox, jump}
fox => {quick, brown, jumps, over}
jumps => {brown, fox, over, the}
Source: McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com
Source: McCormick, C. (2016, April 19). Word2Vec Tutorial - The Skip-Gram Model. Retrieved from http://www.mccormickml.com
Number of times word j occurs in context of word i
Number of times any word appears in the context of word i
Probability that word j appear in the context of word i
The ratios are better at surfacing relevant words.
(the GloVe algorithm)
Calculated directly from our corpus
Word vectors
Context word vector
A function that will tie everything together
We want relationships between words to be embedded in a vector space.
Encodes whether words are related
Encodes a relationship
Scalar
Vectors
Want to work in simple, linear structures.
Want to work in simple, linear structures.
A word and a context word should be an arbitrary distinction.
$$w \leftrightarrow \tilde{w}$$
For this to be possible, our
co-occurrence matrix needs to be symmetric.
$$X \leftrightarrow X^T$$
If \( F \) obeys this, it is a homomorphism between the groups \( (\mathbb{R}, +) \) and \( (\mathbb{R}_{>0}, \times) \)
$$F: \mathbb{R} \to \mathbb{R}_{>0}$$
$$\forall a,b \in \mathbb{R}, F(a+b) = F(a) \times F(b)$$
What about \( \log(0) \)?
Very common solution to \(log(0)\) is to add 1:
$$\log(X_i) \to \log(1 + X_i)$$
But we have another issue. We treat all co-occurrences equally. Most are zero, and the ones that exist are often huge.
Let's fix two problems at once with a weighting function.
Our weighting function should have:
We cast the problem as a weighted least squares regression model, which is a standard machine learning problem.
The word vectors (along with bias, which we can discard) that are returned when we minimize this function, are our embeddings!