Word2vec
Yini Shi
Grace Hopper Academy
July 2016

Word2vec is a machine-learning model that embeds words as vectors in a high-dimensional space.

Image source: http://deeplearning4j.org/word2vec

Image source: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
What does this tell us?
By encoding words as vectors, Word2vec makes it possible to process word meanings and relationships using mathematical operations.
Word2vec is capable of detecting multiple degrees of similarity between words
Input:
'san_francisco'
Most similar words:
???
Source: https://code.google.com/archive/p/word2vec/
Input:
'san_francisco'
| Word | Cosine distance from input |
|---|---|
| los_angeles | 0.666175 |
| golden_gate | 0.571522 |
| oakland | 0.557521 |
| california | 0.554623 |
| san_diego | 0.534939 |
| pasadena | 0.519115 |
| seattle | 0.512098 |
| taiko | 0.507570 |
| houston | 0.499762 |
| chicago_illinois | 0.491598 |
Most similar words:
Source: https://code.google.com/archive/p/word2vec/
Moscow - Russia + Italy = ???

Image source: http://deeplearning4j.org/word2vec
Image source: http://deeplearning4j.org/word2vec

Image source: http://deeplearning4j.org/word2vec

walked : walking :: swam : ???
walked : walking :: swam : swimming

Image source: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
king - man + woman = ???
king - man + woman = queen

Image source: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
Implementing Word2vec
Two architectures

Image source: Mikolov et al, "Efficient Estimation of Word Representations in Vector Space"
Quality depends on...
- the size of the vectors
- the amount of training data
- the quality of the training data
- the training algorithm used
Real Examples from Deeplearning4J's Google News Corpus Model
Source: http://deeplearning4j.org/word2vec
Real Examples from Deeplearning4J's Google News Corpus Model
Source: http://deeplearning4j.org/word2vec
house:roof::castle:[dome, bell_tower, spire, crenellations, turrets]
Real Examples from Deeplearning4J's Google News Corpus Model
Source: http://deeplearning4j.org/word2vec
house:roof::castle:[dome, bell_tower, spire, crenellations, turrets]
Donald Trump:Republican::Barack Obama:[Democratic, GOP, Democrats, McCain]
Real Examples from Deeplearning4J's Google News Corpus Model
Source: http://deeplearning4j.org/word2vec
house:roof::castle:[dome, bell_tower, spire, crenellations, turrets]
Donald Trump:Republican::Barack Obama:[Democratic, GOP, Democrats, McCain]
monkey:human::dinosaur:[fossil, fossilized, Ice_Age_mammals, fossilization]
Sources & Further Reading
Efficient Estimation of Word Representations in Vector Space
Mikolov et al
http://arxiv.org/pdf/1301.3781.pdf
Distributed Representations of Words and Phrases and their Compositionality
Mikolov et al
https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
Word2vec: Neural Word Embeddings in Java
Deeplearning4j
http://deeplearning4j.org/word2vec
Vector Representations of Words
TensorFlow
https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html
Google Code Archive: word2vec
https://code.google.com/archive/p/word2vec/

word2vec
By Yini Shi
word2vec
- 254