Word2vec

 

Yini Shi

Grace Hopper Academy

July 2016

Word2vec is a machine-learning model that embeds words as vectors in a high-dimensional space.

Image source: http://deeplearning4j.org/word2vec

Image source: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html

What does this tell us?

By encoding words as vectors, Word2vec makes it possible to process word meanings and relationships using mathematical operations.

Word2vec is capable of detecting multiple degrees of similarity between words

Input:

'san_francisco'

Most similar words:

???

Source: https://code.google.com/archive/p/word2vec/

Input:

'san_francisco'

Word Cosine distance from input
los_angeles 0.666175 
golden_gate 0.571522
oakland 0.557521
california 0.554623
san_diego 0.534939
pasadena 0.519115
seattle 0.512098
taiko 0.507570
houston 0.499762
chicago_illinois 0.491598

Most similar words:

Source: https://code.google.com/archive/p/word2vec/

Moscow - Russia + Italy = ???

Image source: http://deeplearning4j.org/word2vec

Image source: http://deeplearning4j.org/word2vec

Image source: http://deeplearning4j.org/word2vec

walked : walking :: swam : ???

walked : walking :: swam : swimming

Image source: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html

king - man + woman = ???

king - man + woman = queen

Image source: https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html

Implementing Word2vec

Two architectures

Image source: Mikolov et al, "Efficient Estimation of Word Representations in Vector Space"

Quality depends on...

  • the size of the vectors
  • the amount of training data
  • the quality of the training data
  • the training algorithm used

Real Examples from Deeplearning4J's Google News Corpus Model

Source: http://deeplearning4j.org/word2vec

Real Examples from Deeplearning4J's Google News Corpus Model

Source: http://deeplearning4j.org/word2vec

house:roof::castle:[dome, bell_tower, spire, crenellations, turrets]

 

 

Real Examples from Deeplearning4J's Google News Corpus Model

Source: http://deeplearning4j.org/word2vec

house:roof::castle:[dome, bell_tower, spire, crenellations, turrets]

 

Donald Trump:Republican::Barack Obama:[Democratic, GOP, Democrats, McCain]

 

 

Real Examples from Deeplearning4J's Google News Corpus Model

Source: http://deeplearning4j.org/word2vec

house:roof::castle:[dome, bell_tower, spire, crenellations, turrets]

 

Donald Trump:Republican::Barack Obama:[Democratic, GOP, Democrats, McCain]

 

monkey:human::dinosaur:[fossil, fossilized, Ice_Age_mammals, fossilization]

Sources & Further Reading

Efficient Estimation of Word Representations in Vector Space

Mikolov et al

http://arxiv.org/pdf/1301.3781.pdf

 

Distributed Representations of Words and Phrases and their Compositionality

Mikolov et al

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

 

Word2vec: Neural Word Embeddings in Java

Deeplearning4j

http://deeplearning4j.org/word2vec

 

Vector Representations of Words

TensorFlow

https://www.tensorflow.org/versions/r0.7/tutorials/word2vec/index.html

 

Google Code Archive: word2vec

https://code.google.com/archive/p/word2vec/

word2vec

By Yini Shi