Spooky Introduction to Word Embeddings

hamish - DS COP 31/10/19

code: https://colab.research.google.com/drive/1bYa3eJ85s5fvDrwlsumgMBFaoTcLGZII

NLP "pipeline"

"The quick brown fox..."

"the quick brown fox..."

["the", "quick", "brown", ...]

[3, 6732, 1199, ...]

???

✨

negative

raw input

cleaning

tokenization

word representation

model

output

"the quick brown fox..." → ["the", "quick", "brown", "fox",...]

"the" → [1, 0, 0, 0, 0, ...]

"quick" → [0, 1, 0, 0, 0, ...]

this is the approach used in Naive Bayes

what are we actually asking our model to learn?

this feels like we're asking our model to do quite a lot

"the quick brown fox..." → ["the", "quick", "brown", "fox",...]

"the" → [-0.13, 1.67, 3.96, -2.22, -0.01, ...]

"quick" → [3.23, 1.89, -2.66, 0.12, -3.01, ...]

intuition: close vectors represent similar words

haven't we just created yet another thing we have to learn?
in most contexts words have the same meaning
can independently build a very accurate model against a huge dataset once and reuse it lots of tasks
eg GloVe 840B is trained against 840 billion words (wikipedia)

two common approaches to training

GloVe: co-occurance matrix, good fast results at both small and huge scales

w2v: NN with masking, very fast to train, nice python support

both result in something close to a Hilbert space

(The only scary slide)

training: word used a lot but not in embedding - use a random vector. you're asking your model to learn something weird, but if you don't do it too much it works
inference: UNK token - typically one token chosen at random. here you're telling your model "there was something here but I don't know how to represent it" rather than it not being sure if there was a word or not (and potentially learning a non-pattern)
you can combine embeddings
FastText for unknown words