Spooky Introduction to Word Embeddings

hamish - DS COP 31/10/19

code:  https://colab.research.google.com/drive/1bYa3eJ85s5fvDrwlsumgMBFaoTcLGZII

NLP "pipeline"

"The quick brown fox..."

"the quick brown fox..."

["the", "quick", "brown", ...]

[3, 6732, 1199, ...]

???

negative

raw input

cleaning

tokenization

word representation

model

output

"naive" approach

  • split sentence up into words
  • use a one-hot-encoding as something for our model to learn against

"the quick brown fox..."  →  ["the", "quick", "brown", "fox",...]

"the"  →  [1, 0, 0, 0, 0, ...]

"quick"  →  [0, 1, 0, 0, 0, ...]

 

this is the approach used in Naive Bayes

let's take a step back

what are we actually asking our model to learn?

  • meanings of words
  • how those words fit together to form meaning
  • how these fit together to do the task at hand

this feels like we're asking our model to do quite a lot

a better approach

  • still split sentence up into words
  • this time use a vector to represent our word

"the quick brown fox..."  →  ["the", "quick", "brown", "fox",...]

"the"  →  [-0.13, 1.67, 3.96, -2.22, -0.01, ...]

"quick"  →  [3.23, 1.89, -2.66, 0.12, -3.01, ...]

 

intuition: close vectors represent similar words

is this actually better? 🤔

  • haven't we just created yet another thing we have to learn?
  • in most contexts words have the same meaning
  • can independently build a very accurate model against a huge dataset once and reuse it lots of tasks
  • eg GloVe 840B is trained against 840 billion words (wikipedia)

word embeddings

two common approaches to training

  • GloVe
  • word2vec

 

GloVe: co-occurance matrix, good fast results at both small and huge scales

w2v: NN with masking, very fast to train, nice python support

 

both result in something close to a Hilbert space

here have code

(The only scary slide)

tips and tricks

  • training: word used a lot but not in embedding - use a random vector. you're asking your model to learn something weird, but if you don't do it too much it works
  • inference: UNK token - typically one token chosen at random. here you're telling your model "there was something here but I don't know how to represent it" rather than it not being sure if there was a word or not (and potentially learning a non-pattern)
  • you can combine embeddings
  • FastText for unknown words

serious problems

  • learn embeddings off large (read: huge) corpus
  • biases
  • can introduce biases to your model even if your training data isn't biased

OVO embeddings

  • "smart" means something very specific in energy
  • the whole point is to make things easy for our model
  • Natalie and I have been training some word embeddings

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo

end-to-end learning

Made with Slides.com