Word2Bit-Quantized Word Vectors

What is a Word2Vec?

In simple words, word2vec is a two-layer neural network which processes text
Its main job is to recognize patterns in a text.
Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances.
And since the words are being treated as vectors their output can be used in deep learning model to perform mathematical computations on it.

Memory Issues-Takes up to 3-6 GB of memory/storage – a massive amount relative to other portions of a deep learning model. These requirements pose issues to memory/storage limited devices like mobile phones and GPUs
Overfitting-Standard Word2Vec may be prone to overfitting; the quantization function acts as a regularizer against it.
Problems with current networks: Adding an extra layer of compression on top of pretrained word vectors which may be computationally expensive and degrade accuracy.

Compression of neural networks has been the fascination of many researchers since the 90s.
But previously many methods were used including network pruning, deep compression etc. but as discussed before they are computationally expensive.
But researchers instead found an easier way of including arithmetic operations and using low precision floating points on hardware devices called quantization.

There are two types of word2vec algorithm popularly used-one is Skip Gram and other is CBOW.
In typical word embeddings, we try to find a relationship between words.
But in a CBOW we try to find a relationship between several input words to a single output word

So what is basically doing is trying to me understand the relationship between the centre and context words

Hilton's straight through estimator- It is based on a certain threshold value for calculating the gradient in a typical backprop network. So instead of using the derivative of the gradient as 0, they are using it with respect to a specific value to an outgoing gradient as compared to the incoming gradient.
Similarly, they defined the Qbitlevel function which is a discrete function; they used the derivative of the function as an identity function

The final vector for each word is Qbitlevel(ui + vi); thus each parameter is one of 2 bit-level values and takes bit-level bits to represent.
While we are still training with full precision 32-bit arithmetic operations and 32-bit floating point values, the final word vectors we save to disk are quantized.

Dataset used: 2017 English Wikipedia dump (24G of text).
25 epochs.
hyperparameters: window size = 10,
negative sample size = 12,
min count = 5,
subsampling = 1e-4,
learning rate = .05 (which is linearly decayed to 0.0001).
Our final vocabulary size is 3.7 million after filtering words that appear less than min count = 5 times.

By archana iyer