AI By The Bay
* I like deep learning but I spend a lot of time ranting against the hype ... ^_^
"New ____ city"
"York"
"New ____ city"
"York"
If you're not aware of the GRU or LSTM, you can consider them as improved variants of the RNN
(do read up on the differences though!)
Both these recurrent dropout techniques are easy to implement and they're already a part of many frameworks
Example: one line change in Keras
for variational dropout
Stunningly, you can supply English as input, German as expected output, and the model learns to translate
After each step, the hidden state contains an encoding of the sentence up until that point, with S attempting to encode the entire sentence
The encoder and decoder are the RNNs
Stunningly, you can supply English as input, German as expected output, and the model learns to translate
The key issue comes in the quality of translation for long sentences - the entire input sentence must be compressed to a single hidden state ...
The encoder and decoder are the RNNs
Figure from Bahdanau et al's
Neural Machine Translation by Jointly Learning to Align and Translate
38 words
Human beings translate a part at a time, referring back to the original source sentence when required
How can we simulate that using neural networks?
By providing an attention mechanism
As we process each word on the decoder side, we query the source encoding for relevant information
For long sentences, this allows a "shortcut" for information - the path is shorter and we're not constrained to the information from a single hidden state
For each hidden state we produce an attention score
We ensure that
(the attention scores sum up to one)
We can then produce a context vector, or a weighted summation of the hidden states:
For each hidden state we produce an attention score
We can then produce a context vector, or a weighted summation of the hidden states:
How do we ensure that our attention scores sum to 1?
(also known as being normalized)
We use our friendly neighborhood softmax function
on our unnormalized raw attention scores r
Finally, to produce the raw attention scores, we have a number of options, but the two most popular are:
Inner product between the query and the hidden state
Feed forward neural network using query and hidden state
(this may have one or many layers)
Context vector in green
Attention score calculations in red
European Economic Area <=> zone economique europeenne
Results from Bahdanau et al's
Neural Machine Translation by Jointly Learning to Align and Translate
European Economic Area <=> zone economique europeenne
Our simple model
More depth and forward + backward
Residual connections
If you're interested in what production NMT looks like,
"Peeking into the architecture used for Google's NMT"
(Smerity.com)
From my colleagues Kumar et al. (2015) and Xiong, Merity, Socher (2016)
Rather than the hidden state being a word as in translation,
it's either a sentence for text or a section of an image
Some tasks required multiple passes over memory for a solution
Episodic memory allows us to do this
Convex Hull
Delaunay Triangulation
DMN and other attention mechanisms show the potential for multiple passes to perform complex reasoning
Particularly useful for tasks where transitive reasoning is required or where answers can be progressively refined
Can this be extended to full documents?
Note: work from my amazing colleagues -
Caiming Xiong, Victor Zhong, and Richard Socher
Stanford Question Answering Dataset (SQuAD) uses Wikipedia articles for question answering over textual spans
The overarching concept is relatively simple:
Encoder for the Dynamic Coattention Network
It's the specific implementation that kills you ;)
Explaining the architecture fully is complicated but intuitively:
Iteratively improve the start and end points of the answer
as we perform more passes over the data
Two strong advantages come out of the DCN model:
The core idea: decide whether to use the RNN or the pointer network depending on how much attention a sentinel receives
The core idea: decide whether to use the RNN or the pointer network depending on how much attention a sentinel receives
Frequent
Rare
For image captioning tasks,
many words don't come from the image at all
How can we indicate
(a) what parts of the image are relevant
and
(b) note when the model doesn't need to look at the image
Can the model do better by not distracting itself with the image?
From colleagues Lu, Xiong, Parikh*, Socher
* Parikh from Georgia Institute of Technology
The visual QA work was extended to producing sentences and also utilized a sentinel for when it wasn't looking at the image to generate
Using the sentinel we can tell when and where the model looks
Using the sentinel we can tell when and where the model looks
Using the sentinel we can tell when and where the model looks