## Frontiers of Attention and Memory in Deep Learning

### Stephen Merity @smerity

AI By The Bay

* I like deep learning but I spend a lot of time ranting against the hype ... ^_^

"New ____ city"

"York"

"New ____ city"

"York"

# The limits of restricted memory

### Example from the Visual Genome dataset

• An RNN consumes and generates a sequence
• Characters, words, ...
• The RNN updates an internal state h according to the:
• existing hidden state h and the current input x

If you're not aware of the GRU or LSTM, you can consider them as improved variants of the RNN
(do read up on the differences though!)

h_t = RNN(x_t, h_{t-1})
$h_t = RNN(x_t, h_{t-1})$

# Regularizing RNN

### Zoneout (Krueger et al. 2016) Stochastically forces some of the recurrent units in h to maintain their previous values Imagine a faulty update mechanism: where delta is the update and m the dropout mask

h_t = h_{t-1} + m \odot \delta_t
$h_t = h_{t-1} + m \odot \delta_t$

# Regularizing RNN

Both these recurrent dropout techniques are easy to implement and they're already a part of many frameworks

Example: one line change in Keras
for variational dropout

Stunningly, you can supply English as input, German as expected output, and the model learns to translate

After each step, the hidden state contains an encoding of the sentence up until that point, with S attempting to encode the entire sentence

# Translation as an example

The encoder and decoder are the RNNs

Stunningly, you can supply English as input, German as expected output, and the model learns to translate

The key issue comes in the quality of translation for long sentences - the entire input sentence must be compressed to a single hidden state ...

# Translation as an example

The encoder and decoder are the RNNs

38 words

# Neural Machine Translation

Human beings translate a part at a time, referring back to the original source sentence when required

How can we simulate that using neural networks?
By providing an attention mechanism

# Translation as an example

As we process each word on the decoder side, we query the source encoding for relevant information

For long sentences, this allows a "shortcut" for information - the path is shorter and we're not constrained to the information from a single hidden state

# Translation as an example

For each hidden state          we produce an attention score

We ensure that

(the attention scores sum up to one)

We can then produce a context vector, or a weighted summation of the hidden states:

# Attention in detail

c = \sum a_i h_i
$c = \sum a_i h_i$
h_i
$h_i$
a_i
$a_i$
\sum a_i = 1
$\sum a_i = 1$

For each hidden state          we produce an attention score

We can then produce a context vector, or a weighted summation of the hidden states:

# Attention in detail

c = \sum a_i h_i
$c = \sum a_i h_i$
h_i
$h_i$
a_i
$a_i$

How do we ensure that our attention scores sum to 1?

(also known as being normalized)

We use our friendly neighborhood softmax function

on our unnormalized raw attention scores r

# Attention in detail

\sum a_i = 1
$\sum a_i = 1$
a_i = \frac{e^{r^i}}{\sum_j e^{r^j}}
$a_i = \frac{e^{r^i}}{\sum_j e^{r^j}}$

Finally, to produce the raw attention scores, we have a number of options, but the two most popular are:

Inner product between the query and the hidden state

Feed forward neural network using query and hidden state

(this may have one or many layers)

# Attention in detail

r_i = \mathbf q \cdot \mathbf h_i = \sum_j q_j {h_i}_j
$r_i = \mathbf q \cdot \mathbf h_i = \sum_j q_j {h_i}_j$
r_i = \text{tanh}(W[h_i;q] + b)
$r_i = \text{tanh}(W[h_i;q] + b)$

Context vector in green
Attention score calculations in red

# Visualizing the attention

European Economic Area <=> zone economique europeenne

European Economic Area <=> zone economique europeenne

# Neural Machine Translation

Our simple model

More depth and forward + backward

Residual connections

# Neural Machine Translation

If you're interested in what production NMT looks like,
"Peeking into the architecture used for Google's NMT"
(Smerity.com)

# QA for the DMN

Rather than the hidden state being a word as in translation,
it's either a sentence for text or a section of an image

# Episodic Memory

Some tasks required multiple passes over memory for a solution
Episodic memory allows us to do this

# Pointer Networks

Convex Hull

Delaunay  Triangulation

# Pointer Networks

## Extending QA attention passes w/ Dynamic Coattention Network

DMN and other attention mechanisms show the potential for multiple passes to perform complex reasoning

Particularly useful for tasks where transitive reasoning is required or where answers can be progressively refined

Can this be extended to full documents?

Note: work from my amazing colleagues -
Caiming Xiong, Victor Zhong, and Richard Socher

## Dynamic Coattention Network

The overarching concept is relatively simple:

## Dynamic Coattention Network

Encoder for the Dynamic Coattention Network

It's the specific implementation that kills you ;)

## Dynamic Coattention Network

Explaining the architecture fully is complicated but intuitively:

Iteratively improve the start and end points of the answer
as we perform more passes over the data

## Dynamic Coattention Network

Two strong advantages come out of the DCN model:

• SQuAD provides an underlying dataset that generalizes well
• Like the pointer network, OoV terms are not a major issue

# Pointer Networks

The core idea: decide whether to use the RNN or the pointer network depending on how much attention a sentinel receives

# Pointer Sentinel (Merity et al. 2016)

The core idea: decide whether to use the RNN or the pointer network depending on how much attention a sentinel receives

Frequent

Rare

# Neural Cache Model (Grave et al. 2016)

many words don't come from the image at all

How can we indicate
(a) what parts of the image are relevant
and
(b) note when the model doesn't need to look at the image

Can the model do better by not distracting itself with the image?

From colleagues Lu, Xiong, Parikh*, Socher
* Parikh from Georgia Institute of Technology

# Image Captions (Lu et al. 2016)

The visual QA work was extended to producing sentences and also utilized a sentinel for when it wasn't looking at the image to generate

# Image Captions (Lu et al. 2016)

Using the sentinel we can tell when and where the model looks

# Image Captions (Lu et al. 2016)

Using the sentinel we can tell when and where the model looks

# Image Captions (Lu et al. 2016)

Using the sentinel we can tell when and where the model looks