## Deep learning has saved us from feature engineering*

This is true - to some degree ...

Word vectors have replaced much of the manual work that WordNet was intended for

Machine translation no longer needs the training data to have manual alignments

Speech recognition no longer uses manual features / constructions like phonemes

## Deep learning has saved us from feature engineering*

European Economic Area
aligned with
zone economique europeenne

## Model Architectures

The building blocks may be shared but
very little beyond that

For each task, there's a newly formed variation of the architecture just for it

This is especially true when hunting for the mythical state of the art numbers
(small percentage point gains can justify
very odd and very short term modifications)

## Model Architectures (Vision)

Image from Joseph Paul Cohen

All of these are variations for the task of image classification - not even more tailored tasks such as visual question answering

## Model Architectures (Text)

Encoder for Char-level Neural MT
(Lee et al. 2016):
convolutions and pooling for speed,
highway network for more processing

Google's Neural Machine Translation architecture (GNMT):
close to standard encoder-decoder but eight (!) layers + residual connections

## Model Architectures

Specialized architectures aren't wrong by themselves
but they do pose some strong limitations

+ Transfer learning between tasks is difficult if the architectures for each task are different

+ Improvements found for one architecture may not be applicable (or tested) on another

+ This thinking encourages going back to the drawing board every time we get a new task

## Model Architectures

Our largest interest is in transfer learning:
two related tasks should help each other

We already have primitive transfer learning:

Pretrained word vectors
(word2vec, GloVe, ...)

Pretrained ImageNet weights
(AlexNet, VGG, Inception, ResNet, ...)

Both leverage large datasets to provide an aspect of world knowledge to the model
Both are quite limited in scope however

## Potential paths for improvement

Improve the shared building blocks that all architectures use:
better methods and components (regularization for RNNs / LSTM),
introducing new concepts (residual connections),
etc ...

More heretical: try to solve multiple tasks using
a single shared architecture
(though this is surprisingly difficult, especially while keeping SotA!)

## The path to shared architectures

Modularity
Tasks do have different requirements, so construct the architectures with that in mind (i.e. swappable input module)
[vectors and joint training give us our "shared language"]

Remove information bottlenecks
Different tasks require processing different amounts of information and potentially different amounts of computation

Make reasoning mechanisms more generic
If the underlying reasoning mechanism can't solve certain subproblems it can't be used on tasks involving that subproblem

# Information Bottlenecks

Where is your model forced to use a compressed representation?

Most importantly,
is that a good thing?

# Compression

1 Mary moved to the bathroom.
2 John went to the hallway.
3 Where is Mary?        bathroom        1
4 Daniel went back to the hallway.
5 Sandra moved to the garden.
6 Where is Daniel?      hallway         4
7 John moved to the office.
8 Sandra journeyed to the bathroom.
9 Where is Daniel?      hallway         4
10 Mary moved to the hallway.
11 Daniel travelled to the office.
12 Where is Daniel?     office          11
13 John went back to the garden.
14 John moved to the bedroom.
15 Where is Sandra?     bathroom        8
1 Sandra travelled to the office.
2 Sandra went to the bathroom.
3 Where is Sandra?      bathroom        2

Extract from the Facebook bAbI Dataset

Visual Genome: http://visualgenome.org/

## Beyond classification ...

VQA dataset: http://visualqa.org/

## Beyond classification ...

* TIL Lassi = popular, traditional, yogurt based drink from the Indian Subcontinent

Imagine I gave you an article or an image, asked you to memorize it, took it away, then asked you various questions.

Even as intelligent as you are,
you're going to get a failing grade :(

Why?

• You can't store everything in working memory
• Without a question to direct your attention,
you waste focus on unimportant details

Optimal: give you the input data, give you the question, allow as many glances as possible

Visual Genome: http://visualgenome.org/

## Neural Machine Translation

Figure from Chris Olah's Visualizing Representations

## Neural Machine Translation

European Economic Area <=> zone economique europeenne

## QA for Dynamic Memory Networks

• A modular and flexible DL framework for question answering
• Capable of tackling wide range of tasks and input formats
• Can even been used for general NLP tasks (i.e. non QA)
(PoS, NER, sentiment, translation, ...)

### Dynamic Memory Networks for Visual and Textual Question Answering (Xiong et al., 2016)

For full details:

## Related Attention/Memory Work

• Sequence to Sequence (Sutskever et al. 2014)
• Neural Turing Machines (Graves et al. 2014)
• Teaching Machines to Read and Comprehend
(Hermann et al. 2015)
• Learning to Transduce with Unbounded Memory (Grefenstette 2015)
• Structured Memory for Neural Turing Machines
(Wei Zhang 2015)

• Memory Networks (Weston et al. 2015)
• End to end memory networks (Sukhbaatar et al. 2015)

## DMN Overview

Original input module:
a simple uni-directional GRU

## QA for Dynamic Memory Networks

• A modular and flexible DL framework for question answering
• Capable of tackling wide range of tasks and input formats
• Can even been used for general NLP tasks (i.e. non QA)
(PoS, NER, sentiment, translation, ...)

## Input Modules

+ The module produces an ordered list of facts from the input
+ We can increase the number or dimensionality of these facts
+ Input fusion layer (bidirectional GRU) injects positional information and allows interactions between facts

## Episodic Memory Module

Composed of three parts with potentially multiple passes:

• Computing attention gates
• Attention mechanism
• Memory update

## Computing Attention Gates

Each fact receives an attention gate value from [0, 1]

The value is produced by analyzing [fact, query, episode memory]

Optionally enforce sparsity by using softmax over attention values

## Soft Attention Mechanism

c = \sum^N_{i=1} g_i f_i
$c = \sum^N_{i=1} g_i f_i$

If the gate values were passed through softmax,
the context vector is a weighted  summation of the input facts

Given the attention gates, we now want to extract a context vector from the input facts

Issue: summation loses positional and ordering information

## Attention GRU Mechanism

If we modify the GRU, we can inject information from the attention gates.

## Attention GRU Mechanism

If we modify the GRU, we can inject information from the attention gates.

By replacing the update gate u with the activation gate g,
the update gate can make use of the question and memory

## Results

Focus on three experimental domains:

## Accuracy: Text QA (bAbI 10k)

bAbI is a set of 20 different tasks by Facebook that represent near "unit tests" of logical reasoning

## Accuracy: Sentiment Analysis

Experiments over the Stanford Sentiment Treebank

Test accuracies:

• MV-RNN and RNTN:
Socher et al. (2013)

• DCNN:
Kalchbrenner et al. (2014)

• PVec: Le & Mikolov. (2014)

• CNN-MC: Kim (2014)

• DRNN: Irsoy & Cardie (2015)

• CT-LSTM: Tai et al. (2015)

## Accuracy related to attention passes

The results of the model generally improve with more passes,

especially for tasks requiring transitive reasoning

bAbi tasks (three-facts in particular) are constructed to require transitive reasoning

## Accuracy related to attention passes

For sentiment analysis, two passes is shown to provide the best results. Both examples are incorrect with one pass.

In its ragged, cheap and unassuming way, the move works.

## Accuracy related to attention passes

For sentiment analysis, two passes is shown to provide the best results. Both examples are incorrect with one pass.

The best way to hope for any chance of enjoying this film is by lowering your expectations.

We noted earlier primitive transfer learning:

Pretrained word vectors
(word2vec, GloVe, ...)

Pretrained ImageNet weights
(AlexNet, VGG, Inception, ResNet, ...)

Once we have these weights, we don't touch the original data anymore ...
In the optimal world we should continue consulting these datasets if they're useful!

For each word, we want both
word level and char level knowledge

Cat = [word(Cat); char(Cat)]

word(Cat) is standard Skipgram word vector

char(Cat) trains n-grams with Skipgram,
C, a, t
^C Ca at t$^Ca Cat at$

then averages the resulting vectors of the unique character n-grams

## Levels of our Joint Many Task model

The JMT model is composed of:

(Word level) POS
(Word level) Chunking
(Syntactic level) Dependency
(Semantic level) Relatedness
(Semantic level) Entailment

Each layer feeds in to the next layer, building a progressively enriched representation

Work by Kazuma Hashimoto, Caiming Xiong,
Yoshimasa Tsuruoka & Richard Socher

(Hashimoto (intern) and Tsuruoka from University of Tokyo)

(Semantic level) Entailment

(Semantic level) Relatedness

Each layer feeds in to the next layer, building a progressively enriched representation

(Syntactic level) Dependency

(Word level) Chunking

(Word level) POS

## Levels of our Joint Many Task model

The model is trained jointly.

To prevent the potential for
catastrophic interference
we penalize modifications to the lower level weights.
(Uses L2 regularization)

Training moves from the lowest dataset to the highest.
(i.e. POS
⇒ chunk ⇒ dep ⇒ ...)

## Levels of our Joint Many Task model

Training moves from the lowest dataset to the highest.
(i.e. POS ⇒ chunk ⇒ dep ⇒ ...)

## Training the Joint Many Task model

Loss = Cross Entropy + L2 Regularization + Successive Regularization

The training regime and successive regularization are important:

Achieves state of the art on
4 out of 5
(everything except POS)

Joint training substantially helps the majority of tasks

Higher is better for all tasks
except relatedness

State of the art on 4 of the 5 highly competitive tasks
we experimented over

## Why is JMT important?

Now we can leverage the pretrained set of weights and knowledge

(Word level) POS
(Word level) Chunking
(Syntactic level) Dependency
(Semantic level) Relatedness
(Semantic level) ​Entailment

New tasks can be slotted in to the existing architecture and take advantage of the
progressively enriched representation

## Improving single architecture and building blocks

Shared architectures and joint many task models have many advantages but are slow and temperamental (for now)

Pushing state of the art with specific architectures is important - helps lay the groundwork for later joint models

We're also highly interested in extending the building blocks

## Extending QA attention passes w/ Dynamic Coattention Network

DMN and other attention mechanisms show the potential for multiple passes to perform complex reasoning

Particularly useful for tasks where transitive reasoning is required or where answers can be progressively refined

Can this be extended to full documents?

Note: work from my amazing colleague -
Caiming Xiong, Victor Zhong, and Richard Socher

## Dynamic Coattention Network

The overarching concept is relatively simple:

## Dynamic Coattention Network

Encoder for the Dynamic Coattention Network

It's the specific implementation that kills you ;)

## Dynamic Coattention Network

Explaining the architecture fully is complicated but intuitively:

Iteratively improve the start and end points of the answer
as we perform more passes on the data

## Improving building blocks: RNNs

For our work, recurrent neural networks are a core tool
though they do have fundamental limitations

Overfitting is a major problem
Slow for both training and prediction

## Regularizing RNNs

RNNs can overfit strongly on the recurrent connections

Dropout on recurrent connections does not work by default -
the dropout is applied too many times, killing RNN's memory

## Regularizing RNNs

### Zoneout (Krueger et al. 2016) Stochastically forces some of the recurrent units in h to maintain their previous values Imagine a faulty update mechanism: where delta is the update and m the dropout mask

h_t = h_{t-1} + m \odot \delta_t
$h_t = h_{t-1} + m \odot \delta_t$

## Improving building blocks: RNNs

For our work, recurrent neural networks are a core tool
though they do have fundamental limitations

Overfitting is a major problem
Slow for both training and prediction

## Improving RNN speed via QRNNs

James Bradbury and I  created quasi-recurrent neural networks (QRNNs) to maximize RNN speed without losing accuracy

For standard RNNs, matrix multiplications at each timestep depend on the output of the previous timestep
This forces us into a sequential process that doesn't use the GPU well

Red signifies convolutions or matrix multiplications
Blue signifies parametersless functions

## Improving RNN speed via QRNNs

Key ideas:

take inspiration from CNNs in only allowing fully parallel operations,

make the recurrent function (which is sequential by necessity)
as minimal and efficient as possible

Red signifies convolutions or matrix multiplications
Blue signifies parametersless functions

## Improving RNN speed via QRNNs

Looking at the equations for a minimal LSTM (no input gate),
the use of the previous hidden state is the bottleneck

By replacing h with a convolution over the input x,
we may lose some computational capacity but we can be far more parallel

## Improving RNN speed via QRNNs

This results in the output of c being "dynamic average pooling",
where the average pooling is controlled by the gates f

Here we show an example for a convolutional filter width of 2,
though there is no limitation over the filter width

The only recurrent connection is in blue and the computation is efficient on GPUs

## QRNN results: IMDb

The IMDb dataset is near worst case for LSTMs due to long documents
(a) very slow for LSTMs and (b) gradient disappears quickly

Achieves better than previous LSTM approaches and is over 3x faster
(note: compared to highly optimized Nvidia cuDNN library)

## QRNN results: language modeling

Language modeling is a standard task for RNNs and frequently used to test recurrent regularization techniques

The QRNN achieves similar results as strongly regularized LSTMs
(How? The recurrent capacity is limited - minimal need to regularize)

## QRNN results: character level MT

Character level machine translation holds great promise for morphologically rich languages like German
(only recently have they achieved similar accuracy to word level)

QRNNs with convolutional window of 6 (i.e. look at the last six characters) achieves better results and are over 4x faster than comparable LSTMs

## QRNN speed breakdown

Forward + backward times for a single batch in language modeling
(note: RNN used to be the dominant component, for QRNN it's softmax)

QRNN compared to cuDNN LSTM
(note: longer sequences is better!)

## Summary

• Attention and memory can improve logical reasoning and avoid the information bottlenecks prevalent in architectures
• While complex, joint many task models can leverage multiple datasets to achieve better results via transfer learning
• If you're using LSTMs, make sure you perform recurrent dropout! QRNN is also an interesting avenue of exploration ;)
• This is still early days!

## Extra content!

(if reading online, the majority of the following slides require you to navigate downwards, not just sideways)

## Improving building blocks: RNNs

For our work, recurrent neural networks are a fundamental tool

Hidden state is limited in capacity
Encoding / decoding rare words is problematic

## Pointer Networks

Convex Hull

Delaunay  Triangulation

The challenge: we don't have a "vocabulary" to refer to our points (i.e. P1 = [42, 19.5])
We need to reproduce the exact point

## Pointer Networks

Notice: pointing to input!

## Pointer Sentinel-LSTM

State of the art results on language modeling
(generating language and/or autocompleting a sentence)

By smerity

• 2,326