With good data, deep learning can give high accuracy in image and text classification
It's trivially easy to train your own classifier
with near zero ML knowledge
6th and 7th grade high school students created a custom vision classifier for TrashCam
[Trash, Recycle, Compost] with 90% accuracy
Work by MM colleagues: Caiming Xiong, Kai Sheng Tai, Ivo Mihov, ...
In 2012, AlexNet hit 16.4% error (top-5)
Second best model (non CNN) was 26.2%
Special note: single sample human error..?
5.1%
AlexNet training throughput based on 20 iterations
Slide from Julie Bernauer's NVIDIA presentation
* I rant against deep learning a lot ... ^_^
VQA dataset: http://visualqa.org/
* TIL Lassi = popular, traditional, yogurt based drink from the Indian Subcontinent
Visual Genome: http://visualgenome.org/
Visual Genome: http://visualgenome.org/
1 Mary moved to the bathroom. 2 John went to the hallway. 3 Where is Mary? bathroom 1 4 Daniel went back to the hallway. 5 Sandra moved to the garden. 6 Where is Daniel? hallway 4 7 John moved to the office. 8 Sandra journeyed to the bathroom. 9 Where is Daniel? hallway 4 10 Mary moved to the hallway. 11 Daniel travelled to the office. 12 Where is Daniel? office 11 13 John went back to the garden. 14 John moved to the bedroom. 15 Where is Sandra? bathroom 8 1 Sandra travelled to the office. 2 Sandra went to the bathroom. 3 Where is Sandra? bathroom 2
Extract from the Facebook bAbI Dataset
Imagine I gave you an article or an image, asked you to memorize it, took it away, then asked you various questions.
Even as intelligent as you are,
you're going to get a failing grade :(
Why?
Optimal: give you the input data, give you the question, allow as many glances as possible
Where is your model forced to use a compressed representation?
Most importantly,
is that a good thing?
0.11008 -0.38781 -0.57615 -0.27714 0.70521 0.53994 -1.0786 -0.40146 1.1504 -0.5678 0.0038977 0.52878 0.64561 0.47262 0.48549 -0.18407 0.1801 0.91397 -1.1979 -0.5778 -0.37985 0.33606 0.772 0.75555 0.45506 -1.7671 -1.0503 0.42566 0.41893 -0.68327 1.5673 0.27685 -0.61708 0.64638 -0.076996 0.37118 0.1308 -0.45137 0.25398 -0.74392 -0.086199 0.24068 -0.64819 0.83549 1.2502 -0.51379 0.04224 -0.88118 0.7158 0.38519
... with the dog running behind ...
... the black dog barked at ...
... most dangerous dog breeds focuses ...
... when a dog licks your ...
... the massive dog scared the postman ...
... with the ___ running behind ...
... the black ___ barked at ...
... most dangerous ___ breeds focuses ...
... when a ___ licks your ...
... the massive ___ scared the postman ...
... with the dog running behind ...
... the black dog barked at ...
... most dangerous dog breeds focuses ...
... when a dog licks your ...
... the massive dog scared the postman ...
... with the dog running behind ...
... the black dog barked at ...
... most dangerous dog breeds focuses ...
... when a dog licks your ...
... the massive dog scared the postman ...
... with the dog running behind ...
... the black dog barked at ...
... most dangerous dog breeds focuses ...
... when a dog licks your ...
... the massive dog scared the postman ...
... with the dog running behind ...
... the black dog barked at ...
... most dangerous dog breeds focuses ...
... when a dog licks your ...
... the massive dog scared the postman ...
... with the dog running behind ...
... the black dog barked at ...
... most dangerous dog breeds focuses ...
... when a dog licks your ...
... the massive dog scared the postman ...
... with the dog running behind ...
... the black dog barked at ...
... most dangerous dog breeds focuses ...
... when a dog licks your ...
... the massive dog scared the postman ...
king - man + woman
~= queen
(example image from GloVe - not word2vec but conceptually similar)
(example image from GloVe - not word2vec but conceptually similar)
... with the dog running behind ...
... the black dog barked at ...
... most dangerous dog breeds focuses ...
... when a dog licks your ...
... the massive dog scared the postman ...
Figure from Chris Olah's Visualizing Representations
Figure from Chris Olah's Visualizing Representations
Input
Hidden State
Figure from Chris Olah's Visualizing Representations
If you hear
Gated Recurrent Network (GRU)
or
Long Short Term Memory (LSTM)
just think RNN
Figure from Chris Olah's Visualizing Representations
Figure from Bahdanau et al's
Neural Machine Translation by Jointly Learning to Align and Translate
Figure from Bahdanau et al's
Neural Machine Translation by Jointly Learning to Align and Translate
"Quora requires users to register with their real names rather than an Internet pseudonym ( screen name ) , and visitors unwilling to log in or use cookies have had to resort to workarounds to use the site ."
39 words
Figure from Chris Olah's Visualizing Representations
We store the hidden state at each timestep
(i.e. after reading "I", "think", "...")
We can later query it with attention. Attention allows us to sum up a region of interest when we get up to the appropriate section!
h1 h2 h3 h4
Memory:
Figure from Chris Olah's Visualizing Representations
We store the hidden state at each timestep
(i.e. after reading "I", "think", "...")
We can later query it with attention. Attention allows us to sum up a region of interest when we get up to the appropriate section!
h1 h2 h3 h4
Memory:
a1 a2 a3 a4 s.t. sum(a) = 1
Attention:
Results from Bahdanau et al's
Neural Machine Translation by Jointly Learning to Align and Translate
European Economic Area <=> zone economique europeenne
For full details:
+ The module produces an ordered list of facts from the input
+ We can increase the number or dimensionality of these facts
+ Input fusion layer (bidirectional GRU) injects positional information and allows interactions between facts
Composed of three parts with potentially multiple passes:
If the gate values were passed through softmax,
the context vector is a weighted summation of the input facts
Given the attention gates, we now want to extract a context vector from the input facts
Issue: summation loses positional and ordering information
Each fact receives an attention gate value from [0, 1]
The value is produced by analyzing [fact, query, episode memory]
Optionally enforce sparsity by using softmax over attention values
If we modify the GRU, we can inject information from the attention gates.
If we modify the GRU, we can inject information from the attention gates.
By replacing the update gate u with the activation gate g,
the update gate can make use of the question and memory
Focus on three experiments:
Convex Hull
Delaunay Triangulation
The challenge: we don't have a "vocabulary" to refer to our points (i.e. P1 = [42, 19.5])
We need to reproduce the exact point
Notice: pointing to input!
tldr; We decide whether to use the RNN or the pointer network depending on what the pointer "memory" contains
tldr; We decide whether to use the RNN or the pointer network depending on what the pointer "memory" contains
State of the art results on language modeling
(generating language and/or autocompleting a sentence)
(Deep learning) moves pretty fast.
If you don’t stop and look around once in awhile, you could miss it.
Premise:
A black race car starts up in front of
a crowd of people.
Hypothesis:
A man is driving down a lonely road.
Is it (a) Entailment, (b) Neutral, or (c) Contradiction?