1) Positive sum
2) Minimize entropy
3) Maximize (useful) entropy
We as humans define the logic
We hence decide what information flows from input to output
Extract links => regex (for HTML?!?!? 😡)
Extract text => recursive descent parser
Extract specific text => ... spaghetti if ..?
Extract text and render in Markdown => ...
Oh, and here's a bug - not sure which level
Oh, and instead of HTML how about JSON?
Extract links => self evident from the LM
Extract text => self evident from the LM
Extract specific text => small output head
Extract text and render in Markdown => ...
Bugs are actually training data
Want JSON rather than HTML? Swap the data
"This bit is slow so why don't we try a
less powerful but faster part?"
"This bit is slow so why don't we try a
less powerful but faster part?"
"Wait ... it works just as well? O_o"
Better results for classification, language modeling, and character level translation
Used by Baidu for their "Deep Voice" projects
Relatively simple code
(a few hundred lines of CUDA)
yet we see few of these types of optimizations
Language modeling is compression
Understanding the data is the objective
Learn which signals predict the future state
Learn to extract or maintain relevant contextual state
Important to remember:
language is far more than just text
When passing data from function to function, what information do we maintain?
"Predict the next word" means a ghost of the context remains, scaled by how relevant it is to the objective above
Ask it to learn language modeling?
Your model learns counting as a sub-task
Held SotA on two LM datasets
(yay!)
Google then released ...
Boss: Your objective is to collect links for a web crawler
Huzzah! I can do that!
How about I use ...
import requests
import re
data = requests.get('http://smerity.com/articles/2018/limited_compute.html').text
links = re.findall('<a href="([^"]+)">([^<]+)</a>', data)
Now is this wrong?
Not exactly.
What it does catch is correct.
It just misses oh so many edge cases ...
(= so much missed or lost context)
Now your boss, excited with your progress, asks you to extract text from the same webpages you just processed.
It should be easy, right..?
Answer: 😭
You go all in and write an RDP
Wait, boss, what text do you want? All text, including navigation? Only article text as if it were a news article? Sidebar text?
!?!?!??!?!
Now your boss, excited with your progress, asks you to convert that text to a Markdown equivalent.
Your answer: 😭
At least a butcher, baker, or candlestick maker have clear objectives
You've likely had to deal with some horrific code in your lifetime.
Now imagine having to deal with an entire
web worth of silly people...
The architecture of the Web has several languages in it - there's HTTP, there's HTML, URLs are a language, there's CSS, and there's the scripting language. They're all in there and they can all be embedded in each other and they all have different quoting and escaping and commenting conventions. And they are not consistently implemented in all of the browsers. Some of them are not specified anywhere.
- Douglas Crockford (of Javascript and JSON)
Forget a semicolon/bracket/closing tag/.../?
The LM will become uncertain
(we can measure the entropy)
and can even intelligently suggest
where you went wrong
Held SotA on two LM datasets
(yay!)
Google then released ...
(╯°□°)╯︵ ┻━┻
Neural Architecture Search:
"32,400-43,200 GPU hours"
I wasted months of my life -_-
As a swansong I went to improve PyTorch's language model example
Had to be fast and simple with minimal tweaks
for educational purposes
Small change...
Small change...
BIG IMPROVEMENT
???
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
NVIDIA cuDNN LSTM is fast - but a black box
You can't add layer norm
You can't add regularization
You can't do anything ...
Well, you can "bit twiddle" over the weights
Add recurrent dropout via weight modification
Result: a language model (AWD-LSTM) that was fast and SotA on standard hardware (12-24 hours on old school GPU), released open source.
It has been trained on dozens of other languages, serves as the basis of Fast.AI's language model, has been used in Github's Semantic Code Search, audio processing, bio-informatics, ...
New York Times (2012):
"How Many Computers to Identify a Cat?
16,000 (CPU cores)"
One year later: "three servers each with two quad-core CPUs and four Nvidia GeForce GTX 680 GPUs"
Neural Architecture Search:
"32,400-43,200 GPU hours"
Just over a year later:
"single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours"
AWD-LSTM can train to better perplexity on one GPU in under a day
- Large models fake with soft memorization
- Large slow models mean few ablations
- Monocultures (attention, code, ...)
"Attention Is All You Need (And All We Tried)"
- Optimization (ML and HW) goes to the "winner"
- Much of the progress relies on free compute
- Models are not reproducible
(i.e. LM is compression, large models "cheat")
- We need non-parallel processing / backtracing
(hidden state is an implicit beam search)
Graph processing on a single threaded laptop
Wait ... Did anyone actually check to see if we need that many attention heads..?
Can we use a single attention head?
Did anyone update the old models
to new techniques?
Take AWD-LSTM (my 2017 SotA)
Add layer normalization
Add pointer based attention (my 2016 SotA)
If you believe ensembles and individual communication are humanity's key to greater intelligence,
we need small and efficient independent language models
we need to enable better communication amongst the existing bajillion bioFLOPS
Language models should be
the next fundamental computing structure
- LMs are storage (task specific compression)
- LMs are compute (NFA/DFA/virtual machine)
For social:
- Personal optimization: "the language in your head"
- Inter-process optimization
- The long tail of language (French to English is boring)
- Fundamentally different programming paradigms and optimizations
Actors have a mailbox:
- receive messages and change internal state
- send messages to other actors
- create new actors
Each actor aims to minimize entropy of the data payload it contains.
They have a unique and/or shared language model.
Actors have a mailbox:
- LM filters messages
- LM composes messages
- Actor can spawn new data actors
- LM can dispatch messages based upon embeddings
The actor has both traditional and vector based reprs
Branch prediction has such limited context
What if ~data blocks had an embedding vector?
(data blocks ~= data actors)
I strongly believe blocks of data will have a dual representation: discrete and neural embedding
If we communicate with aliens, especially if they have any substantial latency (light years), we'd be using language models.
- Leverage "world" knowledge (see translation)
- Compression of messages
- LMs can carry a query as payload, interrogate a dataset, and retrieve only what's relelvant
For reading and sending messages we would use language models.
What if we were communicating with actors (aliens) light years away?
If we had shared language model then we can compress messages (+) and pre-filter messages (+) based upon the "query" and likely "execution path"
CacheNN:
Fit a reversible residual sparse NN into L1 cache
Modify weights in place
Better compilation for experimentation
Common minimal hardware for ML deployment
Massive data access preferably via attention
Backtracking + sparse/dense MM
No-one knows how efficient our work could be
or what knowledge we could extract
A single GPU can beat a cluster
Our theory lags behind our practice meaning
we have no freaking clue
Language models are pre-"The Mother of All Demos"