"Language is humanity's longest running program"
Hi, I'm @Smerity ^_^
My focus is language models.
The tagline I live by:
1) Positive sum
2) Minimize entropy
3) Maximize (useful) entropy
"Language is humanity's longest running program"
"Language as a technology"
Not AGI focused but language would be on the path to AGI
Past life / side hobby: Independent Researcher
Generally only use:
- A single GPU
- A day or so of training
This strategy ~reliably hits SotA
Focus:
- How ML can "solve" Blow's curse
- ML monoculture and hardware
- Language models of the future
"Evolutionary bottlenecks"
"Software is a gas"
Blow's inverse Moore's Law
Issues:
If software is a gas,
constrain the container
Issue in Software 1.0:
scaling up is bad ...
If ML is a gas,
it optimizes for the container
Scaling compute is ~trivial
Improving objective is ~possible
If ML is a gas,
it optimizes for the container
Training is JIT pre-compilation for the observed paths (dataset) and desired objectives
Our programs are
flexible, re-usable, and produce expected output
Reality:
Unsupervised modeling and
deep learning may allow all:
flexible, re-usable, and
expected output
Heretical claim
As programmers, functions are
our fundamental building blocks
We as humans define the logic
We hence decide what information flows from input to output
Take input, apply logic, produce output
Functions define our
level of abstraction
We can't influence
what came before
We can't be influenced by
what happens after
Software 1.0
Our tasks are defined by
static functions and dataa
Those functions are written by humans based upon "hidden" objectives
Objectives are lost
past abstraction boundaries
Software 1.0
Imagine writing a web scraper:
Extract links => regex (for HTML?!?!? 😡)
Extract text => recursive descent parser
Extract specific text => ... spaghetti if ..?
Extract text and render in Markdown => ...
Oh, and here's a bug - not sure which level
Oh, and instead of HTML how about JSON?
With gradients however ...
Our tasks are defined by
objectives and data
Objectives
cross
abstraction boundaries
Software 2.0
With gradients however ...
Our tasks are defined by
objectives and data
Storage and compute
adapt
to objectives
Software 2.0
Imagine a neural web scraper:
Extract links => self evident from the LM
Extract text => self evident from the LM
Extract specific text => small output head
Extract text and render in Markdown => ...
Bugs are actually training data
Want JSON rather than HTML? Swap the data
A data guided declarative
programming language
with an optimizing compiler
tracking observed paths of execution
and your desired objectives
(Compiler hints: gradients/supervision)
Software 2.0
If ML is a gas,
it optimizes for the container
"Demoscene for AI"
as you can always scale later
Example: QRNN
Quasi-Recurrent Neural Network
(Bradbury*, Merity*, Xiong, Socher)
"This bit is slow so why don't we try a
less powerful but faster part?"
Example: QRNN
Quasi-Recurrent Neural Network
(Bradbury*, Merity*, Xiong, Socher)
"This bit is slow so why don't we try a
less powerful but faster part?"
"Wait ... it works just as well? O_o"
Example: QRNN
Better results for classification, language modeling, and character level translation
Used by Baidu for their "Deep Voice" projects
Example: QRNN
Relatively simple code
(a few hundred lines of CUDA)
yet we see few of these types of optimizations
Story time and example: Language modeling
Unsupervised learning and
language modeling
Language modeling is compression
Understanding the data is the objective
Learn which signals predict the future state
Learn to extract or maintain relevant contextual state
Important to remember:
language is far more than just text
ML objectives provide an
information bottleneck
When passing data from function to function, what information do we maintain?
Unsupervised learning provides
flexibility and data independence
"Predict the next word" means a ghost of the context remains, scaled by how relevant it is to the objective above
DL is declarative
Ask it to learn language modeling?
Your model learns counting as a sub-task
What about my work?
Similar results with minimal compute
Held SotA on two LM datasets
(yay!)
Google then released ...
Remember that language modeling
has only one broad objective:
guess what comes next
The only goal is minimizing entropy
Our programs are
flexible, re-usable, and produce expected output
Potential aside:
Let's do HTML parsing
Let's say we want to
extract content from the web
Boss: Your objective is to collect links for a web crawler
Huzzah! I can do that!
How about I use ...
Regex for HTML 😡
Are you MAD?!?!?
import requests
import re
data = requests.get('http://smerity.com/articles/2018/limited_compute.html').text
links = re.findall('<a href="([^"]+)">([^<]+)</a>', data)
Now is this wrong?
Not exactly.
What it does catch is correct.
It just misses oh so many edge cases ...
(= so much missed or lost context)
Success!
It isn't perfect, but it does
work for the task at hand ^_^
Now your boss, excited with your progress, asks you to extract text from the same webpages you just processed.
It should be easy, right..?
Answer: 😭
"Proper" parser for HTML
Recursive descent parser (RDP)
You go all in and write an RDP
Wait, boss, what text do you want? All text, including navigation? Only article text as if it were a news article? Sidebar text?
!?!?!??!?!
This is a problem
Our tasks are defined by
objectives and data
Our objective is vague yet
those specifics are key to success
Success!
It isn't perfect, but it does
work for the task at hand ^_^
Now your boss, excited with your progress, asks you to convert that text to a Markdown equivalent.
Your answer: 😭
At least a butcher, baker, or candlestick maker have clear objectives
Worse, what about errors?
Constructing programs resilient to
bad input is hard
You've likely had to deal with some horrific code in your lifetime.
Now imagine having to deal with an entire
web worth of silly people...
The architecture of the Web has several languages in it - there's HTTP, there's HTML, URLs are a language, there's CSS, and there's the scripting language. They're all in there and they can all be embedded in each other and they all have different quoting and escaping and commenting conventions. And they are not consistently implemented in all of the browsers. Some of them are not specified anywhere.
- Douglas Crockford (of Javascript and JSON)
What happens with errors?
The LM gets progressively more upset
Forget a semicolon/bracket/closing tag/.../?
The LM will become uncertain
(we can measure the entropy)
and can even intelligently suggest
where you went wrong
What about my work?
Similar results with minimal compute
Held SotA on two LM datasets
(yay!)
Google then released ...
(╯°□°)╯︵ ┻━┻
Neural Architecture Search:
"32,400-43,200 GPU hours"
What about my work?
Similar results with minimal compute
I wasted months of my life -_-
As a swansong I went to improve PyTorch's language model example
Had to be fast and simple with minimal tweaks
for educational purposes
What about my work?
Similar results with minimal compute
Small change...
What about my work?
Similar results with minimal compute
Small change...
BIG IMPROVEMENT
???
What about my work?
Similar results with minimal compute
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
What about my work?
Similar results with minimal compute
NVIDIA cuDNN LSTM is fast - but a black box
You can't add layer norm
You can't add regularization
You can't do anything ...
Well, you can "bit twiddle" over the weights
What about my work?
Similar results with minimal compute
Add recurrent dropout via weight modification
Optimization handicap
What about my work?
Similar results with minimal compute
Result: a language model (AWD-LSTM) that was fast and SotA on standard hardware (12-24 hours on old school GPU), released open source.
It has been trained on dozens of other languages, serves as the basis of Fast.AI's language model, has been used in Github's Semantic Code Search, audio processing, bio-informatics, ...
THEORY
trails behind and/or never describes
PRACTICE
In deep learning,
Most of our assumptions are
BROKEN
You can't constrain
your thinking by them
A belief so far holding for ML:
What takes a cluster to compute one year takes a consumer machine the next.
ML's pseudo Moore's Law
Irrational Optimism
is not necessarily
Irrational
New York Times (2012):
"How Many Computers to Identify a Cat?
16,000 (CPU cores)"
One year later: "three servers each with two quad-core CPUs and four Nvidia GeForce GTX 680 GPUs"
Neural Architecture Search:
"32,400-43,200 GPU hours"
Just over a year later:
"single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours"
AWD-LSTM can train to better perplexity on one GPU in under a day
Adding ever more engines may
help get the plane off the ground...
but that's not the design that
planes are destined for.
The issues in language models (LMs)
- Large models fake with soft memorization
- Large slow models mean few ablations
- Monocultures (attention, code, ...)
"Attention Is All You Need (And All We Tried)"
- Optimization (ML and HW) goes to the "winner"
- Much of the progress relies on free compute
- Models are not reproducible
(i.e. LM is compression, large models "cheat")
- We need non-parallel processing / backtracing
(hidden state is an implicit beam search)
Aside: Frank McSherry's COST
Graph processing on a single threaded laptop
"When we lose accurate baselines, we lose our ability to accurately measure our progress over time."
My fear, returning to the 1970s/1980s:
Mainframe vs Minicomputer
Tilting at windmills and
alternate history research
Wait ... Did anyone actually check to see if we need that many attention heads..?
Can we use a single attention head?
Did anyone update the old models
to new techniques?
Tilting at windmills and
alternate history research
Take AWD-LSTM (my 2017 SotA)
Add layer normalization
Add pointer based attention (my 2016 SotA)
Tilting at windmills and
alternate history research
Tilting at windmills and
alternate history research
"Language is humanity's longest running program"
If you believe ensembles and individual communication are humanity's key to greater intelligence,
we need small and efficient independent language models
we need to enable better communication amongst the existing bajillion bioFLOPS
"Language is humanity's longest running program"
Language models should be
the next fundamental computing structure
- LMs are storage (task specific compression)
- LMs are compute (NFA/DFA/virtual machine)
"Language is humanity's longest running program"
For social:
- Personal optimization: "the language in your head"
- Inter-process optimization
- The long tail of language (French to English is boring)
- Fundamentally different programming paradigms and optimizations
"Language is humanity's longest running program"
Language and language models
can be thrown at anything
We are still waiting for LM's
"Mother of All Demos"
Actor methodology
Actors have a mailbox:
- receive messages and change internal state
- send messages to other actors
- create new actors
LM based "Data Actors"
Each actor aims to minimize entropy of the data payload it contains.
They have a unique and/or shared language model.
Actors have a mailbox:
- LM filters messages
- LM composes messages
- Actor can spawn new data actors
- LM can dispatch messages based upon embeddings
The actor has both traditional and vector based reprs
LM based Actors + Silicon
Branch prediction has such limited context
What if ~data blocks had an embedding vector?
(data blocks ~= data actors)
I strongly believe blocks of data will have a dual representation: discrete and neural embedding
Language Models + Aliens
If we communicate with aliens, especially if they have any substantial latency (light years), we'd be using language models.
- Leverage "world" knowledge (see translation)
- Compression of messages
- LMs can carry a query as payload, interrogate a dataset, and retrieve only what's relelvant
Language Models + Aliens
For reading and sending messages we would use language models.
What if we were communicating with actors (aliens) light years away?
If we had shared language model then we can compress messages (+) and pre-filter messages (+) based upon the "query" and likely "execution path"
CacheNN:
Fit a reversible residual sparse NN into L1 cache
Modify weights in place
Next mad idea:
Better compilation for experimentation
Common minimal hardware for ML deployment
Massive data access preferably via attention
Backtracking + sparse/dense MM
Needs from HW+ML
No-one knows how efficient our work could be
or what knowledge we could extract
A single GPU can beat a cluster
Our theory lags behind our practice meaning
we have no freaking clue
Language models are pre-"The Mother of All Demos"
The potential
Compute 001
By smerity
Compute 001
- 1,693