"Language is humanity's longest running program"

Hi, I'm @Smerity ^_^
My focus is language models.
The tagline I live by:

1) Positive sum
2) Minimize entropy
3) Maximize (useful) entropy

"Language is humanity's longest running program"


"Language as a technology"

Not AGI focused but language would be on the path to AGI

Past life / side hobby: Independent Researcher

Generally only use:
- A single GPU
- A day or so of training

This strategy ~reliably hits SotA

Focus:

- How ML can "solve" Blow's curse
- ML monoculture and hardware
- Language models of the future

"Evolutionary bottlenecks"
"Software is a gas"
Blow's inverse Moore's Law

Issues:

If software is a gas,
constrain the container

Issue in Software 1.0:
scaling up is bad ...

If ML is a gas,
it optimizes for the container

Scaling compute is ~trivial
Improving objective is ~possible

If ML is a gas,
it optimizes for the container

Training is JIT pre-compilation for the observed paths (dataset) and desired objectives

Our programs are
flexible, re-usable, and produce expected output

Reality:

Unsupervised modeling and
deep learning may allow all:
flexible, re-usable, and
expected output

Heretical claim

As programmers, functions are
our fundamental building blocks


 

We as humans define the logic

We hence decide what information flows from input to output

Take input, apply logic, produce output

Functions define our
level of abstraction

We can't influence
what came before

We can't be influenced by
what happens after

Software 1.0

Our tasks are defined by
static functions and dataa

Those functions are written by humans based upon "hidden" objectives


Objectives are lost
past abstraction boundaries

Software 1.0

Imagine writing a web scraper:

Extract links => regex (for HTML?!?!? 😡)
Extract text => recursive descent parser
Extract specific text => ... spaghetti if ..?
Extract text and render in Markdown => ...

Oh, and here's a bug - not sure which level
Oh, and instead of HTML how about JSON?

With gradients however ...

Our tasks are defined by
objectives and data

Objectives
cross
abstraction boundaries

Software 2.0

With gradients however ...

Our tasks are defined by
objectives and data

Storage and compute
adapt
to objectives

Software 2.0

Imagine a neural web scraper:

Extract links => self evident from the LM
Extract text => self evident from the LM
Extract specific text => small output head
Extract text and render in Markdown => ...

Bugs are actually training data
Want JSON rather than HTML? Swap the data

A data guided declarative
programming language

with an optimizing compiler
tracking observed paths of execution
and your desired objectives

(Compiler hints: gradients/supervision)

Software 2.0

If ML is a gas,
it optimizes for the container

"Demoscene for AI"
as you can always scale later

Example: QRNN

Quasi-Recurrent Neural Network
(Bradbury*, Merity*, Xiong, Socher)

"This bit is slow so why don't we try a
less powerful but faster part?"

Example: QRNN

Quasi-Recurrent Neural Network
(Bradbury*, Merity*, Xiong, Socher)

"This bit is slow so why don't we try a
less powerful but faster part?"
"Wait ... it works just as well? O_o"

Example: QRNN

Better results for classification, language modeling, and character level translation

Used by Baidu for their "Deep Voice" projects

Example: QRNN

Relatively simple code
(a few hundred lines of CUDA)
yet we see few of these types of optimizations

Story time and example: Language modeling

Unsupervised learning and
language modeling

Language modeling is compression
Understanding the data is the objective

Learn which signals predict the future state
Learn to extract or maintain relevant contextual state

Important to remember:
language is far more than just text

ML objectives provide an
information bottleneck

When passing data from function to function, what information do we maintain?

Unsupervised learning provides
flexibility and data independence

"Predict the next word" means a ghost of the context remains, scaled by how relevant it is to the objective above

DL is declarative



Ask it to learn language modeling?
Your model learns counting as a sub-task

What about my work?

Similar results with minimal compute

Held SotA on two LM datasets
(yay!)

Google then released ...

Remember that language modeling
has only one broad objective:

guess what comes next

The only goal is minimizing entropy

Our programs are
flexible, re-usable, and produce expected output

Potential aside:

Let's do HTML parsing

Let's say we want to
extract content from the web

Boss: Your objective is to collect links for a web crawler

Huzzah! I can do that!
How about I use ...

Regex for HTML 😡

Are you MAD?!?!?

import requests
import re
data = requests.get('http://smerity.com/articles/2018/limited_compute.html').text
links = re.findall('<a href="([^"]+)">([^<]+)</a>', data)

Now is this wrong?


Not exactly.
What it does catch is correct.
It just misses oh so many edge cases ...
(= so much missed or lost context)

Success!

It isn't perfect, but it does
work for the task at hand ^_^

Now your boss, excited with your progress, asks you to extract text from the same webpages you just processed.

It should be easy, right..?

Answer: 😭

"Proper" parser for HTML

Recursive descent parser (RDP)

You go all in and write an RDP

 

Wait, boss, what text do you want? All text, including navigation? Only article text as if it were a news article? Sidebar text?
!?!?!??!?!

This is a problem

Our tasks are defined by
objectives and data

Our objective is vague yet
those specifics are key to success

Success!

It isn't perfect, but it does
work for the task at hand ^_^

Now your boss, excited with your progress, asks you to convert that text to a Markdown equivalent.

Your answer: 😭

At least a butcher, baker, or candlestick maker have clear objectives

Worse, what about errors?

Constructing programs resilient to
bad input is hard

You've likely had to deal with some horrific code in your lifetime.

Now imagine having to deal with an entire
web worth of silly people...

The architecture of the Web has several languages in it - there's HTTP, there's HTML, URLs are a language, there's CSS, and there's the scripting language. They're all in there and they can all be embedded in each other and they all have different quoting and escaping and commenting conventions. And they are not consistently implemented in all of the browsers. Some of them are not specified anywhere.

- Douglas Crockford (of Javascript and JSON)

What happens with errors?

The LM gets progressively more upset

Forget a semicolon/bracket/closing tag/.../?

The LM will become uncertain
(we can measure the entropy)
and can even intelligently suggest
where you went wrong

What about my work?

Similar results with minimal compute

Held SotA on two LM datasets
(yay!)

Google then released ...

 

 

(╯°□°)╯︵ ┻━┻

Neural Architecture Search:
"32,400-43,200 GPU hours"

What about my work?

Similar results with minimal compute

I wasted months of my life -_-

As a swansong I went to improve PyTorch's language model example

Had to be fast and simple with minimal tweaks
for educational purposes

What about my work?

Similar results with minimal compute

Small change...

What about my work?

Similar results with minimal compute

Small change...
BIG IMPROVEMENT

???

What about my work?

Similar results with minimal compute

Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...

BIG IMPROVEMENT
Small change...

BIG IMPROVEMENT
Small change...

BIG IMPROVEMENT
 

What about my work?

Similar results with minimal compute

NVIDIA cuDNN LSTM is fast - but a black box
You can't add layer norm
You can't add regularization
You can't do anything ...

Well, you can "bit twiddle" over the weights

What about my work?

Similar results with minimal compute

Add recurrent dropout via weight modification

Optimization handicap

What about my work?

Similar results with minimal compute

Result: a language model (AWD-LSTM) that was fast and SotA on standard hardware (12-24 hours on old school GPU), released open source.

It has been trained on dozens of other languages, serves as the basis of Fast.AI's language model, has been used in Github's Semantic Code Search, audio processing, bio-informatics, ...

THEORY

trails behind and/or never describes

PRACTICE

In deep learning,

Most of our assumptions are
BROKEN

 

You can't constrain
your thinking by them

A belief so far holding for ML:

What takes a cluster to compute one year takes a consumer machine the next.

ML's pseudo Moore's Law

Irrational Optimism

is not necessarily

Irrational

New York Times (2012):
"How Many Computers to Identify a Cat?
16,000 (CPU cores)"

One year later: "three servers each with two quad-core CPUs and four Nvidia GeForce GTX 680 GPUs"

Neural Architecture Search:
"32,400-43,200 GPU hours"

Just over a year later:
"single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours"

AWD-LSTM can train to better perplexity on one GPU in under a day

Adding ever more engines may
help get the plane off the ground...

but that's not the design that
planes are destined for.

The issues in language models (LMs)

- Large models fake with soft memorization
- Large slow models mean few ablations
- Monocultures (attention, code, ...)
"Attention Is All You Need (And All We Tried)"
- Optimization (ML and HW) goes to the "winner"
- Much of the progress relies on free compute
- Models are not reproducible
(i.e. LM is compression, large models "cheat")
- We need non-parallel processing / backtracing
(hidden state is an implicit beam search)

Aside: Frank McSherry's COST

Graph processing on a single threaded laptop

"When we lose accurate baselines, we lose our ability to accurately measure our progress over time."

My fear, returning to the 1970s/1980s:
Mainframe vs Minicomputer

Tilting at windmills and
alternate history research


Wait ... Did anyone actually check to see if we need that many attention heads..?
Can we use a single attention head?

Did anyone update the old models
to new techniques?

Tilting at windmills and
alternate history research

Take AWD-LSTM (my 2017 SotA)
Add layer normalization
Add pointer based attention (my 2016 SotA)

Tilting at windmills and
alternate history research

Tilting at windmills and
alternate history research

"Language is humanity's longest running program"

If you believe ensembles and individual communication are humanity's key to greater intelligence,

we need small and efficient independent language models

we need to enable better communication amongst the existing bajillion bioFLOPS

"Language is humanity's longest running program"

Language models should be
the next fundamental computing structure

- LMs are storage (task specific compression)
- LMs are compute (NFA/DFA/virtual machine)

"Language is humanity's longest running program"

For social:
- Personal optimization: "the language in your head"

- Inter-process optimization
- The long tail of language (French to English is boring)
- Fundamentally different programming paradigms and optimizations

"Language is humanity's longest running program"

Language and language models
can be thrown at anything

We are still waiting for LM's
"Mother of All Demos"

Actor methodology

Actors have a mailbox:
- receive messages and change internal state
- send messages to other actors
- create new actors

LM based "Data Actors"

Each actor aims to minimize entropy of the data payload it contains.
They have a unique and/or shared language model.

Actors have a mailbox:
- LM filters messages
- LM composes messages
- Actor can spawn new data actors
- LM can dispatch messages based upon embeddings

The actor has both traditional and vector based reprs

 

LM based Actors + Silicon

Branch prediction has such limited context

What if ~data blocks had an embedding vector?
(data blocks ~= data actors)

I strongly believe blocks of data will have a dual representation: discrete and neural embedding

Language Models + Aliens

If we communicate with aliens, especially if they have any substantial latency (light years), we'd be using language models.

- Leverage "world" knowledge (see translation)
- Compression of messages
- LMs can carry a query as payload, interrogate a dataset, and retrieve only what's relelvant

Language Models + Aliens

For reading and sending messages we would use language models.

What if we were communicating with actors (aliens) light years away?

If we had shared language model then we can compress messages (+) and pre-filter messages (+) based upon the "query" and likely "execution path"

CacheNN:
Fit a reversible residual sparse NN into L1 cache
Modify weights in place

Next mad idea:

Better compilation for experimentation

Common minimal hardware for ML deployment

Massive data access preferably via attention

Backtracking + sparse/dense MM

Needs from HW+ML

No-one knows how efficient our work could be
or what knowledge we could extract

A single GPU can beat a cluster

Our theory lags behind our practice meaning
we have no freaking clue

Language models are pre-"
The Mother of All Demos"

The potential

Compute 001

By smerity

Compute 001

  • 1,699