Why compute and data moats may well be dead*
---
AU
US
AU/US
* Offer not valid for all moats π
Β
We as humans define the logic
We hence decide what information flows from input to output
The computation is whatever we want
We don't care as long as our desired "program" is a subset of the given computation
Typically a matrix multiplication
followed by an "activation function"
(allows for decisions to be made)
You don't need to understand specifics
State what you want in terms of
input, output, and the type of compute the model may use
The flight from Sydney to New ____
Β
We analyze a massive set of data and follow the patterns we've already seen
Bob ate the ____
Zomicron ate the ____
Bob ate the ____
εΌ δΌ (Zhang Wei) ate the ____
<name> ate the ____
Β
... but now we have a bajillion edge cases to try to capture ...
<name:male> was <verb:run> through the <city:Sydney> <street:plural>
A bajillion edge cases isn't sane for a human
... yet it's what we likely need to do well
So what does this look like for LM?
First, let's think of our objective:
given previous word,
we want to predict the nextΒ word,
on repeat
We want a function akin to:
memory, next_word = f(current_word, memory)
So what does this look like for LM?
First, let's think of our objective:
given previous word,
we want to predict the next word,
on repeat
We want a function akin to:
memory, next_word = f(current_word, memory)
Top: Output
Middle: Logic (Blue)
Bottom: Input
Embed:
Each word has a representation of 400 floating point numbers
words['The'] = [0.123, 0.621, ..., -0.9]
Recurrent Neural Network (RNN):
A function that takes two inputs,
word (400 numbers) and memory (400 numbers),
and produces two outputs (word and memory)
Recurrent Neural Network (RNN):
(h = hidden state, or our memory)
How do you start out the weights?
Random.
(Maybe pre-trained weights but that's later...)
Why is the RNN hidden state important?
It's how we pass along context
(i.e. you said "flew" a few words back and "New" right before this word)
As each word is added,
our hidden state (memory) changes
We define the architecture
(or equations the function may use)
We want each word to be represented by a vector, let's say 400 floating point numbers
Our "running memory" will also be
400 floating point numbers
Our model will learn the best value for each of those 400 numbers for all our words
Our model will learn what type of logic the functions should run to create and manipulate the hidden state (memory) to guess the
next word
"... but now we have a bajillion edge cases to try to capture ..."
<name:male> was <verb:run> through the <city:Sydney> <street:plural>
is implicitlyΒ caught in our vectors and the learned logic of our "program"
The computer learned how to do those bajillion edge cases
from random numbers and context
Ask it to learn language modeling?
Your model learns counting as a sub-task
Boss: Your objective is to collect links for a web crawler
Huzzah! I can do that!
How about I use ...
import requests
import re
data = requests.get('http://smerity.com/articles/2018/limited_compute.html').text
links = re.findall('<a href="([^"]+)">([^<]+)</a>', data)
Now is this wrong?
Not exactly.
What it does catch is correct.
It just misses oh so many edge cases ...
(= so much missed or lost context)
Now your boss, excited with your progress, asks you to extract text from the same webpages you just processed.
It should be easy, right..?
Answer: π
You go all in and write an RDP
(If you don't know what it is, you keep track of the opening and closing HTML tags)
Β
Wait, boss, what text do you want? All text, including navigation? Only article text as if it were a news article? Sidebar text?
!?!?!??!?!
Now your boss, excited with your progress, asks you to convert that text to a Markdown equivalent.
Your answer: π
At least a butcher, baker, or candlestick maker have clear objectives
You've likely had to deal with some horrific code in your lifetime.
Now imagine having to deal with an entire
web worth of silly people...
The architecture of the Web has several languages in it - there's HTTP, there's HTML, URLs are a language, there's CSS, and there's the scripting language. They're all in there and at they can all be embedded in each other and they all have different quoting and escaping and commenting conventions. And they are not consistently implemented in all of the browsers. Some of them are not specified anywhere.
- Douglas Crockford (of Javascript and JSON)
"I've seen things you people wouldn't believe.
DDoSed servers on fire off the shoulder of Tumblr."
Why is the RNN hidden state important?
It's how we pass along context
(i.e. you said "flew" a few words back and "New" right before this word)
As each word is added,
our hidden state (memory) changes
We can introspect the RNN's hidden state
to guess the function of a given memory cell
This is the same "program" as trained on English - but this model was trained on C.
The model learns to capture the depth of an expression by performing LM'ing on C code.
Depth is exactly what we need for HTML.
How does hidden state change exactly?
Depends on everything.
The data, the input, the architecture, ...
Active area of research as we don't really know.
As each word is added,
our hidden state (memory) changes
Forget a semicolon/bracket/closing tag/.../?
The LM will become uncertain
(we can measure the entropy)
and can even intelligently suggest
where you went wrong
The team at OpenAI performed character level language modeling on Amazon reviews.
This is a single neuron with no "supervision".
The Transformer Network (i.e. pull information from words based on my word) learns a form of anaphora resolution as part of translation
Translate between language A and B
without a single shared sentence
Β
How?
Convert a sentence from A => B => A'
Ensure A == A'
Language models are implicit compression
It had no explicit English knowledge injected and few constraints to make it better work on English.
Hence, the model is
re-usable across entire data domains.
In deep learning you re-train the model and see what trade-offs have been made
Held State of the Art (SotA) on two datasets
(yay!)
Google then released ...
Held State of the Art (SotA) on two datasets
(yay!)
Google then released ...
Β
Β
(β―Β°β‘Β°οΌβ―οΈ΅ β»ββ»
Neural Architecture Search:
"32,400-43,200 GPU hours"
I wasted months trying to get something similar and almost gave up.
Went back to improve the PyTorch language model as a swansong for those braver than me.
Had to be fast and simple with minimal tweaks
for educational purposes
Small change...
Small change...
BIG IMPROVEMENT
???
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Β
I wrote a language model (AWD-LSTM) that was fast on standard hardware and achieved state of the art results, releasing it open source.
It has been trained on dozens of other languages, serves as the basis of Fast.AI's language model, has been used in Github's Semantic Code Search, audio processing, bio-informatics, ...
Β
New York Times (2012):
"How Many Computers to Identify a Cat?
16,000 (CPU cores)"
One year later: "three servers each with two quad-core CPUs and four Nvidia GeForce GTX 680 GPUs"
Neural Architecture Search:
"32,400-43,200 GPU hours"
Just over a year later:
"single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours"
Β
The result depends on the substrate (data) and
what you seed it with (type/structure of compute)
"This bit is slow so why don't we try a
less powerful but faster part?"
"Wait ... it works just as well? O_o"
No-one knows how efficient our work could be
or what knowledge we could extract
A single GPU can beat a cluster
Our theory lags behind our practice meaning
we have no freaking clue