Jumping moats

Stephen Merity
(@smerity)

Why compute and data moats may well be dead*

---

Q:

Deep learning...
Up: Overhyped?
Down: Underhyped?

Q:

Deep learning...
Goldilocks?

About me

NCSS Challenge / NCSS at Usyd
University of Sydney (NLP)
Freelancer.com (IPO'ed ASX), Grok Learning
Harvard University (Masters)
Common Crawl (lone engineer)
MetaMind (acquired by Salesforce)
???

AU

US

AU/US

Before I begin,

let's take a step back...

What do we want
in computing?

We want our programs to be flexible, re-usable, and produce expected output

Desire:

Our programs are
flexible, re-usable, and produce expected output

Reality:

Sequence modeling and
deep learning may allow all:
flexible, re-usable, and
expected output

Heretical claim

This fundamentally breaks many of the traditional moats,
specifically compute and data*

Heretical claim++

* Offer not valid for all moats πŸ˜…

As programmers, functions are
our fundamental building blocks


Β 

We as humans define the logic

We hence decide what information flows from input to output

Take input, apply logic, produce output

Functions define our
level of abstraction

We can't influence
what came before

We can't be influenced by
what happens after

This is a problem

Our tasks are defined by
objectives and data

Objectives are lossy
past abstraction boundaries

DL is just functions

Define the input, the output,
and the "architecture"
(i.e. equations the fn computes with)

The logic is decided by the
input and expected output
given to the program

Stack these blocks

The computation is whatever we want

We don't care as long as our desired "program" is a subset of the given computation


Typically a matrix multiplication
followed by an "activation function"
(allows for decisions to be made)

Stack these blocks

Confused? Uncertain?

You don't need to understand specifics

Deep learning is a
declarative programming language

State what you want in terms of
input, output, and the type of compute the model may use

Suddenly ...

Our tasks are defined by
objectives and data

Objectives
cross
abstraction boundaries

In brief: deep learning

Learn to trust the abstraction
(just as you trust your database*)

In brief: deep learning

Learn to trust the abstraction
(just as you trust your compiler*)

In brief: deep learning

Learn to trust the abstraction
(just as you trust your CPU*)

In brief: deep learning

Learn how to and how much to trust an abstraction - and then trust it

The only scary thing with DL'ing is
a human didn't write the logic ...
πŸ€”

The Data Moat

Language Modeling

Given a sequence of tokens (context),
predict the next N tokens

The flight from Sydney to New ____

Β 

We analyze a massive set of data and follow the patterns we've already seen

N-grams

Have we seen this sequence before?
If so, how many times?

Bob ate the ____
Zomicron ate the ____

N-grams

Have we seen this sequence before?
If so, how many times?

Bob ate the ____
张伟 (Zhang Wei) ate the ____

Contextual N-grams

<name> ate the ____

Β 

... but now we have a bajillion edge cases to try to capture ...

<name:male> was <verb:run> through the <city:Sydney> <street:plural>

A bajillion edge cases isn't sane for a human
... yet it's what we likely need to do well

Neural Language Modeling

So what does this look like for LM?

First, let's think of our objective:
given previous word,
we want to predict the nextΒ word,
on repeat

We want a function akin to:
memory, next_word = f(current_word, memory)

h_t, y_t = \text{RNN}(x_t, h_{t-1})
ht,yt=RNN(xt,htβˆ’1)h_t, y_t = \text{RNN}(x_t, h_{t-1})

Neural Language Modeling

So what does this look like for LM?

First, let's think of our objective:
given previous word,
we want to predict the next word,
on repeat

We want a function akin to:
memory, next_word = f(current_word, memory)

h_t, y_t = \text{RNN}(x_t, h_{t-1})
ht,yt=RNN(xt,htβˆ’1)h_t, y_t = \text{RNN}(x_t, h_{t-1})

Neural Language Modeling

Top: Output
Middle: Logic (Blue)
Bottom: Input

Neural Language Modeling

Embed:
Each word has a representation of 400 floating point numbers
words['The'] = [0.123, 0.621, ..., -0.9]

Neural Language Modeling

Recurrent Neural Network (RNN):
A function that takes two inputs,
word (400 numbers) and memory (400 numbers),
and produces two outputs (word and memory)

Neural Language Modeling

Recurrent Neural Network (RNN):


(h = hidden state, or our memory)

h_t, y_t = \text{RNN}(x_t, h_{t-1})
ht,yt=RNN(xt,htβˆ’1)h_t, y_t = \text{RNN}(x_t, h_{t-1})

Neural Language Modeling

How do you start out the weights?
Random.
(Maybe pre-trained weights but that's later...)

Neural Language Modeling

Why is the RNN hidden state important?
It's how we pass along context
(i.e. you said "flew" a few words back and "New" right before this word)

As each word is added,
our hidden state (memory) changes

Visualizing word vectors

Visualizing word vectors

Neural Language Modeling

We define the architecture
(or equations the function may use)


We want each word to be represented by a vector, let's say 400 floating point numbers

Our "running memory" will also be
400 floating point numbers

Neural Language Modeling

Our model will learn the best value for each of those 400 numbers for all our words

Our model will learn what type of logic the functions should run to create and manipulate the hidden state (memory) to guess the
next word

Neural LM

"... but now we have a bajillion edge cases to try to capture ..."


<name:male> was <verb:run> through the <city:Sydney> <street:plural>

is implicitlyΒ caught in our vectors and the learned logic of our "program"

The computer learned how to do those bajillion edge cases

from random numbers and context

DL is declarative



Ask it to learn language modeling?
Your model learns counting as a sub-task

Our programs are
flexible, re-usable, and produce expected output

Potential aside:

Let's do HTML parsing

Let's say we want to
extract content from the web

Boss: Your objective is to collect links for a web crawler

Huzzah! I can do that!
How about I use ...

Regex for HTML 😑

Are you MAD?!?!?

import requests
import re
data = requests.get('http://smerity.com/articles/2018/limited_compute.html').text
links = re.findall('<a href="([^"]+)">([^<]+)</a>', data)

Now is this wrong?


Not exactly.
What it does catch is correct.
It just misses oh so many edge cases ...
(= so much missed or lost context)

Success!

It isn't perfect, but it does
work for the task at hand ^_^

Now your boss, excited with your progress, asks you to extract text from the same webpages you just processed.

It should be easy, right..?

Answer: 😭

"Proper" parser for HTML

Recursive descent parser (RDP)

You go all in and write an RDP
(If you don't know what it is, you keep track of the opening and closing HTML tags)

Β 

Wait, boss, what text do you want? All text, including navigation? Only article text as if it were a news article? Sidebar text?
!?!?!??!?!

This is a problem

Our tasks are defined by
objectives and data

Our objective is vague yet
those specifics are key to success

Success!

It isn't perfect, but it does
work for the task at hand ^_^

Now your boss, excited with your progress, asks you to convert that text to a Markdown equivalent.

Your answer: 😭

At least a butcher, baker, or candlestick maker have clear objectives

Worse, what about errors?

Constructing programs resilient to
bad input is hard

You've likely had to deal with some horrific code in your lifetime.

Now imagine having to deal with an entire
web worth of silly people...

The architecture of the Web has several languages in it - there's HTTP, there's HTML, URLs are a language, there's CSS, and there's the scripting language. They're all in there and at they can all be embedded in each other and they all have different quoting and escaping and commenting conventions. And they are not consistently implemented in all of the browsers. Some of them are not specified anywhere.

- Douglas Crockford (of Javascript and JSON)

My time @ Common Crawl

Crawling ~35 billion pages (~2.5 PB)
as a lone engineer:

"I've seen things you people wouldn't believe.
DDoSed servers on fire off the shoulder of Tumblr."

ΰ² _ΰ² 

LMing for HTML πŸ€”

(Heretical claim reminder)
Sequence modeling and
deep learning may allow all:
flexible, re-usable, and
expected output

Neural Language Modeling

Why is the RNN hidden state important?
It's how we pass along context
(i.e. you said "flew" a few words back and "New" right before this word)

As each word is added,
our hidden state (memory) changes

LMing for HTML πŸ€”

We know LMs learn useful context

We can introspect the RNN's hidden state
to guess the function of a given memory cell

LMing for HTML πŸ€”

Let's look what it does to C code

This is the same "program" as trained on English - but this model was trained on C.

LMing for HTML πŸ€”

The model learns to capture the depth of an expression by performing LM'ing on C code.
Depth is exactly what we need for HTML.

Neural Language Modeling

How does hidden state change exactly?
Depends on everything.
The data, the input, the architecture, ...
Active area of research as we don't really know.

As each word is added,
our hidden state (memory) changes

What happens with errors?

The LM gets progressively more upset

Forget a semicolon/bracket/closing tag/.../?

The LM will become uncertain
(we can measure the entropy)
and can even intelligently suggest
where you went wrong

Remember that at this stage
the model has only one broad objective:

guess what comes next

How far can we take this?

Is it only surface level features?

The team at OpenAI performed character level language modeling on Amazon reviews.
This is a single neuron with no "supervision".

How far can we take this?

With different mechanisms:

The Transformer Network (i.e. pull information from words based on my word) learns a form of anaphora resolution as part of translation

What happens when we add
additional objectives and constraints..?

How far can we take this?

Translation with no parallel corpus

Translate between language A and B
without a single shared sentence

Β 

How?
Convert a sentence from A => B => A'
Ensure A == A'

Language models are implicit compression

Our tasks are defined by
objectives and data

Deep learning models
define their operation based on both

It had no explicit English knowledge injected and few constraints to make it better work on English.

Hence, the model is
re-usable across entire data domains.

The language model is trained based upon the data it sees.

Re-usable and flexible
knowledge understanding
defined by the
objective and task

Re-usable and flexible
knowledge understanding
defined by the
objective and task

So what knowledge is left
unextracted
from the data we already have..?

The Compute Moat

What if tomorrow your program had to work in only 100MB of RAM? A 100 Mhz CPU? Could only use adds but no mults?

In deep learning you re-train the model and see what trade-offs have been made

What about my work?

Similar results with minimal compute

Held State of the Art (SotA) on two datasets
(yay!)

Google then released ...

What about my work?

Similar results with minimal compute

Held State of the Art (SotA) on two datasets
(yay!)

Google then released ...

Β 

Β 

(β•―Β°β–‘Β°οΌ‰β•―οΈ΅ ┻━┻

Neural Architecture Search:
"32,400-43,200 GPU hours"

What about my work?

Similar results with minimal compute

I wasted months trying to get something similar and almost gave up.

Went back to improve the PyTorch language model as a swansong for those braver than me.

Had to be fast and simple with minimal tweaks
for educational purposes

What about my work?

Similar results with minimal compute

Small change...

What about my work?

Similar results with minimal compute

Small change...
BIG IMPROVEMENT

???

What about my work?

Similar results with minimal compute

Small change...
BIG IMPROVEMENT
Small change...
BIG IMPROVEMENT
Small change...

BIG IMPROVEMENT
Small change...

BIG IMPROVEMENT
Small change...

BIG IMPROVEMENT
Β 

What about my work?

Similar results with minimal compute

I wrote a language model (AWD-LSTM) that was fast on standard hardware and achieved state of the art results, releasing it open source.

It has been trained on dozens of other languages, serves as the basis of Fast.AI's language model, has been used in Github's Semantic Code Search, audio processing, bio-informatics, ...

Most of our assumptions are
BROKEN

Β 

Don't constrain your thinking by them

New York Times (2012):
"How Many Computers to Identify a Cat?
16,000 (CPU cores)"

One year later: "three servers each with two quad-core CPUs and four Nvidia GeForce GTX 680 GPUs"

Neural Architecture Search:
"32,400-43,200 GPU hours"

Just over a year later:
"single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours"

Adding ever more engines may
help get the plane off the ground...

but that's not the design that
planes are destined for.

Deep learning is closer to
growing a garden
than
enumerating logic

Β 

The result depends on the substrate (data) and
what you seed it with (type/structure of compute)

My work

Quasi-Recurrent Neural Network
(Bradbury*, Merity*, Xiong, Socher)

"This bit is slow so why don't we try a
less powerful but faster part?"
"Wait ... it works just as well? O_o"

Sequence modeling and
deep learning may allow all:
flexible, re-usable, and
expected output

Heretical claim:

This fundamentally breaks many of the traditional moats,
specifically compute and data

Heretical claim++

No-one knows how efficient our work could be
or what knowledge we could extract

A single GPU can beat a cluster

Our theory lags behind our practice meaning
we have no freaking clue

The potential

Irrational Optimism

is not necessarily

Irrational

Jumping moats @ Canva

By smerity

Jumping moats @ Canva

  • 2,375