Language Modeling

Stephen Merity (@smerity)

as a key to broader knowledge understanding

---

Why listen?

I held multiple State of the Arts (SotAs) in language modeling with minimal resources
(compared to Goog/FB/MSFT/...)

I genuinely want to help :)

Main take-away

Deep learning should feel intuitive

If it doesn't feel like that
- someone has explained it wrong (me!),
- a library or paper has overcomplicated it
- someone misunderstands it
(which is understandable!)

Heretical claim:

As programmers, functions are our fundamental building blocks

We as humans define the logic

We also hence decide what information flows from input to output

Language Modeling

Given a sequence of tokens (context), predict the next N tokens

The flight from Sydney to New ____

We analyze a massive set of data and follow the patterns we've already seen

Machine Translation
p(strong tea) > p(powerful tea)

Speech Recognition
p(speech recognition) > p(speech wreck ignition)

p(President X attended ...) is higher for X=Obama

Query Completion
p(Michael Jordan Berkeley) > p(Michael Jordan basketball)

N-grams

Have we seen this sequence before? If so, how many times?

Bob ate the ____
Zomicron ate the ____

Bob ate the ____

Contextual N-grams

<name> ate the ____

... but now we have a bajillion edge cases to try to capture ...

<name:male> was <verb:run> through the <city:Sydney> <street:plural>

A bajillion edge cases isn't sane for a human
... yet it's what we likely need to do well

Linear regression

Input: beds, baths, square feet
Logic:
N = X * bed + Y * bath + Z * footage
Output: N (approximate house price)

We now have weights {X, Y, Z}

"Learn" {X, Y, Z} to minimize the error

Linear regression

Logic:
X * bed + Y * bath + Z * footage

We don't set {X, Y, Z} ourselves

We use backpropagation to nudge them
(= fancy way of asking the eqn how best to change a parameter to reduce our error)

Stack these blocks

The computation is whatever we want

We don't care as long as our desired program is a subset of the possible computation

Typically a matrix multiplication
followed by an "activation function"
(allows for decisions to be made)

Confused? Uncertain?

You don't need to understand specifics

Deep learning is adeclarative programming language

State what you want in terms of
input, output, and the type of compute the model may use

Response from parents

"This seems ... overly simple?"

"Indeed it is - the scary thing is that the principle scales up. The same general tactics work for images, text, you name it..! Instead of three parameters though, I’m doing this over MILLIONS or BILLIONS of parameters. Backpropagation still works!"

A billion edge cases isn't sane for a human
... yet it's what we likely need to do well

So let's get the computer to do it :)

Neural Language Modeling

So what does this look like for LM?

First, let's think of our objective:
given previous word,
we want to predict the next word,
on repeat

We want a function akin to:
memory, next_word = f(current_word, memory)

h_t, y_t = \text{RNN}(x_t, h_{t-1})
$h_t, y_t = \text{RNN}(x_t, h_{t-1})$

Neural Language Modeling

So what does this look like for LM?

First, let's think of our objective:
given previous word,
we want to predict the next word,
on repeat

We want a function akin to:
memory, next_word = f(current_word, memory)

h_t, y_t = \text{RNN}(x_t, h_{t-1})
$h_t, y_t = \text{RNN}(x_t, h_{t-1})$

Neural Language Modeling

We define the architecture
(or equations the function may use)

We want each word to be represented by a vector, let's say 400 floating point numbers

Our "running memory" will also be
400 floating point numbers

Neural Language Modeling

Our model will learn the best value for each of those 400 numbers for all our words

Our model will learn what type of logic the functions should run to create and manipulate the hidden state (memory) to guess the
next word

Neural Language Modeling

Top: Output
Middle: Logic (Blue)
Bottom: Input

Neural Language Modeling

Embed:
Each word has a representation of 400 floating point numbers
words['The'] = [0.123, 0.621, ..., -0.9]

Neural Language Modeling

Recurrent Neural Network (RNN):
A function that takes two inputs,
word (400 numbers) and memory (400 numbers),
and produces two outputs (word and memory)

Neural Language Modeling

Recurrent Neural Network (RNN):

(h = hidden state, or our memory)

h_t, y_t = \text{RNN}(x_t, h_{t-1})
$h_t, y_t = \text{RNN}(x_t, h_{t-1})$

Neural Language Modeling

How do you start out the weights?
Random.
(Maybe pre-trained weights but that's later...)

Neural Language Modeling

Why is the RNN hidden state important?
It's how we pass along context
(i.e. you said "flew" a few words back and "New" right before this word)

our hidden state (memory) changes

Contextual N-grams

"... but now we have a bajillion edge cases to try to capture ..."

<name:male> was <verb:run> through the <city:Sydney> <street:plural>

The computer learned how to do those bajillion edge cases
from random numbers and context

An aside:

Let's do HTML parsing

Let's say we want to extract content from the web

Huzzah! I can do that!

Regex for HTML 😡

import requests
import re
data = requests.get('http://smerity.com/articles/2018/limited_compute.html').text
links = re.findall('<a href="([^"]+)">([^<]+)</a>', data)

Now is this wrong?

Not exactly.
What it does catch is correct.
It just misses oh so many edge cases ...
(= so much missed or lost context)

Success!

It isn't perfect, but it does work for the task at hand ^_^

Now your boss, excited with your progress, asks you to extract text from the same webpages you just processed.

It should be easy, right..?

"Proper" parser for HTML

Recursive descent parser (RDP)

You go all in and write an RDP
(If you don't know what it is, you keep track of the opening and closing HTML tags)

Wait, boss, what text do you want? All text, including navigation? Only article text as if it were a news article? Sidebar text?
!?!?!??!?!

Success!

It isn't perfect, but it does work for the task at hand ^_^

Now your boss, excited with your progress, asks you to convert that text to a Markdown equivalent.

At least a butcher, baker, or candlestick maker have clear objectives

Constructing programs resilient to bad input is hard

Now imagine having to deal with an entire
web worth of silly people...

The architecture of the Web has several languages in it - there's HTTP, there's HTML, URLs are a language, there's CSS, and there's the scripting language. They're all in there and at they can all be embedded in each other and they all have different quoting and escaping and commenting conventions. And they are not consistently implemented in all of the browsers. Some of them are not specified anywhere.

- Douglas Crockford (of Javascript and JSON)

My time @ Common Crawl

Crawling ~35 billion pages (~2.5 PB) as a lone engineer:

"I've seen things you people wouldn't believe.
DDoSed servers on fire off the shoulder of Tumblr."

ಠ_ಠ

Neural Language Modeling

Why is the RNN hidden state important?
It's how we pass along context
(i.e. you said "flew" a few words back and "New" right before this word)

our hidden state (memory) changes

LMing for HTML 🤔

We know LMs learn useful context

We can introspect the RNN's hidden state
to guess the function of a given memory cell

LMing for HTML 🤔

Let's look what it does to C code

This is the same "program" as trained on English - but this model was trained on C.

LMing for HTML 🤔

The model learns to capture the depth of an expression by performing LM'ing on C code.
Depth is exactly what we need for HTML.

Neural Language Modeling

How does hidden state change exactly?
Depends on everything.
The data, the input, the architecture, ...
Active area of research as we don't really know.

our hidden state (memory) changes

Neural Language Modeling

Example: a few days ago a paper trying to work out how different RNNs count with their memory

Reminder: DL is declarative

Ask it to learn language modeling?

What happens with errors?

The LM gets progressively more upset

Forget a semicolon/bracket/closing tag/.../?

The LM will become uncertain
(we can measure the entropy)
and can even intelligently suggest
where you went wrong

Remember that at this stage the model has only one broad objective:
guess what comes next

Now add more objectives as you want
(mark <a href>, get text, remember if in <b> tag)

The model will learn how to balance objectives given the resources available and data seen

How far can we take this?

Is it only surface level features?

The team at OpenAI performed character level language modeling on Amazon reviews.
This is a single neuron.

How far can we take this?

With different mechanisms

An attention based model (i.e. pull information from words based on my word) learns anaphora resolution as part of translation

How far can we take this?

Translation with no parallel corpus

Translate between language A and B
without a single shared sentence

How?
Convert a sentence from A => B => A'
Ensure A == A'

Language models are implicit compression

My research has been used in areas I never could have imagined

I wrote a language model that was fast
and which achieved state of the art results.
I released the code open source.

It has been trained on dozens of other languages, serves as the basis of Github's Semantic Code Search, audio processing, bio-informatics, ...

Deep learning is closer togrowing a garden thanenumerating logic

The result depends on the substrate (data) and
what you put in (type/structure of compute)

It had no explicit English knowledge injected and few constraints to make it better work on English.

Hence, it is re-usable across data.

What if tomorrow your program had to work in only 100MB of RAM? A 100 Mhz CPU? Could only use adds but no mults?

In deep learning you re-train the model and see what trade-offs have been made

My work

Quasi-Recurrent Neural Network (Bradbury*, Merity*, Xiong, Socher)

"This bit is slow so why don't we try a
less powerful but faster part?"
"Wait ... it works just as well? O_o"

Heretical claim:

No-one knows how efficient it could be
or what knowledge we could extract

No-one has tried this in field / task X

A single GPU can beat a cluster

You don't need a deep theory background

The potential

New York Times (2012):
"How Many Computers to Identify a Cat?
16,000 (CPU cores)"

One year later: "three servers each with two quad-core CPUs and four Nvidia GeForce GTX 680 GPUs"

The potential

Neural Architecture Search:
"32,400-43,200 GPU hours"

Just over a year later:
"single Nvidia GTX 1080Ti GPU, the search for architectures takes less than 16 hours"

The potential

If this type of programming isn't introduced
in first year alongside functions and DBs,
we're doing students a disservice

The implementation is hard, the use is easy

I am not saying this due to the hype,
I am saying this due to how easy it can be
and what can be delivered with it

By smerity

• 368