The ML
Happy (Half) Hour

¯\_(ツ)_/¯

---

1) explain what we think is now possible

2) pull back the hood to give some intuition of how this is possible

3) what are ways this isn’t magical
(doesn’t work)

​4) brainstorm about implications

Aim:

"The team to come away with
2-3 'things we’ve learned'
that help them ask better questions and evaluate ML companies in a way than they couldn't before"

Aim:

a) OpenAI paper released today
b) Predictions for AI
c) Open discussion =]

Loose plan

Exploring the intersections of:

Unsupervised language modeling

+

Web crawling
+
Easier to use "programming" via deep learning

For my work

LMs as core primitives

Language models (LMs) are a supercharged version of your phone's predictive text

Prediction = LMs leverage every piece of info they can to answer "what's next given past history"

Storage = LMs act a form of compression

Communication = LMs can change their "language" depending on context

Step 1: Tokenization

Tokenization is fast and learns predictable repeated surface structure (~compression)

by-product of which was increased tourism to the town.

by█-█produc█t █of█ which█ was█ increase█d to█uri█s█m █to█ the█ to█wn█. █

Step 1: Tokenization




{"title": "█I█ █fo█un█d █a█ s█ec█re█t █spot█!█", "subreddit": "█FortNiteBR", █"is_self": true, █"url": "https://www█

Step 2: Predict next

Where the complex and large language model comes in - given history, guess the next token


{"title": "█I█ █fo█un█d █a█ s█ec█re█t █spot█!█", "subreddit": "█FortNiteBR", █"is_self": true, █"url": "https://www█

epic█games█.com█/█f█ornite█/█
epicgames.com█/█f█ornite█/█
ep█ic█ga█me█s█.com

OpenAI's new work

Released at 9am this morning ^_^

---

What OpenAI did was...

"OpenAI has basically shown that if the predictive text in your mobile had a supercomputer behind it, you could tab complete real work after it read enough of the web."
- Smerity's tl;dr

An extension of ...

The team at OpenAI performed character level language modeling on Amazon reviews.
This is a single neuron with no "supervision".

What OpenAI did was...

Crawled the outgoing links from Reddit and then extracted the text => 20GB

Ran a very large (256 of Google's TPUs) language model over the dataset

Tested the language model with zero tuning on:
- reading comprehension
- translation
- summarization
- question answering

OpenAI's results

All of this is from picking up
naturally occurring patterns in language

Translation (ish)

Question answering

Summarization

"To induce summarization behavior we add the text TL;DR: after the article"

(though results aren't great ...)

My reaction to OpenAI

Similar to much of my own stack (reassuring)

Puts more weight behind my core thesis:
mining knowledge from the web using LMs

Humans can teach language models by writing normally as they would on the web

Reworded: You can likely bring value to the long tail of online communities by simply reading their shared data

Limits of their OpenAI

Mostly what we've seen in the past but bigger

Models are too large for sane production

Requires huge resources (256 TPUs) for training

Not able to precisely control output

(IMHO) hyped in danger narrative

LMs as core primitives

Language models (LMs) are a supercharged version of your phone's predictive text

Prediction = LMs leverage every piece of info they can to answer "what's next given past history"

Storage = LMs act a form of compression

Communication = LMs can optimize their "language" depending on task and shared info

Predictions on ML + LM

> Language models extract + compress knowledge

 

> File formats will be deprecated

> Declarative programming for non-programmers
 

A new ML first programming language
(A Turing complete LM could serve as the building block of a programming language)

Predictions on ML + LM

LMs aren't limited to text - "guess the next X" works in {vision, audio, physics, ...}
 

"So what knowledge is left
unextracted
from the data we already have..?"

"How many proprietary are actually worthless as you can recreate that knowledge from an unsupervised method and open dataset?"

Predictions on ML + LM

As an example I went to use Magic Pony (previously acquired by Twitter) but instead found WaveOne:

Predictions on ML + LM

Predictions on ML + LM

Predictions on ML + LM

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa = 379 bytes

Predictions on ML + LM

File formats will be deprecated:
If we can translate between Fr<>En,
why bother with JSON<>CSV?

by█-█produc█t █of█ which█ was█ increase█d to█uri█s█m █to█ the█ to█wn█. █

{"title": "█I█ █fo█un█d █a█ s█ec█re█t █spot█!█", "subreddit": "█FortNiteBR", █"is_self": true, █"url": "https://www█

Predictions on ML + LM

File formats will be deprecated:
If we can translate between Fr<>En,
why bother with JSON<>CSV?


Humans have internalized many menial chores which is a huge opportunity (= reduce pain)
(keep your data in the best human format!)

Bonus: language models are a form of compression so your data is smaller :)

Predictions on ML + LM

Declarative programming for non-programmers

Predictions on ML + LM

Declarative programming for non-programmers

 

SQL and databases allowed complex queries over structured data in a declarative way ...

What about complex queries on unstructured data? Do we (as humans) need to read the original data to understand or organize it?

"Find all articles on the Hacker News homepage about machine learning"

Predictions on ML + LM

Declarative programming for non-programmers

 

Bonus A: The representation of the data improves according to how it's queried and used

Bonus B: As the underlying model is a language model, we can use it to suggest to the user what to do next or indicate potential errors

"Find all articles on the Hacker News homepage about machine learning<TAB> and save to CSV?"

Predictions on ML + LM

> Language models extract + compress knowledge

 

> File formats will be deprecated

> Declarative programming for non-programmers
 

A new ML first programming language
(A Turing complete LM could serve as the building block of a programming language)

Minimums

Do they have a way of making and testing hypotheses about the company?

Even analytics (an SQL query run and turned into a graph) provides a heartbeat on the business
 

Extreme example: Freelancer.com's graphs
(inspired by Harrah's Casino)

Even enough data to ask:
"Is today the same as yesterday?"

Key take-aways

If ML is core:
How are they architecting for long term

If ML isn't core:
Do they have someone who can keep up with the off the shelf open source components?
{Google, NVIDIA, Facebook, Microsoft} are giving away their latest work

Startup Potential

The existing incumbents have optimized for the last war and are just as confused as anyone

The location and depth of moats is shifting


Low hanging fruit is everywhere
(many times more valuable than PageRank)

To existing startups

The open source ecosystem can be leveraged to help and is essentially R&D for you sponsored by {Google, NVIDIA, Facebook, ...}

AI is really just a tool for listening to your customers. If they're not listening (i.e. basic analytics) this won't magically help them.

Facebook's LASER

Sentence encoding (=understanding how sentences are similar) for 93 languages

"[N]o need to specify the input language ... According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages."

Facebook's LASER

Sentence encoding (=understanding how sentences are similar) for 93 languages

Facebook's LASER

Sentence encoding (=understanding how sentences are similar) for 93 languages

ML Happy Half Hour

By smerity

ML Happy Half Hour

  • 1,773