Exploring the intersections of:

Unsupervised language modeling

+

Web crawling
+
Easier to use "programming" via deep learning

For my work

LMs as core primitives

Language models (LMs) are a supercharged version of your phone's predictive text

Prediction = LMs leverage every piece of info they can to answer "what's next given past history"

Storage = LMs act a form of compression

Communication = LMs can change their "language" depending on context

Step 1: Tokenization

Tokenization is fast and learns predictable repeated surface structure (~compression)

by-product of which was increased tourism to the town.

by█-█produc█t █of█ which█ was█ increase█d to█uri█s█m █to█ the█ to█wn█. █

Step 1: Tokenization

{"title": "█I█ █fo█un█d █a█ s█ec█re█t █spot█!█", "subreddit": "█FortNiteBR", █"is_self": true, █"url": "https://www█

Step 2: Predict next

Where the complex and large language model comes in - given history, guess the next token

{"title": "█I█ █fo█un█d █a█ s█ec█re█t █spot█!█", "subreddit": "█FortNiteBR", █"is_self": true, █"url": "https://www█

epic█games█.com█/█f█ornite█/█
epicgames.com█/█f█ornite█/█
ep█ic█ga█me█s█.com

An extension of ...

The team at OpenAI performed character level language modeling on Amazon reviews.
This is a single neuron with no "supervision".

Radford et al. 2017

What OpenAI did was...

Crawled the outgoing links from Reddit and then extracted the text => 20GB

Ran a very large (256 of Google's TPUs) language model over the dataset

Tested the language model with zero tuning on:
- reading comprehension
- translation
- summarization
- question answering

OpenAI's results

All of this is from picking up
naturally occurring patterns in language

Translation (ish)

Question answering

Summarization

"To induce summarization behavior we add the text TL;DR: after the article"

(though results aren't great ...)

My reaction to OpenAI

Similar to much of my own stack (reassuring)

Puts more weight behind my core thesis:
mining knowledge from the web using LMs

Humans can teach language models by writing normally as they would on the web

Reworded: You can likely bring value to the long tail of online communities by simply reading their shared data

Limits of their OpenAI

Mostly what we've seen in the past but bigger

Models are too large for sane production

Requires huge resources (256 TPUs) for training

Not able to precisely control output

(IMHO) hyped in danger narrative

LMs as core primitives

Language models (LMs) are a supercharged version of your phone's predictive text

Prediction = LMs leverage every piece of info they can to answer "what's next given past history"

Storage = LMs act a form of compression

Communication = LMs can optimize their "language" depending on task and shared info

Predictions on ML + LM

> Language models extract + compress knowledge

> File formats will be deprecated

> Declarative programming for non-programmers

A new ML first programming language
(A Turing complete LM could serve as the building block of a programming language)

Predictions on ML + LM

LMs aren't limited to text - "guess the next X" works in {vision, audio, physics, ...}

"So what knowledge is left
unextracted
from the data we already have..?"

"How many proprietary are actually worthless as you can recreate that knowledge from an unsupervised method and open dataset?"

Predictions on ML + LM

As an example I went to use Magic Pony (previously acquired by Twitter) but instead found WaveOne:

Predictions on ML + LM

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa = 379 bytes

Predictions on ML + LM

File formats will be deprecated:
If we can translate between Fr<>En,
why bother with JSON<>CSV?

by█-█produc█t █of█ which█ was█ increase█d to█uri█s█m █to█ the█ to█wn█. █

{"title": "█I█ █fo█un█d █a█ s█ec█re█t █spot█!█", "subreddit": "█FortNiteBR", █"is_self": true, █"url": "https://www█

Predictions on ML + LM

File formats will be deprecated:
If we can translate between Fr<>En,
why bother with JSON<>CSV?

Humans have internalized many menial chores which is a huge opportunity (= reduce pain)
(keep your data in the best human format!)

Bonus: language models are a form of compression so your data is smaller :)

Predictions on ML + LM

Declarative programming for non-programmers

Predictions on ML + LM

Declarative programming for non-programmers

SQL and databases allowed complex queries over structured data in a declarative way ...

What about complex queries on unstructured data? Do we (as humans) need to read the original data to understand or organize it?

"Find all articles on the Hacker News homepage about machine learning"

Predictions on ML + LM

Declarative programming for non-programmers

Bonus A: The representation of the data improves according to how it's queried and used

Bonus B: As the underlying model is a language model, we can use it to suggest to the user what to do next or indicate potential errors

"Find all articles on the Hacker News homepage about machine learning<TAB> and save to CSV?"

Startup Potential

The existing incumbents have optimized for the last war and are just as confused as anyone

The location and depth of moats is shifting

Low hanging fruit is everywhere
(many times more valuable than PageRank)

To existing startups

The open source ecosystem can be leveraged to help and is essentially R&D for you sponsored by {Google, NVIDIA, Facebook, ...}

AI is really just a tool for listening to your customers. If they're not listening (i.e. basic analytics) this won't magically help them.

Facebook's LASER

Sentence encoding (=understanding how sentences are similar) for 93 languages

"[N]o need to specify the input language ... According to our experience, the sentence encoder also supports code-switching, i.e. the same sentences can contain words in several different languages."

Facebook's LASER

Sentence encoding (=understanding how sentences are similar) for 93 languages

Facebook's LASER

Sentence encoding (=understanding how sentences are similar) for 93 languages

The ML Happy (Half) Hour

ML Happy Half Hour

More from smerity

The ML
Happy (Half) Hour