A Web Worth of Data: Common Crawl for NLP

Text By The Bay

April 24, 2015

It's a non-profit that makes

web data

freely accessible to anyone

Each crawl archive is billions of pages:

February crawl archive is

1.9 billion web pages

~154 terabytes uncompressed

Released

totally free
without additional
intellectual property restrictions
(lives on Amazon Public Data Sets)

Common Crawl File Formats

WARC
+ Raw HTTP response headers
+ Raw HTTP responses

WAT
+ HTML head data
+ HTTP header fields
+ Extracted links / script tags

WET
+ Extracted text

Common Crawl WET

WET (Web Extracted Text) is released in
the crawl archive each month

Data attempts to cover widest range of use cases

No distinction between header / navigation / content:

Does not remove boilerplate

Does not re-format text as appears in browser

Origins of Common Crawl

Common Crawl founded in 2007
by Gil Elbaz (Applied Semantics / Factual)

Google and Microsoft were the powerhouses

Goal: Democratize and simplify access to
"the web as a dataset"

Open Data and Open Source

Data powers the algorithms in our field

How can we have an even playing field for innovation without access to such data?
(Can you replicate work without the data..?)

More data can beat better algorithms
(Banko and Brill, 2001)

Common Crawl for NLP

The web is largely unannotated,
so how are people using it for NLP?

(a) Use extracted text for unsupervised algorithms

(b) Filter it into being semi-annotated or annotated
(big data ⇒ filter ⇒ curated smaller dataset)

Examples of Previous Work

Unsupervised Algorithms

+ N-gram & language models

+ GloVe: Global Vectors for Word Representation

Filtering

+ Web tables for gazetteers

+ Dirt Cheap Web-Scale Parallel Text
+ Extracting US phone numbers

N-gram Counts & Language Models from the Common Crawl
Christian Buck, Kenneth Heafield, Bas van Ooyen
(Edinburgh, Stanford, Owlin BV)

Processed all the text of Common Crawl to produce 975 billion deduplicated tokens

Google N-gram Dataset (Web 1T) consists of
1 trillion tokens

N-gram Counts & Language Models from the Common Crawl

Improvement over Google N-grams (2006):

Inclusion of low count entries

Deduplication to reduce boilerplate

"Google has shared a deduplicated version ...

in limited contexts, but it was never publicly released."
-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)

N-gram Counts & Language Models from the Common Crawl

"The advantages of structured text do not outweigh the extra computing power needed to process them."
-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)

N-gram Counts & Language Models from the Common Crawl

English (23TB), German (1.02TB), Spanish (986GB), French (912GB), Japanese (577GB), Russian (537GB), Polish (334GB), Italian (325GB) ...

Only 0.14% of the corpus was Finnish, yet yielded a useful corpus of 47GB.

42 languages with >10GB

73 languages with >1GB

N-gram & Language Models

Sentence level deduplication led to a removal of 80% of the English corpus, lower for other languages
(in line with Bergsma et al. (2010))

Before preprocessing (English): 23.62 TB

After preprocessing (English): 5.14 TB

(59 billion lines, 975 billion tokens)

N-gram & Language Models

Substantial improvement in perplexity

N-gram & Language Models

"...even though the web data is quite noisy even limited amounts give improvements."

N-gram & Language Models

Project data was released at
http://statmt.org/ngrams

Raw text split by language
Deduped text split by language
Resulting language models

GloVe: Global Vectors for Word Representation

Jeffrey Pennington, Richard Socher, Christopher D. Manning

Word vector representations:

king - queen = man - woman

king - man + woman = queen

(produces dimensions of meaning)

GloVe: Global Vectors for Word Representation

Trained on non-zero entries of a
global word-word co-occurrence matrix

Populating matrix requires a single pass
Subsequent training is far faster

GloVe = O(|C|⁰⋅⁸)

On-line window-based (i.e. word2vec) = O(|C|)

GloVe On Various Corpora

Semantic: "Athens is to Greece as Berlin is to _?"

Syntactic: "Dance is to dancing as fly is to_?"

GloVe over Big Data

GloVe using 42 billion tokens from Common Crawl outperformed word2vec w/ 100 billion tokens (Google News)

Largest GloVe model to prove scalability uses
840 billion tokens from Common Crawl

Source code and pre-trained models at
http://www-nlp.stanford.edu/projects/glove/

Mix and Match: Word Vectors

More data, less fine tuning needed

Best model: mix of all excl. word2vec

Automatic Noun Compound Interpretation using Deep Neural Networks and Word Embeddings
(Dima and Hinrichs, 2015)

Examples of Previous Work

Unsupervised Algorithms

+ N-gram & language models

+ GloVe: Global Vectors for Word Representation

Filtering

+ Web tables for gazetteers

+ Dirt Cheap Web-Scale Parallel Text
+ Extracting US phone numbers

Gazetteers for NER

+ Want the widest variety of topics possible

+ Aim to keep them modern / up to date

+ Capture relationships between similar words
(disambiguation)

Google Sets

Web tables as a source of gazetteers + relations

Querying ["cat"],

returns ["dog", "bird", "horse", "rabbit", ...]

Querying ["cat", "ls"],

returns ["cd", "head", "cut", "vim", ...]

Web Data Commons Web Tables

Extracted 11.2 billion tables from WARC files,
filtered to keep relational tables via trained classifier

Only 1.3% of the original data was kept,
yet it still remains hugely valuable

Resulting dataset:
11.2 billion tables ⇒ 147 million relational web tables

Web Data Commons Web Tables

Popular column headers: name, title, artist, location, model, manufacturer, country ...

Released at webdatacommons.org/webtables/

Web Data Commons Web Tables

Web-Scale Parallel Text

Dirt Cheap Web-Scale Parallel Text from the Common Crawl (Smith et al.)

Processed all text from URLs of the style:
website.com/[langcode]/

"...nothing more than a set of common two-letter language codes ... [we] mined 32 terabytes ... in just under a day"

Web-Scale Parallel Text

(source = foreign language, target = English)

Web-Scale Parallel Text

Both EuroParl & United Nations are large and well curated parallel texts,

but both have very specific domains & genres.

Web-Scale Parallel Text

"...resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain."

Minimal cleaning & filtering still resulted in a substantial improvement in SMT performance

Manual inspection across three languages:
80% of the data contained good translations

Extracting US Phone Numbers

"Let's use Common Crawl to help match businesses from Yelp's database to the possible web pages for those businesses on the Internet."

Yelp extracted ~748 million US phone numbers from the Common Crawl December 2014 dataset

Regular expression over extracted text (WET files)

Extracting US Phone Numbers

Total complexity: 134 lines of Python
Total time: 1 hour (20 × c3.8xlarge)
Total cost: $10.60 (Python using EMR)

Matched against Yelp's database:

48% had exact URL matches

61% had matching domains

More details (and full code) on Yelp's blog post:
Analyzing the Web For the Price of a Sandwich

WikiReverse

Created by volunteer Ross Fairbanks for fun

Task: Find hyperlinks to Wikipedia from the web

Result: Dataset of over 36 million links

Code and data released online at wikireverse.org

Similar work by UMass and Google Research:
Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia

Common Crawl's Derived Datasets

Natural language processing:

Parallel text for machine translation

N-gram & language models (975 bln tokens)

WDC "Collect ALL the web tables" (3.5 bln)

Large scale web analysis:

WDC Hyperlink Graphs (128 bln edges)

WikiReverse.org - Wikipedia in-links analysis

and a million more use cases!

Why am I so excited..?

Open data is catching on!

Even playing field for academia and industry

Baidu used Common Crawl for Deep Speech

Google Web 1T ⇒ Buck et al.'s N-grams

Google's Wikilinks ⇒ WikiReverse

Google's Sets ⇒ WDC Web Tables

Common Crawl releases their dataset
and brilliant people build on top of it

Challenge: Parser training data

Automatic Acquisition of Training Data for Statistical Parsers (Howlett and Curran, 2008)

Use knowledge base of facts or simple sentences:
"Mozart was born in 1756."

Parse more complex sentences with dep constraints:
"Wolfgang Amadeus Mozart (baptized Johannes Chrysostomus Wolfgangus Theophilus) was born in Salzburg in 1756, the second survivor out of six children."

Stephen Merity
stephen@commoncrawl.org
commoncrawl.org

A Web Worth of Data: Common Crawl for NLP

By smerity

A Web Worth of Data: Common Crawl for NLP

10 years ago
5,394

smerity

A Web Worth of Data: Common Crawl for NLP

Text By The Bay April 24, 2015

It's a non-profit that makes

web data

freely accessible to anyone

Each crawl archive is billions of pages:

February crawl archive is

1.9 billion web pages

~154 terabytes uncompressed

Released

totally free without additionalintellectual property restrictions (lives on Amazon Public Data Sets)

Common Crawl File Formats

WARC+ Raw HTTP response headers+ Raw HTTP responses WAT+ HTML head data+ HTTP header fields+ Extracted links / script tags WET+ Extracted text

Common Crawl WET

WET (Web Extracted Text) is released in the crawl archive each month

Data attempts to cover widest range of use cases

No distinction between header / navigation / content:

Does not remove boilerplate Does not re-format text as appears in browser

Origins of Common Crawl

Common Crawl founded in 2007by Gil Elbaz (Applied Semantics / Factual)

Google and Microsoft were the powerhouses

Goal: Democratize and simplify access to"the web as a dataset"

Open Data and Open Source

Data powers the algorithms in our field

How can we have an even playing field for innovation without access to such data?(Can you replicate work without the data..?)

More data can beat better algorithms(Banko and Brill, 2001)

Common Crawl for NLP

The web is largely unannotated,so how are people using it for NLP?

(a) Use extracted text for unsupervised algorithms

(b) Filter it into being semi-annotated or annotated(big data ⇒ filter ⇒ curated smaller dataset)

Examples of Previous Work

Unsupervised Algorithms

+ N-gram & language models + GloVe: Global Vectors for Word Representation

Filtering

+ Web tables for gazetteers + Dirt Cheap Web-Scale Parallel Text+ Extracting US phone numbers

N-gram Counts & Language Models from the Common CrawlChristian Buck, Kenneth Heafield, Bas van Ooyen(Edinburgh, Stanford, Owlin BV)

Processed all the text of Common Crawl to produce 975 billion deduplicated tokensGoogle N-gram Dataset (Web 1T) consists of1 trillion tokens

N-gram Counts & Language Models from the Common Crawl

Improvement over Google N-grams (2006): Inclusion of low count entries Deduplication to reduce boilerplate "Google has shared a deduplicated version ... in limited contexts, but it was never publicly released."-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)

N-gram Counts & Language Models from the Common Crawl

"The advantages of structured text do not outweigh the extra computing power needed to process them."-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)

N-gram Counts & Language Models from the Common Crawl

English (23TB), German (1.02TB), Spanish (986GB), French (912GB), Japanese (577GB), Russian (537GB), Polish (334GB), Italian (325GB) ...Only 0.14% of the corpus was Finnish, yet yielded a useful corpus of 47GB.42 languages with >10GB

73 languages with >1GB

N-gram & Language Models

Sentence level deduplication led to a removal of 80% of the English corpus, lower for other languages(in line with Bergsma et al. (2010))

Before preprocessing (English): 23.62 TB

After preprocessing (English): 5.14 TB

(59 billion lines, 975 billion tokens)

N-gram & Language Models

Substantial improvement in perplexity

N-gram & Language Models

"...even though the web data is quite noisy even limited amounts give improvements."

N-gram & Language Models

Project data was released athttp://statmt.org/ngrams

Raw text split by language

Deduped text split by language

Resulting language models

GloVe: Global Vectors for Word Representation

Jeffrey Pennington, Richard Socher, Christopher D. Manning

Word vector representations:

king - queen = man - woman

king - man + woman = queen

(produces dimensions of meaning)

GloVe: Global Vectors for Word Representation

Trained on non-zero entries of aglobal word-word co-occurrence matrix

Populating matrix requires a single pass Subsequent training is far faster

GloVe = O(|C|⁰⋅⁸)

On-line window-based (i.e. word2vec) = O(|C|)

GloVe On Various Corpora

Semantic: "Athens is to Greece as Berlin is to _?" Syntactic: "Dance is to dancing as fly is to_?"

GloVe over Big Data

GloVe using 42 billion tokens from Common Crawl outperformed word2vec w/ 100 billion tokens (Google News)

Largest GloVe model to prove scalability uses840 billion tokens from Common CrawlSource code and pre-trained models athttp://www-nlp.stanford.edu/projects/glove/

Mix and Match: Word Vectors

More data, less fine tuning needed Best model: mix of all excl. word2vec

Automatic Noun Compound Interpretation using Deep Neural Networks and Word Embeddings(Dima and Hinrichs, 2015)

Examples of Previous Work

Unsupervised Algorithms

+ N-gram & language models + GloVe: Global Vectors for Word Representation

Text By The Bay

April 24, 2015

totally free
without additional
intellectual property restrictions
(lives on Amazon Public Data Sets)

WARC
+ Raw HTTP response headers
+ Raw HTTP responses

WAT
+ HTML head data
+ HTTP header fields
+ Extracted links / script tags

WET
+ Extracted text

WET (Web Extracted Text) is released in
the crawl archive each month

Does not remove boilerplate

Does not re-format text as appears in browser

Common Crawl founded in 2007
by Gil Elbaz (Applied Semantics / Factual)

Goal: Democratize and simplify access to
"the web as a dataset"

How can we have an even playing field for innovation without access to such data?
(Can you replicate work without the data..?)

More data can beat better algorithms
(Banko and Brill, 2001)

The web is largely unannotated,
so how are people using it for NLP?

(b) Filter it into being semi-annotated or annotated
(big data ⇒ filter ⇒ curated smaller dataset)

+ N-gram & language models

+ GloVe: Global Vectors for Word Representation

+ Web tables for gazetteers

+ Dirt Cheap Web-Scale Parallel Text
+ Extracting US phone numbers

N-gram Counts & Language Models from the Common Crawl
Christian Buck, Kenneth Heafield, Bas van Ooyen
(Edinburgh, Stanford, Owlin BV)

Processed all the text of Common Crawl to produce 975 billion deduplicated tokens

Google N-gram Dataset (Web 1T) consists of
1 trillion tokens

Improvement over Google N-grams (2006):

Inclusion of low count entries

Deduplication to reduce boilerplate

"Google has shared a deduplicated version ...

in limited contexts, but it was never publicly released."
-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)

"The advantages of structured text do not outweigh the extra computing power needed to process them."
-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)

English (23TB), German (1.02TB), Spanish (986GB), French (912GB), Japanese (577GB), Russian (537GB), Polish (334GB), Italian (325GB) ...

Only 0.14% of the corpus was Finnish, yet yielded a useful corpus of 47GB.

42 languages with >10GB

Sentence level deduplication led to a removal of 80% of the English corpus, lower for other languages
(in line with Bergsma et al. (2010))

Project data was released at
http://statmt.org/ngrams

Trained on non-zero entries of a
global word-word co-occurrence matrix

Populating matrix requires a single pass
Subsequent training is far faster

Semantic: "Athens is to Greece as Berlin is to _?"

Syntactic: "Dance is to dancing as fly is to_?"

Largest GloVe model to prove scalability uses
840 billion tokens from Common Crawl

Source code and pre-trained models at
http://www-nlp.stanford.edu/projects/glove/

More data, less fine tuning needed

Best model: mix of all excl. word2vec

Automatic Noun Compound Interpretation using Deep Neural Networks and Word Embeddings
(Dima and Hinrichs, 2015)

+ N-gram & language models

+ GloVe: Global Vectors for Word Representation

+ Web tables for gazetteers

+ Dirt Cheap Web-Scale Parallel Text
+ Extracting US phone numbers

+ Capture relationships between similar words
(disambiguation)

Extracted 11.2 billion tables from WARC files,
filtered to keep relational tables via trained classifier

Only 1.3% of the original data was kept,
yet it still remains hugely valuable

Resulting dataset:
11.2 billion tables ⇒ 147 million relational web tables

Processed all text from URLs of the style:
website.com/[langcode]/

Manual inspection across three languages:
80% of the data contained good translations

Yelp extracted ~748 million US phone numbers from the Common Crawl December 2014 dataset

Regular expression over extracted text (WET files)