Common Crawl
for NLP

Web-Scale Natural Language Processing in Northern Europe

November 24, 2014

It's a non-profit that makes

web data

freely accessible to anyone

Each crawl archive is billions of pages:

October crawl archive is

3.72 billion web pages

~254 terabytes uncompressed

Released

totally free
without additional
intellectual property restrictions

Common Crawl File Formats

WARC
+ Raw HTTP response headers
+ Raw HTTP responses

WAT
+ HTML head data
+ HTTP header fields
+ Extracted links / script tags

WET
+ Extracted text

Common Crawl WET

WET (Web Extracted Text) is released in
the crawl archive each month

Data attempts to cover widest range of use cases

No distinction between header / navigation / content:

Does not remove boilerplate

Does not re-format text as appears in browser

Common Crawl for NLP

The web is largely unannotated,
so how are people using it for NLP?

(a) Use extracted text for unsupervised algorithms

(b) Filter it into being semi-annotated or annotated
(big data ⇒ filter ⇒ curated smaller dataset)

Examples of Previous Work

Unsupervised Algorithms

+ N-gram & language models

+ GloVe: Global Vectors for Word Representation

Filtering

+ Web tables for gazetteers

+ Dirt Cheap Web-Scale Parallel Text

N-gram Counts & Language Models from the Common Crawl
Christian Buck, Kenneth Heafield, Bas van Ooyen
(Edinburgh, Stanford, Owlin BV)

975 billion deduplicated tokens

Improvement over Google N-grams:

Inclusion of low count entries

Deduplication to reduce boilerplate

N-gram Counts & Language Models from the Common Crawl

"The advantages of structured text do not outweigh the extra computing power needed to process them."
-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)

N-gram Counts & Language Models from the Common Crawl

English (23TB), German (1.02TB), Spanish (986GB), French (912GB), Japanese (577GB), Russian (537GB), Polish (334GB), Italian (325GB) ...

Only 0.14% of the corpus was Finnish, yet yielded a useful corpus of 47GB.

42 languages with >10GB

73 languages with >1GB

N-gram & Language Models

Sentence level deduplication led to a removal of 80% of the English corpus, lower for other languages
(in line with Bergsma et al. (2010))

Before preprocessing (English): 23.62 TB

After preprocessing (English): 5.14 TB

(59 billion lines, 975 billion tokens)

N-gram & Language Models

"...even though the web data is quite noisy even limited amounts give improvements."

N-gram & Language Models

Project data was released at
http://statmt.org/ngrams

Raw text split by language
Deduped text split by language
Resulting language models

GloVe: Global Vectors for Word Representation

Jeffrey Pennington, Richard Socher, Christopher D. Manning

Word vector representations:

king - queen = man - woman

king - man + woman = queen

(produces dimensions of meaning)

Unsupervised methods: perfect for Common Crawl

GloVe: Global Vectors for Word Representation

Trained on non-zero entries of a
global word-word co-occurrence matrix

Populating matrix requires single pass
Subsequent training is far faster

GloVe = O(|C|⁰⋅⁸)

On-line window-based methods = O(|C|)

GloVe On Various Corpora

Semantic: "Athens is to Greece as Berlin is to _?"

Syntactic: "Dance is to dancing as fly is to_?"

GloVe over Big Data

GloVe using 42 billion tokens from Common Crawl beat word2vec w/ 100 billion tokens (Google News)

Largest GloVe model to prove scalability uses
840 billion tokens from Common Crawl

Source code and pre-trained models at
http://www-nlp.stanford.edu/projects/glove/

Examples of Previous Work

Unsupervised Algorithms

+ N-gram & language models

+ GloVe: Global Vectors for Word Representation

Filtering

+ Web tables for gazetteers

+ Dirt Cheap Web-Scale Parallel Text

Gazetteers for NER

+ Want the widest variety of topics possible

+ Aim to keep them modern / up to date

+ Capture relationships between similar words
(disambiguation)

Google Sets

Web tables as a source of gazetteers + relations

Querying ["cat"],

returns ["dog", "bird", "horse", "rabbit", ...]

Querying ["cat", "ls"],

returns ["cd", "head", "cut", "vim", ...]

Web Data Commons Web Tables

Extracted 11.2 billion tables from WARC files,
filtered to keep relational tables via trained classifier

Only 1.3% of the original data was kept,
yet it still remains hugely valuable

Web Data Commons Web Tables

Resulting dataset:
11.2 billion tables ⇒ 147 million relational web tables

Popular column headers: name, title, artist, location, model, manufacturer, country ...

Released at webdatacommons.org/webtables/

Web Data Commons Web Tables

Web-Scale Parallel Text

Dirt Cheap Web-Scale Parallel Text from the Common Crawl (Smith et al.)

"...nothing more than a set of common two-letter language codes ... [we] mined 32 terabytes ... in just under a day"

Processed all text from URLs of the style:
website.com/[langcode]/

Web-Scale Parallel Text

(source = foreign language, target = English)

Web-Scale Parallel Text

Both EuroParl & United Nations are large and well curated parallel texts,

but both have very specific domains & genres.

Web-Scale Parallel Text

"...resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain."

Minimal cleaning & filtering still resulted in a substantial improvement in SMT performance

Manual inspection across three languages:

80% of the data contained good translations

Common Crawl and the NeIC

We're excited about a potential
collaboration with NeIC

Tackling big data is challenging yet fundamentally important for advances in scientific computing:

NeIC is a forerunner in enabling large-scale computational experimentation in NLP

Thank you

Stephen Merity
stephen@commoncrawl.org
commoncrawl.org

Common Crawlfor NLP

Web-Scale Natural Language Processing in Northern Europe November 24, 2014

It's a non-profit that makes

web data

freely accessible to anyone

Each crawl archive is billions of pages:

October crawl archive is

3.72 billion web pages

~254 terabytes uncompressed

Released

totally free without additionalintellectual property restrictions

Common Crawl File Formats

WARC+ Raw HTTP response headers+ Raw HTTP responses WAT+ HTML head data+ HTTP header fields+ Extracted links / script tags WET+ Extracted text

Common Crawl WET

WET (Web Extracted Text) is released in the crawl archive each month

Data attempts to cover widest range of use cases

No distinction between header / navigation / content:

Does not remove boilerplate Does not re-format text as appears in browser

Common Crawl for NLP

The web is largely unannotated,so how are people using it for NLP?

(a) Use extracted text for unsupervised algorithms

(b) Filter it into being semi-annotated or annotated(big data ⇒ filter ⇒ curated smaller dataset)

Examples of Previous Work

Unsupervised Algorithms

+ N-gram & language models + GloVe: Global Vectors for Word Representation

Filtering

+ Web tables for gazetteers + Dirt Cheap Web-Scale Parallel Text

N-gram Counts & Language Models from the Common CrawlChristian Buck, Kenneth Heafield, Bas van Ooyen(Edinburgh, Stanford, Owlin BV)

975 billion deduplicated tokens

Improvement over Google N-grams: Inclusion of low count entries Deduplication to reduce boilerplate

N-gram Counts & Language Models from the Common Crawl

"The advantages of structured text do not outweigh the extra computing power needed to process them."-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)

N-gram Counts & Language Models from the Common Crawl

English (23TB), German (1.02TB), Spanish (986GB), French (912GB), Japanese (577GB), Russian (537GB), Polish (334GB), Italian (325GB) ...Only 0.14% of the corpus was Finnish, yet yielded a useful corpus of 47GB.42 languages with >10GB

73 languages with >1GB

N-gram & Language Models

Sentence level deduplication led to a removal of 80% of the English corpus, lower for other languages(in line with Bergsma et al. (2010))

Before preprocessing (English): 23.62 TB

After preprocessing (English): 5.14 TB

(59 billion lines, 975 billion tokens)

N-gram & Language Models

"...even though the web data is quite noisy even limited amounts give improvements."

N-gram & Language Models

Project data was released athttp://statmt.org/ngrams

Raw text split by language

Deduped text split by language

Resulting language models

GloVe: Global Vectors for Word Representation

Jeffrey Pennington, Richard Socher, Christopher D. Manning

Word vector representations:

king - queen = man - woman

king - man + woman = queen

(produces dimensions of meaning)Unsupervised methods: perfect for Common Crawl

GloVe: Global Vectors for Word Representation

Trained on non-zero entries of aglobal word-word co-occurrence matrix Populating matrix requires single pass Subsequent training is far faster

GloVe = O(|C|⁰⋅⁸)

On-line window-based methods = O(|C|)

GloVe On Various Corpora

Semantic: "Athens is to Greece as Berlin is to _?" Syntactic: "Dance is to dancing as fly is to_?"

GloVe over Big Data

GloVe using 42 billion tokens from Common Crawl beat word2vec w/ 100 billion tokens (Google News)

Largest GloVe model to prove scalability uses840 billion tokens from Common CrawlSource code and pre-trained models athttp://www-nlp.stanford.edu/projects/glove/

Examples of Previous Work

Unsupervised Algorithms

+ N-gram & language models + GloVe: Global Vectors for Word Representation

Filtering

+ Web tables for gazetteers + Dirt Cheap Web-Scale Parallel Text

Gazetteers for NER

+ Want the widest variety of topics possible

+ Aim to keep them modern / up to date

+ Capture relationships between similar words (disambiguation)

Google Sets

Web tables as a source of gazetteers + relations

Querying ["cat"],

returns ["dog", "bird", "horse", "rabbit", ...]

Querying ["cat", "ls"],

returns ["cd", "head", "cut", "vim", ...]

Web Data Commons Web Tables

Extracted 11.2 billion tables from WARC files,filtered to keep relational tables via trained classifier

Only 1.3% of the original data was kept,yet it still remains hugely valuable

Common Crawl
for NLP

Web-Scale Natural Language Processing in Northern Europe

November 24, 2014

totally free
without additional
intellectual property restrictions

WARC
+ Raw HTTP response headers
+ Raw HTTP responses

WAT
+ HTML head data
+ HTTP header fields
+ Extracted links / script tags

WET
+ Extracted text

WET (Web Extracted Text) is released in
the crawl archive each month

Does not remove boilerplate

Does not re-format text as appears in browser

The web is largely unannotated,
so how are people using it for NLP?

(b) Filter it into being semi-annotated or annotated
(big data ⇒ filter ⇒ curated smaller dataset)

+ N-gram & language models

+ GloVe: Global Vectors for Word Representation

+ Web tables for gazetteers

+ Dirt Cheap Web-Scale Parallel Text

N-gram Counts & Language Models from the Common Crawl
Christian Buck, Kenneth Heafield, Bas van Ooyen
(Edinburgh, Stanford, Owlin BV)

Improvement over Google N-grams:

Inclusion of low count entries

Deduplication to reduce boilerplate

"The advantages of structured text do not outweigh the extra computing power needed to process them."
-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)

English (23TB), German (1.02TB), Spanish (986GB), French (912GB), Japanese (577GB), Russian (537GB), Polish (334GB), Italian (325GB) ...

Only 0.14% of the corpus was Finnish, yet yielded a useful corpus of 47GB.

42 languages with >10GB

Sentence level deduplication led to a removal of 80% of the English corpus, lower for other languages
(in line with Bergsma et al. (2010))

Project data was released at
http://statmt.org/ngrams

(produces dimensions of meaning)

Unsupervised methods: perfect for Common Crawl

Trained on non-zero entries of a
global word-word co-occurrence matrix

Populating matrix requires single pass
Subsequent training is far faster

Semantic: "Athens is to Greece as Berlin is to _?"

Syntactic: "Dance is to dancing as fly is to_?"

Largest GloVe model to prove scalability uses
840 billion tokens from Common Crawl

Source code and pre-trained models at
http://www-nlp.stanford.edu/projects/glove/

+ N-gram & language models

+ GloVe: Global Vectors for Word Representation

+ Web tables for gazetteers

+ Dirt Cheap Web-Scale Parallel Text

+ Capture relationships between similar words
(disambiguation)

Extracted 11.2 billion tables from WARC files,
filtered to keep relational tables via trained classifier

Only 1.3% of the original data was kept,
yet it still remains hugely valuable

Resulting dataset:
11.2 billion tables ⇒ 147 million relational web tables

Dirt Cheap Web-Scale Parallel Text from the Common Crawl (Smith et al.)

"...nothing more than a set of common two-letter language codes ... [we] mined 32 terabytes ... in just under a day"

Processed all text from URLs of the style:
website.com/[langcode]/

We're excited about a potential
collaboration with NeIC

Stephen Merity
stephen@commoncrawl.org
commoncrawl.org