A Web Worth of Data: Common Crawl for NLP
It's a non-profit that makes
web data
freely accessible to
anyone
Each crawl archive is billions of pages:
February crawl archive is
Released
Common Crawl File Formats
-
WARC
+ Raw HTTP response headers
+ Raw HTTP responses
-
WAT
+ HTML head data
+ HTTP header fields
+ Extracted links / script tags
-
WET
+ Extracted text
Common Crawl WET
WET (Web Extracted Text) is released in
the crawl archive each month
Data attempts to cover widest range of use cases
No distinction between header / navigation / content:
-
Does not remove boilerplate
-
Does not re-format text as appears in browser
Origins of Common Crawl
Common Crawl founded in 2007
by Gil Elbaz (Applied Semantics / Factual)
Google and Microsoft were the powerhouses
Goal: Democratize and simplify access to
"the web as a dataset"
Open Data and Open Source
Data powers the algorithms in our field
How can we have an even playing field for innovation without access to such data?
(Can you replicate work without the data..?)
More data can beat better algorithms
(Banko and Brill, 2001)
Common Crawl for NLP
The web is largely unannotated,
so how are people using it for NLP?
(a) Use extracted text for unsupervised algorithms
(b) Filter it into being semi-annotated or annotated
(big data ⇒ filter ⇒ curated smaller dataset)
Examples of Previous Work
Unsupervised Algorithms
+ N-gram & language models
+ GloVe: Global Vectors for Word Representation
Filtering
+ Web tables for gazetteers
+ Dirt Cheap Web-Scale Parallel Text
+ Extracting US phone numbers
N-gram Counts & Language Models from the Common Crawl
Christian Buck, Kenneth Heafield, Bas van Ooyen
(Edinburgh, Stanford, Owlin BV)
Processed all the text of Common Crawl to produce 975 billion deduplicated tokens
Google N-gram Dataset (Web 1T) consists of
1 trillion tokens
N-gram Counts & Language Models from the Common Crawl
Improvement over Google N-grams (2006):
-
Inclusion of low count entries
- Deduplication to reduce boilerplate
"Google has shared a deduplicated version ...
in limited contexts, but it was never publicly released."
-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)
N-gram Counts & Language Models from the Common Crawl
"The advantages of structured text do not outweigh the extra computing power needed to process them."
-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)
N-gram Counts & Language Models from the Common Crawl
English (23TB), German (1.02TB), Spanish (986GB), French (912GB), Japanese (577GB), Russian (537GB), Polish (334GB), Italian (325GB) ...
Only 0.14% of the corpus was Finnish, yet yielded a useful corpus of 47GB.
42 languages with >10GB
73 languages with >1GB
N-gram & Language Models
Sentence level deduplication led to a removal of 80% of the English corpus, lower for other languages
(in line with Bergsma et al. (2010))
Before preprocessing (English): 23.62 TB
After preprocessing (English): 5.14 TB
(59 billion lines, 975 billion tokens)
N-gram & Language Models
Substantial improvement in perplexity
N-gram & Language Models
"...even though the web data is quite noisy even limited amounts give improvements."
N-gram & Language Models
-
Raw text split by language
-
Deduped text split by language
-
Resulting language models
GloVe: Global Vectors for Word Representation
Jeffrey Pennington, Richard Socher, Christopher D. Manning
Word vector representations:
king - queen = man - woman
king - man + woman = queen
(produces dimensions of meaning)
GloVe: Global Vectors for Word Representation
Trained on non-zero entries of a
global word-word co-occurrence matrix
Populating matrix requires a single pass
Subsequent training is far faster
GloVe = O(|C|⁰⋅⁸)
On-line window-based (i.e. word2vec) = O(|C|)
GloVe On Various Corpora
-
Semantic: "Athens is to Greece as Berlin is to _?"
-
Syntactic: "Dance is to dancing as fly is to_?"
GloVe over Big Data
GloVe using 42 billion tokens from Common Crawl outperformed word2vec w/ 100 billion tokens (Google News)
Mix and Match: Word Vectors
-
More data, less fine tuning needed
-
Best model: mix of all excl. word2vec
Examples of Previous Work
Unsupervised Algorithms
+ N-gram & language models
+ GloVe: Global Vectors for Word Representation
Filtering
+ Web tables for gazetteers
+ Dirt Cheap Web-Scale Parallel Text
+ Extracting US phone numbers
Gazetteers for NER
+ Want the widest variety of topics possible
+ Aim to keep them modern / up to date
+ Capture relationships between similar words
(disambiguation)
Google Sets
Web tables as a source of gazetteers + relations
Querying ["cat"],
returns ["dog", "bird", "horse", "rabbit", ...]
Querying ["cat", "ls"],
returns ["cd", "head", "cut", "vim", ...]
Web Data Commons Web Tables
Extracted 11.2 billion tables from WARC files,
filtered to keep relational tables via trained classifier
Only 1.3% of the original data was kept,
yet it still remains hugely valuable
Resulting dataset:
11.2 billion tables ⇒ 147 million relational web tables
Web Data Commons Web Tables
Popular column headers: name, title, artist, location, model, manufacturer, country ...
Web Data Commons Web Tables
Web-Scale Parallel Text
Dirt Cheap Web-Scale Parallel Text from the Common Crawl (Smith et al.)
Processed all text from URLs of the style:
website.com/[langcode]/
"...nothing more than a set of common two-letter language codes ... [we] mined 32 terabytes ... in just under a day"
Web-Scale Parallel Text
(source = foreign language, target = English)
Web-Scale Parallel Text
Both EuroParl & United Nations are large and well curated parallel texts,
but both have very specific domains & genres.
Web-Scale Parallel Text
"...resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain."
Minimal cleaning & filtering still resulted in a substantial improvement in SMT performance
Manual inspection across three languages:
80% of the data contained good translations
Extracting US Phone Numbers
"Let's use Common Crawl to help match businesses from Yelp's database to the possible web pages for those businesses on the Internet."
Yelp extracted ~748 million US phone numbers from the Common Crawl December 2014 dataset
Regular expression over extracted text (WET files)
Extracting US Phone Numbers
Total complexity: 134 lines of Python
Total time: 1 hour (20 × c3.8xlarge)
Total cost: $10.60 (Python using EMR)
Matched against Yelp's database:
-
48% had exact URL matches
- 61% had matching domains
WikiReverse
Created by volunteer Ross Fairbanks for fun
Task: Find hyperlinks to Wikipedia from the web
Result: Dataset of over 36 million links
Code and data released online at wikireverse.org
Common Crawl's Derived Datasets
Large scale web analysis:
and a million more use cases!
Why am I so excited..?
Open data is catching on!
Even playing field for academia and industry
Common Crawl releases their dataset
and brilliant people build on top of it
Challenge: Parser training data
Automatic Acquisition of Training Data for Statistical Parsers (Howlett and Curran, 2008)
Use knowledge base of facts or simple sentences:
"Mozart was born in 1756."
Parse more complex sentences with dep constraints:
"Wolfgang Amadeus Mozart (baptized Johannes Chrysostomus Wolfgangus Theophilus) was born in Salzburg in 1756, the second survivor out of six children."