Internet Scale Analytics
With Common Crawl
It's a non-profit that makes
web data
freely accessible to anyone
Each crawl archive is billions of pages:
February crawl archive is
1.9 billion web pages
~154 terabytes uncompressed
Released
totally free
without additional
intellectual property restrictions
(lives on Amazon Public Data Sets)
Common Crawl File Formats
-
WARC
+ Raw HTTP response headers
+ Raw HTTP responses
-
WAT
+ HTML head data
+ HTTP header fields
+ Extracted links / script tags
-
WET
+ Extracted text
+ Raw HTTP response headers
+ Raw HTTP responses
+ HTML head data
+ HTTP header fields
+ Extracted links / script tags
+ Extracted text
Origins of Common Crawl
Common Crawl founded in 2007
by Gil Elbaz (Applied Semantics / Factual)
Google and Microsoft were the powerhouses
Data powers the algorithms in our field
Goal: Democratize and simplify access to
"the web as a dataset"
Tackling the Web as a Dataset
The web is largely unannotated,
The web is largely unannotated,
so how are people using it?
(a) Use data for unsupervised algorithms / analysis
(b) Filter it into being semi-annotated or annotated
(big data ⇒ filter ⇒ curated smaller dataset)
Web Data at Scale
Analytics
+ Usage of servers, libraries, and metadata
Machine Learning
+ Language models based upon billions of tokens
Filtering and Aggregation
+ Analyzing tables, Wikipedia, and phone numbers
Analytics at Scale
If you have an afternoon and are interested in ...
+ Javascript library usage
+ HTML / HTML5 usage
+ Web server types and age
You can immediately get started over billions of pages!
Analyzing Web Domain Vulns
1) How many websites use Google Analytics (GA)?
2) How much of a user's browsing history is captured by Google Analytics?
Impact of Google Analytics
Top 10k domains:
65.7%
Top 100k domains:
64.2%
Top million domains:
50.8%
Impact of Google Analytics
WDC Hyperlink Graph
Largest freely available real world graph dataset:
3.6 billion pages, 128 billion links
Fast and easy analysis using Dato GraphLab on a single EC2 r3.8xlarge instance
(under 10 minutes per PageRank iteration)
Web Data at Scale
Analytics
+ Usage of servers, libraries, and metadata
Machine Learning
+ Language models based upon billions of tokens
Filtering and Aggregation
+ Analyzing tables, Wikipedia, and phone numbers
N-gram Counts & Language Models from the Common Crawl
Christian Buck, Kenneth Heafield, Bas van Ooyen
N-grams = How many times did a phrase appear?
Processed all the text of Common Crawl to produce 975 billion deduplicated tokens
Google N-gram Dataset (Web 1T) consists of
1 trillion tokens
N-gram Counts & Language Models from the Common Crawl
Improvement over Google N-grams (2006):
-
Inclusion of low count entries
- Deduplication to reduce boilerplate
"Google has shared a deduplicated version ...
in limited contexts, but it was never publicly released."
-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)
-- N-gram Counts & Language Models from the Common Crawl (Buck et al.)
N-gram Counts & Language Models from the Common Crawl
English (23TB), German (1.02TB), Spanish (986GB), French (912GB), Japanese (577GB), Russian (537GB), Polish (334GB), Italian (325GB) ...
Only 0.14% of the corpus was Finnish, yet yielded a useful corpus of 47GB.
42 languages with >10GB
73 languages with >1GB
N-gram & Language Models
"...even though the web data is quite noisy even limited amounts give improvements."
N-gram & Language Models
Project data was released at
http://statmt.org/ngrams
-
Raw text split by language
-
Deduped text split by language
-
Resulting language models
GloVe: Global Vectors for Word Representation
Jeffrey Pennington, Richard Socher, Christopher D. Manning
Word vector representations:
king - queen = man - woman
king - man + woman = queen
(produces dimensions of meaning)
GloVe On Various Corpora
-
Semantic: "Athens is to Greece as Berlin is to _?"
-
Syntactic: "Dance is to dancing as fly is to _?"
GloVe over Big Data
GloVe and word2vec (competing algorithm) can scale to hundreds of billions of tokens
Best of all: the performance keeps improving
Source code and pre-trained models at
http://www-nlp.stanford.edu/projects/glove/
Web-Scale Parallel Text
Dirt Cheap Web-Scale Parallel Text from the Common Crawl (Smith et al.)
Processed all text from URLs of the style:
website.com/[langcode]/
[w.com/en/tesla | w.com/fr/tesla]
"...nothing more than a set of common two-letter language codes ... [we] mined 32 terabytes ... in just under a day"
Web-Scale Parallel Text
(source = foreign language, target = English)
Web-Scale Parallel Text
Both EuroParl & United Nations are large and well curated parallel texts,
but both have very specific domains & genres.
Web-Scale Parallel Text
"...resulting in improvements of up to 1.5 BLEU on standard test sets, and 5 BLEU on test sets outside of the news domain."
Minimal cleaning & filtering still resulted in a substantial improvement in SMT performance
Manual inspection across three languages:
80% of the data contained good translations
Web Data at Scale
Analytics
+ Usage of servers, libraries, and metadata
Machine Learning
+ Language models based upon billions of tokens
Filtering and Aggregation
+ Analyzing tables, Wikipedia, and phone numbers
Gazetteers via "Google Sets"
Idea: Web tables for gazetteers + relations
Querying ["cat"],
returns ["dog", "bird", "horse", "rabbit", ...]
Querying ["cat", "ls"],
returns ["cd", "head", "cut", "vim", ...]
Web Data Commons Web Tables
Extracted 11.2 billion tables from WARC files,
filtered to keep relational tables via trained classifier
Only 1.3% of the original data was kept,
yet it still remains hugely valuable
Resulting dataset:
11.2 billion tables ⇒ 147 million relational web tables
Web Data Commons Web Tables
Popular column headers: name, title, artist, location, model, manufacturer, country ...
Released at webdatacommons.org/webtables/
Web Data Commons Web Tables
Extracting US Phone Numbers
"Let's use Common Crawl to help match businesses from Yelp's database to the possible web pages for those businesses on the Internet."
Yelp extracted ~748 million US phone numbers from the Common Crawl December 2014 dataset
Regular expression over extracted text (WET files)
Extracting US Phone Numbers
Total complexity: 134 lines of Python
Total time: 1 hour (20 × c3.8xlarge)
Total cost: $10.60 USD (Python using EMR)
Matched against Yelp's database:
-
48% had exact URL matches
- 61% had matching domains
More details (and full code) on Yelp's blog post:
Analyzing the Web For the Price of a Sandwich
Common Crawl's Derived Datasets
Natural language processing:
- Parallel text for machine translation
- N-gram & language models (975 bln tokens)
- WDC Web tables (3.5 bln)
Large scale web analysis:
-
WDC Hyperlink Graphs (128 bln edges)
-
WikiReverse.org - Wikipedia in-links analysis
and a million more use cases!
and a million more use cases!
Why am I so excited..?
Open data is catching on!
Even playing field for academia and industry
-
Baidu used Common Crawl for Deep Speech
-
Google Web 1T ⇒ Buck et al.'s N-grams
-
Google's Wikilinks ⇒ WikiReverse
-
Google's Sets ⇒ WDC Web Tables
Common Crawl releases their dataset
and brilliant people build on top of it
Read more at
commoncrawl.org
Internet Scale Analytics With Common Crawl
By smerity
Internet Scale Analytics With Common Crawl
- 4,545