Experiments in
web scale data

Big Data Beers
November 17, 2014

We're a non-profit that makes

web data

accessible to anyone

Each crawl archive is billions of pages...

September crawl archive was

2.98 billion web pages

~220 terabytes uncompressed

Released

totally free

Lives on AWS S3 (Public Data Sets)

We're the largest public user of

What makes us most excited?

Experiments!

The Formats of the Crawl Archive

WARC
+ raw HTTP response headers
+ raw HTTP responses

WAT (1/3rd size of WARC)
+ HTML head data
+ HTTP header fields
+ Extracted links / script tags

WET (1/10th size of WARC)
+ Extracted text

The WARC in Detail

WARC = Web ARChive

Wrappers exist for many languages
[Python, Java, Go, ...]

If they don't, it's still just gzip!

Random Access Archives w/ WARC

gzip spec allows for gzip files to be stuck together

gzip + gzip + ... + gzip + gzip

Why are we interested in this?

☑ Re-uses existing data format + tools

☑ Advantages of per object compression (10% larger than full gzip)

☑ Partition into optimal sized collections (Hadoop / S3)

☑ Allow random access to individual objects

Random Access Archives w/ WARC

In an optimal world,
everything would be in RAM

For one dollar an hour...

EC2 RAM = 87 GB

EC2 Disk = 10,685 GB

S3 (standard) = 24,000 GB

S3 (reduced) = 31,000 GB

We can get most random access advantages w/ S3
(see: byte range requests)

Derived Datasets

Web Data Commons (WDC)
Hyperlink graph
Web tables
RDFa, Microdata, & Microformat

N-gram counts and language models from the CommonCrawl

GloVe: Global Vectors for Word Representation

Example Experiments

Nature of the web with a single machine:
+ Analysing the hyperlink graph

Using a full cluster to perform larger analysis:
+ Recreating Google Sets using web tables
+ The impact of Google Analytics

Hyperlink Graph Experiment

Imagine you wanted to analyze how pages link to each other across the web

Generate the hyperlink graph

Compute PageRank over it

Generating the Graph

Which data format? WAT [metadata]

All links [link text, URL, type (a href, img, CSS, JS)])

Map: output each (a → b) hyperlink

Reduce: remove / count duplicates

Generating the Graph

Key question: how long and how much?

Approx $30 (spot instances) for 1000 instance hours
(100 m2.xlarge machines for 10 hours)

Resulting dataset size:

Host level (x.com): ~10GB
101 million nodes + 2 billion edges

Page level (x.com/y/z.html): ~500GB
3.6 billion nodes + 128 billion edges

Or avoid all of this..!

Web Data Commons Hyperlink Graph

2012: 3.6 billion pages, 128 billion links
April 2014: 1.7 billion pages, 64 billion links

They've performed in-depth analysis as well:
http://webdatacommons.org/hyperlinkgraph/

Calculating PageRank

PageRank can require lots of resources to compute when processing non-trivial graphs

Hadoop is bad for iterative algorithms
(Spark offers a potential solution, but ...)

Running many compute nodes is expensive and computationally complex

FlashGraph

Optimized graph processing engine using SSDs

Comes with various algorithms:
PageRank, weakly connected components, etc...

Performs as well as a cluster using a single machine

(full details: "FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs")

Performance of FlashGraph

Sometimes scaling up >> scaling out

PageRank on the 3.6 billion page graph takes

under 75 minutes

on a powerful single machine*

* Very powerful -- larger than any EC2

Processing with FlashGraph

When running domain level PageRank,

it takes longer to load than to process!

PageRank on 100 million nodes + 2 billion edges:
3 minutes

Total cost:
Well under a Euro
(EC2 spot instances!)

Example Experiments

Nature of the web with a single machine:
+ Analysing the hyperlink graph

Using a full cluster to perform larger analysis:
+ Recreating Google Sets using web tables
+ The impact of Google Analytics

Recreating Google Sets

Task: Given a word or set of words, return words related to them

Querying ["cat"],

returns ["dog", "bird", "horse", "rabbit", ...]

Querying ["cat", "ls"],

returns ["cd", "head", "cut", "vim", ...]

Extracting the Tables

Data Format: WARC (need raw HTML)

Map:
Extract tables [column names + contents] from the HTML pages

Reduce:
Remove duplicates / tables with too few entries

Filter:
Only keep those that appear "relational"

Example: WDC Web Tables

Filtered from 11.2 billion tables via trained classifier

Only 1.3% of the original data was kept,
yet it still remains hugely valuable!

147 million relational web tables from 2012 corpus

Popular column headers: name, price, date, title, artist, size, location, model, rating, manufacturer, country ...

Example: WDC Web Tables

Example Experiments

Nature of the web with a single machine:
+ Analysing the hyperlink graph

Using a full cluster to perform larger analysis:
+ Recreating Google Sets using web tables
+ The impact of Google Analytics

Measuring the Impact:

1) How many websites is Google Analytics (GA) on?

2) How much of a user's browsing history does GA capture?

Insight for GA Analysis

Referrers allow for easy web tracking when done at Google's scale!

No information
!GA → !GA
Full information
!GA → GA
GA → !GA → GA

Google only needs one in every two links to have GA in order to have your full browsing path

Method for GA Analysis

Google Analytics count: ".google-analytics.com/ga.js"

www.winradio.net.au NoGA	1
www.winrar.com.cn GA	6
www.winratzart.com GA	1
www.winrenner.ch GA	244

Generate link graph

domainA.com -> domainB.com	<total times>

cnet-cnec-driver.softutopia.com -> www.softutopia.com	24

Merge link graph & GA count

Method for GA Analysis

Use WARC format to find specific HTML script fragments for identifying if a site has GA*

Use WAT format for generating the link graph and total link count*

* For most JS analysis, the WAT file contains the name of the JS file linked to
** WDC Hyperlink Graph does not contain counts

Results of GA Analysis

29.96% of 48 million domains have GA
(top million domains was 50.8%)

48.96% of 42 billion hyperlinks leaked info to GA

That means that:
1 in every 2 hyperlinks leak information to Google

Wider Impact of GA Analysis

Wide Reaching Possibilities

These experiments have been done before:
they give you an idea of what's possible

Find your own insight or question, then

follow it!

Using Wikipedia to annotate the web

Wikipedia: "small" but well annotated resource
How can we use that for the rest of the web?

Use Wikipedia annotated data
[Brahms, German composer, born on 7 May 1833]
to train NLP parsers using text from the web

Connect pages or domains that have similar linking patterns (implies they cover the same topics)

Who Links to Wikipedia?

Find all pages that link to Wikipedia across the web
including the text used to link to Wikipedia

Task: Aggregate the link text and URL

Even with Wikipedia's referrer log,
this is a complex task...

Who Links to Wikipedia?

Data Format: WAT [links + link text]

Map:
For each link, check for *.wikipedia.org
If found, export
Key=(wiki URL), Value=(page URL, link text)

Reduce:
Key=(wiki URL), Value=[(URL, text), ...]

Challenge Yourself

There's an amazing dataset at your fingertips,

and getting started has never been simpler!

Get started with the:

Java starter kit for Hadoop

Python starter kit using mrjob

Check out commoncrawl.org

Download the Python or Java starter kits

Stephen Merity
stephen@commoncrawl.org

smerity.com / @smerity

If You Know a Lecturer...

Universities should teach big data courses with ...

big data

English Wikipedia

= 9.85 GB compressed
= 44 GB uncompressed
If "big data" isn't bigger than your phone's storage ...

Experiments inweb scale data

Big Data BeersNovember 17, 2014

We're a non-profit that makes

web data

accessible to anyone

Each crawl archive is billions of pages...

September crawl archive was

2.98 billion web pages

~220 terabytes uncompressed

Released

totally free

Lives on AWS S3 (Public Data Sets)

We're the largest public user of

What makes us most excited?

Experiments!

The Formats of the Crawl Archive

WARC+ raw HTTP response headers+ raw HTTP responses WAT (1/3rd size of WARC)+ HTML head data+ HTTP header fields+ Extracted links / script tags WET (1/10th size of WARC)+ Extracted text

The WARC in Detail

WARC = Web ARChiveWrappers exist for many languages[Python, Java, Go, ...]

If they don't, it's still just gzip!

Random Access Archives w/ WARC

gzip spec allows for gzip files to be stuck together

gzip + gzip + ... + gzip + gzip

Why are we interested in this?

☑ Re-uses existing data format + tools

☑ Advantages of per object compression (10% larger than full gzip)

☑ Partition into optimal sized collections (Hadoop / S3)

☑ Allow random access to individual objects

Random Access Archives w/ WARC

In an optimal world,everything would be in RAM

For one dollar an hour... EC2 RAM = 87 GB EC2 Disk = 10,685 GB S3 (standard) = 24,000 GB S3 (reduced) = 31,000 GB

We can get most random access advantages w/ S3(see: byte range requests)

Derived Datasets

Web Data Commons (WDC)Hyperlink graphWeb tablesRDFa, Microdata, & Microformat N-gram counts and language models from the CommonCrawl GloVe: Global Vectors for Word Representation

Example Experiments

Nature of the web with a single machine:+ Analysing the hyperlink graph Using a full cluster to perform larger analysis:+ Recreating Google Sets using web tables+ The impact of Google Analytics

Hyperlink Graph Experiment

Imagine you wanted to analyze how pages link to each other across the web

Generate the hyperlink graph Compute PageRank over it

Generating the Graph

Which data format? WAT [metadata]

All links [link text, URL, type (a href, img, CSS, JS)]) Map: output each (a → b) hyperlink

Reduce: remove / count duplicates

Generating the Graph

Key question: how long and how much?Approx $30 (spot instances) for 1000 instance hours(100 m2.xlarge machines for 10 hours)Resulting dataset size: Host level (x.com): ~10GB101 million nodes + 2 billion edges Page level (x.com/y/z.html): ~500GB3.6 billion nodes + 128 billion edges

Or avoid all of this..!

Web Data Commons Hyperlink Graph

2012: 3.6 billion pages, 128 billion linksApril 2014: 1.7 billion pages, 64 billion links

They've performed in-depth analysis as well:http://webdatacommons.org/hyperlinkgraph/

Calculating PageRank

PageRank can require lots of resources to compute when processing non-trivial graphs Hadoop is bad for iterative algorithms(Spark offers a potential solution, but ...) Running many compute nodes is expensive and computationally complex

FlashGraph

Optimized graph processing engine using SSDs

Comes with various algorithms:PageRank, weakly connected components, etc...Performs as well as a cluster using a single machine

(full details: "FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs")

Performance of FlashGraph

Sometimes scaling up >> scaling out

PageRank on the 3.6 billion page graph takes

under 75 minutes

on a powerful single machine** Very powerful -- larger than any EC2

Processing with FlashGraph

When running domain level PageRank,

it takes longer to load than to process!PageRank on 100 million nodes + 2 billion edges:3 minutes

Total cost:Well under a Euro(EC2 spot instances!)

Example Experiments

Nature of the web with a single machine:+ Analysing the hyperlink graph Using a full cluster to perform larger analysis:+ Recreating Google Sets using web tables+ The impact of Google Analytics

Recreating Google Sets

Task: Given a word or set of words, return words related to them

Querying ["cat"],

returns ["dog", "bird", "horse", "rabbit", ...]

Querying ["cat", "ls"],

returns ["cd", "head", "cut", "vim", ...]

Extracting the Tables

Data Format: WARC (need raw HTML)Map:Extract tables [column names + contents] from the HTML pages

Reduce:Remove duplicates / tables with too few entries

Filter:Only keep those that appear "relational"

Example: WDC Web Tables

Filtered from 11.2 billion tables via trained classifier

Only 1.3% of the original data was kept,yet it still remains hugely valuable!

147 million relational web tables from 2012 corpusPopular column headers: name, price, date, title, artist, size, location, model, rating, manufacturer, country ...

Experiments in
web scale data

Big Data Beers
November 17, 2014

WARC
+ raw HTTP response headers
+ raw HTTP responses

WAT (1/3rd size of WARC)
+ HTML head data
+ HTTP header fields
+ Extracted links / script tags

WET (1/10th size of WARC)
+ Extracted text

WARC = Web ARChive

Wrappers exist for many languages
[Python, Java, Go, ...]

In an optimal world,
everything would be in RAM

For one dollar an hour...

EC2 RAM = 87 GB

EC2 Disk = 10,685 GB

S3 (standard) = 24,000 GB

S3 (reduced) = 31,000 GB

We can get most random access advantages w/ S3
(see: byte range requests)

Web Data Commons (WDC)
Hyperlink graph
Web tables
RDFa, Microdata, & Microformat

N-gram counts and language models from the CommonCrawl

GloVe: Global Vectors for Word Representation

Nature of the web with a single machine:
+ Analysing the hyperlink graph

Using a full cluster to perform larger analysis:
+ Recreating Google Sets using web tables
+ The impact of Google Analytics

Generate the hyperlink graph

Compute PageRank over it

All links [link text, URL, type (a href, img, CSS, JS)])

Map: output each (a → b) hyperlink

Key question: how long and how much?

Approx $30 (spot instances) for 1000 instance hours
(100 m2.xlarge machines for 10 hours)

Resulting dataset size:

Host level (x.com): ~10GB
101 million nodes + 2 billion edges

Page level (x.com/y/z.html): ~500GB
3.6 billion nodes + 128 billion edges

2012: 3.6 billion pages, 128 billion links
April 2014: 1.7 billion pages, 64 billion links

They've performed in-depth analysis as well:
http://webdatacommons.org/hyperlinkgraph/

PageRank can require lots of resources to compute when processing non-trivial graphs

Hadoop is bad for iterative algorithms
(Spark offers a potential solution, but ...)

Running many compute nodes is expensive and computationally complex

Comes with various algorithms:
PageRank, weakly connected components, etc...

Performs as well as a cluster using a single machine

on a powerful single machine*

* Very powerful -- larger than any EC2

it takes longer to load than to process!

PageRank on 100 million nodes + 2 billion edges:
3 minutes

Total cost:
Well under a Euro
(EC2 spot instances!)

Nature of the web with a single machine:
+ Analysing the hyperlink graph

Using a full cluster to perform larger analysis:
+ Recreating Google Sets using web tables
+ The impact of Google Analytics

Data Format: WARC (need raw HTML)

Map:
Extract tables [column names + contents] from the HTML pages

Reduce:
Remove duplicates / tables with too few entries

Filter:
Only keep those that appear "relational"

Only 1.3% of the original data was kept,
yet it still remains hugely valuable!

147 million relational web tables from 2012 corpus

Popular column headers: name, price, date, title, artist, size, location, model, rating, manufacturer, country ...

Nature of the web with a single machine:
+ Analysing the hyperlink graph

Using a full cluster to perform larger analysis:
+ Recreating Google Sets using web tables
+ The impact of Google Analytics

Referrers allow for easy web tracking when done at Google's scale!

No information
!GA → !GA
Full information
!GA → GA
GA → !GA → GA

domainA.com -> domainB.com <total times>

cnet-cnec-driver.softutopia.com -> www.softutopia.com 24

* For most JS analysis, the WAT file contains the name of the JS file linked to
** WDC Hyperlink Graph does not contain counts

29.96% of 48 million domains have GA
(top million domains was 50.8%)

That means that:
1 in every 2 hyperlinks leak information to Google

These experiments have been done before:
they give you an idea of what's possible

Find all pages that link to Wikipedia across the web
including the text used to link to Wikipedia

Task: Aggregate the link text and URL

Even with Wikipedia's referrer log,
this is a complex task...