Experiments in
web scale data

Big Data Beers
November 17, 2014

We're a non-profit that makes

web data

accessible to anyone

Each crawl archive is billions of pages...

September crawl archive was

2.98 billion web pages

~220 terabytes uncompressed


totally free

Lives on AWS S3 (Public Data Sets)

We're the largest public user of

What makes us most excited?


The Formats of the Crawl Archive

  • WARC
    + raw HTTP response headers
    + raw HTTP responses

  • WAT (1/3rd size of WARC)
    + HTML head data
    + HTTP header fields
    + Extracted links / script tags

  • WET (1/10th size of WARC)
    + Extracted text

The WARC in Detail

WARC = Web ARChive

Wrappers exist for many languages
[Python, Java, Go, ...]

If they don't, it's still just gzip!

Random Access Archives w/ WARC

gzip spec allows for gzip files to be stuck together

gzip + gzip + ... + gzip + gzip

Why are we interested in this?

 Re-uses existing data format + tools

 Advantages of per object compression (10% larger than full gzip)

 Partition into optimal sized collections (Hadoop / S3)

 Allow random access to individual objects

Random Access Archives w/ WARC

In an optimal world,
everything would be in RAM

For one dollar an hour...
  • EC2 RAM      = 87 GB
  • EC2 Disk        = 10,685 GB
  • S3 (standard) = 24,000 GB
  • S3 (reduced)  = 31,000 GB

We can get most random access advantages w/ S3
(see: byte range requests)

Derived Datasets

Example Experiments

  • Nature of the web with a single machine:
    + Analysing the hyperlink graph

  • Using a full cluster to perform larger analysis:
    + Recreating Google Sets using web tables
    + The impact of Google Analytics

Hyperlink Graph Experiment

Imagine you wanted to analyze how pages link to each other across the web

  • Generate the hyperlink graph

  • Compute PageRank over it

Generating the Graph

Which data format? WAT [metadata]

All links 
[link text, URL, type (a href, img, CSS, JS)])

Map: output each (a → b) hyperlink

Reduce: remove / count duplicates

Generating the Graph

Key question: how long and how much?

Approx $30 (spot instances) for 1000 instance hours
(100 m2.xlarge machines for 10 hours)

Resulting dataset size:
  • Host level (x.com): ~10GB
    101 million nodes + 2 billion edges
  • Page level (x.com/y/z.html): ~500GB
    3.6 billion nodes + 128 billion edges

Or avoid all of this..!

Web Data Commons Hyperlink Graph

2012: 3.6 billion pages, 128 billion links
April 2014: 1.7 billion pages, 64 billion links

They've performed in-depth analysis as well:

Calculating PageRank

PageRank can require lots of resources to compute when processing non-trivial graphs

  • Hadoop is bad for iterative algorithms
    (Spark offers a potential solution, but ...)

  • Running many compute nodes is expensive and computationally complex


Optimized graph processing engine using SSDs

Comes with various algorithms:
PageRank, weakly connected components, etc...

Performs as well as a cluster using a single machine

(full details: "FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs")

Performance of FlashGraph

Sometimes scaling up >> scaling out

PageRank on the 3.6 billion page graph takes

under 75 minutes

on a powerful single machine*

* Very powerful -- larger than any EC2

Processing with FlashGraph

When running domain level PageRank,

it takes longer to load than to process!

PageRank on 100 million nodes + 2 billion edges:
3 minutes

Total cost:
Well under a Euro
(EC2 spot instances!)

Example Experiments

  • Nature of the web with a single machine:
    + Analysing the hyperlink graph

  • Using a full cluster to perform larger analysis:
    + Recreating Google Sets using web tables
    + The impact of Google Analytics

Recreating Google Sets

Task: Given a word or set of words, return words related to them

Querying ["cat"],

returns ["dog", "bird", "horse", "rabbit", ...]

Querying ["cat", "ls"],

returns ["cd", "head", "cut", "vim", ...]

Extracting the Tables

Data Format: WARC (need raw HTML)

Extract tables [column names + contents] from the HTML pages

Remove duplicates / tables with too few entries

Only keep those that appear "relational"

Example: WDC Web Tables

Filtered from 11.2 billion tables via trained classifier 

Only 1.3% of the original data was kept,
yet it still remains hugely valuable!

147 million relational web tables from 2012 corpus

Popular column headers: name, price, date, title, artist, size, location, model, rating, manufacturer, country ...

Example: WDC Web Tables

Example Experiments

  • Nature of the web with a single machine:
    + Analysing the hyperlink graph

  • Using a full cluster to perform larger analysis:
    + Recreating Google Sets using web tables
    + The impact of Google Analytics

Measuring the Impact:

1) How many websites is Google Analytics (GA) on?

2) How much of a user's browsing history does GA capture?

Insight for GA Analysis

Referrers allow for easy web tracking when done at Google's scale!

No information
!GA → !GA
Full information
!GA → GA
GA → !GA → GA

Google only needs one in every two links to have GA in order to have your full browsing path

Method for GA Analysis

Google Analytics count: ".google-analytics.com/ga.js"

www.winradio.net.au NoGA	1
www.winrar.com.cn GA	6
www.winratzart.com GA	1
www.winrenner.ch GA	244

Generate link graph

domainA.com -> domainB.com	<total times>
cnet-cnec-driver.softutopia.com -> www.softutopia.com	24

Merge link graph & GA count

Method for GA Analysis

Use WARC format to find specific HTML script fragments for identifying if a site has GA*

Use WAT format for generating the link graph and total link count*

* For most JS analysis, the WAT file contains the name of the JS file linked to
** WDC Hyperlink Graph does not contain counts

Results of GA Analysis

29.96% of 48 million domains have GA
(top million domains was 50.8%)

48.96% of 42 billion hyperlinks leaked info to GA

That means that:
1 in every 2 hyperlinks leak information to Google

Wider Impact of GA Analysis


Wide Reaching Possibilities

These experiments have been done before:
they give you an idea of what's possible

Find your own insight or question, then

follow it!

Using Wikipedia to annotate the web

Wikipedia: "small" but well annotated resource
How can we use that for the rest of the web?

Use Wikipedia annotated data
[Brahms, German composer, born on 7 May 1833]
to train NLP parsers using text from the web

Connect pages or domains that have similar linking patterns (implies they cover the same topics)

Who Links to Wikipedia?

Find all pages that link to Wikipedia across the web
including the text used to link to Wikipedia

Task: Aggregate the link text and URL

Even with Wikipedia's referrer log,
this is a complex task...

Who Links to Wikipedia?

Data Format: WAT [links + link text]

For each link, check for *.wikipedia.org
If found, export
wiki URL), Value=(page URL, link text)

Key=(wiki URL), Value=[(URL, text), ...]

Challenge Yourself

There's an amazing dataset at your fingertips,

and getting started has never been simpler!

Get started with the:

Java starter kit for Hadoop

Python starter kit using mrjob

Check out commoncrawl.org

Download the Python or Java starter kits

Stephen Merity

If You Know a Lecturer...

Universities should teach big data courses with ...

big data

English Wikipedia

= 9.85 GB compressed
= 44 GB uncompressed
If "big data" isn't bigger than your phone's storage ...

Experiments in web scale data

By smerity

Experiments in web scale data

  • 4,019