Measuring the impact:

Stephen Merity

Smerity @ Common Crawl

  • Continuing the crawl
  • Documenting best practices
    • Guides for newcomers to Common Crawl + big data
    • Reference for seasoned veterans
  • Spending many hours blessing and/or cursing Hadoop

University of Sydney '11, Harvard '14
Google Sydney,, Grok Learning

I was hoping on creating a tool that will automatically extract some of the most common memes ("But does it run Linux?" and "In Soviet Russia..." style jokes etc) and I needed a corpus - I wrote a primitive (threaded :S) web crawler and started it before I considered robots.txt. I do intensely apologise.

-- Past Smerity (16/12/2007)

Where did all the HTTP referrers go?

Referrers: leaking browsing history

then SFBike knows you came from Reddit

1) How many websites is Google Analytics (GA) on?

2) How much of a user's browsing history does GA capture?

Top 10k domains:

Top 100k domains:

Top million domains:

It keeps dropping off, but by how much..?

Estimate of captured browsing history...


Referrers allow easy web tracking 
when done at Google's scale!

No information
!GA → !GA

Full information
!GA → GA
GA → !GA → GA
GA → !GA → GA → !GA → GA → !GA → GA → !GA → GA

Key insight: leaked browsing history

Google only needs one in every two links to have GA in order to have your full browsing path*

*possibly less if link graph + click timing + machine learning used

Estimating leaked browser history

total_links += 1
if {page A} or {page B} has GA:
total_leaked += 1

Estimate of leaked browser history is simply:
total_leaked / total_links

Joint project with Chad Hornbaker* at Harvard IACS

*Best full name ever: Captain Charles Lafforest Hornbaker II

The task

Google Analytics count: "" NoGA	1 GA	6 GA	1 GA	244

Generate link graph ->	<total times> ->	24

Merge link graph & GA count

Exciting age of open data

Open data
Open tools
Cloud computing

raw web data

metadata (links, title, ...) for each page

extracted text

WARC = GA usage
raw web data

WAT = hyperlink graph
metadata (links, title, ...) for each page

Estimating the task's size

Web Data Commons Hyperlink Graph

3.5 billion nodes, 128 billion edges, 331GB compressed

Subdomain level (
101 million nodes, 2 billion edges, 9.2GB compressed

Decided on using subdomains instead of page level

Engineering for scale

 Use the framework that matches best

 Debug locally

 Standard Hadoop optimizations
(combiner, compression, re-use JVMs...)

Many small jobs ≫ one big job

Ganglia for metrics & monitoring

Hadoop :'(

Hadoop :'(

Monitoring & metrics with Ganglia

Engineering for cost

 Avoid Hadoop if it's simple enough
 Use spot instances everywhere*
Use EMR if highly cost sensitive
(Elastic MapReduce = hosted Hadoop)

*Everywhere but the master node!

Juggling spot instances

c1.xlarge goes from $0.58 p/h to $0.064 p/h

EMR: The good, the bad, the ugly

significantly easier, one click setup

price is insane when using spot instances
(spot = $0.075 with EMR = $0.12)

Guess how many log files for a 100 node cluster?

584,764+ log files.


Cost projection

Best optimized small Hadoop job:
  • 1/177th the dataset in 23 minutes
    (12 c1.xlarge machines + Hadoop master)

Estimated  full dataset job:
  • ~210TB for web data + ~90TB for link data
  • ~$60 in EC2 costs (177 hours of spot instances)
  • ~$100 in EMR costs (avoid EMR for cost!)

Final results

29.96% of 48 million domains have GA
(top million domains was 50.8%)

48.96% of 42 billion hyperlinks leaked info to GA

That means that

one in every two hyperlinks will leak information to Google

The wider impact

Want Big Open Data?

Web Data

Covers everything at scale!

Processing the web is feasible

Downloading it is a pain!
Common Crawl does that for you

Processing it is scary!
Big data frameworks exist and are (relatively) painless

These experiments are too expensive!
Cloud computing means experiments can be just a few dollars

Get started now..!

Want raw web data?

Want hyperlink graph / web tables / RDFa?

Want example code to get you started?

Measuring the impact:

Stephen Merity

Measuring the impact: Google Analytics

By smerity

Measuring the impact: Google Analytics

  • 3,823