AWS at
Common Crawl
We're a non profit that makes
web data
accessible to
anyone
Each crawl is billions of pages...
August crawl was
Released
Lives on S3 (Public Data Sets)
We're the largest public user of
Trivia Time
What major OSS project grew out of Nutch?
Trivia Time
What major OSS project grew out of Nutch?
Crawling Architecture Overview
Standard Hadoop Cluster:
On-demand: One master
Spot Instances: ~60 m2.2xlarge or ~110 m1.xlarge
Generate, Crawl, Export (raw + processed)
2-3 billion pages in 15-20 days
Hadoop is our hammer, handles all the nails
(even if that nail is more like a screw)
Crawling Architecture Overview
Why those instances?
m1.xlarge
- Best all round deal + usually cheap
(1Gbps network, 1.6TB disk, 15GB RAM, ...)
m2.2xlarge
- Similar specs to m1.xlarge, less bang for the buck
Topics for Advanced AWS
- Random access archives using S3
- Surviving & thriving w/ spot instances
- EMR - the good, the bad, the ugly
Random Access Archives
In an optimal world,
everything would be in RAM
For one dollar an hour...
-
EC2 RAM = 87 GB
-
EC2 Disk = 10,685 GB
-
S3 (standard) = 24,000 GB
-
S3 (reduced) = 31,000 GB
How do we get advantages of RAM w/ S3?
Random Access Archives
gzip spec allows for gzip files to be stuck together
gzip + gzip + ... + gzip + gzip
Why are we interested?
☑ Re-uses existing data format + tools
☑ Advantages of per object compression (10% larger than full gzip)
☑ Partition into optimal sized collections (Hadoop / S3)
☑ Allow random access to individual objects
URL index
5 billion unique URLs x [ len(URL) + len(pointer) ]
= 437 GB
= $13 per month
Topics for Advanced AWS
- Random access archives using S3
- Surviving & thriving w/ spot instances
- EMR - the good, the bad, the ugly
Spot Instances
Spot instances are vital but are problematic...
-
Availability issues
- Cost fluctuations
- Transient failure
Spot Instances
To survive transient failures trivially:
rely on S3 and not HDFS
Solution:
Small jobs + push data to S3 on completion
Spot Instances
If you do need to keep state on HDFS,
rack awareness is vital
Segment your cluster's machines according to:
- Region (minimize transfer costs)
- On-demand versus spot instance
- Machine type (prices mostly independent)
- Spot prices
Topics for Advanced AWS
- Random access archives using S3
- Surviving & thriving w/ spot instances
- EMR - the good, the bad, the ugly
What good is data if it's not used?
Accessing & using it needs to be easy
Elastic MapReduce
= Trading money for simplicity
It's
far
cheaper to spin up your own cluster...
(✖ 100-200% overhead on spot instances)
m1.xlarge = $0.088 for EMR + $0.040 for spot
For contributors, we're happy to pay however:
time & energy on task and not cluster
EMR in Education
Universities should teach big data courses with ...
big data
English Wikipedia
= 9.85 GB compressed
= 44 GB uncompressed
If "big data" isn't bigger than your phone's storage ...
Python + mrjob
☑ Test and develop locally with minimal setup
(no Hadoop installation required)
☑ Only uses Python, not mixing other languages
(benefits of Hadoop ecosystem without touching it)
☑ Spin up a cluster in one click
(EMR integration = single click cluster)
Summary
- Random access archives using S3
Allows effective storage of enormous datasets
- Surviving & thriving w/ spot instances
S3 defends us against transient failures
- EMR - the good, the bad, the ugly
Simplifies and lowers barriers to entry
Interested in helping out on something amazing?
Send me a message and/or visit
commoncrawl.org
AWS at
Common Crawl