work for big web data
We're a non profit that makes
accessible to anyone
Each crawl is billions of pages...
July crawl was...
4 billion web pages
266 terabytes uncompressed
Lives on Amazon S3 (Public Datasets)
What good is data if it's not used?
Accessing & using it needs to be easy
Sexy & simple science
Numpy & SciPy
...but it's less simple when handling
Big data = Java ecosystem
(Hadoop, HDFS, ...)
public static void main(String args)
Creating a starter kit for Python
Simple to use framework for experiments
A Python library for accessing files on S3
mrjob to the rescue (thx Yelp!)
mrjob = trivial Hadoop streaming with Python
☑ Test and develop locally with minimal setup
(no Hadoop installation required)
Only use Python, not mixing other languages
(benefits of Hadoop ecosystem without touching it)
Spin up a cluster in one click
(EMR integration = single click cluster)
To run locally...
python mrcc.py --conf-path mrjob.conf <in>
To run on Elastic MapReduce...
python mrcc.py -r emr --conf-path mrjob.conf <in>
EMR images all run with Python 2.6
(✖ from collections import Counter)
Luckily that can be fixed! Python 2.7 installation...
# Many Bothans died to bring us this information ami_version: 3.0.4 interpreter: python2.7 bootstrap: - sudo yum install -y python27 python27-devel gcc-c++ - sudo python2.7 get-pip.py# - sudo pip2.7 install boto mrjob simplejson
Big issue: grabbing the data from S3?
Python has no streaming decompression of gzip
Pulling to local storage and then processing
= terrible performance
We made gzipstream for
streaming gzip decompression
gzipstream = streaming gzip decompression
gzipstream handles multi-part gzip files
(our web data is stored like this for random access)
gzip + gzip + ... + gzip + gzip
Beautiful but slow :'(
Python is generally slower than Java and C++...
but programmer productivity can be more important
Heavy lifting with faster languages
If I/O is slow,
Python will make you happy,
else Java or C*
* C++ didn't fit in the haiku ...
Process once, store forever
Processing all the raw data each time is really inefficient
Create derived datasets that have a single focus
For example, Common Crawl releases processed datasets:
metadata (30% of the original)
- extracted text (15% of the original)
Friends of Common Crawl make derived datasets
Common Crawl mrjob kit
☑ Spin up a cluster in one click
Two lines to get a cluster on EMR! Thx mrjob!
pip install -r requirements.txt # Add your AWS credentials python mrcc.py -r emr --conf-path mrjob.conf input/test-100.txt
Help us make it trivial to use Python to access the world's greatest collection of human knowledge!
Check out and use:
Universities should teach big data courses with ... y'know ...
= 9.85 GB compressed
= 44 GB uncompressed
If "big data" isn't bigger than your phone's storage ...
Common Crawl: Making Python work for big web data