Making Python
work for big web data
Stephen Merity
We're a non profit that makes
web data
accessible to anyone
Each crawl is billions of pages...
July crawl was...
4 billion web pages
266 terabytes uncompressed
Released
totally free
Lives on Amazon S3 (Public Datasets)
What good is data if it's not used?
Accessing & using it needs to be easy
Sexy & simple science
iPython Notebook
Numpy & SciPy
Pandas
Scikit Learn
...but it's less simple when handling
big data
Big data = Java ecosystem
(Hadoop, HDFS, ...)
public static void main(String[] args)
Creating a starter kit for Python
Simple to use framework for experiments
+
A Python library for accessing files on S3
=
mrjob to the rescue (thx Yelp!)
mrjob = trivial Hadoop streaming with Python
☑ Test and develop locally with minimal setup
(no Hadoop installation required)
☑
Only use Python, not mixing other languages
(benefits of Hadoop ecosystem without touching it)
☑
Spin up a cluster in one click
(EMR integration = single click cluster)
mrjob conveniences
To run locally...
python mrcc.py --conf-path mrjob.conf <in>
To run on Elastic MapReduce...
python mrcc.py -r emr --conf-path mrjob.conf <in>
mrjob annoyances
EMR images all run with Python 2.6
(✖ from collections import Counter)
Luckily that can be fixed! Python 2.7 installation...
# Many Bothans died to bring us this information
ami_version: 3.0.4
interpreter: python2.7
bootstrap:
- sudo yum install -y python27 python27-devel gcc-c++
- sudo python2.7 get-pip.py#
- sudo pip2.7 install boto mrjob simplejson
Big issue: grabbing the data from S3?
Python has no streaming decompression of gzip
Pulling to local storage and then processing
= terrible performance
We made gzipstream for
streaming gzip decompression
gzipstream = streaming gzip decompression
Anything that has read()
(S3, HTTP, Gopher, IPoAC, ...)
gzipstream handles multi-part gzip files
(our web data is stored like this for random access)
gzip + gzip + ... + gzip + gzip
Beautiful but slow :'(
Python is generally slower than Java and C++...
but programmer productivity can be more important
Saving Python's performance
Heavy lifting with faster languages
<haiku>
If I/O is slow,
Python will make you happy,
else Java or C*
</haiku>
* C++ didn't fit in the haiku ...
Saving Python's performance
Process once, store forever
Processing all the raw data each time is really inefficient
Create derived datasets that have a single focus
For example, Common Crawl releases processed datasets:
-
metadata (30% of the original)
- extracted text (15% of the original)
Friends of Common Crawl make derived datasets
- hyperlink graph (64 billion edges)
- extracted table text, microdata, and more..!
Common Crawl mrjob kit
☑ Test and develop locally with just Python
☑ Access the data from S3 efficiently using gzipstream
☑ Spin up a cluster in one click
Two lines to get a cluster on EMR! Thx mrjob!
pip install -r requirements.txt
# Add your AWS credentials
python mrcc.py -r emr --conf-path mrjob.conf input/test-100.txt
https://github.com/commoncrawl/cc-mrjob
Help us make it trivial to use Python to access the world's greatest collection of human knowledge!
Check out and use:
https://github.com/commoncrawl/cc-mrjob
Stephen Merity
Attributions
Universities should teach big data courses with ... y'know ...
big data
English Wikipedia
= 9.85 GB compressed
= 44 GB uncompressed
If "big data" isn't bigger than your phone's storage ...
Common Crawl: Making Python work for big web data
By smerity
Common Crawl: Making Python work for big web data
- 4,579