Making Python
work for big web data
We're a non profit that makes
web data
accessible to anyone
Each crawl is billions of pages...
July crawl was...
Released
Lives on Amazon S3 (Public Datasets)
What good is data if it's not used?
Accessing & using it needs to be easy
Sexy & simple science
iPython Notebook
Numpy & SciPy
Pandas
Scikit Learn
...but it's less simple when handling
big data
Big data = Java ecosystem
(Hadoop, HDFS, ...)
public static void main(String[] args)
Creating a starter kit for Python
Simple to use framework for experiments
+
A Python library for accessing files on S3
=
mrjob to the rescue (thx Yelp!)
mrjob = trivial Hadoop streaming with Python
☑ Test and develop locally with minimal setup
(no Hadoop installation required)
☑
Only use Python, not mixing other languages
(benefits of Hadoop ecosystem without touching it)
☑
Spin up a cluster in one click
(EMR integration = single click cluster)
mrjob conveniences
To run locally...
python mrcc.py --conf-path mrjob.conf <in>
To run on Elastic MapReduce...
python mrcc.py -r emr --conf-path mrjob.conf <in>
mrjob annoyances
EMR images all run with Python 2.6
(✖ from collections import Counter)
Luckily that can be fixed! Python 2.7 installation...
# Many Bothans died to bring us this information
ami_version: 3.0.4
interpreter: python2.7
bootstrap:
- sudo yum install -y python27 python27-devel gcc-c++
- sudo python2.7 get-pip.py#
- sudo pip2.7 install boto mrjob simplejson
Big issue: grabbing the data from S3?
Python has no streaming decompression of gzip
Pulling to local storage and then processing
= terrible performance
We made gzipstream for
streaming gzip decompression
gzipstream = streaming gzip decompression
Anything that has read()
(S3, HTTP, Gopher, IPoAC, ...)
gzipstream handles multi-part gzip files
(our web data is stored like this for random access)
gzip + gzip + ... + gzip + gzip
Beautiful but slow :'(
Python is generally slower than Java and C++...
but programmer productivity can be more important
Heavy lifting with faster languages
<haiku>
If I/O is slow,
Python will make you happy,
else Java or C*
</haiku>
* C++ didn't fit in the haiku ...
Process once, store forever
Processing all the raw data each time is really inefficient
Create derived datasets that have a single focus
For example, Common Crawl releases processed datasets:
-
metadata (30% of the original)
- extracted text (15% of the original)
Common Crawl mrjob kit
☑ Test and develop locally with just Python
☑
Access the data from S3 efficiently using gzipstream
☑
Spin up a cluster in one click
Two lines to get a cluster on EMR! Thx mrjob!
pip install -r requirements.txt
# Add your AWS credentials
python mrcc.py -r emr --conf-path mrjob.conf input/test-100.txt
Universities should teach big data courses with ... y'know ...
big data
English Wikipedia
= 9.85 GB compressed
= 44 GB uncompressed
If "big data" isn't bigger than your phone's storage ...