Making Python
work for big web data

Stephen Merity

We're a non profit that makes

web data

accessible to anyone

Each crawl is billions of pages...


July crawl was...

4 billion web pages

266 terabytes uncompressed

Released

totally free


Lives on Amazon S3 (Public Datasets)

What good is data if it's not used?

 

Accessing & using it needs to be easy

Sexy & simple science


iPython Notebook

Numpy & SciPy

Pandas

Scikit Learn

...but it's less simple when handling
big data


Big data = Java ecosystem

(Hadoop, HDFS, ...)











public static void main(String[] args)

Creating a starter kit for Python



Simple to use framework for experiments
+

A Python library for accessing files on S3
=

https://github.com/commoncrawl/cc-mrjob

mrjob to the rescue (thx Yelp!)

mrjob = trivial Hadoop streaming with Python


 Test and develop locally with minimal setup
(no Hadoop installation required)


  Only use Python, not mixing other languages
(benefits of Hadoop ecosystem without touching it)


 Spin up a cluster in one click
(EMR integration = single click cluster)

mrjob conveniences



To run locally...

python mrcc.py --conf-path mrjob.conf <in>


To run on Elastic MapReduce...

python mrcc.py -r emr --conf-path mrjob.conf <in>

mrjob annoyances



EMR images all run with Python 2.6

(✖ from collections import Counter) 

Luckily that can be fixed! Python 2.7 installation...

    # Many Bothans died to bring us this information
    ami_version: 3.0.4
    interpreter: python2.7
    bootstrap:
    - sudo yum install -y python27 python27-devel gcc-c++
    - sudo python2.7 get-pip.py#
    - sudo pip2.7 install boto mrjob simplejson

Big issue: grabbing the data from S3?


Python has no streaming decompression of gzip


Pulling to local storage and then processing
= terrible performance


We made gzipstream for
streaming gzip decompression

gzipstream = streaming gzip decompression


Anything that has read()
(
S3, HTTP, GopherIPoAC, ...)


gzipstream handles multi-part gzip files
(our web data is stored like this for random access)
gzip + gzip + ... + gzip + gzip




Beautiful but slow :'(


Python is generally slower than Java and C++...
but programmer productivity can be more important

Saving Python's performance

Heavy lifting with faster languages

<haiku>
If I/O is slow,
Python will make you happy,
else Java or C*
</haiku>


* C++ didn't fit in the haiku ...

Saving Python's performance

Process once, store forever

Processing all the raw data each time is really inefficient
Create derived datasets that have a single focus


For example, Common Crawl releases processed datasets:
  • metadata (30% of the original)
  • extracted text (15% of the original)

Friends of Common Crawl make derived datasets

Common Crawl mrjob kit


 Test and develop locally with just Python

   Access the data from S3 efficiently using gzipstream

   Spin up a cluster in one click


Two lines to get a cluster on EMR! Thx mrjob!

pip install -r requirements.txt
# Add your AWS credentials
python mrcc.py -r emr --conf-path mrjob.conf input/test-100.txt

https://github.com/commoncrawl/cc-mrjob

Help us make it trivial to use Python to access the world's greatest collection of human knowledge!

Check out and use:
https://github.com/commoncrawl/cc-mrjob

Stephen Merity
smerity.com / @smerity
@commoncrawl

Attributions


http://thenounproject.com/term/broken-computer/20293/

Universities should teach big data courses with ... y'know ...


big data


English Wikipedia
= 9.85 GB compressed
= 44 GB uncompressed
If "big data" isn't bigger than your phone's storage ...

Made with Slides.com