Making Python
work for big web data

Stephen Merity

We're a non profit that makes

web data

accessible to anyone

Each crawl is billions of pages...

July crawl was...

4 billion web pages

266 terabytes uncompressed


totally free

Lives on Amazon S3 (Public Datasets)

What good is data if it's not used?


Accessing & using it needs to be easy

Sexy & simple science

iPython Notebook

Numpy & SciPy


Scikit Learn

...but it's less simple when handling
big data

Big data = Java ecosystem

(Hadoop, HDFS, ...)

public static void main(String[] args)

Creating a starter kit for Python

Simple to use framework for experiments

A Python library for accessing files on S3

mrjob to the rescue (thx Yelp!)

mrjob = trivial Hadoop streaming with Python

 Test and develop locally with minimal setup
(no Hadoop installation required)

  Only use Python, not mixing other languages
(benefits of Hadoop ecosystem without touching it)

 Spin up a cluster in one click
(EMR integration = single click cluster)

mrjob conveniences

To run locally...

python --conf-path mrjob.conf <in>

To run on Elastic MapReduce...

python -r emr --conf-path mrjob.conf <in>

mrjob annoyances

EMR images all run with Python 2.6

(✖ from collections import Counter) 

Luckily that can be fixed! Python 2.7 installation...

    # Many Bothans died to bring us this information
    ami_version: 3.0.4
    interpreter: python2.7
    - sudo yum install -y python27 python27-devel gcc-c++
    - sudo python2.7
    - sudo pip2.7 install boto mrjob simplejson

Big issue: grabbing the data from S3?

Python has no streaming decompression of gzip

Pulling to local storage and then processing
= terrible performance

We made gzipstream for
streaming gzip decompression

gzipstream = streaming gzip decompression

Anything that has read()
S3, HTTP, GopherIPoAC, ...)

gzipstream handles multi-part gzip files
(our web data is stored like this for random access)
gzip + gzip + ... + gzip + gzip

Beautiful but slow :'(

Python is generally slower than Java and C++...
but programmer productivity can be more important

Saving Python's performance

Heavy lifting with faster languages

If I/O is slow,
Python will make you happy,
else Java or C*

* C++ didn't fit in the haiku ...

Saving Python's performance

Process once, store forever

Processing all the raw data each time is really inefficient
Create derived datasets that have a single focus

For example, Common Crawl releases processed datasets:
  • metadata (30% of the original)
  • extracted text (15% of the original)

Friends of Common Crawl make derived datasets

Common Crawl mrjob kit

 Test and develop locally with just Python

   Access the data from S3 efficiently using gzipstream

   Spin up a cluster in one click

Two lines to get a cluster on EMR! Thx mrjob!

pip install -r requirements.txt
# Add your AWS credentials
python -r emr --conf-path mrjob.conf input/test-100.txt

Help us make it trivial to use Python to access the world's greatest collection of human knowledge!

Check out and use:

Stephen Merity


Universities should teach big data courses with ... y'know ...

big data

English Wikipedia
= 9.85 GB compressed
= 44 GB uncompressed
If "big data" isn't bigger than your phone's storage ...

Common Crawl: Making Python work for big web data

By smerity

Common Crawl: Making Python work for big web data

  • 3,330
Loading comments...

More from smerity