Making Python
work for big web data

Stephen Merity

smerity.com / @smerity

@commoncrawl

We're a non profit that makes

web data

accessible to anyone

Each crawl is billions of pages...

July crawl was...

4 billion web pages

266 terabytes uncompressed

Released

totally free

Lives on Amazon S3 (Public Datasets)

What good is data if it's not used?

Accessing & using it needs to be easy

Sexy & simple science

iPython Notebook

Numpy & SciPy

Pandas

Scikit Learn

...but it's less simple when handling
big data

Big data = Java ecosystem

(Hadoop, HDFS, ...)

public static void main(String[] args)

Creating a starter kit for Python

Simple to use framework for experiments
+

A Python library for accessing files on S3
=

https://github.com/commoncrawl/cc-mrjob

mrjob to the rescue (thx Yelp!)

mrjob = trivial Hadoop streaming with Python

☑ Test and develop locally with minimal setup
(no Hadoop installation required)

☑ Only use Python, not mixing other languages
(benefits of Hadoop ecosystem without touching it)

☑ Spin up a cluster in one click
(EMR integration = single click cluster)

mrjob conveniences

To run locally...

python mrcc.py --conf-path mrjob.conf <in>

To run on Elastic MapReduce...

python mrcc.py -r emr --conf-path mrjob.conf <in>

mrjob annoyances

EMR images all run with Python 2.6

(✖ from collections import Counter)

Luckily that can be fixed! Python 2.7 installation...

    # Many Bothans died to bring us this information
    ami_version: 3.0.4
    interpreter: python2.7
    bootstrap:
    - sudo yum install -y python27 python27-devel gcc-c++
    - sudo python2.7 get-pip.py#
    - sudo pip2.7 install boto mrjob simplejson

Big issue: grabbing the data from S3?

Python has no streaming decompression of gzip

Pulling to local storage and then processing
= terrible performance

We made gzipstream for
streaming gzip decompression

gzipstream = streaming gzip decompression

Anything that has read()
(S3, HTTP, Gopher, IPoAC, ...)

gzipstream handles multi-part gzip files
(our web data is stored like this for random access)

gzip + gzip + ... + gzip + gzip

Beautiful but slow :'(

Python is generally slower than Java and C++...
but programmer productivity can be more important

Saving Python's performance

Heavy lifting with faster languages

<haiku>
If I/O is slow,
Python will make you happy,
else Java or C*
</haiku>

* C++ didn't fit in the haiku ...

Saving Python's performance

Process once, store forever

Processing all the raw data each time is really inefficient
Create derived datasets that have a single focus

For example, Common Crawl releases processed datasets:

metadata (30% of the original)

extracted text (15% of the original)

Friends of Common Crawl make derived datasets

hyperlink graph (64 billion edges)

extracted table text, microdata, and more..!

Common Crawl mrjob kit

☑ Test and develop locally with just Python

☑ Access the data from S3 efficiently using gzipstream

☑ Spin up a cluster in one click

Two lines to get a cluster on EMR! Thx mrjob!

pip install -r requirements.txt
# Add your AWS credentials
python mrcc.py -r emr --conf-path mrjob.conf input/test-100.txt

https://github.com/commoncrawl/cc-mrjob

Help us make it trivial to use Python to access the world's greatest collection of human knowledge!

Check out and use:
https://github.com/commoncrawl/cc-mrjob

Stephen Merity

smerity.com / @smerity

@commoncrawl

Attributions

http://thenounproject.com/term/broken-computer/20293/

Universities should teach big data courses with ... y'know ...

big data

English Wikipedia
= 9.85 GB compressed
= 44 GB uncompressed
If "big data" isn't bigger than your phone's storage ...

Making Pythonwork for big web data

Stephen Merity smerity.com / @smerity @commoncrawl

We're a non profit that makes

web data

accessible to anyone

Each crawl is billions of pages...

July crawl was...

4 billion web pages

266 terabytes uncompressed

Released

totally free

Lives on Amazon S3 (Public Datasets)

What good is data if it's not used?

Accessing & using it needs to be easy

Sexy & simple science

iPython Notebook

Numpy & SciPy

Pandas

Scikit Learn

...but it's less simple when handlingbig data

Big data = Java ecosystem

(Hadoop, HDFS, ...)

public static void main(String[] args)

Creating a starter kit for Python

Simple to use framework for experiments+

A Python library for accessing files on S3=

https://github.com/commoncrawl/cc-mrjob

mrjob to the rescue (thx Yelp!)

mrjob = trivial Hadoop streaming with Python

☑ Test and develop locally with minimal setup (no Hadoop installation required)

☑ Only use Python, not mixing other languages (benefits of Hadoop ecosystem without touching it)

☑ Spin up a cluster in one click (EMR integration = single click cluster)

mrjob conveniences

python mrcc.py --conf-path mrjob.conf <in>

python mrcc.py -r emr --conf-path mrjob.conf <in>

mrjob annoyances

EMR images all run with Python 2.6

(✖ from collections import Counter)

Luckily that can be fixed! Python 2.7 installation...

Big issue: grabbing the data from S3?

Python has no streaming decompression of gzip

Pulling to local storage and then processing= terrible performance

We made gzipstream forstreaming gzip decompression

gzipstream = streaming gzip decompression

Anything that has read() (S3, HTTP, Gopher, IPoAC, ...)

gzipstream handles multi-part gzip files(our web data is stored like this for random access) gzip + gzip + ... + gzip + gzip

Beautiful but slow :'(

Python is generally slower than Java and C++...but programmer productivity can be more important

Saving Python's performance

Heavy lifting with faster languages

<haiku>If I/O is slow,Python will make you happy,else Java or C*</haiku>

* C++ didn't fit in the haiku ...

Saving Python's performance

Process once, store forever

Processing all the raw data each time is really inefficientCreate derived datasets that have a single focus

For example, Common Crawl releases processed datasets: metadata (30% of the original) extracted text (15% of the original)

Friends of Common Crawl make derived datasets hyperlink graph (64 billion edges) extracted table text, microdata, and more..!

Common Crawl mrjob kit

☑ Test and develop locally with just Python

☑ Access the data from S3 efficiently using gzipstream

☑ Spin up a cluster in one click

Two lines to get a cluster on EMR! Thx mrjob!

https://github.com/commoncrawl/cc-mrjob

Help us make it trivial to use Python to access the world's greatest collection of human knowledge! Check out and use:https://github.com/commoncrawl/cc-mrjob

Stephen Merity smerity.com / @smerity @commoncrawl

Attributions

Universities should teach big data courses with ... y'know ...

big data

English Wikipedia = 9.85 GB compressed= 44 GB uncompressedIf "big data" isn't bigger than your phone's storage ...

Making Python
work for big web data

Stephen Merity

smerity.com / @smerity

@commoncrawl

...but it's less simple when handling
big data

Simple to use framework for experiments
+

A Python library for accessing files on S3
=

☑ Test and develop locally with minimal setup
(no Hadoop installation required)

☑ Only use Python, not mixing other languages
(benefits of Hadoop ecosystem without touching it)

☑ Spin up a cluster in one click
(EMR integration = single click cluster)

Pulling to local storage and then processing
= terrible performance

We made gzipstream for
streaming gzip decompression

Anything that has read()
(S3, HTTP, Gopher, IPoAC, ...)

gzipstream handles multi-part gzip files
(our web data is stored like this for random access)

gzip + gzip + ... + gzip + gzip

Python is generally slower than Java and C++...
but programmer productivity can be more important

<haiku>
If I/O is slow,
Python will make you happy,
else Java or C*
</haiku>

Processing all the raw data each time is really inefficient
Create derived datasets that have a single focus

For example, Common Crawl releases processed datasets:

metadata (30% of the original)

extracted text (15% of the original)

Friends of Common Crawl make derived datasets

hyperlink graph (64 billion edges)

extracted table text, microdata, and more..!

Help us make it trivial to use Python to access the world's greatest collection of human knowledge!

Check out and use:
https://github.com/commoncrawl/cc-mrjob

Stephen Merity

smerity.com / @smerity

@commoncrawl

English Wikipedia
= 9.85 GB compressed
= 44 GB uncompressed
If "big data" isn't bigger than your phone's storage ...