Plumbing for Data Science

Pradeep Gowda

Proofpoint Inc


Data Science

Drew Conway, IQT Quarterly, 2011

Courtesy: Oracle 




Machine Learning, EDA etc.,

"Big Data"

Data Plumbing

Quick turnaround?



Working with data is messy

Be language and tool agnostic

Get familiar with the command line

UNIX metaphors - pipe, filter, composable programs

The data landscape

Data sources

  • Server logs
  • Sensors
  • Human generated (twitter, geolocation)
  • APIs
  • Databases
    • RDBMS
    • No/NotOnly/New SQL databases

Data Formats

  • API
    • JSON
    • XML
  • HTML - web scraping
  • Delimited files
  • Binary
    • XLS
    • PDF
  • Legacy
  • "unstructured" data

Tools of the trade

Skill sets

  • Basic programming ability
    • procedural
    • object based
    • scripting
  • Text processing
  • Network / HTTP 
  • Databases (and some SQL)

What's in your toolkit?

  • Command line tools
  • Programming language
    • Python. Duh!
  • Data analysis Tools
    • PyData, R etc., 
  • Language Libraries 
  • Database query tools
  • Spreadsheet (like) programs

Command line tools

  • wget
  • curl
  • head/tail
  • grep
  • sort
  • cut
  • sed
  • awk
  • uniq

Of the UNIX heritage

Third party tools: jq, CSVKit etc.,

Demo time

at the end... time permitting

Why Python

  • Community
  • Availability
  • Approachability
  • Multi-faceted
  • Well understood (limitations and strengths)
  • A fair balance of expressiveness, power, permissibility
  • Team player

Batteries included

  • String
  • Process
  • OS
  • CSV
  • Glob
  • Arg/Opt parse
  • ConfigParse
  • Json
  • XML/ElementTree
  • Collections

Third party Libraries

  • Requests - HTTP client
  • BeautifulSoup - HTML scraping
  • Arrow - date and time
  • SQLAlchemy - database access
# Installing third party packages:

pip install arrow
pip install requests

# PROTIP: look up virtualenv and virtualenvwrapper
# Alternate: conda [part of / anaconda package]

Text Encoding

Text: Regular Expression

For parsing multi-line structured records: see pyparsing


import re
str = 'ACME-1245:V2'
match ='\w{4}-\d{4}:\w+', str)

if match:                      
    print 'found',
    print 'did not find'

Four-letter identifier followed by  a "dash" followed by a  4 digit number followed by a "colon" followed by another string




# From string to time
from time import strptime

print strptime('2015-05-01', '%y-%m-%d')
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=121, tm_isdst=-1)

#From time to string
from time import strftime
print strftime('%Y-%m-%dT%H:%M:%S', datetime.timetuple(

Time: Arrow

>>> import arrow
>>> utc = arrow.utcnow()
>>> utc
<Arrow [2013-05-11T21:23:58.970460+00:00]>

>>> utc = utc.replace(hours=-1)
>>> utc
<Arrow [2013-05-11T20:23:58.970460+00:00]>

>>> local ='US/Pacific')
>>> local
<Arrow [2013-05-11T13:23:58.970460-07:00]>

>>> arrow.get('2013-05-11T21:23:58.970460+00:00')
<Arrow [2013-05-11T21:23:58.970460+00:00]>

>>> local.timestamp

>>> local.format('YYYY-MM-DD HH:mm:ss ZZ')
'2013-05-11 13:23:58 -07:00'

>>> local.humanize()
'an hour ago'

Fetching Data off the web

import requests
import json

r = requests.get('')
if r.status_code == 200:
    data = json.loads(r.text)
    print "Your IP is: ", data['origin']
    print "FYI, the text response: \n", r.text
    print "prettified json: \n",  json.dumps(data, sort_keys=True,
                  indent=4, separators=(',', ': '))


Your IP is:
FYI, the text response: 
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "", 
    "User-Agent": "python-requests/2.5.1 CPython/2.7.9 Darwin/14.3.0"
  "origin": "", 
  "url": ""

prettified json: 
    "args": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "",
        "User-Agent": "python-requests/2.5.1 CPython/2.7.9 Darwin/14.3.0"
    "origin": "",
    "url": ""
import requests
data = {'name': 'Who?', 'age': 55}
r = requests.get('', data=data)


Interacting with APIs


comes in handy for  automating

  • testing / learning new APIs
  • data downloads
  • automating periodic data fetches

# You can install httpbin as a library from PyPI and run it as a WSGI app.

$ pip install httpbin
$ gunicorn httpbin:app


"Data comes in, Data goes out..."

"You can't explain that!"

SQL. Learn it.


Database recommendation? one word...

SQL Skills 

  • SELECT statement
  • WHERE clause
  • JOINs
  • Basics of indexing 

Command line tools  in Python

Let the program do one small thing well 

Make it configurable via the command line switches

Make it configurable via a configuration file

Python libraries for building 

Command line tools

Inbuilt Libraries

  • fileinput
  • argparse

Thirdparty Libraries

  • docopt


#!/usr/bin/env python

# Program to print words that start with an uppercase letter
# example usage: cat test.txt | ./filterr

import fileinput
import string

for line in fileinput.input():
    words = line.split()
    for word in words:
        first_letter = ord(word[0])
        if first_letter >=65 and first_letter <=90:
            print word
cat ~/tmp/habits.txt| ./filterr 


import argparse

parser = argparse.ArgumentParser()

parser.add_argument('-s', action='store', dest='simple_value',
                    help='Store a simple value')

parser.add_argument('-c', action='store_const', dest='constant_value',
                    help='Store a constant value')

parser.add_argument('-t', action='store_true', default=False,
                    help='Set a switch to true')
parser.add_argument('-f', action='store_false', default=False,
                    help='Set a switch to false')

parser.add_argument('-a', action='append', dest='collection',
                    help='Add repeated values to a list',

parser.add_argument('-A', action='append_const', dest='const_collection',
                    help='Add different values to list')
parser.add_argument('-B', action='append_const', dest='const_collection',
                    help='Add different values to list')

parser.add_argument('--version', action='version', version='%(prog)s 1.0')
results = parser.parse_args()
print 'simple_value     =', results.simple_value
print 'constant_value   =', results.constant_value
print 'boolean_switch   =', results.boolean_switch
print 'collection       =', results.collection
print 'const_collection =', results.const_collection

argparse ..

$ python -h

usage: [-h] [-s SIMPLE_VALUE] [-c] [-t] [-f]
                          [-a COLLECTION] [-A] [-B] [--version]

optional arguments:
  -h, --help       show this help message and exit
  -s SIMPLE_VALUE  Store a simple value
  -c               Store a constant value
  -t               Set a switch to true
  -f               Set a switch to false
  -a COLLECTION    Add repeated values to a list
  -A               Add different values to list
  -B               Add different values to list
  --version        show program's version number and exit

but then... who has time to write all those switches..


"""Naval Fate.

Usage: ship new <name>... ship <name> move <x> <y> [--speed=<kn>] ship shoot <x> <y> mine (set|remove) <x> <y> [--moored | --drifting] (-h | --help) --version

  -h --help     Show this screen.
  --version     Show version.
  --speed=<kn>  Speed in knots [default: 10].
  --moored      Moored (anchored) mine.
  --drifting    Drifting mine.

from docopt import docopt

if __name__ == '__main__':
    arguments = docopt(__doc__, version='Naval Fate 2.0')

Where to go from here


  • It is still programming
  • Embrace the chaos
  • Plumbing vs Architecting vs Building
  • Document your steps ("Repeatable")
  • Python is an expressive language. Learn it. Use it.




Slides will be on indypy's github repo

Plumbing for data science

By Pradeep Gowda

Plumbing for data science

Pythology Lecture Series - May 1, 2015

  • 1,170

More from Pradeep Gowda