Pythology

CC: https://flic.kr/p/8kEcVA

Plumbing for Data Science

Pradeep Gowda

Proofpoint Inc

@btbytes

Data Science

Drew Conway, IQT Quarterly, 2011

Courtesy: Oracle

Source: http://columbiadatascience.com/

Exploration

Engineering

Machine Learning, EDA etc.,

"Big Data"

Data Plumbing

Quick turnaround?

Automation

Framework

Working with data is messy

Be language and tool agnostic

Get familiar with the command line

In the beginning was the command line

UNIX metaphors - pipe, filter, composable programs

The data landscape

Data sources

Server logs
Sensors
Human generated (twitter, geolocation)
APIs
Databases
- RDBMS
- No/NotOnly/New SQL databases

Data Formats

API
- JSON
- XML
HTML - web scraping
Delimited files
Binary
- XLS
- PDF
Legacy
"unstructured" data

Tools of the trade

Skill sets

Basic programming ability
- procedural
- object based
- scripting
Text processing
Network / HTTP
Databases (and some SQL)

What's in your toolkit?

Command line tools
Programming language
- Python. Duh!
Data analysis Tools
- PyData, R etc.,
Language Libraries
Database query tools
Spreadsheet (like) programs

Command line tools

wget
curl
head/tail
grep
sort
cut
sed
awk
uniq

textproc

Unix text proc

Of the UNIX heritage

Third party tools: jq, CSVKit etc.,

Demo time

at the end... time permitting

The CSV Challenge

https://gist.github.com/jorin-vogel/2e43ffa981a97bc17259

Why Python

Community
Availability
Approachability
Multi-faceted
Well understood (limitations and strengths)
A fair balance of expressiveness, power, permissibility
Team player

Batteries included

String
Process
OS
CSV
Glob
Arg/Opt parse
ConfigParse
Json
XML/ElementTree
Collections

Third party Libraries

Requests - HTTP client
BeautifulSoup - HTML scraping
Arrow - date and time
SQLAlchemy - database access

# Installing third party packages:

pip install arrow
pip install requests

# PROTIP: look up virtualenv and virtualenvwrapper
# Alternate: conda [part of Continuum.io / anaconda package]

Text Encoding

Text: Regular Expression

For parsing multi-line structured records: see pyparsing

ACME-1245:V2

import re
str = 'ACME-1245:V2'
match = re.search(r'\w{4}-\d{4}:\w+', str)

if match:                      
    print 'found', match.group()
else:
    print 'did not find'

Four-letter identifier followed by a "dash" followed by a 4 digit number followed by a "colon" followed by another string

Time

strptime

PROTIP: strftime.org

# From string to time
from time import strptime

print strptime('2015-05-01', '%y-%m-%d')
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=121, tm_isdst=-1)


#From time to string
from time import strftime
print strftime('%Y-%m-%dT%H:%M:%S', datetime.timetuple(datetime.now()))

Time: Arrow

>>> import arrow
>>> utc = arrow.utcnow()
>>> utc
<Arrow [2013-05-11T21:23:58.970460+00:00]>

>>> utc = utc.replace(hours=-1)
>>> utc
<Arrow [2013-05-11T20:23:58.970460+00:00]>

>>> local = utc.to('US/Pacific')
>>> local
<Arrow [2013-05-11T13:23:58.970460-07:00]>

>>> arrow.get('2013-05-11T21:23:58.970460+00:00')
<Arrow [2013-05-11T21:23:58.970460+00:00]>

>>> local.timestamp
1368303838

>>> local.format('YYYY-MM-DD HH:mm:ss ZZ')
'2013-05-11 13:23:58 -07:00'

>>> local.humanize()
'an hour ago'

Fetching Data off the web

import requests
import json

r = requests.get('http://httpbin.org/get')
if r.status_code == 200:
    data = json.loads(r.text)
    print "Your IP is: ", data['origin']
    print "FYI, the text response: \n", r.text
    print "prettified json: \n",  json.dumps(data, sort_keys=True,
                  indent=4, separators=(',', ': '))

Requests/GET

Your IP is:  108.223.55.55
FYI, the text response: 
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.5.1 CPython/2.7.9 Darwin/14.3.0"
  }, 
  "origin": "108.223.55.55", 
  "url": "http://httpbin.org/get"
}

prettified json: 
{
    "args": {},
    "headers": {
        "Accept": "*/*",
        "Accept-Encoding": "gzip, deflate",
        "Host": "httpbin.org",
        "User-Agent": "python-requests/2.5.1 CPython/2.7.9 Darwin/14.3.0"
    },
    "origin": "108.223.55.55",
    "url": "http://httpbin.org/get"
}

import requests
data = {'name': 'Who?', 'age': 55}
r = requests.get('http://example.com', data=data)
r.text

Requests/POST

Interacting with APIs

httpie

comes in handy for automating

testing / learning new APIs
data downloads
automating periodic data fetches

http://httpbin.org/

# You can install httpbin as a library from PyPI and run it as a WSGI app.

$ pip install httpbin
$ gunicorn httpbin:app

Databases

"Data comes in, Data goes out..."

"You can't explain that!"

SQL. Learn it.

PostgreSQL

Database recommendation? one word...

SQL Skills

SELECT statement
WHERE clause
JOINs
Basics of indexing

Command line tools in Python

Let the program do one small thing well

Make it configurable via the command line switches

Make it configurable via a configuration file

Python libraries for building

Command line tools

Inbuilt Libraries

fileinput
argparse

Thirdparty Libraries

docopt

fileinput

#!/usr/bin/env python

# Program to print words that start with an uppercase letter
# example usage: cat test.txt | ./filterr

import fileinput
import string


for line in fileinput.input():
    words = line.split()
    for word in words:
        first_letter = ord(word[0])
        if first_letter >=65 and first_letter <=90:
            print word

cat ~/tmp/habits.txt| ./filterr 
Habits
John
Doe
March

argparse

import argparse

parser = argparse.ArgumentParser()

parser.add_argument('-s', action='store', dest='simple_value',
                    help='Store a simple value')

parser.add_argument('-c', action='store_const', dest='constant_value',
                    const='value-to-store',
                    help='Store a constant value')

parser.add_argument('-t', action='store_true', default=False,
                    dest='boolean_switch',
                    help='Set a switch to true')
parser.add_argument('-f', action='store_false', default=False,
                    dest='boolean_switch',
                    help='Set a switch to false')

parser.add_argument('-a', action='append', dest='collection',
                    default=[],
                    help='Add repeated values to a list',
                    )

parser.add_argument('-A', action='append_const', dest='const_collection',
                    const='value-1-to-append',
                    default=[],
                    help='Add different values to list')
parser.add_argument('-B', action='append_const', dest='const_collection',
                    const='value-2-to-append',
                    help='Add different values to list')

parser.add_argument('--version', action='version', version='%(prog)s 1.0')
results = parser.parse_args()
print 'simple_value     =', results.simple_value
print 'constant_value   =', results.constant_value
print 'boolean_switch   =', results.boolean_switch
print 'collection       =', results.collection
print 'const_collection =', results.const_collection

argparse ..

$ python argparse_action.py -h

usage: argparse_action.py [-h] [-s SIMPLE_VALUE] [-c] [-t] [-f]
                          [-a COLLECTION] [-A] [-B] [--version]

optional arguments:
  -h, --help       show this help message and exit
  -s SIMPLE_VALUE  Store a simple value
  -c               Store a constant value
  -t               Set a switch to true
  -f               Set a switch to false
  -a COLLECTION    Add repeated values to a list
  -A               Add different values to list
  -B               Add different values to list
  --version        show program's version number and exit

but then... who has time to write all those switches..

docopt

"""Naval Fate.

Usage:
  naval_fate.py ship new <name>...
  naval_fate.py ship <name> move <x> <y> [--speed=<kn>]
  naval_fate.py ship shoot <x> <y>
  naval_fate.py mine (set|remove) <x> <y> [--moored | --drifting]
  naval_fate.py (-h | --help)
  naval_fate.py --version

Options:
  -h --help     Show this screen.
  --version     Show version.
  --speed=<kn>  Speed in knots [default: 10].
  --moored      Moored (anchored) mine.
  --drifting    Drifting mine.

"""
from docopt import docopt


if __name__ == '__main__':
    arguments = docopt(__doc__, version='Naval Fate 2.0')
    print(arguments)

Where to go from here

http://datascienceatthecommandline.com/

Summary

It is still programming
Embrace the chaos
Plumbing vs Architecting vs Building
Document your steps ("Repeatable")
Python is an expressive language. Learn it. Use it.

Thanks!

Contact:

@btbytes

http://github.com/btbytes

pradeep@btbytes.com

Slides will be on indypy's github repo

Plumbing for data science

By Pradeep Gowda

Plumbing for data science

Pythology Lecture Series - May 1, 2015

2,958

Pythology

Plumbing for Data Science

Pradeep Gowda

Proofpoint Inc

Working with data is messy

Be language and tool agnostic

Get familiar with the command line

The data landscape

Data sources

Data Formats

Tools of the trade

What's in your toolkit?

Command line tools

Why Python

Batteries included

Third party Libraries

Text Encoding

Text: Regular Expression

Time

Time: Arrow

Databases

SQL Skills

Command line tools in Python

Where to go from here

Summary

Thanks!

Plumbing for data science

More from Pradeep Gowda