Pythology
CC: https://flic.kr/p/8kEcVA
Plumbing for Data Science
Pradeep Gowda
Proofpoint Inc
@btbytes
Data Science
Drew Conway, IQT Quarterly, 2011
Courtesy: Oracle
Source: http://columbiadatascience.com/
Exploration
Engineering
Machine Learning, EDA etc.,
"Big Data"
Data Plumbing
Quick turnaround?
Automation
Framework
Working with data is messy
Be language and tool agnostic
Get familiar with the command line
UNIX metaphors - pipe, filter, composable programs
The data landscape
Data sources
- Server logs
- Sensors
- Human generated (twitter, geolocation)
- APIs
- Databases
- RDBMS
- No/NotOnly/New SQL databases
Data Formats
- API
- JSON
- XML
- HTML - web scraping
- Delimited files
- Binary
- XLS
- Legacy
- "unstructured" data
Tools of the trade
Skill sets
- Basic programming ability
- procedural
- object based
- scripting
- Text processing
- Network / HTTP
- Databases (and some SQL)
What's in your toolkit?
- Command line tools
- Programming language
- Python. Duh!
- Data analysis Tools
- PyData, R etc.,
- Language Libraries
- Database query tools
- Spreadsheet (like) programs
Command line tools
- wget
- curl
- head/tail
- grep
- sort
- cut
- sed
- awk
- uniq
Of the UNIX heritage
Third party tools: jq, CSVKit etc.,
Demo time
at the end... time permitting
Why Python
- Community
- Availability
- Approachability
- Multi-faceted
- Well understood (limitations and strengths)
- A fair balance of expressiveness, power, permissibility
- Team player
Batteries included
- String
- Process
- OS
- CSV
- Glob
- Arg/Opt parse
- ConfigParse
- Json
- XML/ElementTree
- Collections
Third party Libraries
- Requests - HTTP client
- BeautifulSoup - HTML scraping
- Arrow - date and time
- SQLAlchemy - database access
# Installing third party packages:
pip install arrow
pip install requests
# PROTIP: look up virtualenv and virtualenvwrapper
# Alternate: conda [part of Continuum.io / anaconda package]
Text Encoding
Text: Regular Expression
For parsing multi-line structured records: see pyparsing
ACME-1245:V2
import re
str = 'ACME-1245:V2'
match = re.search(r'\w{4}-\d{4}:\w+', str)
if match:
print 'found', match.group()
else:
print 'did not find'
Four-letter identifier followed by a "dash" followed by a 4 digit number followed by a "colon" followed by another string
Time
strptime
PROTIP: strftime.org
# From string to time
from time import strptime
print strptime('2015-05-01', '%y-%m-%d')
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=121, tm_isdst=-1)
#From time to string
from time import strftime
print strftime('%Y-%m-%dT%H:%M:%S', datetime.timetuple(datetime.now()))
Time: Arrow
>>> import arrow
>>> utc = arrow.utcnow()
>>> utc
<Arrow [2013-05-11T21:23:58.970460+00:00]>
>>> utc = utc.replace(hours=-1)
>>> utc
<Arrow [2013-05-11T20:23:58.970460+00:00]>
>>> local = utc.to('US/Pacific')
>>> local
<Arrow [2013-05-11T13:23:58.970460-07:00]>
>>> arrow.get('2013-05-11T21:23:58.970460+00:00')
<Arrow [2013-05-11T21:23:58.970460+00:00]>
>>> local.timestamp
1368303838
>>> local.format('YYYY-MM-DD HH:mm:ss ZZ')
'2013-05-11 13:23:58 -07:00'
>>> local.humanize()
'an hour ago'
Fetching Data off the web
import requests
import json
r = requests.get('http://httpbin.org/get')
if r.status_code == 200:
data = json.loads(r.text)
print "Your IP is: ", data['origin']
print "FYI, the text response: \n", r.text
print "prettified json: \n", json.dumps(data, sort_keys=True,
indent=4, separators=(',', ': '))
Requests/GET
Your IP is: 108.223.55.55
FYI, the text response:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.5.1 CPython/2.7.9 Darwin/14.3.0"
},
"origin": "108.223.55.55",
"url": "http://httpbin.org/get"
}
prettified json:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.5.1 CPython/2.7.9 Darwin/14.3.0"
},
"origin": "108.223.55.55",
"url": "http://httpbin.org/get"
}
import requests
data = {'name': 'Who?', 'age': 55}
r = requests.get('http://example.com', data=data)
r.text
Requests/POST
Interacting with APIs
httpie
comes in handy for automating
- testing / learning new APIs
- data downloads
- automating periodic data fetches
http://httpbin.org/
# You can install httpbin as a library from PyPI and run it as a WSGI app.
$ pip install httpbin
$ gunicorn httpbin:app
Databases
"Data comes in, Data goes out..."
"You can't explain that!"
SQL. Learn it.
PostgreSQL
Database recommendation? one word...
SQL Skills
- SELECT statement
- WHERE clause
- JOINs
- Basics of indexing
Command line tools in Python
Let the program do one small thing well
Make it configurable via the command line switches
Make it configurable via a configuration file
Python libraries for building
Command line tools
Inbuilt Libraries
- fileinput
- argparse
Thirdparty Libraries
- docopt
fileinput
#!/usr/bin/env python
# Program to print words that start with an uppercase letter
# example usage: cat test.txt | ./filterr
import fileinput
import string
for line in fileinput.input():
words = line.split()
for word in words:
first_letter = ord(word[0])
if first_letter >=65 and first_letter <=90:
print word
cat ~/tmp/habits.txt| ./filterr
Habits
John
Doe
March
argparse
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('-s', action='store', dest='simple_value',
help='Store a simple value')
parser.add_argument('-c', action='store_const', dest='constant_value',
const='value-to-store',
help='Store a constant value')
parser.add_argument('-t', action='store_true', default=False,
dest='boolean_switch',
help='Set a switch to true')
parser.add_argument('-f', action='store_false', default=False,
dest='boolean_switch',
help='Set a switch to false')
parser.add_argument('-a', action='append', dest='collection',
default=[],
help='Add repeated values to a list',
)
parser.add_argument('-A', action='append_const', dest='const_collection',
const='value-1-to-append',
default=[],
help='Add different values to list')
parser.add_argument('-B', action='append_const', dest='const_collection',
const='value-2-to-append',
help='Add different values to list')
parser.add_argument('--version', action='version', version='%(prog)s 1.0')
results = parser.parse_args()
print 'simple_value =', results.simple_value
print 'constant_value =', results.constant_value
print 'boolean_switch =', results.boolean_switch
print 'collection =', results.collection
print 'const_collection =', results.const_collection
argparse ..
$ python argparse_action.py -h
usage: argparse_action.py [-h] [-s SIMPLE_VALUE] [-c] [-t] [-f]
[-a COLLECTION] [-A] [-B] [--version]
optional arguments:
-h, --help show this help message and exit
-s SIMPLE_VALUE Store a simple value
-c Store a constant value
-t Set a switch to true
-f Set a switch to false
-a COLLECTION Add repeated values to a list
-A Add different values to list
-B Add different values to list
--version show program's version number and exit
but then... who has time to write all those switches..
docopt
"""Naval Fate.
Usage:
naval_fate.py ship new <name>...
naval_fate.py ship <name> move <x> <y> [--speed=<kn>]
naval_fate.py ship shoot <x> <y>
naval_fate.py mine (set|remove) <x> <y> [--moored | --drifting]
naval_fate.py (-h | --help)
naval_fate.py --version
Options:
-h --help Show this screen.
--version Show version.
--speed=<kn> Speed in knots [default: 10].
--moored Moored (anchored) mine.
--drifting Drifting mine.
"""
from docopt import docopt
if __name__ == '__main__':
arguments = docopt(__doc__, version='Naval Fate 2.0')
print(arguments)
Where to go from here
Summary
- It is still programming
- Embrace the chaos
- Plumbing vs Architecting vs Building
- Document your steps ("Repeatable")
- Python is an expressive language. Learn it. Use it.
Thanks!
Contact:
@btbytes
http://github.com/btbytes
pradeep@btbytes.com
Slides will be on indypy's github repo
Plumbing for data science
By Pradeep Gowda
Plumbing for data science
Pythology Lecture Series - May 1, 2015
- 2,197