CC: https://flic.kr/p/8kEcVA
@btbytes
Data Science
Drew Conway, IQT Quarterly, 2011
Courtesy: Oracle
Source: http://columbiadatascience.com/
Exploration
Engineering
Machine Learning, EDA etc.,
"Big Data"
Data Plumbing
Quick turnaround?
Automation
Framework
UNIX metaphors - pipe, filter, composable programs
Skill sets
Of the UNIX heritage
Third party tools: jq, CSVKit etc.,
Demo time
at the end... time permitting
# Installing third party packages:
pip install arrow
pip install requests
# PROTIP: look up virtualenv and virtualenvwrapper
# Alternate: conda [part of Continuum.io / anaconda package]
For parsing multi-line structured records: see pyparsing
ACME-1245:V2
import re
str = 'ACME-1245:V2'
match = re.search(r'\w{4}-\d{4}:\w+', str)
if match:
print 'found', match.group()
else:
print 'did not find'
Four-letter identifier followed by a "dash" followed by a 4 digit number followed by a "colon" followed by another string
strptime
PROTIP: strftime.org
# From string to time
from time import strptime
print strptime('2015-05-01', '%y-%m-%d')
time.struct_time(tm_year=2015, tm_mon=5, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=121, tm_isdst=-1)
#From time to string
from time import strftime
print strftime('%Y-%m-%dT%H:%M:%S', datetime.timetuple(datetime.now()))
>>> import arrow
>>> utc = arrow.utcnow()
>>> utc
<Arrow [2013-05-11T21:23:58.970460+00:00]>
>>> utc = utc.replace(hours=-1)
>>> utc
<Arrow [2013-05-11T20:23:58.970460+00:00]>
>>> local = utc.to('US/Pacific')
>>> local
<Arrow [2013-05-11T13:23:58.970460-07:00]>
>>> arrow.get('2013-05-11T21:23:58.970460+00:00')
<Arrow [2013-05-11T21:23:58.970460+00:00]>
>>> local.timestamp
1368303838
>>> local.format('YYYY-MM-DD HH:mm:ss ZZ')
'2013-05-11 13:23:58 -07:00'
>>> local.humanize()
'an hour ago'
Fetching Data off the web
import requests
import json
r = requests.get('http://httpbin.org/get')
if r.status_code == 200:
data = json.loads(r.text)
print "Your IP is: ", data['origin']
print "FYI, the text response: \n", r.text
print "prettified json: \n", json.dumps(data, sort_keys=True,
indent=4, separators=(',', ': '))
Requests/GET
Your IP is: 108.223.55.55
FYI, the text response:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.5.1 CPython/2.7.9 Darwin/14.3.0"
},
"origin": "108.223.55.55",
"url": "http://httpbin.org/get"
}
prettified json:
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.5.1 CPython/2.7.9 Darwin/14.3.0"
},
"origin": "108.223.55.55",
"url": "http://httpbin.org/get"
}
import requests
data = {'name': 'Who?', 'age': 55}
r = requests.get('http://example.com', data=data)
r.text
Requests/POST
Interacting with APIs
httpie
comes in handy for automating
http://httpbin.org/
# You can install httpbin as a library from PyPI and run it as a WSGI app.
$ pip install httpbin
$ gunicorn httpbin:app
"Data comes in, Data goes out..."
"You can't explain that!"
SQL. Learn it.
PostgreSQL
Database recommendation? one word...
Let the program do one small thing well
Make it configurable via the command line switches
Make it configurable via a configuration file
Python libraries for building
Command line tools
Inbuilt Libraries
Thirdparty Libraries
fileinput
#!/usr/bin/env python
# Program to print words that start with an uppercase letter
# example usage: cat test.txt | ./filterr
import fileinput
import string
for line in fileinput.input():
words = line.split()
for word in words:
first_letter = ord(word[0])
if first_letter >=65 and first_letter <=90:
print word
cat ~/tmp/habits.txt| ./filterr
Habits
John
Doe
March
argparse
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('-s', action='store', dest='simple_value',
help='Store a simple value')
parser.add_argument('-c', action='store_const', dest='constant_value',
const='value-to-store',
help='Store a constant value')
parser.add_argument('-t', action='store_true', default=False,
dest='boolean_switch',
help='Set a switch to true')
parser.add_argument('-f', action='store_false', default=False,
dest='boolean_switch',
help='Set a switch to false')
parser.add_argument('-a', action='append', dest='collection',
default=[],
help='Add repeated values to a list',
)
parser.add_argument('-A', action='append_const', dest='const_collection',
const='value-1-to-append',
default=[],
help='Add different values to list')
parser.add_argument('-B', action='append_const', dest='const_collection',
const='value-2-to-append',
help='Add different values to list')
parser.add_argument('--version', action='version', version='%(prog)s 1.0')
results = parser.parse_args()
print 'simple_value =', results.simple_value
print 'constant_value =', results.constant_value
print 'boolean_switch =', results.boolean_switch
print 'collection =', results.collection
print 'const_collection =', results.const_collection
argparse ..
$ python argparse_action.py -h
usage: argparse_action.py [-h] [-s SIMPLE_VALUE] [-c] [-t] [-f]
[-a COLLECTION] [-A] [-B] [--version]
optional arguments:
-h, --help show this help message and exit
-s SIMPLE_VALUE Store a simple value
-c Store a constant value
-t Set a switch to true
-f Set a switch to false
-a COLLECTION Add repeated values to a list
-A Add different values to list
-B Add different values to list
--version show program's version number and exit
but then... who has time to write all those switches..
docopt
"""Naval Fate.
Usage:
naval_fate.py ship new <name>...
naval_fate.py ship <name> move <x> <y> [--speed=<kn>]
naval_fate.py ship shoot <x> <y>
naval_fate.py mine (set|remove) <x> <y> [--moored | --drifting]
naval_fate.py (-h | --help)
naval_fate.py --version
Options:
-h --help Show this screen.
--version Show version.
--speed=<kn> Speed in knots [default: 10].
--moored Moored (anchored) mine.
--drifting Drifting mine.
"""
from docopt import docopt
if __name__ == '__main__':
arguments = docopt(__doc__, version='Naval Fate 2.0')
print(arguments)
Contact:
@btbytes
http://github.com/btbytes
pradeep@btbytes.com
Slides will be on indypy's github repo