(Pronounced PAH-DER - I'm Irish)!!
All views are my own and do not represent any future, current or past employers.
Contributor to PyMC3 and other open source software
Author and Speaker at PyData and EuroSciPy
Check out 'Interviews with Data Scientists' - 24 data scientists interviewed - proceeds go to NumFOCUS
I joined Channel 4 in early April as a Senior Data Scientist to work on customer segmentation and recommendation engines
Channel 4 is an award winning not-for-profit TV channel and digital channel. Famous for Father Ted, the IT Crowd and many other shows.
Version 3 is the way forward!
And many others.
Open Source can't thrive without industrial and academic support
Thanks to these guys and girls...
And many many more...
Improvements throughout the stack
Matplotlib colours, Sympy new release, improvements in NumPy
New @ operator in NumPy
Assign, and pipe in Pandas
df = pd.DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
df.assign(ln_A_plus_1=lambda x: np.log(x.A)+1)
Creates a copy of the dataframe with a nice new column.
Really useful for percentages, logarithms etc - standard Financial Analysis and Data Analysis stuff.
Adult data set
data
age workclass fnlwgt education-categorical educ marital-status
0 39 State-gov 77516 Bachelors 13 Never-married
2 38 Private 215646 HS-grad 9 Divorced
3 53 Private 234721 11th 7 Married-civ-spouse
4 28 Private 338409 Bachelors 13 Married-civ-spouse
5 37 Private 284582 Masters 14 Married-civ-spouse
6 49 Private 160187 9th 5 Married-spouse-absent
Source UCI Adult data set, csv version here: http://pymc-devs.github.io/pymc3/Bayesian_LogReg/
I'm stuck on a restricted machine and I only have Python 2.6
(Example shamelessly stolen from Rob Story and adapted for my data set)
import csv
conversion_map = {
'age': int,
'workclass': str,
'fnlwgt': int,
'education-categorical': str,
'educ': int,
'occupation': str,
'sex': str,
'capital-gain': float,
'capital-loss': float,
'hours': int,
'native-country': str,
'income': str
}
Write a conversion map and use csv
Load the csv data source
def converter(type_map, row):
"""Yep, we need to roll our own type conversions."""
converted_row = {}
for col, val in row.items():
converter = type_map.get(col)
if converter:
converted_row[col] = converter(val)
else:
converted_row[col] = val
return converted_row
with open('adult.csv', 'r') as f:
reader = csv.DictReader(f)
adult2 = [converter(conversion_map, r) for r in reader]
How does it look
>>> adult2[:2]
[{'': '0',
'age': 39,
'capital-loss': 0.0,
'captial-gain': '2174',
'educ': 13,
'education-categorical': ' Bachelors',
'fnlwgt': 77516,
'hours': 40,
'income': ' <=50K',
'marital-status': ' Never-married',
'native-country': ' United-States',
'occupation': ' Adm-clerical',
'relationship': ' Not-in-family',
'sex': ' Male',
'workclass': ' State-gov'},
I want to get the maximum age in my dataset
def get_max_age():
max_age = 0
for row in adult2:
if row['age'] > 1 and row['age'] > max_age:
max_age = row['age']
return max_age
>>> get_max_age()
90
# Or you could do it like this generator expression
>>> max(row['age'] for row in adult2 if row['age'] > 1)
90
Let's say you wanted to group things
# defaultdict is awesome. defaultdict is awesome.
from collections import defaultdict
def grouper(grouping_col, seq):
"""People have definitely written a faster version than what I'm about to write
Thanks to Rob Story for this one"""
groups = defaultdict(lambda: defaultdict(list))
for row in seq:
group = groups[row[grouping_col]]
for k, v in row.items():
if k != grouping_col:
group[k].append(v)
return groups
>>> groups = grouper('occupation', adult2)
A natural question is the mean number of hours by occupation
summary = {}
for group, values in groups.items():
summary[group] = sum(values['hours']) / len(values['hours'])
>>> summary
{' ?': 31.90613130765057,
' Adm-clerical': 37.55835543766578,
' Armed-Forces': 40.666666666666664,
' Craft-repair': 42.30422054159551,
' Exec-managerial': 44.9877029021151,
' Farming-fishing': 46.989939637826964,
' Handlers-cleaners': 37.947445255474456,
' Machine-op-inspct': 40.755744255744254,
' Other-service': 34.70166919575114,
' Priv-house-serv': 32.88590604026846,
' Prof-specialty': 42.38671497584541,
' Protective-serv': 42.87057010785824,
' Sales': 40.78109589041096,
' Tech-support': 39.432112068965516,
' Transport-moving': 44.65623043206011}
It is common advice but it's worth being aware of itertools if you want to write something like this.
PSA: PyToolz is awesome allows you to use functional programming techniques in Python.
http://toolz.readthedocs.org/en/latest/index.html
I want to make it faster - I'll use CyToolz
#I wanna see the frequencies of ages in the dataset
>>> tz.frequencies([r['age'] for r in adult2])
# Toolz has currying!
#I want to count by all of the occupations with greater than 15 years of education
import toolz.curried as tzc
>>> tzc.pipe(adult2,
tzc.filter(lambda r: r['educ'] > 15),
tzc.map(lambda r: (r['occupation'],)),
tzc.countby(lambda r: r[0]),
dict)
{' ?': 15,
' Adm-clerical': 5,
' Craft-repair': 2,
' Exec-managerial': 55,
' Farming-fishing': 1,
' Machine-op-inspct': 1,
' Other-service': 1,
' Prof-specialty': 321,
' Sales': 8,
' Tech-support': 3,
' Transport-moving': 1}
Toolz has some great virtues
Not going to talk too much about Pandas in this talk.
It is fast becoming a stable and core member of the PyData stack
Really useful for indexed data like time series data or csv file data
Statsmodels and seaborn already consider it a core member of the stack
# One little example of the power of the Pandas API
adult.groupby('educ').mean()
>>> age fnlwgt captial-gain capital-loss hours
educ
1 42.764706 235889.372549 898.392157 66.490196 36.647059
2 46.142857 239303.000000 125.875000 48.327381 38.255952
3 42.885886 232448.333333 176.021021 68.252252 38.897898
4 48.445820 188079.171827 233.939628 65.668731 39.366873
5 41.060311 202485.066148 342.089494 28.998054 38.044747
Labelled heterogenous data
NumPy arrays plus labels - excellent for 'Scientific data' :) Or multi-indexed data
I have weather forecasting data in NetCDF - this is what you use
arr = np.array([[1, 2, 3, 4],
[10, 20, 30, 40],
[100, 200, 300, 400]])
dim0_coords = ['a', 'b', 'c']
dim1_coords = ['foo', 'bar', 'baz', 'qux']
da = xray.DataArray(arr, [('x', dim0_coords), ('y', dim1_coords)])
da
da.loc['b']
There are plenty of examples in the notebooks
>> da[0:3]
<xarray.DataArray (x: 3, y: 4)>
array([[ 1, 2, 3, 4],
[ 10, 20, 30, 40],
[100, 200, 300, 400]])
Coordinates:
* x (x) <U1 'a' 'b' 'c'
* y (y) <U3 'foo' 'bar' 'baz' 'qux'
>>> da.dims
('x', 'y')
>> da.coords
Coordinates:
* x (x) <U1 'a' 'b' 'c'
* y (y) <U3 'foo' 'bar' 'baz' 'qux'
# Get a mean by label
>> da.mean(dim='y')
<xarray.DataArray (x: 3)>
array([ 2.5, 25. , 250. ])
Coordinates:
* x (x) <U1 'a' 'b' 'c'
import blaze as bz
bz_adult = bz.symbol('adult2', bz.discover(adult))
>>> type(bz_adult)
blaze.expr.expressions.Symbol
>>> mean_age = bz.by(bz_adult.occupation,
price=bz_adult.age.mean())
>>> hours_count = bz.by(bz_adult[bz_adult.hours > 35].educ,
count=bz_adult.workclass.count())
# We haven't actually computed anything yet!
# Let's make Pandas compute it.
bz.compute(mean_age, adult)
# We have here the count of number of years of education
# by a certain filter of greater than 35 hours of work per week.
>>> bz.compute(hours_count, adult)
educ count
0 1 51
1 2 168
2 3 333
3 4 646
4 5 514
5 6 933
6 7 1175
7 8 433
# Blaze/Odo make it easy to move data between containers
# Note that we have an empty table already created
pg_datasource = bz.odo(adult,
"postgresql://peadarcoyle@localhost/pydata::adult2")
# Now we're going to use Postgres as our computation engine
result = bz.compute(hours_count, pg_datasource)
result
<sqlalchemy.sql.selectable.Select at 0x113ae4390; Select object>
# I don't want a selectable. I want a DataFrame
# odo again
bz.odo(bz.compute(hours_count, pg_datasource), pd.DataFrame)
educ count
0 8 433
1 16 413
2 15 576
3 4 646
4 1 51
Let's store in Bcolz (we'll see Bcolz and ctable- the storage format later)
import bcolz
>> %time bz.odo(adult, 'adult.bcolz')
CPU times: user 10.3 s, sys: 18.1 s, total: 28.4 s
Wall time: 28.8 s
Out[55]:
ctable((32561,), [('age', '<i8'), ('workclass', 'O'), ('fnlwgt', '<i8'),
('educationcategorical', 'O'), ('educ', '<i8'), ('maritalstatus', 'O'),
('occupation', 'O'), ('relationship', 'O'), ('sex', 'O'), ('captialgain', '<i8'),
('capitalloss', '<i8'), ('hours', '<i8'), ('nativecountry', 'O'), ('income', 'O')])
nbytes: 7.76 MB; cbytes: 43.54 MB; ratio: 0.18
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
rootdir := 'adult.bcolz'
[ (39, ' State-gov', 77516, ' Bachelors', 13, ' Never-married', ' Adm-clerical',
' Not-in-family', ' Male', 2174, 0, 40, ' United-States', ' <=50K')
(50, ' Self-emp-not-inc', 83311, ' Bachelors', 13, ' Married-civ-spouse',
' Exec-managerial', ' Husband', ' Male', 0, 0, 13, ' United-States', ' <=50K')
(38, ' Private', 215646, ' HS-grad', 9, ' Divorced', ' Handlers-cleaners',
' Not-in-family', ' Male', 0, 0, 40, ' United-States', ' <=50K')
...,
(58, ' Private', 151910, ' HS-grad', 9, ' Widowed', ' Adm-clerical',
' Unmarried', ' Female', 0, 0, 40, ' United-States', ' <=50K')
(22, ' Private', 201490, ' HS-grad', 9, ' Never-married', ' Adm-clerical',
' Own-child', ' Male', 0, 0, 20, ' United-States', ' <=50K')
(52, ' Self-emp-inc', 287927, ' HS-grad', 9, ' Married-civ-spouse',
' Exec-managerial', ' Wife', ' Female', 15024, 0, 40, ' United-States', ' >50K')]
You can use any SQL supported by SQLAlchemy as your computation. It also supports Python lists, Spark DataFrames, MongoDB, Numpy arrays...
bcolz is a columnar data store for fast data storage and retrieval with built-in high performance compression. It supports both in-memory and out-of-memory storage and operations. Cf. http://bcolz.blosc.org/.
df_poiworld = pd.read_csv('POIWorld.csv', usecols=columns)
dc = bcolz.ctable.fromdataframe(df_poiworld)
dc
ctable((9140052,), [('name', 'O'), ('amenity', 'O'),
('Longitude', '<f8'), ('Latitude', '<f8')])
nbytes: 575.61 MB; cbytes: 3.00 GB; ratio: 0.19
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[(nan, 'post_box', -0.20698000000000003, 51.9458753)
(nan, 'post_box', -0.268633, 51.938183)
(nan, 'post_box', -0.274278, 51.930209999999995) ...,
(nan, nan, -77.2697855, 39.24023820000001)
(nan, nan, -77.2777191, 39.237238399999995)
(nan, 'drinking_water', -5.8, nan)]
>>> dc.cols
age : carray((32561,), int64)
nbytes: 254.38 KB; cbytes: 256.00 KB; ratio: 0.99
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[39 50 38 ..., 58 22 52]
workclass : carray((32561,), |S17)
nbytes: 540.56 KB; cbytes: 303.83 KB; ratio: 1.78
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[b' State-gov' b' Self-emp-not-inc' b' Private' ..., b' Private'
b' Private' b' Self-emp-inc']
educ : carray((32561,), int64)
nbytes: 254.38 KB; cbytes: 256.00 KB; ratio: 0.99
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[13 13 9 ..., 9 9 9]
occupation : carray((32561,), |S18)
nbytes: 572.36 KB; cbytes: 338.49 KB; ratio: 1.69
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[b' Adm-clerical' b' Exec-managerial' b' Handlers-cleaners' ...,
b' Adm-clerical' b' Adm-clerical' b' Exec-managerial']
sex : carray((32561,), |S7)
nbytes: 222.58 KB; cbytes: 256.00 KB; ratio: 0.87
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[b' Male' b' Male' b' Male' ..., b' Female' b' Male' b' Female']
hours : carray((32561,), int64)
nbytes: 254.38 KB; cbytes: 256.00 KB; ratio: 0.99
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[40 13 40 ..., 40 20 40]
%%time
#Generate 1GB of data
>> N = 100000 * 1000
>> import bcolz
>> ct = bcolz.fromiter(((i, i ** 2) for i in range(N)),
dtype="i4, i8",
count=N,
cparams=bcolz.cparams(clevel=9))
CPU times: user 59.6 s, sys: 1.08 s, total: 1min
Wall time: 59.1 s
>> ct
ctable((100000000,), [('f0', '<i4'), ('f1', '<i8')])
nbytes: 1.12 GB; cbytes: 151.84 MB; ratio: 7.54
cparams := cparams(clevel=9, shuffle=True, cname='blosclz')
[(0, 0) (1, 1) (2, 4) ..., (99999997, 9999999400000009)
(99999998, 9999999600000004) (99999999, 9999999800000001
That is 7 times compression in-memory
You can also store on disk and read it fast
>> %time ct.eval('f0 ** 2 + sqrt(f1)')
CPU times: user 4.38 s, sys: 1.96 s, total: 6.34 s
Wall time: 1.26 s
Out[36]:
carray((100000000,), float64)
nbytes: 762.94 MB; cbytes: 347.33 MB; ratio: 2.20
cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[ 0.00000000e+00 2.00000000e+00 6.00000000e+00 ..., 1.37491943e+09
1.57491943e+09 1.77491942e+09]
Fast numerical calculations
Integration with Numexpr to handle expressions
Intelligent use of caching and multithreading to optimize numerical calcuations
dc['workclass' == ' State-gov']
#dc.cols
# You can do DataFrame-like stuff
dc['workclass' == ' State-gov']
Out[117]:
(39, b' State-gov', 13, b' Adm-clerical', b' Male', 40)
PSA: Bcolz version 1 release candidate is out
There are some challenges with integration into the rest of PyData, this should stabilize.
Quantopian Inc a crowd-sourced hedge fund uses Bcolz
import dask.array as da
# create a dask array from the above array
a2 = da.from_array(a, chunks=200)
# multiply this array by a factor
b2 = a2 * 4
# find the minimum value
b2_min = b2.min()
print(b2_min)
#I want to tell if this is a School
#or not and then plot it on a graph
>> is_school = with_amenity.amenity.str.contains('[Ss]chool')
>> school = with_amenity[is_school]
#Very similar to pandas but you need to
#call compute on the dask objects
>> dd.compute(school.amenity.count())
(342025,)
# So we have about 342k schools in
# UK and Ireland in the OpenStreetMap project
import dask.dataframe as dd
lon, lat = dd.compute(school.Longitude,
school.Latitude)
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
fig, ax = plt.subplots(figsize=(10, 15))
m = Basemap(projection='mill',
lon_0=-5.23636, lat_0=53.866772,
llcrnrlon=-10.65073, llcrnrlat=49.16209,
urcrnrlon=1.76334, urcrnrlat=60.860699)
m.drawmapboundary(fill_color='#ffffff', linewidth=.0)
x, y = m(lon.values, lat.values)
m.scatter(x, y, s=1, marker=',', color="steelblue", alpha=0.6);
Compute in Dask and plot in Matplotlib
Notice how similar to Pandas and NumPy the API is.
UK and Irish schools in Open Street Map
Very exciting technology for the JVM community
Improvements in PySpark and interoperability
Improvements in Machine Learning libraries
Comes into it's own with lots of JSON blobs on many nodes
Dramatic speed improvements for the 'easy to distribute' problems
Source: Wes McKinney
rounds = con.table('pokemon_types')
rounds.info()
#This is a Pokemon table in SQLite
rounds.slot.value_counts()
slot count
0 1 784
1 2 395
SQLite in the background but could be Impala - all with a pandas like API
Wouldn't it be great to have a map for the stack
I had a a go
data[data['native-country']==" United-States"]
income = 1 * (data['income'] == " >50K")
age2 = np.square(data['age'])
data = data[['age', 'educ', 'hours']]
data['age2'] = age2
data['income'] = income
with pm.Model() as logistic_model:
pm.glm.glm('income ~ age + age2 + educ + hours',
data, family=pm.glm.families.Binomial())
trace_logistic_model = pm.sample(2000,
pm.NUTS(), progressbar=True)
Production ready NLP toolkits all under open source
Lasagne
So cite, send pull requests and/or help NumFOCUS!