Pythonic Data Science

from __future__ import machine_learning

 

Trey Causey

@treycausey

Who am I?

  • Data Scientist at Facebook Seattle
     
  • Former Senior Data Scientist at zulily
     
  • Statistical consultant for an NFL team

Get data

Build model
Predict

From NumPy...

... comes PyData

pandas

scikit-learn

pandas

  • DataFrames

  • Fast ETL operations

  • Descriptive statistics

pd.read_csv()

NFL data!

(with pre-completed feature engineering)

scikit-learn

fast ML

algorithms += n

consistent API

open source

Classification

Regression

Clustering

Dimensionality reduction

Preprocessing

dat API tho!

  • Estimators extend BaseEstimator

  • Pretty much everything's an estimator

  • Estimators have fit(X, y=None), transform(), predict() methods

  • Some have predict_proba() methods

 

NFL win probability

Classification problem

Output a probability per-play

Pretend the plays are independent (shhh...)

Look at your data!

Class imbalance

Centering & scaling

Train-test splits

janitorial work

Random forests!

Ensembles of bootstrap aggregated decision trees! 

How did we do?

Other considerations:

 

Calibration

Per-class performance

Performance associated with predictors

Important omissions

Cross-validation

Validation set vs. test set

Feature engineering

Model checking

UNCERTAINTY

Serialization

Read more!

@treycausey

trey.causey@gmail.com

thespread.us

Pythonic Data Science

By Trey Causey

Pythonic Data Science

from __future__ import machine_learning

  • 3,032