PyData Seattle 2015
Cam Davidson-Pilon
Cam Davidson-Pilon
- Lead on the Data Team at Shopify
- Open source contributer
- Author of Bayesian Methods for Hackers
(in print soon!)
Ottawa
Ottawa
Ottawa?
We needed to predict mail return rates based on census data.
Sample Data (simplified):
Well I'm predicting the rate, so I build that:
Don't need margin of errors...
...then do "data science"
Outcome: failure
What went wrong? At the time, ¯\_(ツ)_/¯
(highly, highly recommended!)
"The std. deviation of the sample mean is equal to the std. deviation of the population over square-root n"
What I learned
A intra-day time series of S&P, Dow, Nasdaq and FTSE (UK index)
Suppose you are interested in doing some day trading. Your target: UK stocks.
Futures on the FTSE in particular.
Post Backtesting Results
Push to Production - investing really money
What happened?
Data Leakage happened
What I learned
What I learned
It was my first A/B test at Shopify...
Control group: 4%
Experiment group: 5%
Bayesian A/B testing told me there was a significant statistical difference between the groups...
Upper management wanted to know the relative increase...
(5% - 4%) / 4% = 25%
No.
We forgot sample size again.
What I learned
Sparse-ing the solution naively
Coefficients after linear regression*:
*Assume data has been normalized too, i.e. mean 0 and standard deviation 1
Decide to drop a variable:
Suppose this is the true model...
Okay, out regression got the coefficients right, but...
So actually, together, these variables have very little contribution to Y!
Solution:
Any form of regularization will solve this. For example, using ridge regression with with even the slightest penalizer gives:
PCA before Regression
PCA is great at many things, but it can actually significantly hurt regression if used as a preprocessing step. How?
Suppose we wish to regress Y onto X and W. The true model of Y is Y = X - W. We don't know this yet.
Suppose further there is a positive correlation between X and W, say 0.5.
Apply PCA to [X W], we get a new matrix:
Textbook analysis tells you to drop the second dimension from this new PCA.
So now we are regressing Y onto:
i.e., find values to fit the model:
But there are no good values for these unknowns!
Quick IPython Demo
Solution:
Don't use naive PCA before regression, you are losing information - try something like supervised PCA, or just don't do it.
Thanks for listening :)
@cmrn_dp