PyData Seattle 2015

Cam Davidson-Pilon

Cam Davidson-Pilon

- Lead on the Data Team at Shopify

- Open source contributer

- Author of Bayesian Methods for Hackers

(in print soon!)

Ottawa

Ottawa

Ottawa?

We needed to predict mail return rates based on census data.

Sample Data (simplified):

Well I'm predicting the rate, so I build that:

Don't need margin of errors...

...then do "data science"

Outcome: failure

What went wrong? At the time, ¯\_(ツ)_/¯

(highly, highly recommended!)

\sigma_{\bar X} = \frac{\sigma}{\sqrt{n}}

$\sigma_{\bar X} = \frac{\sigma}{\sqrt{n}}$

\sigma_{\bar X} = \frac{\sigma}{\sqrt{n}}

$\sigma_{\bar X} = \frac{\sigma}{\sqrt{n}}$

"The std. deviation of the sample mean is equal to the std. deviation of the population over square-root n"

What I learned

- Sample sizes are so important when dealing with aggregate level data.
- It was only an issue because the sample sizes were different, too.
- Use the Margin of Error, don't ignore it - it's there for a reason.
- I got burned so bad here, I became a Bayesian soon after.

A intra-day time series of S&P, Dow, Nasdaq and FTSE (UK index)

Suppose you are interested in doing some day trading. Your target: UK stocks.

Futures on the FTSE in particular.

Post Backtesting Results

Push to Production - investing really money

What happened?

Data Leakage happened

What I learned

- Your backtesting / cross validation will always be equal or overly optimistic - plan for that.
- Understand where your data comes from, from start to finish.

What I learned

- When developing statistical software that already exists in the wild, write tests against the output of that software.
- Be responsible for your software:

It was my first A/B test at Shopify...

Control group: 4%

Experiment group: 5%

Bayesian A/B testing told me there was a significant statistical difference between the groups...

Upper management wanted to know the relative increase...

(5% - 4%) / 4% = 25%

No.

We forgot sample size again.

What I learned

- Don't naively compute stats on top of stats - this only compounds the uncertainty.
- Better to underestimate than overestimate
- Visualizing uncertainty is a the role of a statistician.

**Sparse-ing the solution naively**

Coefficients after linear regression*:

*Assume data has been normalized too, i.e. mean 0 and standard deviation 1

Decide to drop a variable:

Suppose this is the **true** model...

Okay, out regression got the coefficients right, but...

So actually, together, these variables have very little contribution to Y!

Solution:

Any form of regularization will solve this. For example, using ridge regression with with even the slightest penalizer gives:

**PCA before Regression**

PCA is great at many things, but it can actually **significantly hurt** regression if used as a preprocessing step. How?

Suppose we wish to regress Y onto X and W. The *true* model of Y is Y = X - W. We don't know this yet.

Suppose further there is a positive correlation between X and W, say 0.5.

Apply PCA to [X W], we get a new matrix:

[ \frac{1}{\sqrt{2}}X + \frac{1}{\sqrt{2}}W, \frac{1}{\sqrt{2}}X - \frac{1}{\sqrt{2}}W ]

$[ \frac{1}{\sqrt{2}}X + \frac{1}{\sqrt{2}}W, \frac{1}{\sqrt{2}}X - \frac{1}{\sqrt{2}}W ]$

[ \frac{1}{\sqrt{2}}X + \frac{1}{\sqrt{2}}W, \frac{1}{\sqrt{2}}X - \frac{1}{\sqrt{2}}W ]

$[ \frac{1}{\sqrt{2}}X + \frac{1}{\sqrt{2}}W, \frac{1}{\sqrt{2}}X - \frac{1}{\sqrt{2}}W ]$

Textbook analysis tells you to drop the second dimension from this new PCA.

[ \frac{1}{\sqrt{2}}X + \frac{1}{\sqrt{2}}W]

$[ \frac{1}{\sqrt{2}}X + \frac{1}{\sqrt{2}}W]$

So now we are regressing Y onto:

i.e., find values to fit the model:

Y = \alpha + \beta(X + W)

$Y = \alpha + \beta(X + W)$

But there are no good values for these unknowns!

Quick IPython Demo

Solution:

Don't use naive PCA before regression, you are losing information - try something like supervised PCA, or just don't do it.

Thanks for listening :)

@cmrn_dp