Machine Learning Process

by: preston parry

Today's Tour

Data
- Pull data
- Filter data
Feature engineering
Train/test split
Train
- Model types
- Interpreting results
Test!
- Error Metrics
- Overfitting

Pulling Data

ML algorithms need tabular data
SQL Queries
Process data in SQL, filter it in Python
- Iterating on filters is important
Recency
Sample data for training speed
X: features (data we think will be useful for making our prediction)
y: our output values (the value we are trying to predict)

Nerd-snipe

Equation, if you're into that kind of thing (and you definitely don't have to be)
Trying to solve for W * X = y
- W is Weights (for each feature)
- Thanks Raghav!

Filtering Data

Crazy important
Make sure you're predicting the thing you think you are
Algos learn patterns most effectively when you've removed noise

Feature Engineering

Quite possibly the most fun part!
Finding the best data to feed into the model
Dates
Historical aggregates
Other examples

Feature Engineering: Empathy

The best features come from understanding the behavior in your dataset
For most standard ML business problems, empathy leads to better accuracy than advanced ML knowledge

Train/Test Split

Is the model any good?
You need to know how well your model will generalize to new data
Random, or time-based
Sometimes you'll see people using two test sets
- Second is often called a holdout dataset
- You can ignore this for now until you start doing more advanced modeling

Train the model

Feed data to auto_ml
Grab a coffee, check your Slack, while auto_ml does all the heavy lifting for you!

But I wanna know more!! What happens in there?

Model Types

Super basic: linear model

Model Types

Tree-based model

The usual suspects

Random Forests
Gradient Boosted Trees

Interpret the model results

Linear model: coefficients
Tree-based model: feature importances
auto_ml: feature_responses

Score the model

Error metrics!
All just different ways of saying "how far off are my predictions from the actual values?"
Get a prediction for every row in your test dataset
Compare the prediction to the actual value for that row (the error for that row)
Aggregate these individual errors together in some way

Regression

Median Absolute Error
Mean Absolute Error
Mean Squared Error
Root Mean Squared Error (RMSE)

Classification

Accuracy
Precision/Recall
Probability Estimates

Overfitting

There are better ways to store a dataset than in a random forest
source: https://commons.wikimedia.org/wiki/File:Overfitting.svg

Overfitting: Why it's a problem

Generalization
Misleading analytics

Overfitting: Symptoms

Test score is too good
Train score is much better than test score
Training score is much better than production score
One feature is too useful

Overfitting: Causes

Telling the model the future
Too small a dataset
Too complex a model
Data leakage (time series averages- beware!)

Overfitting: Solutions

Inverse of cause :)
Simplify model
- Feature selection
- Regularization

Underfitting

Not scoring very well, even on the training data
Add in more features
More data filtering to remove noise
- This is a really powerful step, especially if your data collection itself is messy and relies on human actions
More complex models

Next Steps

Set up your dev environment
Write a SQL query to get some data
Toss it into auto_ml and see what happens!

Made with Slides.com