Machine Learning Process
by: preston parry
Today's Tour
Data
Pull data
Filter data
Feature engineering
Train/test split
Train
Model types
Interpreting results
Test!
Error Metrics
Overfitting
Pulling Data
ML algorithms need tabular data
SQL Queries
Process data in SQL, filter it in Python
Iterating on filters is important
Recency
Sample data for training speed
X: features (data we think will be useful for making our prediction)
y: our output values (the value we are trying to predict)
Nerd-snipe
Equation, if you're into that kind of thing (and you definitely don't have to be)
Trying to solve for W * X = y
W is Weights (for each feature)
Thanks Raghav!
Filtering Data
Crazy important
Make sure you're predicting the thing you think you are
Algos learn patterns most effectively when you've removed noise
Feature Engineering
Quite possibly the most fun part!
Finding the best data to feed into the model
Dates
Historical aggregates
Other examples
Feature Engineering: Empathy
The best features come from understanding the behavior in your dataset
For most standard ML business problems, empathy leads to better accuracy than advanced ML knowledge
Train/Test Split
Is the model any good?
You need to know how well your model will generalize to new data
Random, or time-based
Sometimes you'll see people using two test sets
Second is often called a holdout dataset
You can ignore this for now until you start doing more advanced modeling
Train the model
Feed data to auto_ml
Grab a coffee, check your Slack, while auto_ml does all the heavy lifting for you!
But I wanna know more!! What happens in there?
Model Types
Super basic: linear model
Model Types
Tree-based model
The usual suspects
Random Forests
Gradient Boosted Trees
Interpret the model results
Linear model: coefficients
Tree-based model: feature importances
auto_ml: feature_responses
Score the model
Error metrics!
All just different ways of saying "how far off are my predictions from the actual values?"
Get a prediction for every row in your test dataset
Compare the prediction to the actual value for that row (the error for that row)
Aggregate these individual errors together in some way
Regression
Median Absolute Error
Mean Absolute Error
Mean Squared Error
Root Mean Squared Error (RMSE)
Classification
Accuracy
Precision/Recall
Probability Estimates
Overfitting
There are better ways to store a dataset than in a random forest
source: https://commons.wikimedia.org/wiki/File:Overfitting.svg
Overfitting: Why it's a problem
Generalization
Misleading analytics
Overfitting: Symptoms
Test score is too good
Train score is much better than test score
Training score is much better than production score
One feature is too useful
Overfitting: Causes
Telling the model the future
Too small a dataset
Too complex a model
Data leakage (time series averages- beware!)
Overfitting: Solutions
Inverse of cause :)
Simplify model
Feature selection
Regularization
Underfitting
Not scoring very well, even on the training data
Add in more features
More data filtering to remove noise
This is a really powerful step, especially if your data collection itself is messy and relies on human actions
More complex models
Next Steps
Set up your dev environment
Write a SQL query to get some data
Toss it into auto_ml and see what happens!
Made with Slides.com