Introduction
to predictive modeling
Måns Magnusson
Machine learning
Machine learning?
Måns Magnusson
Automatically detect patterns in data
Predict future observation
Decision making under uncertainty
Types of Machine learning
Måns Magnusson
Supervised learning
Unsupervised learning
Reinforcement learning
Supervised learning
Måns Magnusson
(also called predictive learning)
response variable
covariates/features
training set
Supervised learning types
Måns Magnusson
If is categorical:
If is real:
classification
regression
Unsupervised learning
Måns Magnusson
also called knowledge discovery,
dimensionality reduction
clustering, PCA, discovering of graph structures
latent variable modeling
Curse of dimentionality
Måns Magnusson
The more variables the larger distance between datapoints
Bias and variance in ML
Måns Magnusson
Underfit = high bias, low variance
Overfit = low bias, high variance
Model selection
Måns Magnusson
hyper parameters
bias and variance - tradeoff
generalization error
validation set/cross validation
Predictive modeling pipeline
Måns Magnusson
1. Set aside data for test (estimate generalization error)
2. Set aside data for validation (if hyperparams)
3. Run algorithms
4. Find best/optimal hyperparameters (on validation set)
5. Choose final model
6. Estimate generalization error on test set
No free lunch theorem
Måns Magnusson
accuracy-complexity-intepratability tradeoff
different models work in different domains
...but more data always wins
Approaches
Måns Magnusson
Linear models
Non-linear models
Supervised learning in R
the caret package
Måns Magnusson
package for supervised learning
do not contain methods - just framework
compare methods on hold-out-data
"Big data"
Big data
Måns Magnusson
"Big data is like teenage sex:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it..."
- Dan Ariely
Big data is relative...
Måns Magnusson
... to computational complexity
We need algorithms that scales!
Examples
Måns Magnusson
Linear regression
Gaussian processes
Support vector machines
Random forests
Topic models
Quicksort
Big data in R
Måns Magnusson
R stores data in RAM
integers
4 bytes
numerics
8 bytes
A matrix with 100m rows and 5 cols with numerics
How to handle
Måns Magnusson
Handle chunkwise
Subsampling
More hardware
C++/Java backend (dplyr)
Reduce data in memory
Database backend
If not enough
Måns Magnusson
Spark and SparkPlyr
Text
Fast cluster computations for ML /STATS
Data munging
using dplyr and tidyr
Tidy data
Måns Magnusson
Theoretical approach to data handling
Similar to Codds normal forms (3rd)
Tidy data and messy data
Tidy data
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
iris
faithful
Why tidy?
Måns Magnusson
80 % of work is data munging
Analysis and visualization is based on tidy data
Performant code
Data analysis pipeline
messy data
tidy data
analysis
Messy data
Måns Magnusson
4. Multiple types of observational units are stored in the same table.
1. Column headers are values, not variable names.
AirPassengers
Example data:
2. Multiple variables are stored in one column.
mtcars
Example data:
3. Variables are stored in both rows and columns.
crimetab
Example data:
5. A single observational unit is stored in multiple tables.
dplyr
Måns Magnusson
Verbs for handling data
Highly optimized C++ code (backend)
Handling larger datasets in R
(no copy-on-modify)
dplyr + tidyr
Måns Magnusson
Intro to Predictive Modeling
By monsmagn
Intro to Predictive Modeling
A fast introduction to basic predictive modeling
- 1,188