Introduction

to predictive modeling

Måns Magnusson

Machine learning

Machine learning?

Måns Magnusson

Automatically detect patterns in data

Predict future observation

Decision making under uncertainty

Types of Machine learning

Måns Magnusson

Supervised learning

Unsupervised learning

Reinforcement learning

Supervised learning

Måns Magnusson

(also called predictive learning)

response variable

covariates/features

\mathcal{D}=({\mathbf{x}_i,y_i})^{N}_{i=1}
D=(xi,yi)i=1N\mathcal{D}=({\mathbf{x}_i,y_i})^{N}_{i=1}

training set

Supervised learning types

Måns Magnusson

If         is categorical:

y_i
yiy_i

If         is real:

y_i
yiy_i

classification

regression

Unsupervised learning

Måns Magnusson

also called knowledge discovery,

dimensionality reduction

\mathcal{D}=({\mathbf{x}_i})^{N}_{i=1}
D=(xi)i=1N\mathcal{D}=({\mathbf{x}_i})^{N}_{i=1}

clustering, PCA, discovering of graph structures

latent variable modeling

Curse of dimentionality

Måns Magnusson

The more variables the larger distance between datapoints

Bias and variance in ML

Måns Magnusson

Underfit = high bias, low variance

Overfit = low bias, high variance

Model selection

Måns Magnusson

hyper parameters

bias and variance - tradeoff

generalization error

validation set/cross validation

Predictive modeling pipeline

Måns Magnusson

1. Set aside data for test (estimate generalization error)

2. Set aside data for validation (if hyperparams)

3. Run algorithms

4. Find best/optimal hyperparameters (on validation set)

5. Choose final model

6. Estimate generalization error on test set

No free lunch theorem

Måns Magnusson

accuracy-complexity-intepratability tradeoff

different models work in different domains

...but more data always wins

Approaches

Måns Magnusson

Linear models

 

Non-linear models

Supervised learning in R

the caret package

Måns Magnusson

package for supervised learning

do not contain methods - just framework

compare methods on hold-out-data

"Big data"

Big data

Måns Magnusson

"Big data is like teenage sex:

everyone talks about it,

nobody really knows how to do it,

everyone thinks everyone else is doing it,

so everyone claims they are doing it..."

- Dan Ariely

Big data is relative...

Måns Magnusson

... to computational complexity

O(N): 10^{12}
O(N):1012O(N): 10^{12}
O(N^2): 10^{6}
O(N2):106O(N^2): 10^{6}
O(N^3): 10^{4}
O(N3):104O(N^3): 10^{4}
O(2^N): 50
O(2N):50O(2^N): 50

We need algorithms that scales!

Examples

Måns Magnusson

Linear regression

O(P^2 \cdot N)
O(P2N)O(P^2 \cdot N)

Gaussian processes

O(N^3)
O(N3)O(N^3)

Support vector machines

O(N^3)|O(N^2)
O(N3)O(N2)O(N^3)|O(N^2)

Random forests

O(T(P \cdot N \cdot \log(N))
O(T(PNlog(N))O(T(P \cdot N \cdot \log(N))

Topic models

\approx O(I \cdot N)
O(IN)\approx O(I \cdot N)

Quicksort

O(N \cdot \log(N))
O(Nlog(N))O(N \cdot \log(N))

Big data in R

Måns Magnusson

R stores data in RAM

integers

4 bytes

numerics

8 bytes

A matrix with 100m rows and 5 cols with numerics

100000000 \cdot 5 \cdot 8 / (1024^3) \approx 3.8 Gb
10000000058/(10243)3.8Gb100000000 \cdot 5 \cdot 8 / (1024^3) \approx 3.8 Gb

How to handle

Måns Magnusson

Handle chunkwise

Subsampling

More hardware

C++/Java backend (dplyr)

Reduce data in memory

Database backend

If not enough

Måns Magnusson

Spark and SparkPlyr

Text

Fast cluster computations  for ML /STATS

Data munging

using dplyr and tidyr

Tidy data

Måns Magnusson

Theoretical approach to data handling

Similar to Codds normal forms (3rd)

Tidy data and messy data

Tidy data

1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.

iris
faithful

Why tidy?

Måns Magnusson

80 % of work is data munging

Analysis and visualization is based on tidy data

Performant code

Data analysis pipeline

messy data

tidy data

analysis

Messy data

Måns Magnusson

4. Multiple types of observational units are stored in the same table.

1. Column headers are values, not variable names. 

AirPassengers

Example data:

2. Multiple variables are stored in one column.

mtcars

Example data:

3. Variables are stored in both rows and columns.

crimetab

Example data:

5. A single observational unit is stored in multiple tables.

dplyr

Måns Magnusson

Verbs for handling data

Highly optimized C++ code (backend)

Handling larger datasets in R

(no copy-on-modify)

dplyr + tidyr

Måns Magnusson

Intro to Predictive Modeling

By monsmagn

Intro to Predictive Modeling

A fast introduction to basic predictive modeling

  • 1,188