Advanced 

programming

Lecture 7

Måns Magnusson

Statistics and Machine learning

Department of computer and information science

Since last time?

Machine learning

Machine learning?

Advanced R Programming

Måns Magnusson

Automatically detect patterns in data

Predict future observation

Decision making under uncertainty

Types of Machine learning

Advanced R Programming

Måns Magnusson

Supervised learning

Unsupervised learning

Reinforcement learning

Supervised learning

Advanced R Programming

Måns Magnusson

(also called predictive learning)

response variable

covariates/features

\mathcal{D}=({\mathbf{x}_i,y_i})^{N}_{i=1}
D=(xi,yi)i=1N\mathcal{D}=({\mathbf{x}_i,y_i})^{N}_{i=1}

training set

Supervised learning types

Advanced R Programming

Måns Magnusson

If         is categorical:

y_i
yiy_i

If         is real:

y_i
yiy_i

classification

regression

Unsupervised learning

Advanced R Programming

Måns Magnusson

also called knowledge discovery,

dimensionality reduction

\mathcal{D}=({\mathbf{x}_i})^{N}_{i=1}
D=(xi)i=1N\mathcal{D}=({\mathbf{x}_i})^{N}_{i=1}

clustering, PCA, discovering of graph structures

latent variable modeling

Curse of dimentionality

Advanced R Programming

Måns Magnusson

The more variables the larger distance between datapoints

Bias and variance in ML

Advanced R Programming

Måns Magnusson

Underfit = high bias, low variance

Overfit = low bias, high variance

Model selection

Advanced R Programming

Måns Magnusson

hyper parameters

bias and variance - tradeoff

generalization error

validation set/cross validation

Predictive modeling pipeline

Advanced R Programming

Måns Magnusson

1. Set aside data for test (estimate generalization error)

2. Set aside data for validation (if hyperparams)

3. Run algorithms

4. Find best/optimal hyperparameters (on validation set)

5. Choose final model

6. Estimate generalization error on test set

No free lunch theorem

Advanced R Programming

Måns Magnusson

accuracy-complexity-intepratability tradeoff

different models work in different domains

...but more data always wins

Supervised learning in R

the caret package

Advanced R Programming

Måns Magnusson

package for supervised learning

do not contain methods - just framework

compare methods on hold-out-data

specific algorithms is part of other courses

Probability in R

Advanced R Programming

Måns Magnusson

Prefix Description Example
r Random draw rnorm
d Density function dbinom
q Quantile function qbeta
p CDF pgamma

Big data

Big data

Advanced R Programming

Måns Magnusson

"Big data is like teenage sex:

everyone talks about it,

nobody really knows how to do it,

everyone thinks everyone else is doing it,

so everyone claims they are doing it..."

- Dan Ariely

Big data is relative...

Advanced R Programming

Måns Magnusson

... to computational complexity

O(N): 10^{12}
O(N):1012O(N): 10^{12}
O(N^2): 10^{6}
O(N2):106O(N^2): 10^{6}
O(N^3): 10^{4}
O(N3):104O(N^3): 10^{4}
O(2^N): 50
O(2N):50O(2^N): 50

We need algorithms that scales!

Examples

Advanced R Programming

Måns Magnusson

Linear regression

O(P^2 \cdot N)
O(P2N)O(P^2 \cdot N)

Gaussian processes

O(N^3)
O(N3)O(N^3)

Support vector machines

O(N^3)|O(N^2)
O(N3)O(N2)O(N^3)|O(N^2)

Random forests

O(T(P \cdot N \cdot \log(N))
O(T(PNlog(N))O(T(P \cdot N \cdot \log(N))

Topic models

\approx O(I \cdot N)
O(IN)\approx O(I \cdot N)

Quicksort

O(N \cdot \log(N))
O(Nlog(N))O(N \cdot \log(N))

Big data in R

Advanced R Programming

Måns Magnusson

R stores data in RAM

integers

4 bytes

numerics

8 bytes

A matrix with 100m rows and 5 cols with numerics

100000000 \cdot 5 \cdot 8 / (1024^3) \approx 3.8
10000000058/(10243)3.8100000000 \cdot 5 \cdot 8 / (1024^3) \approx 3.8

How to handle

Advanced R Programming

Måns Magnusson

Handle chunkwise

Subsampling

More hardware

C++/Java backend (dplyr)

Reduce data in memory

Database backend

If not enough

Advanced R Programming

Måns Magnusson

Spark and SparkR

Text

Fast cluster computations  for ML /STATS

Data munging

using dplyr and tidyr

Tidy data

Advanced R Programming

Måns Magnusson

Theoretical approach to data handling

Similar to Codds normal forms (3rd)

Tidy data and messy data

Tidy data

1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.

iris
faithful

Why tidy?

Advanced R Programming

Måns Magnusson

80 % of work is data munging

Analysis and visualization is based on tidy data

Performant code

Data analysis pipeline

messy data

tidy data

analysis

Messy data

Advanced R Programming

Måns Magnusson

4. Multiple types of observational units are stored in the same table.

1. Column headers are values, not variable names. 

AirPassengers

Example data:

2. Multiple variables are stored in one column.

mtcars

Example data:

3. Variables are stored in both rows and columns.

crimetab

Example data:

5. A single observational unit is stored in multiple tables.

dplyr

Advanced R Programming

Måns Magnusson

Verbs for handling data

Highly optimized C++ code (backend)

Handling larger datasets in R

(no copy-on-modify)

dplyr + tidyr

Advanced R Programming

Måns Magnusson

Advanced R - Lecture 7

By monsmagn

Advanced R - Lecture 7

Lecture 7 in the course Advanced R programming at Linköping University.

  • 1,592