Måns Magnusson
Måns Magnusson
Automatically detect patterns in data
Predict future observation
Decision making under uncertainty
Måns Magnusson
Supervised learning
Unsupervised learning
Reinforcement learning
Måns Magnusson
(also called predictive learning)
response variable
covariates/features
training set
Måns Magnusson
If is categorical:
If is real:
classification
regression
Måns Magnusson
also called knowledge discovery,
dimensionality reduction
clustering, PCA, discovering of graph structures
latent variable modeling
Måns Magnusson
The more variables the larger distance between datapoints
Måns Magnusson
Underfit = high bias, low variance
Overfit = low bias, high variance
Måns Magnusson
hyper parameters
bias and variance - tradeoff
generalization error
validation set/cross validation
Måns Magnusson
1. Set aside data for test (estimate generalization error)
2. Set aside data for validation (if hyperparams)
3. Run algorithms
4. Find best/optimal hyperparameters (on validation set)
5. Choose final model
6. Estimate generalization error on test set
Måns Magnusson
accuracy-complexity-intepratability tradeoff
different models work in different domains
...but more data always wins
Måns Magnusson
Linear models
Non-linear models
Måns Magnusson
package for supervised learning
do not contain methods - just framework
compare methods on hold-out-data
Måns Magnusson
"Big data is like teenage sex:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it..."
- Dan Ariely
Måns Magnusson
... to computational complexity
We need algorithms that scales!
Måns Magnusson
Linear regression
Gaussian processes
Support vector machines
Random forests
Topic models
Quicksort
Måns Magnusson
R stores data in RAM
integers
4 bytes
numerics
8 bytes
A matrix with 100m rows and 5 cols with numerics
Måns Magnusson
Handle chunkwise
Subsampling
More hardware
C++/Java backend (dplyr)
Reduce data in memory
Database backend
Måns Magnusson
Spark and SparkPlyr
Text
Fast cluster computations for ML /STATS
Måns Magnusson
Theoretical approach to data handling
Similar to Codds normal forms (3rd)
Tidy data and messy data
Tidy data
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
iris
faithful
Måns Magnusson
80 % of work is data munging
Analysis and visualization is based on tidy data
Performant code
Data analysis pipeline
messy data
tidy data
analysis
Måns Magnusson
4. Multiple types of observational units are stored in the same table.
1. Column headers are values, not variable names.
AirPassengers
Example data:
2. Multiple variables are stored in one column.
mtcars
Example data:
3. Variables are stored in both rows and columns.
crimetab
Example data:
5. A single observational unit is stored in multiple tables.
Måns Magnusson
Verbs for handling data
Highly optimized C++ code (backend)
Handling larger datasets in R
(no copy-on-modify)
Måns Magnusson