Advanced
programming
Lecture 7
Måns Magnusson
Statistics and Machine learning
Department of computer and information science
Since last time?
Machine learning
Machine learning?
Advanced R Programming
Måns Magnusson
Automatically detect patterns in data
Predict future observation
Decision making under uncertainty
Types of Machine learning
Advanced R Programming
Måns Magnusson
Supervised learning
Unsupervised learning
Reinforcement learning
Supervised learning
Advanced R Programming
Måns Magnusson
(also called predictive learning)
response variable
covariates/features
training set
Supervised learning types
Advanced R Programming
Måns Magnusson
If is categorical:
If is real:
classification
regression
Unsupervised learning
Advanced R Programming
Måns Magnusson
also called knowledge discovery,
dimensionality reduction
clustering, PCA, discovering of graph structures
latent variable modeling
Curse of dimentionality
Advanced R Programming
Måns Magnusson
The more variables the larger distance between datapoints
Bias and variance in ML
Advanced R Programming
Måns Magnusson
Underfit = high bias, low variance
Overfit = low bias, high variance
Model selection
Advanced R Programming
Måns Magnusson
hyper parameters
bias and variance - tradeoff
generalization error
validation set/cross validation
Predictive modeling pipeline
Advanced R Programming
Måns Magnusson
1. Set aside data for test (estimate generalization error)
2. Set aside data for validation (if hyperparams)
3. Run algorithms
4. Find best/optimal hyperparameters (on validation set)
5. Choose final model
6. Estimate generalization error on test set
No free lunch theorem
Advanced R Programming
Måns Magnusson
accuracy-complexity-intepratability tradeoff
different models work in different domains
...but more data always wins
Supervised learning in R
the caret package
Advanced R Programming
Måns Magnusson
package for supervised learning
do not contain methods - just framework
compare methods on hold-out-data
specific algorithms is part of other courses
Probability in R
Advanced R Programming
Måns Magnusson
Prefix | Description | Example |
---|---|---|
r | Random draw | rnorm |
d | Density function | dbinom |
q | Quantile function | qbeta |
p | CDF | pgamma |
Big data
Big data
Advanced R Programming
Måns Magnusson
"Big data is like teenage sex:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it..."
- Dan Ariely
Big data is relative...
Advanced R Programming
Måns Magnusson
... to computational complexity
We need algorithms that scales!
Examples
Advanced R Programming
Måns Magnusson
Linear regression
Gaussian processes
Support vector machines
Random forests
Topic models
Quicksort
Big data in R
Advanced R Programming
Måns Magnusson
R stores data in RAM
integers
4 bytes
numerics
8 bytes
A matrix with 100m rows and 5 cols with numerics
How to handle
Advanced R Programming
Måns Magnusson
Handle chunkwise
Subsampling
More hardware
C++/Java backend (dplyr)
Reduce data in memory
Database backend
If not enough
Advanced R Programming
Måns Magnusson
Spark and SparkR
Text
Fast cluster computations for ML /STATS
Data munging
using dplyr and tidyr
Tidy data
Advanced R Programming
Måns Magnusson
Theoretical approach to data handling
Similar to Codds normal forms (3rd)
Tidy data and messy data
Tidy data
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
iris
faithful
Why tidy?
Advanced R Programming
Måns Magnusson
80 % of work is data munging
Analysis and visualization is based on tidy data
Performant code
Data analysis pipeline
messy data
tidy data
analysis
Messy data
Advanced R Programming
Måns Magnusson
4. Multiple types of observational units are stored in the same table.
1. Column headers are values, not variable names.
AirPassengers
Example data:
2. Multiple variables are stored in one column.
mtcars
Example data:
3. Variables are stored in both rows and columns.
crimetab
Example data:
5. A single observational unit is stored in multiple tables.
dplyr
Advanced R Programming
Måns Magnusson
Verbs for handling data
Highly optimized C++ code (backend)
Handling larger datasets in R
(no copy-on-modify)
dplyr + tidyr
Advanced R Programming
Måns Magnusson
Advanced R - Lecture 7
By monsmagn
Advanced R - Lecture 7
Lecture 7 in the course Advanced R programming at Linköping University.
- 1,592