Machine Learning Algorithms

Symbolism
Statistical
Connectionism

History of ML

Structure of modeling

fetch data
explore
preprocessing
models
techniques

fetch data

explore data

Preprocessing

categorical features
scale
lost data

model selecting

No free lunch theorem

model selecting

Error = Bias(h(\theta))^2 + Var(h(\theta)) +\epsilon^2

Error = Bias(h(\theta))^2 + Var(h(\theta)) +\epsilon^2

model selecting

Interpretability

Flexibility

SKLEARN

estimator.fit(Xtrain, ytrain)

estimator.predict(Xtrain, ytrain)

estimator = #@$%&(*param)

https://arxiv.org/pdf/1309.0238v1.pdf

what are we doing?

R_{D}(h(x)) = \frac{1}{N} \sum_{y_i \in D} (y_i - h(x_i))^2

R_{D}(h(x)) = \frac{1}{N} \sum_{y_i \in D} (y_i - h(x_i))^2

g(x) = \arg\min_{h(x) \in H} R_{D}(h(x))

g(x) = \arg\min_{h(x) \in H} R_{D}(h(x))

estimation risk

linear regression

Y = \beta_0 + \beta_1 X + \epsilon

Y = \beta_0 + \beta_1 X + \epsilon

Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon

Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon

\hat\beta = (X^T X)^{-1}X^T Y

\hat\beta = (X^T X)^{-1}X^T Y

R = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2

R = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2

Ridge regression

LASSO

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j \beta_i^2.

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j \beta_i^2.

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j ||\beta_i||

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j ||\beta_i||

Regularization

Elastic Net

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \lambda \sum_{i=0}^j ||\beta_i|| +\alpha (1 - \lambda) \sum_{i=0}^j \beta_i^2/2

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \lambda \sum_{i=0}^j ||\beta_i|| +\alpha (1 - \lambda) \sum_{i=0}^j \beta_i^2/2

Regularization

= \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha (\lambda \sum_{i=0}^j ||\beta_i|| + (1 - \lambda) \sum_{i=0}^j \beta_i^2/2)

= \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha (\lambda \sum_{i=0}^j ||\beta_i|| + (1 - \lambda) \sum_{i=0}^j \beta_i^2/2)

Logistic regression

h(z) = \frac{1}{1 + e^{-z}}.

h(z) = \frac{1}{1 + e^{-z}}.

knn

SVM

\mathbf{w}^Tx + b =0

\mathbf{w}^Tx + b =0

SVM

\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{w}\|^2 +C \sum_{i=1}^N \xi_i

\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{w}\|^2 +C \sum_{i=1}^N \xi_i

SVM

Bayes

p(h|D) = \frac{p(D|h)\,p(h)}{p(D)}

p(h|D) = \frac{p(D|h)\,p(h)}{p(D)}

P(y = c_1 | X) =

P(y = c_1 | X) =

\frac{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1)}{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1) + P(x_1|c_0) P(x_2 | c_0)... P(y=c_0)}

\frac{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1)}{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1) + P(x_1|c_0) P(x_2 | c_0)... P(y=c_0)}

decision tree

Ensemble method

Bagging

RandomForest

bootstrap

Ensemble method

Bagging

Ensemble method

RandomForest

• Builds upon the idea of bagging
• Each tree build from bootstrap sample
• Node splits calculated from random feature subsets

Ensemble method

RandomForest

• All trees are fully grown
• No pruning
• Two parameters
– Number of trees
– Number of features

Ensemble

Boosting

• Also ensemble method like Bagging
• But:
– weak learners evolve over time
– votes are weighted
• Better than Bagging for many applications

Boosting

• number of trees
• number of splits in each tree (often stumps
work well)
• parameters controlling how weights evolve

How to choose?

knn

SVM

deep learning

features

samples

How to choose?

Clustering

k-means

Clustering

mean-shift

http://www.youtube.com/watch?v=kmaQAsotT9s

1. Put a window around each point
2. Compute mean of points in the frame.
3. Shift the window to the mean
4. Repeat until convergence

– single-linkage

– complete-linkage

– average linkage

Clustering

Hierarchical Clustering

http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

techniques

Subtitle

Principal Component Analysis

Latent Dirichlet Allocation

Cross Validation

decision risk

what else except accuracy/R^2 ?

Regression

MSE/MAE etc.

Cross Validation

Classification

ROC P-R etc.

ROC

P-R

Cross Validation

TPR = Recall = \frac{TP}{TP+FN}

TPR = Recall = \frac{TP}{TP+FN}

FPR = \frac{FP}{FP+TN}

FPR = \frac{FP}{FP+TN}

Precision = \frac{TP}{TP+FP}

Precision = \frac{TP}{TP+FP}

F1 = \frac{2*Recall*Precision}{Recall + Precision}

F1 = \frac{2*Recall*Precision}{Recall + Precision}

F\beta = \frac{(1 + \beta)*Recall*Precision}{\beta^2 * Recall + Precision}

F\beta = \frac{(1 + \beta)*Recall*Precision}{\beta^2 * Recall + Precision}

Cross Validation

predict fact	1	0
1	cost 1	cost 2
0	cost 3	cost 4

cost matrix

MDS

SVD

Learning Path

CS229/CS109

ML/DS

CS231n

DL for CV

CS224n

DL for NLP

CS294

DL for RL

CS246

Machine Learning Algorithms

Symbolism Statistical Connectionism

Structure of modeling

fetch data

explore data

Preprocessing

model selecting

No free lunch theorem

model selecting

model selecting

SKLEARN

https://arxiv.org/pdf/1309.0238v1.pdf

what are we doing?

linear regression

Ridge regression

LASSO

Regularization

Elastic Net

Regularization

Logistic regression

knn

SVM

SVM

SVM

SVM

Bayes

decision tree

decision tree

Ensemble method

Bagging

RandomForest

Ensemble method

Bagging

Ensemble method

RandomForest

Ensemble method

RandomForest

Ensemble

Boosting

Boosting

Boosting

How to choose?

How to choose?

Clustering

k-means

Clustering

mean-shift

Clustering

Hierarchical Clustering

techniques

Subtitle

Principal Component Analysis

Principal Component Analysis

Latent Dirichlet Allocation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

MDS

MDS

MDS

SVD

Learning Path

ML Algos

More from orashi

Symbolism
Statistical
Connectionism