Machine Learning Algorithms

Symbolism
Statistical
Connectionism

History of ML

Structure of modeling

  • fetch data
  • explore
  • preprocessing
  • models
  • techniques

fetch data

explore data

Preprocessing

  • categorical features
  • scale
  • lost data

model selecting

No free lunch theorem

model selecting

Error = Bias(h(\theta))^2 + Var(h(\theta)) +\epsilon^2
Error=Bias(h(θ))2+Var(h(θ))+ϵ2Error = Bias(h(\theta))^2 + Var(h(\theta)) +\epsilon^2

model selecting

Interpretability

Flexibility

SKLEARN

estimator.fit(Xtrain, ytrain)

estimator.predict(Xtrain, ytrain)

estimator = #@$%&(*param)

what are we doing?

R_{D}(h(x)) = \frac{1}{N} \sum_{y_i \in D} (y_i - h(x_i))^2
RD(h(x))=1NyiD(yih(xi))2R_{D}(h(x)) = \frac{1}{N} \sum_{y_i \in D} (y_i - h(x_i))^2
g(x) = \arg\min_{h(x) \in H} R_{D}(h(x))
g(x)=argminh(x)HRD(h(x))g(x) = \arg\min_{h(x) \in H} R_{D}(h(x))

estimation risk

linear regression

Y = \beta_0 + \beta_1 X + \epsilon
Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon
Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon
Y=β0+β1X1++βpXp+ϵY = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon
\hat\beta = (X^T X)^{-1}X^T Y
β^=(XTX)1XTY\hat\beta = (X^T X)^{-1}X^T Y
R = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2
R=i=1Nri=i=1N(yi(β0+β1xi))2 R = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2

Ridge regression

 

LASSO

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j \beta_i^2.
R(hj)=i=1N(yi(β0+β1xi))2+αi=0jβi2.R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j \beta_i^2.
R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j ||\beta_i||
R(hj)=i=1N(yi(β0+β1xi))2+αi=0jβiR(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j ||\beta_i||

Regularization

Elastic Net

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \lambda \sum_{i=0}^j ||\beta_i|| +\alpha (1 - \lambda) \sum_{i=0}^j \beta_i^2/2
R(hj)=i=1N(yi(β0+β1xi))2+αλi=0jβi+α(1λ)i=0jβi2/2R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \lambda \sum_{i=0}^j ||\beta_i|| +\alpha (1 - \lambda) \sum_{i=0}^j \beta_i^2/2

Regularization

= \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha (\lambda \sum_{i=0}^j ||\beta_i|| + (1 - \lambda) \sum_{i=0}^j \beta_i^2/2)
=i=1N(yi(β0+β1xi))2+α(λi=0jβi+(1λ)i=0jβi2/2)= \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha (\lambda \sum_{i=0}^j ||\beta_i|| + (1 - \lambda) \sum_{i=0}^j \beta_i^2/2)

Logistic regression

h(z) = \frac{1}{1 + e^{-z}}.
h(z)=11+ez.h(z) = \frac{1}{1 + e^{-z}}.

knn

SVM

\mathbf{w}^Tx + b =0
wTx+b=0\mathbf{w}^Tx + b =0

SVM

\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{w}\|^2 +C \sum_{i=1}^N \xi_i
minw12w2+Ci=1Nξi\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{w}\|^2 +C \sum_{i=1}^N \xi_i

SVM

SVM

Bayes

p(h|D) = \frac{p(D|h)\,p(h)}{p(D)}
p(hD)=p(Dh)p(h)p(D)p(h|D) = \frac{p(D|h)\,p(h)}{p(D)}
P(y = c_1 | X) =
P(y=c1X)=P(y = c_1 | X) =
\frac{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1)}{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1) + P(x_1|c_0) P(x_2 | c_0)... P(y=c_0)}
P(x1c1)P(x2c1)...P(y=c1)P(x1c1)P(x2c1)...P(y=c1)+P(x1c0)P(x2c0)...P(y=c0)\frac{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1)}{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1) + P(x_1|c_0) P(x_2 | c_0)... P(y=c_0)}

decision tree

decision tree

Ensemble method

Bagging

RandomForest

bootstrap

Ensemble method

Bagging

Ensemble method

RandomForest

Builds upon the idea of bagging
Each tree build from bootstrap sample
Node splits calculated from random feature subsets 

Ensemble method

RandomForest

All trees are fully grown
No pruning
Two parameters
           – Number of trees
           – Number of features 

Ensemble

Boosting

Also ensemble method like Bagging
But:
          – weak learners evolve over time
          – votes are weighted
Better than Bagging for many applications 

Boosting

Boosting

number of trees
number of splits in each tree (often stumps
work well)
parameters controlling how weights evolve 

How to choose?

knn

SVM

RF

deep learning

features

samples

How to choose?

Clustering

k-means

Clustering

mean-shift

1. Put a window around each point
2. Compute mean of points in the frame.
3. Shift the window to the mean
4. Repeat until convergence
 

single-linkage

 

complete-linkage


average linkage

Clustering

Hierarchical Clustering 

techniques

Subtitle

Principal Component Analysis

Principal Component Analysis

Latent Dirichlet Allocation 

Cross Validation

decision risk

what else except accuracy/R^2 ?

Regression

MSE/MAE etc.

Cross Validation

Classification

ROC P-R etc.

ROC

P-R

Cross Validation

TPR = Recall = \frac{TP}{TP+FN}
TPR=Recall=TPTP+FNTPR = Recall = \frac{TP}{TP+FN}
FPR = \frac{FP}{FP+TN}
FPR=FPFP+TNFPR = \frac{FP}{FP+TN}
Precision = \frac{TP}{TP+FP}
Precision=TPTP+FPPrecision = \frac{TP}{TP+FP}
F1 = \frac{2*Recall*Precision}{Recall + Precision}
F1=2RecallPrecisionRecall+PrecisionF1 = \frac{2*Recall*Precision}{Recall + Precision}
F\beta = \frac{(1 + \beta)*Recall*Precision}{\beta^2 * Recall + Precision}
Fβ=(1+β)RecallPrecisionβ2Recall+PrecisionF\beta = \frac{(1 + \beta)*Recall*Precision}{\beta^2 * Recall + Precision}

Cross Validation

           predict
fact                  

1

0
1 cost 1 cost 2
0 cost 3 cost 4

cost matrix

MDS

MDS

MDS

SVD

Learning Path

CS229/CS109

ML/DS

CS231n

DL for CV

CS224n

DL for NLP

CS294

DL for RL

CS246

DM

ML Algos

By orashi

ML Algos

  • 217