Machine Learning/

Data Science

motivations!

motivations!

  • “大数据技术与应用”专业
  • Digital Media Technology / Network Engineering  
  • NLP
  • MCM

Symbolism
Statistical
Connectionism

History of ML

Structure of modeling

  • fetch data
  • explore
  • preprocessing
  • models
  • techniques

fetch data

you already knew this

explore data

Preprocessing

Subtitle

preprocessing

  • categorical features
  • scale
  • lost data

model selecting

No free lunch theorem

model selecting

Error = Bias(h(\theta))^2 + Var(h(\theta)) +\epsilon^2
Error=Bias(h(θ))2+Var(h(θ))+ϵ2Error = Bias(h(\theta))^2 + Var(h(\theta)) +\epsilon^2

model selecting

Interpretability

Flexibility

SKLEARN

estimator.fit(Xtrain, ytrain)

estimator.predict(Xtrain, ytrain)

estimator = #@$%&(*param)

what are we doing?

R_{D}(h(x)) = \frac{1}{N} \sum_{y_i \in D} (y_i - h(x_i))^2
RD(h(x))=1NyiD(yih(xi))2R_{D}(h(x)) = \frac{1}{N} \sum_{y_i \in D} (y_i - h(x_i))^2
g(x) = \arg\min_{h(x) \in H} R_{D}(h(x))
g(x)=argminh(x)HRD(h(x))g(x) = \arg\min_{h(x) \in H} R_{D}(h(x))

estimation risk

linear regression

Y = \beta_0 + \beta_1 X + \epsilon
Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilon
Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon
Y=β0+β1X1++βpXp+ϵY = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon
\hat\beta = (X^T X)^{-1}X^T Y
β^=(XTX)1XTY\hat\beta = (X^T X)^{-1}X^T Y
R = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2
R=i=1Nri=i=1N(yi(β0+β1xi))2 R = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2

Ridge regression

 

LASSO

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j \beta_i^2.
R(hj)=i=1N(yi(β0+β1xi))2+αi=0jβi2.R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j \beta_i^2.
R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j ||\beta_i||
R(hj)=i=1N(yi(β0+β1xi))2+αi=0jβiR(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j ||\beta_i||

Regularization

Elastic Net

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \lambda \sum_{i=0}^j ||\beta_i|| +\alpha (1 - \lambda) \sum_{i=0}^j \beta_i^2/2
R(hj)=i=1N(yi(β0+β1xi))2+αλi=0jβi+α(1λ)i=0jβi2/2R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \lambda \sum_{i=0}^j ||\beta_i|| +\alpha (1 - \lambda) \sum_{i=0}^j \beta_i^2/2

Regularization

= \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha (\lambda \sum_{i=0}^j ||\beta_i|| + (1 - \lambda) \sum_{i=0}^j \beta_i^2/2)
=i=1N(yi(β0+β1xi))2+α(λi=0jβi+(1λ)i=0jβi2/2)= \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha (\lambda \sum_{i=0}^j ||\beta_i|| + (1 - \lambda) \sum_{i=0}^j \beta_i^2/2)

Logistic regression

h(z) = \frac{1}{1 + e^{-z}}.
h(z)=11+ez.h(z) = \frac{1}{1 + e^{-z}}.

knn

SVM

\mathbf{w}^Tx + b =0
wTx+b=0\mathbf{w}^Tx + b =0

SVM

\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{w}\|^2 +C \sum_{i=1}^N \xi_i
minw12w2+Ci=1Nξi\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{w}\|^2 +C \sum_{i=1}^N \xi_i

SVM

SVM

Bayes

p(h|D) = \frac{p(D|h)\,p(h)}{p(D)}
p(hD)=p(Dh)p(h)p(D)p(h|D) = \frac{p(D|h)\,p(h)}{p(D)}
P(y = c_1 | X) =
P(y=c1X)=P(y = c_1 | X) =
\frac{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1)}{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1) + P(x_1|c_0) P(x_2 | c_0)... P(y=c_0)}
P(x1c1)P(x2c1)...P(y=c1)P(x1c1)P(x2c1)...P(y=c1)+P(x1c0)P(x2c0)...P(y=c0)\frac{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1)}{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1) + P(x_1|c_0) P(x_2 | c_0)... P(y=c_0)}

decision tree

decision tree

Ensemble method

Bagging

RandomForest

bootstrap

Ensemble method

Bagging

Ensemble method

RandomForest

Builds upon the idea of bagging
Each tree build from bootstrap sample
Node splits calculated from random feature subsets 

Ensemble method

RandomForest

All trees are fully grown
No pruning
Two parameters
           – Number of trees
           – Number of features 

Ensemble

Boosting

Also ensemble method like Bagging
But:
          – weak learners evolve over time
          – votes are weighted
Better than Bagging for many applications 

Boosting

Boosting

number of trees
number of splits in each tree (often stumps
work well)
parameters controlling how weights evolve 

How to choose?

knn

SVM

RF

deep learning

features

samples

How to choose?

Clustering

k-means

Clustering

mean-shift

1. Put a window around each point
2. Compute mean of points in the frame.
3. Shift the window to the mean
4. Repeat until convergence
 

single-linkage

 

complete-linkage


average linkage

Clustering

Hierarchical Clustering 

techniques

Subtitle

Principal Component Analysis

Principal Component Analysis

Latent Dirichlet Allocation 

Cross Validation

decision risk

what else except accuracy/R^2 ?

Regression

MSE/MAE etc.

Cross Validation

Classification

ROC P-R etc.

ROC

P-R

Cross Validation

TPR = Recall = \frac{TP}{TP+FN}
TPR=Recall=TPTP+FNTPR = Recall = \frac{TP}{TP+FN}
FPR = \frac{FP}{FP+TN}
FPR=FPFP+TNFPR = \frac{FP}{FP+TN}
Precision = \frac{TP}{TP+FP}
Precision=TPTP+FPPrecision = \frac{TP}{TP+FP}
F1 = \frac{2*Recall*Precision}{Recall + Precision}
F1=2RecallPrecisionRecall+PrecisionF1 = \frac{2*Recall*Precision}{Recall + Precision}
F\beta = \frac{(1 + \beta)*Recall*Precision}{\beta^2 * Recall + Precision}
Fβ=(1+β)RecallPrecisionβ2Recall+PrecisionF\beta = \frac{(1 + \beta)*Recall*Precision}{\beta^2 * Recall + Precision}

Cross Validation

           predict
fact                  

1

0
1 cost 1 cost 2
0 cost 3 cost 4

cost matrix

MDS

MDS

MDS

SVD

prerequisites

prerequisits

  • LA
  • calculus
  • statistic
  • probability
  • english

Big Picture

終わり

ML an overview

By orashi

ML an overview

  • 219