motivations!

“大数据技术与应用”专业

Digital Media Technology / Network Engineering

NLP

MCM

Symbolism
Statistical
Connectionism

History of ML

Preprocessing

Subtitle

preprocessing

categorical features
scale
lost data

model selecting

No free lunch theorem

model selecting

Error = Bias(h(\theta))^2 + Var(h(\theta)) +\epsilon^2

Error = Bias(h(\theta))^2 + Var(h(\theta)) +\epsilon^2

model selecting

Interpretability

Flexibility

SKLEARN

estimator.fit(Xtrain, ytrain)

estimator.predict(Xtrain, ytrain)

estimator = #@$%&(*param)

https://arxiv.org/pdf/1309.0238v1.pdf

what are we doing?

R_{D}(h(x)) = \frac{1}{N} \sum_{y_i \in D} (y_i - h(x_i))^2

R_{D}(h(x)) = \frac{1}{N} \sum_{y_i \in D} (y_i - h(x_i))^2

g(x) = \arg\min_{h(x) \in H} R_{D}(h(x))

g(x) = \arg\min_{h(x) \in H} R_{D}(h(x))

estimation risk

linear regression

Y = \beta_0 + \beta_1 X + \epsilon

Y = \beta_0 + \beta_1 X + \epsilon

Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon

Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p + \epsilon

\hat\beta = (X^T X)^{-1}X^T Y

\hat\beta = (X^T X)^{-1}X^T Y

R = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2

R = \sum_{i=1}^N r_i = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2

Ridge regression

LASSO

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j \beta_i^2.

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j \beta_i^2.

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j ||\beta_i||

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \sum_{i=0}^j ||\beta_i||

Regularization

Elastic Net

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \lambda \sum_{i=0}^j ||\beta_i|| +\alpha (1 - \lambda) \sum_{i=0}^j \beta_i^2/2

R(h_j) = \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha \lambda \sum_{i=0}^j ||\beta_i|| +\alpha (1 - \lambda) \sum_{i=0}^j \beta_i^2/2

Regularization

= \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha (\lambda \sum_{i=0}^j ||\beta_i|| + (1 - \lambda) \sum_{i=0}^j \beta_i^2/2)

= \sum_{i=1}^N (y_i - (\beta_0 + \beta_1 x_i))^2 +\alpha (\lambda \sum_{i=0}^j ||\beta_i|| + (1 - \lambda) \sum_{i=0}^j \beta_i^2/2)

Logistic regression

h(z) = \frac{1}{1 + e^{-z}}.

h(z) = \frac{1}{1 + e^{-z}}.

knn

SVM

\mathbf{w}^Tx + b =0

\mathbf{w}^Tx + b =0

SVM

\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{w}\|^2 +C \sum_{i=1}^N \xi_i

\min_{\mathbf{w}} \frac{1}{2}\|\mathbf{w}\|^2 +C \sum_{i=1}^N \xi_i

SVM

Bayes

p(h|D) = \frac{p(D|h)\,p(h)}{p(D)}

p(h|D) = \frac{p(D|h)\,p(h)}{p(D)}

P(y = c_1 | X) =

P(y = c_1 | X) =

\frac{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1)}{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1) + P(x_1|c_0) P(x_2 | c_0)... P(y=c_0)}

\frac{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1)}{P(x_1|c_1) P(x_2 | c_1)... P(y=c_1) + P(x_1|c_0) P(x_2 | c_0)... P(y=c_0)}

decision tree

Ensemble method

Bagging

RandomForest

bootstrap

Ensemble method

Bagging

Ensemble method

RandomForest

• Builds upon the idea of bagging
• Each tree build from bootstrap sample
• Node splits calculated from random feature subsets

Ensemble method

RandomForest

• All trees are fully grown
• No pruning
• Two parameters
– Number of trees
– Number of features

Ensemble

Boosting

• Also ensemble method like Bagging
• But:
– weak learners evolve over time
– votes are weighted
• Better than Bagging for many applications

Boosting

• number of trees
• number of splits in each tree (often stumps
work well)
• parameters controlling how weights evolve

How to choose?

knn

SVM

RF

deep learning

features

samples

How to choose?

Clustering

k-means

Clustering

mean-shift

http://www.youtube.com/watch?v=kmaQAsotT9s

1. Put a window around each point
2. Compute mean of points in the frame.
3. Shift the window to the mean
4. Repeat until convergence

– single-linkage

– complete-linkage

– average linkage

Clustering

Hierarchical Clustering

http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

techniques

Subtitle

Principal Component Analysis

Latent Dirichlet Allocation

Cross Validation

decision risk

what else except accuracy/R^2 ?

Regression

MSE/MAE etc.

Cross Validation

Classification

ROC P-R etc.

ROC

P-R

Cross Validation

TPR = Recall = \frac{TP}{TP+FN}

TPR = Recall = \frac{TP}{TP+FN}

FPR = \frac{FP}{FP+TN}

FPR = \frac{FP}{FP+TN}

Precision = \frac{TP}{TP+FP}

Precision = \frac{TP}{TP+FP}

F1 = \frac{2*Recall*Precision}{Recall + Precision}

F1 = \frac{2*Recall*Precision}{Recall + Precision}

F\beta = \frac{(1 + \beta)*Recall*Precision}{\beta^2 * Recall + Precision}

F\beta = \frac{(1 + \beta)*Recall*Precision}{\beta^2 * Recall + Precision}

Cross Validation

predict fact	1	0
1	cost 1	cost 2
0	cost 3	cost 4

cost matrix

MDS

SVD

prerequisites

prerequisits

LA
calculus
statistic
probability
english

Machine Learning/

Data Science

Symbolism Statistical Connectionism

Structure of modeling

fetch data

you already knew this

explore data

Preprocessing

Subtitle

preprocessing

model selecting

No free lunch theorem

model selecting

model selecting

SKLEARN

https://arxiv.org/pdf/1309.0238v1.pdf

what are we doing?

linear regression

Ridge regression

LASSO

Regularization

Elastic Net

Regularization

Logistic regression

knn

SVM

SVM

SVM

SVM

Bayes

decision tree

decision tree

Ensemble method

Bagging

RandomForest

Ensemble method

Bagging

Ensemble method

RandomForest

Ensemble method

RandomForest

Ensemble

Boosting

Boosting

Boosting

How to choose?

How to choose?

Clustering

k-means

Clustering

mean-shift

Clustering

Hierarchical Clustering

techniques

Subtitle

Principal Component Analysis

Principal Component Analysis

Latent Dirichlet Allocation

Cross Validation

Cross Validation

Cross Validation

Cross Validation

MDS

MDS

MDS

SVD

prerequisites

prerequisits

Big Picture

ML an overview

More from orashi

Symbolism
Statistical
Connectionism