Ensemble Learning
and Random Forests

Tzu-Li Tai @ NCKU TechOrange

Learning Path to Data Science - Machine Learning Session #4

Outline

What is ensemble learning?
Why is ensemble useful?
Random Forests
Hands-on Orange practice

First things first...

Who can give me a brief recap of what we have learnt already?

Logistic Regression
K-Nearest Neighbour
Classification Tree (Decision Tree)
SVM

Use 1 sentence to explain what a classifier is!

Given a set of training examples, a classification algorithm outputs a classifier.

The classifier is an hypothesis about the true function

S

f

The Hypothesis space, of a training set

H

f

h_1

h ​ 1 ​ ​

h_2

h ​ 2 ​ ​

h_3

h ​ 3 ​ ​

f

: the true function

(optimal classifier)

h

: different solutions

of the classifier

Ensemble Classification

A popular technique to increase the accuracy of classifiers

An ensemble classifier is a set of classifiers whose individual decision are combined in some way (typically by voting)

whole training data ( instances)

n

Classification learning algorithm

L

Classifier

h

manipulate

training data

n

n ​' ​ ​

n ​' ​ ​

n ​' ​ ​

L

L

L

h_1

h ​ 1 ​ ​

h_2

h ​ 2 ​ ​

h_m

h ​ m ​ ​

What makes a good ensemble?

The overall ensemble classifier will be more accurate than a single individual classifier if ...

The base models are diverse
The base models are accurate

Why Diverse?

h_{ensemble}=\{h_1,h_2,h_3\}

h ​ e n s e m b l e ​ ​ = {h ​ 1 ​ ​, h ​ 2 ​ ​, h ​ 3 ​ ​}

=> Ideal if

h_1,h_2,h_3

h ​ 1 ​ ​, h ​ 2 ​ ​, h ​ 3 ​ ​

make different predictions

Why Accurate?

Suppose m=20 base classifiers

Each classifier has error rate

If majority voting is used, > 10 wrong predictions will result in overall wrong classification result

p

=> Binomial distribution!

--- : p = 0.3, m = 20

--- : p = 0.4, m = 20

--- : p = 0.6, m = 20

--- : Cumulative prob. P(x>10) = 0.01714

--- : Cumulative prob. P(x>10) = 0.12752

--- : Cumulative prob. P(x>10) = 0.75534

Why is ensemble better?

This is going to be kinda vague...

but we can put it into two categories:

Statistical
Computational

Statistical property

f

h_1

h ​ 1 ​ ​

h_2

h ​ 2 ​ ​

h_3

h ​ 3 ​ ​

h_4

h ​ 4 ​ ​

"Average"

Computational property

Different local search

starting points

Methods to constructing ensembles

=> manipulate training data to form m subsets to

generate m hypotheses

n

n ​' ​ ​

n ​' ​ ​

n ​' ​ ​

Bagging
Boosting

Bagging

"Bootstrap Aggregating"

n

training instances =>

randomly select

n

instances with replacement for each subset

=> each subset is expected to have the fraction

1/(1-e)=63.2\%

1 / (1 - e) = 63.2 %

of unique instances in the original training set

n

0.632*n

0.632 * n

0.632*n

0.632 * n

0.632*n

0.632 * n

L

L

L

h_1

h ​ 1 ​ ​

h_2

h ​ 2 ​ ​

h_m

h ​ m ​ ​

Boosting

=> iteratively learn from the constructed base classifiers

AdaBoost

the most famous boosting algorithm...

"Adaptive Boosting"

=> weak to strong

Adaboost

Maintains a weight for all instances in the training data

In each iteration , hypothesis is calculated to minimize the weighted error on the training set.

The weighted error of is applied to the training set to update all weights of all instances in the training data

Overall effect: Place more weight on training examples that were previously misclassified

l

h_l

h ​ l ​ ​

h_l

h ​ l ​ ​

Adaboost

Input:

- Training data

- Initial weights

- Error function

\{(X_1,y_1),(X_2,y_2),...,(X_n,y_n)\}, y \in \{-1,1\}

{(X ​ 1 ​ ​, y ​ 1 ​ ​), (X ​ 2 ​ ​, y ​ 2 ​ ​), . . ., (X ​ n ​ ​, y ​ n ​ ​)}, y \in {- 1, 1}

w_{1,1},w_{2,1},...,w_{n,1} = 1/n

w ​ 1, 1 ​ ​, w ​ 2, 1 ​ ​, . . ., w ​ n, 1 ​ ​ = 1 / n

E(f(x),y,i)=e^{-y_if(x_i)}

E (f (x), y, i) = e ​ - y ​ i ​ ​ f (x ​ i ​ ​) ​ ​

Procedure:

1. for t in (1 to m iterations):

2. Apply learning algorithm to find such that

3. is minimized.

4. Calculate weighted error of ,

5. Update

6. Final ensemble classifier

h_t

h ​ t ​ ​

\epsilon _t=\sum{w_{i,t}}

ϵ ​ t ​ ​ = \sum w ​ i, t ​ ​

\alpha _t=1/2((1-\epsilon_t)/\epsilon_t)

α ​ t ​ ​ = 1 / 2 ((1 - ϵ ​ t ​ ​) / ϵ ​ t ​ ​)

h_t

h ​ t ​ ​

w_{i,t+1} = w_{i,t}*E(\alpha_th_t,y,i) = w_{i,t}e^{-y_i\alpha_th_t}

w ​ i, t + 1 ​ ​ = w ​ i, t ​ ​ * E (α ​ t ​ ​ h ​ t ​ ​, y, i) = w ​ i, t ​ ​ e ​ - y ​ i ​ ​ α ​ t ​ ​ h ​ t ​ ​ ​ ​

h_{ensemble}=\sum\alpha_th_t

h ​ e n s e m b l e ​ ​ = \sum α ​ t ​ ​ h ​ t ​ ​

Adaboost

n~~(weights =w_{i,1})

n (w e i g h t s = w ​ i, 1 ​ ​)

n~~(weights =w_{i,2})

n (w e i g h t s = w ​ i, 2 ​ ​)

n~~(weights =w_{i,3})

n (w e i g h t s = w ​ i, 3 ​ ​)

n~~(weights =w_{i,m})

n (w e i g h t s = w ​ i, m ​ ​)

L

h_1,\alpha_1

h ​ 1 ​ ​, α ​ 1 ​ ​

L

h_2,\alpha_2

h ​ 2 ​ ​, α ​ 2 ​ ​

L

h_m,\alpha_m

h ​ m ​ ​, α ​ m ​ ​

Random Forests

- A Random Forest is an ensemble classifier that uses decision trees build on bagged training data as base classifiers

- The only difference between a Random Forest and a typical bagged tree classifier is that the base trees are randomly split at each node by randomly selecting features.

Random Forests

Procedure RandomForest():

1. Sample bootstraps from the training data

2. for each bootstrap sample in :

3. BuildRandomizedTree()

m

i

1 ~to~ m

1 t o m

h_i =

h ​ i ​ ​ =

h_{ensemble}=\sum{h_i}

h ​ e n s e m b l e ​ ​ = \sum h ​ i ​ ​

Procedure BuildRandomizedTree():

1. Initialize root node of tree to be the whole space

2. At each node newly split:

3. Select random feature space of features

4. From , find a feature that best splits current node

R

r

R

Ensemble Learning and Random Forests

Outline

First things first...

Use 1 sentence to explain what a classifier is!

Ensemble Classification

What makes a good ensemble?

Why Diverse?

Why Accurate?

Why is ensemble better?

Statistical property

Computational property

Methods to constructing ensembles

Bagging

Boosting

AdaBoost

Adaboost

Adaboost

Adaboost

Random Forests

Random Forests

Ensemble Learning
and Random Forests