Tzu-Li Tai @ NCKU TechOrange
Learning Path to Data Science - Machine Learning Session #4
Who can give me a brief recap of what we have learnt already?
Given a set of training examples, a classification algorithm outputs a classifier.
The classifier is an hypothesis about the true function
The Hypothesis space, of a training set
: the true function
(optimal classifier)
: different solutions
of the classifier
whole training data ( instances)
Classification learning algorithm
Classifier
manipulate
training data
The overall ensemble classifier will be more accurate than a single individual classifier if ...
=> Ideal if
make different predictions
=> Binomial distribution!

--- : p = 0.3, m = 20
--- : p = 0.4, m = 20
--- : p = 0.6, m = 20

--- : Cumulative prob. P(x>10) = 0.01714
--- : Cumulative prob. P(x>10) = 0.12752
--- : Cumulative prob. P(x>10) = 0.75534
This is going to be kinda vague...
but we can put it into two categories:
"Average"

Different local search
starting points
=> manipulate training data to form m subsets to
generate m hypotheses
"Bootstrap Aggregating"
training instances =>
randomly select
instances with replacement for each subset
=> each subset is expected to have the fraction
of unique instances in the original training set
=> iteratively learn from the constructed base classifiers
the most famous boosting algorithm...
"Adaptive Boosting"
=> weak to strong
Input:
- Training data
- Initial weights
- Error function
Procedure:
1. for t in (1 to m iterations):
2. Apply learning algorithm to find such that
3. is minimized.
4. Calculate weighted error of ,
5. Update
6. Final ensemble classifier
- A Random Forest is an ensemble classifier that uses decision trees build on bagged training data as base classifiers
- The only difference between a Random Forest and a typical bagged tree classifier is that the base trees are randomly split at each node by randomly selecting features.
Procedure RandomForest():
1. Sample bootstraps from the training data
2. for each bootstrap sample in :
3. BuildRandomizedTree()
4.
Procedure BuildRandomizedTree():
1. Initialize root node of tree to be the whole space
2. At each node newly split:
3. Select random feature space of features
4. From , find a feature that best splits current node