David Taylor
Senior Data Scientist, R&D, Aviva Canada
prooffreader.com
@prooffreader
Don't worry, not this.
It is the use of algorithms to create knowledge from data.
It is the use of algorithms to create knowledge from data.
What's an algorithm?
How many of you have done machine learning before?
How many of you have done machine learning before?
How many of you have made a best-fit linear regression line in Excel before?
"But wait! All I did was push a button in Excel!"
The Black Box...
An offshoot of the field of Artificial Intelligence
Autocorrect
Google page ranking
Netflix suggestions
Credit card fraud detection
Stock trading systems
Climate modeling and weather forecasting
Facial recognition
Self-driving cars
Simple data analysis
'FRUIT'
from my webcomic, prooffreaderswhimsy
Let's do some
of our fruit dataset
First, choose number of clusters.
We'll go with 3.
1) Centroid-based
1) Centroid-based
2) Hierarchical
1) Centroid-based
2) Hierarchical
3) Neighborhood growers
We're done clustering;
Let's do some
of our fruit dataset
... right after we find out what that means!
There is no "label" associated with clusters.
Data starts with labels
Labels of our fruit and two features
K-Nearest Neighbor
K-Nearest Neighbor
k=1
K-Nearest Neighbor
k=3
Decision surfaces
"the bias-variance tradeoff"
underfitting = bias
overfitting = variance
Balance bias and variance by randomly sequestering part of our data as a testing set; the remaining data becomes our training set.
data
training
70%
testing
30%
Don't cheat!
k=99 k=15 k=3
training
set
test
set
k=99 k=15 k=3
training
set
test
set
new
random
test
set
Learning curve
Learning curve
Comparison of classifier algorithms
adjust parameters
change to more complex algorithm
use more features
use an ensemble of low complexity algorithms
How to fit an algorithm
to correct underfitting
adjust parameters
change to less complex algorithm
use fewer features or perform dimensionality reduction
use an ensemble of high complexity algorithms
use more data
How to fit an algorithm
to correct overfitting
We've done k-Nearest Neighbors;
now let's try a different supervised learning algorithm.
An ensemble of decision trees
RELATIVE FEATURE IMPORTANCES
In what kind of classifier would it be more important to minimize false positives or false negatives?
An adult-content filter for school computers
A book recommendation classifier for Amazon
A genetic risk classifier for cancer
www.prooffreader.com
"Prooffreader" is misspelled:
that's the joke!