David Taylor
Senior Data Scientist, R&D, Aviva Canada
prooffreader.com
@prooffreader
Goals of this presentation:
To demystify the field of machine learning (ML)
To introduce you to the vocabulary and fundamental concepts ("idea buckets") of ML
To serve as a foundation for further learning about ML
To provide links to further resources for self-learning about ML (including some IPython notebooks I made)
http://www.dtdata.io/introml
There are two versions:
All of the examples were made with Python using Scikit-Learn and Matplotlib
Don't worry, not this.
It is the use of algorithms to create knowledge from data.
It is the use of algorithms to create knowledge from data.
What's an algorithm?
How many of you have done machine learning before?
How many of you have done machine learning before?
How many of you have made a best-fit linear regression line in Excel before?
"But wait! All I did was push a button in Excel!"
The Black Box...
An offshoot of the field of Artificial Intelligence
Autocorrect
Google page ranking
Netflix suggestions
Credit card fraud detection
Stock trading systems
Climate modeling and weather forecasting
Facial recognition
Self-driving cars
Simple data analysis
'FRUIT'
from my webcomic, prooffreaderswhimsy
Let's do some
of our fruit dataset
First, choose number of clusters.
We'll go with 3.
1) Centroid-based
1) Centroid-based
2) Hierarchical
1) Centroid-based
2) Hierarchical
3) Neighborhood growers
increasing minimum distance
decreasing minimum samples
We're done clustering;
Let's do some
of our fruit dataset
... right after we find out what that means!
There is no "label" associated with clusters.
Data starts with labels
We have three fruits: orange, apple and pear.
Unsupervised:
Data → "Labels" (clusters)
Supervised:
Data + Labels + New Data → New Labels
Labels of our fruit and two features
K-Nearest Neighbor
K-Nearest Neighbor
k=1
K-Nearest Neighbor
k=3
Decision surfaces
"the bias-variance tradeoff"
underfitting = bias
overfitting = variance
Balance bias and variance by randomly sequestering part of our data as a testing set; the remaining data becomes our training set.
data
training
70%
testing
30%
Don't cheat!
k=99 k=15 k=3
training
set
test
set
k=99 k=15 k=3
training
set
test
set
new
random
test
set
Learning curve
Learning curve
Comparison of classifier algorithms
adjust parameters
change to more complex algorithm
use more features
use an ensemble of low complexity algorithms
How to fit an algorithm
to correct underfitting
adjust parameters
change to less complex algorithm
use fewer features or perform dimensionality reduction
use an ensemble of high complexity algorithms
use more data
How to fit an algorithm
to correct overfitting
We've done k-Nearest Neighbors;
now let's try a different supervised learning algorithm.
An ensemble of decision trees
RELATIVE FEATURE IMPORTANCES
Negative: "I really hate this product."
Negative: "This thing sucks!"
Positive: "This product is so awesome!"
Positive: "I'm happy with my purchase."
Another very popular classifier that, confusingly, does classification, not regression
We won't go into it here, but among its most popular features is the fact that when you add new training data, you don't have to recalculate the entire classifier from scratch, you can just increment it.
In what kind of classifier would it be more important to minimize false positives or false negatives?
An adult-content filter for school computers
A book recommendation classifier for Amazon
A genetic risk classifier for cancer
www.prooffreader.com
"Prooffreader" is misspelled:
that's the joke!