Introduction to data analysis using machine learning
David Taylor
Senior Data Scientist, R&D, Aviva Canada
prooffreader.com
@prooffreader
Goals of this presentation:
-
To demystify the field of machine learning (ML)
-
To introduce you to the vocabulary and fundamental concepts ("idea buckets") of ML
-
To serve as a foundation for further learning about ML
-
To provide links to further resources for self-learning about ML (including some IPython notebooks I made)
To see this presentation and handy links, go to:
http://www.dtdata.io/introml
There are two versions:
- "lite" (for live presentation)
- "complete" (for self-study)
All of the examples were made with Python using Scikit-Learn and Matplotlib
What is machine learning?
What is machine learning?
Don't worry, not this.

What is machine learning?
It is the use of algorithms to create knowledge from data.
What is machine learning?
It is the use of algorithms to create knowledge from data.
What's an algorithm?
What is machine learning?
How many of you have done machine learning before?
What is machine learning?
How many of you have done machine learning before?
How many of you have made a best-fit linear regression line in Excel before?

What is machine learning?
"But wait! All I did was push a button in Excel!"
The Black Box...
History of
machine learning
An offshoot of the field of Artificial Intelligence
Examples of machine learning
Autocorrect
Google page ranking
Netflix suggestions
Credit card fraud detection
Stock trading systems
Climate modeling and weather forecasting
Facial recognition
Self-driving cars
Simple data analysis
A note on machine learning nomenclature
The Dataset
'FRUIT'

Types of Features
- Numeric
- Interval, e.g. date or time
- Ordinal
- Categorical
Two kinds of
machine learning:
Unsupervised
and Supervised
from my webcomic, prooffreaderswhimsy

Unsupervised = exploratory
Supervised = predictive
Let's do some
Unsupervised Machine Learning
of our fruit dataset

Clusters

Density & Distance

Density & Distance
Euclidean distance

Density & Distance
Manhattan distance
Standardization

Standardization

The K-Means algorithm
The K-Means algorithm
First, choose number of clusters.
We'll go with 3.
The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

Dependence on initial conditions

Comparing different values of k




Determining best value for k

Some other clustering algorithms

Some other clustering algorithms
1) Centroid-based

Some other clustering algorithms
1) Centroid-based
2) Hierarchical

Some other clustering algorithms
1) Centroid-based
2) Hierarchical
3) Neighborhood growers

2) Hierarchical Clustering

An example of bottom-up hierarchical clustering

2) Hierarchical Clustering
An example of bottom-up hierarchical clustering

2) Hierarchical Clustering
An example of bottom-up hierarchical clustering

2) Hierarchical Clustering
An example of bottom-up hierarchical clustering
DBSCAN
3) Neighborhood Grower example:

DBSCAN

3) Neighborhood Grower example:





increasing minimum distance
decreasing minimum samples
We're done clustering;
Let's do some
Supervised
Machine Learning
of our fruit dataset
... right after we find out what that means!
Unsupervised
There is no "label" associated with clusters.
EXPLORATORY Data Analysis
Supervised
Data starts with labels
PREDICTIVE data analysis
Our fruit dataset


Our fruit dataset + labels

We have three fruits: orange, apple and pear.


Unsupervised:
Data → "Labels" (clusters)
Supervised:
Data + Labels + New Data → New Labels
Labels of our fruit and two features



K-Nearest Neighbor
K-Nearest Neighbor

k=1
K-Nearest Neighbor
k=3


Decision surfaces





Fitting an algorithm
Fitting an algorithm





"the bias-variance tradeoff"
underfitting = bias
overfitting = variance
Balance bias and variance by randomly sequestering part of our data as a testing set; the remaining data becomes our training set.
data
training
70%
testing
30%
Don't cheat!






k=99 k=15 k=3
training
set
test
set







k=99 k=15 k=3
training
set
test
set
new
random
test
set



Learning curve

Learning curve


Comparison of classifier algorithms

-
adjust parameters
-
change to more complex algorithm
-
use more features
-
use an ensemble of low complexity algorithms
How to fit an algorithm
to correct underfitting

-
adjust parameters
-
change to less complex algorithm
-
use fewer features or perform dimensionality reduction
-
use an ensemble of high complexity algorithms
-
use more data
How to fit an algorithm
to correct overfitting
We've done k-Nearest Neighbors;
now let's try a different supervised learning algorithm.
Decision trees


Ensemble methods
Random Forest
An ensemble of decision trees
Random Forest

Random Forest
RELATIVE FEATURE IMPORTANCES

Random Forest

Random Forest

Random Forest

The Curse of Dimensionality
Dimensionality Reduction
PCA

PCA

PCA

Dimensionality reduction is very handy when you have:
-
1,000 features
-
10,000 features
-
100,000 features
-
1,000,000 features!?!?
- e.g. NLP with n-grams
Sparse matrices

Negative: "I really hate this product."
Negative: "This thing sucks!"
Positive: "This product is so awesome!"
Positive: "I'm happy with my purchase."
Latent features
Logistic Regression
Another very popular classifier that, confusingly, does classification, not regression
We won't go into it here, but among its most popular features is the fact that when you add new training data, you don't have to recalculate the entire classifier from scratch, you can just increment it.
Cost Function

In what kind of classifier would it be more important to minimize false positives or false negatives?
An adult-content filter for school computers
A book recommendation classifier for Amazon
A genetic risk classifier for cancer
Precision & recall

www.prooffreader.com
"Prooffreader" is misspelled:
that's the joke!

Intro to data analysis using machine learning (presentation version)
By David Taylor
Intro to data analysis using machine learning (presentation version)
- 2,098