Introduction to machine learning

David Taylor

Senior Data Scientist, R&D, Aviva Canada

prooffreader.com

@prooffreader

What is machine learning?

pasted-from-clipboard

What is machine learning?

pasted-from-clipboard

Don't worry, not this.

What is machine learning?

It is the use of algorithms to create knowledge from data.

What is machine learning?

It is the use of algorithms to create knowledge from data.

What's an algorithm?

What is machine learning?

How many of you have done machine learning before?

What is machine learning?

How many of you have done machine learning before?

How many of you have made a best-fit linear regression line in Excel before?

What is machine learning?

"But wait! All I did was push a button in Excel!"

The Black Box...

History of
machine learning

An offshoot of the field of Artificial Intelligence

Examples of machine learning

Autocorrect

Google page ranking

Netflix suggestions

Credit card fraud detection

Stock trading systems

Climate modeling and weather forecasting

Facial recognition

Self-driving cars

Simple data analysis

A note on machine learning nomenclature

The Dataset

'FRUIT'

Types of Features

Numeric
Interval, e.g. date or time
Ordinal
Categorical

Two kinds of
machine learning:

Unsupervised
and Supervised

from my webcomic, prooffreaderswhimsy

_Machine_Learning

Unsupervised = exploratory

Supervised = predictive

Let's do some

Unsupervised Machine Learning

of our fruit dataset

Clusters

Density & Distance

Density & Distance

Euclidean distance

Density & Distance

Manhattan distance

Standardization

Standardization

The K-Means algorithm

The K-Means algorithm

First, choose number of clusters.

We'll go with 3.

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

The K-Means algorithm

Dependence on initial conditions

Comparing different values of k

Determining best value for k

Some other clustering algorithms

Some other clustering algorithms

1) Centroid-based

Some other clustering algorithms

1) Centroid-based

2) Hierarchical

Some other clustering algorithms

1) Centroid-based

2) Hierarchical

3) Neighborhood growers

DBSCAN

Neighborhood Grower example:

DBSCAN

Neighborhood Grower example:

We're done clustering;

Let's do some

Supervised
Machine Learning

of our fruit dataset

... right after we find out what that means!

Unsupervised

There is no "label" associated with clusters.

EXPLORATORY Data Analysis

Supervised

Data starts with labels

PREDICTIVE data analysis

Our fruit dataset

Our fruit dataset + labels

Labels of our fruit and two features

K-Nearest Neighbor

K-Nearest Neighbor

k=1

K-Nearest Neighbor

k=3

Decision surfaces

Fitting an algorithm

Fitting an algorithm

"the bias-variance tradeoff"

underfitting = bias

overfitting = variance

Balance bias and variance by randomly sequestering part of our data as a testing set; the remaining data becomes our training set.

data

training
70%

testing
30%

Don't cheat!

k=99 k=15 k=3

training

set

test

set

k=99 k=15 k=3

training

set

test

set

new

random

test

set

Learning curve

Learning curve

Comparison of classifier algorithms

adjust parameters
change to more complex algorithm
use more features
use an ensemble of low complexity algorithms

How to fit an algorithm

to correct underfitting

adjust parameters
change to less complex algorithm
use fewer features or perform dimensionality reduction
use an ensemble of high complexity algorithms
use more data

How to fit an algorithm

to correct overfitting

We've done k-Nearest Neighbors;

now let's try a different supervised learning algorithm.

Decision trees

Ensemble methods

Random Forest

An ensemble of decision trees

Random Forest

Random Forest

RELATIVE FEATURE IMPORTANCES

The Curse of Dimensionality

Dimensionality Reduction

PCA

PCA

Dimensionality reduction is very handy when you have:

1,000 features
10,000 features
100,000 features
1,000,000 features!?!?
- e.g. NLP with n-grams

Cost Function

In what kind of classifier would it be more important to minimize false positives or false negatives?

An adult-content filter for school computers
A book recommendation classifier for Amazon
A genetic risk classifier for cancer

Precision & recall

www.prooffreader.com

"Prooffreader" is misspelled:

that's the joke!

Intro to data analysis using machine learning (50 min)

By David Taylor

Intro to data analysis using machine learning (50 min)

1,810

David Taylor