Machine Learning for

Time Series Analysis VIII

kNearestNeighbors and time warping & Clustering

Fall 2025 - UDel PHYS 641
dr. federica bianco

@fedhere

fbianco@udel.edu

this slide deck:

https://slides.com/federicabianco/mlts25_08

distance metrics

remindes

1

MLTSA:

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

N features (dimensions)

L1 is the Minkowski distance with p=1

L2 is the Minkowski distance with p=2 (Euclidean distance)

if p == infinity D(i,j)p ~ maxk (xik - xjk)

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

N features (dimensions)

D(i,j) >= 0\\ D(i,i) = 0\\ D(i,j)~=~D(j,i)\\ D(i,j)~<=~D(i,k)~+~D(k,j)

properties:

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Manhattan: p=1

D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Manhattan: p=1

D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Euclidean: p=2

D_{Euc}(i,j)~=~\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^2}

features: x, y

distance metrics

https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

distance metrics

remindes

2

MLTSA:

import scipy as sp

sp.spatial.distance.pdist(X) # the pairwise distance: returns (N**2  - N )/2 values for N objects

sp.spatial.distance.squareform(sp.spatial.distance.pdist(wines[["Alcohol", "Magnesium"]])) 
#returns the NXN matrix of distances

plt.imshow(sp.spatial.distance.squareform(sp.spatial.distance.pdist(wines[["Alcohol", "Magnesium"]]))) 
#you can visualize the NXN matrix


plt.xlabel("wine")
plt.ylabel("wine");
plt.colorbar(label="distance");

import scipy as sp

sp.spatial.distance.pdist(X) # the pairwise distance: returns (N**2  - N )/2 values for N objects

sp.spatial.distance.squareform(sp.spatial.distance.pdist(wines[["Alcohol", "Magnesium"]],
                                                        metric='jaccard')) 
#returns the NXN matrix of distances

plt.imshow(sp.spatial.distance.squareform(sp.spatial.distance.pdist(wines[["Alcohol", "Magnesium"]]))) 
#you can visualize the NXN matrix


plt.xlabel("wine")
plt.ylabel("wine");
plt.colorbar(label="distance");

Siddarth Chiaini, UDelaware

1-Nearest Neighbor

3

MLTSA:

what is machine learning?

classification

prediction

feature selection

supervised learning

understanding structure

organizing/compressing data

anomaly detection dimensionality reduction

unsupervised learning

classification

MLTSA:

k-Nearest Neighbor

supervised learning methods

(nearly all other methods you heard of)

learns by example

Need labels, in some cases a lot of labels
Dependent on the definition of similarity

Similarity can be used in conjunction to parametric or non-parametric methods

used to:

classify, predict (regression)

clustering vs classifying

x

y

unsupervised

supervised

observed features:

(x, y)

models typically return a partition of the space

goal is to partition the space so that the unobserved variables are

separated in groups

consistently with

an observed subset

1

target features:

(color)

x

y

observed features:

(x, y)

if x**2 + y**2 <= (x-a)**2 + (y-b)**2 :
	return blue
else:
	return orange

target features:

(color)

A subset of variables has class labels. Guess the label for the other variables

supervised ML: classification

SVM

finds a hyperplane that optimally separates observations

x

y

observed features:

(x, y)

Tree Methods

split spaces along each axis separately

A subset of variables has class labels. Guess the label for the other variables

split along x

if x <= a :
	if y <= b:
      		return blue
return orange

then

along y

target features:

(color)

supervised ML: classification

x

y

observed features:

(x, y)

KNearest Neighbors

Assigns the class of closest neighbors

A subset of variables has class labels. Guess the label for the other variables

if (labels[argmin(distance((x,y), trainingset))][:4] == blue)sum() > (labels[argmin(distance((x,y), trainingset))][:4] == orange)sum():
	return blue
return orange

target features:

(color)

supervised ML: classification

MLTSA:

k-Nearest Neighbors

Calculate the distance d to all known objects

Select the k closest objects

Assign the most common among the k classes:

# k = 1
d = distance(x, trainingset)
C(x) = C(trainingset[argmin(d)])

C^{kNN}(x) = Y_{(1)}

MLTSA:

k-Nearest Neighbors

Calculate the distance d to all known objects

Select the k closest objects

Classification:

Assign the most common among the k classes

Regression:
Predict the average (median) of the k target values

Good

non parametric

very good with large training sets

Cover and Hart 1967: As , the -NN error is no more than twice the error of the Bayes Optimal classifier.

MLTSA:

k-Nearest Neighbor

Good

non parametric

very good with large training sets

Cover and Hart 1967: As , the 1-NN error is no more than twice the error of the Bayes Optimal classifier.

MLTSA:

k-Nearest Neighbor

Let be the nearest neighbor of .

For , x(t)

Theorem: e[x(t)) = C(xNN)]< e_BayesOpt

e_BayesOpt = argmaxy P(y|x)

Proof: assume

eNN = P(y|x(t)) (1−P(y|xNN)) + P(y|xNN) (1−P(y|x(t))) ≤

(1−P(y|xNN)) + (1−P(y|x(t))) =

2 (1−P(y|x(t)) = 2ϵBayesOpt,

Good

non parametric

very good with large training sets

MLTSA:

k-Nearest Neighbor

Not so good

it is only as good as the distance metric

If the similarity in feature space reflect similarity in label then it is perfect!

poor if training sample is sparse

poor with outliers

MLTSA:

k-Nearest Neighbor

MLTSA:

dynamic programming

4

MLTSA:

Dynamic Programming

Breaking down a problem into subproblems

Solve the subproblem once and store the solution

Instead of recomputing look up the solution as needed

Trade off: decreases computational complexity (commonly exponential problems O(exp(n)) are solvable in quadratic time O(n^2) or O(n^3)

MLTSA:

Dynamic Programming

Example: Fibonacci sequency

# function recursively called
def fib (n) :
  if (n < 2):
    return n
  return fib(n-1) + fib(n-2)

# dynamin programming approach
def fibDP (n) :
    fibresult = np.zeros(n+1, int)
    fibresult[1] = 1
    for i in np.arange(2, n+1):
      fibresult[i] = fibresult[i-1] + fibresult[i-2]
    return fibresult[-1]

F_0 ~=~ ~0;~ F_1 ~=~ ~1;\\ F_n ~=~ F_{n-1} + F_{n-2}

https://colab.research.google.com/drive/1a9Uix5CQ8w707aAG_wIkiTea4NmaIOwL

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, ...

MLTSA:

Dynamic time warping

5

MLTSA:

DTW

Sakoe & Chiba 1978

Row-feature methods are very sensitives to "irrelevant" differences

MLTSA:

distance b/w 2 points

d = |x_2 - x_1|

1D

MLTSA:

distance b/w 2 points

d =\sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}

2D

x

y

MLTSA:

distance b/w 2 points

2D

	X	Y
A	x1	y1
B	x2	y2

y2

y1

x1

x2

d =\sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}

x

y

MLTSA:

distance b/w N points in 2D

d = {\sqrt{(x_{2} - x_{1})^2 + (y_{2} - y_{1})^2}}

MLTSA:

distance b/w N points in 2D

d = {\sqrt{(x_{2} - x_{1})^2 + (y_{2} - y_{1})^2}}

MLTSA:

distance b/w 2 ND vectors

d =\sqrt{ \sum_{i=1}^6{(x_{2i} - x_{1i})^2}}

A B

x1

x2

x3

x4

x5

x6

scipy.spatial.distance_matrix(np.atleast_2d(A).T, np.atleast_2d(B).T)

MLTSA:

distance b/w 2 ND vectors

d =\sqrt{ \sum_{i=1}^6{(x_{2i} - x_{1i})^2}}

A B

x1

x2

x3

x4

x5

x6

scipy.spatial.distance_matrix(np.atleast_2d(A).T, np.atleast_2d(B).T)

x1

x2

x3

x4

x5

x6

A-x1

B-x4

A-x1

np.sqrt((np.diag(scipy.spatial.distance_matrix(np.atleast_2d(A).T, np.atleast_2d(B).T))**2).sum())

MLTSA:

distance b/w 2 ND vectors

d =\sqrt{ \sum_{i=1}^6{(x_{2i} - x_{1i})^2}}

scipy.spatial.distance_matrix(np.atleast_2d(A).T, np.atleast_2d(B).T)

x1

x2

x3

x4

x5

x6

np.sqrt((np.diag(scipy.spatial.distance_matrix(np.atleast_2d(A).T, np.atleast_2d(B).T))**2).sum())

A B

x1

x2

x3

x4

x5

x6

MLTSA:

distance b/w 2 ND vectors

d =\sqrt{ \sum_{i=1}^6{(x_{2i} - x_{1i})^2}} = \\ = \sqrt{6} D

A B

x1

x2

x3

x4

x5

x6

D

121

91

64

121

126

121

Data that is not correlated appear as a sphere in the Ndimensional feature space

Data can have covariance (and it almost always does!)

ORIGINAL DATA

STANDARDIZED DATA

Generic preprocessing

When standardizing a dataset we take every features and we force them to be the mean=0 standard deviation=1

standardized~X = \frac{(X - mean(X))}{stdev(X)}

Generic preprocessing

for each feature: divide by standard deviation and subtract mean

mean of each feature should be 0, standard deviation of each feature should be 1

Generic preprocessing

for each feature: divide by standard deviation and subtract mean

mean of each feature should be 0, standard deviation of each feature should be 1

Time Series Preprocessing

what happens if I standardize a dataset by time stamp??

mean of each feature should be 0, standard deviation of each ROW (time series) should be 1

That way we compare shapes of time series, i.e. trends!

MLTSA:

distance b/w 2 ND vectors

d =\sqrt{ \sum_{i=1}^6{(x_{2i} - x_{1i})^2}} = \\ = \sqrt{6} D

D

MLTSA:

distance b/w 2 ND vectors

d =\sqrt{ \sum_{i=1}^6{(x_{2i} - x_{1i})^2}} = \\ > \sqrt{6} D

D

MLTSA:

distance b/w 2 ND vectors

d =\sqrt{ \sum_{i=1}^6{(x_{2i} - x_{1i})^2}} = \\ > \sqrt{6} D

D

	t1	t2	t3	t4	t5	t6
obs1	6.5	8.5	6.2	9.9	5.5	8.5
obs2	0.0	1.0	-0.1	1.9	-0.5	1.0

μ=0, σ=1

MLTSA:

distance b/w 2 ND vectors

x1

D

PREPROCESSING TIME SERIES

7.5

0

timeSeriesScaled = sklearn.preprocessing.scale(timeSeries, axis=1, with_std=False)

Generally I am interested in a similar shape. If I were interested in the absolute distance I would be better off with feature extraction!

d = \sqrt{\sum_i{(x_{2i} - x_{1i})^2}} > 0

	t1	t2	t3	t4	t5	t6
obs1	6.5	8.5	6.2	9.9	5.5	8.5
obs2	-0.6	0.6	-0.8	1.5	-1.2	0.6

μ=0, σ=1

MLTSA:

distance b/w 2 ND vectors

x1

D

PREPROCESSING TIME SERIES

0

timeSeriesScaled = sklearn.preprocessing.scale(timeSeries, axis=1, with_std=False)

Generally I am interested in a similar shape. If I were interested in the absolute distance I would be better off with feature extraction!

d = \sqrt{\sum_i{(x_{2i} - x_{1i})^2}} > 0

	t1	t2	t3	t4	t5	t6
obs1	-1.0	1.0	1.3	2.4	2.0	1.0
obs2	-0.6	0.6	-0.8	1.5	-1.2	0.6

μ=0, σ=1

MLTSA:

distance b/w 2 ND vectors

d = \sqrt{\sum_i{(x_{2i} - x_{1i})^2}} = 0

D

PREPROCESSING TIME SERIES

0

timeSeriesScaled = sklearn.preprocessing.scale(timeSeries, axis=1)

Generally I am interested in a similar shape. If I were interested in the absolute distance I would be better off with feature extraction!

	t1	t2	t3	t4	t5	t6
obs1	-0.6	0.6	0.8	1.5	-1.2	0.6
obs2	-0.6	0.6	-0.8	1.5	-1.2	0.6

μ=0, σ=1

MLTSA:

distance b/w 2 ND vectors

Standardizing (scaling)

assures we are measuring similarity in shape, regardless of absolute numbers

0

Time Warped time series

d = \sqrt{\sum_i{(x_{2i} - x_{1i})^2}} = 0

MLTSA:

distance b/w 2 ND vectors

A warping of the time axis would also suppress similarity

d = \sqrt{\sum_i{(x_{2i} - x_{1i})^2}} = 0

MLTSA:

distance b/w 2 ND vectors

The first plot (left) corresponds to measuring the distance along each axis and combine those distances (e.g. Eucledian)

The right plot is how we deal with "time warping"

https://openproceedings.org/2008/conf/edbt/AssentKAS08.pdf

MLTSA:

distance b/w 2 ND vectors

Modify the distance matrix by considering the points surrounding the diagonal


0
	0	d(Qi-1,Qj)
	d(Qi,Qj-1)	0
			0
				0

d(Q_i, Q_j) = |(Q_i-Q_j)|

x1

x2

x3

x4

x5

x6

x1

x2

x3

x4

x5

x6

x1

x6

MLTSA:

distance b/w 2 ND vectors

Modify the distance matrix by considering the points surrounding the diagonal

0
	0
		0	d(Qi,Qi+1)
		d(Qi+i,Qi)	0
				0
					0

x1

x2

x3

x4

x5

x6

x1

x2

x3

x4

x5

x6

x1

x6

d(Q_i, Q_j) = |(Q_i-Q_j)|

MLTSA:

distance b/w 2 ND vectors

Modify the distance matrix by considering the points surrounding the diagonal

x1

x2

x3

x4

x5

x6

x1

x2

x3

x4

x5

x6

x1

x6

D_{i,?} = min(d(Q_{i,k}, Q_{i,k}), d(Q_{i,k}, Q_{i,k-1}), d(Q_{i,k}, Q_{i,k+1})

0
	0
		0	d(Qi,Qi+1)
		d(Qi+i,Qi)	0
				0
					0

MLTSA:

distance b/w 2 ND vectors

Modify the distance matrix by considering the points surrounding the diagonal



d(Qi-1,Ci-1)	d(Qi-1,Ci)
d(Qi, Ci-1)	d(Qi, Ci)

		d(Qn, Cn)

DTW(Q_i, C_j) = d(Q_i, C_j) + min(d(Q_{i-1}, C_{j-1}), d(Q_{i-1}, C_j), d(Q_i, C_{j-1}))

- first step is always d(Q0,C0)

- if distance is smaller you may deviate off the diagonal

MLTSA:

distance b/w 2 ND vectors

DTW(Qi, Cj) = d(Qi, Cj) + min(d(Qi-1, Ci-1), d(Qi-1, Cj), d(Qi, Cj-1))

To align two sequences using DTW, an n-by-n matrix is constructed, with the (ith, jth) element of the matrix being the Euclidean distance

d(qi, cj) between the points qi and cj .

MLTSA:

distance b/w 2 ND vectors

DTW(Qi, Cj) = d(Qi, Cj) + min(d(Qi-1, Ci-1), d(Qi-1, Cj), d(Qi, Cj-1))

To align two sequences using DTW, an n-by-n matrix is constructed, with the (ith, jth) element of the matrix being the Euclidean distance

d(qi, cj) between the points qi and cj .

A warping path P is a contiguous set of matrix elements that defines a mapping between Q and C. The tth element of P is defined as pt=(i, j)t :

P = p1, p2, …, pt,

n ≤ T ≤ 2n-1

MLTSA:

DTW

import numpy as np
import pylab as pl
import scipy as sp
x = np.array([-2, 10, -10, 15, 13, 20, 5, 14, 2])
y = np.array([3, -13, 14, -7, 9, 20, -2, 14, 2])

distm = sp.spatial.distance_matrix(x.reshape(-1,1), 
                           y.reshape(-1,1), p=1)
pl.imshow(distm)

MLTSA:

DTW

MLTSA:

distance b/w 2 ND vectors

r is the parameter of serach

DTW(Qi, Cj) = d(Qi, Cj) + min(d(Qi-1, Ci-1), d(Qi-1, Cj), d(Qi, Cj-1))

Sakoe & Chiba 78

MLTSA:

Indexing time series

6

MLTSA:

Indexing time series

“…we have about a million samples per minute coming in from 1000 gas turbines around the world… we need to be able to do similarity search for...” Lane Desborough, GE. • “…an archival rate of 3.6 billion points a day, how can we (do similarity search) in this data?” Josh Patterson, TVA. Our

iSAX 2.0: Indexing and Mining One Billion Time Series

https://www.cs.ucr.edu/~eamonn/iSAX_2.0.pdf

We show that the main bottleneck in mining such massive datasets is the time taken to build the index

MLTSA:

Indexing time series

indexing refers to organizing observations (here a time series is an observation) so that they can be searched.

Indexing is at the root of databasing

iSAX 2.0: Indexing and Mining One Billion Time Series

https://www.cs.ucr.edu/~eamonn/iSAX_2.0.pdf

We show that the main bottleneck in mining such massive datasets is the time taken to build the index

MLTSA:

Indexing time series

t = (t1, . . . , tn), ti ∈ R,

where time point f(i) is before f(i + 1):

f(i) < f(i + 1),

and f : N → R is a function mapping indices to time points.

Long time series data which is treated as point data, corresponds to very high dimensional feature spaces.

MLTSA:

Indexing time series

t = (t1, . . . , tn), ti ∈ R,

where time point f(i) is before f(i + 1):

f(i) < f(i + 1),

and f : N → R is a function mapping indices to time points.

Long time series data which is treated as point data, corresponds to very high dimensional feature spaces.

Indexing create a compact yet detailed index structure for time series similarity search and retrieval.

MLTSA:

Indexing time series

t = (t1, . . . , tn), ti ∈ R,

where time point f(i) is before f(i + 1):

f(i) < f(i + 1),

and f : N → R is a function mapping indices to time points.

Long time series data which is treated as point data, corresponds to very high dimensional feature spaces.

Indexing create a compact yet detailed index structure for time series similarity search and retrieval.

storage efficient

MLTSA:

Indexing time series

indexing is about finding compact representations that allow to simplify the search problem: so that you do not habve to measure the distance between all time series

iSAX

https://www.cs.ucr.edu/~eamonn/iSAX_2.0.pdf

SymbolicAggregateApproximation

MLTSA:

Indexing time series

TS-Trees

https://openproceedings.org/2008/conf/edbt/AssentKAS08.pdf

indexing is about finding compact representations that allow to simplify the search problem: so that you do not habve to measure the distance between all time series

MLTSA:

Effective DTW models

6

MLTSA:

Efficient DTW models

A time series of length one trillion is a very large data object. In

fact, it is more than all of the time series data considered in all

papers ever published in all data mining conferences combined.

MLTSA:

Efficient DTW models

1 Time Series Subsequences must be Normalized

2 Arbitrary Query Lengths cannot be Indexed

If we know the length of queries ahead of time we can mitigate at least some of the intractability of search by indexing the data. Although to our knowledge no one has built an index for a trillion real-valued objects (Google only indexed a trillion webpages as recently as 2008), perhaps this could be done.

MLTSA:

Efficient DTW models

4.1.1 Using the Squared Distance

4.1.2 Lower Bounding (LB_keog)

4.1.3 Early Abandoning of ED and LB_keog

4.1.4 Early Abandoning of DTW

d = \sum_i{(x_{2i} - x_{1i})^2}

MLTSA:

Efficient DTW models

4.1.1 Using the Squared Distance

4.1.2 Lower Bounding (LB_keog)

4.1.3 Early Abandoning of ED and LB_keog

4.1.4 Early Abandoning of DTW

MLTSA:

Efficient DTW models

4.1.1 Using the Squared Distance

4.1.2 Lower Bounding (LB_keog)

4.1.3 Early Abandoning of ED and LB_keog

4.1.4 Early Abandoning of DTW

if we note that the current metric

differences between each pair of corresponding datapoints

exceeds the best-so-far we abandon

MLTSA:

Efficient DTW models

4.2.1 Early Abandoning Z-Normalization

this step is heavily relying on dynamic programming

MLTSA:

Efficient DTW models

4.2.1 Early Abandoning Z-Normalization

4.2.2 Reordering Early Abandoning

4.2.3 Reversing the Query/Data Role in LB

4.2.4 Cascading Lower Bounds

universal optimal ordering that we can

compute in advance:

The sections of the query

that are farthest from the mean, zero, will on average have the

largest contributions to the distance measure.

MLTSA:

Efficient DTW models

4.1.1 Using the Squared Distance

4.1.2 Lower Bounding (LB_keog)

4.1.3 Early Abandoning of ED and LB_keog

4.1.4 Early Abandoning of DTW

d = \sum_i{(x_{2i} - x_{1i})^2}

MLTSA:

Python DTW implementation

7

MLTSA:

python implementation

https://tslearn.readthedocs.io/

0

Advanced issue found

▲

MLTSA:

python implementation

https://tslearn.readthedocs.io/

from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn import metrics
scaler = TimeSeriesScalerMeanVariance(mu=0., std=1.)  # Rescale 
dataset_scaled = scaler.fit_transform(dataset)

path, sim = metrics.dtw_path(dataset_scaled[0], dataset_scaled[1])

MLTSA:

python implementation

https://tslearn.readthedocs.io/

from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn.piecewise import PiecewiseAggregateApproximation

scaler = TimeSeriesScalerMeanVariance(mu=0., std=1.)  # Rescale 
dataset = scaler.fit_transform(dataset)

# PAA transform (and inverse transform) of the data
n_paa_segments = 10
paa = PiecewiseAggregateApproximation(n_segments=n_paa_segments)
paa_dataset_inv = paa.inverse_transform(paa.fit_transform(dataset))

use for feature engineering!

visualization tips: color maps

Color maps that are perceptually homogeneous respect relationships between data

https://sciencenode.org/visualization/dear-nasa-no-more-rainbows-please.php

how would you improve these plots?

how would you improve these plots?

how would you improve these plots?

Confusion

data-ink ratio (Edward Tufte)

HW

code up the DTW algorithm and apply it to sound bites, find the notebook in https://github.com/fedhere/MLTSA_FBianco/tree/master/HW7

HW

code up the DTW algorithm and apply it to sound bites, find the notebook in https://github.com/fedhere/MLTSA_FBianco/tree/master/HW7

def path(matrix):
	# the path can be calculated backword or forward
	# I find bacward more intuitive
	# start at one to the last cell:
	i, j = list(matrix.shape - 2)
    #since I do not know how long the path is i will use lists
    # p and q will be the list of indices of the path element along the 2 array axes
    p, q = [i], [j]
    # go all the way to cell 0,0
    while (i > 0) or (j > 0):
      	# pick minimum of 3 surrounding elements: 
      	# the diagonal and the 2 surrounding it
        tb = argmin((D[i, j], D[i, j + 1], D[i + 1, j]))
        #stay on the diagonal
        if tb == 0:
            i -= 1
            j -= 1
        #off diagonal choices: move only up or sideways
        elif tb == 1:
            i -= 1
        else:  # (tb == 2):
            j -= 1
        # put i and the j indexx into p and q pushing existing entries forward
        p.insert(0, i)
        q.insert(0, j)
    return array(p), array(q)

code to calculate

the path

along the DTW array

for plotting

what is machine learning?

classification

prediction

feature selection

supervised learning

understanding structure

organizing/compressing data

anomaly detection dimensionality reduction

unsupervised learning

general ML points

understand structure of feature space

prediction based on examples (inferential AI)

generate new instances (generative AI)

=> second order purpose | feature importance

what is machine good for?

what is machine learning?

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed. Arthur Samuel, 1959

model

parameters:

slope, intercept

data

ML: any model with parameters learnt from the data

what is machine learning?

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed. Arthur Samuel, 1959

model

parameters:

slope, intercept

data

ML: any model with parameters learnt from the data

Via minimization of a loss function

what is machine learning?

Loss function is "distance" between known and predicted values of the target variable

supervised learning

????

unsupervised learning

general ML points

why

understand structure of feature space

- dimensionality reduction

- anomaly detection

what is machine good for?

(e.g. image compression)

general ML points

ML model have parameters and hyperparameters

parameters: the model optimizes based on the data

hyperparameters: chosen by the model author, could be based on domain knowledge, other data, guessed (?!).

e.g. the shape of the polynomial

1 clustering

observed features:

(x, y)

GOAL: partitioning data in maximally homogeneous,

maximally distinguished subsets.

clustering is an unsupervised learning method

x

y

all features are observed for all objects in the sample

(x, y)

how should I group the observations in this feature space?

e.g.: how many groups should I make?

clustering is an unsupervised learning method

x

y

optimal

clustering

http://www-bcf.usc.edu/~soltanol/Applications.html

internal criterion:

members of the cluster should be similar to each other (intra cluster compactness)

external criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

tigers

wales

raptors

zoologist's clusters

orange/green

black/white/blue

internal criterion:

members of the cluster should be similar to each other (intra cluster compactness)

external criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

photographer's clusters

the optimal clustering depends on:

how you define similarity/distance
the purpose of the clustering

internal criterion:

members of the cluster should be similar to each other (intra cluster compactness)

external criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

the ideal clustering algorithm should have:

Scalability (naive algorithms are Np hard)
Ability to deal with different types of attributes
Discovery of clusters with arbitrary shapes
Minimal requirement for domain knowledge
Deals with noise and outliers
Insensitive to order
Allows incorporation of constraints
Interpretable

A Spatial Clustering Technique for the Identification of Customizable Ecoregions

William W. Hargrove and Robert J. Luxmoore

50-year mean monthly temperature, 50-year mean monthly precipitation, elevation, total plant-available water content of soil, total organic matter in soil, and total Kjeldahl soil nitrogen

THIS CLUSTERING IS NOT BASED ON LOCATION! but on properties of the location that have a spatial coherence

2 distance metrics

We already cover and reviewed distances for continuous variables. Briefly lets look at categorical distances:

distance metrics

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

categorical variables:

binary

What is the distance between a leopard and a lizard?

- they both have tails

- only lizards have scales

- neither have wings

distance metrics

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

categorical variables:

binary

What is the distance between a leopard and a lizard?

- they both have tails

- only lizards have scales

- neither have wings

distance metrics

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

}

		0	sum
1		M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

1

0

observation j

distance metrics

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

categorical variables:

binary

What is the distance between a leopard and a lizard?

- they both have tails

- only lizards have scales

- neither have wings

distance metrics

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

}

		0	sum
1		M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation j

distance metrics

Simple Matching Distance

SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

Simple Matching Coefficient

or Rand similarity

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}} = \frac{2}{3}

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

categorical variables:

binary

		0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

lizard/leopard

1

0

distance metrics

Jaccard similarity

J(i,j)~=~\frac{M_{i=1,j=1}}{M_{i=0,j=1}+M_{i=1,j=0}+M_{i=1,j=1}} = \frac{1}{2}

Jaccard distance

D(i,j)~=~1 - J(i,j)

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

categorical variables:

binary

lizard/leopard

distance metrics

Jaccard similarity

J(i,j)~=~

Jaccard distance

D(i,j)~=~1 - J(i,j)

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

B

{A\cap B}

A

{A\cup B}

\frac{A\cap B}{A\cup B}

categorical variables:

binary

distance metrics

Jaccard similarity

https://en.wikipedia.org/wiki/Jaccard_index

Application to Deep Learning for image recognition

Convolutional Neural Nets

J(i,j)~=~

\frac{A\cap B}{A\cup B}

categorical variables:

binary

3 how clustring works

Partitioning
- Hard clustering
- Soft Clustering

Hirarchical
- agglomerative
- devisive

also:
- Density based
- Grid based
- Model based

K-means (McQueen ’67)

K-medoids (Kaufman & Rausseeuw ’87)

Expectation Maximization (Dempster,Laird,Rubin ’77)

4.1 k-means:

Hard partitioning cluster method

K-means: the algorithm

Choose N “centers” guesses: random points in the feature space
repeat:
  Calculate distance between each point and each center

  Assign each point to the closest center

  Calculate the new cluster centers

untill (convergence):
  when clusters no longer change

K-means: the algorithm

: https://www.digitalvidya.com/blog/the-top-5-clustering-algorithms-data-scientists-should-know/

K-means:

Objective: minimizing the aggregate distance within the cluster.

Order: #clusters #dimensions #iterations #datapoints O(KdN)

CONs:

Its non-deterministic: the result depends on the (random) starting point

It only works where the mean is defined: alternative is K-medoids which represents the cluster by its central member (median), rather than by the mean

Must declare the number of clusters upfront (how would you know it?)

PROs:

Scales linearly with d and N

K-means: the objective function

Objective: minimizing the aggregate distance within the cluster.

Order: #clusters #dimensions #iterations #datapoints O(KdN)

O(KdN):

complexity scales linearly with

-d number of dimensions

-N number of datapoints

-K number of clusters

K-means: the objective function

either you know it because of domain knowledge

or

you choose it after the fact: "elbow method"

\sum_{k}\sum_{i \in k} (\vec{x_i} - \vec{\mu_k})^2

total intra-cluster variance

Objective: minimizing the aggregate distance within the cluster.

Order: #clusters #dimensions #iterations #datapoints O(KdN)

Must declare the number of clusters

https://github.com/fedhere/DSPS/blob/master/lab10/StellarPopClustersLab.ipynb

K-means: the objective function

Objective: minimizing the aggregate distance within the cluster.

Order: #clusters #dimensions #iterations #datapoints O(KdN)

Must declare the number of clusters upfront (how would you know it?)

either domain knowledge or

after the fact: "elbow method"

\sum_{k}\sum_{i \in k} (\vec{x_i} - \vec{\mu_k})^2

total intra-cluster variance

K-means: hyperparameters

n_clusters : number of clusters
init : the initial centers or a scheme to choose the center
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

‘random’: choose k observations (rows) at random from data for the initial centroids.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
n_init : if >1 it is effectively an ensamble method: runs n times with different initializations
random_state : for reproducibility

Convergence Criteria

General

Any time you have an objective function (or loss function) you need to set up a tolerance : if your objective function did not change by more than ε since the last step you have reached convergence (i.e. you are satisfied)

ε is your tolerance

For clustering:

convergence can be reached if

no more than n data point changed cluster

n is your tolerance

4.2 expectation-maximization

Soft partitioning cluster method

Hard clustering : each object in the sample belongs to only 1 cluster

Soft clustering : to each object in the sample we assign a degree of belief that it belongs to a cluster

Soft = probabilistic

SKIP IN CLASS BUT YOU CAN LOOK AT THE SLIDES ON YOUR OWN!!

Mixture models

these points come from 2 gaussian distribution.

which point comes from which gaussian?

1

2

3

4-6

7

8

9-12

13

Mixture models

CASE 1:

if i know which point comes from which gaussian

i can solve for the parameters of the gaussian

(e.g. maximizing likelihood)

1

2

3

4-6

7

8

9-12

13

Mixture models

CASE 2:

if i know which the parameters (μ,σ) of the gaussians

i can figure out which gaussian each point is most likely to come from (calculate probability)

1

2

3

4-6

7

8

9-12

13

Mixture models:

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

calculate the probability of each point to belong to A and B

Expectation maximization

Mixture models:

Expectation maximization

low

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

calculate the probability of each point to belong to A and B

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Mixture models:

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

1- calculate the probability p_ji of each point to belong to gaussian j

Expectation maximization

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Bayes theorem: P(A|B) = P(B|A) P(A) / P(B)

P(g_1 | x_i) = \frac{P(x_i | g_1) | P(g_1)}{P(x_i|g_1)P(g_1) + P(x_i|g_2) P(g_2)}

p_{ij}=

Mixture models:

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

1- calculate the probability p_ji of each point to belong to gaussian j

2a - calculate the weighted mean of the cluster, weighted by the p_ji

Expectation maximization

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Bayes theorem: P(A|B) = P(B|A) P(A) / P(B)

P(g_1 | x_i) = \frac{P(x_i | g_1) | P(g_1)}{P(x_i|g_1)P(g_1) + P(x_i|g_2) P(g_2)}

\mu_i=\frac{\sum_j P(g_i|x_j)x_j}{\sum_j P(g_i|x_j)}

\sigma_j=\frac{\sum_i P(g_j|x_i)(x_i-\mu_j)^2}{\sum_j P(g_j|x_i)}

p_{ij}=

Mixture models:

Expectation maximization

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Bayes theorem: P(A|B) = P(B|A) P(A) / P(B)

P(g_1 | x_i) = \frac{P(x_i | g_1) | P(g_1)}{P(x_i|g_1)P(g_1) + P(x_i|g_2) P(g_2)}

\mu_i=\frac{\sum_j P(g_i|x_j)x_j}{\sum_j P(g_i|x_j)}

\sigma_j=\frac{\sum_i P(g_j|x_i)(x_i-\mu_j)^2}{\sum_j P(g_j|x_i)}

p_{ij}=

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

1- calculate the probability p_ji of each point to belong to gaussian j

2a - calculate the weighted mean of the cluster, weighted by the p_ji

2b - calculate the weighted sigma of the cluster, weighted by the p_ji

Mixture models:

Expectation maximization

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Bayes theorem: P(A|B) = P(B|A) P(A) / P(B)

P(g_1 | x_i) = \frac{P(x_i | g_1) | P(g_1)}{P(x_i|g_1)P(g_1) + P(x_i|g_2) P(g_2)}

\mu_i=\frac{\sum_j P(g_i|x_j)x_j}{\sum_j P(g_i|x_j)}

\sigma_j=\frac{\sum_i P(g_j|x_i)(x_i-\mu_j)^2}{\sum_j P(g_j|x_i)}

Alternate expectation and maximization step till convergence

1- calculate the probability p_ji of each point to belong to gaussian j

2a - calculate the weighted mean of the cluster, weighted by the p_ji

2b - calculate the weighted sigma of the cluster, weighted by the p_ji

expectation step

maximization step

}

p_{ij}=

Last iteration: convergence

Mixture models:

Expectation maximization

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Bayes theorem: P(A|B) = P(B|A) P(A) / P(B)

P(g_1 | x_i) = \frac{P(x_i | g_1) | P(g_1)}{P(x_i|g_1)P(g_1) + P(x_i|g_2) P(g_2)}

\mu_i=\frac{\sum_j P(g_i|x_j)x_j}{\sum_j P(g_i|x_j)}

\sigma_j=\frac{\sum_i P(g_j|x_i)(x_i-\mu_j)^2}{\sum_j P(g_j|x_i)}

p_{ij}=

Alternate expectation and maximization step till convergence

1- calculate the probability p_ji of each point to belong to gaussian j

2a - calculate the weighted mean of the cluster, weighted by the p_ji

2b - calculate the weighted sigma of the cluster, weighted by the p_ji

expectation step

maximization step

}

EM: the algorithm

https://www.digitalvidya.com/blog/the-top-5-clustering-algorithms-data-scientists-should-know/

EM: the algorithm

Choose N “centers” guesses (like in K-means)
repeat
Expectation step: Calculate the probability of each distribution given the points

Maximization step: Calculate the new centers and variances as weighted averages of the datapoints, weighted by the probabilities
untill (convergence)
e.g. when gaussian parameters no longer change

Expectatin Maximization:

Order: #clusters #dimensions #iterations #datapoints #parameters O(KdNp) (>K-means)

based on Bayes theorem

Its non-deterministic: the result depends on the (random) starting point (like K-mean)

It only works where a probability distribution for the data points can be defines (or equivalently a likelihood) (like K-mean)

Must declare the number of clusters and the shape of the pdf upfront (like K-mean)

5 whitening and scaling

We already saw why we need to scale and how to scale time series when we use the time series values as features.

What happens if we don't scale/wrong scale in clustering??

20 40 50 60 80 100

0.3

0.2

0.1

0.0

assume you have data that looks like this

can you identify grouping?

net pay in $1,000

training time fraction

made up dataset of a company where employees have group-raises evaluated based on their current income and their commitment to self improvement measured by time in training

0.3

0.2

0.1

0.0

assume you have data that looks like this

can you identify grouping?

training time fraction

20 40 50 60 80 100

net pay in $1,000

0.3

0.2

0.1

0.0

is this what you were thinking??

0.3

0.2

0.1

0.0

assume you have data that looks like this

can you identify grouping?

training time fraction

20 40 50 60 80 100

net pay in $1,000

0.3

0.2

0.1

0.0

is this what you were thinking??

range ~100 dominates distance

range ~0.3 becomes insignificant!

0.3

0.2

0.1

0.0

assume you have data that looks like this

can you identify grouping?

training time fraction

0.3

0.2

0.1

0.0

is this what you were thinking??

normalized pay

-1 0 1

1

0

-1

normalized training time fraction

Data that is not correlated appear as a sphere in the Ndimensional feature space

Data can have covariance (and it almost always does!)

ORIGINAL DATA

STANDARDIZED DATA

Generic preprocessing

When standardizing a dataset we take every features and we force them to be the mean=0 standard deviation=1

standardized~X = \frac{(X - mean(X))}{stdev(X)}

Generic preprocessing

for each feature: divide by standard deviation and subtract mean

mean of each feature should be 0, standard deviation of each feature should be 1

Time Series Preprocessing

what happens if I standardize a dataset by time stamp??

mean of each feature should be 0, standard deviation of each ROW (time series) should be 1

That way we compare shapes of time series, i.e. trends!

Building dataset: cluster building by time built and electricity used for policy implementation

(add a gap feature to make the dataset more interesting... note the gap around 1940: construction slowed during WWII. our eyes can see three distinct groups.

Can we recover them?

K-means clustering without normalization:

years (order of magnitude 1e3) dominate - horizontal split

Even on normalized data k-means is not good at finding non spherical patterns and density changes!

Better option: Density based clustering can recognize density changes and outliers

Full On Whitening

: remove covariance by diagonalizing the transforming the data with a matrix that diagonalizes the covariance matrix

axis 1 -> features

axis 0 -> observations

Data can have covariance (and it almost always does!)

Full On Whitening

: remove covariance by diagonalizing the transforming the data with a matrix that diagonalizes the covariance matrix

Full On Whitening

find the matrix W that diagonalized Σ

from zca import ZCA
import numpy as np

X = np.random.random((10000, 15)) # data array

trf = ZCA().fit(X)

X_whitened = trf.transform(X)

X_reconstructed =

trf.inverse_transform(X_whitened)

assert(np.allclose(X, X_reconstructed))

: remove covariance by diagonalizing the transforming the data with a matrix that diagonalizes the covariance matrix

this is at best hard, in some cases impossible even numerically on large datasets

A covariance matrix is diagonal if the data has no correlation

Full On Whitening

: remove covariance by diagonalizing the transforming the data with a matrix that diagonalizes the covariance matrix

Data can have covariance (and it almost always does!)

6 Density Based

DBSCAN

Density-based spatial clustering of applications with noise

DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature

DBSCAN

defines cluster membership based on local density: based on Nearest Neighbors algorithm.

DBSCAN:

the algorithm

A point p is a core point if at least $minPts$ points are within distance ε (including p).
A point q is directly reachable from p if point q is within distance ε from core point p. Reachable from p if there is a path $p 1, ..., pn$ with $p 1 = p$ and $pn = q$ , where each $p i +1$ is directly reachable from pi.
All points not reachable from any other point are outliers or noise points.

DBSCAN

Density-based spatial clustering of applications with noise

minPts

minimum number of points to form a dense region

maximum distance for points to be considered part of a cluster

ε

Key Hyperparameters:

Density-based spatial clustering of applications with noise

Key Hyperparameters:

minPts

minimum number of points to form a dense region

ε

maximum distance for points to be considered part of a cluster

2 points are considered neighbors if distance between them <= ε

DBSCAN

Density-based spatial clustering of applications with noise

minPts

ε

maximum distance for points to be considered part of a cluster

minimum number of points to form a dense region

2 points are considered neighbors if distance between them <= ε

regions with number of points >= minPts are considered dense

DBSCAN

Key Hyperparameters:

ε

minPts = 3

DBSCAN

slides: Farid Qmar

ε

minPts = 3

slides: Farid Qmar

DBSCAN

ε

minPts = 3

ε

slides: Farid Qmar

DBSCAN

ε

minPts = 3

directly reachable

slides: Farid Qmar

DBSCAN

ε

minPts = 3

core

dense region

slides: Farid Qmar

DBSCAN

ε

minPts = 3

slides: Farid Qmar

DBSCAN

ε

minPts = 3

directly reachable to

slides: Farid Qmar

DBSCAN

ε

minPts = 3

reachable to

slides: Farid Qmar

DBSCAN

ε

minPts = 3

slides: Farid Qmar

DBSCAN

ε

minPts = 3

reachable

slides: Farid Qmar

DBSCAN

ε

minPts = 3

slides: Farid Qmar

DBSCAN

ε

minPts = 3

slides: Farid Qmar

DBSCAN

ε

minPts = 3

ε

slides: Farid Qmar

DBSCAN

ε

minPts = 3

directly reachable

slides: Farid Qmar

DBSCAN

ε

minPts = 3

core

dense region

slides: Farid Qmar

DBSCAN

ε

minPts = 3

reachable

slides: Farid Qmar

DBSCAN

ε

minPts = 3

slides: Farid Qmar

DBSCAN

ε

minPts = 3

noise/outliers

slides: Farid Qmar

DBSCAN

https://www.digitalvidya.com/blog/the-top-5-clustering-algorithms-data-scientists-should-know/

DBSCAN

PROs:

Does not require knowledge of the number of clusters
Deals with (and identifies) noise and outliers
Capable of finding arbitrarily shaped and sized clusters

CONs:

Highly sensitive to choice of ε and minPts
cannot work for clusters with different densities

DBSCAN

DBSCAN:

the algorithm

ε : minimum distance to join points
min_sample : minimum number of points in a cluster, otherwise they are labeled outliers.
metric : the distance metric
p : float, optional The power of the Minkowski metric

DBSCAN:

the algorithm

ε : minimum distance to join points
min_sample : minimum number of points in a cluster, otherwise they are labeled outliers.
metric : the distance metric
p : float, optional The power of the Minkowski metric

its extremely sensitive to these parameters!

DBSCAN:

the algorithm

for each point P count neighbours within minPts: label=C
for each point P ~= C measure distance d to all Cs
        if d<minD: label = DR
for each point P not C and not DR
        if distance d to C or DR > minD: label = outlier
        if distance d to C or DR <= minD: find path to closet C and cluster

Order:

PROs

Deterministic.

Deals with noise and outliers

Can be used with any definition of distance or similarity

PROs

Not entirely deterministic.

Only works in a constant density field

DBSCAN clustering:

O(N^2)

a really good blog post on DBScan

https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-clustering-works/

DBSCAN clustering:

7 Hierarchical clustering

Hierarchical clustering

removes the issue of
deciding K (number of
clusters)

Hierarchical clustering

it calculates distance between clusters and single points: linkage

7.1 Agglomerative

hierarchical clustering

Hierarchical clustering

agglomerative (bottom up)

dataset

Cluster Visualization "dendrogram"

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

agglomerative (bottom up)

distance

Agglomerative clustering

https://www.digitalvidya.com/blog/the-top-5-clustering-algorithms-data-scientists-should-know/

Hierarchical clustering

agglomerative (bottom up)

it's deterministic!

Hierarchical clustering

agglomerative (bottom up)

it's deterministic!

computationally intense because every cluster pair distance has to be calculate

Hierarchical clustering

agglomerative (bottom up)

it's deterministic!

computationally intense because every cluster pair distance has to be calculate

it is slow, though it can be optimize:

complexity

O(N^2d + N^3)

Agglomerative clustering:

the algorithm

compute the distance matrix
each data point is a singleton cluster
repeat
    merge the 2 cluster with minimum distance
    update the distance matrix
untill   
    only a single (n) cluster(s) remains

Order:

PROs

It's deterministic

CONs

It's greedy (optimization is done step by step and agglomeration decisions cannot be undone)

It's computationally expensive

Agglomerative clustering:

O(N^2d + N^3)

Agglomerative clustering: hyperparameters

n_clusters : number of clusters (but you dont have to!)
affinity : the distance/similarity definition
linkage : the scheme to measure distance to a cluster
random_state : for reproducibility

Agglomerative clustering: visualizing the dandrogram

https://colab.research.google.com/drive/1E3fu0hbOlVdS0XMXTCoX_Un5BhsTY8P7?usp=sharing

distance between a point and a cluster:

single link distance

D(c1,c2) = min(D(xc1, xc2))

linkage:

distance between a point and a cluster:

single link distance

D(c1,c2) = min(D(xc1, xc2))

complete link distance

D(c1,c2) = max(D(xc1, xc2))

linkage:

distance between a point and a cluster:

single link distance

D(c1,c2) = min(D(xc1, xc2))

complete link distance

D(c1,c2) = max(D(xc1, xc2))

centroid link distance

D(c1,c2) = mean(D(xc1, xc2))

linkage:

distance between a point and a cluster:

single link distance

D(c1,c2) = min(D(xc1, xc2))

complete link distance

D(c1,c2) = max(D(xc1, xc2))

centroid link distance

D(c1,c2) = mean(D(xc1, xc2))

Ward distance (global measure)

D_{tot} = \sum_j \sum_{i,x \in C_j}(x_i-\mu_j)^2

7.2 Divisive hierarchical clustering

Hierarchical clustering

divisive (top down)

Hierarchical clustering

divisive (top down)

Hierarchical clustering

divisive (top down)

it is

non-deterministic

(like k-mean)

Hierarchical clustering

divisive (top down)

it is

non-deterministic

(like k-mean)

it is greedy -
just as k-means
two nearby points
may end up in
separate clusters

Hierarchical clustering

divisive (top down)

it is

non-deterministic

(like k-mean)

it is greedy -
just as k-means
two nearby points
may end up in
separate clusters

it is high complexity for

exhaustive search

O(2^N)

But can be reduced (~k-means)

O(2Nk)

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5751574/

or

O(N^2)

Divisive clustering:

the algorithm

Calculate clustering criterion for all subgroups, e.g. min intracluster variance

repeat
split the best cluster based on criterion above
untill 
each data is in its own singleton cluster

Order: (w K-means procedure)

It's non-deterministic: the result depends on the (random) starting point (like K-mean) unless its exaustive (but that is )

or

It's greedy (optimization is done step by step)

Divisive clustering:

O(N^2)

O(2^N)

key concepts

Clustering : unsupervised learning where all features are observed for all datapoints. The goal is to partition the space into maximally homogeneous maximally distinguished groups

clustering is easy, but interpreting results is tricky

Distance : A definition of distance is required to group observations/ partition the space.

Common distances over continuous variables

Minkowski (inlcudes Euclidian = Minkowski(2)
Great Circle (for coordinated on a sphere, e.g. earth or sky)

Common distances over categorical variables:

Simple Distance Matrix
Jaccard Distance

Whitening

Models assume that the data is not correlated. If your data is correlated the model results may be invalid. And your data always has correlations.

- whiten the data by using the matrix that diagonalizes the covariance matrix. This is ideal but computationally expensive if possible at all

- scale your data so that each feature is mean=0 stdev=2.

Solution:

key concepts

Partition clustering:

Hard: K-means O(KdN) , needs to decide the number of clusters, non deterministic

simple efficient implementation but the need to select the number of clusters is a significant flaw

Soft: Expectation Maximization O(KdNp) , needs to decide the number of clusters, need to decide a likelihood function (parametric), non deterministic

Hierarchical:

Divisive: Exhaustive ; at least non deterministic

Agglomerative: , deterministic, greedy. Can be run through and explore the best stopping point. Does not require to choose the number of clusters a priori

Density based

DBSCAN: Density based clustering method that can identify outliers, which means it can be used in the presence of noise. Complexity . Most common (cited) clustering method in the natural sciences.

O(N^2d + N^3)

O(2^N)

O(N^2)

key concepts

encoding categorical variables:

variables have to be encoded as numbers for computers to understand them. You can encode categorical variables with integers or floating point but you implicitly impart an order. The standard is to one-hot-encode which means creating a binary (True/False) feature (column) for each category of a categorical variables but this increases the feature space and generated covariance.

model diagnostics for classifiers: Fraction of True Positives and False Positives are the metrics to evaluate classifiers. Combinations of those numbers include Accuracy (TP/ (TP+FP)), Precision (TP/(TP+FN)), Recall ((TP+TN)/(TP+TN+FP+FN)).

ROC curve: (TP vs FP) is a holistic metric of a model. It can be used to guide the choice of hyperparameters to find the "sweet spot" for your problem