foundations of data science for everyone

IX: Clustering

Farid Qamar

this slide deck: https://slides.com/federicabianco/fdsfe_8

1

clustering

Machine Learning

unsupervised learning

identify features and create models that allow to understand structure in the data

unsupervised learning

identify features and create models that allow to understand structure in the data

supervised learning

extract features and create models that allow prediction where the correct answer is known for a subset of the data

unsupervised learning

identify features and create models that allow to understand structure in the data

unsupervised learning

identify features and create models that allow to understand structure in the data

supervised learning

extract features and create models that allow prediction where the correct answer is known for a subset of the data

clustering
Principle Component Analysis
Apriori (association rule)

k-Nearest Neighbors
regression
Support Vector Machines
Classification/Regression Trees
Neural Networks

Machine Learning

objects

features

target

Supervised ML

data as a function of another number characterizing the system

Unsupervised ML

objects

features

data is represented by objects, each of which has associated features

Unsupervised ML

example data object:

Flatiron Building, NYC

wikipedia.org/wiki/Flatiron_Building

example features:

height
energy use
number of floors
number of occupants
age in years
zipcode
footprint
owner

float

integer

integer/string

array (lat/lon)

string

Unsupervised ML

Nf = number of features

No = number of objects

2D dataset (Nf x No)

https://www.netimpact.org/chapters/new-york-city-professional

Clustering

Goal:

Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group

Clustering

Goal:

Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

mammals

birds

fish

zoologist's clusters

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

walk

fly

swim

mobility clusters

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

orange/red/green

black/white/blue

photographer's clusters

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

The optimal clustering depends on

how you define similarity/distance
the purpose of the clustering

Clustering

Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group

Goal:

Define similarity/dissimilarity function
Figure out the grouping of objects based on the chosen similarity/dissimilarity function
- Objects within a cluster are similar
- Objects across clusters are not so similar

How:

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Euclidean: p=2

D_{Euc}(i,j)~=~\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^2}

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Manhattan: p=1

D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Manhattan: p=1

D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{1/p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

L1 is the Minkowski distance with p=1

L2 is the Minkowski distance with p=2

Residuals

2

3

L1 = 7

L2 = 17

distance metrics

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Great Circle distance

D(i,j)~=~R~\arccos{\left( \sin{\phi}_i \cdot \sin{\phi_j} ~+~ \cos{\phi_i} \cdot \cos{\phi_j} \cdot \cos{\Delta\lambda} \right)}

\phi_i,\lambda_i,\phi_j,\lambda_j

features

latitude and longitude

continuous variables

distance metrics

Simple Matching Distance

SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}

	1	0	sum
1	M11	M10	M11+M10
0	M01	M00	M01+M00
sum	M11+M01	M10+M00	M11+M00+M01+ M10

observation i

observation j

}

categorical variables:

binary

3 whitening

Data can have covariance (and it almost always does!)

PLUTO Manhattan data (42,000 x 15)

axis 1 -> features

axis 0 -> observations

https://www.tylervigen.com/spurious-correlations

Data can have covariance (and it almost always does!)

https://www.tylervigen.com/spurious-correlations

Data can have covariance (and it almost always does!)

https://www.tylervigen.com/spurious-correlations

Pearson's correlation (linear correlation)

{\displaystyle r_{xy}={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}}}

correlation = correlation / variance

PLUTO Manhattan data (42,000 x 15) correlation matrix

axis 1 -> features

axis 0 -> observations

Data can have covariance (and it almost always does!)

PLUTO Manhattan data (42,000 x 15) correlation matrix

A covariance matrix is diagonal if the data has no correlation

Data can have covariance (and it almost always does!)

Generic preprocessing... WHY??

Worldbank Happyness Dataset https://github.com/fedhere/MLPNS_FBianco/blob/main/clustering/happiness_solution.ipynb

Skewed data distribution:

std(x) ~ range(y)

Clustering without scaling:

only the variable with more spread matters

Clustering without scaling:

both variables matter equally

Generic preprocessing... WHY??

Worldbank Happyness Dataset https://github.com/fedhere/MLPNS_FBianco/blob/main/clustering/happiness_solution.ipynb

Skewed data distribution:

std(x) ~ 2*range(y)

Clustering without scaling:

only the variable with more spread matters

Clustering without scaling:

both variables matter equally

Data that is not correlated appear as a sphere in the Ndimensional feature space

Data can have covariance (and it almost always does!)

ORIGINAL DATA

STANDARDIZED DATA

Generic preprocessing

for each feature: divide by standard deviation and subtract mean

mean of each feature should be 0, standard deviation of each feature should be 1

Generic preprocessing: most commonly, we will just correct for the spread and centroid

2

k-Means clustering

Clustering

we can define similarity in terms of distance

common measure of distance is the squared Euclidean distance

(aka., L2-norm, sum of squared differences)

\text{distance}_{i,j}^2 = \sum_{k=1}^N(x_{ik} - x_{jk})^2

also called the inertia

Clustering

example:

if the data is on buildings with the features: year built (Y) and energy use (E), the distance between two objects (1 and 2):

\text{distance}^2_{bldg 1,2} = (Y_{bldg,1} - Y_{bldg,2})^2 + (E_{bldg,1} - E_{bldg,2})^2

Types of Clustering

Density-based Clustering

Distribution-based Clustering

Hierarchical Clustering

Centroid-based Clustering

k-Means: the objective function

objective: minimizing the aggregate distance within the cluster

total intra-cluster variance =

\sum_k\sum_{i\in k}(\vec{x}_i - \vec{\mu}_k)^2

hyperparameters: must declare the number of clusters prior to clustering

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

k-Means Clustering

choose k initial centers
calculate the inertia
assign each object to the nearest cluster center
update the cluster centers to be the average of their assigned population
calculate the inertia
IF the inertia has not changed, stop

Else, go back to step 3

choosing k clusters

2

.

1

if low number of variables:

Visualize and pick k manually

Choosing number of clusters k

or...use the Elbow Method

Calculate the distance of all points to their nearest cluster center
Calculate the inertia (sum the distances)
Find the "Elbow"
- point after which the inertia starts decreasing linearly

Choosing number of clusters k

or...use the Elbow Method

"Elbow" inflection point

Optimal k = 4 centers

Calculate the distance of all points to their nearest cluster center
Calculate the inertia (sum the distances)
Find the "Elbow"
- point after which the inertia starts decreasing linearly

Choosing number of clusters k

or...use the Elbow Method

But...this doesn't always work!

Calculate the distance of all points to their nearest cluster center
Calculate the inertia (sum the distances)
Find the "Elbow"
- point after which the inertia starts decreasing linearly

Choosing number of clusters k

problems with k-Means

2

.

2

Problems with k-Means

highly dependent on initial location of k centers

source: https://realpython.com/k-means-clustering-python/

Problems with k-Means

example: 2 clusters, one large and one small

clusters are assumed to all be the same size

Problems with k-Means

clusters are assumed to all be the same size

example: 2 clusters, one large and one small

Problems with k-Means

clusters are assumed to have the same extent in every direction

example: 2 'squashed' clusters with different widths in different directions

Problems with k-Means

example: 2 'squashed' clusters with different widths in different directions

clusters are assumed to have the same extent in every direction

Problems with k-Means

clusters are must be linearly separable (convex sets)

example: 2 non-convex sets

Problems with k-Means

example: 2 non-convex sets

clusters are must be linearly separable (convex sets)

3

DBSCAN

Density-based spatial clustering of applications with noise

One of the most common clustering algorithms and most cited in scientific literature

DBSCAN

Density-based spatial clustering of applications with noise

One of the most common clustering algorithms and most cited in scientific literature

Defines cluster membership based on local density:

Nearest Neighbors algorithm

DBSCAN

Density-based spatial clustering of applications with noise

Requires 2 parameters:

minPts

minimum number of points to form a dense region

maximum distance for points to be considered part of a cluster

ε

DBSCAN

Density-based spatial clustering of applications with noise

Requires 2 parameters:

minPts

minimum number of points to form a dense region

ε

maximum distance for points to be considered part of a cluster

2 points are considered neighbors if distance between them <= ε

DBSCAN

Density-based spatial clustering of applications with noise

Requires 2 parameters:

minPts

ε

maximum distance for points to be considered part of a cluster

minimum number of points to form a dense region

2 points are considered neighbors if distance between them <= ε

regions with number of points >= minPts are considered dense

DBSCAN

A point p is a core point if at least minPts are within distance ε (including p)
A point q is directly reachable from p if point q is within distance ε from core point p
A point q is reachable from p if there is a path p1, ..., pn with p1 = p and pn = q, where each pi+1is directly reachable from pi
All points not reachable from any other point are outliers or noise points

Algorithm:

DBSCAN

ε

minPts = 3

DBSCAN

ε

minPts = 3

DBSCAN

ε

minPts = 3

ε

DBSCAN

ε

minPts = 3

directly reachable

DBSCAN

ε

minPts = 3

core

dense region

DBSCAN

ε

minPts = 3

DBSCAN

ε

minPts = 3

directly reachable to

DBSCAN

ε

minPts = 3

reachable to

DBSCAN

ε

minPts = 3

DBSCAN

ε

minPts = 3

reachable

DBSCAN

ε

minPts = 3

DBSCAN

ε

minPts = 3

DBSCAN

ε

minPts = 3

ε

DBSCAN

ε

minPts = 3

directly reachable

DBSCAN

ε

minPts = 3

core

dense region

DBSCAN

ε

minPts = 3

reachable

DBSCAN

ε

minPts = 3

DBSCAN

ε

minPts = 3

noise/outliers

DBSCAN

source: https://www.digitalvidya.com/blog/the-top-5-clustering-algorithms-data-scientists-should-know/

DBSCAN

PROs:

Does not require knowledge of the number of clusters
Deals (and identifies) noise and outliers
Capable of finding arbitrarily shaped and sized clusters

CONs:

Highly sensitive to choice of ε and minPts
cannot work for clusters with different densities

6 Hierarchical clustering

Hierarchical clustering

removes the issue of
deciding K (number of
clusters)

Hierarchical clustering

it calculates distance between clusters and single points: linkage

6.2 Agglomerative

hierarchical clustering

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

agglomerative (bottom up)

distance

Hierarchical clustering

agglomerative (bottom up)

it's deterministic!

computationally intense because every cluster pair distance has to be calculate

it is slow, though it can be optimize:

complexity

O(N^2d + N^3)

Agglomerative clustering:

the algorithm

compute the distance matrix
each data point is a singleton cluster
repeat
    merge the 2 cluster with minimum distance
    update the distance matrix
untill   
    only a single (n) cluster(s) remains

Order:

PROs

It's deterministic

CONs

It's greedy (optimization is done step by step and agglomeration decisions cannot be undone)

It's computationally expensive

Agglomerative clustering:

O(N^2d + N^3)

Agglomerative clustering: hyperparameters

n_clusters : number of clusters
affinity : the distance/similarity definition
linkage : the scheme to measure distance to a cluster
random_state : for reproducibility

Agglomerative clustering: visualizing the dandrogram

https://colab.research.google.com/drive/1E3fu0hbOlVdS0XMXTCoX_Un5BhsTY8P7?usp=sharing

foundations of data science for everyone

Supervised ML

Unsupervised ML

Unsupervised ML

Unsupervised ML

Clustering

Clustering

Clustering

distance metrics

continuous variables

distance metrics

continuous variables

distance metrics

continuous variables

distance metrics

continuous variables

distance metrics

continuous variables

distance metrics

categorical variables:

binary

3

whitening

Clustering

Clustering

Types of Clustering

k-Means: the objective function

Choosing number of clusters k

Choosing number of clusters k

Choosing number of clusters k

Choosing number of clusters k

Problems with k-Means

Problems with k-Means

Problems with k-Means

Problems with k-Means

Problems with k-Means

Problems with k-Means

Problems with k-Means

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

DBSCAN

6

Hierarchical clustering

Hierarchical clustering

removes the issue of deciding K (number of clusters)

Hierarchical clustering

it calculates distance between clusters and single points: linkage

6.2

Agglomerative

hierarchical clustering

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

removes the issue of
deciding K (number of
clusters)