federica bianco
astro | data science | data for good
IX: Clustering
Farid Qamar
this slide deck: https://slides.com/federicabianco/fdsfe_8
1
clustering
Machine Learning
unsupervised learning
identify features and create models that allow to understand structure in the data
unsupervised learning
identify features and create models that allow to understand structure in the data
supervised learning
extract features and create models that allow prediction where the correct answer is known for a subset of the data
unsupervised learning
identify features and create models that allow to understand structure in the data
unsupervised learning
identify features and create models that allow to understand structure in the data
supervised learning
extract features and create models that allow prediction where the correct answer is known for a subset of the data
Machine Learning
objects
features
target
data as a function of another number characterizing the system
objects
features
data is represented by objects, each of which has associated features
example data object:
Flatiron Building, NYC
wikipedia.org/wiki/Flatiron_Building
example features:
float
float
integer
integer
integer
integer/string
array (lat/lon)
string
Nf = number of features
No = number of objects
2D dataset (Nf x No)
https://www.netimpact.org/chapters/new-york-city-professional
Goal:
Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group
Goal:
Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
mammals
birds
fish
zoologist's clusters
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
walk
fly
swim
mobility clusters
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
orange/red/green
black/white/blue
photographer's clusters
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
The optimal clustering depends on
Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group
Goal:
How:
Minkowski family of distances
Euclidean: p=2
features: x, y
Minkowski family of distances
Manhattan: p=1
features: x, y
Minkowski family of distances
Manhattan: p=1
features: x, y
Minkowski family of distances
L1 is the Minkowski distance with p=1
L2 is the Minkowski distance with p=2
Residuals
2
2
3
L1 = 7
L2 = 17
Minkowski family of distances
Great Circle distance
features
latitude and longitude
Simple Matching Distance
Uses presence/absence of features in data
: number of features in neither
: number of features in both
: number of features in i but not j
: number of features in j but not i
Simple Matching Coefficient
or Rand similarity
1 | 0 | sum | |
---|---|---|---|
1 | M11 | M10 | M11+M10 |
0 | M01 | M00 | M01+M00 |
sum | M11+M01 | M10+M00 | M11+M00+M01+ M10 |
observation i
observation j
}
}
Data can have covariance (and it almost always does!)
PLUTO Manhattan data (42,000 x 15)
axis 1 -> features
axis 0 -> observations
Data can have covariance (and it almost always does!)
Data can have covariance (and it almost always does!)
Pearson's correlation (linear correlation)
correlation = correlation / variance
PLUTO Manhattan data (42,000 x 15) correlation matrix
axis 1 -> features
axis 0 -> observations
Data can have covariance (and it almost always does!)
PLUTO Manhattan data (42,000 x 15) correlation matrix
A covariance matrix is diagonal if the data has no correlation
Data can have covariance (and it almost always does!)
Generic preprocessing... WHY??
Worldbank Happyness Dataset https://github.com/fedhere/MLPNS_FBianco/blob/main/clustering/happiness_solution.ipynb
Skewed data distribution:
std(x) ~ range(y)
Clustering without scaling:
only the variable with more spread matters
Clustering without scaling:
both variables matter equally
Generic preprocessing... WHY??
Worldbank Happyness Dataset https://github.com/fedhere/MLPNS_FBianco/blob/main/clustering/happiness_solution.ipynb
Skewed data distribution:
std(x) ~ 2*range(y)
Clustering without scaling:
only the variable with more spread matters
Clustering without scaling:
both variables matter equally
Data that is not correlated appear as a sphere in the Ndimensional feature space
Data can have covariance (and it almost always does!)
ORIGINAL DATA
STANDARDIZED DATA
Generic preprocessing
Generic preprocessing
for each feature: divide by standard deviation and subtract mean
mean of each feature should be 0, standard deviation of each feature should be 1
Generic preprocessing: most commonly, we will just correct for the spread and centroid
2
k-Means clustering
we can define similarity in terms of distance
common measure of distance is the squared Euclidean distance
(aka., L2-norm, sum of squared differences)
also called the inertia
example:
if the data is on buildings with the features: year built (Y) and energy use (E), the distance between two objects (1 and 2):
Density-based Clustering
Distribution-based Clustering
Hierarchical Clustering
Centroid-based Clustering
objective: minimizing the aggregate distance within the cluster
total intra-cluster variance =
hyperparameters: must declare the number of clusters prior to clustering
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
choosing k clusters
2
.
1
if low number of variables:
Visualize and pick k manually
or...use the Elbow Method
or...use the Elbow Method
"Elbow" inflection point
Optimal k = 4 centers
or...use the Elbow Method
But...this doesn't always work!
problems with k-Means
2
.
2
highly dependent on initial location of k centers
example: 2 clusters, one large and one small
clusters are assumed to all be the same size
clusters are assumed to all be the same size
example: 2 clusters, one large and one small
clusters are assumed to have the same extent in every direction
example: 2 'squashed' clusters with different widths in different directions
example: 2 'squashed' clusters with different widths in different directions
clusters are assumed to have the same extent in every direction
clusters are must be linearly separable (convex sets)
example: 2 non-convex sets
example: 2 non-convex sets
clusters are must be linearly separable (convex sets)
3
DBSCAN
Density-based spatial clustering of applications with noise
One of the most common clustering algorithms and most cited in scientific literature
Density-based spatial clustering of applications with noise
One of the most common clustering algorithms and most cited in scientific literature
Defines cluster membership based on local density:
Nearest Neighbors algorithm
Density-based spatial clustering of applications with noise
Requires 2 parameters:
minPts
minimum number of points to form a dense region
maximum distance for points to be considered part of a cluster
ε
Density-based spatial clustering of applications with noise
Requires 2 parameters:
minPts
minimum number of points to form a dense region
ε
maximum distance for points to be considered part of a cluster
2 points are considered neighbors if distance between them <= ε
Density-based spatial clustering of applications with noise
Requires 2 parameters:
minPts
ε
maximum distance for points to be considered part of a cluster
minimum number of points to form a dense region
2 points are considered neighbors if distance between them <= ε
regions with number of points >= minPts are considered dense
Algorithm:
ε
minPts = 3
ε
minPts = 3
ε
minPts = 3
ε
ε
minPts = 3
directly reachable
ε
minPts = 3
core
dense region
ε
minPts = 3
ε
minPts = 3
directly reachable to
ε
minPts = 3
reachable to
ε
minPts = 3
ε
minPts = 3
reachable
ε
minPts = 3
ε
minPts = 3
ε
minPts = 3
ε
ε
minPts = 3
directly reachable
ε
minPts = 3
core
dense region
ε
minPts = 3
reachable
ε
minPts = 3
ε
minPts = 3
noise/outliers
PROs:
CONs:
distance
it's deterministic!
computationally intense because every cluster pair distance has to be calculate
it is slow, though it can be optimize:
complexity
compute the distance matrix
each data point is a singleton cluster
repeat
merge the 2 cluster with minimum distance
update the distance matrix
untill
only a single (n) cluster(s) remains
Order:
PROs
It's deterministic
CONs
It's greedy (optimization is done step by step and agglomeration decisions cannot be undone)
It's computationally expensive
By federica bianco