federica bianco
astro | data science | data for good
IX: Clustering
Farid Qamar
this slide deck: https://slides.com/federicabianco/fdsfe_8
1
clustering
Machine Learning
unsupervised learning
identify features and create models that allow to understand structure in the data
unsupervised learning
identify features and create models that allow to understand structure in the data
supervised learning
extract features and create models that allow prediction where the correct answer is known for a subset of the data
unsupervised learning
identify features and create models that allow to understand structure in the data
unsupervised learning
identify features and create models that allow to understand structure in the data
supervised learning
extract features and create models that allow prediction where the correct answer is known for a subset of the data
Machine Learning
objects
features
target
data as a function of another number characterizing the system
objects
features
data is represented by objects, each of which has associated features
example data object:
Flatiron Building, NYC
wikipedia.org/wiki/Flatiron_Building
example features:
float
float
integer
integer
integer
integer/string
array (lat/lon)
string
Nf = number of features
No = number of objects
2D dataset (Nf x No)
https://www.netimpact.org/chapters/new-york-city-professional
Goal:
Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group
Goal:
Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
mammals
birds
fish
zoologist's clusters
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
walk
fly
swim
mobility clusters
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
orange/red/green
black/white/blue
photographer's clusters
Internal Criterion:
members of the cluster should be similar to each other (intra-cluster compactness)
External Criterion:
objects outside the cluster should be dissimilar from the objects inside the cluster
The optimal clustering depends on
Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group
Goal:
How:
we can define similarity in terms of distance
common measure of distance is the squared Euclidean distance
(aka., L2-norm, sum of squared differences)
also called the inertia
example:
if the data is on buildings with the features: year built (Y) and energy use (E), the distance between two objects (1 and 2):
Density-based Clustering
Distribution-based Clustering
Hierarchical Clustering
Centroid-based Clustering
2
k-Means clustering
objective: minimizing the aggregate distance within the cluster
total intra-cluster variance =
hyperparameters: must declare the number of clusters prior to clustering
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
k-Means Clustering
Else, go back to step 3
choosing k clusters
2
.
1
if low number of variables:
Visualize and pick k manually
or...use the Elbow Method
or...use the Elbow Method
"Elbow" inflection point
Optimal k = 4 centers
or...use the Elbow Method
But...this doesn't always work!
problems with k-Means
2
.
2
highly dependent on initial location of k centers
example: 2 clusters, one large and one small
clusters are assumed to all be the same size
clusters are assumed to all be the same size
example: 2 clusters, one large and one small
clusters are assumed to have the same extent in every direction
example: 2 'squashed' clusters with different widths in different directions
example: 2 'squashed' clusters with different widths in different directions
clusters are assumed to have the same extent in every direction
clusters are must be linearly separable (convex sets)
example: 2 non-convex sets
example: 2 non-convex sets
clusters are must be linearly separable (convex sets)
3
DBSCAN
Density-based spatial clustering of applications with noise
One of the most common clustering algorithms and most cited in scientific literature
Density-based spatial clustering of applications with noise
One of the most common clustering algorithms and most cited in scientific literature
Defines cluster membership based on local density:
Nearest Neighbors algorithm
Density-based spatial clustering of applications with noise
Requires 2 parameters:
minPts
minimum number of points to form a dense region
maximum distance for points to be considered part of a cluster
ε
Density-based spatial clustering of applications with noise
Requires 2 parameters:
minPts
minimum number of points to form a dense region
ε
maximum distance for points to be considered part of a cluster
2 points are considered neighbors if distance between them <= ε
Density-based spatial clustering of applications with noise
Requires 2 parameters:
minPts
ε
maximum distance for points to be considered part of a cluster
minimum number of points to form a dense region
2 points are considered neighbors if distance between them <= ε
regions with number of points >= minPts are considered dense
Algorithm:
ε
minPts = 3
ε
minPts = 3
ε
minPts = 3
ε
ε
minPts = 3
directly reachable
ε
minPts = 3
core
dense region
ε
minPts = 3
ε
minPts = 3
directly reachable to
ε
minPts = 3
reachable to
ε
minPts = 3
ε
minPts = 3
reachable
ε
minPts = 3
ε
minPts = 3
ε
minPts = 3
ε
ε
minPts = 3
directly reachable
ε
minPts = 3
core
dense region
ε
minPts = 3
reachable
ε
minPts = 3
ε
minPts = 3
noise/outliers
PROs:
CONs:
By federica bianco