foundations of data science for everyone

IX: Clustering
Farid Qamar 



Machine Learning

unsupervised learning

identify features and create models that allow to understand structure in the data

unsupervised learning

identify features and create models that allow to understand structure in the data

supervised learning

extract features and create models that allow prediction where the correct answer is known for a subset of the data

unsupervised learning

identify features and create models that allow to understand structure in the data

unsupervised learning

identify features and create models that allow to understand structure in the data

supervised learning

extract features and create models that allow prediction where the correct answer is known for a subset of the data

  • clustering
  • Principle Component Analysis
  • Apriori (association rule)
  • k-Nearest Neighbors
  • regression
  • Support Vector Machines
  • Classification/Regression Trees
  • Neural Networks

Machine Learning




Supervised ML

data as a function of another number characterizing the system

Unsupervised ML



data is represented by objects, each of which has associated features

Unsupervised ML

example data object:

Flatiron Building, NYC

example features:

  • height
  • energy use
  • number of floors
  • number of occupants
  • age in years
  • zipcode
  • footprint
  • owner







array (lat/lon)


Unsupervised ML

Nf = number of features

No = number of objects

2D dataset (Nf x No)



Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group



Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster




zoologist's clusters

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster




mobility clusters

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster



photographer's clusters

Internal Criterion:

members of the cluster should be similar to each other (intra-cluster compactness)

External Criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

The optimal clustering depends on

  • how you define similarity/distance
  • the purpose of the clustering


Find a pattern by dividing the objects into groups such that the objects within a group are more similar to each other than objects outside the group


  • Define similarity/dissimilarity function
  • Figure out the grouping of objects based on the chosen similarity/dissimilarity function
    • Objects within a cluster are similar
    • Objects across clusters are not so similar



we can define similarity in terms of distance

common measure of distance is the  squared Euclidean distance

(aka., L2-norm, sum of squared differences)

\text{distance}_{i,j}^2 = \sum_{k=1}^N(x_{ik} - x_{jk})^2

also called the inertia



if the data is on buildings with the features: year built (Y) and energy use (E), the distance between two objects (1 and 2):

\text{distance}^2_{bldg 1,2} = (Y_{bldg,1} - Y_{bldg,2})^2 + (E_{bldg,1} - E_{bldg,2})^2

Types of Clustering 

Density-based Clustering

Distribution-based Clustering

Hierarchical Clustering

Centroid-based Clustering


k-Means clustering

k-Means: the objective function

objective: minimizing the aggregate distance within the cluster

total intra-cluster variance =

\sum_k\sum_{i\in k}(\vec{x}_i - \vec{\mu}_k)^2

hyperparameters: must declare the number of clusters                                        prior to clustering

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

k-Means Clustering

  1. choose k initial centers
  2. calculate the inertia
  3. assign each object to the nearest cluster center
  4. update the cluster centers to be the average of their assigned population
  5. calculate the inertia
  6. IF the inertia has not changed, stop

      Else, go back to step 3

choosing k clusters




if low number of variables:

Visualize and pick k manually

Choosing number of clusters k

or...use the Elbow Method

  • Calculate the distance of all points to their nearest cluster center
  • Calculate the inertia (sum the distances)
  • Find the "Elbow"
    • point after which the inertia starts decreasing linearly

Choosing number of clusters k

or...use the Elbow Method

"Elbow" inflection point

Optimal k = 4 centers

  • Calculate the distance of all points to their nearest cluster center
  • Calculate the inertia (sum the distances)
  • Find the "Elbow"
    • point after which the inertia starts decreasing linearly

Choosing number of clusters k

or...use the Elbow Method

But...this doesn't always work!

  • Calculate the distance of all points to their nearest cluster center
  • Calculate the inertia (sum the distances)
  • Find the "Elbow"
    • point after which the inertia starts decreasing linearly

Choosing number of clusters k

problems with k-Means




Problems with k-Means

highly dependent on initial location of k centers

Problems with k-Means

example: 2 clusters, one large and one small

clusters are assumed to all be the same size

Problems with k-Means

clusters are assumed to all be the same size

example: 2 clusters, one large and one small

Problems with k-Means

clusters are assumed to have the same extent in every direction

example: 2 'squashed' clusters with different widths in different directions

Problems with k-Means

example: 2 'squashed' clusters with different widths in different directions

clusters are assumed to have the same extent in every direction

Problems with k-Means

clusters are must be linearly separable (convex sets)

example: 2 non-convex sets

Problems with k-Means

example: 2 non-convex sets

clusters are must be linearly separable (convex sets)




Density-based spatial clustering of applications with noise

One of the most common clustering algorithms and most cited in scientific literature


Density-based spatial clustering of applications with noise

One of the most common clustering algorithms and most cited in scientific literature

Defines cluster membership based on local density:

Nearest Neighbors algorithm


Density-based spatial clustering of applications with noise

Requires 2 parameters:


minimum number of points to form a dense region

maximum distance for points to be considered part of a cluster



Density-based spatial clustering of applications with noise

Requires 2 parameters:


minimum number of points to form a dense region


maximum distance for points to be considered part of a cluster

2 points are considered neighbors if distance between them <= ε


Density-based spatial clustering of applications with noise

Requires 2 parameters:



maximum distance for points to be considered part of a cluster

minimum number of points to form a dense region

2 points are considered neighbors if distance between them <= ε

regions with number of points >= minPts are considered dense


  • A point p is a core point if at least minPts are within distance ε (including p)
  • A point q is directly reachable from p if point q is within distance ε from core point p
  • A point q is reachable from p if there is a path p1, ..., pn with p1 = p and pn = q, where each pi+1is directly reachable from pi
  • All points not reachable from any other point are outliers or noise points




minPts = 3



minPts = 3



minPts = 3




minPts = 3

directly reachable



minPts = 3


dense region



minPts = 3



minPts = 3

directly reachable to



minPts = 3

reachable to



minPts = 3



minPts = 3




minPts = 3



minPts = 3



minPts = 3




minPts = 3

directly reachable



minPts = 3


dense region



minPts = 3




minPts = 3



minPts = 3





  • Does not require knowledge of the number of clusters
  • Deals (and identifies) noise and outliers
  • Capable of finding arbitrarily shaped and sized clusters



  • Highly sensitive to choice of ε and minPts
  • cannot work for clusters with different densities

Foundations of Data Science for Everyone - IX : Clustering

By Farid Qamar

Foundations of Data Science for Everyone - IX : Clustering

  • 808