How to choose your clustering method ?

What should I do if I have a bunch of data in a n-dimensional space and somehow I need to separate it in different groups, but I'm completely lost in the sea of possibilities given by the different available algorithms ?

Felipe Delestro
delestro@biologie.ens.fr

MACHINE LEARNING JOURNAL CLUB

Tem hora que a gente se pergunta, por que é que não se junta tudo numa coisa só? Fernando Anitelli

Sometimes we ask ourselves, why don't we just group it all together? Fernando Anitelli

Linear separability

TOY DATA-SETS

DATASET 01

  n: 10.000

std: 0.3

  k: 20

DATASET 02

  n: 10.000 & 20.000

std: 0.8 & 1.6

  k: 3

DATASET 03

  n: 10.000

std: 1.0

  k: 1

DATASET 04

  n: 10.000

  k: 2

DATASET 05

  n: 10.000

  k: 2

Clustering methods

http://scikit-learn.org

Methods

K-Means
Affinity propagation
Spectral clustering
Ward hierarchical clustering
DBSCAN
Gaussian mixtures
Birch
Mean shift

http://scikit-learn.org/stable/modules/clustering.html#clustering

K-Means

The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.

K-means

DATASET 01

K-means

random initialisation

K-means

k-means++

DATASET 02

K-means

DATASET 03

K-means

DATASET 04

K-means

DATASET 05

K-means

AFFINITY PROPAGATION

AffinityPropagation creates clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs. This updating happens iteratively until convergence, at which point the final exemplars are chosen, and hence the final clustering is given.

Affinity propagation

DATASET 01

Affinity propagation

damping = 0.9

Affinity propagation

damping = 0.9

Affinity propagation

DATASET 02

Affinity propagation

DATASET 03

Affinity propagation

DATASET 04

Affinity propagation

DATASET 05

SPECTRAL CLUSTERING

It does a low-dimension embedding of the affinity matrix between samples, followed by a KMeans in the low dimensional space. It is especially efficient if the affinity matrix is sparse and the pyamg module is installed. SpectralClustering requires the number of clusters to be specified. It works well for a small number of clusters but is not advised when using many clusters.

Spectral clustering

DATASET 01

Spectral clustering

DATASET 02

Spectral clustering

DATASET 03

Spectral clustering

DATASET 04

affinity: rbf

affinity: nearest neighbors

Spectral clustering

DATASET 05

affinity: rbf

affinity: nearest neighbors

Spectral clustering

WARD

Hierarchical clustering is a general family of clustering algorithms that build nested clusters by merging or splitting them successively. This hierarchy of clusters is represented as a tree (or dendrogram). The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample. Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.

Ward Hierarchical

DATASET 04

Ward Hierarchical

# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)

Ward Hierarchical

DATASET 05

Ward Hierarchical

DBSCAN

The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to k-means which assumes that clusters are convex shaped. The central component to the DBSCAN is the concept of core samples, which are samples that are in areas of high density. A cluster is therefore a set of core samples, each close to each other (measured by some distance measure) and a set of non-core samples that are close to a core sample (but are not themselves core samples).

DBSCAN

DATASET 01

DBSCAN

DATASET 02

DBSCAN

DATASET 03

DBSCAN

DATASET 04

DBSCAN

DATASET 05

DBSCAN

GAUSSIAN MIXTURES

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.

Gaussian mixtures

DATASET 01

Gaussian mixtures

DATASET 02

Gaussian mixtures

DATASET 03

Gaussian mixtures

DATASET 04

Gaussian mixtures

DATASET 05

BIRCH

Balanced Iterative Reducing and Clustering using Hierarchies is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources (memory and time constraints).

Birch

DATASET 01

Birch

DATASET 02

Birch

threshold = 0.10

Birch

DATASET 03

Birch

DATASET 04

Birch

DATASET 05

Birch

Mean Shift

Mean shift clustering aims to discover “blobs” in a smooth density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids.

Mean shift

DATASET 01

Mean shift

bandwidth = estimate_bandwidth(X, quantile=0.04)

Mean shift

bandwidth quantile = 0.04

Mean shift

bandwidth quantile = 0.04 | cluster_all=False

Mean Shift

DATASET 02

Mean Shift

bandwidth quantile = 0.15

DATASET 03

Mean Shift

DATASET 04

Mean Shift

bandwidth quantile = 0.10

Mean Shift

bandwidth quantile = 0.05

DATASET 05

Mean Shift

bandwidth quantile = 0.10