from supervised learning ...

to unsupervised learning

supervised learning

unsupervised learning

\mathcal{D} = \{x_i, y_i\}_{1\leq{i} \leq{N}}

unsupervised learning

\mathcal{D} = \{x_i\}_{1\leq{i} \leq{N}}

unsupervised learning

\mathcal{D} = \{x_i\}_{1\leq{i} \leq{N}}

unsupervised learning

... efficient representations (embedding, interpret)

... estimations of your data distribution (generate new samples)

... groups of similar samples (free labels, fill blanks)

... outliers (anomaly detection, de-noising)

with unsupervised learning you can find

unsupervised learning

Dimension reduction

Clustering

How to ?

unsupervised learning

Dimension reduction

"the curse of dimensionality"

To avoid redundancy and unnecessary computational load

To visualize the data

To improve data representation
(supervised task pre-processing: semi-supervised learning)

unsupervised learning

Dimension reduction

Feature selection

Feature extraction

unsupervised learning

Feature selection

	feat1	feat2	feat3	feat4	feat5
x1	1	2	2	6	3
x2	2	4	4	12	7
x3	3	6	8	24	9
...
xn	4	8	16	48	11

\mathcal{D} = \{x_i\}_{1\leq{i} \leq{N}}

unsupervised learning

Feature selection

	feat1	feat2	feat3	feat4	feat5
x1	1	2	2	6	3
x2	2	4	4	12	7
x3	3	6	8	24	9
...
xn	4	8	16	48	11

\mathcal{D} = \{x_i\}_{1\leq{i} \leq{N}}

unsupervised learning

	feat1	feat2	feat3	feat4	feat5
x1	1	2	2	6	3
x2	2	4	4	12	7
x3	3	6	8	24	9
...
xn	4	8	16	48	11

\times 2

\times 3

Feature selection

unsupervised learning

	feat1	feat3	feat5
x1	1	2	3
x2	2	4	7
x3	3	8	9
...
xn	4	16	11

Feature selection

unsupervised learning

Feature selection example with the Breast Cancer dataset

(K. P. Bennett and O. L. Mangasarian, 1994)

malignant breast fine needle aspirates

M = 10 \mathrm{\ (only \ 3 \ here})

N=569

unsupervised learning

Feature selection example with the Breast Cancer dataset

unsupervised learning

Feature selection example with the Breast Cancer dataset

unsupervised learning

e.g. Principal Component Analysis (PCA)

Feature extraction

unsupervised learning

e.g. Principal Component Analysis (PCA)

Feature extraction

linearly combine features to find mutually orthogonal components

the (principal) components are ranked from
the most "significant" to least "significant"

projecting the data on the first components maximize its spread (variance)

dimension reduction: select the d first components

unsupervised learning

demo with PCA

Feature extraction

unsupervised learning

demo with PCA

Feature extraction

https://github.com/emmanuelrouxfr/PCA_illustration

unsupervised learning

PCA

M = \mathrm{\ 7 \ here}

Feature extraction

\lambda_1

\lambda_2

\lambda_3

\lambda_4

\lambda_5

\lambda_6

\lambda_7

v_1

v_2

v_3

v_4

v_5

v_6

v_7

V^{-1} = V^T

\mathrm{diag}(\lambda)

D = V \mathrm{diag}(\lambda)V^{-1}

v_1

v_2

v_3

v_4

v_5

v_6

v_7

unsupervised learning

PCA

Feature extraction

v_1

v_2

v_3

v_4

v_5

v_6

v_7

V^{-1} = V^T

x_{\mathrm{projected}} = V^{-1} x

x_{\mathrm{projected}}

d = 7

M = 7

unsupervised learning

PCA

Feature extraction

v_1

v_2

v_3

v_4

v_5

v_6

v_7

V^{-1} = V^T

x_{\mathrm{projected}} = V^{-1} x

x_{\mathrm{projected}}

d = 3

M = 7

unsupervised learning

PCA

D = V \mathrm{diag}(\lambda)V^{-1}

x_{\mathrm{projected}}

\mathrm{first\ } d=4 \mathrm{\ rows\ of\ } V^{-1}

M = \mathrm{\ 7 \ here}

d = 4

Feature extraction

unsupervised learning

D = V \mathrm{diag}(\lambda)V^{-1}

M = \mathrm{\ 7 \ here}

d = 3

x_{\mathrm{projected}}

\mathrm{first\ } d=3 \mathrm{\ rows\ of\ } V^{-1}

PCA

Feature extraction

unsupervised learning

1st dimension of PCA

selection of the "mean texture" feature (normalized)

\sim 6

\sim 20

unsupervised learning

Dimension reduction (linear)

ICA

PCA

ICA

unsupervised learning

ISOMAP

Locally Linear Embedding

Hessian Eigenmapping

Local Tangent Space Alignment

t-distributed Stochastic Neighbor Embedding (t-SNE)

UMAP

(deep) auto-encoders...

non-linear
dimension reduction

unsupervised learning

Dimension reduction (non-linear)








						0.9

similarity matrix in input space

t-distributed Stochastic Neighbor Embedding (t-SNE)

unsupervised learning

Dimension reduction (non-linear)










	0.2

similarity matrix in input space

t-distributed Stochastic Neighbor Embedding (t-SNE)

unsupervised learning

Dimension reduction (non-linear)








						0.9

	0.2








						0.8

	0.3

similarity matrix in input space

similarity matrix in lower space

t-distributed Stochastic Neighbor Embedding (t-SNE)

unsupervised learning

Dimension reduction (non-linear)

similarity matrix in input space

similarity matrix in lower space

L J Frasinski

t-distributed Stochastic Neighbor Embedding (t-SNE)

unsupervised learning

(Vindas et al. 2021, IEEE IUS 2021 submitted)

example of t-SNE application

accelerating the annotation of a
Transcranial Doppler ultrasound micro-embolic dataset

unsupervised learning

Find groups of similar examples (clusters)

clustering

unsupervised learning

what is a cluster ?

clustering

unsupervised learning

clustering

what is a cluster ?

distance-based definition

unsupervised learning

clustering

what is a cluster ?

distance-based definition
density-based definition

\rho_b

\rho_a >

unsupervised learning

clustering

what is a cluster ?

distance-based definition
density-based definition
distribution-based definition

unsupervised learning

clustering

what is a cluster ?

distance-based definition
density-based definition
distribution-based definition
path-based distribution (graphs)

unsupervised learning

clustering

K-means (distance-based method)

unsupervised learning

clustering

initialize k samples as centers *

K-means (distance-based method)

unsupervised learning

clustering

initialize k samples as centers
for each sample associate the label of its closest center

K-means (distance-based method)

unsupervised learning

clustering

initialize k samples as centers *
for each sample associate the label of its closest center
update the centers (mean position of its group)

K-means (distance-based method)

unsupervised learning

clustering

initialize k samples as centers *
for each sample associate the label of its closest center
update the centers (mean position of its group)
repeat steps 2. and 3. until no update in the clusters

K-means (distance-based method)

unsupervised learning

clustering

initialize k samples as centers *
for each sample associate the label of its closest center
update the centers (mean position of its group)
repeat steps 2. and 3. until no update in the clusters

K-means (distance-based method)

unsupervised learning

clustering

initialize k samples as centers *
for each sample associate the label of its closest center
update the centers (mean position of its group)
repeat steps 2. and 3. until no update in the clusters

K-means (distance-based method)

unsupervised learning

clustering

initialize k samples as centers *
for each sample associate the label of its closest center
update the centers (mean position of its group)
repeat steps 2. and 3. until no update in the clusters

K-means (distance-based method)

unsupervised learning

clustering

initialize k samples as centers *
for each sample associate the label of its closest center
update the centers (mean position of its group)
repeat steps 2. and 3. until no update in the clusters

K-means (distance-based method)

unsupervised learning

clustering

initialize k samples as centers *
for each sample associate the label of its closest center
update the centers (mean position of its group)
repeat steps 2. and 3. until no update in the clusters

K-means (distance-based method)

unsupervised learning

clustering

+ fast (O(n))

- need to know / find k (number of clusters)

- can detect only circular clusters

alt. k-median (more computation because need to sort...)

K-means (distance-based method)

unsupervised learning

clustering

hierarchical clustering (distance-based method)

agglomerative (bottom up) or divisive (top-down)

clustering

unsupervised learning

clustering

unsupervised learning

clustering

unsupervised learning

clustering

hierarchical clustering (distance-based method)

+ does not need to know the number of clusters before.
+ does not depend on the chosen distance metric (source?)

+ sub-groups discovery

- lower efficiency, O(n^3)

unsupervised learning

Gaussian Mixture Model with Expected-Maximization
(distribution-based method)

k-means with probability of assignment (instead of closest point assignment)

unsupervised learning

Gaussian Mixture Model with Expected-Maximization
(distribution-based method)

initialize the k = 2 distribution (*several strategies)

unsupervised learning

Gaussian Mixture Model with Expected-Maximization
(distribution-based method)

Expectation (E) step

find the probability for each point to be generated by each mixture

p(X | \theta ^{(t)})

unsupervised learning

Gaussian Mixture Model with Expected-Maximization
(distribution-based method)

Expectation (E) step

find the probability for each point to be generated by each mixture

{E}_{Z|X,\theta^{(t)}} [\log p(X | \theta ^{(t)})]

=Q(\theta | \theta ^{(t)})

unsupervised learning

Gaussian Mixture Model with Expected-Maximization
(distribution-based method)

maximization (M) step:

fit the mixture to the samples

\theta^{(t+1)} = \argmax Q(\theta | \theta ^{(t)})

unsupervised learning

Gaussian Mixture Model with Expected-Maximization
(distribution-based method)

ready for a new E step ?

check the colors in the squares...

unsupervised learning

Gaussian Mixture Model with Expected-Maximization
(distribution-based method)

unsupervised learning

Gaussian Mixture Model with Expected-Maximization
(distribution-based method)

unsupervised learning

Gaussian Mixture Model with Expected-Maximization
(distribution-based method)

no more move ? assign the labels => clusters
or keep the multiple labels ...

unsupervised learning

clustering

Gaussian Mixture Model with Expected-Maximization
(distribution-based method)

+ not restricted to circular clusters... possibly ellipses !
+ support mixed membership labeling

+ you can generate new samples (probabilistic model)

- need to fix the number of Gaussians (expected number of clusters) as in k-means

unsupervised learning

clustering

All points within the cluster are mutually density-connected

If a point is "density-reachable" from some point of the cluster, it is also part of the cluster

: neighborhood radius
minPts: minimum number of neighbors to be a core point

DBSCAN (density-based method)

\epsilon

unsupervised learning

DBSCAN (density-based method)

minPts = 2

core point

neighborhood radius

theses are not core points

\epsilon

unsupervised learning

DBSCAN (density-based method)

minPts = 2

not core points but reachable!

unsupervised learning

DBSCAN (density-based method)

minPts = 2

the rest is "noise"

unsupervised learning

DBSCAN (density-based method)

minPts = 2

different results with smaller epsilon ...

unsupervised learning

DBSCAN (density-based method)

minPts = 2

different results with greater epsilon ...

unsupervised learning

clustering

DBSCAN (density-based method)

+ Does not assume any predefined shape on data clusters

- data defined by set of coordinates (not capable of handling arbitrary feature spaces)
- computationally costly... (...)

- not robust to clusters of varying density

=> OPTICS (density-based method)

unsupervised learning

clustering

Performance Metrics ?

Silhouette coefficient

Calinski-Harabaz index
Davies-Bouldin Index

Rand index
Mutual Information based scores
Homogeneity, completeness and V-measure
Fowlkes-Mallows scores
Contingency Matrix
Pair Confusion Matrix

unsupervised learning

clustering

\frac{b-a}{\max(a,b)}

with $a$ the mean distance between a sample and all other points in the same cluster

with $b$ the mean distance between a sample and all other points in the next nearest cluster

Silhouette coefficient (between -1 and 1)

for each sample

the higher its value, the more similar the sample is within its cluster (and not to neighboring clusters).

If most samples have a low or negative value, then the clustering configuration is not appropriate.

unsupervised learning

clustering

Performance Metrics ?

Calinski-Harabaz index

The higher the Calinski-Harabaz index $s (k)$ the more dense and well separated the $k-th$ cluster is.

W_k

with $B_k$ the inter-cluster dispersion matrix

B_k

and $B_k$ the intra-cluster dispersion matrix

\frac{\mathrm{Tr}(B_k)}{\mathrm{Tr}(W_k)} \frac{N- k}{k-1}

unsupervised learning

Dimension reduction

Clustering

Unsupervised Learning (part II of Introduction to machine learning (Deep Learning for Medical Imaging School 2021)

By emmanuelrouxfr

Unsupervised Learning (part II of Introduction to machine learning (Deep Learning for Medical Imaging School 2021)