principles of Urban Science 8

dr.federica bianco | fbb.space |    fedhere |    fedhere 

Machine Learning &

unsupervised learning

 

this slide deck:

 
  • Machine Learning basic concepts
    • interpretability
    • parameters vs hyperparameters
    • supervised/unsupervised

 

  • Clustering : how it works
  • Partitioning
    • Hard clustering
    • Soft Clustering

 

  • Hirarchical
    • agglomerative
    • devisive

 

  • also: 
    • Density based
    • Grid based
    • Model based

recap

what is machine learning

 

a model is a low dimensional representation of a higher dimensionality datase

the best way to think about it    in the ML context:

 

what is a model?

what is machine learning?

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

 

what is machine learning?

model

parameters: slope, intercept

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

 

what is machine learning?

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

 

model

parameters: slope, intercept

data

 

what is machine learning?

ML: any model with parameters learnt from the data

Machine Learning models are parametrized representation of "reality"  where the parameters are learned from finite sets of realizations of that reality

(note: learning by instance, e.g. nearest neighbours, may not comply to this definition)

Machine Learning is the disciplines that conceptualizes, studies, and applies those models.

Key Concept

what is  machine learning?

 

used to:

  • understand structure of feature space
  • classify based on examples,
  • regression (classification with infinitely small classes)
  • understand which features are important in prediction (to get close to causality)

General ML points

unsupervised vs supervised learning

Clustering

partitioning the feature space so that the existing data is grouped (according to some target function!)

Clustering

partitioning the feature space so that the existing data is grouped (according to some target function!)

Unsupervised learning

  • understanding structure  
  • anomaly detection
  • dimensionality reduction

All features are observed for all datapoints

unsupervised vs supervised learning

Clustering

partitioning the feature space so that the existing data is grouped (according to some target function!)

Classifying & regression

 

finding functions of the variables that allow to predict unobserved properties of new observations

Unsupervised learning

  • understanding structure  
  • anomaly detection
  • dimensionality reduction

All features are observed for all datapoints

unsupervised vs supervised learning

Clustering

partitioning the feature space so that the existing data is grouped (according to some target function!)

Classifying & regression

 

finding functions of the variables that allow to predict unobserved properties of new observations

Unsupervised learning

  • understanding structure  
  • anomaly detection
  • dimensionality reduction

Supervised learning

  • classification
  • prediction
  • feature selection

All features are observed for all datapoints

Some features are not observed for some data points we want to predict them.

unsupervised vs supervised learning

Unsupervised learning

Supervised learning

All features are observed for all datapoints

and we are looking for structure in the feature space

Some features are not observed for some data points we want to predict them.

The datapoints for which the target feature is observed are said to be "labeled"

Semi-supervised learning

Active learning

A small amount of labeled data is available. Data is cluster and clusters inherit labels

The code can interact with the user to update labels.

also...

unsupervised vs supervised learning

what is machine learning?

extract features and create models that allow prediction where the correct answer is known for a subset of the data

supervised learning

identify features and create models that allow to understand structure in the data

unsupervised learning

  • k-Nearest Neighbors

  • Regression

  • Support Vector Machines

  • Neural networks

  • Classification/Regression Trees

  • clustering

  • Principle Component Analsysis

  • Apriori (association rule)

train, test, and validate

2

MLtsa:

validating a model

How do we measure if a model is good?

Accuracy

Precision

Recall

ROC

AOC

 

We will talk more about this later...

but for now focus on

regression performance metrics

MLtsa:

validating a model

How do we measure if a model is good?

Accuracy

Precision

Recall

ROC

AOC

 

Absolute error

 

 

Squared error

 

 

Mean squared error

 

 

Root mean

squared error

 

Relative mean

squared error

 

R squared

SE = \sum_i\epsilon_i^2
MSE = \frac{1}{N}SE
RMSE=\sqrt{MSE}
rMSE = \frac{MSE}{\sigma^2}
R^2 = 1 - rMSE
AE = \sum_i|\epsilon_i|
\epsilon_i=y_𝑖 - f(t_i)

We will talk more about this later...

but for now focus on

regression performance metrics

MLtsa:

validating a model

How do we measure if a model is good?

Accuracy

Precision

Recall

ROC

AOC

 

Absolute error

 

 

Squared error

 

 

Mean squared error

 

 

Root mean

squared error

 

Relative mean

squared error

 

R squared

SE = \sum_i\epsilon_i^2
MSE = \frac{1}{N}SE
RMSE=\sqrt{MSE}
rMSE = \frac{MSE}{\sigma^2}
R^2 = 1 - rMSE
AE = \sum_i|\epsilon_i|

do you recognize these??

\epsilon_i=y_𝑖 - f(t_i)

We will talk more about this later...

but for now focus on

regression performance metrics

MLtsa:

validating a model

How do we measure if a model is good?

Accuracy

Precision

Recall

ROC

AOC

 

We will talk more about this later...

but for now focus on

regression performance metrics

\epsilon_i=y_𝑖 - f(t_i)

Split the sample in test and training sets 

Train on the training set

Test (measure accuracy) on the test set

R^2 = 1 - rMSE

MLtsa:

validating a model

from sklearn.model_selection import train_test_split

def line(x, intercept, slope):
    return slope * x + intercept

def chi2(args, x, y, s):
    a, b = args
    return sum((y - line(x, a, b))**2 / s)

x_train, x_test, y_train, y_test, s_train, s_test = train_test_split(
     x, y, s, test_size=0.25, random_state=42)

initialGuess = (10, 1)

chi2Solution_goodsplit = minimize(chi2, initialGuess, 
	args=(x_train, y_train, s_train))

print("best fit parameters from the minimization of the chi squared: " + 
       "slope {:.2f}, intercept {:.2f}".format(*chi2Solution_goodsplit.x))

print("R square on training set: ", Rsquare(chi2Solution_goodsplit.x, x_train, y_train))
print("R square on test set: ", Rsquare(chi2Solution_goodsplit.x, x_test, y_test))

ML standard

In ML models need to be "validated":

  1. split the data into a training and a test set (typical split 70/30). 
  2. learn the model parameters by "training" the model on the training set
  3. "test" the model on the test set: measure the accuracy of the prediction (e.g. as the distance between the prediction and the test data).

The performance on the model is the performance achieved on the test set.  

An upgrade on this workflow is to create a training, a test, and a validation test. Iterate between training and test to achieve optimal performance, then measure accuracy on the validation set.This is because you can use the test set performance to tune the model hyperparameters (model selection) but then you would report a performance that is tuned on the test set.

a significance performance degradation on the test compared to training set indicates that the model is "overtrained" and does not generalize well.

supervised and unsupervised learning

1

what is machine learning?

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.       Arthur Samuel, 1959

 

model

parameters:

 

 

slope, intercept

data

 

ML: any model with parameters learnt from the data

what is machine learning?

classification

prediction

feature selection

supervised learning

understanding structure

organizing/compressing data

anomaly detection dimensionality reduction

unsupervised learning

what is machine learning?

classification

prediction

feature selection

supervised learning

understanding structure

organizing/compressing data

anomaly detection dimensionality reduction

unsupervised learning

clustering

PCA

Apriori

k-Nearest Neighbors

Regression

Support Vector Machines

Classification/Regression Trees

Neural networks

 

general          ML            points

used to:

understand structure of feature space

classify based on examples,

regression (classification with infinitely small classes)

general          ML            points

should be interpretable:

 

why?

predictive policing,

selection of conference participants.

ethical implications

general          ML            points

should be interpretable:

 

prective policing,

selection of conference participants.

why the model made a choice?

which feature mattered?

why?

ethical implications

causal connection

general          ML            points

should be interpretable:

 

ethical implications

general          ML            points

ML model have parameters and hyperparameters

parameters:  the model optimizes based on the data

hyperparameters:  chosen by the model author, could be based on domain knowledge, other data, guessed (?!).

e.g. the shape of the polynomial

 

1

clustering

 

observed features:

(x, y)

GOAL: partitioning data in  maximally homogeneous,

maximally distinguished  subsets.

clustering is an unsupervised learning method

x

y

all features are observed for all objects in the sample

(x, y)

how should I group the observations in this feature space?

e.g.: how many groups should I make?

clustering is an unsupervised learning method

x

y

optimal 

clustering

 

internal criterion:

members of the cluster should be similar to each other (intra cluster compactness)

 

external criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

tigers

wales

raptors

zoologist's clusters

orange/green

black/white/blue

internal criterion:

members of the cluster should be similar to each other (intra cluster compactness)

 

external criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

photographer's clusters

the optimal clustering depends on:

  • how you define similarity/distance

  • the purpose of the clustering

internal criterion:

members of the cluster should be similar to each other (intra cluster compactness)

 

external criterion:

objects outside the cluster should be dissimilar from the objects inside the cluster

the ideal clustering algorithm should have:

 

  • Scalability (naive algorithms are Np hard)

  • Ability to deal with different types of attributes

  • Discovery of clusters with arbitrary shapes

  • Minimal requirement for domain knowledge

  • Deals with noise and outliers

  • Insensitive to order

  • Allows incorporation of constraints

  • Interpretable

A Spatial Clustering Technique for the Identification of Customizable Ecoregions

William W. Hargrove and Robert J. Luxmoore

thinking about distances spatially helps intuition

but exercise in abstracting these distances to non-spatial features

A Spatial Clustering Technique for the Identification of Customizable Ecoregions

William W. Hargrove and Robert J. Luxmoore

50-year mean monthly temperature, 50-year mean monthly precipitation, elevation, total plant-available water content of soil, total organic matter in soil, and total Kjeldahl soil nitrogen

2

distance metrics

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{|x_{i1}-x_{j1}|^p~+~|x_{i2}-x_{j2}|^p~+~...~+~|x_{iN}-x_{jN}|^p}

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

N features (dimensions)

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

N features (dimensions)

D(i,j) > 0\\ D(i,i) = 0\\ D(i,j)~=~D(j,i)\\ D(i,j)~<=~D(i,k)~+~D(k,j)

properties:

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Manhattan: p=1

D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Manhattan: p=1

D_{Man}(i,j)~=~\sum_{k=1}^{N}|x_{ik}-x_{jk}|

features: x, y

distance metrics

continuous variables

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Euclidean: p=2

D_{Euc}(i,j)~=~\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^2}

features: x, y

distance metrics

Minkowski family of distances

D(i,j)~=~^{p}\sqrt{\sum_{k=1}^{N}|x_{ik}-x_{jk}|^p}

Great Circle distance

D(i,j)~=~R~\arccos{\left( \sin{\phi}_i \cdot \sin{\phi_j} ~+~ \cos{\phi_i} \cdot \cos{\phi_j} \cdot \cos{\Delta\lambda} \right)}
\phi_i,\lambda_i,\phi_j,\lambda_j

features

latitude and longitude

continuous variables

distance metrics

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

categorical variables:

binary

1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

distance metrics

Simple Matching Distance 

SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

categorical variables:

binary

distance metrics

Simple Matching Distance 

SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

categorical variables:

binary

distance metrics

Simple Matching Distance 

SMD(i,j)~=~1~- SMC(i,j)

Uses presence/absence of features in data

: number of features in neither

M_{i=0,j=0}

: number of features in both

M_{i=1,j=1}

: number of features in i but not j

M_{i=1,j=0}

: number of features in j but not i

M_{i=0,j=1}

Simple Matching Coefficient

or Rand similarity

SMC(i,j)~=~\frac{M_{i=0,j=0} + M_{i=1,j=1}}{M_{i=0,j=0} + M_{i=1,j=0} + M_{i=0,j=1} + M_{i=1,j=1}}
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

categorical variables:

binary

distance metrics

Jaccard similarity

Jaccard distance

D(i,j)~=~1 - J(i,j)
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

categorical variables:

binary

J(i,j)~=~\frac{M_{i=1,j=1}}{M_{i=0,j=1}+M_{i=1,j=0}+M_{i=1,j=1}}

distance metrics

Jaccard similarity

J(i,j)~=~

Jaccard distance

D(i,j)~=~1 - J(i,j)
1 0 sum
1 M11 M10 M11+M10
0 M01 M00 M01+M00
sum M11+M01 M10+M00 M11+M00+M01+ M10

observation i

observation j

}

}

B
{A\cap B}
A
{A\cup B}
\frac{A\cap B}{A\cup B}

categorical variables:

binary

distance metrics

Jaccard similarity

Application to Deep Learning for image recognition

Convolutional Neural Nets 

J(i,j)~=~
\frac{A\cap B}{A\cup B}

categorical variables:

binary

distance metrics

3

whitening

Data can have covariance (and it almost always does!)

PLUTO Manhattan data (42,000 x 15)

axis 1 -> features

axis 0  -> observations

Data can have covariance (and it almost always does!)

Data can have covariance (and it almost always does!)

Pearson's correlation (linear correlation)

{\displaystyle r_{xy}={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{{\sqrt {\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}{\sqrt {\sum _{i=1}^{n}(y_{i}-{\bar {y}})^{2}}}}}}

correlation = correlation / variance

PLUTO Manhattan data (42,000 x 15) correlation matrix

axis 1 -> features

axis 0  -> observations

Data can have covariance (and it almost always does!)

PLUTO Manhattan data (42,000 x 15) correlation matrix

A covariance matrix is diagonal if the data has no correlation

Data can have covariance (and it almost always does!)

Full On Whitening

find the matrix W that diagonalized Σ

from zca import ZCA
import numpy as np
X = np.random.random((10000, 15)) # data array
trf = ZCA().fit(X)
X_whitened = trf.transform(X)
X_reconstructed =
trf.inverse_transform(X_whitened)
assert(np.allclose(X, X_reconstructed))

: remove covariance by diagonalizing the transforming the data with a matrix that diagonalizes the covariance matrix

this is at best hard, in some cases impossible even numerically on large datasets

Data that is not correlated appear as a sphere in the Ndimensional feature space

Data can have covariance (and it almost always does!)

ORIGINAL DATA

STANDARDIZED DATA

 

Generic preprocessing

Generic preprocessing

for each feature: divide by standard deviation and subtract mean

mean of each feature should be 0, standard deviation of each feature should be 1

4

how clustring works

  • Partitioning
    • Hard clustering
    • Soft Clustering

 

  • Hirarchical
    • agglomerative
    • devisive

 

  • also: 
    • Density based
    • Grid based
    • Model based

K-means (McQueen ’67)

K-medoids (Kaufman & Rausseeuw ’87)

 

Expectation Maximization (Dempster,Laird,Rubin ’77)

5

clustering by

partitioning

5.1

k-means:

Hard partitioning cluster method

K-means: the algorithm

Choose N “centers” guesses: random points in the feature space
repeat:
  Calculate distance between each point and each center

  Assign each point to the closest center

  Calculate the new cluster centers

untill (convergence):
  when clusters no longer change

K-means: 

Objective: minimizing the aggregate distance within the cluster.

 

Order: #clusters #dimensions #iterations #datapoints O(KdN)

 

  CONs:

Its non-deterministic: the result depends on the (random) starting point

It only works where the mean is defined: alternative is K-medoids which represents the cluster by its central member (median), rather than by the mean

Must declare the number of clusters upfront (how would you know it?)

 PROs:

Scales linearly with d and N

 

K-means: the objective function

Objective: minimizing the aggregate distance within the cluster.  

 

Order: #clusters #dimensions #iterations #datapoints O(KdN)

 

O(KdN):

complexity scales linearly with

   -d number of dimensions

   -N number of datapoints

   -K number of clusters

K-means: the objective function

either you know it because of domain knowledge

or

you choose it after the fact: "elbow method"

\sum_{k}\sum_{i \in k} (\vec{x_i} - \vec{\mu_k})^2

total intra-cluster variance

Objective: minimizing the aggregate distance within the cluster.

 

 

Order: #clusters #dimensions #iterations #datapoints O(KdN)

 

 Must declare the number of clusters  

K-means: the objective function

Objective: minimizing the aggregate distance within the cluster.

 

 

Order: #clusters #dimensions #iterations #datapoints O(KdN)

 

 Must declare the number of clusters upfront (how would you know it?)

 

either domain knowledge or

after the fact: "elbow method"

\sum_{k}\sum_{i \in k} (\vec{x_i} - \vec{\mu_k})^2

total intra-cluster variance

K-means: hyperparameters

  • n_clusters : number of clusters
  • init : the initial centers or a scheme to choose the center

    ‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

    ‘random’: choose k observations (rows) at random from data for the initial centroids.

    If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

  • n_init : if >1 it is effectively an ensamble method: runs n times with different initializations
  • random_state : for reproducibility

5.2

expectation-maximization

Soft partitioning cluster method

Hard clustering : each object in the sample belongs to only 1 cluster

Soft clustering : to each object in the sample we assign a degree of belief that it belongs to a cluster

Soft = probabilistic

Mixture models

these points come from 2 gaussian distribution.

which point comes from which gaussian?

1

2

3

4-6

7

8

9-12

13

Mixture models

CASE 1:

if i know which point comes from which gaussian

i can solve for the parameters of the gaussian

(e.g. maximizing likelihood)

1

2

3

4-6

7

8

9-12

13

Mixture models

CASE 2:

if i know which the parameters (μ,σ) of the gaussians

i can figure out which gaussian each point is most likely to come from (calculate probability)

1

2

3

4-6

7

8

9-12

13

Mixture models: 

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

calculate the probability of each point to belong to A and B

Expectation maximization

Mixture models: 

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

calculate the probability of each point to belong to A and B

Expectation maximization

high

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Mixture models: 

Expectation maximization

low

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

calculate the probability of each point to belong to A and B

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Mixture models: 

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

1- calculate the probability p_ji of each point to belong to gaussian j

Expectation maximization

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Bayes theorem: P(A|B) = P(B|A) P(A) / P(B)

P(g_1 | x_i) = \frac{P(x_i | g_1) | P(g_1)}{P(x_i|g_1)P(g_1) + P(x_i|g_2) P(g_2)}
p_{ij}=

Mixture models: 

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

1- calculate the probability p_ji of each point to belong to gaussian j

2a - calculate the weighted mean of the cluster,  weighted by the p_ji

 

Expectation maximization

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Bayes theorem: P(A|B) = P(B|A) P(A) / P(B)

P(g_1 | x_i) = \frac{P(x_i | g_1) | P(g_1)}{P(x_i|g_1)P(g_1) + P(x_i|g_2) P(g_2)}
\mu_i=\frac{\sum_j P(g_i|x_j)x_j}{\sum_j P(g_i|x_j)}
\sigma_j=\frac{\sum_i P(g_j|x_i)(x_i-\mu_j)^2}{\sum_j P(g_j|x_i)}
p_{ij}=

Mixture models: 

Expectation maximization

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Bayes theorem: P(A|B) = P(B|A) P(A) / P(B)

P(g_1 | x_i) = \frac{P(x_i | g_1) | P(g_1)}{P(x_i|g_1)P(g_1) + P(x_i|g_2) P(g_2)}
\mu_i=\frac{\sum_j P(g_i|x_j)x_j}{\sum_j P(g_i|x_j)}
\sigma_j=\frac{\sum_i P(g_j|x_i)(x_i-\mu_j)^2}{\sum_j P(g_j|x_i)}
p_{ij}=

Guess parameters g= (μ,σ) for 2 Gaussian distributions A and B

1- calculate the probability p_ji of each point to belong to gaussian j

2a - calculate the weighted mean of the cluster,  weighted by the p_ji

2b - calculate the weighted sigma of the cluster,  weighted by the p_ji

 

Mixture models: 

Expectation maximization

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Bayes theorem: P(A|B) = P(B|A) P(A) / P(B)

P(g_1 | x_i) = \frac{P(x_i | g_1) | P(g_1)}{P(x_i|g_1)P(g_1) + P(x_i|g_2) P(g_2)}
\mu_i=\frac{\sum_j P(g_i|x_j)x_j}{\sum_j P(g_i|x_j)}
\sigma_j=\frac{\sum_i P(g_j|x_i)(x_i-\mu_j)^2}{\sum_j P(g_j|x_i)}

Alternate expectation and maximization step till convergence

1- calculate the probability p_ji of each point to belong to gaussian j

2a - calculate the weighted mean of the cluster,  weighted by the p_ji

2b - calculate the weighted sigma of the cluster,  weighted by the p_ji

 

expectation step

maximization step

}

p_{ij}=

Last iteration: convergence

Mixture models: 

Expectation maximization

P(x_i | \mu_j, \sigma_j) = \frac{1}{\sqrt{2\pi \sigma_j^2}}\exp\left(-\frac{x_i-\mu_j}{2\sigma_j^2}\right)

Bayes theorem: P(A|B) = P(B|A) P(A) / P(B)

P(g_1 | x_i) = \frac{P(x_i | g_1) | P(g_1)}{P(x_i|g_1)P(g_1) + P(x_i|g_2) P(g_2)}
\mu_i=\frac{\sum_j P(g_i|x_j)x_j}{\sum_j P(g_i|x_j)}
\sigma_j=\frac{\sum_i P(g_j|x_i)(x_i-\mu_j)^2}{\sum_j P(g_j|x_i)}
p_{ij}=

Alternate expectation and maximization step till convergence

1- calculate the probability p_ji of each point to belong to gaussian j

2a - calculate the weighted mean of the cluster,  weighted by the p_ji

2b - calculate the weighted sigma of the cluster,  weighted by the p_ji

 

expectation step

maximization step

}

EM: the algorithm

Choose N “centers” guesses (like in K-means)
repeat
Expectation step: Calculate the probability of each distribution given the points

Maximization step: Calculate the new centers and variances as weighted averages of the datapoints, weighted by the probabilities
untill (convergence)
e.g. when gaussian parameters no longer change

Expectatin Maximization:  

Order: #clusters #dimensions #iterations #datapoints #parameters O(KdNp) (>K-means)

based on Bayes theorem

Its non-deterministic: the result depends on the (random) starting point  (like K-mean)

It only works where a probability distribution for the data points can be defines (or equivalently a likelihood)  (like K-mean)

Must declare the number of clusters and the shape of the pdf upfront (like K-mean)

Convergence Criteria

General

Any time you have an objective function (or loss function) you need to set up a tolerance : if your objective function did not change by more than ε since the last step you have reached convergence (i.e. you are satisfied)

 

ε is your tolerance

For clustering:

convergence can be reached if

no more than n data point changed cluster

n is your tolerance

6

Hierarchical clustering

Hierarchical clustering

removes the issue of
deciding K (number of
clusters)

Hierarchical clustering

it calculates distance between clusters and single points: linkage

6.1

Agglomerative

hierarchical clustering

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

agglomerative (bottom up)

Hierarchical clustering

agglomerative (bottom up)

distance

Hierarchical clustering

agglomerative (bottom up)

it's deterministic!

Hierarchical clustering

agglomerative (bottom up)

it's deterministic!

computationally intense because every cluster pair distance has to be calculate

Hierarchical clustering

agglomerative (bottom up)

it's deterministic!

computationally intense because every cluster pair distance has to be calculate

it is slow, though it can be optimize:

complexity

 

O(N^2d + N^3)

Agglomerative clustering:

the algorithm

compute the distance matrix
each data point is a singleton cluster
repeat
    merge the 2 cluster with minimum distance
    update the distance matrix
untill   
    only a single (n) cluster(s) remains

Order:             

PROs

It's deterministic

CONs

It's greedy (optimization is done step by step and agglomeration decisions cannot be undone)

It's computationally expensive

Agglomerative clustering:

 

O(N^2d + N^3)

Agglomerative clustering: hyperparameters

  • n_clusters : number of clusters
  • affinity : the distance/similarity definition
  • linkage : the scheme to measure distance to a cluster
  • random_state : for reproducibility

6.2

Divisive hierarchical clustering

Hierarchical clustering

divisive (top down)

Hierarchical clustering

divisive (top down)

Hierarchical clustering

divisive (top down)

 

it is

non-deterministic

(like k-mean)

Hierarchical clustering

divisive (top down)

 

it is

non-deterministic

(like k-mean)

 

it is greedy -
just as k-means
two nearby points
may end up in
separate clusters

 

Hierarchical clustering

divisive (top down)

 

it is

non-deterministic

(like k-mean)

 

it is greedy -
just as k-means
two nearby points
may end up in
separate clusters

 

it is high complexity for

exhaustive search

 

O(2^N)

But can be reduced (~k-means)

O(2Nk)

or 

O(N^2)

Divisive clustering:

the algorithm

Calculate clustering criterion for all subgroups, e.g. min intracluster variance 
repeat
split the best cluster based on criterion above
untill 
each data is in its own singleton cluster

Order:               (w K-means procedure)

It's non-deterministic: the result depends on the (random) starting point  (like K-mean) unless its exaustive (but that is                )

or

It's greedy (optimization is done step by step)

Divisive clustering:

 

O(N^2)
O(2^N)

distance between a point and a cluster:

single link distance

D(c1,c2) = min(D(xc1, xc2))

linkage:

distance between a point and a cluster:

single link distance

D(c1,c2) = min(D(xc1, xc2))

complete link distance

D(c1,c2) =  max(D(xc1, xc2))

linkage:

distance between a point and a cluster:

single link distance

D(c1,c2) = min(D(xc1, xc2))

complete link distance

D(c1,c2) =  max(D(xc1, xc2))

centroid link distance

D(c1,c2) =  mean(D(xc1, xc2))

linkage:

linkage:

distance between a point and a cluster:

single link distance

D(c1,c2) = min(D(xc1, xc2))

complete link distance

D(c1,c2) =  max(D(xc1, xc2))

centroid link distance

D(c1,c2) =  mean(D(xc1, xc2))

Ward distance (global measure)

 

D_{tot} = \sum_j \sum_{i,x \in C_j}(x_i-\mu_j)^2

7

Density Based

 

DBSCAN

DBSCAN

Density-based spatial clustering of applications with noise

 

DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature

DBSCAN

defines cluster membership based on local density: based on Nearest Neighbors algorithm. 

 

DBSCAN:

the algorithm

  • A point p is a core point if at least minPts points are within distance ε (including p).

  • A point q is directly reachable from p if point q is within distance ε from core point p. Reachable from p if there is a path p1, ..., pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi

  • All points not reachable from any other point are outliers or noise points.

DBSCAN:

the algorithm

  • ε : minimum distance to join points

  • min_sample : minimum number of points in a cluster, otherwise they are labeled outliers.

  • metric : the distance metric

  • p : float, optional The power of the Minkowski metric 

DBSCAN:

the algorithm

  • ε : minimum distance to join points

  • min_sample : minimum number of points in a cluster, otherwise they are labeled outliers.

  • metric : the distance metric

  • p : float, optional The power of the Minkowski metric 

its extremely sensitive to these parameters!

DBSCAN:

the algorithm

for each point P count neighbours within minPts: label=C
for each point P ~= C measure distance d to all Cs
        if d<minD: label = DR
for each point P not C and not DR
        if distance d to C or DR > minD: label = outlier
        if distance d to C or DR <= minD: find path to closet C and cluster

 

Order:             

PROs

Deterministic.

Deals with noise and outliers

Can be used with any definition of distance or similarity

PROs

Not entirely deterministic.

Only works in a constant density field

DBSCAN clustering:

 

O(N^2)

key concepts

 

Clustering :  unsupervised learning where all features are observed for all datapoints. The goal is to partition the space into maximally homogeneous maximally distinguished groups

clustering is easy, but interpreting results is tricky

Distance : A definition of distance is required to group observations/ partition the space.

Common distances over continuous variables

  • Minkowski (inlcudes Euclidian = Minkowski(2)
  • Great Circle (for coordinated on a sphere, e.g. earth or sky) 

Common distances over categorical variables:

  • Simple Distance Matrix                                                                 
  • Jaccard Distance  

Whitening

Models assume that the data is not correlated. If your data is correlated the model results may be invalid. And your data always has correlations.

 - whiten the data by using the matrix that diagonalizes the covariance matrix. This is ideal but computationally expensive if possible at all

- scale your data so that each feature is mean=0 stdev=2. 

Solution:

key concepts

 

Partition clustering:

   Hard: K-means O(KdN) , needs to decide the number of clusters, non deterministic

simple efficient implementation but the need to select the number of clusters is a significant flaw

   Soft: Expectation Maximization O(KdNp) , needs to decide the number of clusters, need to decide a likelihood function (parametric), non deterministic

Hierarchical:

    Divisive:   Exhaustive           ;             at least non deterministic

    Agglomerative:                        , deterministic, greedy. Can be run through and explore the best stopping point. Does not require to choose the number of clusters a priori

Density based

    DBSCAN: Density based clustering method that can identify outliers, which means it can be used in the presence of noise. Complexity             . Most common (cited) clustering method in the natural sciences.

O(N^2d + N^3)
O(2^N)
O(N^2)
O(N^2)

key concepts

 

encoding categorical variables:

variables have to be encoded as numbers for computers to understand them. You can encode categorical variables with integers or floating point but you implicitly impart an order. The standard is to one-hot-encode which means creating a binary (True/False) feature (column) for each category of a categorical variables but this increases the feature space and generated covariance.

model diagnostics for classifiers: Fraction of True Positives and False Positives are the metrics to evaluate classifiers. Combinations of those numbers include Accuracy (TP/ (TP+FP)), Precision (TP/(TP+FN)), Recall ((TP+TN)/(TP+TN+FP+FN)).

ROC curve: (TP vs FP) is a holistic metric of a model. It can be used to guide the choice of hyperparameters to find the "sweet spot" for your problem

READING

 

https://towardsdatascience.com/unsupervised-learning-and-data-clustering-eeecb78b422a

resources

 


a comprehensive review of clustering methods

Data Clustering: A Review, Jain, Mutry, Flynn 1999


https://www.cs.rutgers.edu/~mlittman/courses/lightai03/jain99data.pdf


required reading: a blog post on how to generate and interpret a scipy dendrogram by Jörn Hees


https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/

Principles of Urban Science 8

By federica bianco

Principles of Urban Science 8

clustering

  • 374