ISDSA Workshop: Gentle Introduction to Machine Learning -Unsupervised Learning 3/4

Gentle Introduction to Machine Learning

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Workshop prepared for International Society for Data Science and Analytics (ISDSA) Annual Meeting, Notre Dame University, June 2nd, 2022.

Unsupervised Learning

No response variable \(Y\)
Not interested in prediction
The goal is to discover interesting patterns about the measurements on \(X_{1}\), \(X_{2}\), . . . , \(X_{p}\)
Any subgroups among the variables or among the observations?
- Principal components analysis
- Clustering

Supervised vs. Unsupervised Learning

Supervised Learning: both X and Y are known
Unsupervised Learning: only X

Challenge

Unsupervised learning is more subjective than supervised learning, as there is no simple prediction of a response.
But unsupervised learning is growing popular in fields like:
- Medicine: grouping cancer patients by gene expression measurements;
- Marketing: grouping shoppers by browsing and purchase histories;
- Entertainment: movies grouped by movie viewer ratings.

Principal Components Analysis

Principal Components Analysis (PCA) produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated.
Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization.

Principal Components Analysis

The first principal component of a set of features
\(X_1, X_2, . . . , X_p\) is the normalized linear combination of the features:
that has the largest variance. By normalized, we mean that
\(\sum_{j=1}^p\phi_{j1}^2 = 1\).
The elements \(\phi_{11}, . . . , \phi_{p1}\) are the loadings of the first principal component; together, the loadings make up
the principal component loading vector, \(\phi_1 = (\phi_{11} \phi_{21} ... \phi_{p1})^T\) .
We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance.

\(Z_1 = \phi_{11}X_1 +\phi_{21}X_2 +...+\phi_{p1}X_p\)

Principal Components Analysis

The population size (pop) and ad spending (ad) for 100 different cities are shown as purple circles. The green solid line indicates the first principal component direction, and the blue dashed line indicates the second principal component directio

Computation of Principal Components

Suppose we have a \(n×p\) data set \(X\). Since we are only interested in variance, we assume that each of the variables in \(X\) has been centered to have mean zero (that is, the column means of \(X\)are zero).
We then look for the linear combination of the sample feature values of the form
for \(i = 1, . . . , n\) that has largest sample variance, subject to
the constraint that \(\sum^p_{j=1} \phi^2_{j1} = 1\).
Since each of the \(x_{ij}\) has mean zero, then so does \(z_{i1}\) (for any values of \(\phi_{j1}\)). Hence the sample variance of the \(z_{i1}\) can be written as \(\frac{1}{n}\sum^n_{i=1}z_{i1}^2\).

\(Z_{i1} = \phi_{11}X_{i1} +\phi_{21}X_{i2} +...+\phi_{p1}X_{ip}\)

(1)

Computation of Principal Components

Plugging in (1) the first principal component loading vector solves the optimization problem
This problem can be solved via a singular-value decomposition of the matrix \(X\), a standard technique in linear algebra.
We refer to \(Z_1\) as the first principal component, with realized values \(z_{11}, . . . , z_{n1}\)

Geometry of PCA

The loading vector \(\phi_1\) with elements \(\phi_{11}, \phi_{21},...,\phi_{p1}\) defines a direction in feature space along which the data vary the most.
If we project the \(n\) data points \(x_1,...,x_n\) onto this direction, the projected values are the principal component scores \(z_{11},...,z_{n1}\) themselves.

Further principal components

The second principal component is the linear combination of \(X_1, . . . , X_p\) that has maximal variance among all linear combinations that are uncorrelated with \(Z_1\).
The second principal component scores \(z_{12}, z_{22},..., z_{n2}\) take the form:

\(z{i2} = \phi_{12}x_{i1} + \phi_{22}x_{i2} +...+ \phi_{p2}x_{ip}\),

where \(\phi_2\) is the second principal component loading vector, with elements \(\phi_{12}, \phi_{22}, . . . , \phi_{p2}\).

Further principal components

It turns out that constraining \(Z_2\) to be uncorrelated with \(Z_1\) is equivalent to constraining the direction \(\phi_2\) to be orthogonal (perpendicular) to the direction \(\phi_1\). And so on.
The principal component directions \(\phi_1\),\(\phi_2\), \(\phi_3\), . . . are the ordered sequence of right singular vectors of the matrix \(X\), and the variances of the components are \(\frac{1}{n}\) times the squares of the singular values. There are at most
\(min(n − 1, p)\) principal components.

Illustration: US Arrests

The first two principal components for the USArrests data.
The black state names represent the scores for the first two principal components.
The red arrows indicate the first two principal component loading vectors (with axes on the top and right). For example, the loading for Rape on the first component is 0.54, and its loading on the second principal component 0.17 [the word Rape is centered at the point (0.54, 0.17)]. (Note scale difference in plot)

                PC1        PC2        PC3         PC4
Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

PCA find the hyperplane closest to the observations

The first principal component loading vector has a very special property: it defines the line in \(p\)-dimensional space that is closest to the \(n\) observations (using average squared Euclidean distance as a measure of closeness)
The notion of principal components as the dimensions that are closest to the \(n\) observations extends beyond just the first principal component.
For instance, the first two principal components of a data set span the plane that is closest to the \(n\) observations, in terms of average squared Euclidean distance.

Scaling of the variables matters

If the variables are in different units, scaling each to have standard deviation equal to one is recommended.
If they are in the same units, you might or might not scale the variables.

Elbow Method

The elbow method looks at the percentage of variance explained as a function of the number of clusters,
Choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data.

Clustering

Clustering refers to a set of techniques for finding subgroups, or clusters, in a data set.
A good clustering is one when the observations within a group are similar but between groups are very different
For example, suppose we collect p measurements on each of n breast cancer patients. There may be different unknown types of cancer which we could discover by clustering the data

Different Clustering Methods

There are many different types of clustering methods
Two most commonly used approaches:
- K-Means Clustering
- Hierarchical Clustering

K-Means Clustering

To perform K-means clustering, one must first specify the desired number of clusters K
Then the K-means algorithm will assign each observation to exactly one of the K clusters

How does K-Means work?

We would like to partition that data set into K clusters:

Each observation belong to at least one of the K clusters
The clusters are non-overlapping, i.e. no observation belongs to more than one cluster
The objective is to have a minimal “within-cluster-variation”, i.e. the elements within a cluster should be as similar as possible
One way of achieving this is to minimize the sum of all the pair-wise squared Euclidean distances between the observations in each cluster.

\(C_1,...,C_K\)

K-Means Algorithm

Step 1: Randomly assign each observation to one of K clusters
Step 2: Iterate until the cluster assignments stop changing:
- For each of the K clusters, compute the cluster centroid. The \(k^{th}\) cluster centroid if the mean of the observations assigned to the \(k^{th}\) cluster
- Assign each observation to the cluster whose centroid is closest (where “closest” is defined using Euclidean distance.

K-Means Algorithm

Step 1: random assignment of points to k groups

Iteration 1 Step 2a:

Compute cluster centroids

Iteration 1 Step 2b:
Assign points to closest cluster center

Iteration 2 Step 2a:
Compute new cluster centroids

Stop at k center points where there's no further change.

K-Means Cluster: Iris data

Different starting values

K-means clustering performed six times on the data with K = 3, each time with a different random assignment of the observations.
Three different local optima were obtained, one of which resulted in a smaller value of the objective and provides better separation between the clusters.
Those labeled in red all achieved the same best solution, with an objective value of 235.8

Hierarchical Clustering

K-Means clustering requires choosing the number of clusters.
An alternative is Hierarchical Clustering which does not require that we commit to a particular choice of \(K\).
Hierarchical Clustering has an advantage that it produces a tree-based representation of the observations: Dendrogram
A dendrogram is built starting from the leaves and combining clusters up to the trunk.

Dendrograms

First join closest points (5 and 7)
Height of fusing/merging (on vertical axis) indicates how similar the points are
After the points are fused they are treated as a single observation and the algorithm continues

Interpretation

Each “leaf” of the dendogram represents one of the 45 observations
At the bottom of the dendogram, each observation is a distinct leaf. However, as we move up the tree, some leaves begin to fuse. These correspond to observations that are similar to each other.
As we move higher up the tree, an increasing number of observations have fused. The earlier (lower in the tree) two observations fuse, the more similar they are to each other.
Observations that fuse later are quite different

Choosing Clusters

To choose clusters we draw lines across the dendrogram
We can form any number of clusters depending on where we draw the break point.

One cluster

Two clusters

Three clusters

Agglomerative Approach

To build the dendrogram:

Start with each point as a separate cluster (\(n\) clusters)
Calculate a measure of dissimilarity between all points/clusters
Fuse two clusters that are most similar so that there are now \(n-1\) clusters
Fuse next two most similar clusters so there are now \(n-2\) clusters
Continue until there is only 1 cluster

Illustration

Start with 9 clusters
Fuse 5 and 7
Fuse 6 and 1
Fuse the (5,7) cluster with 8.
Continue until all observations are fused.

Dissimilarity

How do we define the dissimilarity, or linkage, between the fused (5,7) cluster and 8?
There are four options:
- Complete Linkage
- Single Linkage
- Average Linkage
- Centriod Linkage

Linkage Methods: Distance Between Clusters

Complete Linkage:
- Largest distance between observations
Single Linkage:
- Smallest distance between observations
Average Linkage:
- Average distance between observations
Centroid:
- Distance between centroids of the observations

Linkage Methods: Distance Between Clusters

Complete	Maximal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the largest of these dissimilarities.
Single	Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the smallest of these dissimilarities.
Average	Mean inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B, and record the average of these dissimilarities.
Centroid	Dissimilarity between the centroid for cluster A (a mean vector of length p) and the centroid for cluster B. Centroid linkage can result in undesirable inversions.

Linkage Methods: Distance Between Clusters

These dendrograms show three clustering results for the same data
The only difference is the linkage method but the results are very different
Complete and average linkage tend to yield evenly sized clusters whereas single linkage tends to yield extended clusters to which single leaves are fused one by one.

Choice of Dissimilarity Measure

We have considered using Euclidean distance as the dissimilarity measure
An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
This is an unusual use of correlation, which is normally
computed between variables; here it is computed between the observation profiles for each pair of observations.

Choice of Dissimilarity Measure

We have considered using Euclidean distance as the dissimilarity measure
An alternative is correlation-based distance which considers
two observations to be similar if their features are highly
correlated.
This is an unusual use of correlation, which is normally
computed between variables; here it is computed between the observation profiles for each pair of observations.

In this example, we have 3 observations and p = 20 variables
In terms of Euclidean distance obs. 1 and 3 are similar. However, obs. 1 and 2 are highly correlated so would be considered similar in terms of correlation measure.

Illustration: Online shopping

Suppose we record the number of purchases of each item (columns) for each customer (rows)
Using Euclidean distance, customers who have purchases very little will be clustered together
Using correlation measure, customers who tend to purchase the same types of products will be clustered together even if the magnitude of their purchase may be quite different

Illustration: Online shopping

Consider an online shop that sells two items: socks and computers
- Left: In terms of quantity, socks have higher weight
- Center: After standardizing, socks and computers have equal weight
- Right: In terms of dollar sales, computers have higher weight

Quantity

Price

Key decisions in Clustering

Should the features first be standardized? i.e. Have the variables centered to have a mean of zero and standard deviation of one.
For K-means clustering,
- How many clusters should be set?
For Hierarchical Clustering,
- What dissimilarity measure should be used?
- What type of linkage should be used?
- Where should we cut the dendrogram in order to obtain clusters?
In real world research, try several different choices, and decide on one that is interpretable!

Conclusions

Unsupervised learning is important for understanding the variation and grouping structure of a set of unlabeled data, and can be a useful pre-processor for supervised learning
It is intrinsically more difficult than supervised learning because there is no gold standard (like an outcome variable) and no single objective (like test set accuracy).
It is an active field of research, with many recently developed tools such as self-organizing maps, independent components analysis and spectral clustering.

Deep Learning

Deep learning is a neural-networks technique that organizes the neurons in many more than multi-layer perceptrons. It is in this sense the networks are “deep,” and this is what gave the paradigm its name.

The top layers of these networks are trained by a supervised learning technique such as the back-propagation of error. By contrast, the task for the lower layers is to create a reasonable set of features. It is conceivable that self-organizing feature maps be used to this end; in reality, more advanced techniques are usually preferred.
If the number of features is manageable, and if the size of
the training set is limited, then classical approaches will do just as well as deep learning or even better

Gentle Introduction to Machine Learning

Unsupervised Learning

Unsupervised Learning

No response variable \(Y\)

Not interested in prediction

The goal is to discover interesting patterns about the measurements on \(X_{1}\), \(X_{2}\), . . . , \(X_{p}\)

Any subgroups among the variables or among the observations?

Principal components analysis

Clustering

Supervised vs. Unsupervised Learning

Supervised Learning: both X and Y are known

Unsupervised Learning: only X

Challenge

Unsupervised learning is more subjective than supervised learning, as there is no simple prediction of a response.

But unsupervised learning is growing popular in fields like:

Medicine: grouping cancer patients by gene expression measurements;

Marketing: grouping shoppers by browsing and purchase histories;

Entertainment: movies grouped by movie viewer ratings.

Principal Components Analysis

Principal Components Analysis

Principal Components Analysis

Computation of Principal Components

Computation of Principal Components

Geometry of PCA

Further principal components

Further principal components

Illustration: US Arrests

PCA find the hyperplane closest to the observations

Scaling of the variables matters

If the variables are in different units, scaling each to have standard deviation equal to one is recommended.

If they are in the same units, you might or might not scale the variables.

Elbow Method

The elbow method looks at the percentage of variance explained as a function of the number of clusters,

Choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data.

Clustering

Clustering refers to a set of techniques for finding subgroups, or clusters, in a data set.

A good clustering is one when the observations within a group are similar but between groups are very different

For example, suppose we collect p measurements on each of n breast cancer patients. There may be different unknown types of cancer which we could discover by clustering the data

Different Clustering Methods

There are many different types of clustering methods

Two most commonly used approaches:

K-Means Clustering

Hierarchical Clustering

K-Means Clustering

To perform K-means clustering, one must first specify the desired number of clusters K

Then the K-means algorithm will assign each observation to exactly one of the K clusters

How does K-Means work?

K-Means Algorithm

K-Means Algorithm

K-Means Algorithm

K-Means Cluster: Iris data

Different starting values

Different starting values

Hierarchical Clustering

Dendrograms

Interpretation

Choosing Clusters

Agglomerative Approach

Illustration

Dissimilarity

Linkage Methods: Distance Between Clusters

Linkage Methods: Distance Between Clusters

Linkage Methods: Distance Between Clusters

Choice of Dissimilarity Measure

Choice of Dissimilarity Measure

Illustration: Online shopping

Illustration: Online shopping

Key decisions in Clustering

Conclusions

Deep Learning