CS6015: Linear Algebra and Random Processes

Lecture 20:  Principal Component Analysis (the wishlist)

Learning Objectives

A quick recap of mean, variance and covariance

What is the covariance matrix?

What is the motivation for PCA?

What is the wishlist for representing data using fewer dimensions?

The Eigenstory

real

imaginary

distinct

repeating

\(A^\top\)

\(A^{-1}\)

\(AB\)

\(A^\top A\)

(basis)

powers of A

steady state

PCA

optimisation 

diagonalisation

\(A+B\)

\(U\)

\(R\)

\(A^2\)

\(A + kI\)

How to compute eigenvalues?

What are the possible values?

What are the eigenvalues of some special matrices ?

What is the relation between the eigenvalues of related matrices?

What do eigen values reveal about a matrix?

What are some applications in which eigenvalues play an important role?

Identity

Projection

Reflection

Markov

Rotation

Singular

Orthogonal

Rank one

Symmetric

Permutation

det(A - \lambda I) = 0

trace

determinant

invertibility 

rank

nullspace

columnspace

(Markov matrices)

(positive definite matrices)

positive pivots

(independent eigenvectors)

(orthogonal eigenvectors)

... ...

(symmetric)

(where are we?)

(characteristic equation)
(desirable)

HW5

distinct values

independent eigenvectors

\(\implies\)

Detour: Mean

Salinity
Site 1
Site 2
Site 3
Pressure
Density
Depth
Temp.
n var.
...
\begin{bmatrix} a_{11}&~&a_{12}&~&a_{13}&~&a_{14}&~&a_{15}&\cdots&a_{1n}\\ a_{21}&~&a_{22}&~&a_{23}&~&a_{24}&~&a_{25}&\cdots&a_{2n}\\ a_{31}&~&a_{32}&~&a_{33}&~&a_{34}&~&a_{35}&\cdots&a_{3n}\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ a_{m1}&~&a_{m2}&~&a_{m3}&~&a_{m4}&~&a_{m5}&\cdots&a_{mn}\\ \end{bmatrix}
\mu_1 = \frac{1}{m}\sum_{i=1}^m{a_{i1}}
Site m
(mean salinity across all locations)
\mu_2 = \frac{1}{m}\sum_{i=1}^m{a_{i2}}
=X
(mean pressure across all locations)
\mu_j = \frac{1}{m}\sum_{i=1}^m{a_{ij}}
It is customary/common to subtract the mean from each column and make the data 0-centred
Salinity
Site 1
Site 2
Site 3
Pressure
Density
Depth
Temp.
n var.
...
\begin{bmatrix} a_{11}&~&a_{12}&~&a_{13}&~&a_{14}&~&a_{15}&\cdots&a_{1n}\\ a_{21}&~&a_{22}&~&a_{23}&~&a_{24}&~&a_{25}&\cdots&a_{2n}\\ a_{31}&~&a_{32}&~&a_{33}&~&a_{34}&~&a_{35}&\cdots&a_{3n}\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ a_{m1}&~&a_{m2}&~&a_{m3}&~&a_{m4}&~&a_{m5}&\cdots&a_{mn}\\ \end{bmatrix}
Site m
=X
\hat{\mu}_j = \frac{1}{m}\sum_{i=1}^m({a_{ij}} - \mu_j)
New mean
-\mu_1
-\mu_1
-\mu_1
-\mu_1
-\mu_2
-\mu_2
-\mu_2
-\mu_2
-\mu_3
-\mu_3
-\mu_3
-\mu_4
-\mu_4
-\mu_4
-\mu_5
-\mu_5
-\mu_5
-\mu_3
-\mu_4
-\mu_5
-\mu_n
-\mu_n
-\mu_n
-\mu_n
= \frac{1}{m}\sum_{i=1}^m a_{ij} - \frac{1}{m}\sum_{i=1}^m\mu_j
= \mu_j - \frac{1}{m}m \mu_j = 0
The data is now zero-centred (i.e. the mean is 0)

Detour: Mean

For the rest of the discussion we will assume that the data is always zero-centred

If it is not, we can always make it zero-centred by subtracting the mean

Salinity
Site 1
Site 2
Site 3
Pressure
Density
Depth
Temp.
n var.
...
\begin{bmatrix} a_{11}&~&a_{12}&~&a_{13}&~&a_{14}&~&a_{15}&\cdots&a_{1n}\\ a_{21}&~&a_{22}&~&a_{23}&~&a_{24}&~&a_{25}&\cdots&a_{2n}\\ a_{31}&~&a_{32}&~&a_{33}&~&a_{34}&~&a_{35}&\cdots&a_{3n}\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ a_{m1}&~&a_{m2}&~&a_{m3}&~&a_{m4}&~&a_{m5}&\cdots&a_{mn}\\ \end{bmatrix}
Site m
=X

Detour: Variance

\sigma^2_1 = \frac{1}{m}\sum_{i=1}^m{(a_{i1} - \mu_1)^2}
(variance in salinity across all locations)

\(\because\) the data is zero-centred, 

\sigma^2_1 = \frac{1}{m}\sum_{i=1}^m{a_{i1}^2}
\sigma^2_2 = \frac{1}{m}\sum_{i=1}^m{a_{i2}^2}
\sigma^2_j = \frac{1}{m}\sum_{i=1}^m{a_{ij}^2}
(variance in pressure across all locations)
Site 1
Site 2
Site 3
\begin{bmatrix} a_{11}&~&a_{12}&~&a_{13}&~&a_{14}&~&a_{15}&\cdots&a_{1n}\\ a_{21}&~&a_{22}&~&a_{23}&~&a_{24}&~&a_{25}&\cdots&a_{2n}\\ a_{31}&~&a_{32}&~&a_{33}&~&a_{34}&~&a_{35}&\cdots&a_{3n}\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ a_{m1}&~&a_{m2}&~&a_{m3}&~&a_{m4}&~&a_{m5}&\cdots&a_{mn}\\ \end{bmatrix}
Site m
=X

Detour: Variance

\sigma^2_j = \frac{1}{m}\sum_{i=1}^m{a_{ij}^2}
\mathbf{x_1}
\mathbf{x_2}
\mathbf{x_3}
\mathbf{x_4}
\mathbf{x_5}
\mathbf{x_n}
=\frac{1}{m}\mathbf{x}_j^\top \mathbf{x}_j

Detour: Covariance

\(\because\) the data is zero-centred, 

Cov(\mathbf{x_1},\mathbf{x_2}) = \frac{1}{m}\sum_{k=1}^m(a_{k1} - \mu_1)(a_{k2} - \mu_2)
Cov(\mathbf{x_1},\mathbf{x_2}) = \frac{1}{m}\sum_{k=1}^m a_{k1}a_{k2}
Cov(\mathbf{x_i},\mathbf{x_j}) = \frac{1}{m}\sum_{k=1}^m a_{ki}a_{kj}
Site 1
Site 2
Site 3
\begin{bmatrix} a_{11}&~&a_{12}&~&a_{13}&~&a_{14}&~&a_{15}&\cdots&a_{1n}\\ a_{21}&~&a_{22}&~&a_{23}&~&a_{24}&~&a_{25}&\cdots&a_{2n}\\ a_{31}&~&a_{32}&~&a_{33}&~&a_{34}&~&a_{35}&\cdots&a_{3n}\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ a_{m1}&~&a_{m2}&~&a_{m3}&~&a_{m4}&~&a_{m5}&\cdots&a_{mn}\\ \end{bmatrix}
Site m
=X
\mathbf{x_1}
\mathbf{x_2}
\mathbf{x_3}
\mathbf{x_4}
\mathbf{x_5}
\mathbf{x_n}

Detour: Covariance

Cov(\mathbf{x_i},\mathbf{x_j}) = \frac{1}{m}\sum_{k=1}^m a_{ki}a_{kj}
Site 1
Site 2
Site 3
\begin{bmatrix} a_{11}&~&a_{12}&~&a_{13}&~&a_{14}&~&a_{15}&\cdots&a_{1n}\\ a_{21}&~&a_{22}&~&a_{23}&~&a_{24}&~&a_{25}&\cdots&a_{2n}\\ a_{31}&~&a_{32}&~&a_{33}&~&a_{34}&~&a_{35}&\cdots&a_{3n}\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ a_{m1}&~&a_{m2}&~&a_{m3}&~&a_{m4}&~&a_{m5}&\cdots&a_{mn}\\ \end{bmatrix}
Site m
=X
\mathbf{x_1}
\mathbf{x_2}
\mathbf{x_3}
\mathbf{x_4}
\mathbf{x_5}
\mathbf{x_n}
=\frac{1}{m}\mathbf{x}_i^\top \mathbf{x}_j

Puzzle: What is the matrix \(\frac{1}{m}X^\top X\) ?

\begin{bmatrix} a_{11}&a_{12}&a_{13}&a_{14}&\cdots&a_{1n}\\ a_{21}&a_{22}&a_{23}&a_{24}&\cdots&a_{2n}\\ a_{31}&a_{32}&a_{33}&a_{34}&\cdots&a_{3n}\\ \cdots&\cdots&\cdots&\cdots&\cdots&\cdots\\ \cdots&\cdots&\cdots&\cdots&\cdots&\cdots\\ a_{m1}&a_{m2}&a_{m3}&a_{m4}&\cdots&a_{mn}\\ \end{bmatrix}
\mathbf{x_1}
\mathbf{x_2}
\mathbf{x_3}
\mathbf{x_4}
\mathbf{x_n}
\begin{bmatrix} a_{11}&a_{21}&a_{31}&a_{41}&\cdots&a_{m1}\\ a_{12}&a_{22}&a_{32}&a_{42}&\cdots&a_{m2}\\ a_{13}&a_{23}&a_{33}&a_{43}&\cdots&a_{m3}\\ a_{14}&a_{24}&a_{34}&a_{44}&\cdots&a_{m4}\\ \cdots&\cdots&\cdots&\cdots&\cdots&\cdots\\ a_{1n}&a_{2n}&a_{3n}&a_{4n}&\cdots&a_{mn}\\ \end{bmatrix}
\mathbf{x_1}^\top
\mathbf{x_2}^\top
\mathbf{x_3}^\top
\mathbf{x_4}^\top
\mathbf{x_n}^\top
X^\top
X
\frac{1}{m}
=\begin{bmatrix} ~~~~~~&~~~~~~&~~~~~~&~~~~~~&~~~~~~&~~~~~~\\ ~~~~~~&~~~~~~&~~~~~~&~~~~~~&~~~~~~&~~~~~~\\ ~~~~~~&~~~~~~&\Sigma_{ij} = ?&~~~~~~&~~~~~~&~~~~~~\\ ~~~~~~&~~~~~~&~~~~~~&~~~~~~&~~~~~~&~~~~~~\\ ~~~~~~&~~~~~~&~~~~~~&~~~~~~&~~~~~~&~~~~~~\\ ~~~~~~&~~~~~~&~~~~~~&~~~~~~&~~~~~~&~~~~~~\\ \end{bmatrix}
\Sigma
\Sigma_{ij} = \frac{1}{m}\mathbf{x_i}^\top\mathbf{x_j}
=Cov(i,j)
=\sigma^2_i
if~~i\neq j
if~~i= j

Covariance Matrix

(symmetric matrix)

We are now ready to start a discussion on PCA!

The standard basis

Salinity
Site 1
Site 2
Site 3
Pressure
\begin{bmatrix} x_{11}&~&x_{12}\\ x_{21}&~&x_{22}\\ x_{31}&~&x_{32}\\ \cdots&~&\cdots\\ x_{m1}&~&x_{m2}\\ \end{bmatrix}
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
Site m
\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
=x_{11}\begin{bmatrix}1\\0\end{bmatrix}+x_{21}\begin{bmatrix}0\\1\end{bmatrix}
\mathbf{x_1}^\top
Note the change in notation on this slide. We are now referring to one row in the data as x
\mathbf{x_2}^\top
\mathbf{x_3}^\top
\mathbf{x_m}^\top

(using the ML notation)

What if we choose a different basis?

\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
=b_{11}\mathbf{v_1}+b_{12}\mathbf{v_2}
\approx 0
\therefore\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
\approx b_{11}\mathbf{v_1}

It seems that the same data which was originally represented using 2 dimensions can now be represented using one dimension by making a smarter choice for the basis!

Salinity
Site 1
Site 2
Site 3
Pressure
\begin{bmatrix} x_{11}&~&x_{12}\\ x_{21}&~&x_{22}\\ x_{31}&~&x_{32}\\ \cdots&~&\cdots\\ x_{m1}&~&x_{m2}\\ \end{bmatrix}
Site m
\mathbf{x_1}^\top
\mathbf{x_2}^\top
\mathbf{x_3}^\top
\mathbf{x_m}^\top
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
\mathbf{v_1}
\mathbf{v_2}

The bigger question

Salinity
Site 1
Site 2
Site 3
Pressure
Density
Depth
Temp.
n var.
...
\begin{bmatrix} x_{11}&~&x_{12}&~&x_{13}&~&x_{14}&~&x_{15}&\cdots&x_{1n}\\ x_{21}&~&x_{22}&~&x_{23}&~&x_{24}&~&x_{25}&\cdots&x_{2n}\\ x_{31}&~&x_{32}&~&x_{33}&~&x_{34}&~&x_{35}&\cdots&x_{3n}\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ \cdots&~&\cdots&~&\cdots&~&\cdots&~&\cdots&\cdots&\cdots\\ x_{m1}&~&x_{m2}&~&x_{m3}&~&x_{m4}&~&x_{m5}&\cdots&x_{mn}\\ \end{bmatrix}
Site m
=X

Can we represent the data using fewer dimensions by choosing a different basis?

Can we project the data onto a smaller subspace?

OR

Yes, we can!

We will see how!

Let us first dig a bit deeper into our toy example

Why do we not care about \(\mathbf{v_2}\)?

What is being projected, where is it being projected and how is it being projected?

(or why do we think we can represent the data using fewer dimensions)
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
\mathbf{v_1}
\mathbf{v_2}

Is there something else that we desire?

Why do we not care about \(\mathbf{v_2}\) ?

\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
=b_{11}\mathbf{v_1}+b_{12}\mathbf{v_2}
\approx 0
\therefore\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
\approx b_{11}\mathbf{v_1}

Because the data has very little variance along this dimension

Salinity
Site 1
Site 2
Site 3
Pressure
\begin{bmatrix} x_{11}&~&x_{12}\\ x_{21}&~&x_{22}\\ x_{31}&~&x_{32}\\ \cdots&~&\cdots\\ x_{m1}&~&x_{m2}\\ \end{bmatrix}
Site m
\mathbf{x_1}^\top
\mathbf{x_2}^\top
\mathbf{x_3}^\top
\mathbf{x_m}^\top
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
\mathbf{v_1}
\mathbf{v_2}

Wishlist: Represent the data using fewer dimensions such that the data has high variance along these dimensions

Projection: What, where and how?

\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
=b_{11}\mathbf{v_1}+b_{21}\mathbf{v_2}
\approx 0
\therefore\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
\approx b_{11}\mathbf{v_1}

What is being projected? 

Salinity
Site 1
Site 2
Site 3
Pressure
\begin{bmatrix} x_{11}&~&x_{12}\\ x_{21}&~&x_{22}\\ x_{31}&~&x_{32}\\ \cdots&~&\cdots\\ x_{m1}&~&x_{m2}\\ \end{bmatrix}
Site m
\mathbf{x_1}^\top
\mathbf{x_2}^\top
\mathbf{x_3}^\top
\mathbf{x_m}^\top
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
\mathbf{v_1}
\mathbf{v_2}

Wishlist: Represent the data using fewer orthonormal basis vectors

\(\mathbf{x_1}\)

Where is it being projected?

on \(\mathbf{v_1}\) and \(\mathbf{v_2}\)

How is it being projected?

b_{11} = \mathbf{x_1}^\top\mathbf{v_1}
(since v1, v2 are orthonormal)
b_{21} = \mathbf{x_1}^\top\mathbf{v_2}

Is there anything else that we desire?

Is \(\mathbf{z}\) adding any new information beyond what is already contained in \(\mathbf{x}\) ?

The two columns have a high covariance (when one increases the other also increases)

Wishlist: The covariance between the columns in the new orthonormal basis should be low - ideally 0

(means that the columns should be linearly independent)

Summary of wishlist

Represent the data using fewer dimensions such that 

the data has high variance along these dimensions

the covariance between any two dimensions is low

the basis vectors are orthonormal

Learning Objectives

A quick recap of mean, variance and covariance

What is the covariance matrix?

What is the motivation for PCA?

What is the wishlist for representing data using fewer dimensions?

(achieved)

Made with Slides.com