CS6015: Linear Algebra and Random Processes

Lecture 21:  Principal Component Analysis (the math)

Learning Objectives 

What is PCA?

What are some applications of PCA?

Recap of Wishlist

Represent the data using fewer dimensions such that 

the data has high variance along these dimensions

the covariance between any two dimensions is low

the basis vectors are orthonormal

\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\mathbf{v_2}
\mathbf{v_1}

We will keep the wishlist aside for now and just build some background first (mostly recap)

Projecting onto one dimension

\begin{bmatrix} x_{11}&x_{12}&x_{13}&x_{14}& \dots &x_{1n} \\ x_{21}&x_{22}&x_{23}&x_{24}& \dots &x_{2n} \\ x_{31}&x_{32}&x_{33}&x_{34}& \dots &x_{3n} \\ \dots &\dots &\dots &\dots &\dots &\dots \\ \dots &\dots &\dots &\dots &\dots &\dots \\ x_{m1}&x_{m2}&x_{m3}&x_{m4}& \dots &x_{mn} \\ \end{bmatrix}
\mathbf{X_1}^\top
\mathbf{X_2}^\top
\mathbf{X_3}^\top
\mathbf{X_m}^\top
\mathbf{X_4}^\top
\mathbf{X}
\mathbf{x_1} = x_{11}\mathbf{u_1} + x_{12}\mathbf{u_2} + \dots + x_{1n}\mathbf{u_n}

Standard Basis Vectors

(unit norms)

\begin{bmatrix} \uparrow\\ \\ v1 \\ \\ \downarrow \end{bmatrix}

New Basis Vector

(unit norm)

\mathbf{x_1} = \hat{x_{11}}\mathbf{v1}
\hat{x_{11}} = \frac{\mathbf{x_1}^\top \mathbf{v1}}{\mathbf{v1}^\top \mathbf{v1}} = \mathbf{x_1}^\top \mathbf{v1}
\mathbf{x_2} = \hat{x_{21}}\mathbf{v1}
\hat{x_{21}} = \frac{\mathbf{x_2}^\top \mathbf{v1}}{\mathbf{v1}^\top \mathbf{v1}} = \mathbf{x_2}^\top \mathbf{v1}
\mathbf{u2} = \begin{bmatrix} 0\\1 \end{bmatrix}
\mathbf{u1} = \begin{bmatrix} 1\\0 \end{bmatrix}
\begin {bmatrix} x_{11}\\ x_{12} \end {bmatrix}

Projecting onto two dimensions

\begin{bmatrix} x_{11}&x_{12}&x_{13}&x_{14}& \dots &x_{1n} \\ x_{21}&x_{22}&x_{23}&x_{24}& \dots &x_{2n} \\ x_{31}&x_{32}&x_{33}&x_{34}& \dots &x_{3n} \\ \dots &\dots &\dots &\dots &\dots &\dots \\ \dots &\dots &\dots &\dots &\dots &\dots \\ x_{m1}&x_{m2}&x_{m3}&x_{m4}& \dots &x_{mn} \\ \end{bmatrix}
\mathbf{X_1}^\top
\mathbf{X_2}^\top
\mathbf{X_3}^\top
\mathbf{X_m}^\top
\mathbf{X_4}^\top
\mathbf{X}
\begin{bmatrix} \uparrow & \uparrow \\ \\ \mathbf{\scriptsize v_1} & \mathbf{\scriptsize v_2} \\ \\ \downarrow & \downarrow \end{bmatrix}

new basis vectors

(unit norm)

\mathbf{x_1} = \hat{x_{11}}\mathbf{v_1}+\hat{x_{12}}\mathbf{v_2}
\hat{x_{11}} = \mathbf{x_1}^\top \mathbf{v_1}
\mathbf{x_2} = \hat{x_{21}}\mathbf{v_1}+\hat{x_{22}}\mathbf{v_2}
\hat{x_{21}} = \mathbf{x_2}^\top \mathbf{v_1}
\mathbf{u_2} = \begin{bmatrix} 0\\1 \end{bmatrix}
\mathbf{u_1} = \begin{bmatrix} 1\\0 \end{bmatrix}
\begin {bmatrix} x_{11}\\ x_{12} \end {bmatrix}
\mathbf{\hat{\textit{X}}} =
\begin{bmatrix} \hat{x_{11}}&\hat{x_{12}}\\ \hat{x_{21}}&\hat{x_{22}}\\ \hat{x_{31}}&\hat{x_{32}}\\ \dots &\dots\\ \dots &\dots\\ \hat{x_{m1}}&\hat{x_{m2}}\\ \end{bmatrix}
\mathbf{=}
\begin{bmatrix} \mathbf{x_{1}}^\top \mathbf{v_{1}}&\mathbf{x_{1}}^\top \mathbf{v_{2}} \\ \mathbf{x_{2}}^\top \mathbf{v_{1}}&\mathbf{x_{2}}^\top \mathbf{v_{2}} \\ \mathbf{x_{3}}^\top \mathbf{v_{1}}&\mathbf{x_{3}}^\top \mathbf{v_{2}} \\ \dots &\dots \\ \dots &\dots \\ \mathbf{x_{m}}^\top \mathbf{v_{1}}&\mathbf{x_{m}}^\top \mathbf{v_{2}} \\ \end{bmatrix}
\mathbf{= \textit{XV}}
\hat{x_{12}} = \mathbf{x_1}^\top \mathbf{v_2}
\hat{x_{21}} = \mathbf{x_2}^\top \mathbf{v_2}
\mathbf{v_1}
\mathbf{\textit{V}}
\mathbf{v_2}

Projecting onto k dimension

\begin{bmatrix} \hat{x_{11}}&\hat{x_{12}}& \dots &\hat{x_{1k}} \\ \hat{x_{21}}&\hat{x_{22}}& \dots &\hat{x_{2k}} \\ \hat{x_{31}}&\hat{x_{32}}& \dots &\hat{x_{3k}} \\ \dots &\dots &\dots &\dots \\ \dots &\dots &\dots &\dots \\ \hat{x_{m1}}&\hat{x_{m2}}& \dots &\hat{x_{mk}} \\ \end{bmatrix}
\mathbf{X_1}^\top
\mathbf{X_2}^\top
\mathbf{X_3}^\top
\mathbf{X_m}^\top
\mathbf{X_4}^\top
\mathbf{\textit{X}}
\begin{bmatrix} \uparrow & \uparrow & \cdots & \uparrow \\ \\ \mathbf{\scriptsize v1} & \mathbf{\scriptsize v2} & \cdots & \mathbf{\scriptsize vk} \\ \\ \downarrow & \downarrow & \cdots & \downarrow \end{bmatrix}
     New Basis Vectors
       (unit norm)
\begin{bmatrix} x_{11}&x_{12}&x_{13}&x_{14}& \dots &x_{1n} \\ x_{21}&x_{22}&x_{23}&x_{24}& \dots &x_{2n} \\ x_{31}&x_{32}&x_{33}&x_{34}& \dots &x_{3n} \\ \dots &\dots &\dots &\dots &\dots &\dots \\ \dots &\dots &\dots &\dots &\dots &\dots \\ x_{m1}&x_{m2}&x_{m3}&x_{m4}& \dots &x_{mn} \\ \end{bmatrix}
\mathbf{\hat{\textit{X}}} =
\mathbf{\textit{V}}
\begin{bmatrix} \mathbf{x_{1}}^\top \mathbf{v_{1}}&\mathbf{x_{1}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{1}}^\top \mathbf{v_{k}} \\ \mathbf{x_{2}}^\top \mathbf{v_{1}}&\mathbf{x_{2}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{2}}^\top \mathbf{v_{k}}\\ \mathbf{x_{3}}^\top \mathbf{v_{1}}&\mathbf{x_{3}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{3}}^\top \mathbf{v_{k}} \\ \dots &\dots &\dots &\dots \\ \dots &\dots &\dots &\dots \\ \mathbf{x_{m}}^\top \mathbf{v_{1}}&\mathbf{x_{m}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{m}}^\top \mathbf{v_{k}} \\ \end{bmatrix}
\mathbf{= \textit{XV}}
\mathbf{=}

We want to find a V such that

columns of V are ortho-normal

columns of \(\mathbf{\hat{\textit{X}}}\)   have high variance

 

columns of \(\mathbf{\hat{\textit{X}}}\)   have low co-variance (ideally 0)

 

What is the new covariance matrix?

\hat{X} = XV
\hat{\Sigma} = \frac{1}{m}\hat{X}^T\hat{X}
\hat{\Sigma} = \frac{1}{m}(XV)^T(XV)
\hat{\Sigma} = V^T(\frac{1}{m}X^TX)V

What do we want?

\hat{\Sigma}_{ij} = Cov(i,j)
\text{ if } i \neq j

low covariance

= 0
= \sigma^2_i
\text{ if } i = j
\neq 0

high variance

We want \( \hat{\Sigma}\) to be diagonal

We are looking for orthogonal vectors which will diagonalise \( \frac{1}{m}X^TX\) :-)

These would be eigenvectors of \( X^TX\)

(Note that the eigenvectors of cA are the same as the eigenvectors of A)

The eigenbasis of \(X^TX\)

\hat{\Sigma} = V^T(\frac{1}{m}X^TX)V = D

We have found a \( V \) such that

columns of \( V\) are orthonormal

eigenvectors of a symmetric matrix

columns of \( \hat{X}\) have zero covariance

 diagonal

The right basis to use is the eigenbasis of \(X^TX\)

What about the variance of the columns of \( \hat{X}\) ?

?
\checkmark
\checkmark

What is the variance of the cols of \(\hat{X}\) ?

The i-th column of \(\hat{X}\) is

The variance for the i-th column is 

The i-th column of \(\hat{X}\) is

\(\sigma_{i}^{2}\) = \(\frac{1}{m}\hat{X}_i^{T}\hat{X}_i\)

\mathbf{\textit{X}}
\begin{bmatrix} \uparrow & \uparrow & \cdots & \uparrow \\ \\ \mathbf{\scriptsize v1} & \mathbf{\scriptsize v2} & \cdots & \mathbf{\scriptsize vk} \\ \\ \downarrow & \downarrow & \cdots & \downarrow \end{bmatrix}
\begin{bmatrix} x_{11}&x_{12}&x_{13}&x_{14}& \dots &x_{1n} \\ x_{21}&x_{22}&x_{23}&x_{24}& \dots &x_{2n} \\ x_{31}&x_{32}&x_{33}&x_{34}& \dots &x_{3n} \\ \dots &\dots &\dots &\dots &\dots &\dots \\ \dots &\dots &\dots &\dots &\dots &\dots \\ x_{m1}&x_{m2}&x_{m3}&x_{m4}& \dots &x_{mn} \\ \end{bmatrix}
\mathbf{\textit{V}}
\begin{bmatrix} \mathbf{x_{1}}^\top \mathbf{v_{1}}&\mathbf{x_{1}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{1}}^\top \mathbf{v_{k}} \\ \mathbf{x_{2}}^\top \mathbf{v_{1}}&\mathbf{x_{2}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{2}}^\top \mathbf{v_{k}}\\ \mathbf{x_{3}}^\top \mathbf{v_{1}}&\mathbf{x_{3}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{3}}^\top \mathbf{v_{k}} \\ \dots &\dots &\dots &\dots \\ \dots &\dots &\dots &\dots \\ \mathbf{x_{m}}^\top \mathbf{v_{1}}&\mathbf{x_{m}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{m}}^\top \mathbf{v_{k}} \\ \end{bmatrix}

The variance for the i-th column is 

\(\hat{X}_i \) = \( Xv_{i} \)

\mathbf{\hat{X} =}
\mathbf{\hat{X}_{1} }
\mathbf{\hat{X}_{2} }
\mathbf{\hat{X}_{n} }

The i-th column of \(\hat{X}\) is

The variance for the i-th column is 

\(\hat{X}_i \) = \( Xv_{i} \)

\(\sigma_{i}^{2}\) = \(\frac{1}{m}\hat{X}_i^{T}\hat{X}_i\)

 = \(\frac{1}{m}{(Xv_i)}^{T}Xv_i\)

 = \(\frac{1}{m}{v_i}^T{X}^{T}Xv_i\)

 = \(\frac{1}{m}{v_i}^T\lambda _iv_i\)

 = \(\frac{1}{m}{(Xv_i)}^{T}Xv_i\)

 = \(\frac{1}{m}{v_i}^T{X}^{T}Xv_i\)

 = \(\frac{1}{m}{v_i}^T\lambda _iv_i\)

 = \(\frac{1}{m}\lambda _i\)

 = \(\frac{1}{m}\lambda _i\)

  \((\because {v_i}^Tv_i = 1)\)

Retain only these eigenvectors which have a eigenvalue (high variance)

The full story

(How would you do this in practice?)

Compute the n eigen vectors of X TX

Sort them according to the corresponding eigenvalues

Retain only those eigenvectors corresponding to the top-k eigenvalues

Project the data onto these k eigenvectors

We know that n such vectors will exist since it is a symmetric matrix
These are called the principal components
Heuristics: k=50,100 or choose k such that λkmax > t

Reconstruction Error 

\mathbf{x} = \begin{bmatrix}x_{11}\\x_{12}\end{bmatrix} = \begin{bmatrix}3.3\\3\end{bmatrix}

Suppose

\mathbf{x} = {3.3u_{1} + 3u_{2}}

Let

\mathbf{v_{1}} = \begin{bmatrix}1\\1\end{bmatrix} \mathbf{v_{2}} = \begin{bmatrix}-1\\1\end{bmatrix}
\mathbf{v_{1}} = \begin{bmatrix}\frac{\mathbf{1}}{\sqrt{2}} \\ \\ \frac{\mathbf{1}}{\sqrt{2}}\end{bmatrix} \mathbf{v_{2}} = \begin{bmatrix}-\frac{\mathbf{1}}{\sqrt{2}} \\ \\ \frac{\mathbf{1}}{\sqrt{2}}\end{bmatrix}
\frac{6.3}{\sqrt{2}} \mathbf{v_{1}} + \frac{-0.3}{\sqrt{2}}\mathbf{v_{2}} =\begin{bmatrix}3.3\\3\end{bmatrix} =\mathbf{x}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\begin{bmatrix}3.3\\3\end{bmatrix}

if we use all the n eigenvectors

we will get an exact reconstruction of the data

one data point

new basis vectors

unit norm

\mathbf{x} = b_{11}\mathbf{v_{1}} + b_{12}\mathbf{v_{2}} \\
b_{11} = \mathbf{x^{\top}v_{1}} = \frac{6.3}{\sqrt{2}} \\
b_{12} = \mathbf{x^{\top}v_{2}} = -\frac{0.3}{\sqrt{2}} \\

Reconstruction Error 

\mathbf{x} = \begin{bmatrix}x_{11}\\x_{12}\end{bmatrix} = \begin{bmatrix}3.3\\3\end{bmatrix}

Suppose

\mathbf{x} = {3.3u_{1} + 3u_{2}}

Let

\mathbf{v_{1}} = \begin{bmatrix}1\\1\end{bmatrix} \mathbf{v_{2}} = \begin{bmatrix}-1\\1\end{bmatrix}
\mathbf{v_{1}} = \begin{bmatrix}\frac{\mathbf{1}}{\sqrt{2}} \\ \\ \frac{\mathbf{1}}{\sqrt{2}}\end{bmatrix} \mathbf{v_{2}} = \begin{bmatrix}-\frac{\mathbf{1}}{\sqrt{2}} \\ \\ \frac{\mathbf{1}}{\sqrt{2}}\end{bmatrix}
\frac{6.3}{\sqrt{2}} \mathbf{v_{1}} =\begin{bmatrix}3.15\\3.15\end{bmatrix} =\mathbf{x}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\begin{bmatrix}3.3\\3\end{bmatrix}

but we are going

to use fewer

eigenvectors

(we will throw away \(\mathbf{v_{2}}\))

one data point

new basis vectors

unit norm

b_{11} = \mathbf{x^{\top}v_{1}} = \frac{6.3}{\sqrt{2}} \\ \\ \\ \newline \newline
\mathbf{x} = b_{11}\mathbf{v_{1}} + b_{12}\mathbf{v_{2}} \\

Reconstruction Error 

\mathbf{x} = \begin{bmatrix}3.3\\3\end{bmatrix} \mathbf{\hat{x}} = \begin{bmatrix}3.15\\3.15\end{bmatrix}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\begin{bmatrix}3.3\\3\end{bmatrix}

original x

(\mathbf{x-\hat{x}})^{\top} (\mathbf{x-\hat{x}})
min\sum_{i=i}^{m} (\mathbf{x-\hat{x}})^{\top} (\mathbf{x-\hat{x}})
\mathbf{x_{i}} = \sum_{j=1}^{n} b_{ij}\mathbf{v_{j}} \\

x reconstructed from 

fewer eigen vectors

reconstruction error vector

reconstruction error vector

(length of the error)

\mathbf{\hat{x_{i}}} = \sum_{j=1}^{k} b_{ij}\mathbf{v_{j}}

original x - reconstructed from all n eigenvectors

reconstructed only from top K eigenvectors

solving the above optimization problem corresponds to choosing the eigen basis while discarding the eigenvectors corresponding to the smallest eigen values

\mathbf{x-\hat{x}} \\

Goal:

V

PCA thus minimizes reconstruction error

Learning Objectives (acheived)

What is PCA?

What are some applications of PCA?

Copy of CS6015: Lecture 21

By Mitesh Khapra

Copy of CS6015: Lecture 21

Lecture 21: Principal Component Analysis (the math)

  • 598