CS6015: Linear Algebra and Random Processes
Lecture 21: Principal Component Analysis (the math)
Learning Objectives
What is PCA?
What are some applications of PCA?
Recap of Wishlist
Represent the data using fewer dimensions such that
the data has high variance along these dimensions
the covariance between any two dimensions is low
the basis vectors are orthonormal
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\mathbf{v_2}
\mathbf{v_1}
We will keep the wishlist aside for now and just build some background first (mostly recap)
Projecting onto one dimension
\begin{bmatrix}
x_{11}&x_{12}&x_{13}&x_{14}& \dots &x_{1n} \\
x_{21}&x_{22}&x_{23}&x_{24}& \dots &x_{2n} \\
x_{31}&x_{32}&x_{33}&x_{34}& \dots &x_{3n} \\
\dots &\dots &\dots &\dots &\dots &\dots \\
\dots &\dots &\dots &\dots &\dots &\dots \\
x_{m1}&x_{m2}&x_{m3}&x_{m4}& \dots &x_{mn} \\
\end{bmatrix}
\mathbf{X_1}^\top
\mathbf{X_2}^\top
\mathbf{X_3}^\top
\mathbf{X_m}^\top
\mathbf{X_4}^\top
\mathbf{X}
\mathbf{x_1} = x_{11}\mathbf{u_1} + x_{12}\mathbf{u_2} + \dots + x_{1n}\mathbf{u_n}
Standard Basis Vectors
(unit norms)
\begin{bmatrix}
\uparrow\\
\\
v1
\\
\\
\downarrow
\end{bmatrix}
New Basis Vector
(unit norm)
\mathbf{x_1} = \hat{x_{11}}\mathbf{v1}
\hat{x_{11}} = \frac{\mathbf{x_1}^\top \mathbf{v1}}{\mathbf{v1}^\top \mathbf{v1}} = \mathbf{x_1}^\top \mathbf{v1}
\mathbf{x_2} = \hat{x_{21}}\mathbf{v1}
\hat{x_{21}} = \frac{\mathbf{x_2}^\top \mathbf{v1}}{\mathbf{v1}^\top \mathbf{v1}} = \mathbf{x_2}^\top \mathbf{v1}
\mathbf{u2} = \begin{bmatrix} 0\\1 \end{bmatrix}
\mathbf{u1} = \begin{bmatrix} 1\\0 \end{bmatrix}
\begin {bmatrix}
x_{11}\\
x_{12}
\end {bmatrix}
Projecting onto two dimensions
\begin{bmatrix}
x_{11}&x_{12}&x_{13}&x_{14}& \dots &x_{1n} \\
x_{21}&x_{22}&x_{23}&x_{24}& \dots &x_{2n} \\
x_{31}&x_{32}&x_{33}&x_{34}& \dots &x_{3n} \\
\dots &\dots &\dots &\dots &\dots &\dots \\
\dots &\dots &\dots &\dots &\dots &\dots \\
x_{m1}&x_{m2}&x_{m3}&x_{m4}& \dots &x_{mn} \\
\end{bmatrix}
\mathbf{X_1}^\top
\mathbf{X_2}^\top
\mathbf{X_3}^\top
\mathbf{X_m}^\top
\mathbf{X_4}^\top
\mathbf{X}
\begin{bmatrix}
\uparrow & \uparrow
\\
\\
\mathbf{\scriptsize v_1} & \mathbf{\scriptsize v_2}
\\
\\
\downarrow & \downarrow
\end{bmatrix}
new basis vectors
(unit norm)
\mathbf{x_1} = \hat{x_{11}}\mathbf{v_1}+\hat{x_{12}}\mathbf{v_2}
\hat{x_{11}} = \mathbf{x_1}^\top \mathbf{v_1}
\mathbf{x_2} = \hat{x_{21}}\mathbf{v_1}+\hat{x_{22}}\mathbf{v_2}
\hat{x_{21}} = \mathbf{x_2}^\top \mathbf{v_1}
\mathbf{u_2} = \begin{bmatrix} 0\\1 \end{bmatrix}
\mathbf{u_1} = \begin{bmatrix} 1\\0 \end{bmatrix}
\begin {bmatrix}
x_{11}\\
x_{12}
\end {bmatrix}
\mathbf{\hat{\textit{X}}} =
\begin{bmatrix}
\hat{x_{11}}&\hat{x_{12}}\\
\hat{x_{21}}&\hat{x_{22}}\\
\hat{x_{31}}&\hat{x_{32}}\\
\dots &\dots\\
\dots &\dots\\
\hat{x_{m1}}&\hat{x_{m2}}\\
\end{bmatrix}
\mathbf{=}
\begin{bmatrix}
\mathbf{x_{1}}^\top \mathbf{v_{1}}&\mathbf{x_{1}}^\top \mathbf{v_{2}} \\
\mathbf{x_{2}}^\top \mathbf{v_{1}}&\mathbf{x_{2}}^\top \mathbf{v_{2}} \\
\mathbf{x_{3}}^\top \mathbf{v_{1}}&\mathbf{x_{3}}^\top \mathbf{v_{2}} \\
\dots &\dots \\
\dots &\dots \\
\mathbf{x_{m}}^\top \mathbf{v_{1}}&\mathbf{x_{m}}^\top \mathbf{v_{2}} \\
\end{bmatrix}
\mathbf{= \textit{XV}}
\hat{x_{12}} = \mathbf{x_1}^\top \mathbf{v_2}
\hat{x_{21}} = \mathbf{x_2}^\top \mathbf{v_2}
\mathbf{v_1}
\mathbf{\textit{V}}
\mathbf{v_2}
Projecting onto k dimension
\begin{bmatrix}
\hat{x_{11}}&\hat{x_{12}}& \dots &\hat{x_{1k}} \\
\hat{x_{21}}&\hat{x_{22}}& \dots &\hat{x_{2k}} \\
\hat{x_{31}}&\hat{x_{32}}& \dots &\hat{x_{3k}} \\
\dots &\dots &\dots &\dots \\
\dots &\dots &\dots &\dots \\
\hat{x_{m1}}&\hat{x_{m2}}& \dots &\hat{x_{mk}} \\
\end{bmatrix}
\mathbf{X_1}^\top
\mathbf{X_2}^\top
\mathbf{X_3}^\top
\mathbf{X_m}^\top
\mathbf{X_4}^\top
\mathbf{\textit{X}}
\begin{bmatrix}
\uparrow & \uparrow & \cdots & \uparrow
\\
\\
\mathbf{\scriptsize v1} & \mathbf{\scriptsize v2} & \cdots & \mathbf{\scriptsize vk}
\\
\\
\downarrow & \downarrow & \cdots & \downarrow
\end{bmatrix}
New Basis Vectors (unit norm)
\begin{bmatrix}
x_{11}&x_{12}&x_{13}&x_{14}& \dots &x_{1n} \\
x_{21}&x_{22}&x_{23}&x_{24}& \dots &x_{2n} \\
x_{31}&x_{32}&x_{33}&x_{34}& \dots &x_{3n} \\
\dots &\dots &\dots &\dots &\dots &\dots \\
\dots &\dots &\dots &\dots &\dots &\dots \\
x_{m1}&x_{m2}&x_{m3}&x_{m4}& \dots &x_{mn} \\
\end{bmatrix}
\mathbf{\hat{\textit{X}}} =
\mathbf{\textit{V}}
\begin{bmatrix}
\mathbf{x_{1}}^\top \mathbf{v_{1}}&\mathbf{x_{1}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{1}}^\top \mathbf{v_{k}} \\
\mathbf{x_{2}}^\top \mathbf{v_{1}}&\mathbf{x_{2}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{2}}^\top \mathbf{v_{k}}\\
\mathbf{x_{3}}^\top \mathbf{v_{1}}&\mathbf{x_{3}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{3}}^\top \mathbf{v_{k}} \\
\dots &\dots &\dots &\dots \\
\dots &\dots &\dots &\dots \\
\mathbf{x_{m}}^\top \mathbf{v_{1}}&\mathbf{x_{m}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{m}}^\top \mathbf{v_{k}} \\
\end{bmatrix}
\mathbf{= \textit{XV}}
\mathbf{=}
We want to find a V such that
columns of V are ortho-normal
columns of \(\mathbf{\hat{\textit{X}}}\) have high variance
columns of \(\mathbf{\hat{\textit{X}}}\) have low co-variance (ideally 0)
What is the new covariance matrix?
\hat{X} = XV
\hat{\Sigma} = \frac{1}{m}\hat{X}^T\hat{X}
\hat{\Sigma} = \frac{1}{m}(XV)^T(XV)
\hat{\Sigma} = V^T(\frac{1}{m}X^TX)V
What do we want?
\hat{\Sigma}_{ij} = Cov(i,j)
\text{ if } i \neq j
low covariance
= 0
= \sigma^2_i
\text{ if } i = j
\neq 0
high variance
We want \( \hat{\Sigma}\) to be diagonal
We are looking for orthogonal vectors which will diagonalise \( \frac{1}{m}X^TX\) :-)
These would be eigenvectors of \( X^TX\)
(Note that the eigenvectors of cA are the same as the eigenvectors of A)
The eigenbasis of \(X^TX\)
\hat{\Sigma} = V^T(\frac{1}{m}X^TX)V = D
We have found a \( V \) such that
columns of \( V\) are orthonormal
eigenvectors of a symmetric matrix
columns of \( \hat{X}\) have zero covariance
diagonal
The right basis to use is the eigenbasis of \(X^TX\)
What about the variance of the columns of \( \hat{X}\) ?
?
\checkmark
\checkmark
What is the variance of the cols of \(\hat{X}\) ?
The i-th column of \(\hat{X}\) is
The variance for the i-th column is
The i-th column of \(\hat{X}\) is
\(\sigma_{i}^{2}\) = \(\frac{1}{m}\hat{X}_i^{T}\hat{X}_i\)
\mathbf{\textit{X}}
\begin{bmatrix}
\uparrow & \uparrow & \cdots & \uparrow
\\
\\
\mathbf{\scriptsize v1} & \mathbf{\scriptsize v2} & \cdots & \mathbf{\scriptsize vk}
\\
\\
\downarrow & \downarrow & \cdots & \downarrow
\end{bmatrix}
\begin{bmatrix}
x_{11}&x_{12}&x_{13}&x_{14}& \dots &x_{1n} \\
x_{21}&x_{22}&x_{23}&x_{24}& \dots &x_{2n} \\
x_{31}&x_{32}&x_{33}&x_{34}& \dots &x_{3n} \\
\dots &\dots &\dots &\dots &\dots &\dots \\
\dots &\dots &\dots &\dots &\dots &\dots \\
x_{m1}&x_{m2}&x_{m3}&x_{m4}& \dots &x_{mn} \\
\end{bmatrix}
\mathbf{\textit{V}}
\begin{bmatrix}
\mathbf{x_{1}}^\top \mathbf{v_{1}}&\mathbf{x_{1}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{1}}^\top \mathbf{v_{k}} \\
\mathbf{x_{2}}^\top \mathbf{v_{1}}&\mathbf{x_{2}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{2}}^\top \mathbf{v_{k}}\\
\mathbf{x_{3}}^\top \mathbf{v_{1}}&\mathbf{x_{3}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{3}}^\top \mathbf{v_{k}} \\
\dots &\dots &\dots &\dots \\
\dots &\dots &\dots &\dots \\
\mathbf{x_{m}}^\top \mathbf{v_{1}}&\mathbf{x_{m}}^\top \mathbf{v_{2}}&\dots&\mathbf{x_{m}}^\top \mathbf{v_{k}} \\
\end{bmatrix}
The variance for the i-th column is
\(\hat{X}_i \) = \( Xv_{i} \)
\mathbf{\hat{X} =}
\mathbf{\hat{X}_{1} }
\mathbf{\hat{X}_{2} }
\mathbf{\hat{X}_{n} }
The i-th column of \(\hat{X}\) is
The variance for the i-th column is
\(\hat{X}_i \) = \( Xv_{i} \)
\(\sigma_{i}^{2}\) = \(\frac{1}{m}\hat{X}_i^{T}\hat{X}_i\)
= \(\frac{1}{m}{(Xv_i)}^{T}Xv_i\)
= \(\frac{1}{m}{v_i}^T{X}^{T}Xv_i\)
= \(\frac{1}{m}{v_i}^T\lambda _iv_i\)
= \(\frac{1}{m}{(Xv_i)}^{T}Xv_i\)
= \(\frac{1}{m}{v_i}^T{X}^{T}Xv_i\)
= \(\frac{1}{m}{v_i}^T\lambda _iv_i\)
= \(\frac{1}{m}\lambda _i\)
= \(\frac{1}{m}\lambda _i\)
\((\because {v_i}^Tv_i = 1)\)
Retain only these eigenvectors which have a eigenvalue (high variance)
The full story
(How would you do this in practice?)
Compute the n eigen vectors of X TX
Sort them according to the corresponding eigenvalues
Retain only those eigenvectors corresponding to the top-k eigenvalues
Project the data onto these k eigenvectors
We know that n such vectors will exist since it is a symmetric matrix
These are called the principal components
Heuristics: k=50,100 or choose k such that λk/λmax > t
Reconstruction Error
\mathbf{x} =
\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix} =
\begin{bmatrix}3.3\\3\end{bmatrix}
Suppose
\mathbf{x} = {3.3u_{1} + 3u_{2}}
Let
\mathbf{v_{1}} = \begin{bmatrix}1\\1\end{bmatrix}
\mathbf{v_{2}} = \begin{bmatrix}-1\\1\end{bmatrix}
\mathbf{v_{1}} = \begin{bmatrix}\frac{\mathbf{1}}{\sqrt{2}} \\ \\ \frac{\mathbf{1}}{\sqrt{2}}\end{bmatrix}
\mathbf{v_{2}} = \begin{bmatrix}-\frac{\mathbf{1}}{\sqrt{2}} \\ \\ \frac{\mathbf{1}}{\sqrt{2}}\end{bmatrix}
\frac{6.3}{\sqrt{2}} \mathbf{v_{1}} + \frac{-0.3}{\sqrt{2}}\mathbf{v_{2}} =\begin{bmatrix}3.3\\3\end{bmatrix} =\mathbf{x}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\begin{bmatrix}3.3\\3\end{bmatrix}
if we use all the n eigenvectors
we will get an exact reconstruction of the data
one data point
new basis vectors
unit norm
\mathbf{x} = b_{11}\mathbf{v_{1}} + b_{12}\mathbf{v_{2}} \\
b_{11} = \mathbf{x^{\top}v_{1}} = \frac{6.3}{\sqrt{2}} \\
b_{12} = \mathbf{x^{\top}v_{2}} = -\frac{0.3}{\sqrt{2}} \\
Reconstruction Error
\mathbf{x} =
\begin{bmatrix}x_{11}\\x_{12}\end{bmatrix} =
\begin{bmatrix}3.3\\3\end{bmatrix}
Suppose
\mathbf{x} = {3.3u_{1} + 3u_{2}}
Let
\mathbf{v_{1}} = \begin{bmatrix}1\\1\end{bmatrix}
\mathbf{v_{2}} = \begin{bmatrix}-1\\1\end{bmatrix}
\mathbf{v_{1}} = \begin{bmatrix}\frac{\mathbf{1}}{\sqrt{2}} \\ \\ \frac{\mathbf{1}}{\sqrt{2}}\end{bmatrix}
\mathbf{v_{2}} = \begin{bmatrix}-\frac{\mathbf{1}}{\sqrt{2}} \\ \\ \frac{\mathbf{1}}{\sqrt{2}}\end{bmatrix}
\frac{6.3}{\sqrt{2}} \mathbf{v_{1}} =\begin{bmatrix}3.15\\3.15\end{bmatrix} =\mathbf{x}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\begin{bmatrix}3.3\\3\end{bmatrix}
but we are going
to use fewer
eigenvectors
(we will throw away \(\mathbf{v_{2}}\))
one data point
new basis vectors
unit norm
b_{11} = \mathbf{x^{\top}v_{1}} = \frac{6.3}{\sqrt{2}} \\ \\ \\
\newline
\newline
\mathbf{x} = b_{11}\mathbf{v_{1}} + b_{12}\mathbf{v_{2}} \\
Reconstruction Error
\mathbf{x} =
\begin{bmatrix}3.3\\3\end{bmatrix}
\mathbf{\hat{x}} =
\begin{bmatrix}3.15\\3.15\end{bmatrix}
\mathbf{u_2} = \begin{bmatrix}0\\1\end{bmatrix}
\mathbf{u_1} = \begin{bmatrix}1\\0\end{bmatrix}
\begin{bmatrix}3.3\\3\end{bmatrix}
original x
(\mathbf{x-\hat{x}})^{\top} (\mathbf{x-\hat{x}})
min\sum_{i=i}^{m} (\mathbf{x-\hat{x}})^{\top} (\mathbf{x-\hat{x}})
\mathbf{x_{i}} = \sum_{j=1}^{n} b_{ij}\mathbf{v_{j}} \\
x reconstructed from
fewer eigen vectors
reconstruction error vector
reconstruction error vector
(length of the error)
\mathbf{\hat{x_{i}}} = \sum_{j=1}^{k} b_{ij}\mathbf{v_{j}}
original x - reconstructed from all n eigenvectors
reconstructed only from top K eigenvectors
solving the above optimization problem corresponds to choosing the eigen basis while discarding the eigenvectors corresponding to the smallest eigen values
\mathbf{x-\hat{x}} \\
Goal:
V
PCA thus minimizes reconstruction error
Learning Objectives (acheived)
What is PCA?
What are some applications of PCA?
Copy of CS6015: Lecture 21
By Mitesh Khapra
Copy of CS6015: Lecture 21
Lecture 21: Principal Component Analysis (the math)
- 598