\[m\]

\[n\]

\[A\]

\[U\]

\[\Sigma\]

\[V^T\]

\[=\]

\[m\]

\[n\]

\[A\]

\[U\]

\[\Sigma\]

\[V^T\]

\[=\]

left singular vectors

right singular vectors

singular values

\[A \approx A_k = \sigma_1 u_1 v^T_1 + \sigma_2 u_2 v^T_2 + \cdots + \sigma_k u_k v^T_k\]

Randomized SVD

Matrix \(A\)

Rank \(k\)

Input

\(\sigma_1 u_1 v^T_1 + \cdots + \sigma_k u_k v^T_k\)

Output

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\)

Randomly generate \(n \times k\) matrix \(\Omega\)
\(Y \leftarrow A\Omega\)
Perform QR decomposition on \(Y\), \(\ \ Y =: QR\)
\(B \leftarrow Q^\top A\)
Perform SVD on \(B\), \(\ \ B =: \tilde{U} \Sigma V^\top\)
\(U \leftarrow Q\tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\)

Let \(l = k + p\)
Randomly generate \(n \times l\) matrix \(\Omega\)
\(Y \leftarrow A\Omega\)
Perform QR decomposition on \(Y\), \(\ \ Y =: QR\)
\(B \leftarrow Q^\top A\)
Perform SVD on \(B\), \(\ \ B =: \tilde{U} \Sigma V^\top\)
\(U \leftarrow Q\tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\)

Let \(l = k + p\)
Randomly generate \(n \times l\) matrix \(\Omega\)
\(Y \leftarrow A\) \((A^\top A) \) \(\Omega\) (subspace iteration)
Perform QR decomposition on \(Y\), \(\ \ Y =: QR\)
\(B \leftarrow Q^\top A\)
Perform SVD on \(B\), \(\ \ B =: \tilde{U} \Sigma V^\top\)
\(U \leftarrow Q\tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

Let \(l = k + p\). Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
\(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and \(Y_i =: \bar{Q}_i R_i\)
\(\textbf{TSQR:}\)
- Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
- Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
- \(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\)
\(\textbf{Subspace iteration:}\)
- \(\textbf{On each GPU } i\), do \(Z_i \leftarrow A_i^\top Q_i\) then reduce \(Z \leftarrow \sum_i Z_i\)
- \(Y_i \leftarrow A_i Z\), and re-orthogonalize \(Y_i\) using the same TSQR process to update \(Q_i\)
\(\textbf{On each GPU } i\), do \(B_i \leftarrow Q_i^\top A_i\)
\(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
\(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\). \(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

Let \(l = k + p\). Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
\(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and \(Y_i =: \bar{Q}_i R_i\)
\(\textbf{TSQR Reduce:}\)
- Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
- Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
\(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
\(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
\(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
\(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

cost of communication: \(O(l ^2)\)

cost of communication: \(O(nl)\)

cost of communication: \(O(lk)\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

Let \(l = k + p\). Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
\(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and \(Y_i =: \bar{Q}_i R_i\)
\(\textbf{TSQR Reduce:}\)
- Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
- Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
\(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
\(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
\(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
\(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

cost of communication: \(O(l ^2)\)

cost of communication: \(O(nl)\)
(most crucial, \(\because n \gg l \approx k\))

cost of communication: \(O(lk)\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

Let \(l = k + p\). Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
\(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and \(Y_i =: \bar{Q}_i R_i\)
\(\textbf{TSQR Reduce:}\)
- Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
- Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
\(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
\(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
\(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
\(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

cost of communication: \(O(nl)\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

Let \(l = k + p\). Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
\(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and \(Y_i =: \bar{Q}_i R_i\)
\(\textbf{TSQR Reduce:}\)
- Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
- Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
\(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
\(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
\(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
\(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

cost of communication: \(O(lk)\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

Let \(l = k + p\). Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
\(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and \(Y_i =: \bar{Q}_i R_i\)
\(\textbf{TSQR Reduce:}\)
- Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
- Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
\(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
\(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
\(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
\(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

\(\textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling \(p\), \(G\) GPUs

Set \(l = k+p\). Row-partition \(A = [A_0;\cdots;A_{G-1}]\), where \(A_i\) is on GPU \(i\)
Randomly generate \(n \times l\) matrix \(\Omega\), then \(Y_i \leftarrow A_i\Omega\)
Distributed QR by TSQR: \(Y_i \Rightarrow Q_i\)
Subspace iteration: \(Z_i \leftarrow A_i^\top Q_i\), then reduce \(Z \leftarrow \sum_i Z_i\)
Re-sketch and re-orthogonalize: \(Y_i \leftarrow A_iZ,\ Y_i \Rightarrow Q_i\)
Build small matrix: \(B_i \leftarrow Q_i^\top A_i\), then reduce \(B \leftarrow \sum_i B_i\)
Small SVD on GPU 0: \(B = \tilde{U}\Sigma V^\top\)
Left singular vectors: \(U_i \leftarrow Q_i\tilde{U}\)

\(\textbf{Return: } A \approx U_k\Sigma_k V_k^\top\)

\(\textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling \(p\), \(G\) GPUs

Set \(l = k+p\). Row-partition \(A = [A_0;\cdots;A_{G-1}]\), where \(A_i\) is on GPU \(i\)
Randomly generate \(n \times l\) matrix \(\Omega\), then \(Y_i \leftarrow A_i\Omega\)
Distributed QR by TSQR: \(Y_i \Rightarrow Q_i\)
Subspace iteration: \(Z_i \leftarrow A_i^\top Q_i\), then reduce \(Z \leftarrow \sum_i Z_i\)
Re-sketch and re-orthogonalize: \(Y_i \leftarrow A_iZ,\ Y_i \Rightarrow Q_i\)
Build small matrix: \(B_i \leftarrow Q_i^\top A_i\), then reduce \(B \leftarrow \sum_i B_i\)
Small SVD on GPU 0: \(B = \tilde{U}\Sigma V^\top\)
Left singular vectors: \(U_i \leftarrow Q_i\tilde{U}\)

\(\textbf{Return: } A \approx U_k\Sigma_k V_k^\top\)

cost of communication: \(O(nl)\)

cost of communication: \(O(nl)\)

cost of communication: \(O(l^2)\)

cost of communication: \(O(lk)\)

\(\textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling \(p\), \(G\) GPUs

Set \(l = k+p\). Row-partition \(A = [A_0;\cdots;A_{G-1}]\), where \(A_i\) is on GPU \(i\)
Randomly generate \(n \times l\) matrix \(\Omega\), then \(Y_i \leftarrow A_i\Omega\)
Distributed QR by TSQR: \(Y_i \Rightarrow Q_i\)
Subspace iteration: \(Z_i \leftarrow A_i^\top Q_i\), then reduce \(Z \leftarrow \sum_i Z_i\)
Re-sketch and re-orthogonalize: \(Y_i \leftarrow A_iZ,\ Y_i \Rightarrow Q_i\)
Build small matrix: \(B_i \leftarrow Q_i^\top A_i\), then reduce \(B \leftarrow \sum_i B_i\)
Small SVD on GPU 0: \(B = \tilde{U}\Sigma V^\top\)
Left singular vectors: \(U_i \leftarrow Q_i\tilde{U}\)

\(\textbf{Return: } A \approx U_k\Sigma_k V_k^\top\)

cost of communication: \(O(nl)\)

(most crucial, \(\because n \gg l \approx k\))

cost of communication: \(O(nl)\)

cost of communication: \(O(l^2)\)

cost of communication: \(O(lk)\)

\(\textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling \(p\), \(G\) GPUs

Set \(l = k+p\). Row-partition \(A = [A_0;\cdots;A_{G-1}]\), where \(A_i\) is on GPU \(i\)
Randomly generate \(n \times l\) matrix \(\Omega\), then \(Y_i \leftarrow A_i\Omega\)
Distributed QR by TSQR: \(Y_i \Rightarrow Q_i\)
Subspace iteration: \(Z_i \leftarrow A_i^\top Q_i\), then reduce \(Z \leftarrow \sum_i Z_i\)
Re-sketch and re-orthogonalize: \(Y_i \leftarrow A_iZ,\ Y_i \Rightarrow Q_i\)
Build small matrix: \(B_i \leftarrow Q_i^\top A_i\), then reduce \(B \leftarrow \sum_i B_i\)
Small SVD on GPU 0: \(B = \tilde{U}\Sigma V^\top\)
Left singular vectors: \(U_i \leftarrow Q_i\tilde{U}\)

\(\textbf{Return: } A \approx U_k\Sigma_k V_k^\top\)

cost of communication: \(O(nl)\)

(most crucial, \(\because n \gg l \approx k\))

cost of communication: \(O(nl)\)

\( e^* = \sqrt{\frac{\sum_{i=k+1}^{n} (i+1)^{-2p}}{\sum_{i=1}^{n} (i+1)^{-2p}}} \)

deck

By Gino

deck

49

Gino