\[m\]

\[n\]

\[A\]

\[U\]

\[\Sigma\]

\[V^T\]

\[=\]

\[m\]

\[n\]

\[A\]

\[U\]

\[\Sigma\]

\[V^T\]

\[=\]

left singular vectors

right singular vectors

singular values

\[A \approx A_k = \sigma_1 u_1 v^T_1 + \sigma_2 u_2 v^T_2 + \cdots + \sigma_k u_k v^T_k\]

Randomized SVD

Matrix \(A\)

Rank \(k\)

Input

\(\sigma_1 u_1 v^T_1 +   \cdots + \sigma_k u_k v^T_k\)

Output

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\)

  • Randomly generate \(n \times k\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
  • \(Y \leftarrow A\Omega\)
  • Perform QR decomposition on \(Y\), \(\ \ Y =: QR\)
  • \(B \leftarrow Q^\top A\)
  • Perform SVD on \(B\), \(\ \ B =: \tilde{U} \Sigma V^\top\)
  • \(U \leftarrow Q\tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\)

  • Let \(l = k + p\)
  • Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
  • \(Y \leftarrow A\Omega\)
  • Perform QR decomposition on \(Y\), \(\ \ Y =: QR\)
  • \(B \leftarrow Q^\top A\)
  • Perform SVD on \(B\), \(\ \ B =: \tilde{U} \Sigma V^\top\)
  • \(U \leftarrow Q\tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\)

  • Let \(l = k + p\)
  • Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
  • \(Y \leftarrow A\) \((A^\top A)^2 \) \(\Omega\)  (subspace iteration)
  • Perform QR decomposition on \(Y\), \(\ \ Y =: QR\)
  • \(B \leftarrow Q^\top A\)
  • Perform SVD on \(B\), \(\ \ B =: \tilde{U} \Sigma V^\top\)
  • \(U \leftarrow Q\tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

  • Let \(l = k + p\).  Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
  • Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
  • \(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and  \(Y_i =: \bar{Q}_i R_i\)
  • \(\textbf{TSQR Reduce:}\)
    • Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
    • Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
  • \(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
  • \(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
  • \(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
  • \(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

  • Let \(l = k + p\).  Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
  • Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
  • \(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and  \(Y_i =: \bar{Q}_i R_i\)
  • \(\textbf{TSQR Reduce:}\)
    • Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
    • Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
  • \(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
  • \(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
  • \(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
  • \(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

cost of communication: \(O(l ^2)\)

cost of communication: \(O(nl)\)

cost of communication: \(O(lk)\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

  • Let \(l = k + p\).  Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
  • Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
  • \(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and  \(Y_i =: \bar{Q}_i R_i\)
  • \(\textbf{TSQR Reduce:}\)
    • Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
    • Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
  • \(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
  • \(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
  • \(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
  • \(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

cost of communication: \(O(l ^2)\)

cost of communication: \(O(nl)\)
(most crucial, \(\because n \gg l \approx k\))

cost of communication: \(O(lk)\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

  • Let \(l = k + p\).  Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
  • Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
  • \(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and  \(Y_i =: \bar{Q}_i R_i\)
  • \(\textbf{TSQR Reduce:}\)
    • Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
    • Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
  • \(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
  • \(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
  • \(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
  • \(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

cost of communication: \(O(nl)\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

  • Let \(l = k + p\).  Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
  • Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
  • \(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and  \(Y_i =: \bar{Q}_i R_i\)
  • \(\textbf{TSQR Reduce:}\)
    • Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
    • Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
  • \(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
  • \(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
  • \(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
  • \(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

cost of communication: \(O(lk)\)

\( \textbf{Input: } A \in \mathbb{R}^{m \times n}\), target rank \(k\), oversampling parameter \(p\), \(G\) GPUs

  • Let \(l = k + p\).  Partition \(A\) by rows: \(A = [A_0;\ A_1;\ \cdots;\ A_{G-1}]\), \(A_i\) is on GPU \(i\)
  • Randomly generate \(n \times l\) matrix \(\Omega \sim \mathcal{N}(0,1)\)
  • \(\textbf{On each GPU } i\), do \(Y_i \leftarrow A_i \Omega\) and  \(Y_i =: \bar{Q}_i R_i\)
  • \(\textbf{TSQR Reduce:}\)
    • Stack \([R_0;\ R_1;\ \cdots;\ R_{G-1}] =: T \cdot R\)
    • Split \(T = [T_0;\ T_1;\ \cdots;\ T_{G-1}]\)
  • \(\textbf{On each GPU } i\), do \(Q_i \leftarrow \bar{Q}_i T_i\) and \(B_i \leftarrow Q_i^\top A_i\)
  • \(\textbf{Reduce: }\) \(B \leftarrow \sum_i B_i\)
  • \(\textbf{On GPU 0}\), \(\ B =: \tilde{U} \Sigma V^\top\)
  • \(\textbf{On each GPU } i\), \(\ U_i \leftarrow Q_i \tilde{U}\)

\(\textbf{Return: } A \approx \sigma_1 u_1 v_1^\top + \cdots + \sigma_k u_k v_k^\top\)

deck

By Gino

deck

  • 0