Dimension Free Matrix Tail Bounds and Applications

Min-Hsiu Hsieh, UTS

with C. Zhang and D. Tao [arXiv: 1910.03718]

Tail Bounds

\mathbb{P}\left\{|Y|\geq t\right\}\leq 2 \exp\left\{\frac{-t^2/2}{\sigma(Y)+Lt/3}\right\}

Let \(X_1,\cdots,X_n\) be independent, zero-mean r.v. so that \(|X_i|\leq L\) for all \(i\).

Then \(Y=\sum_{i=1}^n X_i\) satisfies

Matrix Tail Bounds

\mathbb{P}\left\{\|\mathbf{Y}\|\geq t\right\}\leq (d_1+d_2) e^{\frac{-t^2/2}{\sigma(\mathbf{Y})+Lt/3}}

Let \(\mathbf{X}_1,\cdots,\mathbf{X}_n\) be independent, zero-mean, \(d_1\times d_2\) random matrices so that \(\|\mathbf{X}_i\|\leq L\) for all \(i\).

Then \(\mathbf{Y}=\sum_{i=1}^n \mathbf{X}_i\) satisfies

Tropp. User-friendly tail bounds for sums of random matrices, Found. Comput. Math., Aug 2011.

(*)

Discussion

1. The bound contains matrix dimensions \(d_1+d_2\).

2. The first proof was provided by Ahlswede & Winter in 2002.

3. D. Gross also has another version for matrix completion in 2011.

4. Lieb's concavity theorem is used in (*).

Background

f(\mathbf{A}) = \sum_{i} f(\lambda_i) |\nu_i\rangle\langle \nu_i|
\text{if}\, \textbf{A}=\sum_i \lambda_i|\nu_i\rangle\langle\nu_i|
\mathcal{H}(\mathbf{B}) = \left[\begin{array}{cc} 0 & \mathbf{B} \\ \mathbf{B}^\dagger & 0 \end{array}\right]

Hermitian Dilation

Standard Matrix Function

Warm Up

\leq {\rm e}^{-\theta t}\cdot \mathbb{E}\,{\rm e}^{ {\lambda_{\max}(\theta\bf Y})}
\mathbb{P}\{\lambda_{\max}({\bf Y})\geq t\}
\leq {\rm e}^{-\theta t}\cdot {\mathbb{E}}\,\text{Tr}\,{ {\rm e}^{\theta {\bf Y}}}
= \mathbb{P}\{e^{ \lambda_{\max}({\theta \bf Y})}\geq e^{\theta t}\}
\forall \theta>0

(Laplace Transform Method)

Golden-Thompson

{\mathbb{E}}\,\text{Tr}\, { {\rm e}^{\theta {\bf Y}}} ={\mathbb{E}}\,\text{Tr}\, { {\rm e}^{\theta \sum_{i=1}^n{\bf X}_i}}
\leq \text{Tr}\,\textcolor{red}{(\mathbb{E} {\rm e}^{\theta \sum_{i=1}^{n-1} {\bf X}_i})( \lambda_{\max}(\mathbb{E}e^{\theta{\bf X}_n}) I_d)}
\leq (\text{Tr}I_d)\,\left[ \prod_{i=1}^n \lambda_{\max}(\mathbb{E}e^{\theta{\bf X}_i}) \right]
= \text{dim}(\mathbf{Y})\,e^{\left[ \sum_{i=1}^n \lambda_{\max}( \log\mathbb{E}e^{\theta{\bf X}_i}) \right]}

Ahlswede and Winter. IEEE Trans. Inform. Theory, 48(3):569-579, 2002.

\text{GT}:\, \text{tr} e^{\mathbf{A}+\mathbf{B}} \leq \text{tr} e^{\mathbf{A}}e^{\mathbf{B}}

Lieb's Concavity Thm

 \(\mathbf{A} \mapsto \text{Tr} \,e^{\mathbf{H}+\log\mathbf{A}}\) is concave.

\leq \mathbb{E} \text{Tr}\,e^{ \textcolor{blue}{\theta \sum_{i=1}^{n-1} {\bf X}_i } + \textcolor{red}{ \log \mathbb{E}e^{\theta\mathbf{X}_n}}}
\leq \text{Tr}\, {e}^{\textcolor{red}{\sum_{i=1}^{n} \log \mathbb{E} e^{\theta\mathbf{X}_i}}}
\leq \text{dim}(\mathbf{Y})\,\textcolor{red}{ {e}^{\lambda_{\max}(\sum_{i=1}^{n} \log \mathbb{E} e^{\theta\mathbf{X}_i})}}

Tropp. User-friendly tail bounds for sums of random matrices, Found. Comput. Math., Aug 2011.

{\mathbb{E}}\,\text{Tr}\, { {\rm e}^{\theta {\bf Y}}} ={\mathbb{E}}\,\text{Tr}\, { {\rm e}^{\theta \sum_{i=1}^n{\bf X}_i}}

\(\mathbb{E}_X\text{Tr}e^{\textcolor{blue}{\mathbf{H}}+\textcolor{red}{\mathbf{X}}} \leq  \text{Tr} \,e^{\textcolor{blue}{\mathbf{H}}+\textcolor{red}{\log\mathbb{E}_X e^\mathbf{X}}}\) 

To Obtain (*), the last ingredient is 

\log \mathbb{E} e^{\theta \mathbf{X}_i} \leq g(\theta) \mathbb{E} \mathbf{X}_i^2
g(\theta)= \frac{\theta^2/2}{1-\theta L/3}
\mathbb{P}\{\lambda_{\max}({\bf Y})\geq t\} \leq \text{tr} e^{-\theta t}e ^{g(\theta) \sum_{i} \mathbb{E}\mathbf{X}_i^2}
\leq d e^{-\theta t}e ^{g(\theta) \sigma (\mathbf{Y})}

(**)

Issues of (**)

1. Not suitable for high dimensional or infinite dimensional matrices.

2. Only applicable to spectral norm.

(**)

\mathbb{P}\left\{\|\mathbf{Y}\|\geq t\right\}\leq \inf_{\theta>0} d\, e^{-\theta t + g(\theta)\sigma(\mathbf{Y})}

Our Improvement

\mathbb{P}\left\{ \textcolor{red}{\mu} \left( \sum_{k=1}^K{\bf X}_k\right)\geq t \right\} \leq \inf_{\theta>0}\left\{{\rm e}^{-\theta t+g(\theta,K)\cdot \textcolor{red}{\phi}} \right\}

2. \(\phi=O(U^K)\), where \(\mathbb{E} \mu({\bf X}_k) \leq U\) for all \(k\).

1. \(\mu:\mathbb{M}\to \mathbb{R}\) satisfies 

(i) \(\mu(\mathbf{A})\geq 0\)

(ii) \(\mu(\theta\mathbf{A})=\theta\mu(\mathbf{A})\)

(iii) \(\mu(\mathbf{A}+\mathbf{B})\leq\mu(\mathbf{A})+\mu(\mathbf{B})\)

Discussion

1. Instead of \(d_1+d_2\), we have \(e^{O(U^K)}\), where \(\mathbb{E} \mu({\bf X}_k) \leq U\).

2. The matrix function \(\mu(\cdot)\) can be chosen to be any matrix norm and others.

\mathbb{P}\left\{ \textcolor{red}{\mu} \left( {\bf Y}\right)\geq t \right\} \leq \inf_{\theta>0}\left\{ {\rm e}^{-\theta t+g(\theta,K)\cdot \textcolor{red}{\phi}} \right\}
\mathbb{P}\left\{\|\mathbf{Y}\|\geq t\right\}\leq \inf_{\theta>0} d\, e^{-\theta t + g(\theta)\sigma(\mathbf{Y})}

(**)

K=10

K=20

Random Hermitian matrices with \(d=200\) and each entry obeys \(\mathcal{N}(0,1)\).

Numerics

Expectation Bound

\mathbb{E}\left\{ \mu\left({\bf Y}\right) \right\}\leq \phi \left(\sqrt{2 \alpha_2(c) } + \frac{c \alpha_2(c)}{3} \right)
\mathbb{E}\left\{ \lambda_{\max}\left({\bf Y}\right) \right\}\leq \sqrt{2 \sigma(\mathbf{Y})\log d} + L\log d

\(\phi=O(U^K)\) and \(\max_k\mathbb{E} \mu({\bf X}_k) \leq U\).

\alpha_2(c) = \frac{3[(c+3) - \sqrt{6c+9}] }{c^2}

Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning 8, 1-2 (2015), 1–230.

Expectation Bound can be used to analyze

  1. Matrix Approximation

  2. Matrix Sparsification  

  3. Matrix Multiplication

Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning 8, 1-2 (2015), 1–230.

Matrix Random Series

Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning 8, 1-2 (2015), 1–230.

\mathbf{Y} = \sum_{i=1}^K \xi_i \mathbf{A}_i

where \(\{\xi_i\}\) are independent random variables and \(\{\mathbf{A}_i\}\) are fixed matrices.

Example: Gaussian Wigner Matrices, Matrix Rademacher Series

Applications of Matrix Random Series include

  1. Optimization

  2. Sample Complexity

Azuma–Hoeffding Inequality

A sequence \(\{X_1,X_2\cdots,\}\) is a martingale if

\mathbb{E}[|X_i|]\leq \infty
\mathbb{E}[X_{n+1}|X_1,\cdots,X_{n}] = X_n

A matrix martingale \(\{\mathbf{X}_1,\mathbf{X}_2\cdots,\}\) satisfies

\mathbb{E}\|\mathbf{X}_i\|\leq \infty
\mathbb{E}_n[\mathbf{X}_{n+1}] = \mathbf{X}_n

Define a difference sequence \(\{\mathbf{Z}_i\}\), where \(\mathbf{Z}_i=\mathbf{X}_i-\mathbf{X}_{i-1}\).

Matrix Azuma–Hoeffding

\mathbb{P}\{\lambda_{\max}(\mathbf{Y})\geq t\} \leq d\, \exp\left\{\frac{-t^2}{8\sigma^2}\right\}

Tropp. User-friendly tail bounds for sums of random matrices, Found. Comput. Math., Aug 2011.

\mathbb{P}\left\{\mu\left({\bf Y} \right)\geq t \right\} \leq {\rm e}^{\frac{\phi_{\widetilde{\Omega}}}{4}}\cdot\exp\left\{ -\frac{t^2}{4\phi_{\widetilde{\Omega}}}\right\}

Applications

1. Matrix Approximation

2. Optimization

3. Matrix Expander Graph

4. Quantum Hypergraph

5. Compressed Sensing

6. Random Process

1. Matrix Approximation

{\bf B} = \sum_{l=1}^L {\bf B}_l \in \mathbb{R}^{m\times n}

Construct an unbiased random matrix \(\mathbf{R}\) so that \(\mathbb{E} \mathbf{R} = \mathbf{B}\).

\widehat{\mathbf{R}}_K = \frac{1}{K}\sum_{k=1}^K {\bf R}_k

Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning 8, 1-2 (2015), 1–230.

\text{e.g.}, \mathbb{P}\{\mathbf{R} = p_\ell^{-1}\mathbf{B}_\ell \} = p_\ell

1. Matrix Approximation

Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning 8, 1-2 (2015), 1–230.

K \geq \frac{2\sigma({\bf R}) \log (m+n)}{\epsilon^2} + \frac{2L \log (m+n)}{3\epsilon}.

Use (*) to show a dimension dependant bound on \(\mathbb{P}\{\|\widehat{{\bf R}}_K-\mathbf{B} \|>t\}\).

In addition, \(\mathbb{E}  \| \widehat{{\bf R}}_K - {\bf B}\|\leq 2\epsilon\) if

1. Matrix Approximation

\(\max\limits_{k}   \mu( {\bf R}_k - {\bf B}) \leq  \sqrt{1+2\epsilon \mu( {\bf B} ) }-1\)

However, we show  \(\mathbb{E} \mu( \widehat{{\bf R}}_K - {\bf B}) \leq \epsilon \) if

Our result emphasizes the importance of the approximation quality between \(\mathbf{B}\) and \(\mathbf{R}_k\), when the number of copies \(K\) is fixed.

2. Optimization

\min_{{\bf x}\in\mathbb{R}^N} {\bf c}^T {\bf x}
{\bf F}({\bf x})\leq {\bf 0}
\mathbb{P}\left\{ {\mathcal A}_0({\bf x}) - \sum_{k=1}^K \xi_k {\mathcal A}_k({\bf x}) \succeq {\bf 0} \right\}\geq 1-\epsilon

Chance Constrained Optimization:

where \({\mathcal A}_0({\bf x}) \succeq {\bf 0}\), \({\mathcal A}_k:\mathbb{R}^N \rightarrow \mathbb{S}^{M}\) and \(\xi_k\) are iid r.v.

subject to

2. Optimization

\mathbb{P}\left\{ {\mathcal A}_0({\bf x}) - \sum_{k=1}^K \xi_k {\mathcal A}_k({\bf x}) \succeq {\bf 0} \right\}\geq 1-\epsilon
\Leftrightarrow \mathbb{P}\left\{\sum_{k=1}^K \xi_k {\mathcal A}'_k({\bf x}) \preceq {\bf I} \right\} \geq 1-\epsilon
{\mathcal A}'_k ({\bf x})= ({\mathcal A}_0({\bf x}))^{-1/2}{\mathcal A}_k ({\bf x})({\mathcal A}_0({\bf x}))^{-1/2}

So. Moment inequalities for sums of random matrices and their applications in optimization. Mathematical Programming 130, 1 (2011), 125–151.

\Leftrightarrow \mathbb{P}\left\{ \left\| \sum_{k} \xi_k \Big( \frac{1}{\gamma} {\mathcal A}'_k({\bf x}) \Big) \right\| \leq \frac{1}{\gamma}\right\}> 1- \epsilon

2. Optimization

A. So showed that the following relaxation 

\sum_{k=1}^K ( {\mathcal A}'_k({\bf x}))^2 \preceq \gamma(\epsilon)^2 {\bf I}

is a good approximation if \(\{\xi_i\}\) are Gaussian distribution with unit variance or distribution supported on \([-1,1]\).

So. Moment inequalities for sums of random matrices and their applications in optimization. Mathematical Programming 130, 1 (2011), 125–151.

\mathbb{P}\left\{ \left\| \sum_{k} \xi_k \Big( \frac{1}{\gamma} {\mathcal A}'_k({\bf x}) \Big) \right\| \leq \frac{1}{\gamma}\right\}> 1- \epsilon

2. Optimization

\mathbb{P}\left\{\sum_{k=1}^K \xi_k {\mathcal A}'_k({\bf x}) \preceq {\bf I} \right\}
=\mathbb{P}\left\{ \left\| \sum_{k} \xi_k \Big( \frac{1}{\gamma} {\mathcal A}'_k({\bf x}) \Big) \right\| \leq \frac{1}{\gamma}\right\}> 1- \epsilon

Application of our tail bound can remove the distributional assumption on \(\{\xi_i\}\), and better \(\gamma\).

3. Matrix Expander Graph

Expander graph is a sparse graph with strong connectivity.

3. Matrix Expander Graph

Random walk on expander graph is as good as independent sampling.

Let \((Y_1,\cdots,Y_K)\) be vertices visited by random walk on \(G\) with spectral gap \(\lambda\).

\mathbb{P}\{ \|\frac{1}{K}\sum_{k} f(Y_k) - \mathbb{E}[f] \| >t\} \leq d 2^{-\Omega(t^2\lambda K)}

for \(f: V\to \mathbb{H}^{d\times d}\).

Wigderson and Xiao. A randomness-efficient sampler for matrix-valued functions and applications. FOCS'05, pp. 397-406.

Garg, Lee, Song, and Srivastava. A matrix expander chernoff bound. STOC'18, pp. 1102–1114.

3. Matrix Expander Graph

Garg, Lee, Song, and Srivastava. A matrix expander chernoff bound. STOC'18, pp. 1102–1114.

\mathbb{P} \left\{ \left\|\frac{1}{K} \sum_{k=1}^K f(y_k)\right\| >t \right\} \leq \mathbb{P} \left\{ \left\|\frac{1}{K} \sum_{k=1}^K {\bf Z}_k\right\| >\frac{t}{2} \right\},

for some matrix Martingale difference sequence \(\{{\bf Z}_1,\cdots,{\bf Z}_K\}\) and \(\|\cdot\|\) is spectral norm. 

3. Matrix Expander Graph

\mathbb{P} \left\{ \left\|\frac{1}{K} \sum_{k=1}^K f(y_k)\right\|_1 >t \right\} \leq \mathbb{P} \left\{ \left\|\frac{1}{K} \sum_{k=1}^K {\bf Z}_k\right\|_1 >\frac{t}{2} \right\},

for some matrix Martingale difference sequence \(\{{\bf Z}_1,\cdots,{\bf Z}_K\}\), WHERE \(\|A\|_1=\sum_{i,j}|A_{i,j}|\). 

3. Matrix Expander Graph

Garg, Lee, Song, and Srivastava. A matrix expander chernoff bound. STOC'18, pp. 1102–1114.

\mathbb{P}\left\{\mu\left(\sum_{k=1}^K{\bf Z}_k \right)\geq t \right\} \leq {\rm e}^{\frac{\phi_{\widetilde{\Omega}}}{4}}\cdot\exp\left\{ -\frac{t^2}{4\phi_{\widetilde{\Omega}}}\right\}.

where \(\phi_{\widetilde{\Omega}}:=\sum_{i=1}^{\widetilde{I}}( [\widetilde{U}_i+1]^{|\widetilde{\Omega}_i|}-1)\) with \(\widetilde{U}_i :=\max_{k\in\widetilde{\Omega}_i} \{u_k \}\).

Proof ?

{\bf D}_0:=\bm{\Lambda}\left[0,0,\log\frac{1}{2!}, \log\frac{1}{3!},\cdots\right]
{\rm e}^{\theta \mu({\bf B})+1} = {\rm tr}\Big(\bm{\Lambda}\Big[1,\theta \mu({\bf B})+1,\frac{(\theta \mu({\bf B})+1)^2}{2!}, \cdots\Big] \Big).
={\rm tr}\big( {\rm e}^{{\bf D}_\mu[\theta; {\bf B}]+{\bf D}_0}\big)
{{\bf D}}_\mu[\theta; {\bf B}]:= \bm{\Lambda}\Big[0, \log(\theta \cdot \mu({\bf B})+1),2\log(\theta \cdot \mu({\bf B})+1), \cdots\Big].

STEP 1: An Identity

Where

{{\bf D}}_\mu[\theta; {\bf B}]:= \bm{\Lambda}\Big[0, \log(\theta \cdot \mu({\bf B})+1),2\log(\theta \cdot \mu({\bf B})+1), \cdots\Big].

STEP 2: Property of \({{\bf D}}_\mu[\theta; {\bf B}]\)

(1): {\bf D}_\mu\left[\theta;\sum_{k=1}^K{\bf B}_k\right] \preceq \sum_{k=1}^K{\bf D}_\mu[\theta;{\bf B}_k];
(2):\sum_{k=1}^K {\bf D}_\mu\left[\theta;{\bf B}_k\right] \preceq K\cdot {\bf D}_\mu\left[\theta;\sum_{k=1}^K\frac{{\bf A}_k}{K}\right]
\sum\limits_k\mu({\bf B}_k)\leq\mu\big(\sum\limits_k{\bf A}_k\big)
\mathbb{P}\left\{ \mu\left({\bf Y}_K\right)\geq t \right\}
\leq{\rm e}^{-\theta t}\cdot \mathbb{E}\,\exp\left( \mu\left(\theta\cdot {\bf Y}_K\right)\right)
= {\rm e}^{-1} \cdot \mathbb{E}\,{\rm tr} \,\exp\left({\bf D}_0 +{\bf D}_\mu\left[\theta;\sum_{k=1}^K{\bf X}_k\right]\right)
\leq {\rm e}^{-1} \cdot \mathbb{E}\,{\rm tr} \,\exp\left({\bf D}_0+\sum_{k=1}^K{\bf D}_\mu\left[\theta;{\bf X}_k\right]\right)
\leq{\rm e}^{-1} \cdot{\rm tr}\,\exp\left({\bf D}_0+\sum_{k=1}^K \log\mathbb{E}\,{\rm e}^{{\bf D}_\mu[\theta;{\bf X}_k]}\right)
\mathbb{P}\left\{ \mu\left({\bf Y}_K\right)\geq t \right\}
\leq{\rm e}^{-1} \cdot{\rm tr}\,\exp\left({\bf D}_0+\sum_{k=1}^K \log\mathbb{E}\,{\rm e}^{{\bf D}_\mu[\theta;{\bf X}_k]}\right)
\leq{\rm e}^{-\theta t}\cdot{\rm e}^{-1}\cdot {\rm tr}\,\exp\left( {\bf D}_0+\sum_{k=1}^K \log\,{\rm e}^{{\bf D}_\mu[\theta;{\bf B}_k] }\right)
={\rm e}^{-\theta t}\cdot{\rm e}^{-1}\cdot {\rm tr}\,\exp\left({\bf D}_0+\sum_{k=1}^K {\bf D}_\mu[\theta;{\bf B}_k]\right)
\leq {\rm e}^{-\theta t}\cdot{\rm e}^{-1}\cdot {\rm tr}\,\exp\left( K\cdot {\bf D}_\mu\left[\theta;{\bf U}\right] +{\bf D}_0\right)
\mathbb{E}\mu(\mathbf{X}_k) \leq \mu(\mathbf{B}_k)
U:=\max_{k} \mu(\mathbf{B}_k)
\mathbb{P}\left\{ \mu\left({\bf Y}_K\right)\geq t \right\}
\leq {\rm e}^{-\theta t}\cdot{\rm e}^{-1}\cdot {\rm tr}\,\exp\left( K\cdot {\bf D}_\mu\left[\theta;{\bf U}\right] +{\bf D}_0\right)
\leq \exp \big(-\theta t + g(\theta,K) \phi \big)

Thank you!

Dimension Free Tail Inequalities and Applications

By Lawrence Min-Hsiu Hsieh

Dimension Free Tail Inequalities and Applications

  • 109