Variance Reduction Study Group

SVRG

\(d\) - Dimension of our problem

\displaystyle \min_{w \in \mathbb{R}^d} P(w) = \frac{1}{n}\sum_{i = 1}^n \psi_i(w)

\(\psi_i\) - Loss with respect to data point \(i\)

\(n\) - Number of examples/data points

Assumptions:

\(P\) is \(\gamma\)-strongly-convex

\displaystyle P(w) \geq P(y) + \nabla P(y)^T (y - w) + \frac{\gamma}{2} \lVert y - w \rVert^2

\(\psi_i\) is \(L\)-smooth

\displaystyle \psi_i(w) \leq \psi_i(y) + \nabla \psi_i(y)^T (y - w) + \frac{L}{2} \lVert y - w \rVert^2

Condition number: \(\rho = \frac{L}{\gamma}\)

\(\psi_i\) - Loss with respect to data point \(i\)

Epoch \(s\) of SVRG

\displaystyle w_t = w_{t-1} - \eta (\nabla\psi_{i_t}(w_{t-1}) - \nabla\psi_{i_t}(\tilde{w}) + \tilde{\mu})
\displaystyle w_0 = \tilde{w} = \tilde{w}_{s-1}
\tilde{\mu} = \nabla P(\tilde{w}) = \frac{1}{n} \sum_{i = 1}^n \nabla \psi_i(\tilde{w})

For \(t = 1,2,\dots, m\)

Pick \(i_t\) uniformly at random from \(\{1, \dotsc, n\}\) 

End For

\(\tilde{w}_s = w_t\) for \(t\) picked uniformly at random from \(\{1, \dotsc, n\}\) 

\displaystyle v_t

Theorem

\displaystyle \mathbb{E}[P(\tilde{w}_s) - P(\tilde{w})] \leq \alpha^s(P(w_0) - P(w^*))
\displaystyle \text{where}~\alpha = \frac{1}{\gamma \eta (1 - 2 L \eta) m} + \frac{2 L \eta}{1 - 2 L \eta}

Lemma (VR Property)

Initialization:

and

\displaystyle \mathbb{E}\Big[ \lVert v \rVert^2 ~\vert~ \mathcal{F}_{t-1}\Big] \leq 4 L [P(w_{t-1}) - P(w^*) + P(\tilde{w}) - P(w^*)]

Epoch \(s\) of SVRG

\displaystyle w_t = w_{t-1} - \eta (\nabla\psi_{i_t}(w_{t-1}) - \nabla\psi_{i_t}(\tilde{w}) + \tilde{\mu})
\displaystyle w_0 = \tilde{w} = \tilde{w}_{s-1}
\tilde{\mu} = \nabla P(\tilde{w}) = \frac{1}{n} \sum_{i = 1}^n \nabla \psi_i(\tilde{w})

For \(t = 1,2,\dots, m\)

Pick \(i_t\) uniformly at random from \(\{1, \dotsc, n\}\) 

End For

\(\tilde{w}_s = w_t\) for \(t\) picked uniformly at random from \(\{1, \dotsc, n\}\) 

\displaystyle v_t
\displaystyle \mathbb{E}_{t-1}\Big[\lVert w_{t} - w^* \rVert\Big] - \lVert w_{t-1} - w^* \rVert + 2 \eta(1 - 2 L \eta) (P(w_{t-1}) - P(w_t))

Initialization:

and

From the VR property, we get

\displaystyle \leq 4 L \eta^2 [P(\tilde{w}) - P(w^*)]

SDCA

\(d\) - Dimension of our problem

\displaystyle \min_{w \in \mathbb{R}^d} P(w) = \sum_{i = 1}^n \phi_i(w^T x_i) + \frac{\lambda}{2} \lVert w \rVert^2

\(n\) - Number of examples/data points

Assumptions:

\(P\) is \(\lambda\)-strongly-convex

\displaystyle P(w) \geq P(y) + \nabla P(y)^T (y - w) + \frac{\lambda}{2} \lVert y - w \rVert^2

\(\phi_i\) is \(1/\gamma\)-smooth

\displaystyle \phi_i(w) \leq \phi_i(y) + \nabla \phi_i(y)^T (y - w) + \frac{1}{2\gamma} \lVert y - w \rVert^2

Condition number: \(\rho = \frac{1}{\lambda \gamma}\)

\(\phi_i\) - i-th loss function

Note:

\displaystyle f^*(\hat{x}) = \sup_{x} \langle \hat{x}, x \rangle - f(x)
f^{**} = f

Some Properties

\sup_{x} \langle \hat{x}, x \rangle - f(x)

attained at \(\nabla f^*(\hat{x})\)

f(x)+ f^*(\hat{x}) \geq \langle \hat{x}, x \rangle
\hat{x} = \nabla f(x) \iff \nabla f^*(\hat{x}) = x
\partial f(x) = \{g~\colon~ f(z) \geq f(x) + \langle g, z - x \rangle~\forall z\}

Some Properties

\sup_{x} \langle \hat{x}, x \rangle - f(x)

attained at \(g \in \partial f^*(\hat{x})\)

\hat{x} \in \partial f(x) \iff \partial f^*(\hat{x}) \ni x
\partial f(x) = \{\nabla f(x)\}~\text{if differentiable}
\displaystyle P(w) = \frac{1}{n} \sum_{i = 1}^n \phi_i(w^Tx_i) + \frac{\lambda}{2} \lVert {w} \rVert^2
\displaystyle D(\alpha) = - \frac{1}{n} \sum_{i = 1}^n \phi_i^*(-\alpha_i) - \frac{\lambda}{2} \Big\lVert\frac{1}{\lambda n} \sum_i \alpha_i x_i \Big\rVert^2
\displaystyle G(w, \alpha) = \frac{1}{n} \sum_{i = 1}^n\Big( - \alpha_i (x_i^T w) - \phi_i^*(-\alpha_i) \Big) + \frac{\lambda}{2} \lVert {w} \rVert^2
\displaystyle D(\alpha) = - \frac{1}{n} \sum_{i = 1}^n \phi_i^*(-\alpha_i) - \frac{\lambda}{2} \Big\lVert\frac{1}{\lambda n} \sum_i \alpha_i x_i \Big\rVert^2
P(w) \geq D(\alpha),

Duality:

and

P(w^*) = D(\alpha^*)

with

w^* = w(\alpha^*) \coloneqq \frac{1}{\lambda n} \sum_{i} \alpha_i^* x_i

Duality gap:

P(w(\alpha)) - D(\alpha)

SDCA

\displaystyle D(\alpha^{(t -1)} + \Delta \alpha_i e_i)

For \(t = 1,2,\dots, T\)

Pick \(i\) uniformly at random from \(\{1, \dotsc, n\}\) 

End For

\(\Delta \alpha_i\) maximizes

\displaystyle \alpha^{(t)} = \alpha^{(t -1)} + \Delta \alpha_i e_i
\displaystyle w^{(t)} = w(\alpha^{(t)})
\bar{w} = \frac{1}{T - T_0} \sum_{i = T_0 + 1}^T w^{(i)}

Lemma

\displaystyle \mathbb{E}[D(\alpha^{(t)}) - D(\alpha^{(t-1)})] \geq \frac{s}{n} \mathbb{E}[P(w^{(t-1)}) - D(\alpha^{(t-1)})] - \Big(\frac{s}{n}\Big)^2 \frac{G^{(t)}}{2 \lambda}

For \(t = 1,2,\dots, T\)

Pick \(i\) uniformly at random from \(\{1, \dotsc, n\}\) 

End For

\(\Delta \alpha_i\) maximizes

\displaystyle \alpha^{(t)} = \alpha^{(t -1)} + \Delta \alpha_i e_i
\displaystyle w^{(t)} = w(\alpha^{(t)})
\bar{w} = \frac{1}{T - T_0} \sum_{i = T_0 + 1}^T w^{(i)}

For any \(s \in [0,1]\), we have

\displaystyle D(\alpha^{(t -1)} + \Delta \alpha_i e_i)
\displaystyle G^{(t)} \coloneqq \frac{1}{n} \sum_{i = 1}^n \Big(\lVert x_i \rVert^2 - \frac{\gamma (1 - s) \lambda n}{s} \Big) \mathbb{E}[(u_i^{(t-1)} - \alpha_i^{(t-1)})^2]

Assumptions (for simplicity):

\displaystyle \lVert x_i \rVert \leq 1,
\displaystyle \phi_i(\cdot) \geq 0,
\displaystyle \phi_i(0) \leq 1

with

\displaystyle - u_i^{(t-1)} \in \partial \phi_i(x_i^T w^{(t-1)})

Theorem 2 (rephrased)

\displaystyle \mathbb{E}[P(w^{(t)}) - D(\alpha^{(t)})] \leq \frac{n}{s} \Big( 1 - \frac{1}{\rho + n}\Big)^t

For \(t = 1,2,\dots, T\)

Pick \(i\) uniformly at random from \(\{1, \dotsc, n\}\) 

End For

\(\Delta \alpha_i\) maximizes

\displaystyle \alpha^{(t)} = \alpha^{(t -1)} + \Delta \alpha_i e_i
\displaystyle w^{(t)} = w(\alpha^{(t)})
\bar{w} = \frac{1}{T - T_0} \sum_{i = T_0 + 1}^T w^{(i)}
\displaystyle D(\alpha^{(t -1)} + \Delta \alpha_i e_i)

Assumptions (for simplicity):

\displaystyle \lVert x_i \rVert \leq 1,
\displaystyle \phi_i(\cdot) \geq 0,
\displaystyle \phi_i(0) \leq 1

and

\displaystyle \mathbb{E}[P(\bar{w}) - D(\bar{\alpha})] \leq \frac{n}{s(T - T_0)} \Big( 1 - \frac{1}{\rho + n}\Big)^{T_0}

Theorem 2 (rephrased)

\displaystyle \mathbb{E}[P(w^{(t)}) - D(\alpha^{(t)})] \leq \frac{n}{s} \Big( 1 - \frac{1}{\rho + n}\Big)^t

For \(t = 1,2,\dots, T\)

Pick \(i\) uniformly at random from \(\{1, \dotsc, n\}\) 

End For

\(\Delta \alpha_i\) maximizes

\displaystyle \alpha^{(t)} = \alpha^{(t -1)} + \Delta \alpha_i e_i
\displaystyle w^{(t)} = w(\alpha^{(t)})
\bar{w} = \frac{1}{T - T_0} \sum_{i = T_0 + 1}^T w^{(i)}
\displaystyle D(\alpha^{(t -1)} + \Delta \alpha_i e_i)

Assumptions (for simplicity):

\displaystyle \lVert x_i \rVert \leq 1,
\displaystyle \phi_i(\cdot) \geq 0,
\displaystyle \phi_i(0) \leq 1

and

\displaystyle \mathbb{E}[P(\bar{w}) - D(\bar{\alpha})] \leq \frac{n}{s(T - T_0)} \Big( 1 - \frac{1}{\rho + n}\Big)^{T_0}

SAG & SAGA

\(d\) - Dimension of our problem

\displaystyle \min_{x \in \mathbb{R}^d} f(x) = \frac{1}{n}\sum_{i = 1}^n f_i(x)

\(f_i\) - Loss with respect to data point \(i\)

\(n\) - Number of examples/data points

Assumptions:

\(f_i\) is \(\mu\)-strongly-convex

\displaystyle f_i(x) \geq f_i(y) + \nabla f_i(x)^T (y - x) + \frac{\mu}{2} \lVert y - x \rVert^2

\(f_i\) is \(L\)-smooth

\displaystyle f_i(x) \leq f_i(x) + \nabla f_i(x)^T (y - w) + \frac{L}{2} \lVert x - w \rVert^2

Condition number: \(\rho = \frac{L}{\gamma}\)

SAGA

For \(k = 1,2,\dots, T\)

End For

Pick \(j\) uniformly at random from \(\{1, \dotsc, n\}\) 

\(\phi_j \gets x^k\) and store \(\nabla f(\phi_j)\)

\displaystyle x^{k+1} = x^{k} - \gamma \left[ \nabla f_j(x^k) - \nabla f_j(\phi_j) + \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\phi_i) \right]

UNBIASED!

\displaystyle \mathbb{E}\lVert x^k - x^* \rVert^2 \lesssim \left( 1 - \frac{1}{2(n + \rho)}\right)^k

Convergence Rates

with

\displaystyle \gamma = \frac{1}{2(\mu n + L)}
\displaystyle \mathbb{E}\lVert x^k - x^* \rVert^2 \lesssim \left( 1 - \min\left\{\frac{1}{4 n}, \frac{1}{3\rho} \right\}\right)^k

with

\displaystyle \gamma = \frac{1}{3 L}

Can add prox step

Indep. of \(\mu\)

Convergence of SAGA shows decrease of a Lyapunov function

T^k

SAG

For \(k = 1,2,\dots, T\)

End For

Pick \(j\) uniformly at random from \(\{1, \dotsc, n\}\) 

\(\phi_j \gets x^k\) and store \(\nabla f(\phi_j)\)

\displaystyle x^{k+1} = x^{k} - \gamma \cdot \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\phi_i)

BIASED!

Convergence Rate

\displaystyle \mathbb{E}\lVert x^k - x^* \rVert^2 \lesssim \left( 1 - \min\left\{\frac{1}{8 n}, \frac{1}{16\rho} \right\}\right)^k

with

\displaystyle \gamma = \frac{1}{16 L}

Indep. of \(\mu\)

Smaller Variance?

Comparison of updates (from SAGA paper)

Comparison of features (from SAGA paper)

... and beyond

Finito

\displaystyle \mathbb{E}\left[ f\left( \frac{1}{n}\sum_{i = 1}^n \phi^k \right) - f(x^*)\right] \lesssim \left( 1 - \frac{1}{2n}\right)^k

Convergence Rate

with

\displaystyle \gamma = \frac{1}{2 \mu}

Good speed-ups with permutation of data, but no theory (?)

For \(k = 1,2,\dots, T\)

End For

Pick \(j\) uniformly at random from \(\{1, \dotsc, n\}\) 

\(\phi_j \gets x^k\) and store \(\nabla f(\phi_j)\)

\displaystyle x^{k+1} = \frac{1}{n} \sum_{i = 1}^n \phi_i - \gamma \cdot \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\phi_i)
\displaystyle n \gtrsim \rho

Acceleration

So far, to get to \(\varepsilon\)-opt we required \(O((n + \rho)\log(1/\varepsilon))\) iterations

There are accelerated methods that require

 \(O((n + \sqrt{n\rho})\log(1/\epsilon))\) iterations

Good when \(\rho \gg n\)

Examples

Accelerated SDCA

Catalyst: Accelerate any VR method

Katyusha: Direct acceleration of VR with "negative momentum"

VR for non-convex problems

SPIDER

Variance reduction for \(f_i\) smooth and \(f\) lower bounded

\(O(n + \sqrt{n}/\varepsilon^2)\) for \(\varepsilon\)-opt

Matches the Lower Bound

Based on "online SVRG":

g^k = \nabla f_j(x^k) - \nabla f_j(x^{k-1}) + g^{k-1}

Normalized gradient descent steps (?!)

With resets

Non-uniform sampling

Sparse gradients

Improve beginning of VR methods

Second order

Local/Federated VR

Made with Slides.com