# SVRG

$$d$$ - Dimension of our problem

\displaystyle \min_{w \in \mathbb{R}^d} P(w) = \frac{1}{n}\sum_{i = 1}^n \psi_i(w)

$$\psi_i$$ - Loss with respect to data point $$i$$

$$n$$ - Number of examples/data points

Assumptions:

$$P$$ is $$\gamma$$-strongly-convex

\displaystyle P(w) \geq P(y) + \nabla P(y)^T (y - w) + \frac{\gamma}{2} \lVert y - w \rVert^2

$$\psi_i$$ is $$L$$-smooth

\displaystyle \psi_i(w) \leq \psi_i(y) + \nabla \psi_i(y)^T (y - w) + \frac{L}{2} \lVert y - w \rVert^2

Condition number: $$\rho = \frac{L}{\gamma}$$

$$\psi_i$$ - Loss with respect to data point $$i$$

Epoch $$s$$ of SVRG

\displaystyle w_t = w_{t-1} - \eta (\nabla\psi_{i_t}(w_{t-1}) - \nabla\psi_{i_t}(\tilde{w}) + \tilde{\mu})
\displaystyle w_0 = \tilde{w} = \tilde{w}_{s-1}
\tilde{\mu} = \nabla P(\tilde{w}) = \frac{1}{n} \sum_{i = 1}^n \nabla \psi_i(\tilde{w})

For $$t = 1,2,\dots, m$$

Pick $$i_t$$ uniformly at random from $$\{1, \dotsc, n\}$$

End For

$$\tilde{w}_s = w_t$$ for $$t$$ picked uniformly at random from $$\{1, \dotsc, n\}$$

\displaystyle v_t

Theorem

\displaystyle \mathbb{E}[P(\tilde{w}_s) - P(\tilde{w})] \leq \alpha^s(P(w_0) - P(w^*))
\displaystyle \text{where}~\alpha = \frac{1}{\gamma \eta (1 - 2 L \eta) m} + \frac{2 L \eta}{1 - 2 L \eta}

Lemma (VR Property)

Initialization:

and

\displaystyle \mathbb{E}\Big[ \lVert v \rVert^2 ~\vert~ \mathcal{F}_{t-1}\Big] \leq 4 L [P(w_{t-1}) - P(w^*) + P(\tilde{w}) - P(w^*)]

Epoch $$s$$ of SVRG

\displaystyle w_t = w_{t-1} - \eta (\nabla\psi_{i_t}(w_{t-1}) - \nabla\psi_{i_t}(\tilde{w}) + \tilde{\mu})
\displaystyle w_0 = \tilde{w} = \tilde{w}_{s-1}
\tilde{\mu} = \nabla P(\tilde{w}) = \frac{1}{n} \sum_{i = 1}^n \nabla \psi_i(\tilde{w})

For $$t = 1,2,\dots, m$$

Pick $$i_t$$ uniformly at random from $$\{1, \dotsc, n\}$$

End For

$$\tilde{w}_s = w_t$$ for $$t$$ picked uniformly at random from $$\{1, \dotsc, n\}$$

\displaystyle v_t
\displaystyle \mathbb{E}_{t-1}\Big[\lVert w_{t} - w^* \rVert\Big] - \lVert w_{t-1} - w^* \rVert + 2 \eta(1 - 2 L \eta) (P(w_{t-1}) - P(w_t))

Initialization:

and

From the VR property, we get

\displaystyle \leq 4 L \eta^2 [P(\tilde{w}) - P(w^*)]

# SDCA

$$d$$ - Dimension of our problem

\displaystyle \min_{w \in \mathbb{R}^d} P(w) = \sum_{i = 1}^n \phi_i(w^T x_i) + \frac{\lambda}{2} \lVert w \rVert^2

$$n$$ - Number of examples/data points

Assumptions:

$$P$$ is $$\lambda$$-strongly-convex

\displaystyle P(w) \geq P(y) + \nabla P(y)^T (y - w) + \frac{\lambda}{2} \lVert y - w \rVert^2

$$\phi_i$$ is $$1/\gamma$$-smooth

\displaystyle \phi_i(w) \leq \phi_i(y) + \nabla \phi_i(y)^T (y - w) + \frac{1}{2\gamma} \lVert y - w \rVert^2

Condition number: $$\rho = \frac{1}{\lambda \gamma}$$

$$\phi_i$$ - i-th loss function

Note:

\displaystyle f^*(\hat{x}) = \sup_{x} \langle \hat{x}, x \rangle - f(x)
f^{**} = f

Some Properties

\sup_{x} \langle \hat{x}, x \rangle - f(x)

attained at $$\nabla f^*(\hat{x})$$

f(x)+ f^*(\hat{x}) \geq \langle \hat{x}, x \rangle
\hat{x} = \nabla f(x) \iff \nabla f^*(\hat{x}) = x
\partial f(x) = \{g~\colon~ f(z) \geq f(x) + \langle g, z - x \rangle~\forall z\}

Some Properties

\sup_{x} \langle \hat{x}, x \rangle - f(x)

attained at $$g \in \partial f^*(\hat{x})$$

\hat{x} \in \partial f(x) \iff \partial f^*(\hat{x}) \ni x
\partial f(x) = \{\nabla f(x)\}~\text{if differentiable}
\displaystyle P(w) = \frac{1}{n} \sum_{i = 1}^n \phi_i(w^Tx_i) + \frac{\lambda}{2} \lVert {w} \rVert^2
\displaystyle D(\alpha) = - \frac{1}{n} \sum_{i = 1}^n \phi_i^*(-\alpha_i) - \frac{\lambda}{2} \Big\lVert\frac{1}{\lambda n} \sum_i \alpha_i x_i \Big\rVert^2
\displaystyle G(w, \alpha) = \frac{1}{n} \sum_{i = 1}^n\Big( - \alpha_i (x_i^T w) - \phi_i^*(-\alpha_i) \Big) + \frac{\lambda}{2} \lVert {w} \rVert^2
\displaystyle D(\alpha) = - \frac{1}{n} \sum_{i = 1}^n \phi_i^*(-\alpha_i) - \frac{\lambda}{2} \Big\lVert\frac{1}{\lambda n} \sum_i \alpha_i x_i \Big\rVert^2
P(w) \geq D(\alpha),

Duality:

and

P(w^*) = D(\alpha^*)

with

w^* = w(\alpha^*) \coloneqq \frac{1}{\lambda n} \sum_{i} \alpha_i^* x_i

Duality gap:

P(w(\alpha)) - D(\alpha)

SDCA

\displaystyle D(\alpha^{(t -1)} + \Delta \alpha_i e_i)

For $$t = 1,2,\dots, T$$

Pick $$i$$ uniformly at random from $$\{1, \dotsc, n\}$$

End For

$$\Delta \alpha_i$$ maximizes

\displaystyle \alpha^{(t)} = \alpha^{(t -1)} + \Delta \alpha_i e_i
\displaystyle w^{(t)} = w(\alpha^{(t)})
\bar{w} = \frac{1}{T - T_0} \sum_{i = T_0 + 1}^T w^{(i)}

Lemma

\displaystyle \mathbb{E}[D(\alpha^{(t)}) - D(\alpha^{(t-1)})] \geq \frac{s}{n} \mathbb{E}[P(w^{(t-1)}) - D(\alpha^{(t-1)})] - \Big(\frac{s}{n}\Big)^2 \frac{G^{(t)}}{2 \lambda}

For $$t = 1,2,\dots, T$$

Pick $$i$$ uniformly at random from $$\{1, \dotsc, n\}$$

End For

$$\Delta \alpha_i$$ maximizes

\displaystyle \alpha^{(t)} = \alpha^{(t -1)} + \Delta \alpha_i e_i
\displaystyle w^{(t)} = w(\alpha^{(t)})
\bar{w} = \frac{1}{T - T_0} \sum_{i = T_0 + 1}^T w^{(i)}

For any $$s \in [0,1]$$, we have

\displaystyle D(\alpha^{(t -1)} + \Delta \alpha_i e_i)
\displaystyle G^{(t)} \coloneqq \frac{1}{n} \sum_{i = 1}^n \Big(\lVert x_i \rVert^2 - \frac{\gamma (1 - s) \lambda n}{s} \Big) \mathbb{E}[(u_i^{(t-1)} - \alpha_i^{(t-1)})^2]

Assumptions (for simplicity):

\displaystyle \lVert x_i \rVert \leq 1,
\displaystyle \phi_i(\cdot) \geq 0,
\displaystyle \phi_i(0) \leq 1

with

\displaystyle - u_i^{(t-1)} \in \partial \phi_i(x_i^T w^{(t-1)})

Theorem 2 (rephrased)

\displaystyle \mathbb{E}[P(w^{(t)}) - D(\alpha^{(t)})] \leq \frac{n}{s} \Big( 1 - \frac{1}{\rho + n}\Big)^t

For $$t = 1,2,\dots, T$$

Pick $$i$$ uniformly at random from $$\{1, \dotsc, n\}$$

End For

$$\Delta \alpha_i$$ maximizes

\displaystyle \alpha^{(t)} = \alpha^{(t -1)} + \Delta \alpha_i e_i
\displaystyle w^{(t)} = w(\alpha^{(t)})
\bar{w} = \frac{1}{T - T_0} \sum_{i = T_0 + 1}^T w^{(i)}
\displaystyle D(\alpha^{(t -1)} + \Delta \alpha_i e_i)

Assumptions (for simplicity):

\displaystyle \lVert x_i \rVert \leq 1,
\displaystyle \phi_i(\cdot) \geq 0,
\displaystyle \phi_i(0) \leq 1

and

\displaystyle \mathbb{E}[P(\bar{w}) - D(\bar{\alpha})] \leq \frac{n}{s(T - T_0)} \Big( 1 - \frac{1}{\rho + n}\Big)^{T_0}

Theorem 2 (rephrased)

\displaystyle \mathbb{E}[P(w^{(t)}) - D(\alpha^{(t)})] \leq \frac{n}{s} \Big( 1 - \frac{1}{\rho + n}\Big)^t

For $$t = 1,2,\dots, T$$

Pick $$i$$ uniformly at random from $$\{1, \dotsc, n\}$$

End For

$$\Delta \alpha_i$$ maximizes

\displaystyle \alpha^{(t)} = \alpha^{(t -1)} + \Delta \alpha_i e_i
\displaystyle w^{(t)} = w(\alpha^{(t)})
\bar{w} = \frac{1}{T - T_0} \sum_{i = T_0 + 1}^T w^{(i)}
\displaystyle D(\alpha^{(t -1)} + \Delta \alpha_i e_i)

Assumptions (for simplicity):

\displaystyle \lVert x_i \rVert \leq 1,
\displaystyle \phi_i(\cdot) \geq 0,
\displaystyle \phi_i(0) \leq 1

and

\displaystyle \mathbb{E}[P(\bar{w}) - D(\bar{\alpha})] \leq \frac{n}{s(T - T_0)} \Big( 1 - \frac{1}{\rho + n}\Big)^{T_0}

# SAG & SAGA

$$d$$ - Dimension of our problem

\displaystyle \min_{x \in \mathbb{R}^d} f(x) = \frac{1}{n}\sum_{i = 1}^n f_i(x)

$$f_i$$ - Loss with respect to data point $$i$$

$$n$$ - Number of examples/data points

Assumptions:

$$f_i$$ is $$\mu$$-strongly-convex

\displaystyle f_i(x) \geq f_i(y) + \nabla f_i(x)^T (y - x) + \frac{\mu}{2} \lVert y - x \rVert^2

$$f_i$$ is $$L$$-smooth

\displaystyle f_i(x) \leq f_i(x) + \nabla f_i(x)^T (y - w) + \frac{L}{2} \lVert x - w \rVert^2

Condition number: $$\rho = \frac{L}{\gamma}$$

SAGA

For $$k = 1,2,\dots, T$$

End For

Pick $$j$$ uniformly at random from $$\{1, \dotsc, n\}$$

$$\phi_j \gets x^k$$ and store $$\nabla f(\phi_j)$$

\displaystyle x^{k+1} = x^{k} - \gamma \left[ \nabla f_j(x^k) - \nabla f_j(\phi_j) + \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\phi_i) \right]

UNBIASED!

\displaystyle \mathbb{E}\lVert x^k - x^* \rVert^2 \lesssim \left( 1 - \frac{1}{2(n + \rho)}\right)^k

Convergence Rates

with

\displaystyle \gamma = \frac{1}{2(\mu n + L)}
\displaystyle \mathbb{E}\lVert x^k - x^* \rVert^2 \lesssim \left( 1 - \min\left\{\frac{1}{4 n}, \frac{1}{3\rho} \right\}\right)^k

with

\displaystyle \gamma = \frac{1}{3 L}

Indep. of $$\mu$$

Convergence of SAGA shows decrease of a Lyapunov function

T^k

SAG

For $$k = 1,2,\dots, T$$

End For

Pick $$j$$ uniformly at random from $$\{1, \dotsc, n\}$$

$$\phi_j \gets x^k$$ and store $$\nabla f(\phi_j)$$

\displaystyle x^{k+1} = x^{k} - \gamma \cdot \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\phi_i)

BIASED!

Convergence Rate

\displaystyle \mathbb{E}\lVert x^k - x^* \rVert^2 \lesssim \left( 1 - \min\left\{\frac{1}{8 n}, \frac{1}{16\rho} \right\}\right)^k

with

\displaystyle \gamma = \frac{1}{16 L}

Indep. of $$\mu$$

Smaller Variance?

Comparison of updates (from SAGA paper)

Comparison of features (from SAGA paper)

# ... and beyond

Finito

\displaystyle \mathbb{E}\left[ f\left( \frac{1}{n}\sum_{i = 1}^n \phi^k \right) - f(x^*)\right] \lesssim \left( 1 - \frac{1}{2n}\right)^k

Convergence Rate

with

\displaystyle \gamma = \frac{1}{2 \mu}

Good speed-ups with permutation of data, but no theory (?)

For $$k = 1,2,\dots, T$$

End For

Pick $$j$$ uniformly at random from $$\{1, \dotsc, n\}$$

$$\phi_j \gets x^k$$ and store $$\nabla f(\phi_j)$$

\displaystyle x^{k+1} = \frac{1}{n} \sum_{i = 1}^n \phi_i - \gamma \cdot \frac{1}{n} \sum_{i = 1}^n \nabla f_i(\phi_i)
\displaystyle n \gtrsim \rho

Acceleration

So far, to get to $$\varepsilon$$-opt we required $$O((n + \rho)\log(1/\varepsilon))$$ iterations

There are accelerated methods that require

$$O((n + \sqrt{n\rho})\log(1/\epsilon))$$ iterations

Good when $$\rho \gg n$$

Examples

Accelerated SDCA

Catalyst: Accelerate any VR method

Katyusha: Direct acceleration of VR with "negative momentum"

VR for non-convex problems

SPIDER

Variance reduction for $$f_i$$ smooth and $$f$$ lower bounded

$$O(n + \sqrt{n}/\varepsilon^2)$$ for $$\varepsilon$$-opt

Matches the Lower Bound

Based on "online SVRG":

g^k = \nabla f_j(x^k) - \nabla f_j(x^{k-1}) + g^{k-1}

With resets

Non-uniform sampling

Improve beginning of VR methods

Second order

Local/Federated VR

#### Variance Reduction Study Group

By Victor Sanches Portella

• 156