Variance Reduction Study Group
SVRG
\(d\) - Dimension of our problem
\(\psi_i\) - Loss with respect to data point \(i\)
\(n\) - Number of examples/data points
Assumptions:
\(P\) is \(\gamma\)-strongly-convex
\(\psi_i\) is \(L\)-smooth
Condition number: \(\rho = \frac{L}{\gamma}\)
\(\psi_i\) - Loss with respect to data point \(i\)
Epoch \(s\) of SVRG
For \(t = 1,2,\dots, m\)
Pick \(i_t\) uniformly at random from \(\{1, \dotsc, n\}\)
End For
\(\tilde{w}_s = w_t\) for \(t\) picked uniformly at random from \(\{1, \dotsc, n\}\)
Theorem
Lemma (VR Property)
Initialization:
and
Epoch \(s\) of SVRG
For \(t = 1,2,\dots, m\)
Pick \(i_t\) uniformly at random from \(\{1, \dotsc, n\}\)
End For
\(\tilde{w}_s = w_t\) for \(t\) picked uniformly at random from \(\{1, \dotsc, n\}\)
Initialization:
and
From the VR property, we get
SDCA
\(d\) - Dimension of our problem
\(n\) - Number of examples/data points
Assumptions:
\(P\) is \(\lambda\)-strongly-convex
\(\phi_i\) is \(1/\gamma\)-smooth
Condition number: \(\rho = \frac{1}{\lambda \gamma}\)
\(\phi_i\) - i-th loss function
Note:
Some Properties
attained at \(\nabla f^*(\hat{x})\)
Some Properties
attained at \(g \in \partial f^*(\hat{x})\)
Duality:
and
with
Duality gap:
SDCA
For \(t = 1,2,\dots, T\)
Pick \(i\) uniformly at random from \(\{1, \dotsc, n\}\)
End For
\(\Delta \alpha_i\) maximizes
Lemma
For \(t = 1,2,\dots, T\)
Pick \(i\) uniformly at random from \(\{1, \dotsc, n\}\)
End For
\(\Delta \alpha_i\) maximizes
For any \(s \in [0,1]\), we have
Assumptions (for simplicity):
with
Theorem 2 (rephrased)
For \(t = 1,2,\dots, T\)
Pick \(i\) uniformly at random from \(\{1, \dotsc, n\}\)
End For
\(\Delta \alpha_i\) maximizes
Assumptions (for simplicity):
and
Theorem 2 (rephrased)
For \(t = 1,2,\dots, T\)
Pick \(i\) uniformly at random from \(\{1, \dotsc, n\}\)
End For
\(\Delta \alpha_i\) maximizes
Assumptions (for simplicity):
and
SAG & SAGA
\(d\) - Dimension of our problem
\(f_i\) - Loss with respect to data point \(i\)
\(n\) - Number of examples/data points
Assumptions:
\(f_i\) is \(\mu\)-strongly-convex
\(f_i\) is \(L\)-smooth
Condition number: \(\rho = \frac{L}{\gamma}\)
SAGA
For \(k = 1,2,\dots, T\)
End For
Pick \(j\) uniformly at random from \(\{1, \dotsc, n\}\)
\(\phi_j \gets x^k\) and store \(\nabla f(\phi_j)\)
UNBIASED!
Convergence Rates
with
with
Can add prox step
Indep. of \(\mu\)
Convergence of SAGA shows decrease of a Lyapunov function
SAG
For \(k = 1,2,\dots, T\)
End For
Pick \(j\) uniformly at random from \(\{1, \dotsc, n\}\)
\(\phi_j \gets x^k\) and store \(\nabla f(\phi_j)\)
BIASED!
Convergence Rate
with
Indep. of \(\mu\)
Smaller Variance?
Comparison of updates (from SAGA paper)
Comparison of features (from SAGA paper)
... and beyond
Finito
Convergence Rate
with
Good speed-ups with permutation of data, but no theory (?)
For \(k = 1,2,\dots, T\)
End For
Pick \(j\) uniformly at random from \(\{1, \dotsc, n\}\)
\(\phi_j \gets x^k\) and store \(\nabla f(\phi_j)\)
Acceleration
So far, to get to \(\varepsilon\)-opt we required \(O((n + \rho)\log(1/\varepsilon))\) iterations
There are accelerated methods that require
\(O((n + \sqrt{n\rho})\log(1/\epsilon))\) iterations
Good when \(\rho \gg n\)
Examples
Accelerated SDCA
Catalyst: Accelerate any VR method
Katyusha: Direct acceleration of VR with "negative momentum"
VR for non-convex problems
SPIDER
Variance reduction for \(f_i\) smooth and \(f\) lower bounded
\(O(n + \sqrt{n}/\varepsilon^2)\) for \(\varepsilon\)-opt
Matches the Lower Bound
Based on "online SVRG":
Normalized gradient descent steps (?!)
With resets
Non-uniform sampling
Sparse gradients
Improve beginning of VR methods
Second order
Local/Federated VR
Variance Reduction Study Group
By Victor Sanches Portella
Variance Reduction Study Group
- 148