Victor Sanches Portella
MLRG @ UBC
August 11, 2021
Interesting application of Riemannian optimization to an important problem in ML
Details of the technicalities are not important
Yet, we have enough to sense the flavor of the field
A GMM is a convex combination of a fixed number of Gaussians
Image source: https://angusturner.github.io/assets/images/mixture.png
A GMM is a convex combination of a fixed number of Gaussians
Density of a \(K\)-mixture on \(\mathbb{R}^d\):
Mean and Covariance of \(i\)-th Gaussian
Gaussian density
Weight of \(i\)-th Gaussian
(Must sum to 1)
Find GMM most likely to generate data points \(x_1, \dotsc, x_n \)
Image source: https://towardsdatascience.com/gaussian-mixture-models-d13a5e915c8e
Find GMM most likely to generate data points \(x_1, \dotsc, x_n \)
Idea: use means \(\mu_i\) and covariances \(\Sigma_i\) that maximize log-likelihood
Hard to describe
Open - No clear projection
Classical optimization methods (Newton, CG, IPM) struggle
Slowdown when iterates get closer to boundary
IPM works, but it is slow
Force positive definiteness via "Cholesky decomposition"
Adds spurious maxima/minima
Slow
Expectation Maximization guarantees \(\succ 0\) "for free"
Easy for GMM's
Fast in practice
EM: Iterative method to find (local) max-likely parameters
E-step: Fix parameters, find "weighted log-likelihood"
M-step: Find parameters by maximizing the "weighted log-likelihood"
For GMM's:
E-step: Given \(\mu_i, \Sigma_i\), compute \(P(x_j \in \mathcal{N}_i)\)
M-step: Max weighted log-likelihood with weights \(P(x_j \in \mathcal{N}_i)\)
\(= p(x_j ; \mu_i, \Sigma_i)\)
CLOSED FORM SOLUTION FOR GMMs!
Guarantees \(\succ 0\) for free
Rimennian metric at
Tangent space at a point
All symmetric matrices
Remark: Blows up when \(\Sigma\) is close to the boundary
General steps:
1 - Find descent direction
2 - Perform line-search and step with retraction
Retraction: a way to move on a manifold along a tangent direction
Not necessarily follows a geodesic!
In this paper, we will only use the Exponential Map:
where
is the geodesic starting at \(\Sigma\) with direction \(D\)
where
\(\gamma_D(t)\) is the geodesic starting at \(\Sigma\) with direction \(D\)
Geodesic between \(\Sigma\) and \(\Gamma\)
\(\Sigma\) at \(t = 0\) and \(\Gamma\) at \(t = 1\)
From this (?), we get the form of the exponential map
GD on the Riemannian Manifold \(\mathbb{P}^d\):
Riemannian gradient of \(f\).
Depends on the metric!
If we have time, we shall see conditions for convergence of SGD
The methods the authors use are Conjugate Gradient and LBFGS.
Traditional gradient descent (GD):
Defining those depends on the idea of vector transport and Hessians. I'll skip these descriptions
We have a geometric structure on \(\mathbb{P}^d\). The other variable can stay in the traditional Euclidean space.
We can apply methods for Riemannian optimization!
... but they are way slower than traditional EM
Maximum Likelihood estimation for a single Gaussian:
\(\mathcal{L}\) is concave!
In fact, the above has a closed form solution
In a Riemannian manifold, meaning of convexity changes
Geodesic convexity (g-convexity):
Geodesic from \(\Sigma_1\) to \(\Sigma_2\)
Maximum Likelihood estimation for a single Gaussian:
\(\mathcal{L}\) is concave!
Solution: lift the function to a higher dimension!
Write
For now, fix
What about \(\alpha_1, \dotsc, \alpha_K\)? Reparameterize to use unconstrained opt
I think there are not many guarantees on final \(\alpha_j\)'s
PS: Reformulation helps Cholesky as well, but this information disappeared from the final version
\(i_t\) a random index
Assumptions on \(f\):
Riemann Lip. Smooth
Unbiased Gradients
Bounded Variance
We want to minimize
Riemannian Stochastic Gradient Descent:
Assumptions on \(f\):
Riemann Lip. Smooth
Unbiased Gradients
Bounded Variance
Riemannian SGD
Theorem If one of the following holds:
i) Bounded gradients:
ii) Outputs an uniformly random iterate
Then
\(\tau\) is unif. random if (ii)
\(\tau\) is best iter. if (i)
(No need for bdd var)
Do the conditions for SGD to converge hold for GMMs?
Theorem If \(\eta_t \leq 1\), then iterates stay in a compact set.
(For some reason they analyze SGD with a different retraction...)
Remark Objective function is not Riemannian Lip. smooth outside of a compact set
Fact Iterates stay in a compact set \(\implies\)
Bounded Gradients
\(\implies\)
Classical Lip. smoothness
\(\implies\)
Thm from Riemannian Geometry
Riemann Lip. smoothness
With a different retraction and at a slower rate...?