An Alternative to EM for Gaussian Mixture Models
Victor Sanches Portella
MLRG @ UBC
August 11, 2021
Why this paper?
Interesting application of Riemannian optimization to an important problem in ML
Details of the technicalities are not important
Yet, we have enough to sense the flavor of the field
Gaussian Mixture Models
Gaussian Mixture Models
A GMM is a convex combination of a fixed number of Gaussians
Image source: https://angusturner.github.io/assets/images/mixture.png
Gaussian Mixture Models
A GMM is a convex combination of a fixed number of Gaussians
Density of a \(K\)-mixture on \(\mathbb{R}^d\):
Mean and Covariance of \(i\)-th Gaussian
Gaussian density
Weight of \(i\)-th Gaussian
(Must sum to 1)
Find GMM most likely to generate data points \(x_1, \dotsc, x_n \)
Fitting a GMM
Image source: https://towardsdatascience.com/gaussian-mixture-models-d13a5e915c8e
Find GMM most likely to generate data points \(x_1, \dotsc, x_n \)
Fitting a GMM
Idea: use means \(\mu_i\) and covariances \(\Sigma_i\) that maximize log-likelihood
Hard to describe
Open - No clear projection
Classical optimization methods (Newton, CG, IPM) struggle
Optimization vs "\(\succ 0\)"
Slowdown when iterates get closer to boundary
IPM works, but it is slow
Force positive definiteness via "Cholesky decomposition"
Adds spurious maxima/minima
Slow
Expectation Maximization guarantees \(\succ 0\) "for free"
Easy for GMM's
Fast in practice
EM: Iterative method to find (local) max-likely parameters
EM in one slide
E-step: Fix parameters, find "weighted log-likelihood"
M-step: Find parameters by maximizing the "weighted log-likelihood"
For GMM's:
E-step: Given \(\mu_i, \Sigma_i\), compute \(P(x_j \in \mathcal{N}_i)\)
M-step: Max weighted log-likelihood with weights \(P(x_j \in \mathcal{N}_i)\)
\(= p(x_j ; \mu_i, \Sigma_i)\)
CLOSED FORM SOLUTION FOR GMMs!
Guarantees \(\succ 0\) for free
Riemannian Optimization
Manifold of PD Matrices
Manifold of PD Matrices
Rimennian metric at
Tangent space at a point
All symmetric matrices
Remark: Blows up when \(\Sigma\) is close to the boundary
Optimization Structure
General steps:
1 - Find descent direction
2 - Perform line-search and step with retraction
Retraction: a way to move on a manifold along a tangent direction
Not necessarily follows a geodesic!
In this paper, we will only use the Exponential Map:
where
is the geodesic starting at \(\Sigma\) with direction \(D\)
The Exponential Map
where
\(\gamma_D(t)\) is the geodesic starting at \(\Sigma\) with direction \(D\)
Geodesic between \(\Sigma\) and \(\Gamma\)
\(\Sigma\) at \(t = 0\) and \(\Gamma\) at \(t = 1\)
From this (?), we get the form of the exponential map
Gradient Descent
GD on the Riemannian Manifold \(\mathbb{P}^d\):
Riemannian gradient of \(f\).
Depends on the metric!
If we have time, we shall see conditions for convergence of SGD
The methods the authors use are Conjugate Gradient and LBFGS.
Traditional gradient descent (GD):
Defining those depends on the idea of vector transport and Hessians. I'll skip these descriptions
Riemannian opt for GMMs
We have a geometric structure on \(\mathbb{P}^d\). The other variable can stay in the traditional Euclidean space.
We can apply methods for Riemannian optimization!
... but they are way slower than traditional EM
GMMs and Geodesic Convexity
Single Gaussian Estimation
Maximum Likelihood estimation for a single Gaussian:
\(\mathcal{L}\) is concave!
In fact, the above has a closed form solution
In a Riemannian manifold, meaning of convexity changes
Geodesic convexity (g-convexity):
Geodesic from \(\Sigma_1\) to \(\Sigma_2\)
\(\mathcal{L}\) is not g-concave
Single Gaussian Estimation
Maximum Likelihood estimation for a single Gaussian:
\(\mathcal{L}\) is concave!
\(\mathcal{L}\) is not g-concave
Solution: lift the function to a higher dimension!
\(\hat{\mathcal{L}}\) is g-concave
Single Gaussian Estimation
Theorem If \(\mu^*, \Sigma^*\) max \(\mathcal{L}\) and \(S^*\) maxs \(\hat{\mathcal{L}}\), then
\(\mathcal{L}\) is not g-concave
\(\hat{\mathcal{L}}\) is g-concave
and
Theorem If \(\mu^*, \Sigma^*\) max \(\mathcal{L}\) and \(S^*\) maxs \(\hat{\mathcal{L}}\), then
and
Proof idea:
Write
Optimal choice of \(s\) is 1 \(\implies\) both \(\mathcal{L}\) and \(\hat{\mathcal{L}}\) agree
PS: \(\Sigma^* \succ 0\) by Shur's complement Lemma
Back to Many Gaussians
For now, fix
Theorem The local maxima of \(\mathcal{L}\) and of \(\hat{\mathcal{L}}\) agree
What about \(\alpha_1, \dotsc, \alpha_K\)? Reparameterize to use unconstrained opt
I think there are not many guarantees on final \(\alpha_j\)'s
Plots!
The effect of the lifting on convergence
Plots!
EM vs Riemannian Opt
PS: Reformulation helps Cholesky as well, but this information disappeared from the final version
Remarks on What I Skipped
Step-size Line Search
Lifting with Regularization
Details on vector transport and 2nd order methods
Riemannian SGD
Riemannian SGD
\(i_t\) a random index
Assumptions on \(f\):
Riemann Lip. Smooth
Unbiased Gradients
Bounded Variance
We want to minimize
Riemannian Stochastic Gradient Descent:
Riemannian SGD
Assumptions on \(f\):
Riemann Lip. Smooth
Unbiased Gradients
Bounded Variance
Riemannian SGD
Theorem If one of the following holds:
i) Bounded gradients:
ii) Outputs an uniformly random iterate
Then
\(\tau\) is unif. random if (ii)
\(\tau\) is best iter. if (i)
(No need for bdd var)
Does SGD work for GMM?
Do the conditions for SGD to converge hold for GMMs?
Theorem If \(\eta_t \leq 1\), then iterates stay in a compact set.
(For some reason they analyze SGD with a different retraction...)
Remark Objective function is not Riemannian Lip. smooth outside of a compact set
Fact Iterates stay in a compact set \(\implies\)
Bounded Gradients
\(\implies\)
Classical Lip. smoothness
\(\implies\)
Thm from Riemannian Geometry
Riemann Lip. smoothness
It works!
With a different retraction and at a slower rate...?
MLRG - Riemannian opt
By Victor Sanches Portella
MLRG - Riemannian opt
- 264