MLRG - Riemannian opt

An Alternative to EM for Gaussian Mixture Models

Victor Sanches Portella
MLRG @ UBC

August 11, 2021

Gaussian Mixture Models

A GMM is a convex combination of a fixed number of Gaussians

Density of a $K$ -mixture on $\mathbb{R}^d$ :

\displaystyle p(x) = \sum_{i = 1}^K \alpha_i \; p_{\mathcal{N}}(x; \mu_i, \Sigma_i)

\displaystyle p(x) = \sum_{i = 1}^K \alpha_i \; p_{\mathcal{N}}(x; \mu_i, \Sigma_i)

Mean and Covariance of $i$ -th Gaussian

Gaussian density

Weight of $i$ -th Gaussian

(Must sum to 1)

Find GMM most likely to generate data points $x_1, \dotsc, x_n$

Fitting a GMM

Idea: use means $\mu_i$ and covariances $\Sigma_i$ that maximize log-likelihood

\displaystyle \max \sum_{i = 1}^n \log\Big(\sum_{j = 1}^K \alpha_i \; p_{\mathcal{N}}(x_i; \mu_j, \Sigma_j)\Big)

\displaystyle \max \sum_{i = 1}^n \log\Big(\sum_{j = 1}^K \alpha_i \; p_{\mathcal{N}}(x_i; \mu_j, \Sigma_j)\Big)

\displaystyle \sum_{i = 1}^n

\displaystyle \sum_{i = 1}^n

\displaystyle x_i

\displaystyle x_i

\{

\alpha \in \Delta_n

\alpha \in \Delta_n

\mu_j \in \mathbb{R}^d

\mu_j \in \mathbb{R}^d

\Sigma_j \succ 0

\Sigma_j \succ 0

\Sigma_j \succ 0

\Sigma_j \succ 0

Hard to describe

Open - No clear projection

Classical optimization methods (Newton, CG, IPM) struggle

Optimization vs " $\succ 0$ "

Slowdown when iterates get closer to boundary

IPM works, but it is slow

Force positive definiteness via "Cholesky decomposition"

Adds spurious maxima/minima

\Sigma = L L^T

\Sigma = L L^T

Slow

Expectation Maximization guarantees $\succ 0$ "for free"

Easy for GMM's

Fast in practice

EM: Iterative method to find (local) max-likely parameters

EM in one slide

E-step: Fix parameters, find "weighted log-likelihood"

M-step: Find parameters by maximizing the "weighted log-likelihood"

For GMM's:

E-step: Given $\mu_i, \Sigma_i$ , compute $P(x_j \in \mathcal{N}_i)$

M-step: Max weighted log-likelihood with weights $P(x_j \in \mathcal{N}_i)$

$= p(x_j ; \mu_i, \Sigma_i)$

CLOSED FORM SOLUTION FOR GMMs!

Guarantees $\succ 0$ for free

Manifold of PD Matrices

\mathbb{P}^d = \{\Sigma \in \mathbb{R}^{d \times d} \colon \Sigma \succ 0\}

\mathbb{P}^d = \{\Sigma \in \mathbb{R}^{d \times d} \colon \Sigma \succ 0\}

Rimennian metric at

\Sigma

\Sigma

\langle X,Y \rangle_{\Sigma} = \mathrm{Tr}(\Sigma^{-1} X \Sigma^{-1} Y)

\langle X,Y \rangle_{\Sigma} = \mathrm{Tr}(\Sigma^{-1} X \Sigma^{-1} Y)

\Sigma

\Sigma

Tangent space at a point

T_{\Sigma}

T_{\Sigma}

\equiv

\equiv

All symmetric matrices

Remark: Blows up when $\Sigma$ is close to the boundary

Optimization Structure

General steps:

1 - Find descent direction

2 - Perform line-search and step with retraction

Retraction: a way to move on a manifold along a tangent direction

\mathrm{Ret}_{\Sigma} \colon T_{\Sigma} \to \mathbb{P}^d

\mathrm{Ret}_{\Sigma} \colon T_{\Sigma} \to \mathbb{P}^d

\mathrm{Ret}_{\Sigma}(0) = \Sigma

\mathrm{Ret}_{\Sigma}(0) = \Sigma

\mathrm{D Ret}_{\Sigma}(0) = \mathrm{Id}

\mathrm{D Ret}_{\Sigma}(0) = \mathrm{Id}

\{

Not necessarily follows a geodesic!

\mathrm{D}

\mathrm{D}

In this paper, we will only use the Exponential Map:

\mathrm{Ret}_{\Sigma}(D) = \gamma_D(1)

\mathrm{Ret}_{\Sigma}(D) = \gamma_D(1)

where

\gamma_D(t)

\gamma_D(t)

is the geodesic starting at $\Sigma$ with direction $D$

The Exponential Map

\mathrm{Ret}_{\Sigma}(D) = \gamma_D(1)

\mathrm{Ret}_{\Sigma}(D) = \gamma_D(1)

where

$\gamma_D(t)$ is the geodesic starting at $\Sigma$ with direction $D$

Geodesic between $\Sigma$ and $\Gamma$

$\Sigma$ at $t = 0$ and $\Gamma$ at $t = 1$

\Sigma^{\frac{1}{2}} (\Sigma^{-\frac{1}{2}} \Gamma \Sigma^{-\frac{1}{2}})^t \Sigma^{\frac{1}{2}}

\Sigma^{\frac{1}{2}} (\Sigma^{-\frac{1}{2}} \Gamma \Sigma^{-\frac{1}{2}})^t \Sigma^{\frac{1}{2}}

\displaystyle (

\displaystyle (

\displaystyle )^{}

\displaystyle )^{}

{}^t

{}^t

From this (?), we get the form of the exponential map

\mathrm{Ret}_{\Sigma}(D) = \Sigma^{\frac{1}{2}} \exp(\Sigma^{-\frac{1}{2}} D \Sigma^{-\frac{1}{2}}) \Sigma^{\frac{1}{2}}

\mathrm{Ret}_{\Sigma}(D) = \Sigma^{\frac{1}{2}} \exp(\Sigma^{-\frac{1}{2}} D \Sigma^{-\frac{1}{2}}) \Sigma^{\frac{1}{2}}

= \Sigma \; \exp(\Sigma^{-1} D )

= \Sigma \; \exp(\Sigma^{-1} D )

Gradient Descent

GD on the Riemannian Manifold $\mathbb{P}^d$ :

\Sigma_{t+1} = \mathrm{Ret}_{\Sigma_t}( - \eta_t \; \mathrm{grad} f(\Sigma_t) )

\Sigma_{t+1} = \mathrm{Ret}_{\Sigma_t}( - \eta_t \; \mathrm{grad} f(\Sigma_t) )

\mathrm{grad} f(\Sigma_t)

\mathrm{grad} f(\Sigma_t)

\mathrm{grad} f(\Sigma_t)

\mathrm{grad} f(\Sigma_t)

\equiv

\equiv

Riemannian gradient of $f$ .

Depends on the metric!

\displaystyle = \frac{1}{2} \Sigma_t \Big(\nabla f(\Sigma_t) + \nabla f(\Sigma_t)^T \Big) \Sigma_t

\displaystyle = \frac{1}{2} \Sigma_t \Big(\nabla f(\Sigma_t) + \nabla f(\Sigma_t)^T \Big) \Sigma_t

If we have time, we shall see conditions for convergence of SGD

The methods the authors use are Conjugate Gradient and LBFGS.

x_{t+1} = x_t - \eta_t \nabla f(x_t)

x_{t+1} = x_t - \eta_t \nabla f(x_t)

Traditional gradient descent (GD):

Defining those depends on the idea of vector transport and Hessians. I'll skip these descriptions

Single Gaussian Estimation

Maximum Likelihood estimation for a single Gaussian:

\displaystyle \max_{\mu, \; \Sigma \succ 0} \mathcal{L}(\mu, \Sigma) \coloneqq \sum_{i = 1}^n \log p_{\mathcal{N}}(x_i; \mu, \Sigma)

\displaystyle \max_{\mu, \; \Sigma \succ 0} \mathcal{L}(\mu, \Sigma) \coloneqq \sum_{i = 1}^n \log p_{\mathcal{N}}(x_i; \mu, \Sigma)

$\mathcal{L}$ is concave!

In fact, the above has a closed form solution

In a Riemannian manifold, meaning of convexity changes

Geodesic convexity (g-convexity):

f(\gamma(t)) \leq (1 - t) f(\Sigma_1) + t f(\Sigma_2)

f(\gamma(t)) \leq (1 - t) f(\Sigma_1) + t f(\Sigma_2)

\gamma(t)

\gamma(t)

Geodesic from $\Sigma_1$ to $\Sigma_2$

$\mathcal{L}$ is not g-concave

Single Gaussian Estimation

Maximum Likelihood estimation for a single Gaussian:

\displaystyle \max_{\mu, \; \Sigma \succ 0} \mathcal{L}(\mu, \Sigma) \coloneqq \sum_{i = 1}^n \log p_{\mathcal{N}}(x_i; \mu, \Sigma)

\displaystyle \max_{\mu, \; \Sigma \succ 0} \mathcal{L}(\mu, \Sigma) \coloneqq \sum_{i = 1}^n \log p_{\mathcal{N}}(x_i; \mu, \Sigma)

$\mathcal{L}$ is concave!

$\mathcal{L}$ is not g-concave

Solution: lift the function to a higher dimension!

\displaystyle \max_{S \succ 0} \hat{\mathcal{L}}(S) \coloneqq \sum_{i = 1}^n \log q_{\mathcal{N}}(y_i; S)

\displaystyle \max_{S \succ 0} \hat{\mathcal{L}}(S) \coloneqq \sum_{i = 1}^n \log q_{\mathcal{N}}(y_i; S)

\displaystyle = \begin{pmatrix} x_i \\ 1 \end{pmatrix}

\displaystyle = \begin{pmatrix} x_i \\ 1 \end{pmatrix}

(d + 1) \times (d+1)

(d + 1) \times (d+1)

$\hat{\mathcal{L}}$ is g-concave

\displaystyle = c \cdot p_{\mathcal{N}}(y_i; 0, S)

\displaystyle = c \cdot p_{\mathcal{N}}(y_i; 0, S)

Single Gaussian Estimation

\displaystyle \max_{\mu, \; \Sigma \succ 0} \mathcal{L}(\mu, \Sigma) \coloneqq \sum_{i = 1}^n \log p_{\mathcal{N}}(x_i; \mu, \Sigma)

\displaystyle \max_{\mu, \; \Sigma \succ 0} \mathcal{L}(\mu, \Sigma) \coloneqq \sum_{i = 1}^n \log p_{\mathcal{N}}(x_i; \mu, \Sigma)

\displaystyle \max_{S \succ 0} \hat{\mathcal{L}}(S) \coloneqq \sum_{i = 1}^n \log q_{\mathcal{N}}(y_i; S)

\displaystyle \max_{S \succ 0} \hat{\mathcal{L}}(S) \coloneqq \sum_{i = 1}^n \log q_{\mathcal{N}}(y_i; S)

Theorem If $\mu^, \Sigma^$ max $\mathcal{L}$ and $S^*$ maxs $\hat{\mathcal{L}}$ , then

$\mathcal{L}$ is not g-concave

$\hat{\mathcal{L}}$ is g-concave

\displaystyle \mathcal{L}(\mu^*, \Sigma^*) = \hat{\mathcal{L}}(S^*)

\displaystyle \mathcal{L}(\mu^*, \Sigma^*) = \hat{\mathcal{L}}(S^*)

S^* = \begin{pmatrix} \Sigma^* + \mu^* {\mu^{*}}^T & \mu^* \\ \mu^* & 1 \end{pmatrix}

S^* = \begin{pmatrix} \Sigma^* + \mu^* {\mu^{*}}^T & \mu^* \\ \mu^* & 1 \end{pmatrix}

and

Theorem If $\mu^, \Sigma^$ max $\mathcal{L}$ and $S^*$ maxs $\hat{\mathcal{L}}$ , then

\displaystyle \mathcal{L}(\mu^*, \Sigma^*) = \hat{\mathcal{L}}(S^*)

\displaystyle \mathcal{L}(\mu^*, \Sigma^*) = \hat{\mathcal{L}}(S^*)

S^* = \begin{pmatrix} \Sigma^* + \mu^* {\mu^{*}}^T & \mu^* \\ \mu^* & 1 \end{pmatrix}

S^* = \begin{pmatrix} \Sigma^* + \mu^* {\mu^{*}}^T & \mu^* \\ \mu^* & 1 \end{pmatrix}

and

Proof idea:

S = \begin{pmatrix} \Sigma + s \mu {\mu}^T & s \mu \\ s \mu & s \end{pmatrix}

S = \begin{pmatrix} \Sigma + s \mu {\mu}^T & s \mu \\ s \mu & s \end{pmatrix}

s > 0

s > 0

Write

Optimal choice of $s$ is 1 $\implies$ both $\mathcal{L}$ and $\hat{\mathcal{L}}$ agree

PS: $\Sigma^* \succ 0$ by Shur's complement Lemma

Back to Many Gaussians

\displaystyle \max \mathcal{L}\big(\{\mu_j\}, \{\Sigma_j\} \big) = \sum_{i = 1}^n \log\Big(\sum_{j = 1}^K \alpha_i \; p_{\mathcal{N}}(x_i; \mu_j, \Sigma_j)\Big)

\displaystyle \max \mathcal{L}\big(\{\mu_j\}, \{\Sigma_j\} \big) = \sum_{i = 1}^n \log\Big(\sum_{j = 1}^K \alpha_i \; p_{\mathcal{N}}(x_i; \mu_j, \Sigma_j)\Big)

For now, fix

\alpha_1, \dotsc, \alpha_k

\alpha_1, \dotsc, \alpha_k

\displaystyle \max \hat{\mathcal{L}}\big(\{S_j\} \big) = \sum_{i = 1}^n \log\Big(\sum_{j = 1}^K \alpha_i \; q_{\mathcal{N}}(y_i; S_j)\Big)

\displaystyle \max \hat{\mathcal{L}}\big(\{S_j\} \big) = \sum_{i = 1}^n \log\Big(\sum_{j = 1}^K \alpha_i \; q_{\mathcal{N}}(y_i; S_j)\Big)

Theorem The local maxima of $\mathcal{L}$ and of $\hat{\mathcal{L}}$ agree

What about $\alpha_1, \dotsc, \alpha_K$ ? Reparameterize to use unconstrained opt

\displaystyle \alpha_j \coloneqq \frac{\exp(w_j)}{\sum_{k = 1}^K \exp(w_k)}

\displaystyle \alpha_j \coloneqq \frac{\exp(w_j)}{\sum_{k = 1}^K \exp(w_k)}

I think there are not many guarantees on final $\alpha_j$ 's

Riemannian SGD

x_{t+1} = \mathrm{Ret}_{x_t}\big(- \eta_t \; \mathrm{grad} f_{i_t}(x_t)\big),

x_{t+1} = \mathrm{Ret}_{x_t}\big(- \eta_t \; \mathrm{grad} f_{i_t}(x_t)\big),

$i_t$ a random index

Assumptions on $f$ :

Riemann Lip. Smooth

f\big(\mathrm{Ret}_{x}(d)\big) \leq f(x) + \langle \mathrm{grad} f(x), d \rangle + \frac{L}{2} ||d ||^2

f\big(\mathrm{Ret}_{x}(d)\big) \leq f(x) + \langle \mathrm{grad} f(x), d \rangle + \frac{L}{2} ||d ||^2

Unbiased Gradients

\mathbb{E}[ \mathrm{grad} f_{i_t}(x) - \mathrm{grad} f(x)] = 0

\mathbb{E}[ \mathrm{grad} f_{i_t}(x) - \mathrm{grad} f(x)] = 0

Bounded Variance

\mathbb{E}\big[ ||\mathrm{grad} f_{i_t}(x) - \mathrm{grad} f(x)||^2\big] \leq \sigma

\mathbb{E}\big[ ||\mathrm{grad} f_{i_t}(x) - \mathrm{grad} f(x)||^2\big] \leq \sigma

We want to minimize

f = \frac{1}{n} \sum_{i = 1}^n f_i

f = \frac{1}{n} \sum_{i = 1}^n f_i

Riemannian Stochastic Gradient Descent:

Riemannian SGD

x_{t+1} = \mathrm{Ret}_{x_t}\big(- \eta_t \; \mathrm{grad} f_{i_t}(x_t)\big)

x_{t+1} = \mathrm{Ret}_{x_t}\big(- \eta_t \; \mathrm{grad} f_{i_t}(x_t)\big)

Assumptions on $f$ :

Riemann Lip. Smooth

Unbiased Gradients

Bounded Variance

f = \frac{1}{n} \sum_{i = 1}^n f_i

f = \frac{1}{n} \sum_{i = 1}^n f_i

Riemannian SGD

Theorem If one of the following holds:

i) Bounded gradients:

|| \mathrm{grad} f(x)|| \leq G \quad \forall i

|| \mathrm{grad} f(x)|| \leq G \quad \forall i

ii) Outputs an uniformly random iterate

Then

\displaystyle \mathbb E \Big[ || \mathrm{grad} f(x_{\tau})||^2 \Big] \leq O\Bigg(\frac{1}{\sqrt{T}}\Bigg),

\displaystyle \mathbb E \Big[ || \mathrm{grad} f(x_{\tau})||^2 \Big] \leq O\Bigg(\frac{1}{\sqrt{T}}\Bigg),

$\tau$ is unif. random if (ii)

$\tau$ is best iter. if (i)

(No need for bdd var)

Does SGD work for GMM?

Do the conditions for SGD to converge hold for GMMs?

Theorem If $\eta_t \leq 1$ , then iterates stay in a compact set.

(For some reason they analyze SGD with a different retraction...)

Remark Objective function is not Riemannian Lip. smooth outside of a compact set

Fact Iterates stay in a compact set $\implies$

Bounded Gradients

$\implies$

Classical Lip. smoothness

$\implies$

Thm from Riemannian Geometry

Riemann Lip. smoothness

It works!

With a different retraction and at a slower rate...?

An Alternative to EM for Gaussian Mixture Models Victor Sanches Portella MLRG @ UBC August 11, 2021