Flavien Léger

Variational aspects of the Expectation-Maximization algorithm

 

1. Warm-up: no latent variables

\((X,\mathbb{P})\): sample space, space of observations \(x\in X\)

(mathematically: measurable space)

Goal: understand the probability measure \(\mathbb{P}\),

e.g. to make predictions

Model: \(p_\theta(dx)\in\mathcal{P}(X)\), parameter \(\theta\in\Theta\)

Goal (frequentist): find a \(\theta\) that best fits observed \(x_1,\cdots,x_n\)

Maximum likelihood

Given \(n\) i.i.d. observations \(x_1,\cdots,x_n\), model predicts probability \[p_\theta(x_1)\cdots p_\theta(x_n)\]

Likelihood function

Statistician: solves

\[\max_{\theta\in\Theta}\sum_{i=1}^n\log p_\theta(x_i)\]

Maximum likelihood as minimum KL

\[\max_{\theta\in\Theta}\sum_{i=1}^n\log p_\theta(x_i)\]

Define the Kullback–Leibler divergence, or relative entropy,

\[\operatorname{KL}(\mu|\nu)=\begin{cases}\int_X\log\Big(\frac{d\mu}{d\nu}\Big)\,d\mu&\text{if }\mu\ll\nu\\+\infty&\text{otherwise}\end{cases}\]

Encode observations into the empirical measure \(\mu=\frac1n\sum_{i=1}^n\delta_{x_i}\), then

\begin{align*} \operatorname{KL}(\mu|p_\theta)&=\int_X\log(\mu)\,d\mu) - \int_X\log(p_\theta)\,d\mu\\ &=\text{cst} - \frac1n\sum_{i=1}^n\log(p_\theta(x_i)) \end{align*}

Statistical inference problem reformulated as minimum entropy problem

\min_{\theta\in \Theta} \operatorname{KL}(\mu|p_\theta)

Optimization problem: minimize \(F(\theta)\coloneqq\operatorname{KL}(\mu|p_\theta)\)

Gradient descent \(\theta_{k+1}-\theta_k=-\sigma\nabla F(\theta_k)\)...

End of warm-up

With latent variables

Model \(p_\theta(dx,dz)\in\mathcal{P}(X\times Z)\)

\(x\in X\): observed variables

\(z\in Z\): latent (hidden) variables

\theta\in\Theta

Under \(\theta\), model predicts probability of observing \(x\in X\) is

\[\sum_{z\in Z}p_\theta(x,z)\eqqcolon (P_Xp_\theta)(x)\]

\(P_Xp_\theta\): marginal of \(p_\theta\),

(P_Xp_\theta(dx)=\int_Zp_\theta(dx,dz))
\in\mathcal{P}(X)

Maximum likelihood

Given i.i.d. observations \(x_1,\cdots,x_n\), \[\max_{\theta\in\Theta}\sum_{i=1}^n\log P_Xp_\theta(x_i)\]

Rewrite as minimum entropy

\[\min_{\theta\in\Theta}\operatorname{KL}(\mu|P_Xp_\theta)\]

If \(F(\theta)\coloneqq \operatorname{KL}(\mu|P_Xp_\theta)\) easy to manipulate: done

In many cases: undesirable to handle  \(P_X\) directly

Optimization problem

\(X,Z\): two (measurable) sets,   \(\Theta\): a set

\((\theta\in\Theta)\mapsto p_\theta\in\mathcal{P}(X\times Z)\): given,   \(\mu\in\mathcal{P}(X)\): given

Solve:

\[\min_{\theta\in\Theta}\operatorname{KL}(\mu|P_Xp_\theta)\]

Main formula

\operatorname{KL}(\mu|P_Xp_\theta)=\min_{\pi\in\Pi(\mu,*)}\operatorname{KL}(\pi|p_\theta)

\(\Pi(\mu,*)=\{\pi\in\mathcal{P}(X\times Z) : P_X\pi = \mu\}\)

joint laws, couplings, with first marginal \(\mu\)

\operatorname{KL}(\mu|P_Xp_\theta)=\min_{\pi\in\Pi(\mu,*)}\operatorname{KL}(\pi|p_\theta)

two things:

①  Data processing inequality

\[\forall \pi,p\in \mathcal{P}(X\times Z),\quad\operatorname{KL}(P_X\pi|P_X p)\leq \operatorname{KL}(\pi|p)\]

②  Equality attained for 

\[\pi(dx,dz)=\frac{P_X\pi(dx)}{P_X p(dx)} p(dx,dz)\]

EM algorithm

Alternating Minimization of \(\Phi(\theta,\pi) = \operatorname{KL}(\pi|p_\theta)\)

\theta\in\Theta
\pi\in\Pi(\mu,*)
\begin{align*} \pi_{n+1} &= \argmin_{\pi\in\Pi(\mu,*)} \operatorname{KL}(\pi|p_{\theta_n})\\ \theta_{n+1} &= \argmin_{\theta\in \Theta} \operatorname{KL}(\pi_{n+1}|p_{\theta}) \end{align*}

“E-step”

“M-step”

Why is this a good idea? What is the underlying structure?

\operatorname{KL}(\mu|P_Xp_\theta)=\min_{\pi\in\Pi(\mu,*)}\operatorname{KL}(\pi|p_\theta)
F(\theta)=\min_{\pi\in\Pi(\mu,*)}\Phi(\theta,\pi)

Recall: want to minimize

\(F\leq \Phi\)

with equality

F
\theta
\Phi(\cdot,\pi)
\begin{align*} \pi_{n+1} &= \argmin_{\pi} \Phi(\theta_n,\pi)\\ \theta_{n+1} &= \argmin_{\theta} \Phi(\theta,\pi_{n+1}) \end{align*}
\pi_{n+1}
\theta_{n}
\theta_{n+1}

Descent property for free

Nothing better (convergence...) in general

Gradient descent with a general cost

F(\theta)=\min_{\pi\in\Pi(\mu,*)}\Phi(\theta,\pi)
\begin{align*} \pi_{n+1} &= \argmin_{\pi} \Phi(\theta_n,\pi)\\ \theta_{n+1} &= \argmin_{\theta} \Phi(\theta,\pi_{n+1}) \end{align*}
\left\{\begin{aligned} \nabla_\theta\Phi(\theta_n,\pi_{n+1})&=\nabla F(\theta_n)\\ \nabla_\theta\Phi(\theta_{n+1},\pi_{n+1})&=0 \end{aligned} \right.

FL, PCA, Gradient descent with a general cost, 2023, arXiv:2305.04917

EM for exponential families

Model: \[p_\theta(dx,dz)=e^{s(x,z)\cdot\theta-A(\theta)}\,R(dx,dz),\]

\(\theta\in\Theta\subset\mathbb{R}^d\)

\begin{align*} \Phi(\theta,\pi)&=\operatorname{KL}(\pi|p_\theta)=\int\log\Big(\frac{\pi}{e^{s\cdot\theta-A(\theta)}R}\Big)\,d\pi\\ &=\operatorname{KL}(\pi|R)+A(\theta)-\theta\cdot\int sd\pi \end{align*}

GDGC:

\left\{\begin{aligned} \nabla A(\theta_n)-\int s\,d\pi_{n+1}&=\nabla F(\theta_n)\\ \nabla A(\theta_{n+1})-\int s\,d\pi_{n+1}&=0 \end{aligned} \right.
\nabla A(\theta_{n+1})-\nabla A(\theta_n)=-\nabla F(\theta_n)

Mirror Descent

Thank you!

Refs

FL, PCA, Gradient descent with a general cost, 2023, arXiv:2305.04917