Flavien Léger
Variational aspects of the Expectation-Maximization algorithm
1. Warm-up: no latent variables
\((X,\mathbb{P})\): sample space, space of observations \(x\in X\)
(mathematically: measurable space)
Goal: understand the probability measure \(\mathbb{P}\),
e.g. to make predictions
Model: \(p_\theta(dx)\in\mathcal{P}(X)\), parameter \(\theta\in\Theta\)
Goal (frequentist): find a \(\theta\) that best fits observed \(x_1,\cdots,x_n\)
Maximum likelihood
Given \(n\) i.i.d. observations \(x_1,\cdots,x_n\), model predicts probability \[p_\theta(x_1)\cdots p_\theta(x_n)\]
Likelihood function
Statistician: solves
\[\max_{\theta\in\Theta}\sum_{i=1}^n\log p_\theta(x_i)\]
Maximum likelihood as minimum KL
\[\max_{\theta\in\Theta}\sum_{i=1}^n\log p_\theta(x_i)\]
Define the Kullback–Leibler divergence, or relative entropy,
\[\operatorname{KL}(\mu|\nu)=\begin{cases}\int_X\log\Big(\frac{d\mu}{d\nu}\Big)\,d\mu&\text{if }\mu\ll\nu\\+\infty&\text{otherwise}\end{cases}\]
Encode observations into the empirical measure \(\mu=\frac1n\sum_{i=1}^n\delta_{x_i}\), then
Statistical inference problem reformulated as minimum entropy problem
Optimization problem: minimize \(F(\theta)\coloneqq\operatorname{KL}(\mu|p_\theta)\)
Gradient descent \(\theta_{k+1}-\theta_k=-\sigma\nabla F(\theta_k)\)...
End of warm-up
With latent variables
Model \(p_\theta(dx,dz)\in\mathcal{P}(X\times Z)\)
\(x\in X\): observed variables
\(z\in Z\): latent (hidden) variables
Under \(\theta\), model predicts probability of observing \(x\in X\) is
\[\sum_{z\in Z}p_\theta(x,z)\eqqcolon (P_Xp_\theta)(x)\]
\(P_Xp_\theta\): marginal of \(p_\theta\),
Maximum likelihood
Given i.i.d. observations \(x_1,\cdots,x_n\), \[\max_{\theta\in\Theta}\sum_{i=1}^n\log P_Xp_\theta(x_i)\]
Rewrite as minimum entropy
\[\min_{\theta\in\Theta}\operatorname{KL}(\mu|P_Xp_\theta)\]
If \(F(\theta)\coloneqq \operatorname{KL}(\mu|P_Xp_\theta)\) easy to manipulate: done
In many cases: undesirable to handle \(P_X\) directly
Optimization problem
\(X,Z\): two (measurable) sets, \(\Theta\): a set
\((\theta\in\Theta)\mapsto p_\theta\in\mathcal{P}(X\times Z)\): given, \(\mu\in\mathcal{P}(X)\): given
Solve:
\[\min_{\theta\in\Theta}\operatorname{KL}(\mu|P_Xp_\theta)\]
Main formula
\(\Pi(\mu,*)=\{\pi\in\mathcal{P}(X\times Z) : P_X\pi = \mu\}\)
joint laws, couplings, with first marginal \(\mu\)
two things:
① Data processing inequality
\[\forall \pi,p\in \mathcal{P}(X\times Z),\quad\operatorname{KL}(P_X\pi|P_X p)\leq \operatorname{KL}(\pi|p)\]
② Equality attained for
\[\pi(dx,dz)=\frac{P_X\pi(dx)}{P_X p(dx)} p(dx,dz)\]
EM algorithm
Alternating Minimization of \(\Phi(\theta,\pi) = \operatorname{KL}(\pi|p_\theta)\)
“E-step”
“M-step”
Why is this a good idea? What is the underlying structure?
Recall: want to minimize
\(F\leq \Phi\)
with equality
Descent property for free
Nothing better (convergence...) in general
Gradient descent with a general cost
FL, PCA, Gradient descent with a general cost, 2023, arXiv:2305.04917
EM for exponential families
Model: \[p_\theta(dx,dz)=e^{s(x,z)\cdot\theta-A(\theta)}\,R(dx,dz),\]
\(\theta\in\Theta\subset\mathbb{R}^d\)
GDGC:
Mirror Descent
Thank you!
Refs
FL, PCA, Gradient descent with a general cost, 2023, arXiv:2305.04917
(m+e+c 2024-10-03) The EM algorithm
By Flavien Léger
(m+e+c 2024-10-03) The EM algorithm
- 64