Lecture 11 : K-means and Gaussian Mixture Models
Assoc. Prof. Yuan-Sen Ting
Astron 5550 : Advanced Astronomical Data Analysis
Logistics
Homework Assignment 4 due next Friday,11:59pm
Group Project 2 - Project Selection - feedback found on Carmen
Group Project 2 - Presentation - April 18
Group Project 2 - Report - April 28 (no deferment)
Plan for Today :
K-means
Gaussian Mixture Models
Expectation-Maximization Algorithm
Information Criterion

From Dimension Reduction to Clustering
Clustering identifies natural groupings in unlabeled data
Dimension reduction: continuous representations preserving important variations
Clustering: discrete partitioning based on similarity measures
Both techniques uncover structure but with different approaches.
Clustering versus Classification
Classification requires labeled training data and finds decision boundaries between known categories
Clustering allows exploration without preconceived categories of what should exist

Naidu et al. 2020

Clustering versus Classification
Classification requires labeled training data and finds decision boundaries between known categories
Clustering allows exploration without preconceived categories of what should exist
However, without ground truth labels, evaluating clustering quality becomes more subjective
The appropriate number of clusters is often unclear and may depend on the scientific question
Approaches to Clustering
K-means partitions data by minimizing distance to cluster centers but assumes spherical clusters
Gaussian Mixture Models (GMMs) model data as a mixture of several Gaussian distributions
K-means is computationally efficient and scalable to large astronomical datasets
GMMs align with Bayesian approach to probabilistic inference and better quantify uncertainty in clustering
K-means: Core Concept
K-means partitions data into K clusters with optimal cluster centers
Data points are assigned to the cluster whose center is closest to them
We must optimize two parameter sets simultaneously: cluster centers and point assignments

Mathematical Notation
Dataset: \( {\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N} \), where each \( \mathbf{x}_n \) is D-dimensional
Cluster centers: \( {\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \ldots, \boldsymbol{\mu}_K} \)
Binary indicator variables: \( r_{nk} \in \{0, 1\} \).
Constraint: \( \sum_{k=1}^K r_{nk} = 1 \)
Each data point must belong to exactly one cluster
("hard assignment")
K-means Objective Function
We minimize:
Each point contributes exactly one non-zero term
(its assigned cluster)
Two parameter sets to optimize simultaneously: centers \( \boldsymbol{\mu}_k \) and assignments \( r_{nk} \)
J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} |\mathbf{x}_n - \boldsymbol{\mu}_k|^2

The Chicken-and-Egg Problem
If we knew cluster centers, assignments would be trivial (assign to nearest center)
If we knew assignments, optimal centers would be straightforward (mean of assigned points)
Binary constraint alone creates combinatorial complexity with \( K^N \) possible assignment combinations
Standard gradient descent methods are impractical
Expectation Maximization Algorithm: Key Insight
Optimizing both cluster centers and assignments simultaneously is difficult
Optimizing one parameter set while keeping the other fixed is straightforward
EM follows a "zig-zag" path through parameter space toward a minimum
Complex optimization problem decomposes into sequence of simpler analytical problems






EM Algorithm for K-means: E-step
Fix cluster centers \( \boldsymbol{\mu}_k \) and optimize assignments \( r_{nk} \)
Optimal assignment:
Assign each point to its nearest cluster center
J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} |\mathbf{x}_n - \boldsymbol{\mu}_k|^2
r_{nk} = \begin{cases}
1 & \text{if } k = \arg\min_{j} |\mathbf{x}_n - \boldsymbol{\mu}_j|^2 \\
0 & \text{otherwise}
\end{cases}


EM Algorithm for K-means: M-step
Fix assignments \( r_{nk} \) and optimize cluster centers \( \boldsymbol{\mu}_k \)
Take gradient with respect to \( \boldsymbol{\mu}_k \):
J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} |\mathbf{x}_n - \boldsymbol{\mu}_k|^2
\frac{\partial J}{\partial \boldsymbol{\mu}_k} = \sum_{n=1}^N r_{nk} \frac{\partial}{\partial \boldsymbol{\mu}_k} |\mathbf{x}_n - \boldsymbol{\mu}_k|^2
= \sum_{n=1}^N r_{nk} (-2)(\mathbf{x}_n - \boldsymbol{\mu}_k)
EM Algorithm for K-means: M-step
Take gradient with respect to \( \boldsymbol{\mu}_k \):
J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} |\mathbf{x}_n - \boldsymbol{\mu}_k|^2
\frac{\partial J}{\partial \boldsymbol{\mu}_k} = -2 \sum_{n=1}^N r_{nk} \mathbf{x}_n + 2 \boldsymbol{\mu}_k \sum_{n=1}^N r_{nk} = 0
\boldsymbol{\mu}_k = \frac{\sum_{n=1}^N r_{nk} \mathbf{x}_n}{\sum_{n=1}^N r_{nk}}
Optimal center is the mean of all points assigned to that cluster


EM Algorithm for K-means
Repeat E-step and M-step until convergence
Convergence occurs when assignments no longer change significantly




EM Algorithm for K-means
Repeat E-step and M-step until convergence
Convergence occurs when assignments no longer change significantly
Algorithm converges to a local minimum, not necessarily the global minimum
Final solution depends on initial centers, need multiple runs.

K-means++ Initialization Strategy
More systematic approach to choosing initial cluster centers
Select first center uniformly: \( P(\boldsymbol{\mu}_1 = \mathbf{x}_n) = \frac{1}{N} \) for all n
For each data point, compute minimum distance to existing centers: \( \)
D(\mathbf{x}_n) = \min_{j \in \{1,2,\ldots,m\}} |\mathbf{x}_n - \boldsymbol{\mu}_j|^2
K-means++ Initialization Strategy
Select next center with probability proportional to \( D(\mathbf{x}_n)^2\):
Maximize minimum distance between centers while respecting data density
Repeat until all K centers are chosen, then proceed with standard K-means
P(\boldsymbol{\mu}_{m+1} = \mathbf{x}_n) = \frac{D(\mathbf{x}_n)^2}{\sum_{i=1}^N D(\mathbf{x}_i)^2}




Inductive Biases of Clustering with K-Means
Spherical covariances (same for all clusters)
Hard assignments of points to clusters
Gaussian Mixture Models (GMMs) relax these restrictive assumptions for more flexible clustering
J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} |\mathbf{x}_n - \boldsymbol{\mu}_k|^2

Computational Complexity of K-means
Distance calculation complexity: \(\mathcal{O}(NKD)\) operations, \(N\) data points × \(K\) centers × \(D\) dimensions per distance calculation
Linear scaling with dataset size, number of clusters, and dimensionality.
Efficiency advantage for astronomical datasets
Determining the Number of Clusters
All clustering techniques discussed assume a fixed number of clusters K, which must be specified in advance
Determining the appropriate K value is a challenge
Methods like calculating the Initial and performance Silhouette Analysis provide quantitative guidance
Information Criterion (toward the end) is another theoeretical option
Inertial
Within-cluster sum of squares against increasing values of K (number of clusters)
The inertia is defined as:
With natural clusters \(K*\), adding clusters when \(K < K*\) reduces inertia, while \(K > K*\) yields diminishing returns
\text{SSE} = \sum_{k=1}^K \sum_{n \in C_k} |\mathbf{x}_n - \boldsymbol{\mu}_k|^2


Silhouette Analysis
For each point i, the score is:
\( a(i) = \) average distance between point i and all other points in the same cluster, \( b(i) = \) minimum average distance between point i and points in any other cluster
The optimal K maximizes the average silhouette score across all data points.
s(i) = \frac{b(i) - a(i)}{\max{(a(i), b(i))}}


Gaussian Mixture Models
Data is modeled as generated from a weighted sum of multiple Gaussian distributions
p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)
\(\pi_k\): Mixture weight representing prior probability of cluster k
Constraints: \(\pi_k \geq 0\) for all k and \(\sum_{k=1}^K \pi_k = 1\)


GMM Parameters to Optimize
Total free parameters: \((K-1) + KD + KD(D+1)/2\) (grows quadratically with dimension D)
\(\pi_k\): Relative importance of each component : \( K - 1 \) dimensions
Mean vectors \(\boldsymbol{\mu}_k\): \( K D \) dimensions
Covariance matrices \(\boldsymbol{\Sigma}_k\): \( K D (D+1)/2 \) dimensions
Simplifying Covariance Structures
Full covariance matrices: \(\boldsymbol{\Sigma}_k\) with \(K \cdot D(D+1)/2\) parameters
Diagonal covariance: \(\boldsymbol{\Sigma}_k = \text{diag}(\sigma_{k1}^2, \sigma_{k2}^2, \ldots, \sigma_{kD}^2)\) with \(K \cdot D\) parameters
Spherical covariance: \(\boldsymbol{\Sigma}_k = \sigma_k^2\mathbf{I}\) with just \(K\) parameters - K-means with soft assignments
Trade-off between model flexibility and computational efficiency

Logistics
Homework Assignment 4 due next Friday,11:59pm
Group Project 2 - Project Selection - feedback found on Carmen
Group Project 2 - Presentation - April 18
Group Project 2 - Report - April 28 (no deferment)

GMM Likelihood Function
Likelihood function:
Log-likelihood:
p(\{\mathbf{x}_n\}|\{\pi_k\}, \{\boldsymbol{\mu}_k\}, \{\boldsymbol{\Sigma}_k\}) = \prod_{n=1}^N \left[\sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right]
\ln p(\{\mathbf{x}_n\}|\{\pi_k\}, \{\boldsymbol{\mu}_k\}, \{\boldsymbol{\Sigma}_k\}) = \sum_{n=1}^N \ln\left[\sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right]
Challenges in GMM Optimization
Highly multimodal likelihood surface with many local maxima
Interdependence of parameters: optimal values of means depend on covariances and weights
Label switching problem: \( K! \) equivalent modes due to component index permutation
Need a specialized approach to decompose this complex problem into simpler ones
The EM Algorithm for GMMs
Introduces latent variables to overcome optimization challenges
Responsibility \(\gamma_{nk}\) represents posterior probability that component k generated point n
Soft assignment allows partial membership in multiple clusters
\gamma_{nk} = \frac{\pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}







Deriving the Log-Likelihood Gradient
Start with log-likelihood:
Take derivative with respect to \(\boldsymbol{\mu}_k\):
\ln p(\{\mathbf{x}_n\}) = \sum_{n=1}^N \ln\left[\sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right]
\frac{\partial}{\partial \boldsymbol{\mu}_k} \ln p = \sum_{n=1}^N \frac{\partial}{\partial \boldsymbol{\mu}_k} \ln\left[\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)\right]
Deriving the Log-Likelihood Gradient
Only term with \(\boldsymbol{\mu}_k\) survives in summation:
\frac{\partial}{\partial \boldsymbol{\mu}_k} \ln p = \sum_{n=1}^N \frac{1}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \cdot \frac{\partial}{\partial \boldsymbol{\mu}_k}\left[\pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right]
Deriving the Log-Likelihood Gradient
Derivating the Gaussian
Substitute back into our expression:
\frac{\partial}{\partial \boldsymbol{\mu}_k}\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \cdot \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k)
\frac{\partial}{\partial \boldsymbol{\mu}_k} \ln p = \sum_{n=1}^N \frac{\pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \cdot \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k)
Simplifying the Likelihood Gradient
Substituting responsibilities:
Responsibilities are treated as fixed during the M-step
\frac{\partial}{\partial \boldsymbol{\mu}_k} \ln p = \sum_{n=1}^N \gamma_{nk} \cdot \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k)
Deriving the Mean Update
Set derivative equal to zero and solve:
\sum_{n=1}^N \gamma_{nk} \cdot \boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k) = 0
\Rightarrow \boldsymbol{\mu}_k = \frac{\sum_{n=1}^N \gamma_{nk} \cdot \mathbf{x}_n}{\sum_{n=1}^N \gamma_{nk}}
Define effective number of points in cluster k: \(N_k = \sum_{n=1}^N \gamma_{nk}\)
\boldsymbol{\mu}_k = \frac{1}{N_k}\sum_{n=1}^N \gamma_{nk}\mathbf{x}_n
Deriving the Mean Update
Compare with K-means update:
\boldsymbol{\mu}_k = \frac{\sum_{n=1}^N r_{nk} \cdot \mathbf{x}_n}{\sum_{n=1}^N r_{nk}}
GMMs generalize K-means by replacing hard assignments with soft responsibilities
\boldsymbol{\mu}_k = \frac{1}{N_k}\sum_{n=1}^N \gamma_{nk}\mathbf{x}_n


Deriving the Covariance Matrix Update
Start with multivariate Gaussian distribution:
\ln \mathcal{N}(\mathbf{x}|\boldsymbol{\mu}, \boldsymbol{\Sigma}) = -\frac{D}{2}\ln(2\pi) - \frac{1}{2}\ln|\boldsymbol{\Sigma}| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})
We will differentiate with respect to precision matrix \(\boldsymbol{\Sigma}_k^{-1}\) instead of \(\boldsymbol{\Sigma}_k\)
Matrix calculus
\frac{\partial}{\partial \mathbf{A}} \ln|\mathbf{A}| = \mathbf{A}^{-1}
Differentiating the Log-Likelihood
Start with log-likelihood derivative:
\frac{\partial \ln p}{\partial \boldsymbol{\Sigma}_k^{-1}} = \sum_{n=1}^N \frac{1}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \cdot \frac{\partial}{\partial \boldsymbol{\Sigma}_k^{-1}}\left[\pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\right]
Apply chain rule in form \(\frac{d}{dx}f(x) = f(x) \cdot \frac{d}{dx}\ln f(x)\):
\frac{\partial \ln p}{\partial \boldsymbol{\Sigma}_k^{-1}} = \sum_{n=1}^N \frac{\pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \cdot \frac{\partial}{\partial \boldsymbol{\Sigma}_k^{-1}} \ln \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)
Differentiating the Log-Likelihood
First term is the responsibility \(\gamma_{nk}\)
Need to calculate derivative of log Gaussian with respect to precision matrix \( \boldsymbol{\Sigma}_k^{-1} \)
\frac{\partial \ln p}{\partial \boldsymbol{\Sigma}_k^{-1}} = \sum_{n=1}^N \gamma_{nk} \frac{\partial}{\partial \boldsymbol{\Sigma}_k^{-1}} \ln \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)
Derivative of Log Gaussian
Expand and rewrite terms involving \(\boldsymbol{\Sigma}_k\):
Apply matrix calculus identity \( \frac{\partial}{\partial \mathbf{A}} \ln|\mathbf{A}| = \mathbf{A}^{-1} \):
\frac{\partial}{\partial \boldsymbol{\Sigma}_k^{-1}} \left[-\frac{D}{2}\ln(2\pi) + \frac{1}{2}\ln|\boldsymbol{\Sigma}_k^{-1}| - \frac{1}{2}(\mathbf{x}_n - \boldsymbol{\mu}_k)^T\boldsymbol{\Sigma}_k^{-1}(\mathbf{x}_n - \boldsymbol{\mu}_k)\right]
\frac{\partial}{\partial \boldsymbol{\Sigma}_k^{-1}} \ln \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) = \frac{1}{2}\boldsymbol{\Sigma}_k - \frac{1}{2}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^T
Complete Log-Likelihood Derivative
Substitute derivative of log Gaussian back into log-likelihood and set equal to zero to find critical point:
Solve for \(\boldsymbol{\Sigma}_k\):
\sum_{n=1}^N \gamma_{nk} \left[\frac{1}{2}\boldsymbol{\Sigma}_k - \frac{1}{2}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^T\right] = 0
\boldsymbol{\Sigma}_k = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^T
Interpreting the Covariance Update
Generalization of sample covariance matrix:
But with fractional point assignments through responsibilities \(\gamma_{nk}\)
\mathbf{S} = \frac{1}{N}\sum_{n=1}^N (\mathbf{x}_n - \boldsymbol{\mu})(\mathbf{x}_n - \boldsymbol{\mu})^T
\boldsymbol{\Sigma}_k = \frac{1}{N_k} \sum_{n=1}^N \gamma_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k)(\mathbf{x}_n - \boldsymbol{\mu}_k)^T



Optimizing Mixture Weights
Need to optimize mixture weights \(\pi_k\) while respecting constraint \(\sum_{k=1}^K \pi_k = 1\)
Use Lagrange multipliers for constrained optimization
First term is log-likelihood, second term enforces constraint via multiplier \(\lambda\)
\mathcal{L} = \ln p(\{\mathbf{x}_n\}) + \lambda\left(1 - \sum_{k=1}^K \pi_k\right)
Deriving the Update Equation
Take derivative with respect to \(\pi_k\) and set to zero:
Need to determine value of Lagrange multiplier \(\lambda\)
\frac{\partial \mathcal{L}}{\partial \pi_k} = \frac{\partial}{\partial \pi_k}\sum_{n=1}^N \ln\left[\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)\right] - \lambda
= \sum_{n=1}^N \frac{1}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} \cdot \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) - \lambda = 0
Finding the Lagrange Multiplier
Multiply both sides by \(\pi_k\), Sum both sides over all k, using constraint \(\sum_{k=1}^K \pi_k = 1\):
Inner numerator sum equals denominator, so fraction equals 1 for each point
\sum_{k=1}^K \pi_k \sum_{n=1}^N \frac{\mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} - \lambda = 0
\lambda = N
Final Update Equation for Weights
Substitute \(\lambda = N\) back into our equation:
Recognize that the fraction is the responsibility \(\gamma_{nk}\):
\sum_{n=1}^N \frac{\pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)} = N \pi_k
\pi_k = \frac{1}{N}\sum_{n=1}^N \gamma_{nk} = \frac{N_k}{N}
Proportion of data effectively "owned" by component k
Completing the EM Algorithm
So far: derived update equations for \(\boldsymbol{\mu}_k\), \(\boldsymbol{\Sigma}_k\), and \(\pi_k\) (M-step)
Alternate with computing optimal responsibilities (E-step)
\gamma_{nk} = \frac{\pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}
When model parameters are fixed, responsibilities are given by
Completing the EM Algorithm
Initialization: Choose starting values for \({\pi_k}, {\boldsymbol{\mu}_k}, {\boldsymbol{\Sigma}_k}\)
Options:
- random data points as means,
- K-means results,
- random partitioning
Typically initialize weights uniformly: \(\pi_k = 1/K\)

Completing the EM Algorithm
E-step: Compute responsibilities using current parameters
\gamma_{nk} = \frac{\pi_k \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^K \pi_j \mathcal{N}(\mathbf{x}_n|\boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}

Completing the EM Algorithm
M-step: Update parameters using current responsibilities
N_k = \sum_{n=1}^N \gamma_{nk}
\boldsymbol{\mu}_k^{new} = \frac{1}{N_k}\sum_{n=1}^N \gamma_{nk}\mathbf{x}_n
\boldsymbol{\Sigma}_k^{new} = \frac{1}{N_k}\sum_{n=1}^N \gamma_{nk}(\mathbf{x}_n - \boldsymbol{\mu}_k^{new})(\mathbf{x}_n - \boldsymbol{\mu}_k^{new})^T
Completing the EM Algorithm
M-step: Update parameters using current responsibilities
\pi_k^{new} = \frac{N_k}{N}
Check log-likelihood change; stop if below threshold or maximum iterations reached
Common practice: run algorithm multiple times with different initializations. Select solution with highest likelihood





Connection to K-means and Key Advantages
K-means is a special case of GMM with three constraints.
GMMs use soft assignments \((\gamma_{nk})\), providing natural measure of uncertainty.
GMMs model clusters with arbitrary elliptical shapes through covariance matrices
GMMs are generative models that capture full probability distribution of data
Probabilistic Foundation and Applications
Select component \(k\) with probability \(\pi_k\)
Sample from Gaussian \(\mathcal{N}(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)\)
Valuable for mock catalogs, testing survey selection functions, and forward modeling
Generative capabilities allow creation of synthetic data:
Outlier Detection
Low-density points identify potentially interesting astronomical objects
p(\mathbf{x}_*) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}_*|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) \ll 1
Enables principled outlier detection using probability density:

Model Selection with Information Criteria
As \(K\) increases, model complexity and data fit improve, with log-likelihood monotonically increasing
Extreme case: \(K=N\) gives perfect fit but learns nothing
Information criteria quantify the trade-off between fit and complexity
Determining the number of clusters \(K\)
Parsimony Principle
Occam's razor: When multiple models explain data equally well, prefer the simplest
Einstein: "Everything should be made as simple as possible, but no simpler"
Like regularization in regression, we penalize excessive
model complexity
Information-theoretic criteria provide principled statistical foundations for selection
Bayesian Information Criterion (BIC)
Information score (lower better):
\(\hat{L}\) is maximum likelihood, \( P \) is parameter count, \( N\) is data point count
BIC approximates the Bayes factor, comparing posterior probabilities of different models
\text{BIC} = -2\ln(\hat{L}) + P\ln(N)
Bayesian Information Criterion (BIC)
For a full-covariance GMM with \(K\) components in \(D\) dimensions:
P = (K-1) + KD + K\frac{D(D+1)}{2}

Bayesian Model Selection Framework
For model selection, compare the "evidence" of different models \(\mathcal{M}_K\):
p(\mathbf{X}|\mathcal{M}_K) = \int p(\mathbf{X}|\boldsymbol{\theta}_K, \mathcal{M}_K)p(\boldsymbol{\theta}_K|\mathcal{M}_K)d\boldsymbol{\theta}_K
Bayesian approach:
p(\boldsymbol{\theta}|\mathbf{X}) = \frac{p(\mathbf{X}|\boldsymbol{\theta})p(\boldsymbol{\theta})}{p(\mathbf{X})}
= \frac{p(\mathbf{X}|\boldsymbol{\theta},\mathcal{M}_K)p(\boldsymbol{\theta},\mathcal{M}_K)}{p(\mathbf{X}|\mathcal{M}_K)}
Laplace Approximation for BIC
Hessian
For intractable integrals, approximate log-posterior around its maximum:
\ln[p(\mathbf{X}|\boldsymbol{\theta}_K)p(\boldsymbol{\theta}_K)] \approx \ln[p(\mathbf{X}|\hat{\boldsymbol{\theta}}_K)p(\hat{\boldsymbol{\theta}}_K)]
- \frac{1}{2}(\boldsymbol{\theta}_K - \hat{\boldsymbol{\theta}}_K)^T \mathbf{H} (\boldsymbol{\theta}_K - \hat{\boldsymbol{\theta}}_K)
\mathbf{H} = -\nabla^2 \ln p(\mathbf{X}|\boldsymbol{\theta}_K) = -\sum_{n=1}^N \nabla^2 \ln p(\mathbf{x}_n|\boldsymbol{\theta}_K)
This gives \(\mathbf{H} = N \cdot \mathcal{J}(\hat{\boldsymbol{\theta}}_K)\) where \(\mathcal{J}\) is average information per observation
BIC Derivation and Asymptotic Behavior
The \((2\pi)^{P/2}|\mathbf{H}|^{-1/2}\) term is the normalization constant from the \(P\)-dimensional Gaussian Integrating:
Determinant property, for \(P\) parameters:
|\mathbf{H}| = |N \cdot \mathcal{J}(\hat{\boldsymbol{\theta}}_K)| = N^P \cdot |\mathcal{J}(\hat{\boldsymbol{\theta}}_K)|
p(\mathbf{X}|\mathcal{M}_K) \approx p(\mathbf{X}|\hat{\boldsymbol{\theta}}_K)p(\hat{\boldsymbol{\theta}}_K) \cdot (2\pi)^{P/2}|\mathbf{H}|^{-1/2}
This integration automatically implements Occam's razor by penalizing complex parameter spaces
Asymptotic Behavior and Final BIC Form
As \(N\) grows, log-likelihood \(\mathcal{O}(N)\) and penalty \(P\ln(N)\) dominate over \(\mathcal{O}(1)\) terms
Taking negative log:
2\ln p(\mathbf{X}|\mathcal{M}_K) \approx -2\ln p(\mathbf{X}|\hat{\boldsymbol{\theta}}_K) - 2\ln p(\hat{\boldsymbol{\theta}}_K) + P\ln(N) + \mathcal{O}(1)
This yields BIC:
\text{BIC} = -2\ln(\hat{L}) + P\ln(N)
Practical BIC Application for GMMs
Select model with lowest BIC value
Compute log-likelihood at EM convergence:
\ln(\hat{L}) = \sum_{n=1}^N \ln\left(\sum_{k=1}^K \hat{\pi}_k \mathcal{N}(\mathbf{x}_n|\hat{\boldsymbol{\mu}}_k, \hat{\boldsymbol{\Sigma}}_k)\right)
BIC validity depends on large samples, well-behaved likelihoods, and diffuse priors

Akaike Information Criterion (AIC)
Smaller penalty than BIC: \(2P\) instead of \(P\ln(N)\)
AIC approaches model selection from information theory perspective
\text{AIC} = -2\ln(\hat{L}) + 2P
AIC tends to favor more complex models.
Information-Theoretic Intuition for AIC
Aims to maximize information between model and true data-generating process
AIC connects to entropy and mutual information concepts
Using same data to fit and evaluate introduces bias, larger for complex models
This bias is asymptotically approximated by parameter count \(P\)


Small Sample Correction for AIC
Correction provides more aggressive penalty when observations are limited
For limited samples, corrected AIC (AICc) is preferred:
As \(N\) increases, AICc converges to standard AIC
\text{AIC}_c = \text{AIC} + \frac{2P(P+1)}{N-P-1}

BIC vs. AIC: Philosophical Differences
AIC aims to find the model that will make the best predictions on new data
BIC attempts to identify the "true" model
Both criteria provide guidance but require domain knowledge for interpretation
Final decisions should integrate statistical evidence with scientific insight
What We Have Learned:
K-means
Gaussian Mixture Models
Expectation-Maximization Algorithm
Information Criterion
Astron 5550 - Lecture 11
By Yuan-Sen Ting
Astron 5550 - Lecture 11
- 104