Lecture 9 : Bayesian Logistic Regression

Assoc. Prof. Yuan-Sen Ting

Astron 5550 : Advanced Astronomical Data Analysis

Logistics

Homework Assignment 3 due this Friday,11:59pm

Group Project Report due tonight, 11:59pm

Select and submit your topic for group project 2 (on Carmen), by next Friday, 11:59pm

I will be away next week. Lecture will be recorded and posted on Carmen

Plan for Today :

Bayesian Logistic Regression

Soft Boundary

Laplace Approximation

Probit Approximation

Logistic Regression Recap

Logistic regression uses linear weights \({w_k}\) to convert features into probabilities via logits

Maximum likelihood estimation optimizes weights using cross-entropy loss

Binary Logistic Regression Review

Binary classification uses sigmoid function to model class probability: 

Likelihood function: 

p(C_1|\mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T \mathbf{x}}}
p(\mathbf{t} | \mathbf{w}, \mathbf{X}) = \prod^N_{n=1} y_n^{t_n} (1 - y_n)^{(1-t_n)}

Astronomical Classification Challenges

Astronomical data suffers from scarcity of well-labeled examples

Training samples typically come from biased sampling (easier-to-classify objects)

Example: Galaxy classification datasets contain clear spiral/elliptical examples but few transitional morphologies

The Bayesian Perspective

Core insight: Model parameters \( \{ \mathbf{w} \} \) themselves have uncertainty

Multiple decision boundaries could reasonably separate classes given finite data

Rather than seeking single "best" model, characterize posterior distribution over all possible models

Bayesian Approach to Logistic Regression

Instead of single weight vector, we derive posterior distribution \(p(\mathbf{w}|\mathcal{D})\)

Provides naturally higher uncertainty in regions with sparse/no training data

Technical challenges: non-Gaussian posterior distribution requires approximation techniques (Laplace approximation, probit approximation)

Bayesian Framework for Logistic Regression

Start with prior distribution over weights: 

Apply Bayes' theorem: 

p(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{m}_0, \mathbf{S}_0)
p(\mathbf{w} | \mathbf{t}, \mathbf{X}) \propto p(\mathbf{w}) \cdot p(\mathbf{t} | \mathbf{w}, \mathbf{X})

The Non-Gaussian Posterior Challenge

Posterior distribution:

Unlike Bayesian linear regression, posterior is not Gaussian

No simple analytical form exists for this posterior distribution

p(\mathbf{w} | \mathbf{t}, \mathbf{X}) \propto \mathcal{N}(\mathbf{w} | \mathbf{m}_0, \mathbf{S}_0) \cdot \prod^N_{n=1} y_n^{t_n} (1 - y_n)^{(1-t_n)}

Bernstein-von Mises Theorem

Posteriors tend toward Gaussian distributions as sample size increases

Related to Central Limit Theorem

Each observation contributes information; aggregate effect produces increasingly Gaussian-shaped distribution

Solution preview: Laplace approximation will approximate this complex posterior with a Gaussian distribution

Mechanics of Laplace Approximation

We approximate our complex posterior with a Gaussian 

Location of the peak (mode): becomes the mean of our approximating Gaussian

Curvature at the peak: determines covariance (measured by second derivatives of log posterior)

Dealing With Unnormalized Distribution

In Bayesian inference, we often know distributions only up to a normalization constant: \(p(z) = \frac{f(z)}{Z}\), where \(Z = \int f(z) dz\)

By Bayes' theorem: 

The denominator \(p(\text{data})\) (evidence) is often intractable

Laplace approximation allows us to work directly with unnormalized density \(f(z) = p(\text{data}|z)p(z) \)

p(z|\text{data}) = \frac{p(\text{data}|z)p(z)}{p(\text{data})}

Finding the Mode of the Target Distribution

Locate the mode \(z_0\) where probability density is maximum

At this point, the derivative equals zero: 

Equivalently, the derivative of log-density equals zero: 

\left.\frac{d}{dz}f(z)\right|_{z=z_0} = 0
\left.\frac{d}{dz}\ln f(z)\right|_{z=z_0} = 0

Taylor Expansion Around the Mode

Second-order Taylor expansion of log-density around mode \(z_0\):

Define curvature parameter: \(A = -\left.\frac{d^2}{dz^2}\ln f(z)\right|_{z=z_0}\)

\ln f(z) \approx \ln f(z_0) + \left.\frac{d}{dz}\ln f(z)\right|_{z=z_0}(z-z_0)
+ \frac{1}{2}\left.\frac{d^2}{dz^2}\ln f(z)\right|_{z=z_0}(z-z_0)^2

Constructing the Gaussian Approximation

Approximation becomes: 

Exponentiating both sides: \(f(z) \approx f(z_0)\exp\left(-\frac{1}{2}A(z-z_0)^2\right)\)

\ln f(z) \approx \ln f(z_0) - \frac{1}{2}A(z-z_0)^2

Curvature \(A\) determines the width and the normalization factor

After normalization: \(q(z) = \mathcal{N}(z|z_0, A^{-1}) \) 

The Complete Laplace Approximation

Find the mode \(z_0\) of the distribution (where first derivative = zero)

Calculate second derivative at mode to determine
\(A = -\left.\frac{d^2}{dz^2}\ln f(z)\right|_{z=z_0} \)

Construct Gaussian approximation with mean \(z_0\) and variance \(A^{-1}\). We have \(q(z) = \mathcal{N}(z|z_0, A^{-1}) \) 

Provides principled approach for approximating complex distributions, particularly valuable when exact posteriors are intractable

Extension to Multiple Dimensions

Step 1: Find the mode \(\mathbf{z}_0\) where the gradient of log density vanishes: 

Gradient operator produces vector of partial derivatives:

\nabla \ln f(\mathbf{z})\Big|_{\mathbf{z}=\mathbf{z}_0} = \mathbf{0}
\nabla \ln f(\mathbf{z}) = \left(\frac{\partial \ln f(\mathbf{z})}{\partial z_1}, \frac{\partial \ln f(\mathbf{z})}{\partial z_2}, \ldots, \frac{\partial \ln f(\mathbf{z})}{\partial z_d}\right)^T

Multivariate Taylor Expansion

Step 2: Second-order Taylor expansion around the mode:

Hessian matrix \(\mathbf{H}\) of second derivatives:

ln f(\mathbf{z}) \approx \ln f(\mathbf{z}_0) + (\mathbf{z}-\mathbf{z}_0)^T \nabla \ln f(\mathbf{z})\Big|_{\mathbf{z}=\mathbf{z}_0}
H_{ij} = \frac{\partial^2 \ln f(\mathbf{z})}{\partial z_i \partial z_j}\Big|_{\mathbf{z}=\mathbf{z}_0}
+ \frac{1}{2}(\mathbf{z}-\mathbf{z}_0)^T \mathbf{H} (\mathbf{z}-\mathbf{z}_0)

Laplace Approximation at Multiple Dimension

Define precision matrix \(\mathbf{A} = -\mathbf{H}\) (negative Hessian)

Our approximation becomes: 

\ln f(\mathbf{z}) \approx \ln f(\mathbf{z}_0) - \frac{1}{2}(\mathbf{z}-\mathbf{z}_0)^T \mathbf{A} (\mathbf{z}-\mathbf{z}_0)

Multivariate Laplace approximation is a \(\mathcal{N}(\mathbf{z}_0, \mathbf{A}^{-1})\) distribution.

Application to Bayesian Logistic Regression

MAP finds parameter values that maximize posterior probability: 

Extends Maximum Likelihood Estimation by incorporating prior 

\mathbf{w}_{\text{MAP}} = \arg\max_{\mathbf{w}} p(\mathbf{w}|\mathcal{D}) = \arg\max_{\mathbf{w}} {p(\mathcal{D}|\mathbf{w})p(\mathbf{w})}

When prior is uniform, MAP reduces to MLE

Then, calculate the Hessian at the log posterior evaluated at \( \mathbf{w}_{\text{MAP}} \)

The Log Posterior

Log posterior combines prior and likelihood terms: 

Second term is familiar cross-entropy from MLE

\ln p(\mathbf{w}|\mathbf{t},\mathbf{X}) = \ln p(\mathbf{w}) + \ln p(\mathbf{t}|\mathbf{w},\mathbf{X})
= -\frac{1}{2}(\mathbf{w}-\mathbf{m}_0)^T\mathbf{S}_0^{-1}(\mathbf{w}-\mathbf{m}_0)
+ \sum_{n=1}^N [t_n\ln\sigma(\mathbf{w}^T\mathbf{x}_n) + (1-t_n)\ln(1-\sigma(\mathbf{w}^T\mathbf{x}_n))]

Finding the MAP Estimate

Need to optimize posterior probability using stochastic gradient descent 

Computing the gradient:

\nabla \ln p(\mathbf{w}|\mathbf{t},\mathbf{X}) = \nabla \ln p(\mathbf{w}) + \nabla \ln p(\mathbf{t}|\mathbf{w},\mathbf{X})
= -\mathbf{S}_0^{-1}(\mathbf{w}-\mathbf{m}_0) + \sum_{n=1}^N (t_n - \sigma(\mathbf{w}^T\mathbf{x}_n))\mathbf{x}_n

Prior acts as regularizer that pulls weights toward \(\mathbf{m}_0\)

Finding the MAP Estimate

Unlike linear regression, no closed-form solution 

Maximize log posterior \( \ln p(\mathbf{w}|\mathbf{t},\mathbf{X}) \) using SGD: 

We climb up the posterior with the gradient:

\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} + \eta \nabla \ln p(\mathbf{w}^{(t)}|\mathbf{t},\mathbf{X})

After finding \(\mathbf{w}_{\text{MAP}}\), compute Hessian matrix for
Laplace approximation

Computing the Hessian Matrix

For Laplace approximation, we need negative Hessian at \(\mathbf{w}_{\text{MAP}}\)

Hessian is matrix of second derivatives of log posterior:

\mathbf{A} = \left. - \mathbf{H} \right|_{\mathbf{w}_\text{MAP}} = \left. - \nabla\nabla \ln p(\mathbf{w}) - \nabla\nabla \ln p(\mathbf{t}|\mathbf{w},\mathbf{X}) \right|_{\mathbf{w}_{\text{MAP}}}
=\mathbf{S}_0^{-1} + \sum_{n=1}^N \sigma(\mathbf{w}_{\text{MAP}}^T\mathbf{x}_n)(1-\sigma(\mathbf{w}_{\text{MAP}}^T\mathbf{x}_n))\mathbf{x}_n\mathbf{x}_n^T

Represents inverse covariance of posterior approximation

Computing the Hessian Matrix

Each data point's contribution weighted by
\(\sigma(\mathbf{w}_{\text{MAP}}^T\mathbf{x}_n)(1-\sigma(\mathbf{w}_{\text{MAP}}^T\mathbf{x}_n))\)

This term has maximum value of 0.25 when \(\sigma(\mathbf{w}_{\text{MAP}}^T\mathbf{x}_n) = 0.5\) (at decision boundary)

More data points near decision boundary → increased precision matrix magnitude → less parameter uncertainty

\mathbf{A} = \mathbf{S}_0^{-1} + \sum_{n=1}^N \sigma(\mathbf{w_{\text{MAP}}}^T\mathbf{x}_n)(1-\sigma(\mathbf{w}_{\text{MAP}}^T\mathbf{x}_n))\mathbf{x}_n\mathbf{x}_n^T

Logistics

Homework Assignment 3 due this Friday,11:59pm

Select and submit your topic for group project 2 (on Carmen), by next Friday, 11:59pm

I will be away next week. Lecture will be recorded and posted on Carmen

From Posterior to Prediction

We approximated the posterior as a Gaussian distribution 

Distribution over decision boundaries: Each sample \(\mathbf{w}\) corresponds to a different boundary, assigning different probabilities

q(\mathbf{w}) = \mathcal{N}(\mathbf{w} | \mathbf{w}_{\text{MAP}}, \mathbf{A}^{-1})
\mathbf{A} = \mathbf{S}_0^{-1} + \sum_{n=1}^N \sigma(\mathbf{w_{\text{MAP}}}^T\mathbf{x}_n)(1-\sigma(\mathbf{w}_{\text{MAP}}^T\mathbf{x}_n))\mathbf{x}_n\mathbf{x}_n^T

The Predictive Distribution

Probability of a class label for a new input that integrates over all model uncertainty

A weighted average where each model contributes according to its posterior probability

p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) = \int p(y^* = 1 | \mathbf{x}^*, \mathbf{w}) p(\mathbf{w} | \mathcal{D}) d\mathbf{w}

For the other class

p(y^* = 0 | \mathbf{x}^*, \mathcal{D}) = 1 - p(y^* = 1 | \mathbf{x}^*, \mathcal{D})

Computational challenge of the integral

Substituting our Gaussian approximation:

The integral lacks a closed-form solution unlike in Bayesian Linear Regression

p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) \approx \int \sigma(\mathbf{w}^T \mathbf{x}^*) \mathcal{N}(\mathbf{w} | \mathbf{w}_{\text{MAP}}, \mathbf{A}^{-1}) d\mathbf{w}

Approximate sigmoid with probit function to make the integral more manageable

Simplifying the High-Dimensional Integral

For classification, what truly matters is the logit value \( a = \mathbf{w}^T \mathbf{x}^* \)

This single value determines class probability through the sigmoid function \( \sigma(a) \)

\( a \) represents signed distance from test point to decision boundary 

Many different weight vectors \( \mathbf{w} \) can produce the same logit value \( a \) for a given input

Weight Space Redundancy

Weight vectors forming a hyperplane perpendicular to \( \mathbf{x}^* \) all map to the same logit \( a \)

This many-to-one mapping suggests we can simplify by focusing on distribution of \( a \) rather than full distribution of \( \mathbf{w} \)

Decompose weight vector into components parallel and perpendicular to \( \mathbf{x}^* \): \( \mathbf{w} = w_{\parallel}\frac{\mathbf{x}^*}{|\mathbf{x}^*|} + \mathbf{w}_{\perp} \)

Only the component of \( \mathbf{w} \) parallel to \( \mathbf{x}^* \) affects the prediction. From D dimension to 1D.

Transforming the Integral

Original predictive distribution:

Change of variables from \( \mathbf{w} \) to \( ( a, \mathbf{w}_{\perp} ) \) 

 

p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) \approx \int \sigma(\mathbf{w}^T \mathbf{x}^*) \mathcal{N}(\mathbf{w} | \mathbf{w}_{\text{MAP}}, \mathbf{A}^{-1}) d\mathbf{w}
p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) \approx \int \sigma(a) \mathcal{N}(a | \mu_a, \sigma_a^2) da

Transformation of Random Variables

Since \( \mathbf{w} \) follows Gaussian \( \mathcal{N}(\mathbf{w}_{\text{MAP}}, \mathbf{A}^{-1}) \), and \( a = \mathbf{w}^T \mathbf{x}^* \) is a linear combination

Linear transformations of Gaussian variables remain Gaussian

Mean and variance of logit distribution: \( \mu_a = \mathbf{w}_{\text{MAP}}^T\mathbf{x}^* \)
\( \sigma_a^2 = \mathbf{x}^{*T}\mathbf{A}^{-1}\mathbf{x}^* \)

Remarkable simplification: High-dimensional integral reduced to one-dimensional

The Probit Function

p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) \approx \int \sigma(a) \mathcal{N}(a | \mu_a, \sigma_a^2) da

Mathematical forms of sigmoid and Gaussian don't allow direct integration

Need a creative approach: Replace sigmoid with a similar function

Use the probit function that allows analytical integration

Introducing the Probit Function

Probit function \( \Phi(x) \): Cumulative distribution function (CDF) of a standard unit Gaussian distribution

Mathematical definition: 

Monotonically increasing with "S" shape

\Phi(x) = \int_{-\infty}^{x} \mathcal{N}(t | 0, 1) dt

Approximating Sigmoid with Probit

Need to approximate: \( \sigma(a) \approx \Phi(\lambda a) \)

Find optimal \( \lambda \) by equating derivatives at \( a = 0 \):

Sigmoid derivative at origin: \( \frac{d}{da}\sigma(a)|_{a=0} = \frac{1}{4} \). Probit derivative at origin: \( \frac{d}{da}\Phi(\lambda a)|_{a=0} = \frac{\lambda}{\sqrt{2\pi}} \)

Equating these: \( \frac{1}{4} = \frac{\lambda}{\sqrt{2\pi}} \implies \lambda = \sqrt{\frac{\pi}{8}} \approx 0.63 \)

The Key Mathematical Property

Remarkable property of probit function when integrated with Gaussian:

This identity is exactly what we need to solve our predictive distribution integral

Can be proven elegantly using properties of random variables

\int \Phi(\lambda a) \mathcal{N}(a | \mu, \sigma^2) da = \Phi\left(\frac{\lambda\mu}{\sqrt{1 + \lambda^2\sigma^2}}\right)

Proving the Identity

Consider two independent random variables:

Linear combination of random variables \(  \)

\( X \sim \mathcal{N}(0, 1/\lambda^2) \) 

\( Y \sim \mathcal{N}(\mu, \sigma^2) \) 

X - Y \sim \mathcal{N}(-\mu, 1/\lambda^2 + \sigma^2)

Proving the Identity

Probability that \( P(X < Y) \) or \( P( X - Y < 0 ) \) is: 

Definition of the Probit function and cumulative distribution function

P(X - Y < 0) = \Phi\left(\frac{\mu}{\sqrt{1/\lambda^2 + \sigma^2}}\right) = \Phi\left(\frac{\lambda\mu}{\sqrt{1 + \lambda^2\sigma^2}}\right)
X - Y \sim \mathcal{N}(-\mu, 1/\lambda^2 + \sigma^2)
\int \Phi(\lambda a) \mathcal{N}(a | \mu, \sigma^2) da = \Phi\left(\frac{\lambda\mu}{\sqrt{1 + \lambda^2\sigma^2}}\right)

The RHS of

Proving the Identity

Calculate same probability by conditioning on value of \( Y \)

For each possible \( y \), probability that \( X < y \) is \( \Phi(\lambda y) \)

Integrate over all possible values

This gives us: \( P(X < Y) = \int_{-\infty}^{\infty} \Phi(\lambda y) \mathcal{N}(y | \mu, \sigma^2) dy \). Equals to the LHS of the identity

\int \Phi(\lambda a) \mathcal{N}(a | \mu, \sigma^2) da = \Phi\left(\frac{\lambda\mu}{\sqrt{1 + \lambda^2\sigma^2}}\right)

Derivation of the Predictive Distribution

Original predictive distribution integral:

Using probit approximation: 

p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) = \int \sigma(\mathbf{w}^T\mathbf{x}^*) p(\mathbf{w}|\mathcal{D}) d\mathbf{w}
p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) = \int \sigma(a) \mathcal{N}(a | \mu_a, \sigma_a^2) da
p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) \approx \int \Phi\left(\sqrt{\frac{\pi}{8}} a\right) \mathcal{N}(a | \mu_a, \sigma_a^2) da

Applying the Integration Formula

Using our derived identity with \( \lambda = \sqrt{\pi/8} \):

Converting back to sigmoid function:

\int \Phi\left(\sqrt{\frac{\pi}{8}} a\right) \mathcal{N}(a | \mu_a, \sigma_a^2) da = \Phi\left(\frac{\sqrt{\pi/8} \cdot \mu_a}{\sqrt{1 + (\pi/8) \sigma_a^2}}\right)
\Phi\left(\frac{\sqrt{\pi/8} \cdot \mu_a}{\sqrt{1 + (\pi/8) \sigma_a^2}}\right) \approx \sigma\left(\frac{\mu_a}{\sqrt{1 + \pi\sigma_a^2/8}}\right)

Final expression for predictive distribution:

Where \( \mu_a = \mathbf{w}_{\text{MAP}}^T\mathbf{x}^* \) and \( \sigma_a^2 = \mathbf{x}^{*T}\mathbf{A}^{-1}\mathbf{x}^* \)

p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) \approx \sigma\left(\frac{\mu_a}{\sqrt{1 + \pi\sigma_a^2/8}}\right)
p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) \approx \sigma\left(\kappa(\mathbf{x}^{*T} \mathbf{A}^{-1} \mathbf{x}^*) \cdot \mathbf{w}_{\text{MAP}}^T \mathbf{x}^*\right)

Substituting original terms: 

Define uncertainty factor: \( \kappa(\sigma_a^2) = (1 + \pi\sigma_a^2/8)^{-1/2} \)

Comparison with Standard Logistic Regression

Dampens certainty of predictions by pushing toward 0.5

p(y^* = 1 | \mathbf{x}^*, \mathcal{D}) \approx \sigma\left(\kappa(\mathbf{x}^{*T} \mathbf{A}^{-1} \mathbf{x}^*) \cdot \mathbf{w}_{\text{MAP}}^T \mathbf{x}^*\right)

Standard logistic regression prediction: 

Uncertainty factor: \( \kappa(\sigma_a^2) = (1 + \pi\sigma_a^2/8)^{-1/2} \)

p(y^* = 1 | \mathbf{x}^*, \mathbf{w}_{\text{MAP}}) = \sigma(\mathbf{w}_{\text{MAP}}^T \mathbf{x}^*)

Geometric Interpretation of Uncertainty

When \( \mathbf{x}^* \) is far from training data \(\rightarrow\) Larger values in \( \mathbf{A}^{-1} \) in direction of \( \mathbf{x}^* \)

The variance \( \sigma_a^2 = \mathbf{x}^{*T} \mathbf{A}^{-1} \mathbf{x}^* \) depends on the location of test point relative to training data

Uncertainty factor: \( \kappa(\sigma_a^2) = (1 + \pi\sigma_a^2/8)^{-1/2} \)

Increased \( \sigma_a^2 \) reflecting greater uncertainty. Boundary becomes "softer" away from training data

Application to Astronomical Classification

Bayesian approach provides crucial uncertainty signal.

Naturally conservative with unusual objects

Predictions closer to 0.5 for objects unlike training examples

Identifies potential new or transitional types

What We Have Learned

Bayesian Logistic Regression

Soft Boundary

Laplace Approximation

Probit Approximation

Astron 5550 - Lecture 9

By Yuan-Sen Ting

Astron 5550 - Lecture 9

  • 91