CS6015: Linear Algebra and Random Processes
Lecture 37: Exponential families of distributions
Learning Objectives
What are exponential families ?
Why do we care about them?
What are exponential families?
A set of probability distributions whose pmf(discrete case) or pdf (continuous case) can be expressed in the following form
fX(x∣θ)=h(x)exp[η(θ)⋅T(x)−A(θ)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
T,h:known functions of x
T, h: known~functions~of~x
η,A:known functions of θ
\eta, A: known~functions~of~\theta
θ:parameter
\theta: parameter
p
p
n,p
n,p
the support of fX(x∣θ) should not depend on θ
the support depends on n
Recap: Parameters
μ,σ
\mu, \sigma
Normal
Bernoulli
Binomial
Questions of Interest
Are there any popular families that we care about?
Why do we care about this form?
Many! (we will see some soon)
This form has some useful properties! (we will see some of these properties soon)
fX(x∣θ)=h(x)exp[η(θ)⋅T(x)−A(θ)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
Example 1: Bernoulli Distribution
fX(x∣θ)=h(x)exp[η(θ)⋅T(x)−A(θ)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
pX(x)=px(1−p)1−x
p_{X}(x) = p^x(1-p)^{1-x}
θ=p
\theta = p
=exp(log(px(1−p)1−x))
= \exp(\log(p^x(1-p)^{1-x}))
=exp(log(px(1−p)1−x))
= \exp(\log(p^x(1-p)^{1-x}))
exp(k)=ek
\exp(k) = e^k
=exp(logpx+log(1−p)1−x)
= \exp(\log p^x + \log (1-p)^{1-x})
=exp(xlogp+(1−x)log(1−p))
= \exp(x\log p + (1-x)\log (1-p))
=exp(xlogp−xlog(1−p)+log(1−p))
= \exp(x\log p -x\log (1-p) + \log (1-p))
=exp(log1−ppx+log(1−p))
= \exp(\log \frac{p}{1-p}x + \log (1-p))
h(x)=1
h(x)=1
η(p)=log1−pp
\eta(p)=\log \frac{p}{1-p}
T(x)=x
T(x) = x
A(p)=−log(1−p)
A(p) = -log(1-p)
Example 2: Binomial Distribution
only if n is known/fixed & hence not a parameter
pX(x)=(xn)px(1−p)n−x
p_{X}(x) = {n \choose x} p^x(1-p)^{n-x}
X={0,1,2,…,N}
X = \{0,1,2,\dots, N\}
=exp(log((xn)px(1−p)n−x))
=\exp(\log({n \choose x} p^x(1-p)^{n-x}))
=exp(log(xn)+xlogp+(n−x)log(1−p))
=\exp(\log{n \choose x} + x\log p + (n-x)\log(1-p))
=exp(log(xn)+log(1−p)px+nlog(1−p))
=\exp(\log{n \choose x} + \log \frac{p}{(1-p)}x + n\log(1-p))
=(xn)exp(log(1−p)px+nlog(1−p))
={n \choose x}\exp(\log \frac{p}{(1-p)}x + n\log(1-p))
h(x)=1
h(x)=1
η(p)=log1−pp
\eta(p)=\log \frac{p}{1-p}
T(x)=x
T(x) = x
A(p)=−nlog(1−p)
A(p) = -nlog(1-p)
fX(x∣θ)=h(x)exp[η(θ)⋅T(x)−A(θ)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
Example 3: Normal Distribution
Two parameters: μ,σ
fX(x)=2πσ1exp(2σ2−(x−μ)2)
f_X(x) = \frac{1}{\sqrt{2 \pi}\sigma} \exp(\frac{-(x-\mu)^2}{2\sigma^2})
=2π1exp(−logσ)exp(−2σ2x2−2xμ+μ2)
= \frac{1}{\sqrt{2 \pi}}\exp(-\log\sigma) \exp(-\frac{x^2 -2x\mu + \mu^2}{2\sigma^2})
=2π1exp(σ2xμ+−2σ2x2−2σ2μ−logσ)
= \frac{1}{\sqrt{2 \pi}}\exp(\frac{x\mu}{\sigma^2} + \frac{x^2}{-2\sigma^2} - \frac{\mu}{2\sigma^2} - \log \sigma)
=2π1exp([σ2μ−2σ21][xx2]−(2σ2μ−logσ))
= \frac{1}{\sqrt{2 \pi}}\exp(\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}\begin{bmatrix}x \\ x^2 \end{bmatrix} - (\frac{\mu}{2\sigma^2} - \log \sigma))
h(x)=1
h(x)=1
η(μ,σ2)=[σ2μ−2σ21]⊤
\eta(\mu, \sigma^2)=\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}^\top
T(x)=[xx2]⊤
T(x) = \begin{bmatrix}x & x^2 \end{bmatrix}^\top
A(μ,σ2)=2σ2μ−logσ
A(\mu, \sigma^2) = \frac{\mu}{2\sigma^2} - \log \sigma
dot product
dim. of η(θ) is equal to number of parameters
fX(x∣θ)=h(x)exp[∑iηi(θ).Ti(x)−A(θ)]
f_{X}(x|\theta) = h(x) exp[\sum_i\eta_i(\theta).T_i(x) - A (\theta)]
fX(x∣θ)=h(x)exp[η(θ)⋅T(x)−A(θ)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
The bigger picture
exponential families
scalar parameter, scalar variable
vector parameter, scalar variable
vector parameter, vector variable
Bernoulli
Binomial (n fixed)
Poisson
Negative binomial (r fixed)
Geometric
... ...
Normal
Gamma
Beta
Multinomial
Bivariate normal
... ...
Multivariate normal
... ...
fX(x∣θ)=h(x)exp[η(θ)⋅T(x)−A(θ)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
The gamma distribution
Why do families matter?
f(x)=Γ(α)βαx(α−1)e−βx
f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{(\alpha - 1)}e^{-\beta x}
Γ(α)=(α−1)!
\Gamma(\alpha) = (\alpha - 1)!
α=shape parameter
\alpha = shape~parameter
β=rate parameter
\beta = rate~parameter

import matplotlib.pyplot as plt
from scipy.stats import gamma
import numpy as np
x = np.linspace(0, 10, 500)
alpha = 3
beta = 1
rv = gamma(alpha, loc = 0., scale = 1/beta)
plt.plot(x, rv.pdf(x))
Example 4: Bivariate normal
fX,Y(x,y)=2π∣Σ∣211exp(−21(x−μ)TΣ−1(x−μ))
f_{X,Y}(x,y) = \frac{1}{2 \pi |\Sigma|^{\frac{1}{2}}}\exp(-\frac{1}{2}(\mathbf{x} - \mu)^T\Sigma^{-1}(\mathbf{x} - \mu))
=2π1∣Σ∣−21exp(−21(x−μ)TΣ−1(x−μ))
= \frac{1}{2 \pi} |\Sigma|^{-\frac{1}{2}}\exp(-\frac{1}{2}(\mathbf{x} - \mu)^T\Sigma^{-1}(\mathbf{x} - \mu))
=2π1exp(−21(x⊤Σ−1x+μ⊤Σ−1μ−2μ⊤Σ−1x+ln∣Σ∣))
= \frac{1}{2 \pi} \exp(-\frac{1}{2}(\mathbf{x}^\top\Sigma^{-1}\mathbf{x} + \mu^\top\Sigma^{-1}\mu - 2\mu^\top\Sigma^{-1}\mathbf{x} + \ln|\Sigma|))
x=[x y]⊤
\mathbf{x} = [x~~y]^\top
fX(x∣θ)=h(x)exp[η(θ)⋅T(x)−A(θ)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
=2π1exp(−21x⊤Σ−1x+μ⊤Σ−1x−21μ⊤Σ−1μ−21ln∣Σ∣)
= \frac{1}{2 \pi} \exp(-\frac{1}{2}\mathbf{x}^\top\Sigma^{-1}\mathbf{x} + \mu^\top\Sigma^{-1}\mathbf{x} -\frac{1}{2} \mu^\top\Sigma^{-1}\mu -\frac{1}{2} \ln|\Sigma|)
=2π1exp(−21vec(Σ−1)⊤vec(xx⊤)+μ⊤Σ−1x−21μ⊤Σ−1μ−21ln∣Σ∣)
= \frac{1}{2 \pi} \exp(-\frac{1}{2}vec(\Sigma^{-1})^\top vec(\mathbf{x}\mathbf{x}^\top) + \mu^\top\Sigma^{-1}\mathbf{x} -\frac{1}{2} \mu^\top\Sigma^{-1}\mu -\frac{1}{2} \ln|\Sigma|)
h(x)=2π1
h(x)=\frac{1}{2\pi}
η(μ,Σ)=[Σ−1μ−21vec(Σ−1)]
\eta(\mu, \Sigma)=\begin{bmatrix}
\Sigma^{-1}\mu \\
-\frac{1}{2}vec(\Sigma^{-1})
\end{bmatrix}
T(x)=[xvec(xx⊤)]
T(x) =\begin{bmatrix}
\mathbf{x} \\
vec(\mathbf{x}\mathbf{x}^\top)
\end{bmatrix}
A(μ,Σ)=21(μ⊤Σ−1μ+ln∣Σ∣)
A(\mu, \Sigma) = \frac{1}{2} (\mu^\top\Sigma^{-1}\mu + \ln|\Sigma|)
Example 5: Multivariate normal
Try on your own
Natural parameterisation
fX(x∣θ)=h(x)exp[η(θ).T(x)−A(θ)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta).T(x) - A (\theta)]
fX(x∣θ)=h(x)exp[η.T(x)−A(η)]
f_{X}(x|\theta) = h(x) exp[\eta.T(x) - A (\eta)]
\underbrace{~~~~~~~}
\underbrace{~~~~~~~~}
sufficient statistics
T(x)
T(x)
η
\eta
natural parameter
A(η)
A(\eta)
log-partition function
Revisiting the examples
Bernoulli

p(x)=exp(log1−ppx+log(1−p))
p(x)= \exp(\log \frac{p}{1-p}x + \log (1-p))
h(x)=1
h(x)=1
η(p)=log1−pp
\eta(p)=\log \frac{p}{1-p}
T(x)=x
T(x) = x
A(p)=−log(1−p)
A(p) = -\log(1-p)
Let η=log1−pp
Let~\eta=\log \frac{p}{1-p}
then A(η)=log(1+eη)
then~A(\eta)=\log (1+e^\eta)
p(x)=exp(ηx−log(1+eη))
p(x)= \exp(\eta x - \log (1+e^\eta))
Revisiting the examples

Binomial
p(x)=(xn)exp(log(1−p)px+nlog(1−p))
p(x)={n \choose x}\exp(\log \frac{p}{(1-p)}x + n\log(1-p))
h(x)=1
h(x)=1
η(p)=log1−pp
\eta(p)=\log \frac{p}{1-p}
T(x)=x
T(x) = x
A(p)=−nlog(1−p)
A(p) = -n\log(1-p)
Let η=log1−pp
Let~\eta=\log \frac{p}{1-p}
then A(η)=nlog(1+eη)
then~A(\eta)=n\log (1+e^\eta)
p(x)=exp(ηx−nlog(1+eη))
p(x)= \exp(\eta x - n\log (1+e^\eta))
Revisiting the examples

Normal
f(x)=2π1exp([σ2μ−2σ21][xx2]−(2σ2μ−logσ))
f(x) = \frac{1}{\sqrt{2 \pi}}\exp(\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}\begin{bmatrix}x \\ x^2 \end{bmatrix} - (\frac{\mu}{2\sigma^2} - \log \sigma))
h(x)=1
h(x)=1
η(μ,σ2)=[σ2μ−2σ21]⊤
\eta(\mu, \sigma^2)=\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}^\top
T(x)=[xx2]⊤
T(x) = \begin{bmatrix}x & x^2 \end{bmatrix}^\top
A(μ,σ2)=2σ2μ−logσ
A(\mu, \sigma^2) = \frac{\mu}{2\sigma^2} - \log \sigma
η=[η1,η2]=[σ2μ,−2σ21]
\eta = [\eta_1, \eta_2] = [\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2}]
A(η)=−4η2η12−21log(−2η2)
A(\eta) = -\frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log(-2\eta_2)
Revisiting the examples

Bivariate normal
f(x)=2π1exp(−21vec(Σ−1)⊤vec(xx⊤)+μ⊤Σ−1x−21μ⊤Σ−1μ−21ln∣Σ∣)
f(x) = \frac{1}{2 \pi} \exp(-\frac{1}{2}vec(\Sigma^{-1})^\top vec(\mathbf{x}\mathbf{x}^\top) + \mu^\top\Sigma^{-1}\mathbf{x} -\frac{1}{2} \mu^\top\Sigma^{-1}\mu -\frac{1}{2} \ln|\Sigma|)
h(x)=2π1
h(x)=\frac{1}{2\pi}
η(μ,Σ)=[Σ−1μ−21vec(Σ−1)]
\eta(\mu, \Sigma)=\begin{bmatrix}
\Sigma^{-1}\mu \\
-\frac{1}{2}vec(\Sigma^{-1})
\end{bmatrix}
T(x)=[xvec(xx)⊤]
T(x) =\begin{bmatrix}
\mathbf{x} \\
vec(\mathbf{x}\mathbf{x})^\top
\end{bmatrix}
A(μ,Σ)=21(μ⊤Σ−1μ+ln∣Σ∣)
A(\mu, \Sigma) = \frac{1}{2} (\mu^\top\Sigma^{-1}\mu + \ln|\Sigma|)
Try on your own
Revisiting the examples

Multivariate norm.
Try on your own
The log-partition function


fX(x)=h(x)exp[η.T(x)−A(η)]
f_{X}(x) = h(x) exp[\eta.T(x) - A (\eta)]
=h(x)exp(η.T(x))exp(−A(η))
= h(x) exp(\eta.T(x))exp(- A (\eta))
=g(η)h(x)exp(η.T(x))
= g(\eta) h(x) exp(\eta.T(x))
g(η)=exp(−A(η))
g(\eta) = exp(-A(\eta))
A(η)=−logg(η)
A(\eta) = -\log g(\eta)
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~}
p(x)=h(x)exp(η.T(x))
p(x) = h(x) exp(\eta.T(x))
the kernel - encoding all dependencies on x
We can convert the kernel to a probability density function by normalising it
The log-partition function


fX(x)=Z1p(x)
f_{X}(x) = \frac{1}{Z}p(x)
g(η)=exp(−A(η))
g(\eta) = exp(-A(\eta))
A(η)=−logg(η)
A(\eta) = -\log g(\eta)
p(x)=h(x)exp(η.T(x))
p(x) = h(x) exp(\eta.T(x))
the kernel - encoding all dependencies on x
Z=∫xp(x)dx
Z = \int_x p(x) dx
Z is called the partition function
Z=∫xh(x)exp(η.T(x))dx
Z = \int_x h(x) exp(\eta.T(x)) dx
1=∫xfx(x)dx
1 = \int_x f_x(x) dx
=∫xg(η)h(x)exp(η.T(x))dx
= \int_x g(\eta) h(x) exp(\eta.T(x)) dx
=g(η)∫xh(x)exp(η.T(x))dx
= g(\eta) \int_x h(x) exp(\eta.T(x)) dx
=g(η)Z
= g(\eta) Z
The log-partition function


g(η)=exp(−A(η))
g(\eta) = exp(-A(\eta))
A(η)=−logg(η)
A(\eta) = -\log g(\eta)
p(x)=h(x)exp(η.T(x))
p(x) = h(x) exp(\eta.T(x))
the kernel - encoding all dependencies on x
1=g(η)Z
1 = g(\eta) Z
g(η)=Z1
g(\eta) = \frac{1}{Z}
logg(η)=logZ1
\log g(\eta) = \log \frac{1}{Z}
−A(η)=−logZ
-A(\eta) = - \log Z
A(η)=logZ
A(\eta) = \log Z
(log partition)
Properties (why do we care?)
Easy to compute E[T(x)] and Var(T(x))
no complex integrals or summations involving infinities
Conjugate priors - important in Bayesian statistics
This course
Statistics course
fX∣Y(x∣y)=fY(y)fY∣X(y∣x)fX(x)
f_{X|Y}(x|y) = \frac{f_{Y|X}(y|x)f_{X}(x)}{f_{Y}(y)}
ML course
Generalised Linear models
unifying various models such as linear regression and logistic regression
Properties (why do we care?)
Easy to compute E[T(x)] and Var(T(x))
no complex integrals or summations involving infinities
It can be shown that
E[Ti(x)]=∂ηi∂A(η)
E[T_i(x)] = \frac{\partial A(\eta)}{\partial \eta_i}
Var[Ti(x)]=∂ηi2∂2A(η)
Var[T_i(x)] = \frac{\partial^2 A(\eta)}{\partial \eta_i^2}
Proof left as an exercise
Recap: Normal dist.
η=[σ2μ,−2σ21]
\eta = [\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2}]
T(x)=[x,x2]
T(x) = [x, x^2]
A(η)=−4η2η12−21log(−2η2)
A(\eta) = -\frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log(-2\eta_2)
Alternative forms
fX(x)=h(x)exp[η(θ)⋅T(x)−A(θ)]
f_{X}(x) = h(x) \exp[\eta(\theta)\cdot T(x) - A (\theta)]
fX(x)=h(x)g(θ)exp[η(θ)⋅T(x)]
f_{X}(x) = h(x) g(\theta)\exp[\eta(\theta)\cdot T(x)]
g(θ)=exp[−A(θ)]
g(\theta) = \exp[-A(\theta)]
fX(x)=exp[η(θ)⋅T(x)−A(θ)+B(x)]
f_{X}(x) = \exp[\eta(\theta)\cdot T(x) - A(\theta) + B(x)]
B(x)=log(h(x))
B(x) = \log(h(x))
Summary



E[Ti(x)]=∂ηi∂A(η)
E[T_i(x)] = \frac{\partial A(\eta)}{\partial \eta_i}
Var[Ti(x)]=∂ηi2∂2A(η)
Var[T_i(x)] = \frac{\partial^2 A(\eta)}{\partial \eta_i^2}
CS6015: Lecture 37
By Mitesh Khapra
CS6015: Lecture 37
Lecture 37: The exponential family of distributions
- 1,542