CS6015: Linear Algebra and Random Processes

Lecture 37: Exponential families of distributions

Learning Objectives

What are exponential families ?

Why do we care about them?

What are exponential families?

A set of probability distributions whose pmf(discrete case) or pdf (continuous case) can be expressed in the following form

fX(xθ)=h(x)exp[η(θ)T(x)A(θ)]f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
T,h:known functions of xT, h: known~functions~of~x
T, h: known~functions~of~x
η,A:known functions of θ\eta, A: known~functions~of~\theta
\eta, A: known~functions~of~\theta
θ:parameter\theta: parameter
\theta: parameter
pp
p
n,pn,p
n,p

the support of fX(xθ)f_X(x|\theta) should not depend on θ\theta

the support depends on nn

Recap: Parameters

μ,σ\mu, \sigma
\mu, \sigma

Normal

Bernoulli

Binomial

Questions of Interest

Are there any popular families that we care about?

Why do we care about this form?

Many! (we will see some soon)

This form has some useful properties! (we will see some of these properties soon)

fX(xθ)=h(x)exp[η(θ)T(x)A(θ)]f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]

Example 1: Bernoulli Distribution

fX(xθ)=h(x)exp[η(θ)T(x)A(θ)]f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
pX(x)=px(1p)1xp_{X}(x) = p^x(1-p)^{1-x}
p_{X}(x) = p^x(1-p)^{1-x}
θ=p\theta = p
\theta = p
=exp(log(px(1p)1x))= \exp(\log(p^x(1-p)^{1-x}))
= \exp(\log(p^x(1-p)^{1-x}))
=exp(log(px(1p)1x))= \exp(\log(p^x(1-p)^{1-x}))
= \exp(\log(p^x(1-p)^{1-x}))
exp(k)=ek\exp(k) = e^k
\exp(k) = e^k
=exp(logpx+log(1p)1x)= \exp(\log p^x + \log (1-p)^{1-x})
= \exp(\log p^x + \log (1-p)^{1-x})
=exp(xlogp+(1x)log(1p))= \exp(x\log p + (1-x)\log (1-p))
= \exp(x\log p + (1-x)\log (1-p))
=exp(xlogpxlog(1p)+log(1p))= \exp(x\log p -x\log (1-p) + \log (1-p))
= \exp(x\log p -x\log (1-p) + \log (1-p))
=exp(logp1px+log(1p))= \exp(\log \frac{p}{1-p}x + \log (1-p))
= \exp(\log \frac{p}{1-p}x + \log (1-p))
h(x)=1h(x)=1
h(x)=1
η(p)=logp1p\eta(p)=\log \frac{p}{1-p}
\eta(p)=\log \frac{p}{1-p}
T(x)=xT(x) = x
T(x) = x
A(p)=log(1p)A(p) = -log(1-p)
A(p) = -log(1-p)

Example 2: Binomial Distribution

only if nn is known/fixed & hence not a parameter

pX(x)=(nx)px(1p)nxp_{X}(x) = {n \choose x} p^x(1-p)^{n-x}
p_{X}(x) = {n \choose x} p^x(1-p)^{n-x}
X={0,1,2,,N}X = \{0,1,2,\dots, N\}
X = \{0,1,2,\dots, N\}
=exp(log((nx)px(1p)nx))=\exp(\log({n \choose x} p^x(1-p)^{n-x}))
=\exp(\log({n \choose x} p^x(1-p)^{n-x}))
=exp(log(nx)+xlogp+(nx)log(1p))=\exp(\log{n \choose x} + x\log p + (n-x)\log(1-p))
=\exp(\log{n \choose x} + x\log p + (n-x)\log(1-p))
=exp(log(nx)+logp(1p)x+nlog(1p))=\exp(\log{n \choose x} + \log \frac{p}{(1-p)}x + n\log(1-p))
=\exp(\log{n \choose x} + \log \frac{p}{(1-p)}x + n\log(1-p))
=(nx)exp(logp(1p)x+nlog(1p))={n \choose x}\exp(\log \frac{p}{(1-p)}x + n\log(1-p))
={n \choose x}\exp(\log \frac{p}{(1-p)}x + n\log(1-p))
h(x)=1h(x)=1
h(x)=1
η(p)=logp1p\eta(p)=\log \frac{p}{1-p}
\eta(p)=\log \frac{p}{1-p}
T(x)=xT(x) = x
T(x) = x
A(p)=nlog(1p)A(p) = -nlog(1-p)
A(p) = -nlog(1-p)
fX(xθ)=h(x)exp[η(θ)T(x)A(θ)]f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]

Example 3: Normal Distribution

Two parameters: μ,σ\mu, \sigma

fX(x)=12πσexp((xμ)22σ2)f_X(x) = \frac{1}{\sqrt{2 \pi}\sigma} \exp(\frac{-(x-\mu)^2}{2\sigma^2})
f_X(x) = \frac{1}{\sqrt{2 \pi}\sigma} \exp(\frac{-(x-\mu)^2}{2\sigma^2})
=12πexp(logσ)exp(x22xμ+μ22σ2)= \frac{1}{\sqrt{2 \pi}}\exp(-\log\sigma) \exp(-\frac{x^2 -2x\mu + \mu^2}{2\sigma^2})
= \frac{1}{\sqrt{2 \pi}}\exp(-\log\sigma) \exp(-\frac{x^2 -2x\mu + \mu^2}{2\sigma^2})
=12πexp(xμσ2+x22σ2μ2σ2logσ)= \frac{1}{\sqrt{2 \pi}}\exp(\frac{x\mu}{\sigma^2} + \frac{x^2}{-2\sigma^2} - \frac{\mu}{2\sigma^2} - \log \sigma)
= \frac{1}{\sqrt{2 \pi}}\exp(\frac{x\mu}{\sigma^2} + \frac{x^2}{-2\sigma^2} - \frac{\mu}{2\sigma^2} - \log \sigma)
=12πexp([μσ212σ2][xx2](μ2σ2logσ))= \frac{1}{\sqrt{2 \pi}}\exp(\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}\begin{bmatrix}x \\ x^2 \end{bmatrix} - (\frac{\mu}{2\sigma^2} - \log \sigma))
= \frac{1}{\sqrt{2 \pi}}\exp(\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}\begin{bmatrix}x \\ x^2 \end{bmatrix} - (\frac{\mu}{2\sigma^2} - \log \sigma))
h(x)=1h(x)=1
h(x)=1
η(μ,σ2)=[μσ212σ2]\eta(\mu, \sigma^2)=\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}^\top
\eta(\mu, \sigma^2)=\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}^\top
T(x)=[xx2]T(x) = \begin{bmatrix}x & x^2 \end{bmatrix}^\top
T(x) = \begin{bmatrix}x & x^2 \end{bmatrix}^\top
A(μ,σ2)=μ2σ2logσA(\mu, \sigma^2) = \frac{\mu}{2\sigma^2} - \log \sigma
A(\mu, \sigma^2) = \frac{\mu}{2\sigma^2} - \log \sigma

dot product

dim. of η(θ)\eta(\theta) is equal to number of parameters

fX(xθ)=h(x)exp[iηi(θ).Ti(x)A(θ)]f_{X}(x|\theta) = h(x) exp[\sum_i\eta_i(\theta).T_i(x) - A (\theta)]
f_{X}(x|\theta) = h(x) exp[\sum_i\eta_i(\theta).T_i(x) - A (\theta)]
fX(xθ)=h(x)exp[η(θ)T(x)A(θ)]f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]

The bigger picture

exponential families

scalar parameter, scalar variable

vector parameter, scalar variable

vector parameter, vector variable

Bernoulli

Binomial (nn fixed)

Poisson

Negative binomial (rr fixed)

Geometric

... ...

Normal

Gamma

Beta

Multinomial

Bivariate normal

... ...

Multivariate normal

... ...

fX(xθ)=h(x)exp[η(θ)T(x)A(θ)]f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]

The gamma distribution

Why do families matter?

f(x)=βαΓ(α)x(α1)eβxf(x) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{(\alpha - 1)}e^{-\beta x}
f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)}x^{(\alpha - 1)}e^{-\beta x}
Γ(α)=(α1)!\Gamma(\alpha) = (\alpha - 1)!
\Gamma(\alpha) = (\alpha - 1)!
α=shape parameter\alpha = shape~parameter
\alpha = shape~parameter
β=rate parameter\beta = rate~parameter
\beta = rate~parameter
import matplotlib.pyplot as plt
from scipy.stats import gamma
import numpy as np

x = np.linspace(0, 10, 500)
alpha = 3
beta = 1
rv = gamma(alpha, loc = 0., scale = 1/beta)

plt.plot(x, rv.pdf(x))

Example 4: Bivariate normal

fX,Y(x,y)=12πΣ12exp(12(xμ)TΣ1(xμ))f_{X,Y}(x,y) = \frac{1}{2 \pi |\Sigma|^{\frac{1}{2}}}\exp(-\frac{1}{2}(\mathbf{x} - \mu)^T\Sigma^{-1}(\mathbf{x} - \mu))
f_{X,Y}(x,y) = \frac{1}{2 \pi |\Sigma|^{\frac{1}{2}}}\exp(-\frac{1}{2}(\mathbf{x} - \mu)^T\Sigma^{-1}(\mathbf{x} - \mu))
=12πΣ12exp(12(xμ)TΣ1(xμ))= \frac{1}{2 \pi} |\Sigma|^{-\frac{1}{2}}\exp(-\frac{1}{2}(\mathbf{x} - \mu)^T\Sigma^{-1}(\mathbf{x} - \mu))
= \frac{1}{2 \pi} |\Sigma|^{-\frac{1}{2}}\exp(-\frac{1}{2}(\mathbf{x} - \mu)^T\Sigma^{-1}(\mathbf{x} - \mu))
=12πexp(12(xΣ1x+μΣ1μ2μΣ1x+lnΣ))= \frac{1}{2 \pi} \exp(-\frac{1}{2}(\mathbf{x}^\top\Sigma^{-1}\mathbf{x} + \mu^\top\Sigma^{-1}\mu - 2\mu^\top\Sigma^{-1}\mathbf{x} + \ln|\Sigma|))
= \frac{1}{2 \pi} \exp(-\frac{1}{2}(\mathbf{x}^\top\Sigma^{-1}\mathbf{x} + \mu^\top\Sigma^{-1}\mu - 2\mu^\top\Sigma^{-1}\mathbf{x} + \ln|\Sigma|))
x=[x  y]\mathbf{x} = [x~~y]^\top
\mathbf{x} = [x~~y]^\top
fX(xθ)=h(x)exp[η(θ)T(x)A(θ)]f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta)\cdot T(x) - A (\theta)]
=12πexp(12xΣ1x+μΣ1x12μΣ1μ12lnΣ)= \frac{1}{2 \pi} \exp(-\frac{1}{2}\mathbf{x}^\top\Sigma^{-1}\mathbf{x} + \mu^\top\Sigma^{-1}\mathbf{x} -\frac{1}{2} \mu^\top\Sigma^{-1}\mu -\frac{1}{2} \ln|\Sigma|)
= \frac{1}{2 \pi} \exp(-\frac{1}{2}\mathbf{x}^\top\Sigma^{-1}\mathbf{x} + \mu^\top\Sigma^{-1}\mathbf{x} -\frac{1}{2} \mu^\top\Sigma^{-1}\mu -\frac{1}{2} \ln|\Sigma|)
=12πexp(12vec(Σ1)vec(xx)+μΣ1x12μΣ1μ12lnΣ)= \frac{1}{2 \pi} \exp(-\frac{1}{2}vec(\Sigma^{-1})^\top vec(\mathbf{x}\mathbf{x}^\top) + \mu^\top\Sigma^{-1}\mathbf{x} -\frac{1}{2} \mu^\top\Sigma^{-1}\mu -\frac{1}{2} \ln|\Sigma|)
= \frac{1}{2 \pi} \exp(-\frac{1}{2}vec(\Sigma^{-1})^\top vec(\mathbf{x}\mathbf{x}^\top) + \mu^\top\Sigma^{-1}\mathbf{x} -\frac{1}{2} \mu^\top\Sigma^{-1}\mu -\frac{1}{2} \ln|\Sigma|)
h(x)=12πh(x)=\frac{1}{2\pi}
h(x)=\frac{1}{2\pi}
η(μ,Σ)=[Σ1μ12vec(Σ1)]\eta(\mu, \Sigma)=\begin{bmatrix} \Sigma^{-1}\mu \\ -\frac{1}{2}vec(\Sigma^{-1}) \end{bmatrix}
\eta(\mu, \Sigma)=\begin{bmatrix} \Sigma^{-1}\mu \\ -\frac{1}{2}vec(\Sigma^{-1}) \end{bmatrix}
T(x)=[xvec(xx)]T(x) =\begin{bmatrix} \mathbf{x} \\ vec(\mathbf{x}\mathbf{x}^\top) \end{bmatrix}
T(x) =\begin{bmatrix} \mathbf{x} \\ vec(\mathbf{x}\mathbf{x}^\top) \end{bmatrix}
A(μ,Σ)=12(μΣ1μ+lnΣ)A(\mu, \Sigma) = \frac{1}{2} (\mu^\top\Sigma^{-1}\mu + \ln|\Sigma|)
A(\mu, \Sigma) = \frac{1}{2} (\mu^\top\Sigma^{-1}\mu + \ln|\Sigma|)

Example 5: Multivariate normal

Try on your own

Natural parameterisation

fX(xθ)=h(x)exp[η(θ).T(x)A(θ)]f_{X}(x|\theta) = h(x) exp[\eta(\theta).T(x) - A (\theta)]
f_{X}(x|\theta) = h(x) exp[\eta(\theta).T(x) - A (\theta)]
fX(xθ)=h(x)exp[η.T(x)A(η)]f_{X}(x|\theta) = h(x) exp[\eta.T(x) - A (\eta)]
f_{X}(x|\theta) = h(x) exp[\eta.T(x) - A (\eta)]
       \underbrace{~~~~~~~}
\underbrace{~~~~~~~}
        \underbrace{~~~~~~~~}
\underbrace{~~~~~~~~}

sufficient statistics

T(x)T(x)
T(x)
η\eta
\eta

natural parameter

A(η)A(\eta)
A(\eta)

log-partition function

Revisiting the examples

Bernoulli

p(x)=exp(logp1px+log(1p))p(x)= \exp(\log \frac{p}{1-p}x + \log (1-p))
p(x)= \exp(\log \frac{p}{1-p}x + \log (1-p))
h(x)=1h(x)=1
h(x)=1
η(p)=logp1p\eta(p)=\log \frac{p}{1-p}
\eta(p)=\log \frac{p}{1-p}
T(x)=xT(x) = x
T(x) = x
A(p)=log(1p)A(p) = -\log(1-p)
A(p) = -\log(1-p)
Let η=logp1pLet~\eta=\log \frac{p}{1-p}
Let~\eta=\log \frac{p}{1-p}
then A(η)=log(1+eη)then~A(\eta)=\log (1+e^\eta)
then~A(\eta)=\log (1+e^\eta)
p(x)=exp(ηxlog(1+eη))p(x)= \exp(\eta x - \log (1+e^\eta))
p(x)= \exp(\eta x - \log (1+e^\eta))

Revisiting the examples

Binomial

p(x)=(nx)exp(logp(1p)x+nlog(1p))p(x)={n \choose x}\exp(\log \frac{p}{(1-p)}x + n\log(1-p))
p(x)={n \choose x}\exp(\log \frac{p}{(1-p)}x + n\log(1-p))
h(x)=1h(x)=1
h(x)=1
η(p)=logp1p\eta(p)=\log \frac{p}{1-p}
\eta(p)=\log \frac{p}{1-p}
T(x)=xT(x) = x
T(x) = x
A(p)=nlog(1p)A(p) = -n\log(1-p)
A(p) = -n\log(1-p)
Let η=logp1pLet~\eta=\log \frac{p}{1-p}
Let~\eta=\log \frac{p}{1-p}
then A(η)=nlog(1+eη)then~A(\eta)=n\log (1+e^\eta)
then~A(\eta)=n\log (1+e^\eta)
p(x)=exp(ηxnlog(1+eη))p(x)= \exp(\eta x - n\log (1+e^\eta))
p(x)= \exp(\eta x - n\log (1+e^\eta))

Revisiting the examples

Normal

f(x)=12πexp([μσ212σ2][xx2](μ2σ2logσ))f(x) = \frac{1}{\sqrt{2 \pi}}\exp(\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}\begin{bmatrix}x \\ x^2 \end{bmatrix} - (\frac{\mu}{2\sigma^2} - \log \sigma))
f(x) = \frac{1}{\sqrt{2 \pi}}\exp(\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}\begin{bmatrix}x \\ x^2 \end{bmatrix} - (\frac{\mu}{2\sigma^2} - \log \sigma))
h(x)=1h(x)=1
h(x)=1
η(μ,σ2)=[μσ212σ2]\eta(\mu, \sigma^2)=\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}^\top
\eta(\mu, \sigma^2)=\begin{bmatrix}\frac{\mu}{\sigma^2} & -\frac{1}{2\sigma^2} \end{bmatrix}^\top
T(x)=[xx2]T(x) = \begin{bmatrix}x & x^2 \end{bmatrix}^\top
T(x) = \begin{bmatrix}x & x^2 \end{bmatrix}^\top
A(μ,σ2)=μ2σ2logσA(\mu, \sigma^2) = \frac{\mu}{2\sigma^2} - \log \sigma
A(\mu, \sigma^2) = \frac{\mu}{2\sigma^2} - \log \sigma
η=[η1,η2]=[μσ2,12σ2]\eta = [\eta_1, \eta_2] = [\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2}]
\eta = [\eta_1, \eta_2] = [\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2}]
A(η)=η124η212log(2η2)A(\eta) = -\frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log(-2\eta_2)
A(\eta) = -\frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log(-2\eta_2)

Revisiting the examples

Bivariate normal

f(x)=12πexp(12vec(Σ1)vec(xx)+μΣ1x12μΣ1μ12lnΣ)f(x) = \frac{1}{2 \pi} \exp(-\frac{1}{2}vec(\Sigma^{-1})^\top vec(\mathbf{x}\mathbf{x}^\top) + \mu^\top\Sigma^{-1}\mathbf{x} -\frac{1}{2} \mu^\top\Sigma^{-1}\mu -\frac{1}{2} \ln|\Sigma|)
f(x) = \frac{1}{2 \pi} \exp(-\frac{1}{2}vec(\Sigma^{-1})^\top vec(\mathbf{x}\mathbf{x}^\top) + \mu^\top\Sigma^{-1}\mathbf{x} -\frac{1}{2} \mu^\top\Sigma^{-1}\mu -\frac{1}{2} \ln|\Sigma|)
h(x)=12πh(x)=\frac{1}{2\pi}
h(x)=\frac{1}{2\pi}
η(μ,Σ)=[Σ1μ12vec(Σ1)]\eta(\mu, \Sigma)=\begin{bmatrix} \Sigma^{-1}\mu \\ -\frac{1}{2}vec(\Sigma^{-1}) \end{bmatrix}
\eta(\mu, \Sigma)=\begin{bmatrix} \Sigma^{-1}\mu \\ -\frac{1}{2}vec(\Sigma^{-1}) \end{bmatrix}
T(x)=[xvec(xx)]T(x) =\begin{bmatrix} \mathbf{x} \\ vec(\mathbf{x}\mathbf{x})^\top \end{bmatrix}
T(x) =\begin{bmatrix} \mathbf{x} \\ vec(\mathbf{x}\mathbf{x})^\top \end{bmatrix}
A(μ,Σ)=12(μΣ1μ+lnΣ)A(\mu, \Sigma) = \frac{1}{2} (\mu^\top\Sigma^{-1}\mu + \ln|\Sigma|)
A(\mu, \Sigma) = \frac{1}{2} (\mu^\top\Sigma^{-1}\mu + \ln|\Sigma|)

Try on your own

Revisiting the examples

Multivariate norm.

Try on your own

The log-partition function

fX(x)=h(x)exp[η.T(x)A(η)]f_{X}(x) = h(x) exp[\eta.T(x) - A (\eta)]
f_{X}(x) = h(x) exp[\eta.T(x) - A (\eta)]
=h(x)exp(η.T(x))exp(A(η))= h(x) exp(\eta.T(x))exp(- A (\eta))
= h(x) exp(\eta.T(x))exp(- A (\eta))
=g(η)h(x)exp(η.T(x))= g(\eta) h(x) exp(\eta.T(x))
= g(\eta) h(x) exp(\eta.T(x))
g(η)=exp(A(η))g(\eta) = exp(-A(\eta))
g(\eta) = exp(-A(\eta))
A(η)=logg(η)A(\eta) = -\log g(\eta)
A(\eta) = -\log g(\eta)
                            \underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~}
\underbrace{~~~~~~~~~~~~~~~~~~~~~~~~~~~~}
p(x)=h(x)exp(η.T(x))p(x) = h(x) exp(\eta.T(x))
p(x) = h(x) exp(\eta.T(x))

the kernel - encoding all dependencies on xx

We can convert the kernel to a probability density function by normalising it

The log-partition function

fX(x)=1Zp(x)f_{X}(x) = \frac{1}{Z}p(x)
f_{X}(x) = \frac{1}{Z}p(x)
g(η)=exp(A(η))g(\eta) = exp(-A(\eta))
g(\eta) = exp(-A(\eta))
A(η)=logg(η)A(\eta) = -\log g(\eta)
A(\eta) = -\log g(\eta)
p(x)=h(x)exp(η.T(x))p(x) = h(x) exp(\eta.T(x))
p(x) = h(x) exp(\eta.T(x))

the kernel - encoding all dependencies on xx

Z=xp(x)dxZ = \int_x p(x) dx
Z = \int_x p(x) dx

ZZ is called the partition function

Z=xh(x)exp(η.T(x))dxZ = \int_x h(x) exp(\eta.T(x)) dx
Z = \int_x h(x) exp(\eta.T(x)) dx
1=xfx(x)dx1 = \int_x f_x(x) dx
1 = \int_x f_x(x) dx
=xg(η)h(x)exp(η.T(x))dx= \int_x g(\eta) h(x) exp(\eta.T(x)) dx
= \int_x g(\eta) h(x) exp(\eta.T(x)) dx
=g(η)xh(x)exp(η.T(x))dx= g(\eta) \int_x h(x) exp(\eta.T(x)) dx
= g(\eta) \int_x h(x) exp(\eta.T(x)) dx
=g(η)Z= g(\eta) Z
= g(\eta) Z

The log-partition function

g(η)=exp(A(η))g(\eta) = exp(-A(\eta))
g(\eta) = exp(-A(\eta))
A(η)=logg(η)A(\eta) = -\log g(\eta)
A(\eta) = -\log g(\eta)
p(x)=h(x)exp(η.T(x))p(x) = h(x) exp(\eta.T(x))
p(x) = h(x) exp(\eta.T(x))

the kernel - encoding all dependencies on xx

1=g(η)Z1 = g(\eta) Z
1 = g(\eta) Z
g(η)=1Zg(\eta) = \frac{1}{Z}
g(\eta) = \frac{1}{Z}
logg(η)=log1Z\log g(\eta) = \log \frac{1}{Z}
\log g(\eta) = \log \frac{1}{Z}
A(η)=logZ-A(\eta) = - \log Z
-A(\eta) = - \log Z
A(η)=logZA(\eta) = \log Z
A(\eta) = \log Z

(log partition)

Properties (why do we care?)

Easy to compute E[T(x)]E[T(x)] and Var(T(x)Var(T(x))

no complex integrals or summations involving infinities

Conjugate priors - important in Bayesian statistics

This course

Statistics course

fXY(xy)=fYX(yx)fX(x)fY(y)f_{X|Y}(x|y) = \frac{f_{Y|X}(y|x)f_{X}(x)}{f_{Y}(y)}
f_{X|Y}(x|y) = \frac{f_{Y|X}(y|x)f_{X}(x)}{f_{Y}(y)}

ML course

Generalised Linear models

unifying various models such as linear regression and logistic regression

Properties (why do we care?)

Easy to compute E[T(x)]E[T(x)] and Var(T(x)Var(T(x))

no complex integrals or summations involving infinities

It can be shown that

E[Ti(x)]=A(η)ηiE[T_i(x)] = \frac{\partial A(\eta)}{\partial \eta_i}
E[T_i(x)] = \frac{\partial A(\eta)}{\partial \eta_i}
Var[Ti(x)]=2A(η)ηi2Var[T_i(x)] = \frac{\partial^2 A(\eta)}{\partial \eta_i^2}
Var[T_i(x)] = \frac{\partial^2 A(\eta)}{\partial \eta_i^2}

Proof left as an exercise

Recap: Normal dist.

η=[μσ2,12σ2]\eta = [\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2}]
\eta = [\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2}]
T(x)=[x,x2]T(x) = [x, x^2]
T(x) = [x, x^2]
A(η)=η124η212log(2η2)A(\eta) = -\frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log(-2\eta_2)
A(\eta) = -\frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log(-2\eta_2)

Alternative forms

fX(x)=h(x)exp[η(θ)T(x)A(θ)]f_{X}(x) = h(x) \exp[\eta(\theta)\cdot T(x) - A (\theta)]
f_{X}(x) = h(x) \exp[\eta(\theta)\cdot T(x) - A (\theta)]
fX(x)=h(x)g(θ)exp[η(θ)T(x)]f_{X}(x) = h(x) g(\theta)\exp[\eta(\theta)\cdot T(x)]
f_{X}(x) = h(x) g(\theta)\exp[\eta(\theta)\cdot T(x)]
g(θ)=exp[A(θ)]g(\theta) = \exp[-A(\theta)]
g(\theta) = \exp[-A(\theta)]
fX(x)=exp[η(θ)T(x)A(θ)+B(x)]f_{X}(x) = \exp[\eta(\theta)\cdot T(x) - A(\theta) + B(x)]
f_{X}(x) = \exp[\eta(\theta)\cdot T(x) - A(\theta) + B(x)]
B(x)=log(h(x))B(x) = \log(h(x))
B(x) = \log(h(x))

Summary

E[Ti(x)]=A(η)ηiE[T_i(x)] = \frac{\partial A(\eta)}{\partial \eta_i}
E[T_i(x)] = \frac{\partial A(\eta)}{\partial \eta_i}
Var[Ti(x)]=2A(η)ηi2Var[T_i(x)] = \frac{\partial^2 A(\eta)}{\partial \eta_i^2}
Var[T_i(x)] = \frac{\partial^2 A(\eta)}{\partial \eta_i^2}

CS6015: Lecture 37

By Mitesh Khapra

CS6015: Lecture 37

Lecture 37: The exponential family of distributions

  • 1,542