Bayesian Analyses

Hui Hu Ph.D.

Department of Epidemiology

College of Public Health and Health Professions & College of Medicine

April 4, 2018

Introduction to Bayesian Statistical Perspectives

Statistical Perspectives

• Statistics:
-  the use of analytical tools to uncover relationships in data
-  different foundations in philosophy of inference
-  different perspectives on probability: constant of nature vs. subjective degree of belief

• The most common statistical perspectives:
-  Frequentist: Pr(y|θ), probability of seeing this data, given the parameter
-  Bayesian: Pr(θ|y), probability of the parameter, given this data

Data

Prior Belief

Decision

Frequentist

Bayesian

Frequentist vs. Bayesian

Frequentist
Data are repeatable random sample: there is a frequency
Underlying parameters remain constant during this repeatable process
Parameters are fixed
Bayesian
Data are observed from the realized sample
Parameters are unknown and described probabilistically

Data are fixed

Bayes' Theorem

Pr(A|B)={{Pr(B|A)Pr(A)}\over Pr(B)}
$Pr(A|B)={{Pr(B|A)Pr(A)}\over Pr(B)}$
• 2% of women at age 40 who participate in routine screening have breast cancer
• 80% of women with breast cancer will get positive mammographies
• 10% of women without breast cancer will also get positive mammographies
• Question:
-  A woman in this age group had a positive mammography in a routine screening.
What is the probability that she actually has breast cancer?
• 2% of women at age 40 who participate in routine screening have breast cancer
• 80% of women with breast cancer will get positive mammographies
• 10% of women without breast cancer will also get positive mammographies
• A woman in this age group had a positive mammography in a routine screening.
What is the probability that she actually has breast cancer?
Pr(positive \ test)=0.02\times0.8+0.98\times0.1=0.114
$Pr(positive \ test)=0.02\times0.8+0.98\times0.1=0.114$
Pr(cancer)=0.02
$Pr(cancer)=0.02$
Pr(positive\ test|cancer)=0.8
$Pr(positive\ test|cancer)=0.8$
Pr(cancer|postive\ test)={{0.8\times0.02}\over 0.114}=0.14
$Pr(cancer|postive\ test)={{0.8\times0.02}\over 0.114}=0.14$

Bayes' Theorem

Pr(A|B)={{Pr(B|A)Pr(A)}\over Pr(B)}
$Pr(A|B)={{Pr(B|A)Pr(A)}\over Pr(B)}$

Posterior:
The probability of parameter given the data is collected

Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}
$Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}$

Likelihood:
The probability of collecting this data given the parameter

Prior:
The probability of the parameter before collecting data

Marginal Likelihood:
The probability of collecting this data under all possible parameters

Posterior:
The probability of parameter given the data is collected

Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}
$Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}$

Likelihood:
The probability of collecting this data given the parameter

Prior:
The probability of the parameter before collecting data

Marginal Likelihood:
The probability of collecting this data under all possible parameters

• Prior represents our subjective beliefs, via a probability statement, about likely values of unobserved parameter before we have observed data

• Types:
-  noninformative: minimal prior information (e.g. all values equally likely)
-  clinical: come from statistician/knowledgeable scientist interaction
-  skeptical: quantify a large effect as unlikely
-  enthusiastic: create good chance to observe effect

Posterior:
The probability of parameter given the data is collected

Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}
$Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}$

Likelihood:
The probability of collecting this data given the parameter

Prior:
The probability of the parameter before collecting data

Marginal Likelihood:
The probability of collecting this data under all possible parameters

• Likelihood is used when describing a function of a parameter given an outcome
-  e.g. if a coin is flipped 10 times and it has landed heads-up 10 times, what is the likelihood that the coin is fair?

• The likelihood for data {yi}, i=1, 2, ..., m, is defined as
L(y|\theta)=\prod_{i=1}^mf(y_i|\theta)
$L(y|\theta)=\prod_{i=1}^mf(y_i|\theta)$

Posterior:
The probability of parameter given the data is collected

Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}
$Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}$

Likelihood:
The probability of collecting this data given the parameter

Prior:
The probability of the parameter before collecting data

Marginal Likelihood:
The probability of collecting this data under all possible parameters

• Given a set of independent identically distributed data points

where yi~p(yi|θ) according to some probability distribution parameterized by θ, and  θ itself is a random variable described by a distribution:

• The marginal likelihood in general asks what the probability p(y|α) is, where θ has been marginalized out

• The probability of the data under all possible parameters
-  it is a constant value given data
y=(y_1,...,y_n)
$y=(y_1,...,y_n)$
\theta \sim p(\theta|\alpha)
$\theta \sim p(\theta|\alpha)$
p(y|\alpha)=\int_\theta p(y|\theta)p(\theta |\alpha)d\theta
$p(y|\alpha)=\int_\theta p(y|\theta)p(\theta |\alpha)d\theta$
Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}
$Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}$
Pr(\theta|y)\propto{{Pr(y|\theta)Pr(\theta)}}
$Pr(\theta|y)\propto{{Pr(y|\theta)Pr(\theta)}}$

The posterior is proportional to the likelihood times the prior

The Sunrise Problem

• The sunrise problem was introduced by Price in the comments to the article that introduced Bayes' theorem

• It concerns the probability that the sun will rise tomorrow, and the evaluation and updating of a belief (prior)

• Imagine a newborn who observed the sun on his first day
-  after the sun sets, the newborn has uncertainty as to whether or not he will see the sun again
-  we can represent this uncertainty with a Beta distribution
-  in this example, the Beta distribution is used as a prior probability distribution, which is an expression of belief before seeing more data

The Sunrise Problem (cont'd)

• A Beta distribution has two parameters, α and β
-  a Beta distribution that is specified as B(α=1,β=1) is a uniform distribution between zero and one

• The α and β are easily understandable in terms of successes and failures of an event, where α=successes+1 and β=failures+1
-  therefore, zero successes and failures may be represented with a B(α=1,β=1) distribution
-  the expection of a Beta distribution is α/(α+
β)

• The Beta distribution is also convenient because it is self-conjugate, meaning that when it is used as a prior distribution, the posterior distribution is also a Beta distribution

The Sunrise Problem (cont'd)

• After observing one sunrise (one success):
-  the prior mean has been updated from 0.5 to the posterior mean of 0.67

The Sunrise Problem (cont'd)

• After observing two sunrises:
-  the prior mean has been updated from 0.5 to the posterior mean of 0.75

The Sunrise Problem (cont'd)

• After observing ten sunrises:
-  the prior mean has been updated from 0.5 to the posterior mean of 0.92

The Sunrise Problem (cont'd)

• If this example was calculated for the newborn on his 80th birthday:
-  the prior mean has been updated from 0.5 to the posterior mean of 0.9999658

The Sunrise Problem (cont'd)

• There are many things about this example that have been presented arbitrarily:
-  the newborn may have been optimistic or pessimistic in his uncertainty, and a different prior probability distribution may have been more appropriate
-  the subjective interpretation of probability allows for the prior probability distribution to differ for each person
-  although a different prior would affect the posterior, the difference decreases as more data is observed
-  characterizing the prior or posterior with the mean is also an arbitrary choice

• In Bayesian inference, only the simplest examples may be demonstrated analytically. Usually, numerical approximation such as Markov chain Monte Carlo (MCMC) must be used

Conjugate Pairs

• When the prior and the posterior are from the same family of distribution, we call them conjugate pairs

• Since posterior is proportional to the prior times the likelihood, we need to pick a right combination of likelihood distribution and prior distribution to get a conjugate pair

• Conjugate pairs are important
-  we cannot explicitly write out the posterior distribution for non-conjugate pairs

Conjugate Pairs (cont.d)

• An example of conjugate pair with a Binomial likelihood and a Beta prior

Bayesian Hierarchical Model

Bayesian Hierarchical Model

• A simple example of a hierarchical model that is commonly found in spatial epidemiological studies is where the data likelihood is Poisson and there is a common relative risk parameter with a single gamma prior distribution

where g(θ) is a Gamma distribution with parameters α and β.

• A compact notation for this model is:

• α and β controls the form of the prior distribution g(θ)
-  these parameters can have assumed values, but more usually, we don't have a strong belief in the prior parameters values
-  alternatively, as parameters within models are regarded as stochastic, then these parameters must also have distributions, and these distributions are known as hyperprior distributions, these parameters are known as hyperparameters
p(\theta |y)\propto L(y|\theta)g(\theta)
$p(\theta |y)\propto L(y|\theta)g(\theta)$
y_i|\theta \sim Pois(e_i^\theta)
$y_i|\theta \sim Pois(e_i^\theta)$
\theta \sim G(\alpha,\beta)
$\theta \sim G(\alpha,\beta)$

Bayesian Hierarchical Model (cont.d)

• In this Poisson-Gamma example, there is a two level hierarchy:
-  θ has a G(α, β) distribution at the first level of the hierarchy
-
α can have a hyperprior distribution (hα), and β can have a hyperprior distribution (hβ)

• This can be written as:
y_i|\theta \sim Pois(e_i^\theta)
$y_i|\theta \sim Pois(e_i^\theta)$
\theta \sim G(\alpha,\beta)
$\theta \sim G(\alpha,\beta)$
\alpha|\nu \sim h_\alpha(\nu)
$\alpha|\nu \sim h_\alpha(\nu)$
\beta|\rho \sim h_\beta(\rho)
$\beta|\rho \sim h_\beta(\rho)$

Computational Tools

• In Bayesian analyses, integration is the principle inferential operation, as opposed to optimization in classical statistics

• Historically, the need to evaluate integrals was a major stumbling block for the application of Bayesian methods, severely restricts the type of models that could be implemented

• Around 28 years ago, a numerical technique known as Markov chain Monte Carlo (MCMC) was proposed by a Gelfand and Smith (1990)

• MCMC is a general method that simultaneously solves inference of major quantities in Bayesian analyses
-  allows to concentrate on modelling: use models that you believe represent the true dependence structures in the data, rather than those that are simple to compute

MCMC

• All the information we need is contained in the posterior distribution, however, it may not be quantifiable as a standard distribution (unless we pick specific prior and likelihood combinations to create conjugate pair)

• MCMC simulation approximates the true posterior density by using a bag of samples drawn from the density
-  the initial set of T samples will be discarded since when few simulations were done, the samples drawn are can often be far from the posterior distribution
-  the time (iteration number) T is known as the burn-in

• Two most general procedures for MCMC simulation from a target distribution:
-  the Gibbs sampler
-  the Metropolis-Hastings algorithm: allows us to draw samples from non-conjugate pairs

Metropolis-Hastings Algorithm

• The Metropolis-Hastings algorithm can be embedded within the Gibbs sampler and generate samples from these unknown distributions

• Although we can use this algorithm to take samples from any distributions, we usually don't do that when we have a known posterior distribution since the sample generated by this algorithm may introduce autocorrelations in Gibbs sampler

• We do not cover the technical details of this algorithm since it has been implemented in JAGS
-  but for more complex model which cannot be implemented in JAGS or needs further optimizations, we do need to program the Gibbs sampler and implement the Metropolis-Hastings algorithm

Lab: JAGS

BUGS and JAGS

• BUGS: Bayesian inference Using Gibbs Sampling

• JAGS (Just Another Gibbs Sampler)

Syntax Differences in BUGS/JAGS Compared with R

• Order not important because they are compiled simultaneously

• [],[,],[,,]needed when defining arrays

• Matrix filled row-wise

• Defining vector/matrix prior to filling the elements is not necessary

Model Construction

• Likelihood:
for (i in 1:n){
x[i]~dexp(l[i])
l[i]<-mu[i]*x[i]
}
• Prior:
for (j in 1:m){
mu[j]~dexp(3)
}

By Hui Hu

PHC6194-Spring2018-Lecture11

Slides for Lecture 11, Spring 2018, PHC6194 Spatial Epidemiology

• 459