## Bayesian Analyses

Hui Hu Ph.D.

Department of Epidemiology

College of Public Health and Health Professions & College of Medicine

April 3, 2019

# Introduction to Bayesian Statistical Perspectives

### Statistical Perspectives

• Statistics:
-  the use of analytical tools to uncover relationships in data
-  different foundations in philosophy of inference
-  different perspectives on probability: constant of nature vs. subjective degree of belief

• The most common statistical perspectives:
-  Frequentist: Pr(y|θ), probability of seeing this data, given the parameter
-  Bayesian: Pr(θ|y), probability of the parameter, given this data

Data

Prior Belief

Decision

Frequentist

Bayesian

### Frequentist vs. Bayesian

Frequentist
Data are repeatable random sample: there is a frequency
Underlying parameters remain constant during this repeatable process
Parameters are fixed
Bayesian
Data are observed from the realized sample
Parameters are unknown and described probabilistically

Data are fixed

### Bayes' Theorem

Pr(A|B)={{Pr(B|A)Pr(A)}\over Pr(B)}
• 2% of women at age 40 who participate in routine screening have breast cancer
• 80% of women with breast cancer will get positive mammographies
• 10% of women without breast cancer will also get positive mammographies
• Question:
-  A woman in this age group had a positive mammography in a routine screening.
What is the probability that she actually has breast cancer?
• 2% of women at age 40 who participate in routine screening have breast cancer
• 80% of women with breast cancer will get positive mammographies
• 10% of women without breast cancer will also get positive mammographies
• A woman in this age group had a positive mammography in a routine screening.
What is the probability that she actually has breast cancer?
Pr(positive \ test)=0.02\times0.8+0.98\times0.1=0.114
Pr(cancer)=0.02
Pr(positive\ test|cancer)=0.8
Pr(cancer|postive\ test)={{0.8\times0.02}\over 0.114}=0.14

### Bayes' Theorem

Pr(A|B)={{Pr(B|A)Pr(A)}\over Pr(B)}

Posterior:
The probability of parameter given the data is collected

Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}

Likelihood:
The probability of collecting this data given the parameter

Prior:
The probability of the parameter before collecting data

Marginal Likelihood:
The probability of collecting this data under all possible parameters

Posterior:
The probability of parameter given the data is collected

Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}

Likelihood:
The probability of collecting this data given the parameter

Prior:
The probability of the parameter before collecting data

Marginal Likelihood:
The probability of collecting this data under all possible parameters

• Prior represents our subjective beliefs, via a probability statement, about likely values of unobserved parameter before we have observed data

• Types:
-  noninformative: minimal prior information (e.g. all values equally likely)
-  clinical: come from statistician/knowledgeable scientist interaction
-  skeptical: quantify a large effect as unlikely
-  enthusiastic: create good chance to observe effect

Posterior:
The probability of parameter given the data is collected

Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}

Likelihood:
The probability of collecting this data given the parameter

Prior:
The probability of the parameter before collecting data

Marginal Likelihood:
The probability of collecting this data under all possible parameters

• Likelihood is used when describing a function of a parameter given an outcome
-  e.g. if a coin is flipped 10 times and it has landed heads-up 10 times, what is the likelihood that the coin is fair?

• The likelihood for data {yi}, i=1, 2, ..., m, is defined as
L(y|\theta)=\prod_{i=1}^mf(y_i|\theta)

Posterior:
The probability of parameter given the data is collected

Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}

Likelihood:
The probability of collecting this data given the parameter

Prior:
The probability of the parameter before collecting data

Marginal Likelihood:
The probability of collecting this data under all possible parameters

• Given a set of independent identically distributed data points

where yi~p(yi|θ) according to some probability distribution parameterized by θ, and  θ itself is a random variable described by a distribution:

• The marginal likelihood in general asks what the probability p(y|α) is, where θ has been marginalized out

• The probability of the data under all possible parameters
-  it is a constant value given data
y=(y_1,...,y_n)
\theta \sim p(\theta|\alpha)
p(y|\alpha)=\int_\theta p(y|\theta)p(\theta |\alpha)d\theta
Pr(\theta|y)={{Pr(y|\theta)Pr(\theta)}\over Pr(y)}
Pr(\theta|y)\propto{{Pr(y|\theta)Pr(\theta)}}

The posterior is proportional to the likelihood times the prior

### The Sunrise Problem

• The sunrise problem was introduced by Price in the comments to the article that introduced Bayes' theorem

• It concerns the probability that the sun will rise tomorrow, and the evaluation and updating of a belief (prior)

• Imagine a newborn who observed the sun on his first day
-  after the sun sets, the newborn has uncertainty as to whether or not he will see the sun again
-  we can represent this uncertainty with a Beta distribution
-  in this example, the Beta distribution is used as a prior probability distribution, which is an expression of belief before seeing more data

### The Sunrise Problem (cont'd)

• A Beta distribution has two parameters, α and β
-  a Beta distribution that is specified as B(α=1,β=1) is a uniform distribution between zero and one

• The α and β are easily understandable in terms of successes and failures of an event, where α=successes+1 and β=failures+1
-  therefore, zero successes and failures may be represented with a B(α=1,β=1) distribution
-  the expection of a Beta distribution is α/(α+
β)

• The Beta distribution is also convenient because it is self-conjugate, meaning that when it is used as a prior distribution, the posterior distribution is also a Beta distribution

### The Sunrise Problem (cont'd)

• After observing one sunrise (one success):
-  the prior mean has been updated from 0.5 to the posterior mean of 0.67

### The Sunrise Problem (cont'd)

• After observing two sunrises:
-  the prior mean has been updated from 0.5 to the posterior mean of 0.75

### The Sunrise Problem (cont'd)

• After observing ten sunrises:
-  the prior mean has been updated from 0.5 to the posterior mean of 0.92

### The Sunrise Problem (cont'd)

• If this example was calculated for the newborn on his 80th birthday:
-  the prior mean has been updated from 0.5 to the posterior mean of 0.9999658

### The Sunrise Problem (cont'd)

-  the newborn may have been optimistic or pessimistic in his uncertainty, and a different prior probability distribution may have been more appropriate
-  the subjective interpretation of probability allows for the prior probability distribution to differ for each person
-  although a different prior would affect the posterior, the difference decreases as more data is observed
-  characterizing the prior or posterior with the mean is also an arbitrary choice

• In Bayesian inference, only the simplest examples may be demonstrated analytically. Usually, numerical approximation such as Markov chain Monte Carlo (MCMC) must be used

### Conjugate Pairs

• When the prior and the posterior are from the same family of distribution, we call them conjugate pairs

• Since posterior is proportional to the prior times the likelihood, we need to pick a right combination of likelihood distribution and prior distribution to get a conjugate pair

• Conjugate pairs are important
-  we cannot explicitly write out the posterior distribution for non-conjugate pairs

### Conjugate Pairs (cont.d)

• An example of conjugate pair with a Binomial likelihood and a Beta prior

# Bayesian Hierarchical Model

### Bayesian Hierarchical Model

• A simple example of a hierarchical model that is commonly found in spatial epidemiological studies is where the data likelihood is Poisson and there is a common relative risk parameter with a single gamma prior distribution

where g(θ) is a Gamma distribution with parameters α and β.

• A compact notation for this model is:

• α and β controls the form of the prior distribution g(θ)
-  these parameters can have assumed values, but more usually, we don't have a strong belief in the prior parameters values
-  alternatively, as parameters within models are regarded as stochastic, then these parameters must also have distributions, and these distributions are known as hyperprior distributions, these parameters are known as hyperparameters
p(\theta |y)\propto L(y|\theta)g(\theta)
y_i|\theta \sim Pois(e_i^\theta)
\theta \sim G(\alpha,\beta)

### Bayesian Hierarchical Model (cont.d)

• In this Poisson-Gamma example, there is a two level hierarchy:
-  θ has a G(α, β) distribution at the first level of the hierarchy
-
α can have a hyperprior distribution (hα), and β can have a hyperprior distribution (hβ)

• This can be written as:
y_i|\theta \sim Pois(e_i^\theta)
\theta \sim G(\alpha,\beta)
\alpha|\nu \sim h_\alpha(\nu)
\beta|\rho \sim h_\beta(\rho)

### Computational Tools

• In Bayesian analyses, integration is the principle inferential operation, as opposed to optimization in classical statistics

• Historically, the need to evaluate integrals was a major stumbling block for the application of Bayesian methods, severely restricts the type of models that could be implemented

• Around 28 years ago, a numerical technique known as Markov chain Monte Carlo (MCMC) was proposed by a Gelfand and Smith (1990)

• MCMC is a general method that simultaneously solves inference of major quantities in Bayesian analyses
-  allows to concentrate on modelling: use models that you believe represent the true dependence structures in the data, rather than those that are simple to compute

### MCMC

• All the information we need is contained in the posterior distribution, however, it may not be quantifiable as a standard distribution (unless we pick specific prior and likelihood combinations to create conjugate pair)

• MCMC simulation approximates the true posterior density by using a bag of samples drawn from the density
-  the initial set of T samples will be discarded since when few simulations were done, the samples drawn are can often be far from the posterior distribution
-  the time (iteration number) T is known as the burn-in

• Two most general procedures for MCMC simulation from a target distribution:
-  the Gibbs sampler
-  the Metropolis-Hastings algorithm: allows us to draw samples from non-conjugate pairs

### Metropolis-Hastings Algorithm

• The Metropolis-Hastings algorithm can be embedded within the Gibbs sampler and generate samples from these unknown distributions

• Although we can use this algorithm to take samples from any distributions, we usually don't do that when we have a known posterior distribution since the sample generated by this algorithm may introduce autocorrelations in Gibbs sampler

• We do not cover the technical details of this algorithm since it has been implemented in JAGS
-  but for more complex model which cannot be implemented in JAGS or needs further optimizations, we do need to program the Gibbs sampler and implement the Metropolis-Hastings algorithm

# Lab: JAGS

### BUGS and JAGS

• BUGS: Bayesian inference Using Gibbs Sampling

• JAGS (Just Another Gibbs Sampler)

### Syntax Differences in BUGS/JAGS Compared with R

• Order not important because they are compiled simultaneously

• [],[,],[,,]needed when defining arrays

• Matrix filled row-wise

• Defining vector/matrix prior to filling the elements is not necessary

### Model Construction

• Likelihood:
for (i in 1:n){
x[i]~dexp(l[i])
l[i]<-mu[i]*x[i]
}
• Prior:
for (j in 1:m){
mu[j]~dexp(3)
}