Chapter 7: Evaluating, comparing, and expanding models

presentation for the INM-6 bookclub

30 Aug 2019

presenter: Alexandre René

Bayesian Data Analysis

7.1 Measures of predictive accuracy

observed data: $y$

new data: $\tilde{y}_i$

parameters: $θ$

point estimate: $\hat{θ}$

likelihood: $p(y|θ)$

true model: $f(y)$

prior: $p(θ)$

posterior: $p_{\text{post}}(θ) = p(θ|y)$

\begin{aligned} p_{\text{post}}(\tilde{y}_i) = E_{\text{post}}\bigl(p(\tilde{y}_i|θ)\bigr) = \int p(\tilde{y}_i | θ) p_{\text{post}}(θ) \, dθ \end{aligned}

\begin{aligned} p_{\text{post}}(\tilde{y}_i) = E_{\text{post}}\bigl(p(\tilde{y}_i|θ)\bigr) = \int p(\tilde{y}_i | θ) p_{\text{post}}(θ) \, dθ \end{aligned}

posterior predictive distribution:

7.1 Measures of predictive accuracy

lpd

Log predictive density,
aka log-likelihood

\log p(y | θ)

\log p(y | θ)

elpd

expected log predictive density for a new data point

\begin{aligned} &\mathrm{E}_f(\log p_{\text{post}}(\tilde{y}))) \\ &= \int \bigl(\log p_{\text{post}}(\tilde{y})\bigr) f(\tilde{y}) \, d\tilde{y} \end{aligned}

\begin{aligned} &\mathrm{E}_f(\log p_{\text{post}}(\tilde{y}))) \\ &= \int \bigl(\log p_{\text{post}}(\tilde{y})\bigr) f(\tilde{y}) \, d\tilde{y} \end{aligned}

replace $f$ , $θ$ by an estimate (e.g. $p_{\text{post}}$ , $\hat{θ}$ )

plug-in estimate

Approximations

Non-computable measures we want to use

lppd

log pointwise (posterior) predictive density

\begin{aligned} \log \prod_{i=1}^n p_{\text{post}}(y_i) &= \sum_{i=1}^n \log \int p(y_i | θ) \mathrm{d}θ \\ &\approx \sum_{i=1}^n \log \left( \frac{1}{S} \sum_{s=1}^S p(y_i | θ^s) \right) \end{aligned}

\begin{aligned} \log \prod_{i=1}^n p_{\text{post}}(y_i) &= \sum_{i=1}^n \log \int p(y_i | θ) \mathrm{d}θ \\ &\approx \sum_{i=1}^n \log \left( \frac{1}{S} \sum_{s=1}^S p(y_i | θ^s) \right) \end{aligned}

Needs true $θ$

Needs true $f$

Uses sample data $y$

elpd| $\hat{θ}$

\mathrm{E}_f(\log p(\tilde{y}|\hat{θ}))) \\

\mathrm{E}_f(\log p(\tilde{y}|\hat{θ}))) \\

expected log predictive density, given $\hat{θ}$

Log predictive density scoring rule (Log-likelihood)

Scoring rule: Measure of predictive accuracy for probabilistic prediction. Good scoring rules are proper and local

proper: $\text{argmax}(\text{score}|θ)$ returns $θ$
local: some $\tilde{y}$ are more important than others

It can be shown that the logarithmic score is the unique (up to an affine transformation) local and proper scoring rule, and it is commonly used for evaluating probabilistic predictions.

7.2 Information criteria and cross-validation

Measures of predictive accuracy are called information criteria.
They are usually expressed in terms of the deviance, defined as $-2 \log p(y|\hat{θ})$ .
Basic idea: we want to use lppd, but to compare different models we need to consider effective parameters.

Why effective parameters matter

Assume $\dim θ = k$ and $p(θ|y) \to \mathcal{N}(θ_0, V_0/n)$ as $n \to \infty$ . Then

\log p(y|θ) = \text{cst}(y) - \frac{1}{2}\left[k \log(2π) + \log|V_0/n| + (θ - θ_0)^T(V_0/n)^{-1} (θ-θ_0) \right] \,.

\log p(y|θ) = \text{cst}(y) - \frac{1}{2}\left[k \log(2π) + \log|V_0/n| + (θ - θ_0)^T(V_0/n)^{-1} (θ-θ_0) \right] \,.

\sim \text{cst} - \frac{1}{2} χ_k^2

\sim \text{cst} - \frac{1}{2} χ_k^2

$\sum_1^k \mathcal{N}(0,1)$

So

\text{mean}_θ \left(\log p(y|θ)\right) = \text{max}_θ \left(\log p(y|θ)\right) - k/2

\text{mean}_θ \left(\log p(y|θ)\right) = \text{max}_θ \left(\log p(y|θ)\right) - k/2

Fig 7.2

max lpd: -40.3

mean lpd: -42.0

no. params: 3

${}^-40.3 - {}^-42.0 = 1.7 \approx k /2$

I.e. the 3 parameters are expected to contribute a factor $k$ to the deviance

Estimating out-of-sample predictive accuracy using available data

Within-sample predictive accuracy
Ignore the bias.
Akaike information criterion (AIC)
Subtract $k$ from deviance.
Deviance information criterion (DIC)
Replace $k$ with Bayesian estimate for effective number of parameters.

We know of no approximation that works in general, but predictive accuracy is important enough that it is still worth trying.

(Estimating lpd or elpd)

Basic idea: start from lppd and correct for bias due to parameters.

Watanabe-Akaike (widely available) information criterion (WAIC)
Fully Bayesian information criterion. Averages over posterior.
Leave-one-out cross-validation (LOO-CV)
Evaluate lppd on the left-out data. Bias usually neglected.

Example: WAIC

Effective number of parameters:

\displaystyle p_{\text{WAIC}} = 2 \sum_{i=1}^n\left[ \log\left( \frac{1}{S} \sum_{s=1}^S p(y_i|θ^s)\right) - \frac{1}{S}\sum_{s=1}^S \log p(y_i | θ^s) \right]

\displaystyle p_{\text{WAIC}} = 2 \sum_{i=1}^n\left[ \log\left( \frac{1}{S} \sum_{s=1}^S p(y_i|θ^s)\right) - \frac{1}{S}\sum_{s=1}^S \log p(y_i | θ^s) \right]

\log E_θ [p_{\text{post}}(y_i)]

\log E_θ [p_{\text{post}}(y_i)]

E_θ [\log p_{\text{post}}(y_i)]

E_θ [\log p_{\text{post}}(y_i)]

Then

\widehat{\text{elppd}}_{WAIC} = \text{lppd} - p_{\text{WAIC}} \,.

\widehat{\text{elppd}}_{WAIC} = \text{lppd} - p_{\text{WAIC}} \,.

The authors also define an $p_{\text{WAIC}_2}$ .

the accuracy of a fitted model’s predictions of future data will generally be lower, in expectation, than the accuracy of the same model’s predictions for observed data

Effective number of parameters as a random variable

[C]onsider the model $y_i, \dotsc, y_n \sim N(θ, 1)$ , with $n$ large and $θ \sim U(0, ∞)$ . If the measurement $y$ is close to zero, then $p \approx 1/2$ , since roughly half the information in the posterior distribution is coming from the prior constraint of positivity. However if $y > 0$ is large, the effective number of parameters is 1.

Informative prior distributions and hierarchichal structures tend to reduce the amount of overfitting

7.3 Model comparison based on predictive performance

For both nested and nonnested models important to adjust for overfitting, especially when complexities strongly differ.

Illustrated on 8-schools model:

School	yⱼ	σⱼ
A	28	15
B	8	10
C	-3	16
D	7	11
E	-1	9
F	1	11
G	18	10
H	12	18

No pooling model

y_j \sim \mathcal{N}(θ_j, σ_j^2)

y_j \sim \mathcal{N}(θ_j, σ_j^2)

$p = 8$

Complete pooling model

y_j \sim \mathcal{N}(θ, σ_j^2)

y_j \sim \mathcal{N}(θ, σ_j^2)

$p = 1$

Hierarchichal model

\begin{aligned} θ_j &\sim \mathcal{N}(μ, τ^2) \\ y_j|θ_j &\sim \mathcal{N}(θ_j, σ_j^2) \end{aligned}

\begin{aligned} θ_j &\sim \mathcal{N}(μ, τ^2) \\ y_j|θ_j &\sim \mathcal{N}(θ_j, σ_j^2) \end{aligned}

$1 \leq p \leq 8$

Table 5.2

Evaluating predictive error comparisons

Two issues: “statistical” and “practical” significance.

Rule of thumb for nested models:
$Δ > 1 \Rightarrow \text{statistically significant}$ .
For practical significance:
- Use expert knowledge.
- Calibrate with smaller models.

Out-of-sample prediction error also does not tell the whole story: Substantial improvements may be small on an absolute scale.

7.4 Model comparison using Bayes factors

This fully Bayesian approach has some appeal but we generally do not recommend it because, in practice, the marginal likelihood is highly sensitive to aspects of the model that are typically assigned arbitrarily and are untestable from data. Here we present the general idea and illustrate with two examples, one where it makes sense to assign prior and posterior probabilities to discrete models, and one example where it does not.

Idea: compare two models $H_1$ and $H_2$ by computing

\frac{p(H_2|y)}{p(H_1|y)}

\frac{p(H_2|y)}{p(H_1|y)}

A discrete example when Bayes factors are helpful

Two competing ‘models’ :

$H_1$ : the woman is affected
$H_2$ : the woman is unaffected

Unfiorm priors:

A model where an affected mother has a 50% chance of passing on the affected gene to her son.

Data:

$y$ : woman has two unaffected sons

Bayes factor:

\frac{p(H_2)}{p(H_1)} = 1

\frac{p(H_2)}{p(H_1)} = 1

\frac{p(y|H_2)}{p(y|H_1)} = \frac{1.0}{0.25}

\frac{p(y|H_2)}{p(y|H_1)} = \frac{1.0}{0.25}

Posterior odds:

\frac{p(H_2|y)}{p(H_1|y)} = \frac{p(y|H_2)}{p(y|H_1)}\frac{p(H_2)}{p(H_1)} = 4

\frac{p(H_2|y)}{p(H_1|y)} = \frac{p(y|H_2)}{p(y|H_1)}\frac{p(H_2)}{p(H_1)} = 4

Why this works:

Truly discrete alternatives – proper priors $p(H_i)$ .
Proper marginal distributions $p(y|H_i)$ .

A continuous example when Bayes are a distraction

The 8 schools model where

$H_1$ : no pooling
$H_2$ : complete pooling

Unfiorm priors:

Bayes factor:

p(θ_1,\dotsc,θ_8 | H_1) \propto 1

p(θ_1,\dotsc,θ_8 | H_1) \propto 1

\frac{p(y|H_2)}{p(y|H_1)} = \frac{0}{0} \quad \text{(undefined)}

\frac{p(y|H_2)}{p(y|H_1)} = \frac{0}{0} \quad \text{(undefined)}

Why this doesn't work:

Bayes factors depend on how the arbitrary choice we make to fix the improper priors.

p(θ|H_2) \propto 1

p(θ|H_2) \propto 1

Preferred solution

Continuous model expansion

Thus, if we were to use a Bayes factor for this problem, we would find a problem in the model-checking stage (a discrepancy between posterior distribution and substantive knowledge), and we would be moved toward setting up a smoother, continuous family of models to bridge the gap between the two extremes. A reasonable continuous family of models is $y_j \sim \mathcal{N}(θ_j, σ_j^2), θ_j \sim \mathcal{N}(μ, τ^2)$ , with a flat prior distribution on $μ$ , and $τ$ in the range $[0, ∞)$ ; this is the model we used in Section 5.5. Once the continuous expanded model is fitted, there is no reason to assign discrete positive probabilities to the values $τ = 0$ and $τ = \infty$ , considering that neither makes scientific sense.

7.5 Continuous model expansion

Adding parameters to a model

Possible reasons:

Model does not fit data or prior knowledge.
Broadening class of models because an assumption is questionable.
Two different models $p_1(y,θ)$ and $p_2(y,θ)$ can be combined into a larger model using a continuous parameterization.
Expand model to include new data.

Broadly speaking, will need to replace $p(θ))$ with $p(θ|φ)$ .

7.6 Implicit assumptions and model exansion: an example

Goal: estimate total population of New York State from samples of 100 municipalities

y_{\text{total}} = N \bar{y} = n \bar{y}_{\text{obs}} + (N-n)\bar{y}_{\text{mis}}

y_{\text{total}} = N \bar{y} = n \bar{y}_{\text{obs}} + (N-n)\bar{y}_{\text{mis}}

Assume normal distribution.
⇒ 95% confidence bounds [2.0×10⁶, 37.0×10⁶].
Lognormal distribution more appropriate
⇒ [5.4×10⁶, 9.9×10⁶].
This is good right ?
Posterior predictive check
Test statistic: $T(y_{\text{obs}}) = \sum_{i=1}^n y_{\text{obs},i}$
Produce $S$ independent replicate datasets, and see where the test statistic fits.
1. Draw $μ,σ^2$ from posterior, produce $y_{\text{obs}}^{\text{rep}}$ \,.
2. Compute $T(y_{\text{obs}}^{\text{rep}})$ for each.
All of test statistics are lower than that of the observations ⇒ this model is unacceptable.

sample min

Extend model to “power transformed normal family”.
Posterior predictive check now gives 15 out of 100 samples with larger sample total.
95% confidence bounds: [5.8×10⁶, 31.8×10⁶]
New problem: this doesn't work on sample 2.
Predicts median of 57×10⁷.

How can the inferences for the population total in sample 2 be so much less realistic with a better-fitting model than with a worse-fitting model ?
The problem with the inferences in this example is not an inability of the models to fit the data, but an inherent inability of the data to distinguish between alternative models.

The inference for ytotal is actually critically dependent upon tail behavior beyond the quantile corresponding to the largest observed $y_{\text{obs}, i}$ .

Chapter 7: Evaluating, comparing, and expanding models

Bayesian Data Analysis

Why this chapter

7.1 Measures of predictive accuracy

7.1 Measures of predictive accuracy

7.2 Information criteria and cross-validation

Why effective parameters matter

7.3 Model comparison based on predictive performance

7.3 Model comparison based on predictive performance

Evaluating predictive error comparisons

7.4 Model comparison using Bayes factors

A discrete example when Bayes factors are helpful

A continuous example when Bayes are a distraction

7.5 Continuous model expansion

7.5 Continuous model expansion

7.5 Continuous model expansion

7.5 Continuous model expansion

7.5 Continuous model expansion

7.6 Implicit assumptions and model exansion: an example

Selected Gelman quotes

CSN Book club – Bayesian Data Analysis Chap 7 – 30-08-2019

CSN Book club – Bayesian Data Analysis Chap 7 – 30-08-2019

alexrene

Chapter 7: Evaluating, comparing, and expanding models

Bayesian Data Analysis

CSN Book club – Bayesian Data Analysis Chap 7 – 30-08-2019

More from alexrene