Ecological Analyses and Mixed-Effect Model

Hui Hu Ph.D.

Department of Epidemiology

College of Public Health and Health Professions & College of Medicine

March 14, 2018



Linear Mixed-Effects Model


Generalized Linear Mixed-Effects Model


Ecological Studies

  • Based on grouped data, with the groups in a spatial context corresponding to geographical areas
  • Ecological studies have a long history in many disciplines in addition to epidemiology and public health:
    -  political science, geography, sociology
  • Due to aggregation, ecological studies are susceptible to unique challenges, in particular the potential for ecological bias
    -  the difference between estimated association based on ecological- and individual-level data
  • Ecological data can be used for a variety of purposes:
    -  mapping: ecological bias is not a big problem (with-in areas variations may be obscured by agggregation)
    -  cluster detection: small-area anomalies may be washed away when data are aggregated

Ecological Bias

  • The fundamental problem with ecological inference is that the process of aggregation reduces information
    -  this information loss usually prevents identification of parameters of interest in the underlying individual-level model
  • If there is no within-area variability in exposures and confounders, then there will be no ecological bias
    -  therefore, ecological bias occurs due to within-area variability in exposures and confounders
  • Distinct consequences of this variability:
    -  pure specification bias
    -  confounding

Pure Specification Bias

  • Also called model specification bias
  • This bias arises because a nonlinear risk model changes its form under aggregation
  • This type of bias has nothing to do with confounding

Pure Specification Bias (cont'd)

  • In an ecological setting, the individual-level data are unavailable, and rather, we observe the aggregate data that correspond to the average outcome and exposure

# individuals in area i (i=1,2,...,m)

Outcome for individual j in area i

Exposure for individual j in area i

E(Y_{ij}|x_{ij})=\alpha +\beta x_{ij}
\bar y_i = {1 \over n_i} \sum^{n_i}_{j=1}y_{ij}
\bar x_i = {1 \over n_i} \sum^{n_i}_{j=1}x_{ij}
  • On aggregation, we have
E(\bar Y_{i}|\bar x_{i})=\alpha +\beta \bar x_{i}

Pure Specification Bias (cont'd)

  • Aggregate to sum:
E(Y_{ij}|x_{ij})=\alpha +\beta x_{ij}
E(Y_{i}|x_{ij})=\sum^{n_i}_{j=1} ( \alpha +\beta x_{ij})
  • Dividing the left- and right-hand sides by ni
E(Y_{i}|\bar x_{i})=n_i ( \alpha +\beta \bar x_{i})
  • When there is no within-area variability in exposure:
x_{ij}=\bar x_i

There is no ecological bias

  • The pure specification bias is reduced if areas are smaller, since the heterogeneity of exposures within areas is decreased


  • It is challenging to characterize the within-area joint distribution of exposures and confounders with only aggregated data
  • Two scenarios when we can address the confounding issue with aggregated data
    -  the exposure and confounders are independent (no interaction between exposure and confounders)
    -  if we have the confounders that are constant within areas (e.g. county-level policy)
  • We usually assume the samples drawn from targeted population are independent and identically distributed (i.i.d.).


  • This assumption does not hold when we have data with multilevel structure:
    - clustered and nested data (i.e. individuals within areas)
    - longitudinal data (i.e. repeated measurements within individuals)
    - non-nested structures (i.e. individuals within areas and belonging to some subgroups such as occupations)

  • Samples within each group are dependent, while samples between groups stay independent
  • Two sources of variations:
    - variations within groups
    - variations between groups
  • A longitudinal study:
    - n = 3
    - t = 3
  • Complete pooling
    - poor performance
  • No pooling
    - infeasible for large n
  • Partial pooling
  • An alternative solution: include categorical individual indicators in the traditional linear regression model.
  • Why do we still need mixed-effects models?
  1. Account for both individual- and group-level variations when estimating group-level coefficients.
  2. Easily model variations among individual-level coefficients, especially when making predictions for new groups.
  3. Allow us to estimate coefficients for specific groups, even for groups with small n

Fixed and Random Effects

  • Random Effects: varying coefficients
  • Fixed Effects: varying coefficients that are not themselves modeled

How to decide whether to use fixed-effects or random-effects?

When do mixed-effects models make a difference?

Fixed and Random Effects

Two extreme cases:

  • when the group-level variation is very little
    - reduce to traditional regression models without group indicators (complete pooling)
  • when the group-level variation is very large
    - reduce to traditional regression models with group indicators (no-pooling)

Little risk to apply a mixed-effects model

What's the difference between no-pooling models and mixed-effects models only with varying intercepts?

  • In no-pooling models, the intercept is obtained by least squares estimates, which equals to the fitted intercepts in models that are run separately by group.
  • In mixed-effects models, we assign a probability distribution to the random intercept: 

Intraclass Correlation (ICC)

shows the variation between groups

ICC ranges from 0 to 1:

  • ICC -> 0: the groups give no information (complete-pooling)
  • ICC -> 1: all individuals of a group are identical (no-pooling)

Intraclass Correlation (ICC)

ICC ranges from 0 to 1:

  • ICC -> 0: "hard constraint" to 
  • ICC -> 1: "no constraint" to
  • Mixed-effects model: "soft constraint" to 

This constraint has different effects on different groups:

  • For group with small n, a strong pooling is usually seen, where the value of     is close to the mean      (towards complete-pooling)
  • For group with large n, the pooling will be weak, where the value of    is far away from the mean      (towards no-pooling)

Linear Mixed-Effects Model

git pull

Load the Packages and Data

1,000 participants

5 repeated measurements






race: 1=white, 2=black, 3=others

gender: 1=male, 2=female

edu: 1=<HS, 2=HS, 3=>HS


am: 1=measured in morning

ex: #days exercised in the past year

Varying-intercept Model with No Predictors

allows intercept to vary by individual

estimated intercept, averaging over the individuals

estimated variations

Varying-intercept Model with an individual-level predictor

Varying-intercept Model with both individual-level and group-level predictors

Varying Slopes Models

With only an individual-level predictor

Varying Slopes Models

Add a group-level predictor

Non-nested Models

Generalized Linear Mixed-Effects Model

Mixed-Effects Logistic Model

Empty model

Mixed-Effects Logistic Model

Add bmi and race

Mixed-Effects Poisson Model

Parameter Estimation Algorithms

  • ML: maximum likelihood
  • REML: restricted maximum likelihood
    - default in lmer()
  • PQL: pseudo- and penalized quasilikelihood
  • Laplace approximations
    - default in glmer()
  • GHQ: Gauss-Hermite quadrature
  • McMC: Markov chain Monte Carlo

Bolker BM, Brooks ME, Clark CJ, Geange SW, Poulsen JR, Stevens MHH, et al. 2009. Generalized linear mixed models: A practical guide for ecology and evolution. Trends in ecology & evolution 24:127-135.

Mixed-Effects Model vs. GEE

Mixed-Effects Model Marginal Model with GEE
Distributional assumptions Yes No
Population average estimates Yes Yes
Group-specific estimates Yes No
Estimate variance components Yes No
Perform good with small n Yes No


By Hui Hu


Slides for Lecture 9, Spring 2018, PHC6194 Spatial Epidemiology

  • 424
Loading comments...

More from Hui Hu