College of Public Health and Health Professions & College of Medicine
February 28, 2018
Introduction
Global Clustering
Local Clustering
Lab: Disease Clustering
Introduction
Spatial Patterns
Random
Cluster
Regular
Disease Cluster
The occurrence of a greater than expected number of cases of a particular disease within a group of people, a geographic area, or a period of time.
A collection of disease occurrence:
- of sufficient size and concentation to be unlikely to have occurred by chance, or
- related to each other through some social or biological mechanism, or having a common relationship with some other events or circumstance
Spatial aggregation of disease events may only be a function of the distribution of population
Disease cluster: residual spatial variation in risk after known influence have been accounted for
Purposes of Disease Cluster Detection
Confirmatory purpose
- verify if a perceived cluster exists:
e.g. excess risk reported by citizens
Exploratory purpose
- search for spatial patterns
Identification of clusters can lead to interventions
Methods of Disease Cluster Detection
Global clustering:
- non-specific methods
- only detect if cluster exists, without specific location
Local clustering:
- specific methods
- shows the specific locations where clusters exist
- two methods: non-focused and focused
Global Clustering
Global Clustering (Non-specific Methods)
Evaluate whether clustering exist as a global phenomena throughout the study region, without pinpointing the location of specific cluster
e.g. the analysis of overall clustering tendency of some disease incidence in a study region
Tests for Global Clustering
Over 100 different testing methods for global clustering in the field
Some widely-used methods:
- for aggregated data:
Moran's I
Geary's C
- for points data:
KNN
Moran's I
Moran's I is a global index of spatial auto-correlation
- to quantify the similarity of an variable among areas that are defined as spatially related
Monte Carlo test under the random labeling hypothesis is used to test the significance
The rank of the test statistic is based on the data observed among the values from the randomly labeled data, which allows calculation of the p-value
Local Clustering
Local Clustering Test
Additionally specify the location and can be extended to also consider temporal patterns
Focused tests:
- investigate whether there is an increased risk of disease around a pre-determined point
- e.g. Superfund site; A nuclear power plant; A waste dumping site
Non-focused tests:
- identify the location of all potential clusters in the study region
Focused Tests
H0: there is no cluster of cases around the foci
The Lawson Waller test
- also called Berman's Z1 test
- H0: yi~ Poisson(ni*r)
- H1: yi~ Poisson(ni*r (1+εθi)), where θi represents exposure to the foci experienced by population in region i; ε represents a small, positive constant
- what is the relative risk comparing people in region i with people with no exposure
The Lawson Waller score
- where θi is defined by the inverse distance of each region from the foci
- usually standardized to range from 0 to 1
T_{sc}=\sum_{i=1}^N\theta_i(y_i-rn_i)
Tsc=∑i=1Nθi(yi−rni)
Non-focused Tests
Aggregated data:
- Local Indicators of Spatial Auto-correlation (LISA)
- Local Getis-Ord G statistics
- Spatial scan statistics
LISA values allow for the computation of its similarity with its neighbors and also to test its significance
LISA divides the study region into 5 categories:
- high-high locations: also known as hot spots
- low-low locations: also known as cold spots
- high-low locations: potential spatial outliers
- low-high locations: potential spatial outliers
- locations with no significant local auto-correlation
Local Getis-Ord G Statistic
The proportion of all x values in the study area accounted for by the neighbors of location i
G will be high where high values cluster (hot spot)
- G will be low where low values cluster (cold spot)
G_i(d)={{\sum_jw_{ij}x_j}\over {\sum_jx_j}}
Gi(d)=∑jxj∑jwijxj
Spatial Scan Statistic
Steps:
- search over a given set of spatial regions
- find those regions which are most likely to be clusters
- correctly adjust for multiple hypothesis testing
Search Over a Given Set of Spatial Regions
Create a regular or irregular grid of centroids covering the whole study area
Create an infinite number of circles around each centroid, with the radius ranging from 0 to a maximum which includes at most 50% of the population
A circular scanning window is placed at different coordinates with radius that vary from 0 to some set upper limit.
Find Regions that are Most Likely to be Clusters
For each location and size of window
H = elevated risk within window as compared to outside of window
Is there any region with disease rates significantly higher inside the circle than outside the circle ?
For each circle, obtain the actual and expected number of cases inside and outside the circle, and calculate likelihood function
A
Find Regions that are Most Likely to be Clusters (cont'd)
Generate random replicas of the dataset under the null-hypothesis of no clusters (Monte Carlo sampling)
Compare most likely clusters in real and random datasets (likelihood ratio test)
Properties of Spatial Scan Statistics
Adjusts for inhomogeneous population density
Simultaneously tests for clusters of any size and any location, by using circular windows with continuously variable radius