#math graduate. Postdoc fellow with Jason Moore.
Kim Lab journal club
- EHR data are often available only at irregular intervals that vary among patients
- machine learning algorithms cannot directly accommodate
- "complete case" approach: biased, limited generalizability, few observations left
- existing imputation methods: cross-sectional data (same time point)
multiple copies of a data set
Step 1: Naively impute missing data points of each variable (e.g., with mean value)
Step 2: Put NAs back in the age variable where it was missing.
Step 3: Train age on income and gender (linear regression) with available data
Step 4 Use the fitted model to predict the missing age values.
Step 5: Repeat Steps 2–4 separately for each variable that has missing data, namely income and gender.
age, gender, income
for each cycle:
- focus on one variable at a time
- utilizes the correlation among the features
\(f(t_i)\) have a joint Gaussian distribution
closer time points have more similar measurement values
Step 1: extract separate univariate time series for each patient and variable
Step 2: GPfit: MLE over \(\alpha\) and \(l\)
Step 3: infer values at unobserved time points
utilizes autocorrelation within each variable
mask one result per analyte per patient
time index \(i\)
correlation between analytes and
between current and prior values for each analyte
- Amelia II
- > 2/3 patients were excluded
- interpolations in place of GP for cases lacking sufficient temporal data to use 3D-MICE
- improvement over MICE or GP is small
- missing at random assumption
- 3DMICE is competitive in imputing missing data, especially when both inter-variable and within-variable correlation are present
By Trang Le