pseudotime TDA

4CE longitudinal lab values

https://github.com/covidclinical/Phase2.1TDAPseudotimeRPackage

CRP enrichment

Topological Data Analysis (TDA): mine “connected” disease states via data shape
Pseudo Time: infer progression among states to reconstruct disease progression trajectories

Challenges

Missing data
- Not at random
- Slow imputation
Aggregate result from different sites
- Consistency?
- Parameter tuning?
- Train/test site
Assumptions
- Number of trajectory clusters
- All patients start out at a similar state

missing data

4CE longitudinal lab values

Does the missingness pattern reflect the healthcare dynamic? (doctor's worry)

workflow proposal

Step 0: Drop variable with high missing rate (> 80%: troponin_high, procalcitonin, fibrinogen, troponin_nomal)

Step 1: Naively impute missing data points of each variable using functional PCA {fdapace}

Step 2: Drop rows with the most (originally) missing values, record the proportion of rows dropped for each patient (pdrop)

Step 3: Put NAs back in the CRP variable where it was missing.

Step 4: Train CRP on Leukocytes, Albumin and pdrop (mixed effect model, XGBoost, Amelia II) with available data

Step 5: Use the fitted model to predict the missing CRP values.

Step 6: Repeat Steps 3–5 separately for each variable that has missing data (Leukocytes and Albumin).

CRP, Albumin, Leukocytes

for each cycle:

nRMSD(a) = \sqrt{\frac{\sum{_{p,i}}I_{p,a,i}\left(\frac{X_{p,a,i} - Y_{p,a,i}}{max(Y_{p,a}) - min(Y_{p,a})}\right)^2}{\sum_{p,i}I_{p,a,i}}}

evaluation

patient \(p\)

lab \(a\)

time index \(i\)

mask one extra value per lab per patient

Amelia II

MICE

Limitations

a proportion of patients would excluded
missing at random assumption

pattern of missingness?

other details

x_n = \frac{x-min(x)}{max(x) - min(x)}

normalization

Gaussian processes

\(f(t_i)\) have a joint Gaussian distribution

P(f(t))

locality constraint

closer time points have more similar measurement values

cov(f(t_1), f(t_2)) = \alpha e^{-(t_1-t_2)^2/l}

Step 1: extract separate univariate time series for each patient and variable

Step 2: GPfit: MLE over \(\alpha\) and \(l\)

Step 3: infer values at unobserved time points

4CE missing data analysis

By Trang Le

4CE missing data analysis

2020-12-15

1,070

Trang Le

#math graduate. Postdoc fellow with Jason Moore.

pseudotime TDA

Challenges

missing data

workflow proposal

evaluation

Amelia II

MICE

Limitations

pattern of missingness?

other details

normalization

Gaussian processes

4CE missing data analysis

4CE missing data analysis

Trang Le

More from Trang Le