Rebecca Barter, Prabhu Shankar, Parul Dayal,
Karl Kumbier, Hien Nguyen, Bin Yu
Dr Prabhu Shankar
Dr Parul Dayal
Dr Hien Nguyen
Prof Bin Yu
(almost Dr) Rebecca Barter
Dr Karl Kumbier
Zimlichman et al. 2013, Merkow et al. 2015, Klevens et al. 2002, Magill et al. 2014
Cases per year
Deaths associated with SSI per year
ICU deaths are associated with SSI
Attributable cost per year to hospitals
Additional hospitalization for the average SSI patient
Closer monitoring
Preventative treatment
Take quicker action
If we can predict which patients are at high risk of SSI...
Better patient outcomes
Lower costs for hospitals
To generate a prediction... we need data!
30 days
Infection*?
All hospitals have compulsory SSI reporting to the CDC
National Healthcare Safety Network (NSHN)
*How do you define an infection?
Surgery
Patient
SSI
Vitals
Labs
Medications
Data was split across
for multiple types of data
Total: 26 excel files, each with multiple sheets
The different sheets had slightly different column names, so initially everything beyond the first sheet was missing
Filenames differed slightly by year:
2014: Prob_List.xlsx
2015: Prob_list.xlsx
2016: Problem_list.xlsx
2017: Problem_List.xlsx
In 2017 there were two vitals datasets: *Vitals.xlsx and *Vitals_2
Combined within data-types (across sheets and years)
Goal: combine into a single covariate matrix
NHSN denominator data
39,174 rows and 44 variables
Lab EHR data
12,927,273 rows across 30 lab variables
Vitals EHR data
8,666,375 rows across 9 vitals measurements (including pulse and temperature)
Medication EHR data
7,637,621 rows across 50 medication classes
Lab data
(long-form)
Lab data
(wide-form)
Want one row per surgery...
Take average value in the 30 days pre- and post-surgery
Remove labs that have missing values for more than 30% of patients
Each row is a single surgery
Impute
missing values
Remove variables with fewer than 70% non-missing values
Imputation via Random Forest missRanger
Stekhoven, D. J. & Buhlmann, P. (2011) MissForest- non-parametric missing value imputation for mixed-type data.
Fit RF model
Plug into RF model
2.3
2.0
3.2
1.6
Imputation via Random Forest missRanger
Stekhoven, D. J. & Buhlmann, P. (2011) MissForest- non-parametric missing value imputation for mixed-type data.
Fit RF model
Plug into RF model
2.3
2.0
1.6
3.2
1.6
1.6
2.3
2.0
3.2
19
25
21
Run through all variables
Iterate until matrix change between iterations is small
Number of surgeries: 37,881
Number of SSI cases: 790 (~2%)
Number of variables: 263
Variable importance
Variable importance
Variable importance
(top 15)
(top 15)
(top 15)
(top 15)
(top 15)
(top 15)
AUC 0.79
single-model
AUC: 0.77
Upsampling
Downsampling
Comparing many downsampling results with our aggregate approach
NHSN-only
AUC: 0.73
SSI is really hard to predict.
We incorporated EHR data
Our model has an AUC of 0.79 on the test set
Aggregating many balanced models seems to work very well (at least for this problem!)
...
We will soon be testing on a cohort from the Davis VA hospital
... and would like to develop a GUI for the Davis surgeons