Predicting Surgical Site Infections (SSI) using Electronic Medical Records
Rebecca Barter, Prabhu Shankar, Parul Dayal,
Karl Kumbier, Hien Nguyen, Bin Yu


A collaborative journey







Dr Prabhu Shankar
Dr Parul Dayal
Dr Hien Nguyen

Prof Bin Yu
(almost Dr) Rebecca Barter

Dr Karl Kumbier
Zimlichman et al. 2013, Merkow et al. 2015, Klevens et al. 2002, Magill et al. 2014
Surgical Site Infections
~160,000
Cases per year
>8000
Deaths associated with SSI per year
11%
ICU deaths are associated with SSI
$3.2 billion
Attributable cost per year to hospitals
11 days
Additional hospitalization for the average SSI patient
Surgical Site Infections
Closer monitoring
Preventative treatment
Take quicker action
If we can predict which patients are at high risk of SSI...
Better patient outcomes
Lower costs for hospitals
To generate a prediction... we need data!
SSI Surveillance by the CDC

30 days
Infection*?
All hospitals have compulsory SSI reporting to the CDC
National Healthcare Safety Network (NSHN)
*How do you define an infection?



+
NHSN surveillance data
EHR data
Current approaches






Surgery
Patient
SSI
Vitals
Labs
Medications
... Our idea
Obtaining NHSN and EHR data
The data is large and messy




Data was split across
- one file per year (2014-2017)
- multiple sheets within each excel file
for multiple types of data
- Labs
- Medications
- Previous diagnoses
- Problem list
- Vitals
- NHSN file (patient + surgery info)
Total: 26 excel files, each with multiple sheets

Combining the data was tricky...
The different sheets had slightly different column names, so initially everything beyond the first sheet was missing
Filenames differed slightly by year:
-
2014: Prob_List.xlsx
-
2015: Prob_list.xlsx
-
2016: Problem_list.xlsx
-
2017: Problem_List.xlsx
In 2017 there were two vitals datasets: *Vitals.xlsx and *Vitals_2
Defining complete datasets
Combined within data-types (across sheets and years)
Goal: combine into a single covariate matrix
NHSN denominator data
39,174 rows and 44 variables
Lab EHR data
12,927,273 rows across 30 lab variables
Vitals EHR data
8,666,375 rows across 9 vitals measurements (including pulse and temperature)
Medication EHR data
7,637,621 rows across 50 medication classes
Converting from long to wide form

Lab data
(long-form)

Lab data
(wide-form)
Want one row per surgery...
Summarising irregularly measured variables



Take average value in the 30 days pre- and post-surgery
Remove labs that have missing values for more than 30% of patients
Combining by patient ID and date

Each row is a single surgery
Handling missing data

Impute
missing values
Remove variables with fewer than 70% non-missing values
Data imputation
Imputation via Random Forest missRanger
Stekhoven, D. J. & Buhlmann, P. (2011) MissForest- non-parametric missing value imputation for mixed-type data.




Fit RF model
Plug into RF model
2.3
2.0
3.2
1.6
Data imputation
Imputation via Random Forest missRanger
Stekhoven, D. J. & Buhlmann, P. (2011) MissForest- non-parametric missing value imputation for mixed-type data.


Fit RF model
Plug into RF model
2.3
2.0
1.6
3.2
1.6
1.6
2.3
2.0
3.2

19
25
21
Run through all variables
Iterate until matrix change between iterations is small
Data oddities
- Four patients reported with SSI whose surgeries did not appear in the main NHSN dataset
- Decided to remove them
- Missing surgery times in 13% of patients
- discussion and exploration revealed that UCD switched to an EPIC EHR system in July 2014
- Decided to remove all patients before July 2014
- Diagnosis codes were a mixture of ICD9 and ICD10
- Converted all diagnosis codes to ICD9
- Mistakes in definitions in the codebook
- Asked collaborators for clarifications
- Lab values that are >10000 but all others are in [0, 20]
- Did some research to discover that these values are possible...
The final covariate matrix
Number of surgeries: 37,881
Number of SSI cases: 790 (~2%)
Number of variables: 263








Predicting SSI
SSI is a rare event
Repeated balanced subsampling




Feature selection

Variable importance
Variable importance
Variable importance
Feature selection


Aggregated RF model
(top 15)
(top 15)
(top 15)

Predicting SSI

(top 15)
(top 15)
(top 15)
Performance Evaluation
Performance on test set


AUC 0.79

Comparing alternative approaches: single model



single-model
AUC: 0.77
Comparing alternative approaches: Up/Downsampling

Upsampling

Downsampling


Comparing many downsampling results with our aggregate approach
Comparing alternative approaches: NHSN variables only

NHSN-only
AUC: 0.73
Comparing alternative approaches: grouping procedures by risk

Comparing alternative approaches: SVM and logistic regression

Conclusion
Takeaways
SSI is really hard to predict.
We incorporated EHR data
Our model has an AUC of 0.79 on the test set
Aggregating many balanced models seems to work very well (at least for this problem!)
...
We will soon be testing on a cohort from the Davis VA hospital
... and would like to develop a GUI for the Davis surgeons
Yu Group presentation: Predicting SSI
By Rebecca Barter
Yu Group presentation: Predicting SSI
- 115