Rebecca Barter
Making the raw data useable
Understanding the data
Obtaining more data
Modeling approaches
Data was split across
for multiple types of data
Total: 26 excel files, each with multiple sheets
Wrote an R script (01_combine_year_separated_data.R) that automatically combined the data across years
The different sheets had slightly different column names, so initially everything beyond the first sheet was missing
Filenames differed slightly by year:
2014: Prob_List.xlsx
2015: Prob_list.xlsx
2016: Problem_list.xlsx
2017: Problem_List.xlsx
In 2017 there were two vitals datasets: *Vitals.xlsx and *Vitals_2
PATNUM: patient
ADMISSION_ENCNUM:
Hospitalization + surgery
ENCNUM: hospitalization
PROCID:
procedure undertaken
So what is the unit of interest?
ADMISSION_ENCNUM
PATNUM
ADMISSION_ENCNUM
ENCNUM
ENCNUM
ENCNUM
PROCID
PROCID
What ID should we use to join data?
PATNUM
Surgeon characteristics (age & experience)
Surgical category
Elixhauser categories of diagnoses
Medical definitions of normal lab ranges by gender
Medication categories
(Hopefully) medication dispensing time
| Surgical encounter ID | Procedure | Average serum creatinine 1 week pre-surgery | Minimum white blood cell count 1 week pre-surgery | Maximum temperature 2 days pre-surgery | antibiotics prescribed 1 day pre-surgery | SSI |
|---|---|---|---|---|---|---|
| 1 | Cardiac surgery | 0.4 | 7.9 | 98.2 | TRUE | FALSE |
| 2 | Colon surgery | 3.3 | 12.1 | 99.1 | FALSE | FALSE |
| 3 | Spinal fusion | 2.1 | 6.8 | 102.3 | FALSE | TRUE |
| 4 | Cesarean section | 1.4 | 8.3 | 97.4 | FALSE | FALSE |
| 5 | Abdominal surgery | 2.0 | 8.1 | 98.9 | TRUE | FALSE |
| 6 | Rectal surgery | 2.9 | 10.0 | 100.2 | FALSE | FALSE |
| 7 | Colon surgery | 1.5 | 12.5 | 101.1 | FALSE | TRUE |
Logistic regression, random forest, etc
Need to have well-defined predictor variables
train/ssi/33098396_19385283.txt
Nguyen et al. (2016), Deepr