Predictability, computability, and stability for interpretable and reproducible data science
Karl Kumbier
Joint work with Bin Yu
Natural phenomena
Supervised learning
Domain insights
Prediction error
Interpretation error
Joint work with Sumanta Basu, James B. Brown, and Bin Yu
iterative Random Forests (iRF) build on PCS to identify genomic interactions in developing Drosophila embryos
Sampling features based on their importance acts as soft regularization and maintains predictive accuracy of RF
Gini importance
Features selected on RF decision paths
Intersect features on randomly selected decision paths to identify frequently co-occurring combinations
Outer-layer bootstrap samples evaluate how consistently interactions are recovered across bootstrap perturbations of the data
ovary, liver, stomach, pancreas, lung, breast,
esophagus, and colorectum
61 genetic mutations and 39 protein biomarkers
Probability of cancer
Decision rule
iRF interaction
Rule-based models successfully diagnose an additional 123/1000 patients under domain-specific constraints
Additional 12 / 100 patients successfully diagnosed
Additional 11 / 100 patients successfully diagnosed
Interpretable machine learning: The extraction of relevant knowledge about domain relationships contained in data.
Relevant knowledge: provides insight for a particular audience into a chosen problem. These insights guide actions, discovery, and communication.
Encourage our model to learn interpretable relationships that can be [easily] read off by a practitioner.
Allow our model to learn complex relationships and develop methods to extract portions of these relationships.
Graphic for prediction accuracy vs interpretation accuracy
iterative re-weighting encourages model to use sparse set of stable features
Recovered interactions represent predictive, simulatable "modules" (almost)
Biomarkers known associations with cancer