Three principles of data science:

predictability, computability, and stability

Bin Yu and Karl Kumbier

 

The data science life cycle guides decision making

Stability assumptions initiate the data science life cycle

  • Formulating the domain question or problem
  • Collecting data
  • Cleaning and preprocessing data
  • Exploratory data analysis

Does an alternative "appropriate" analysis produce similar findings to the performed analysis?

The PCS framework to communicate and evaluate human judgement calls

  • Computability: Can I tractably build/train my model?
    • ​Computational constraints
  • Predictability: Does my model capture external reality?
    • Prediction & evaluation functions
    • ​Internal v. external testing data
  • Stability: Are my results consistent with respect to "reasonable" perturbations?
    • ​Stability target
    • Data/model perturbations, generative models 

PCS inference: evaluating uncertainty with justified perturbations

  1. Formulate problem (e.g. target of interest, perturbations)
  2. Screen out models with low prediction accuracy
  3. Generate target value perturbation distributions
  4. Summarize target value perturbation distribution          

Feature selection in linear model setting: simulation setup

Feature selection in linear model setting: simulation setup

Feature selection in linear model setting: simulation results (n = 1000)

PCS documentation transparently reports human judgment calls

PCS documentation transparently reports human judgment calls

  1. Domain problem formulation (narrative)
  2. Data collection and storage (narrative)
  3. Data cleaning and visualization (narrative, code, visualizations)
  4. PCS inference (narrative, code, visualizations)
  5. Conclusions/recommendations (narrative, visualizations)

PCS documentation transparently reports human judgment calls

PCS documentation transparently reports human judgment calls

PCS discussion

By kkumbier

PCS discussion

  • 80