Health Data Science Meetup

September 26, 2016



Overview of Data Science Pipeline


Python Crash Course and Common Packages

A place where we can

  • share tips and codes

  • explore new areas of the health data science landscape

  • take challenges on online competitions such as Kaggle and Dream challenges

What to expect from the meetup?

  1. Time: 9/26   9:00-11:00 AM; Topic: Data Science Pipeline, Python crash course, common packages
  2. Time: 10/10 9:00-11:00 AM; Topic: Shrinkage and Regularization: Lasso, Ridge, Elastic Net
  3. Time: 10/24 9:00-11:00 AM: Topic: Tree-based Methods: CART, Random Forests, GBDT & Ensembles: Voting, Bagging, Boosting
  4. Time: 11/7   9:00-11:00 AM; Topic: Support Vector Machines
  5. Time: 11/21 9:00-11:00 AM; Topic: Neural Network: MLP, CNN
  6. Time: 12/5   9:00-11:00 AM: Topic: Neural Network: RNN

Raw Data

Data Wrangling


Model Selection

Feature Engineering

Train Model

Evaluate Performance

Data Product

Data Science Pipeline

Inferential Model


Predictive Model

Decomposition of Error

Bias-Variance Trade-off


  • The test error can be calculated if a test dataset is available.
  • Unfortunately, this is usually not the case.
    We don’t have a very large designated test dataset that can be used to directly estimate the test error rate in most time.
  • Cross-validation:
    A method that can estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the trained method to those held out observations






k-fold CV

Tune Models

Evaluate Performance


  • Install Python 2
  • Install numpy
    pip install numpy
  • Install pandas
    pip install pandas
  • Install matplotlib
    pip install matplotlib
  • Install scikit-learn
    pip install scikit-learn
  • Install jupyter notebook
    pip install jupyter

HDS Meetup 9/26/2016

By Hui Hu

HDS Meetup 9/26/2016

Slides for the Health Data Science Meetup

  • 733
Loading comments...

More from Hui Hu