Health Data Science Meetup

September 26, 2016

Introduction

 

Overview of Data Science Pipeline

 

Python Crash Course and Common Packages

A place where we can

  • share tips and codes

  • explore new areas of the health data science landscape

  • take challenges on online competitions such as Kaggle and Dream challenges

What to expect from the meetup?

  1. Time: 9/26   9:00-11:00 AM; Topic: Data Science Pipeline, Python crash course, common packages
     
  2. Time: 10/10 9:00-11:00 AM; Topic: Shrinkage and Regularization: Lasso, Ridge, Elastic Net
     
  3. Time: 10/24 9:00-11:00 AM: Topic: Tree-based Methods: CART, Random Forests, GBDT & Ensembles: Voting, Bagging, Boosting
     
  4. Time: 11/7   9:00-11:00 AM; Topic: Support Vector Machines
     
  5. Time: 11/21 9:00-11:00 AM; Topic: Neural Network: MLP, CNN
     
  6. Time: 12/5   9:00-11:00 AM: Topic: Neural Network: RNN

Raw Data

Data Wrangling

Explore

Model Selection

Feature Engineering

Train Model

Evaluate Performance

Data Product

Data Science Pipeline

Inferential Model

vs

Predictive Model

Decomposition of Error

Bias-Variance Trade-off

Cross-Validation

  • The test error can be calculated if a test dataset is available.
  • Unfortunately, this is usually not the case.
    We don’t have a very large designated test dataset that can be used to directly estimate the test error rate in most time.
     
  • Cross-validation:
    A method that can estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the trained method to those held out observations

 

Training

80%

Testing

20%

k-fold CV

Tune Models

Evaluate Performance

Python

  • Install Python 2
  • Install numpy
    pip install numpy
  • Install pandas
    pip install pandas
  • Install matplotlib
    pip install matplotlib
  • Install scikit-learn
    pip install scikit-learn
  • Install jupyter notebook
    pip install jupyter

HDS Meetup 9/26/2016

By Hui Hu

HDS Meetup 9/26/2016

Slides for the Health Data Science Meetup

  • 924