Health Data Science Meetup

Introduction

Overview of Data Science Pipeline

Python Crash Course and Common Packages

Time: 9/26 9:00-11:00 AM; Topic: Data Science Pipeline, Python crash course, common packages
Time: 10/10 9:00-11:00 AM; Topic: Shrinkage and Regularization: Lasso, Ridge, Elastic Net
Time: 10/24 9:00-11:00 AM: Topic: Tree-based Methods: CART, Random Forests, GBDT & Ensembles: Voting, Bagging, Boosting
Time: 11/7 9:00-11:00 AM; Topic: Support Vector Machines
Time: 11/21 9:00-11:00 AM; Topic: Neural Network: MLP, CNN
Time: 12/5 9:00-11:00 AM: Topic: Neural Network: RNN

Raw Data

Data Wrangling

Explore

Model Selection

Feature Engineering

Train Model

Evaluate Performance

Data Product

Decomposition of Error

The test error can be calculated if a test dataset is available.
Unfortunately, this is usually not the case.
We don’t have a very large designated test dataset that can be used to directly estimate the test error rate in most time.
Cross-validation:
A method that can estimate the test error rate by holding out a subset of the training observations from the fitting process, and then applying the trained method to those held out observations