Ishanu Chattopadhyay

University of Chicago

Machine Learning & Advanced Analytics for Biomedicine

CCTS 40500 / CCTS 20500 / BIOS 29208
Winter 2023

Contact

Room: BSLC 313

Monday 9.30 - 12.20 AM

Resources

https://github.com/zeroknowledgediscovery/course_notes

RCC Midway

  • Expectations
  • Grading
  • Midterm
  • Final Project
  1. You should be able to model complex data on your own
  2. Choose the right framework for the right parameters, for the right reasons
  3. Know the limitations and the strengths of your model
  • Expectations
  • Grading
  • Midterm
  • Final Project
  • I grade on progress and effort
  • Innovative approaches get more credit
  • There will be homeworks, not weekly but periodically
  • Expectations
  • Grading
  • Midterm
  • Final Project
  • Midterm and finals are not in-class exams
  • Again, effort and innovation get more credit

Class Time

MON 9.30 AM

(~3 hrs)

 

FRI  9.00 AM

(0.5-1 hr if you have questions)

Today's Take-Home Message

What is Machine Learning?

Why is it everywhere?

Why is it important to biomedicine?

Are we really solving the hard problems?

What is Machine Learning

Learning from machines?

Learning with the help of computers?

Modeling data?

Regression?

What is Machine Learning

Learning from machines?

Learning with the help of computers?

Modeling data?

Regression?

data -> (intelligent) automated analysis -> actionable insights

How is  Machine Learning different from...

Statistics

AI

Data Mining

Deep Learning

How is  Machine Learning different from...

"Machine learning is essentially a form of applied statistics”

 

“Machine learning is statistics scaled up to big data”

 

“Machine learning is Statistics minus any checking of models and assumptions.”


“I don’t know what Machine Learning will look like in ten years, but whatever it is I’m sure Statisticians will be whining that they did it earlier and better.”

Approach to a problem differs between mathematicians, statisticians & ML-experts

  • Central Limit Theorem
  • Measure Theory
  • Stochastic Processes
  • Linear Regression
  • General Linear Models
  • What is the "correct" statistical model for a problem/process ?
  • Often interest is "describing" data already observed
  • No model is correct. 
  • The useful ones predict correctly more often than others
  • ONLY interested in how well a model works on unseen data

Decision Surfaces with Different Classification Algorithms

Two features 

different models produce different solutions

Data Science

Big Data Analytics

Data Science = Automated Analytics

How Do We Teach Machines To..

Is there any good reason to assume that data that you have not seen yet will share any properties with data you have already seen?

Broad ML Categories

Broad ML Categories

Broad ML Categories

A Bird's Eye View

ML Applications in Bio-medicine

Uncharted Possibilities

  • Predicting future disease
  • Optimizing interventions
  • Discovering unknown mechanisms
  • A new paradigm of scientific discovery
  • At-scale pattern discovery impossible otherwise

Data

Knowledge

Towards a grand unified theory of data

lots of data!

Classical Science

The age of data

Pandemics

Emergent Pathogens

Social Dynamics

Complex Diseases

Data

Forecast case count

Predict future mutations

Predict crime

Diagnose Complex Diseases

Data

Data

Insight

scientific knowledge

Clinical Decisions

social theory

Designing Better Vaccines

Designing Better Vaccines

Can we predict future variants?

Designing Better Vaccines

Can we predict future variants?

Designing Better Vaccines

Can we predict future variants?

Designing Better Vaccines

Can we predict future variants?

Designing Better Vaccines

Can we predict future variants?

Bio-NORAD

  • CDC has evaluated 23 influenza strains for pandemic risk in last 12 years
  • We have collected > 6000 strains in last two years

Are we prepared for the next pandemic?

Future:

NORAD

for biological threats

New Residues of Importance Emerges in Influenza Cellular Entry Proteins

Microbiome

Modeling complex ecosystems

e.g.

human gut

microbiome

Leverage Vast Patient Database

Truven MarketScan (IBM)
Commerical Claims & Encounters Database

2003-2018

87M patients visible > 1 year

>7B individual claims

>87K unique diagnostic codes

 

>7% Medicare data present

Zero-burden EHR Analytics

Diagnostic & Screening for complex disorders

*CoR : * Comorbid Risk Scores

ACoR

PCoR

ZCoR

Universality

Autism

Bipolar Disorder

Idiopathic Pulmonary Fibrosis

Alzheimer's Disease

Perioperative Cardiac Event

Chronic Kidney Disease

...                 

  • complex, expensive, time-consuming diagnostic tests
  • Lack of Universal Screening at the point of care
  • Early diagnosis is difficult, late or missed diagnosis costs lives

Conventional Off-the-shelf ML will not do!

ASD: Ineffective screening causes delays and incurs costs

Current Prevalence: 1 in 59 

Children with ASD experience higher co-morbidities

Can we exploit these patterns to predict diagnosis?

Common Knowledge: Comorbidties  Exist

Autism Co-morbid Risk (ACoR) Score

Autism Co-morbid Risk (ACoR) Score

MCHAT/F

Head to head comparison with current practice

ACoR: Variation with Age

can track risk increase over time

Older children are easier to diagnose

Co-morbidity Spectra: Pattern Discovery amidst Heterogeneity

Top patters come from:

 

Nervous disorders

Digestive disorders

Injury & Poisoning

Neoplasms

Endocrine

Immune

The Secret Sauce: Inferring Probabilistic Machines from Data

Deep Learning Without Neural Networks: Fractal-nets for Rare Event Modeling (Under Review Nature Machine Intelligence)

Yi Huang, James Evans, I. Chattopadhyay

Sequence Likelihood Divergence For Fast Time Series Comparison

Yi Huang, Victor Rotaru, I. Chattopadhyay

Under Review IEEE Transactions of Data and Knowledge Engineering

Abductive learning of quantized stochastic processes with probabilistic finite automata 

Ishanu Chattopadhyay  and Hod Lipson

2013 Phil. Trans. R. Soc. A.3712011054320110543

The Secret Sauce: Inferring Probabilistic Machines from Data

Immune female control

Immune female case

Secret Sauce: Leverging Temporal Patterns

Specialized HMM models from code sequences

Model control and case cohorts seprately

given a new test case, compute likelihood of sample arising from case models vs control models

sequence likelihood defect

Bipolar Disorder

Manic Episodes in Mood Disorders

No Blood-work

No questionnaire

 

Dx codes + Rx Codes

Idiopathic

Pulmonary

Fibrosis

Idiopathic Pulmonary Fibrosis

  • No effective screening available

  • Pathobiology unclear

  • Post diagnostic survival: 3-5 years

Significant Boost in survival time

Alzheimer's Disease and Related Dementia

>5 Million in US. >13 Million in next 10 years

Alzheimer's Disease and Related Dimentia

state of art with EHR:

~67% AUC*

 

ZCoR:  ~87%

Alzheimer's Disease and Related Dimentia

state of art with EHR:

~67% AUC*

 

ZCoR:  ~87%

Preempting ADRD accurately upto a decade in future

Perioperative Cardiac Risk from Hip/Knee Surgeries

Impact on patient outcome

Prospective Validation

ASD

ADRD

Pediatrics

Neurology

(Memory Center)

Time Series Analysis

Deep learning without Neural Networks

Using Flu incidence data from the past to predict COVID-19 case counts

Predicting rare and extreme events in complex dynamical systems

rare weather events

 

earthquakes

 

crime

Fractal Net Architecture: Rethinking Deep Learning in Stochastic Rare/Extreme Event Scenario

Predicting crime and auditing enforcement biases

Datasets

ECG

EEG

Microbiome

EHR

Genomic

Epidemiology

Tissue Image

Sequence

Confusion Matrix with 2 classes

Performance Metrics

Receiver-Operator Characteristic

The Fundamental Problem Setting

  • Naive Bayes Classifier

  • Nearest Neighbor Classifier

  • Support Vector Machines

  • Naive Bayes Classifier

  • Nearest Neighbor Classifier

  • Support Vector Machines

  • Decision Trees

  • Random Forests

Neurons

  • The building block for neural networks are artificial neurons.
  • These are simple computational units that have weighted input signals and produce an output signal using an activation function.

Biological Neurons

  • Neuron Weights
  • Activation
  • Networks of Neurons
a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right)
a^{l} = \sigma(w^l a^{l-1}+b^l)

NN Computation

short-hand

NN Learning: Backpropagation

optimizing weights and biases

by minimizing a loss-function

Convolutional Neural Nets

Convolution

Long short-term memory

 API has been trained on the COCO dataset (Common Objects in Context).

Auto-encoder

So What is Machine Learning..

?

How to Spot a

Fake Data Scientist

  • Lacking in deep understanding of ML theory
  • Never worked with substantial data
  • Portfolio/github page has less than 5 listed projects 
  • Cannot explain what is an auto-encoder

 

End of First Class

HW:

 

  1. Why are some applications "ML" and not "statistical" modeling?

CCTS 405000-23-01

By Ishanu Chattopadhyay

CCTS 405000-23-01

Machine Learning for Biomedicine

  • 141