Ishanu Chattopadhyay

University of Chicago

Machine Learning & Advanced Analytics for Biomedicine

CCTS 40500 / CCTS 20500 / BIOS 29208
Winter 2023

Lecture 2

Contact

Room: BSLC 313

Monday 9.30 - 12.20 AM

Resources

https://github.com/zeroknowledgediscovery/course_notes

RCC Midway

PLEASE CHECK if your RCC midway usernames work

  • Expectations
  • Grading
  • Midterm
  • Final Project
  1. You should be able to model complex data on your own
  2. Choose the right framework for the right parameters, for the right reasons
  3. Know the limitations and the strengths of your model
  • Expectations
  • Grading
  • Midterm
  • Final Project

I grade on progress and effort

Innovative approaches get more credit

There will be homeworks, not weekly but periodically

  • Expectations
  • Grading
  • Midterm
  • Final Project

Midterm and finals are not in-class exams

 

Again, effort and innovation get more credit

Class Time

MON 9.30 AM (~3 hrs)

FRI  9.00 AM (0.5-1 hr)

Today's Take-Home Message

Performance Metrics

Diagnostic Tests

Bayesian Statistics

Diagnostic Tests for Diseases

  • Risk Factors
    • Past Diagnoses
  • Laboratory Tests
  • Questionnaire
  • Familial Risks
  • Life Events

Does the patient have the disorder?

Not Always Obvious

autism

dementia

Diagnostic Tests for Diseases

  • Risk Factors
    • Past Diagnoses
  • Laboratory Tests
  • Questionnaire
  • Familial Risks
  • Life Events

Does the patient have risk of the disorder ?

Not Always Obvious

autism

dementia

How do we quantify risk?

How do we map risk to severity?

Diagnostic Tests

Diagnostic Tests

Diagnostic Tests

Diagnostic Tests

Diagnostic Tests

Diagnostic Tests

Diagnostic Tests

Diagnostic Tests

Diagnostic Tests

Diagnostic Tests

Sensitivity & Specificity

Confusion Matrix with 2 classes

Performance Metrics

Relationships between Performance Metrics

TPR = \frac{t_p}{P} = \frac{t_p}{t_p+f_n}\\ TNR = \frac{t_n}{N} = \frac{t_n}{t_n+f_p}\\ FPR =1-TNR\\ PPV =\frac{t_p}{t_p+f_p}\\ \rho =\frac{P}{N+P}
t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }
f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }

Relationships between Performance Metrics

PPV = \frac{t_p/P}{t_p/P + (f_p/N)(N/P)} = \frac{TPR}{\rho + ((N-t_n)/N)(N/P)}
t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }
f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }
s : \textrm{ sensitivity }, c: \textrm{ specificity }
NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\rho}-1\right )} }
PPV = \frac{s}{s + (1-c)(\frac{1}{\rho} -1)}

Relationships between Performance Metrics

PPV = \frac{t_p/P}{t_p/P + (f_p/N)(N/P)} = \frac{TPR}{\rho + ((N-t_n)/N)(N/P)}
t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }
f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }
s : \textrm{ sensitivity }, c: \textrm{ specificity }
NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }
PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

prevalence is intrinsic property of the disease

Relationships between Performance Metrics

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }
PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

Manic Episode with no Bipolar history

prevalence: ~10%

Relationships between Performance Metrics

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }
PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

Idiopathic Pulmonary Fibrosis

prevalence: ~0.5%

Relationships between Performance Metrics

The decision threshold is upto us to decide

 

Impacts sensitivity & specificity

Sensitivity Specificity Tradeoff

Sensitivity Specificity Tradeoff

Sensitivity Specificity Tradeoff

Sensitivity Specificity Tradeoff

Each choice of a threshold produces a different test

Comparing Tests

Comparing Tests

Comparing Tests

Why is a "diagonal ROC" useless?

s=c \\ \Rightarrow \frac{t_p}{P} = \frac{t_n}{N} \\ \Rightarrow \frac{t_p}{t_n} = \frac{P}{N} = \frac{\wp}{1-\wp}

Let sensitivity be \(s\), specificity be \(c\), and prevalence P/(N+P) be \(\wp\).

Then:

Hence, s=c is NO BETTER than a coin toss!

t_n

Comparing Tests

Comparing Tests

  • AUC only considers ranks, not actual values
  • Related to the Mann-
    Whitney U Test
  • Shows why AUC is immune to class imbalence

HW.

For 2 random samples, AUC is the probability that the positive sample is ranked higher than the negative one

Tests are tools to reduce uncertainty

Test Effectiveness

Test Effectiveness

Test Effectiveness

Test Effectiveness

-LR=\frac{f_n}{t_n} \times \frac{1-\rho}{\rho} =\frac{1-s}{c}
+LR=\frac{t_p}{f_p} \times \frac{1-\rho}{\rho} =\frac{s}{(1-c) }

Prove this using Bayes' Theorem

Test Effectiveness

$$t_p/f_p$$

$$\frac{\rho}{1-\rho}$$

Test Effectiveness

Test Effectiveness

Choosing Thresholds

Balancing False Positives & False Negatives

Cost Positive Negative
Test Positive $0 $x
Test Negative $y $0

Cost Optimization to choose operating point

\textrm{minimize } \zeta = C(f_p)+C(f_n)

Criminal Justice: $$C(f_n) = 0 $$

Healthcare (Covid test?)

 $$C(f_p) = 0 $$

naive dichotomy

Choosing Thresholds

Overlapping features are harder to classify

How do we formalize these trade-offs?

Covid tests are similar

What happens if we test again?

0.045

0.045

1-0.045

0.69

But  confirmatory tests might not be always feasible

Summary of Bayesian Inference

(H)

Maximum Likelihood Estimate

vs

Maximum a posteriori probability Estimate

\theta_{MLE} = \argmax_\theta Pr(X \vert \theta)
\theta_{MAP} = \argmax_\theta Pr(\theta \vert X) \\ = \argmax_\theta \bigg ( \log P(X \vert \theta) + \log Pr(\theta) \bigg )

HW: Show that the second expression is true

HW: 1. Why do we choose the Beta Distribution?

2. Choose a different prior and compute MAP estimate

Note on beta distribution:

E[X] = \frac{\alpha}{\alpha + \beta}

HW: Why choose conjugate priors?

Example of Computing  A Bayes Estimator

Bayes' Error

 

The Universal Metric

Also not computable

HW will be posted on canvas. 

Extra Credit Problem: Derive the posterior distribution using this approach

CCTS 405000-02-2023

By Ishanu Chattopadhyay

CCTS 405000-02-2023

Machine Learning for Biomedicine

  • 147