Ishanu Chattopadhyay

University of Chicago

Machine Learning & Advanced Analytics for Biomedicine

CCTS 40500 / CCTS 20500 / BIOS 29208
Winter 2023

Lecture 3

Contact

https://join.slack.com/t/mlbio2023/shared_invite/zt-1mshbyy5f-nmhGiYSUXdXoVDmVe_Qf4Q

ishanu@uchicago.edu

900 E57 ST

KCBD 10152

https://uchicagomedicine.zoom.us/j/96256222662?pwd=TFR2R0hwZGtWYjFjUjlrVlNXWEVVZz09

Room: BSLC 313

Monday 9.30 - 12.20 AM

Resources

https://github.com/zeroknowledgediscovery/course_notes

RCC Midway

How to use RCC & Running Jupyter Notebooks remotely

RCC Midway

How to use RCC & Running Jupyter Notebooks remotely

ssh -X username@midway2.rcc.uchicago.edu

screen
sinteractive --exclusive --time=08:00:00

get a compute node within screen

module load python
a=`/sbin/ip route get 8.8.8.8 | awk '{print $NF;exit}'`
jupyter-notebook --no-browser --ip="$a"

start jupyter notebook without browser

Access the notebook from local machine

/project2/ishanu/run_jupyter.sh

simplify

RCC Midway

How to use RCC & Running Jupyter Notebooks remotely

Use thinlinc

# open terminal

screen
sinteractive --exclusive --time=08:00:00

get a compute node within screen

module load python
a=`/sbin/ip route get 8.8.8.8 | awk '{print $NF;exit}'`
jupyter-notebook --no-browser --ip="$a"

start jupyter notebook without browser

Access the notebook from local machine

/project2/ishanu/run_jupyter.sh

simplify

Today's Take-Home Message

Bayesian Statistics

Bayes' Error

Decision Trees

Confusion Matrix with 2 classes

Review: Metrics

Performance Metrics

Relationships between Performance Metrics

TPR = \frac{t_p}{P} = \frac{t_p}{t_p+f_n}\\ TNR = \frac{t_n}{N} = \frac{t_n}{t_n+f_p}\\ FPR =1-TNR\\ PPV =\frac{t_p}{t_p+f_p}\\ \rho =\frac{P}{N+P}

t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }

f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }

Relationships between Performance Metrics

PPV = \frac{t_p/P}{t_p/P + (f_p/N)(N/P)} = \frac{TPR}{\rho + ((N-t_n)/N)(N/P)}

t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }

f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }

s : \textrm{ sensitivity }, c: \textrm{ specificity }

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\rho} -1)}

Relationships between Performance Metrics

PPV = \frac{t_p/P}{t_p/P + (f_p/N)(N/P)} = \frac{TPR}{\rho + ((N-t_n)/N)(N/P)}

t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }

f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }

s : \textrm{ sensitivity }, c: \textrm{ specificity }

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

prevalence is intrinsic property of the disease

Relationships between Performance Metrics

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

Idiopathic Pulmonary Fibrosis

prevalence: ~0.5%

Relationships between Performance Metrics

The decision threshold is upto us to decide

Impacts sensitivity & specificity

Sensitivity Specificity Tradeoff

Comparing Tests

Balancing False Positives & False Negatives

Cost	Positive	Negative
Test Positive	$0	$x
Test Negative	$y	$0

Cost Optimization to choose operating point

\textrm{minimize } \zeta = C(f_p)+C(f_n)

Criminal Justice: $$C(f_n) = 0 $$

Healthcare: $$C(f_p) = 0 $$

naive dichotomy

Maximum Likelihood Estimate

vs

Maximum a posteriori probability Estimate

\theta_{MLE} = \argmax_\theta Pr(X \vert \theta)

\theta_{MAP} = \argmax_\theta Pr(\theta \vert X) \\ = \argmax_\theta \bigg ( \log P(X \vert \theta) + \log Pr(\theta) \bigg )

Why do we choose the Beta Distribution?

Bayes Estimator

&

Bayes Error

Bayes' Error

Classification & Decision Theory

\textrm{Estimate a function } f:X \rightarrow Y\\ \textrm{where } Y \textrm{ has finitely many elements}

\textrm{In classification we consider pairs } (x,y)\\ \textrm{ where $x$ is a feature vector, and $y$ is a class label}

x \in \mathcal{X}, y \in \mathcal{Y}

\textrm{Classification of hand written digits}\\ \mathcal{X} =\{\textrm{ image vectors representing $0-9$} \}\\ \mathcal{Y} = \{ 0, \cdots, 9\}

Classification & Decision Theory

\textrm{Estimate a function } f:X \rightarrow Y\\ \textrm{where } Y \textrm{ has finitely many elements}

Classification & Decision Theory

\textrm{There are two interpretations for the above decomposition:}\\\textrm{First, one can view it as a two-step random number generation procedure:}\\\textrm{first generate the label $Y = y$ via $P_Y$}\\\textrm{then generate the feature vector according to $P_{X\vert Y =y}$}\\\\\color{yellow} \textrm{Second, one can interpret this decomposition via the total expectation theorem:}\\ \textrm{for any real-valued function $φ : X × Y → R$}\\ \color{yellow} \mathbb{E}_{XY}[\phi(X,Y)]= \mathbb{E}_Y \mathbb{E}_{X\vert Y} [\phi(X, Y )]

Classification & Decision Theory

p_0=Pr(X \vert Y=0)

p_1=Pr(X \vert Y=1)

Classification & Decision Theory

Mathematical definition of classifier:

h:X \rightarrow Y

R(h) \triangleq P_{XY}(h(X) \neq Y) = \mathbb{E}_{XY}[\mathbb{1}_{\{ h(X) \neq Y \}}]

Risk of a classifier:

Classification & Decision Theory

\underbrace{R^\star = \inf_h R(h)}

Bayes Risk

R(h) \coloneq P_{XY}(h(X) \neq Y) = \mathbb{E}_{XY}[\mathbb{1}_{\{ h(X) \neq Y \}}]

Risk of a classifier:

Mathematical definition of classifier:

h:X \rightarrow Y

search over all possible classifiers

Classification & Decision Theory

R^\star = \inf_h R(h)

Bayes Risk

A classifier achieving the Bayes risk is a Bayes Optimal Classifier

\textrm{Consider a loss function $L : Y × Y \rightarrow R $}\\\textrm{that will measure the penalty of misclassifications.}

L(y, \widehat{y})

\textrm{loss incurred for predicting $\widehat{y}$ when true label is $y$}

L(y, \widehat{y}) = \left \{ \begin{array}{ll} 1 & y \neq \widehat{y} \\ 0 & \textrm{otherwise} \end{array}\right.

\textrm{Zero-one Loss}

Bayesian Decision Theory

L(y, \widehat{y}) = \left \{ \begin{array}{ll} 1 & y \neq \widehat{y} \\ 0 & \textrm{otherwise} \end{array}\right.

\textrm{Zero-one Loss}

Minimizing the 0, 1-loss is equivalent to minimizing the overall misclassification rate. 0, 1-loss is an example of a symmetric loss function: all errors are penalized equally. In certain applications, asymmetric loss functions are more appropriate.

Recall cost of false negatives vs that of false positives

\textrm{Minimize the expected loss}\\\textrm{with respect to the probability distribution $p(x, y)$}

The expected 0, 1-loss is precisely the probability of making a mistake

Defining Bayes Optimal Classifier in terms of the Loss function

Bayesian Decision Theory

f:X \rightarrow Y \textrm{ classifier }

Bayes Optimal Classifier

The above derivation is of course for only 0-1 Loss

But this is true in general

Bayes Optimal Classifier: Example

Assume we want to classify fish based on length

Bayes Optimal Classifier: Example

Assume we want to classify fish based on length

Why doesn't this solve all problems in ML?

Bayes Risk

\textrm{Bayes classifier is a classifier whose risk}\\\textrm{ $R(h)$ is minimal among all possible classifiers}\\

\textrm{The minimum risk $R^\star$ is called the Bayes risk}

p_0=Pr(X \vert Y=0)

p_1=Pr(X \vert Y=1)

R_0

R_1

\textrm{Bayes Classifier}\\h^\star(x) = \argmax_y P(Y \vert X)

Bayes Risk

p_0=Pr(X \vert Y=0)

p_1=Pr(X \vert Y=1)

R_0

R_1

\textrm{Bayes Classifier}\\h^\star(x) = \argmax_y P(Y \vert X)

P(y_0) + P(y_1) =1\\ \Rightarrow \displaystyle \sum_{y \in Y} \int_{R_0 \cup R_1} P(Y=y \vert X=x)p(X=x) dx = 1

R(h^\star)=P_{XY}(h^\star(X) \neq Y)\\

Bayes Risk

R(h) = P_{XY}(h(X)\neq Y | Y=y_0) + P_{XY}(h(X)\neq Y | Y=y_1) \\ = 1 - \sum_i P(Y=y_i)\int_{R_i} P(x \vert Y=y_i)dx

p_0=Pr(X \vert Y=0)

p_1=Pr(X \vert Y=1)

R_0

R_1

\textrm{Bayes Classifier}\\h^\star(x) = \argmax_y P(Y \vert X)

Bayes Optimal Classifier

What happens if the loss function is more general?

\textrm{Fix a cost function $L(y,h(x))$}\\ \textrm{Expected cost of classifier}\\ Cost(h) = \sum_{(x,y) \in X \times Y} P(x,y) L(y,h(x))

\textrm{Show that:}\\ h^\star(x) = \argmin_{y \in Y} G_x(y)\\ \textrm{where }\\ G_x(y) = \sum_{y' \in Y} P(x,y')L(y',y)

extra credit

Decision Trees

https://www-users.cs.umn.edu/~kumar001/dmbook/ch4.pdf

https://www.cs.princeton.edu/courses/archive/spr07/cos424/papers/mitchell-dectrees.pdf

https://www.cis.upenn.edu/~danroth/Teaching/CS446-17/LectureNotesNew/dtree/main.pdf

Books

Antonio Criminisi, Jamie Shotton (2013)
- [Decision Forests for Computer Vision and Medical Image Analysis] (http://link.springer.com/book/10.1007%2F978-1-4471-4929-3)
Trevor Hastie, Robert Tibshirani, Jerome Friedman (2008)
- [The Elements of Statistical Learning, (Chapter 10, 15, and 16)] (http://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf)
Luc Devroye, Laszlo Gyorfi, Gabor Lugosi (1996)
- A Probabilistic Theory of Pattern Recognition (Chapter 20, 21)

if x == y
  class = 0
else
  class = 1

N1: is x == 1 ? (yes -> N2, no -> N3)
N2: is y == 1 ? (yes -> class=0, no -> class=1)
N3: is y == 1 ? (yes -> class=1, no -> class=0)

XOR

Other Measures of Impurity

We will look at Q-nets later that optimally solve this problem

Key Problem: Overfitting

Formal Description of Overfitting

How Does Overfitting Look Like In Practice?

How Does Overfitting Look Like In Practice?

Are More Features Always Better? NO

Advantages of decision trees are:

Simple to understand and to interpret. Trees can be visualised.
Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree.
Able to handle both numerical and categorical data. However scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialised in analysing datasets that have only one type of variable. See algorithms for more information.
Able to handle multi-output problems.
Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.
Performs well even if its assumptions are somewhat violated by the true model from which the data were generated

The disadvantages of decision trees include:

Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations. Therefore, they are not good at extrapolation.
The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristic algorithms such as the greedy algorithm where locally optimal decisions are made at each node. Such algorithms cannot guarantee to return the globally optimal decision tree. This can be mitigated by training multiple trees in an ensemble learner, where the features and samples are randomly sampled with replacement.
There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

How do we make decision trees better?

Reduce "bias"

Reduce "variance"

Cannot reduce "irreducible error"

Lets take a detour

Note: performance metrics relate to the sample population, not to individual samples

THE TABULAR DATA FORMAT

Summing Up The ML Problem

P(y\vert \overrightarrow{x}) \\ \textrm{ is impossible to}\\ \textrm{estimate in practice}

Naive Bayes Assumption

Vox Populi Vox Dei

OK, Back to Making Decision Trees Better.....

Bias Variance, Reducible & Irreducible Error

Bias Variance, Reducible & Irreducible Error

\textrm{Let target variable $Y$ and feature $X$ be related as}\\ Y = f(X) + \epsilon \\

\textrm{Data}\\ d = \{ (x_1,y_1), \ldots , (x_m,y_m) \}

\textrm{Estimate of the function $f(x)$, called the hypothesis $h_d(x)$} \\ \textrm{The subscript $d$ reminds us that the hypothesis}\\\textrm{ is a random function that varies over training data sets.}

archive/references/biasVsVariance.pdf

Bias Variance, Reducible & Irreducible Error

E_{X,\epsilon} [ (Y(X,\epsilon) - h_{d}(X))^2 ]

Expected Test Error:

= E_X \left[ \left( f - E_{\mathcal{D}} \left[ h \right] \right)^2 \right] + E_X \left[ Var_{\mathcal{D}} \left[ h \right] \right] + Var_\epsilon \left[ \epsilon \right]

bias

variance

irreducible error

archive/references/biasVsVariance.pdf

Bias Variance, Reducible & Irreducible Error

archive/references/biasVsVariance.pdf

Show Proof

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

clf_ = DecisionTreeClassifier(max_depth=4, class_weight='balanced')
clf = BaggingClassifier(base_estimator=clf_,n_estimators=10,oob_score=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train,y_train)
clf.predict(X_test,y_test)

Bagging

Variable/Feature Importance for Bagging

Average over the feature importance of base models

Median might be better

Feature Importance for Bagging Classifier

import numpy as np
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)
clf = BaggingClassifier(DecisionTreeClassifier())
clf.fit(X, y)

feature_importances = np.mean([
    tree.feature_importances_ for tree in clf.estimators_
], axis=0)

Average over the feature importance of base models

Median might be better

Results no longer easily interpretable
One can no longer trace the "logic" of an output through a series of decisions based on predictor values

Problem with Ensemble Methods

Instead of one "rule", it is a distribution over rules, or a linear combination of rules

\frac{1}{B}\sigma^2

\rho\sigma^2 + \frac{1-\rho}{B}\sigma^2

\textrm{Average of B iid random variables each with variance $\sigma^2$ has variance }

\textrm{Average of B id random variables each with variance $\sigma^2$}\\\textrm{and mutual correlation $\rho$ has variance }

HW. Prove this

iid.pdf

Proof here:

What happens when there is one strongly predicting feature?

How do we avoid this?

We can penalize the number of times a feature is used at a certain depth

We can penalize the number of times a feature can be used

We will only allow some features chosen by some meta analysis

No! Same Bias

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=None, class_weight='balanced',n_estimators=300)
y_pred = clf.fit(X_train, y_train).predict(X_test)

https://www.stat.berkeley.edu/~breiman/RandomForests/

Next Week:

Random Forests

&

Others

Then On to Neural Networks