Trading-off Model Accuracy Against Fairness:

The Ethical Conundrum of Standardized Screening

Ishanu Chattopadhyay

Assitant Professor, Medicine

04.19.2023

Ian Cero, Peter A. Wyman, I. Chattopadhyay, Robert D. Gibbons, Predictive equity in suicide risk screening, Journal of the Academy of Consultation-Liaison Psychiatry, 2023. https://doi.org/10.1016/j.jaclp.2023.03.005

Equity

Fairness

Ian Cero

Peter Wyman

Robert Gibbons

Suicide is a major public health concern

1 death by suicide every 40 seconds

As per the data from the CDC, in 2019, there were over 47,500 suicide deaths in the U.S., with an age-adjusted rate of 13.9 per 100,000 individuals.

10th leading cause of death in the United States

Screening Tests are Increasingly common

Columbia-Suicide Severity Rating Scale (C-SSRS)

Patient Health Questionnaire-9 (PHQ-9)

Ask Suicide-Screening Questions (ASQ)

These screening tools are not meant to be diagnostic but rather to help identify individuals who may need further evaluation or intervention to prevent suicide.

Primary Care

Emergency Dept

School & Community

Screening Tests are Increasingly common

Columbia-Suicide Severity Rating Scale (C-SSRS)

Patient Health Questionnaire-9 (PHQ-9)

Ask Suicide-Screening Questions (ASQ)

Primary Care

Emergency Dept

School & Community

Coley RY, Johnson E, Simon GE, Cruz M, Shortreed SM. Racial/Ethnic Disparities in the Performance of Prediction Models for Death by Suicide After Mental Health Visits. JAMA Psychiatry. 2021 Jul 1;78(7):726–34.

The increasing standardization of suicide risk screening suggests predictive models balance not only accuracy, but also fairness for the different groups of people whose futures are being predicted

Accuracy

Fairness

Group A

Group B

Ask Suicide-Screening Questions (ASQ) has high and equivalent sensitivity and specificity for suicide ideation across black and white youth in the emergency department.

Black

Sensitivity

Specificity

Non-Hispanic White

Equal across groups

ASQ

Different Base rates (prevalence)

6.11 per 100,000*

15.68 per 100,000*

Non-Hispanic White

Black

*CDC 2019 Data

Uneven base rates

Mathematically unavoidable trade-off between model accuracy and fairness

Another Example: criminal recidivism

ProPublica recently analyzed over 10,000 of the actual predictions from a popular recidivism prediction model (COMPAS)

Black defendants were twice as likely as white defendants to receive a false positive classification

Creators of COMPAS presented equally compelling findings

model’s overall classification accuracy (about 64%) was in fact equal for both black and white defendants

Larson J, Mattu S, Kirchner L, Angwin J. How We Analyzed the COMPAS Recidivism Algorithm [Internet]. ProPublica. 2016 [cited 2022 Dec 30]. Available from: https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm

UNLIKELY due to "biased data", or model

*Kleinberg J, Mullainathan S, Raghavan M. Inherent Trade-Offs in the Fair Determination of Risk Scores. ArXiv160905807 Cs Stat [Internet]. 2016 Nov 17 [cited 2019 Nov 6]; Available from: http://arxiv.org/abs/1609.05807

Predictive disparity is likely caused by uneven base rates on the outcome being predicted*

Screening Test

Classification Problem

Screening Test

Target Condition

Screening Test

Target Condition

Screening Test

Target Condition

Screening Test

Target Condition

Screening Test

Target Condition

Screening Test

Target Condition

Screening Test

Target Condition

Screening Test

Target Condition

Screening Test

Confusion Matrix with 2 classes

Common Performance Metrics

Relationships between Performance Metrics

TPR = \frac{t_p}{P} = \frac{t_p}{t_p+f_n}\\ TNR = \frac{t_n}{N} = \frac{t_n}{t_n+f_p}\\ FPR =1-TNR\\ PPV =\frac{t_p}{t_p+f_p}\\ \rho =\frac{P}{N+P}

t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }

f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }

sensitivity

specificity

precision

prevalence

Relationships between Performance Metrics

PPV = \frac{t_p/P}{t_p/P + (f_p/N)(N/P)} = \frac{TPR}{\rho + ((N-t_n)/N)(N/P)}

t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }

f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }

s : \textrm{ sensitivity }, c: \textrm{ specificity }

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\rho} -1)}

Relationships between Performance Metrics

PPV = \frac{t_p/P}{t_p/P + (f_p/N)(N/P)} = \frac{TPR}{\rho + ((N-t_n)/N)(N/P)}

t_p : \textrm{ true positives }, t_n: \textrm{ true negatives }

f_p : \textrm{ false positives }, f_n: \textrm{ false negatives }

s : \textrm{ sensitivity }, c: \textrm{ specificity }

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

prevalence is intrinsic property of the disease

Relationships between Performance Metrics

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

Manic Episode with no Bipolar history

prevalence: ~10%

Relationships between Performance Metrics

NPV = \frac{1}{1+ \frac{1-s}{c \left ( \frac{1}{\red \rho}-1\right )} }

PPV = \frac{s}{s + (1-c)(\frac{1}{\red \rho} -1)}

Idiopathic Pulmonary Fibrosis

prevalence: ~0.5%

The decision threshold is upto us to decide

Impacts sensitivity & specificity

Sensitivity Specificity Tradeoff

Each choice of a threshold produces a different test

Tests are tools to reduce uncertainty

Test Effectiveness

-LR=\frac{f_n}{t_n} \times \frac{1-\rho}{\rho} =\frac{1-s}{c}

+LR=\frac{t_p}{f_p} \times \frac{1-\rho}{\rho} =\frac{s}{(1-c) }

Test Effectiveness

$$t_p/f_p$$

$$\frac{\rho}{1-\rho}$$

Test Effectiveness

UCM Data

Blacks

Non-Hispanic Whites

AUC~90%

AUC~88%

Universal SCreening for Suicidal Ideation / Attempts

UCM Data

Universal SCreening for Suicidal Ideation / Attempts

UCM Data

Universal SCreening for Suicidal Ideation / Attempts

UCM Data

Universal SCreening for Suicidal Ideation / Attempts

$466,700

$135,700

Assume you have $1,000,000 to allocate to the post-screening followup service

67%

33%

Number of actual individuals helped

Demographic breakdown at UCM

=40

Assume you have $1,000,000 to allocate to the post-screening followup service

44%

66%

Number of actual individuals helped

Demographic breakdown at UCM

Differential

base

rate

=58

Race-blind followup

Assume you have $1,000,000 to allocate to the post-screening followup service

100%

Number of actual individuals helped

=21

Assume you have $1,000,000 to allocate to the post-screening followup service

77.5%

22.5%

Number of actual individuals helped

Equal outcome

allocation

=34

No blood tests, no questionnaires, just diagnostic codes.

Instantaneous Universal Screening at Primary Care.

Works even for patients without history of mental disorders.

Screening

Posterior odds of SI/SA
in flagged population:

13 in 20

Prior odds of SI/SA
in general population:

1 in 20

3 out of 13 true flags have no prior history of mental disorders

The Screening Test is at its performance limit

The Ethics Question

Distribute resources race-blind

Distribute resources to make equal outcomes

Lives saved

The new frontier of predictive fairness in suicide prediction

Large scale and prospectively designed studies are needed to investigate the full scope of the problem and optimal alternatives, considering not only traditional cost measures but also screening mistakes and community stakeholders' preferences.
Suicide prevention research can be informed by progress in algorithmic fairness, such as predictive models constrained by a fairness budget and survey methods to elicit desired fairness trade-offs from community members.
New best practices in predictive modeling of suicide risk should include optimization of both accuracy and fairness.
Practice guidelines for individual clinicians need to be developed based on prospective research studies, with caution against making ad hoc adjustments to screening and risk thresholds.

References

[1] Coley RY, et al. JAMA Psychiatry. 2021;78(7):726–34.
[2] Kearns M, Roth A. Oxford University Press; 2019.
[3] Wang X, et al. Manag Syst Eng. 2022;1(1):7.
[5] Kleinberg J, et al. ArXiv160905807 Cs Stat. 2016.
[8] Jung C, et al. arXiv. 2020.
[11] Zafar MB, et al. arXiv. 2017.
[12] Dwork C, et al. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference. 2012.