Ishanu Chattopadhyay

Assistant Professor of Medicine

University of Chicago

Data  Governance

Privacy in the age of AI

Friday the 20th, October 2023

Age

of

Ai & Data

Vast databases of detailed personal information are everywhere

Data privacy and governance is crucial to protect individual rights and maintain trust.

Machine learning and AI have posed new challanges

Vast databases of detailed personal information are everywhere

Data privacy and governance is crucial to protect individual rights and maintain trust.

name age occupation address diabetes
Alice 20 doctor 23 maple st PA No
Bob 35 teacher 12 E av IL Yes
Charlie 43 farmer 42 oak st IL No

Classical Example

Truven MarketScan (IBM)
Commerical Claims & Encounters Database

2003-2021

150M patients

12B transanctions

Age

of

Ai & Data

Current Example

  • Discover and leverage comorbidity patterns from large patient databases
  • Infer or predict medical conditions for individuals from sparse, noisy medical history

End of Privacy?

  • Universal screening for complex disorders
  • Prevent missed or late diagnosis
  • Accelerate scientific research by making it cheaper to do clinical trials

Governace must not limit AI capabilities

Example: 

Onishchenko, D., Marlowe, R.J., Ngufor, C.G. et al. Screening for idiopathic pulmonary fibrosis using comorbidity signatures in electronic health records. Nat Med 28, 2107–2116 (2022). https://doi.org/10.1038/s41591-022-02010-y

ZCoR: Zero-burden Co-morbid Risk Score

shortness of breath

dry cough

doctor can hear velcro crackles

Common Symptoms

>50 years old

more men than women

IPF

Rare disease

~5 in 10,000

Post-Dx

Survival

~4 years

At least one misdiagnosis

~55%

Two or more misdiagnoses

38%

Initially attributed to age- related symptoms:

72%

Cannot always be seen on CXR

Non-specific symptoms

PCP workflow demands

Initial midiagnoses

~ 4yrs

current

post-Dx  survival ~4yrs

~ 4yrs

current clinical DX

ZCoR screening

n=~3M

AUC~90%

Likelihood ratio ~30

Conventional AI/ML  attempts to model the physician

AI in IPF Research

  • Co-morbidity Patterns
  • No data demands
  • Use whatever data is already in patient file
  • Discover and leverage comorbidity patterns
  • No data demands
  • Use whatever data is already on patient file

Primary Care

Reliable screening and diagnosis

ZCoR Flag

  • No blood tests
  • No imaging
  • No pulmonary function tests

Sparse, noisy data with a priori unknown "risk factors"

Correct labels (diagnoses)

Ai

ZeD Lab: Predictive Screening from Comorbidity Footprints

Nature Medicine

JAHA

CELL Reports

Science Adv.

Predictive Screening from Comorbidity Footprints

ZED performance Competition
Autism >80% AUC at 2 yrs Double false positives
Alzheimer's Disease ~90% AUC  60-70% AUC
Idiopathic Pulmonary Fibrosis ~90% AUC NA
MACE ~80% AUC ~70% AUC 
Bipolar Disorder ~85% AUC NA
CKD ~85% AUC NA
Cancers ~75% AUC NA

Complex systems with many variables

Cross-talk

Constrained "feasible space"

Reconstruction from noisy incomplete data

  • How can we maintain the statistical properties of data while ensuring individual privacy?
  • Can we 'corrupt' data to destroy identifiability without compromising its utility?

The Classical  Challenge

Introduce 'noise' or alterations to the data in a way that individual records cannot be traced back, but the overall statistical patterns remain intact.

Corrupting Data

  • Identifiability refers to the ability to trace back data to an individual.
  • Destroying it ensures privacy but poses challenges in maintaining data integrity.

Destroying Identifiability

Mathematically, determining the right amount and type of 'noise' to add is non-trivial.

 

Too much alteration can render data useless, while too little can compromise privacy.

Mathematical Challenges

Mathematical framework to quantify and manage privacy risks

Differential Privacy

(Dwork 2006)

Removing or adding a single data point does not significantly impact the outcome of any analysis, thus ensuring the protection of individual privacy.

\forall S \subseteq \text{Range}(\mathcal{M}), \forall D, D' \\\text{ such that } ||D - D'||_1 \leq 1, \\ P(\mathcal{M}(D) \in S) \leq e^{\epsilon} P(\mathcal{M}(D') \in S)
\mathcal{M} \textrm{ guarantees } \epsilon \textrm{ privacy budget if }

Synthetic data generation as a mode of data corruption

original data

original data

generate synthetic data

Learn generative model

has identifiable information

has no identifiable information

How do we know we did a good job?

Statistical tests

Membership Inference attcks

Qnets: A New Model for General Synthetic Data Generation

for tabular data

with mathematical guarantee of "good generator"

Differentially Private Generative Adversarial Networks (DP-GANs) combine the power of Generative Adversarial Networks (GANs) with the privacy guarantees of differential privacy.

Papernot, N., Song, S., Mironov, I., Raghunathan, A., Talwar, K., & Erlingsson, U. (2018). Differentially Private Generative Adversarial Network. arXiv preprint arXiv:1802.06739.

Existing Methods based on Adversarial Networks

discriminator

generator

Qnets: A New Model for General Synthetic Data Generation

CAD-PTSD Dataset

  • 211 items
  • 304 respondants
PTSD1 PTSD2 PTSD3
patient1 agree
patient2 disagree
patient3 strongly agree
patient4 neutral
patient5

items

respondants

example

with mathematical guarantee of "good generator"

Intrinsic Structure of Survey Responses

  • Can we determine if a response vector is "valid"?
  • Can we distinguish algorithmically between actual/honest responses vs random/adversarial responses?
  • Each response vector is an element of the "response space".
  • Is there a natural metric on the response space? What would such a metric mean intuitively

PTSD4

PTSD93

PTSD86

QNet Trees

(3 out of 211)

Together they form

a recursive forest

Nodes "hyperlinked" to trees: Potentially Infinite Hierarchy

The QNet Structure

Nodes Hyperlinked to Trees

click on nodes to change trees

The q-distance Metric

Collection of all such conditional inference trees is  the recursive forest, answering the following question:

\textrm{If we have $n$ items/questions } X_1, \cdots , X_n, \\ \textrm{ and we have a subject responding with}\\ {\color{yellow} x_1, \cdots, x_{i-1},x_{i+1},\cdots, x_{n-1}, }\\ \textrm{ then the distribution of responses to question $X_i$ is given by } \\ {\color{yellow}\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)}\\ \textrm{ where } \mathcal{D}(\Sigma_i) \textrm{ is the set of all possible distributions}\\ \textrm{over the set of all possible responses $\Sigma_i$ }

The q-distance Metric

\textrm{where $P,Q$ are possibly two distinct populations}\\ \textrm{with distinct qnets, such that }\\ x \in P, y \in Q \textrm{ and }\\ J \textrm{ is the Jensen-Shannon divergence }
{\theta(x,y) \triangleq \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i^P(x_{-i}) , \Phi_i^Q(y_{-i})\right ) \right )}\\
\textrm{For two opinion vectors $x,y$}
\textrm{Intrinsic metric between response vectors}

The q-distance Metric: Why Is  This a Natural Metric?

\textrm{items } X_1, X_2, \cdots , X_{i-1},X_{i+1}, \cdots, X_N
a
b
c
d
e
X_i

Similar opinion/response vectors can spontaneously switch:

intrinsic metric quantifies the odds of this spontaneous switch

Theorem: q-distance is "natural"

\textrm{With $N$ distinct questions, at a significance level $\alpha$, we have }\\ \omega_y e^{\frac{\sqrt{8}N^2}{1-\alpha}\theta(x,y)} \geqq Pr(x \in P \rightarrow y \in Q) \geqq \omega_y e^{-\frac{\sqrt{8}N^2}{1-\alpha}\theta(x,y)}\\ \textrm{ where } \omega_y \textrm{ is the probability $y \in P$ }

Sanov's Theorem & Pinsker's Inequality

 Framework

\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)

Qnets

\theta(x,y) \triangleq \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i^P(x_{-i}) , \Phi_i^Q(y_{-i})\right ) \right )

Q-distance

\left \vert \ln \frac{Pr(x \rightarrow y)}{Pr(y \rightarrow y)} \right \vert \leqq \beta \theta(x,y)

Dynamics 

Q-sampling as a means of synthetic data generation

Assume that one question $$X_i$$ is unanswered.

\textrm{questions/opinions } X_1, X_2, \cdots , X_i, \cdots, X_N
a
b
c
d
e

Distribution of responses to this item given remaining responses

X_i

Given this distribution the probability that "b" is the answer

Pr(x \in P \rightarrow y \in Q) = \prod_{i=1}^N\Phi_i^P(x_{-i}) \vert_{y_i}
\theta(x,y) \triangleq \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i^P(x_{-i}) , \Phi_i^Q(y_{-i})\right ) \right )\\
\omega_y e^{\frac{\sqrt{8}N^2}{1-\alpha}{\color{green}\theta(x,y)}} \geqq {\color{red} Pr(x \in P \rightarrow y \in Q) } \geqq \omega_y e^{-\frac{\sqrt{8}N^2}{1-\alpha}{\color{green}\theta(x,y)}}

Follows from first principles:

Distance metric such that log-likelihood of jump scales as the distance

theorem

Q-sampling is just Gibb's sampling

  • Computationally difficult to directly sample a model distribution over hundreds or thousands of variables
  • starting from a known sample, we may iteratively update its indices by sampling the corresponding conditional distribution

"Corrupting" datasets with Qnets

\mathcal{D}' = \left \{ \zeta(x,n): x \in \mathcal{D}, 1 \leqq n \leqq N\right \}
\textrm{Given dataset } \mathcal{D},

q-sampled responses 

\displaystyle \Theta(\mathcal{D},\mathcal{D}') \triangleq \max_{x \in \mathcal{D}} \min_{y \in \mathcal{D}'} \theta(x,y)

Perturbation of original dataset in terms of induced metric :

Homework

f(\Theta(\mathcal{D},\mathcal{D}')) \leqq \epsilon \leqq g(\Theta(\mathcal{D},\mathcal{D}'))

Recall:

\forall S \subseteq \text{Range}(\mathcal{M}), \forall D, D' \\ P(\mathcal{M}(D) \in S) \leq e^{\epsilon} P(\mathcal{M}(D') \in S)
\mathcal{M} \textrm{ guarantees } \epsilon \textrm{ privacy budget if }

Find \(f,g\) such that

?

ishanu@uchicago.edu

Data Governance and Privacy

By Ishanu Chattopadhyay

Data Governance and Privacy

Predictive modeling of crime and rare phenomena using fractal nets

  • 127