Data Governance and Privacy

Ishanu Chattopadhyay

Assistant Professor of Medicine

University of Chicago

zed.uchicago.edu

Data Governance

Privacy in the age of AI

Friday the 20th, October 2023

Age

Ai & Data

Vast databases of detailed personal information are everywhere

Data privacy and governance is crucial to protect individual rights and maintain trust.

Machine learning and AI have posed new challanges

Vast databases of detailed personal information are everywhere

Data privacy and governance is crucial to protect individual rights and maintain trust.

name	age	occupation	address	diabetes
Alice	20	doctor	23 maple st PA	No
Bob	35	teacher	12 E av IL	Yes
Charlie	43	farmer	42 oak st IL	No

Classical Example

Truven MarketScan (IBM)
Commerical Claims & Encounters Database

2003-2021

150M patients

12B transanctions

Age

Ai & Data

Current Example

Discover and leverage comorbidity patterns from large patient databases
Infer or predict medical conditions for individuals from sparse, noisy medical history

End of Privacy?

Universal screening for complex disorders
Prevent missed or late diagnosis
Accelerate scientific research by making it cheaper to do clinical trials

Governace must not limit AI capabilities

Example:

Onishchenko, D., Marlowe, R.J., Ngufor, C.G. et al. Screening for idiopathic pulmonary fibrosis using comorbidity signatures in electronic health records. Nat Med 28, 2107–2116 (2022). https://doi.org/10.1038/s41591-022-02010-y

ZCoR: Zero-burden Co-morbid Risk Score

shortness of breath

dry cough

doctor can hear velcro crackles

Common Symptoms

>50 years old

more men than women

IPF

Rare disease

~5 in 10,000

Post-Dx

Survival

~4 years

At least one misdiagnosis

~55%

Two or more misdiagnoses

38%

Initially attributed to age- related symptoms:

72%

Cannot always be seen on CXR

Non-specific symptoms

PCP workflow demands

Initial midiagnoses

~ 4yrs

current

post-Dx survival ~4yrs

~ 4yrs

current clinical DX

ZCoR screening

n=~3M

AUC~90%

Likelihood ratio ~30

Conventional AI/ML attempts to model the physician

AI in IPF Research

Co-morbidity Patterns
No data demands
Use whatever data is already in patient file

Discover and leverage comorbidity patterns
No data demands
Use whatever data is already on patient file

Primary Care

Reliable screening and diagnosis

ZCoR Flag

No blood tests
No imaging
No pulmonary function tests

Sparse, noisy data with a priori unknown "risk factors"

Correct labels (diagnoses)

ZeD Lab: Predictive Screening from Comorbidity Footprints

Nature Medicine

JAHA

CELL Reports

Science Adv.

Predictive Screening from Comorbidity Footprints

	ZED performance	Competition
Autism	>80% AUC at 2 yrs	Double false positives
Alzheimer's Disease	~90% AUC	60-70% AUC
Idiopathic Pulmonary Fibrosis	~90% AUC	NA
MACE	~80% AUC	~70% AUC
Bipolar Disorder	~85% AUC	NA
CKD	~85% AUC	NA
Cancers	~75% AUC	NA

Complex systems with many variables

Cross-talk

Constrained "feasible space"

Reconstruction from noisy incomplete data

How can we maintain the statistical properties of data while ensuring individual privacy?
Can we 'corrupt' data to destroy identifiability without compromising its utility?

The Classical Challenge

Introduce 'noise' or alterations to the data in a way that individual records cannot be traced back, but the overall statistical patterns remain intact.

Corrupting Data

Identifiability refers to the ability to trace back data to an individual.
Destroying it ensures privacy but poses challenges in maintaining data integrity.

Destroying Identifiability

Mathematically, determining the right amount and type of 'noise' to add is non-trivial.

Too much alteration can render data useless, while too little can compromise privacy.

Mathematical Challenges

Mathematical framework to quantify and manage privacy risks

Differential Privacy

(Dwork 2006)

Removing or adding a single data point does not significantly impact the outcome of any analysis, thus ensuring the protection of individual privacy.

\forall S \subseteq \text{Range}(\mathcal{M}), \forall D, D' \\\text{ such that } ||D - D'||_1 \leq 1, \\ P(\mathcal{M}(D) \in S) \leq e^{\epsilon} P(\mathcal{M}(D') \in S)

\mathcal{M} \textrm{ guarantees } \epsilon \textrm{ privacy budget if }

Synthetic data generation as a mode of data corruption

original data

generate synthetic data

Learn generative model

has identifiable information

has no identifiable information

How do we know we did a good job?

Statistical tests

Membership Inference attcks

Qnets: A New Model for General Synthetic Data Generation

for tabular data

with mathematical guarantee of "good generator"

Differentially Private Generative Adversarial Networks (DP-GANs) combine the power of Generative Adversarial Networks (GANs) with the privacy guarantees of differential privacy.

Papernot, N., Song, S., Mironov, I., Raghunathan, A., Talwar, K., & Erlingsson, U. (2018). Differentially Private Generative Adversarial Network. arXiv preprint arXiv:1802.06739.

Existing Methods based on Adversarial Networks

discriminator

generator

Qnets: A New Model for General Synthetic Data Generation

CAD-PTSD Dataset

211 items
304 respondants

	PTSD1	PTSD2	PTSD3
patient1	agree
patient2	disagree
patient3	strongly agree
patient4	neutral
patient5

items

respondants

example

with mathematical guarantee of "good generator"

Intrinsic Structure of Survey Responses

Can we determine if a response vector is "valid"?
Can we distinguish algorithmically between actual/honest responses vs random/adversarial responses?

Each response vector is an element of the "response space".
Is there a natural metric on the response space? What would such a metric mean intuitively

PTSD4

PTSD93

PTSD86

QNet Trees

(3 out of 211)

Together they form

a recursive forest

Nodes "hyperlinked" to trees: Potentially Infinite Hierarchy

The QNet Structure

Nodes Hyperlinked to Trees

click on nodes to change trees

The q-distance Metric

Collection of all such conditional inference trees is the recursive forest, answering the following question:

\textrm{If we have $n$ items/questions } X_1, \cdots , X_n, \\ \textrm{ and we have a subject responding with}\\ {\color{yellow} x_1, \cdots, x_{i-1},x_{i+1},\cdots, x_{n-1}, }\\ \textrm{ then the distribution of responses to question $X_i$ is given by } \\ {\color{yellow}\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)}\\ \textrm{ where } \mathcal{D}(\Sigma_i) \textrm{ is the set of all possible distributions}\\ \textrm{over the set of all possible responses $\Sigma_i$ }

The q-distance Metric

\textrm{where $P,Q$ are possibly two distinct populations}\\ \textrm{with distinct qnets, such that }\\ x \in P, y \in Q \textrm{ and }\\ J \textrm{ is the Jensen-Shannon divergence }

{\theta(x,y) \triangleq \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i^P(x_{-i}) , \Phi_i^Q(y_{-i})\right ) \right )}\\

\textrm{For two opinion vectors $x,y$}

\textrm{Intrinsic metric between response vectors}

The q-distance Metric: Why Is This a Natural Metric?

\textrm{items } X_1, X_2, \cdots , X_{i-1},X_{i+1}, \cdots, X_N

X_i

Similar opinion/response vectors can spontaneously switch:

intrinsic metric quantifies the odds of this spontaneous switch

Theorem: q-distance is "natural"

\textrm{With $N$ distinct questions, at a significance level $\alpha$, we have }\\ \omega_y e^{\frac{\sqrt{8}N^2}{1-\alpha}\theta(x,y)} \geqq Pr(x \in P \rightarrow y \in Q) \geqq \omega_y e^{-\frac{\sqrt{8}N^2}{1-\alpha}\theta(x,y)}\\ \textrm{ where } \omega_y \textrm{ is the probability $y \in P$ }

Sanov's Theorem & Pinsker's Inequality

Framework

\Phi_i:\prod_{j \neq i} \Sigma_j \rightarrow \mathcal{D}(\Sigma_i)

Qnets

\theta(x,y) \triangleq \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i^P(x_{-i}) , \Phi_i^Q(y_{-i})\right ) \right )

Q-distance

\left \vert \ln \frac{Pr(x \rightarrow y)}{Pr(y \rightarrow y)} \right \vert \leqq \beta \theta(x,y)

Dynamics

Q-sampling as a means of synthetic data generation

Assume that one question $$X_i$$ is unanswered.

\textrm{questions/opinions } X_1, X_2, \cdots , X_i, \cdots, X_N

Distribution of responses to this item given remaining responses

X_i

Given this distribution the probability that "b" is the answer

Pr(x \in P \rightarrow y \in Q) = \prod_{i=1}^N\Phi_i^P(x_{-i}) \vert_{y_i}

\theta(x,y) \triangleq \mathbf{E}_i \left ( \mathbb{J}^{\frac{1}{2}} \left (\Phi_i^P(x_{-i}) , \Phi_i^Q(y_{-i})\right ) \right )\\

\omega_y e^{\frac{\sqrt{8}N^2}{1-\alpha}{\color{green}\theta(x,y)}} \geqq {\color{red} Pr(x \in P \rightarrow y \in Q) } \geqq \omega_y e^{-\frac{\sqrt{8}N^2}{1-\alpha}{\color{green}\theta(x,y)}}

Follows from first principles:

Distance metric such that log-likelihood of jump scales as the distance

theorem

Q-sampling is just Gibb's sampling

Computationally difficult to directly sample a model distribution over hundreds or thousands of variables

starting from a known sample, we may iteratively update its indices by sampling the corresponding conditional distribution

"Corrupting" datasets with Qnets

\mathcal{D}' = \left \{ \zeta(x,n): x \in \mathcal{D}, 1 \leqq n \leqq N\right \}

\textrm{Given dataset } \mathcal{D},

q-sampled responses

\displaystyle \Theta(\mathcal{D},\mathcal{D}') \triangleq \max_{x \in \mathcal{D}} \min_{y \in \mathcal{D}'} \theta(x,y)

Perturbation of original dataset in terms of induced metric :

Homework

f(\Theta(\mathcal{D},\mathcal{D}')) \leqq \epsilon \leqq g(\Theta(\mathcal{D},\mathcal{D}'))

Recall:

\forall S \subseteq \text{Range}(\mathcal{M}), \forall D, D' \\ P(\mathcal{M}(D) \in S) \leq e^{\epsilon} P(\mathcal{M}(D') \in S)

\mathcal{M} \textrm{ guarantees } \epsilon \textrm{ privacy budget if }

Find $f,g$ such that

ishanu@uchicago.edu