Differential Privacy

Color Mixing

Differential Privacy is kind of like color mixing on hard mode!

Standard data encryption offers little to no protection. Like the color purple, you can easily decipher its constituents (sensitive information) to be red and blue.

Problems with other standard

approaches of data privacy protection

Data Encryption

advanced computer algorithms can decrypt easily

 

Anonymize

example: a data set includes gender, residence, age, DOB, malaria(?)

---> deduce who the data belongs to

---> sensitive data is leaked

 

^^ linkage attack

Mediated Access

analysts can ask specific questions to the trusted "curator" to identify who the data belongs to 

We (the data seekers) can still obtain the information that it's green, but much harder to find its exact components

This is differential privacy:

 

Imagine your raw data is green, but differential privacy encodes it to an ambiguous green like this.

Definition

  • Learn nothing about an individual but still learning useful information about a population
    • trade off between privacy and accuracy
  • A concept rather than a specific algorithm
  • no additional harm is done to the individual should they provide their data

To achieve differential privacy, we have to introduce randomness. The amount of randomness depends on:

 

1. sensitivity of the query (global sensitivity) 

2. desired level of privacy

(privacy-utility trade-off)

=  randomness

(aka noise)

Global Sensitivity

  • how much one data entry influences the outcome of a query
  • formally: maximum possible difference between the function output on two neighboring data sets

Global Sensitivity

low sensitivity

Query: is it still a mix of colors?

yes

yes

Global Sensitivity

high sensitivity

Query: a mix of how many colors?

3

2

Global Sensitivity

difference of one data entry heavily impacts function output = high sensitivity

  • higher sensitivity requires more noise to compensate
  • lower sensitivity requires less noise to compensate

difference of one data entry negligibly impacts function output = low sensitivity

Privacy-Utility Trade-off

If you want accuracy, there is less privacy

Easier to infer blue and yellow from green

If you want privacy, there is less accuracy

harder to infer blue and yellow from green

Privacy-Utility Trade-off

Summary of Randomness

  • Sensitivity of query: direct relationship with amount of noise we want to add
  • Utility: inverse relationship with amount of noise we want to add

How do we add randomness to our data set?

One example: Randomized Response

  • developed to collect sensitive/embarrassing information
  • participants report whether or not they have property P
  • they also do a coin flip:
    • tails = respond truthfully
    • heads = lie
      • second flip:
        • heads = "yes"
        • tails = "no"
    • respondent is not incriminated: at least 1/4 probability that the respondent did not answer honestly

Synthetic Data Set

  • A completely new and false set of data that reflects the same attributes as the original data set 
  • Example: Given my (only) attribute is number of colors mixed, original data set and synthetic data set have the same attribute(s)!

original

synthetic

Specific Algorithms for Randomization

Laplace

  • adds noise from laplace distribution
  • uses Manhattan distance to determine noise scale
  • good for simple numeric queries

Gaussian

  • adds noise from normal distribution
  • uses Euclidian distance to determine noise scale
  • good for complex analysis, multiple queries

Some Implications of Differential Privacy

  • quantification of privacy loss 
  • protection against arbitrary risks (threshold quantified by researcher)
  • protection against post processing (a data analyst cannot deduce the output of algorithm and increase privacy loss) 

anni zhao diff. privacy

By Dan Ryan

anni zhao diff. privacy

  • 24