Data Methods: Ethics and Privacy

Data Methods:
Ethics and Privacy

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

How would medical data be protected? 

Who can access these data? 

  • Hospitals

  • Doctors

  • Apple and data hosts

  • Governments

How much data companies like Google or Facebook influence our lives?

  • Elections

  • Consumptions 

  • Decisions

  • Activities 

Need more education to read new, big data?

  • Neighborhood crime rate

  • Medical/Health data (aggregate)

  • Government data

  • News media

  • Social media

What is data privacy?

Privacy “encompasses not only the famous ‘right to be left alone,’ (Warren and Brandeis 1890) or keeping one’s personal matters and relationships secret, but also the ability to share information selectively but not publicly”*
 (Foster et al. p. 299)



* President’s Council of Advisors on Science and Technology. Big data and privacy: A technological perspective. Technical report, Executive Office of the President, 2014.

What is Confidentiality?

Confidentiality is “preserving authorized restrictions on information access and disclosure, including means for protecting personal privacy and proprietary information” (McCallister, Grance, and Scarfone 2010).


“‘Big data’ has great potential to benefit society. At the same time, its availability creates significant potential for mistaken, misguided or malevolent uses of personal information.

With the advent of Big data

"The conundrum for the law is to provide space for big data to fulfill its potential for societal benefit, while protecting citizens adequately from related individual and social harms.

With the advent of Big data

Current privacy law evolved to address different concerns and must be adapted to confront big data’s challenges.” (Strandburg 2014)

With the advent of Big data

Sensitive Personal Data

  • Racial or ethnic origin

  • Political opinions

  • Religious beliefs

  • Trade union activities

  • Personal health

  • Sexual life

  • Criminal offences.

Aggregate  vs. subgroup vs. individual data

Aggregate data

Subgroup data

Aggregate  vs. Subgroup vs. Individual data

Individual data

Choice between Utility and Privacy

As more data become available externally, the more difficult it is to maintain privacy.


Some data are more valuable than others.

Weight of the Extreme Values

"Spending on health care services in the United States is highly concentrated among a small proportion of people with extremely high use. For the overall civilian population living in the community, the latest data indicate that more than 20% of all personal health care spending in 2009 ($275 billion) was on behalf of just 1% of the population (Schoenman 2012)".

- Bender et al. 2017

Why Identify subjects?

  • Purpose of study                   

  • Disciplines

    • Journalism

    • Medicine

    • Marketing

      • Smart marketing

    • Security

    • Whatelse?

Risks of Identifying subjects, even unintentionally

  • Design

  • External data sources

  • Sample size

  • Question design

Why Data Access?

Pro Con
Data is public good. Individual data is private.
Innovations Technology gap
Optimize redistribution "Information" Inequality
More data access, more public good. More data access, less personal protection
National security Personal rights

Providing Data Access

Dissemination of data to the public usually occurs in three steps:

  1. Evaluation of disclosure risks

  2. Data anonymization

  3. Evaluation of disclosure risks and analytical quality of the candidate data release(s). 

Controlling Data Access

The two main approaches:

  1. Statistical disclosure control

    • anonymized public use data sets

  2. Controlled access through a research data center.

The Debate

Are you willing to share your data?

To whom?

How much?

Who can protect our data?

A. The Government

B. Legal Framework

C. Scientists

D. Yourself

Why Data Scientists have to be concerned?

  • Risk could jeopardize whole project

  • Legal issues

  • Social responsibilities

    • "Do no harm" Rule

  • Sustainability 

Personal Identifiable Information

PII is “any information about an individual main- tained by an agency, including:

  1. Any information that can be used to distinguish or trace an individual’s identity, such as name, social security number, date and place of birth, mother’s maiden name, or biometric records; and
  2. Any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information”



  • Remove personal data plus geographic data (e.g. address, county, state, etc.)


  • Re-identify individuals from anonymized data

    • Merge with external data
    • Ecological Inference (King, Rosen and Tanner 2004)
    • Google
    • A.I.


  • Re-identify individuals from anonymized data

In 2006, the release of supposedly deidentified web search data by AOL allowed two New York Times reporters to reidentify a customer simply from her browsing habits (New York Times 2006).


  • Re-identify individuals from anonymized data

In the 1990s, Massachusetts Group Insurance released “deidentified” data on the hospital visits of state employees; researcher Latanya Sweeney quickly reidentified the hospital records of the then Governor William Weld using nothing more than state voter records about residence and date of birth (Sweeney 2001). 


  • Re-identify individuals from anonymized data

In 2012, statisticians at the department store, Target, used a young teenager’s shopping patterns to determine that she was pregnant before her father did (Hill 2012).

Legal and Ethical Framework

"The right of the people to be secure in their persons, houses, papers, and effects, against unreasonable searches and seizures, shall not be violated, and no Warrants shall issue, but upon probable cause, supported by Oath or affirmation, and particularly describing the place to be searched, and the persons or things to be seized."

Amendment IV

Legal and Ethical Framework

Institutional Review Board (IRB)

  1. Protect your subject

    1. Pregnant

    2. Minor

    3. Sexual preference

  2. Protect your project

  3. Protect yourself


  1. Use multiple data but release aggregate results

  2. Anonymize

  3. Mindful of IRB (legal protection)

  4. Geocodes/IP exploitation

  5. Data consent

From Algorithm Bias to FOMO (Fear of Missing Out): Why STEM-focused curricula could bring more problems than solutions?

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Weapon of Math Destruction: How big data increases inequality and threatens democracy by Cathy O'Neil


We live in the age of the algorithm. Increasingly, the decisions that affect our lives—where we go to school, whether we get a car loan, how much we pay for health insurance—are being made not by humans, but by mathematical models. 

In theory, this should lead to greater fairness: Everyone is judged according to the same rules, and bias is eliminated.

In reality, the opposite is true!


The models being used today are opaque, unregulated, and uncontestable, even when they’re wrong. 

Most troubling, they reinforce discrimination: If a poor student can’t get a loan because a lending model deems him too risky (by virtue of his zip code), he’s then cut off from the kind of education that could pull him out of poverty, and a vicious spiral ensues. 



Models are propping up the lucky and punishing the downtrodden, creating a “toxic cocktail for democracy.” 

Welcome to the dark side of Big Data.

  • Meta ethics

  • Normative ethics

  • Applied ethics 

Ethics and AI

Meta-ethics is about the nature of ethics and moral reasoning.

  • Discussions about whether ethics is relative and whether we always act from self-interest are examples of meta-ethical discussions.

  • Drawing the conceptual distinction between Meta-ethics, Normative Ethics, and Applied Ethics is itself a "metaethical analysis."

Ethics and AI

Normative ethics is interested in determining the content of our moral behavior.

  • Normative ethical theories seek to provide action-guides; procedures for answering the Practical Question ("What ought I to do?").

  • The moral theories of Kant and Bentham are examples of normative theories that seek to provide guidelines for determining a specific course of moral action. 

Ethics and AI

Applied Ethics deals with specific realms of human action and to craft criteria for discussing issues that might arise within those realms.

  • Contemporary topics:

    • Business Ethics

    • Computer Ethics

    • Engineering Ethics

    • Medical Ethics

Ethics and AI

Questions for future Data Scientists:

  • AI is inevitable

  • Institutionalized discrimination

  • Who is responsible for the collective wrongs of AI?

Ethics and AI

Notes for Prospective Data Scientists

  1. Think "datawise"

  2. Think temporal and spatial

  3. Reproducible/Replicable

  4. Accumulate

  5. Collaborate

  6. Build social network