ML in Society

Cornell CS 3/5780 · Spring 2026

continuing last lecture's focus on data,

1. Bias in/from Data
2. Data Ownership
3. Invisible Labor

Then zooming out...
5. Utility & Harms
6. Intelligence & "Alignment"

7. Future of Work

1. Bias in/from Data

1.1 Bias from non-representative data

Racial Bias in Gender Recognition: Facial recognition technology works poorly on sub-populations who are poorly represented in training and evaluation data (Buolamwini & Gebru)

1.2 Bias in data reflected in models

Gender Stereotypes in Word Embeddings: 

1.3 Amplification of Bias

Miscalibration in Recommendation: Designing recommendations which optimize engagement leads to over-recommending the most prevalent types (Steck, 2018)

rom-com 80% of the time

horror 20% of the time

optimize for probable click

$$\mathbb P(\mathsf{click})= \mathbb P(\mathsf{click}\mid \mathsf{romcom}) \mathbb P(\mathsf{romcom}) + \mathbb P(\mathsf{click}\mid \mathsf{horror}) \mathbb P(\mathsf{horror})$$

  • law of total probability (assuming every movie is either horror or rom-com)

\((1-\mathbb P(\mathsf{romcom}))\)

rom-com 80% of the time

horror 20% of the time

optimize for probable click

recommend rom-com 100% of the time

$$\max_{0\leq p\leq 1}0.8\times p + 0.2 \times (1-p)$$

1.3 Amplification of Bias

Miscalibration in Recommendation: Designing recommendations which optimize engagement leads to over-recommending the most prevalent types (Steck, 2018)

Gender Bias in Translation: Data-driven machine translation perpetuates gender bias.

1.3 Amplification of Bias

(screenshot from fall 2023)

Machine translation works by maximizing the probability of an English sentence given Hungarian sentence

1.3 Amplification of Bias

Spring 2026 update:

Bias (regardless of source) is amplified when ML is deployed

  • Solutions depend on context
  • More data and better models can reduce uncertainty

1.3 Amplification of Bias

2. Data Ownership

2.1 Privacy

Risk of re-identification from "anonymized" data

Now, personal browsing information more rarely released for public research

...but companies still collect, use, and sell it outside public view

themarkup.org

2.1 Privacy

An old idea about ensuring privacy from survey studies...

2.1 Privacy

Is re-identification possible when only a model, not a dataset, is released?

Making re-identification unlikely with differential privacy

2.1 Privacy

2.3 Copyright

...but Google was prevented from allowing access to the full text of the books themselves.

 

(even out of print books that are otherwise impossible to access)

2.3 Copyright

generative AI pushed "fair use" to its limits

3. Invisible Labor

3.1 Data Annotation

3.2 Algorithm Override

4 ML/AI is Useful but handle with care…

4.1 ML is Useful: AI in Drug Discovery

4.2 ML is Useful: AI Astronomy

4.3 ML is Useful: AI in Material Science

Let’s play a game…

5. Model Alignment

AI and Jobs

AI and CS Jobs

If history is any indicator…

  1. Growth of jobs in AI
  2. Displacement of jobs from classic programmer/software engineer to folks capable of using AI in the workflow
  3. Nature of jobs will change but growth in long term

2024: ~119,900 new AI-related roles were added, outpacing AI-linked losses.

ML in Society

By Sarah Dean

Private

ML in Society