Datasets

Cornell CS 3/5780 · Spring 2026

Datasets are central to machine learning. In particular, the competitive testing paradigm is the engine of progress in ML and AI.

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

1. Beginnings of Competitive Testing

Bill Highleyman and Louis Kamenstky, 1959 at Bell Labs

  • 26 letters + 10 digits \(\times\) 50 writers
  • train/test split of writers 40/10
  • Heldout test set is fundamental

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

1. Data Sharing and Common Task

Larger and higher-quality datasets are needed for work aimed at achieving useful results ... contain hundreds, or even thousands, of samples in each class.

Research Group Organization Error Rate
Woody Bledsoe Sandia Labs ~60%
Chao Kong Chow Burroughs Corporation 41.7%
Munson, Duda, and Hart Stanford Research Institute (SRI) 31.7% (Linear Model)
Munson, Duda, and Hart Stanford Research Institute (SRI) 12% (Digits Only)
Human Subjects (Informal Experiment) 15.7%

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

1. Enter: the Internet

  • UCI Machine Learning Repository available through FTP in 1987
    • predominantly tabular data
  • MNIST release by LeCun et al., “Gradient-Based Learning Applied to Document Recognition,” 1998.
  • 28x28 images (60k train/10k test)
  • LeCun Leaderboard, 1999

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. ImageNet

  • Fei-Fei Li's idea: image dataset with as many categories as nouns (synsets in WordNet)
  • 2009: 5k classes with 600 images each. 2011: 32k classes
  • Images came from Flickr, quality control of labels came from workers on MTurk

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. ImageNet Competition

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. Enter: Deep Learning

  • 2012: AlexNet (60M parameter CNN) spurs interest in competition and in deep learning

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. Pushing Test Set to the Limit

  • Competitors evaluate their models many times on the test set
  • Checking test error is enough to train weak classifiers (recall: boosting)
  • Doesn't this break the key ML tenant?

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. ImageNet 2.0

  • 2019: ImageNet 2.0 is a fresh test set following original construction steps
  • Good news: no adaptive overfitting
  • Bad news: performance drops due to extreme fragility to distribution shift

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. Fragility and Distribution Shift

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

3. Enter: Pretraining

  • Distribution shift: difference between training and evaluation data
  • Pretraining: train on WAY more data
  • Challenge: how to get data?
  • CLIP: 400 million images and text descriptions collected from the web by OpenAI (not public) trained with contrastive loss

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

3. Language data

$$ \mathrm{PPL}= \exp \left(-\frac{1}{D}\sum_{i=1}^D \log p(t_i \mid t_1t_2\cdots  t_{i-1})\right) $$

  • Common crawl, a non-profit maintaining scraped logs of the web since 2007

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

3. Scaling laws

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

4. Evaluation Tasks

  • MMLU (Measuring Massive Multitask Language Understanding), 2021
    • multiple-choice, 1000s questions, college level

    • measures language comprehension and knowledge

  • GSM8K (Grade School Math 8K), 2021

    • 8.5k math word problems with natural language solutions

    • evaluated by numerical accuracy of final answer

    • measures comprehension and reasoning

  • HumanEval, 2021

    • 164 hand-crafted programming problems: function signature, a descriptive docstring, reference implementation, and unit tests

    • evaluated on passing tests

    • measures ability to generate correct code

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

4. Challenges

  • Nonlinear emergence (unlike predictable scaling)
  • Model ranking depends on evaluation task

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

4. New ideas: "Tune before Test"

  • Common practice: "train on test task"
  • Put models on equal footing: "tune before test"

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

Summary

References

Lecture 25: Datasets

By Sarah Dean

Private

Lecture 25: Datasets