Datasets

Cornell CS 3/5780 · Spring 2026

Datasets are central to machine learning. In particular, the competitive testing paradigm is the engine of progress in ML and AI.

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

1. Beginnings of Competitive Testing

Bill Highleyman and Louis Kamenstky, 1959 at Bell Labs

26 letters + 10 digits $\times$ 50 writers
train/test split of writers 40/10
Heldout test set is fundamental

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

1. Data Sharing and Common Task

Larger and higher-quality datasets are needed for work aimed at achieving useful results ... contain hundreds, or even thousands, of samples in each class.

Research Group	Organization	Error Rate
Woody Bledsoe	Sandia Labs	~60%
Chao Kong Chow	Burroughs Corporation	41.7%
Munson, Duda, and Hart	Stanford Research Institute (SRI)	31.7% (Linear Model)
Munson, Duda, and Hart	Stanford Research Institute (SRI)	12% (Digits Only)
Human Subjects	(Informal Experiment)	15.7%

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

1. Enter: the Internet

UCI Machine Learning Repository available through FTP in 1987
- predominantly tabular data
MNIST release by LeCun et al., “Gradient-Based Learning Applied to Document Recognition,” 1998.
28x28 images (60k train/10k test)
LeCun Leaderboard, 1999

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. ImageNet

Fei-Fei Li's idea: image dataset with as many categories as nouns (synsets in WordNet)
2009: 5k classes with 600 images each. 2011: 32k classes
Images came from Flickr, quality control of labels came from workers on MTurk

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. ImageNet Competition

2010: first ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
1000 classes (over 100 are different dog breeds)
1.2 million training images, 50k validation, 120k test (hidden labels)
https://www.kaggle.com/datasets/hieu1344/imagenetsample/data

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. Enter: Deep Learning

2012: AlexNet (60M parameter CNN) spurs interest in competition and in deep learning

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. Pushing Test Set to the Limit

Competitors evaluate their models many times on the test set
Checking test error is enough to train weak classifiers (recall: boosting)
Doesn't this break the key ML tenant?

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. ImageNet 2.0

2019: ImageNet 2.0 is a fresh test set following original construction steps

Good news: no adaptive overfitting
Bad news: performance drops due to extreme fragility to distribution shift

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

2. Fragility and Distribution Shift

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

3. Enter: Pretraining

Distribution shift: difference between training and evaluation data
Pretraining: train on WAY more data
Challenge: how to get data?

CLIP: 400 million images and text descriptions collected from the web by OpenAI (not public) trained with contrastive loss

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

3. Language data

Recap: language models predict the next token

$$ \mathrm{PPL}= \exp \left(-\frac{1}{D}\sum_{i=1}^D \log p(t_i \mid t_1t_2\cdots t_{i-1})\right) $$

Common crawl, a non-profit maintaining scraped logs of the web since 2007

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

3. Scaling laws

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

4. Evaluation Tasks

MMLU (Measuring Massive Multitask Language Understanding), 2021
- multiple-choice, 1000s questions, college level
- measures language comprehension and knowledge
GSM8K (Grade School Math 8K), 2021
- 8.5k math word problems with natural language solutions
- evaluated by numerical accuracy of final answer
- measures comprehension and reasoning
HumanEval, 2021
- 164 hand-crafted programming problems: function signature, a descriptive docstring, reference implementation, and unit tests
- evaluated on passing tests
- measures ability to generate correct code

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

4. Challenges

Nonlinear emergence (unlike predictable scaling)
Model ranking depends on evaluation task

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

4. New ideas: "Tune before Test"

Common practice: "train on test task"
Put models on equal footing: "tune before test"

1. Beginnings
2. Benchmark Era

3. Pretraining

4. Evaluation

Summary

References

Patterns, Predictions, and Actions: A story about machine learning (mlstory.org), Moritz Hardt and Benjamin Recht
The Emerging Science of Machine Learning Benchmarks (mlbenchmarks.org), Moritz Hardt
CS525: Training Data for AI at Stanford, Winter 2026 (https://ludwigschmidt.github.io/cs525-winter2026-website/), Ludwig Schmidt

Lecture 25: Datasets

By Sarah Dean

Lecture 25: Datasets

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Datasets

Cornell CS 3/5780 · Spring 2026

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

1. Beginnings of Competitive Testing

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

1. Data Sharing and Common Task

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

1. Enter: the Internet

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

2. ImageNet

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

2. ImageNet Competition

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

2. Enter: Deep Learning

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

2. Pushing Test Set to the Limit

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

2. ImageNet 2.0

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

2. Fragility and Distribution Shift

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

3. Enter: Pretraining

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

3. Language data

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

3. Scaling laws

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

4. Evaluation Tasks

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

4. Challenges

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

4. New ideas: "Tune before Test"

1. Beginnings 2. Benchmark Era

3. Pretraining

4. Evaluation

Summary

References

Lecture 25: Datasets

More from Sarah Dean

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era

1. Beginnings
2. Benchmark Era