Vaswani et al. (2017)
Huh et. al (2024)
Chang and Bergen (2022)
ChatGPT (2022)
Scaling and multi-modal
Kaplan/Hennigan et al. (2020)
Bengio et al. (2003), Mikolov et al. (2013), Vaswani et al. (2017)
Chang and Bergen (2022), Meister et al. (2023), Huh et al. (2024)
KL divergence between probabilities output and various distributions
Short hand notation between models:
We compute the expectation
by approximating over the samples
Definition:
Approximation with validation set samples:
By predicted:
By context:
Convergence to unigram around 256 steps
Goal: Determine relatively significant P.o.S. at each training step
For predicted token:
For context token:
Looking for outliers =/= 1
Looking at learnability
all, either, every ...
all, either, every ...
Trump, Escobar, Zurich
can, may, need ...
can, may, need ...
all, either, every, ...
pos(t-1); NNP - e.g. Switzerland, Zurich
pos(t-1); NNP - e.g. Switzerland, Zurich
pos(t-1); NNP - e.g. Switzerland, Zurich
pos(t-1); NNP - e.g. Switzerland, Zurich
Greater sensitivity is connected to higher divergence in later steps
pos(t); DT - e.g. all, either, there
pos(t); DT - e.g. all, either, there
pos(t); DT - e.g. all, either, there
For (early) functional linguistic features, KL lags CE
Holds true for both fitting by ratio / difference
Low cross-entropy difference is a necessary condition for low KL!
Context
Prediction
410m model - NN: house, cat, dog - NNP: ETH
Context
Prediction
410m model - NN: house, cat, dog - NNP: ETH
Context
Prediction
410m model - VBG: eating - VBN: called
Context
Prediction
410m model - VBG: eating - VBN: called
Context
Prediction
410m model
Earlier divergence in training correlates to greater stability later on