Analysing the Convergence and Divergence of Language Models

Introduction

Introduction

Vaswani et al. (2017)

Huh et. al (2024)

Chang and Bergen (2022)

ChatGPT (2022)

Scaling and multi-modal

Kaplan/Hennigan et al. (2020)

Objective & Significance

Analyse convergence of model probabilities
Quantify impact of different features
Large(r) randomly initialised models converge during training; behaviour for smaller models is unclear
Hypothesise different phases of word acquisition

Background

Language Models

Bengio et al. (2003), Mikolov et al. (2013), Vaswani et al. (2017)

p(w_t \mid \textbf{w}_{\lt t}) = \frac{e^{z_{w_t}}}{\sum_{w' \in \mathcal{V}} e^{z_{w'}}}

w \in \mathcal{V}

\mathbf z = f(\textbf{w}_{\lt t})

Evolution in Training

Chang and Bergen (2022), Meister et al. (2023), Huh et al. (2024)

p(w_t \mid \textbf{w}_{\lt t}) \approx \mathcal{U} := \frac{1}{|\mathcal{V}|}

\text{Unigram} := p(w_t \mid \textbf{w}_{\lt t}) = \frac{\sum_{w_v \in \mathcal{D}_\mathrm{VAL}} \mathbf 1_{\{w_t=w_v\}}(w_v)}{|\mathcal{D}_\mathrm{VAL}|}

KL divergence between probabilities output and various distributions

Methodology

Kullback Leibler Divergence

Short hand notation between models:

We compute the expectation

by approximating over the samples

\mathbb{E}[D_\text{KL}] \approx \frac{1}{\sum_{\textbf{w} \in \mathcal D} |\textbf{w}|} \sum_{\textbf{w} \in \mathcal D} \sum_{t=1}^{|\textbf{w}|} D_\text{KL}

Cross-Entropy

Definition:

Approximation with validation set samples:

Frequency

Part of Speech

By predicted:

By context:

Begin/End of Word

Experimental Setup

Model Selection

Pythia models (Biderman et al., 2023)
Logarithmic steps
Random seeds 1, 3, 5, 7, 9

Model Selection

Pythia models (Biderman et al., 2023)
Logarithmic steps
Random seeds 1, 3, 5, 7, 9

Model Selection

Pythia models (Biderman et al., 2023)
Logarithmic steps
Random seeds 1, 3, 5, 7, 9

Model Selection

Pythia models (Biderman et al., 2023)
Logarithmic steps
Random seeds 1, 3, 5, 7, 9

Part of Speech Tagging

Part of Speech Tagging

Part of Speech Tagging

Part of Speech Tagging

Results

Reference Distributions

Reference Distributions

Reference Distributions

Convergence to unigram around 256 steps

Relative Divergences

Goal: Determine relatively significant P.o.S. at each training step

For predicted token:

For context token:

Looking for outliers =/= 1

Relative Divergences

\sigma_{(t)}

t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

Relative Divergences

\sigma_{(t)}

t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

Looking at learnability

Relative Divergences

\sigma_{(t)}

t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

all, either, every ...

Relative Divergences

\sigma_{(t)}

t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

all, either, every ...

Trump, Escobar, Zurich

Relative Divergences

\sigma_{(t-1)}

t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

can, may, need ...

Relative Divergences

\sigma_{(t-1)}

t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

can, may, need ...

all, either, every, ...

Training Phases

Uniform Phase (until approx. step 8)
Functional Universality Phase (approx. step 8 to 64)
Unigram Transition Phase (approx. step 64 to 1000)
Contextual Learning Phase (approx. step 1000 onwards)

Cross-Entropy Fitting

Cross-Entropy Fitting

pos(t-1); NNP - e.g. Switzerland, Zurich

Cross-Entropy Fitting

pos(t-1); NNP - e.g. Switzerland, Zurich

\mathcal Q

Cross-Entropy Fitting

pos(t-1); NNP - e.g. Switzerland, Zurich

\alpha \cdot \mathcal Q + b

Cross-Entropy Fitting

pos(t-1); NNP - e.g. Switzerland, Zurich

\alpha = 394.09\%

Greater sensitivity is connected to higher divergence in later steps

Cross-Entropy Fitting

pos(t); DT - e.g. all, either, there

Cross-Entropy Fitting

pos(t); DT - e.g. all, either, there

\Delta H

Cross-Entropy Fitting

pos(t); DT - e.g. all, either, there

\alpha = -2.05\%

For (early) functional linguistic features, KL lags CE

Cross-Entropy Fitting

Holds true for both fitting by ratio / difference

Cross-Entropy Difference

\Delta H = H_{m_1,t_1} - H_{m_2,t_2}

Cross-Entropy Difference

\Delta H = H_{m_1,t_1} - H_{m_2,t_2}

Cross-Entropy Difference

\Delta H = H_{m_1,t_1} - H_{m_2,t_2}

Cross-Entropy Difference

\Delta H = H_{m_1,t_1} - H_{m_2,t_2}

Low cross-entropy difference is a necessary condition for low KL!

Seed Comparison

Seed Comparison

Seed Comparison

Seed Comparison

Seeds: Frequency

Seeds: Frequency

Seeds: Frequency

Seeds: Frequency

Seeds: Nouns

Context

Prediction

410m model - NN: house, cat, dog - NNP: ETH

Seeds: Nouns

Context

Prediction

410m model - NN: house, cat, dog - NNP: ETH

Seeds: Verbs

Context

Prediction

410m model - VBG: eating - VBN: called

Seeds: Verbs

Context

Prediction

410m model - VBG: eating - VBN: called

Seeds: Other P.o.S.

Context

Prediction

410m model

Earlier divergence in training correlates to greater stability later on

Linear Regression

D_{\text{KL}}(w_t) \approx \alpha \cdot \log(\text{freq}(w_t)) + \sum_{p \in \text{ P.o.S. tags}} \beta_{p} \cdot \mathbf{1}_{\text{tag}(w_t)=p} + \sum_{p \in \text{ P.o.S. tags}} \gamma_{p} \cdot \mathbf{1}_{\text{tag}(w_{t-1})=p} + \sum_{m \in \text{Model Sizes}} \delta_m \cdot \mathbf{1}_{m_1=m2=m}

Linear Regression

D_{\text{KL}}(w_t) \approx \alpha \cdot \log(\text{freq}(w_t)) + \sum_{p \in \text{ P.o.S. tags}} \beta_{p} \cdot \mathbf{1}_{\text{tag}(w_t)=p} + \sum_{p \in \text{ P.o.S. tags}} \gamma_{p} \cdot \mathbf{1}_{\text{tag}(w_{t-1})=p} + \sum_{m \in \text{Model Sizes}} \delta_m \cdot \mathbf{1}_{m_1=m2=m}

Conclusion

Summary

Trajectory in model training
Model size's role in enhancing self-consistency
Significant differences by part of speech
Efficiency gains can likely be made by adapting model training to linguistic structure

Limitations & Future Work

Lack of different model architectures
No variability in tokenisers
Issues with Pythia model suite

Improvement by emphasis on unstable linguistic features
Connection between KL and cross-entropy could help improve model performance
Focus on multi-lingual P.o.S. features

Q&A

Made with Slides.com