Analysing the Convergence and Divergence of Language Models

Introduction

Introduction

Vaswani et al. (2017)

Huh et. al (2024)

Chang and Bergen (2022)

ChatGPT (2022)

Scaling and multi-modal

Kaplan/Hennigan et al. (2020)

Objective & Significance

  • Analyse convergence of model probabilities
  • Quantify impact of different features
  • Large(r) randomly initialised models converge during training; behaviour for smaller models is unclear
  • Hypothesise different phases of word acquisition

Background

Language Models

Bengio et al. (2003), Mikolov et al. (2013), Vaswani et al. (2017)

p(w_t \mid \textbf{w}_{\lt t}) = \frac{e^{z_{w_t}}}{\sum_{w' \in \mathcal{V}} e^{z_{w'}}}
w \in \mathcal{V}
\mathbf z = f(\textbf{w}_{\lt t})

Evolution in Training

Chang and Bergen (2022), Meister et al. (2023), Huh et al. (2024)

p(w_t \mid \textbf{w}_{\lt t}) \approx \mathcal{U} := \frac{1}{|\mathcal{V}|}
\text{Unigram} := p(w_t \mid \textbf{w}_{\lt t}) = \frac{\sum_{w_v \in \mathcal{D}_\mathrm{VAL}} \mathbf 1_{\{w_t=w_v\}}(w_v)}{|\mathcal{D}_\mathrm{VAL}|}

KL divergence between probabilities output and various distributions

Methodology

Kullback Leibler Divergence

Short hand notation between models:

We compute the expectation

by approximating over the samples

\mathbb{E}[D_\text{KL}] \approx \frac{1}{\sum_{\textbf{w} \in \mathcal D} |\textbf{w}|} \sum_{\textbf{w} \in \mathcal D} \sum_{t=1}^{|\textbf{w}|} D_\text{KL}

Cross-Entropy

Definition:

Approximation with validation set samples:

Frequency

Part of Speech

By predicted:

By context:

Begin/End of Word

Experimental Setup

Model Selection

  • Pythia models (Biderman et al., 2023)
  • Logarithmic steps
  • Random seeds 1, 3, 5, 7, 9

Model Selection

  • Pythia models (Biderman et al., 2023)
  • Logarithmic steps
  • Random seeds 1, 3, 5, 7, 9

Model Selection

  • Pythia models (Biderman et al., 2023)
  • Logarithmic steps
  • Random seeds 1, 3, 5, 7, 9

Model Selection

  • Pythia models (Biderman et al., 2023)
  • Logarithmic steps
  • Random seeds 1, 3, 5, 7, 9

Part of Speech Tagging

Part of Speech Tagging

Part of Speech Tagging

Part of Speech Tagging

Results

Reference Distributions

Reference Distributions

Reference Distributions

Convergence to unigram around 256 steps

Relative Divergences

Goal: Determine relatively significant P.o.S. at each training step

For predicted token:

For context token:

Looking for outliers =/= 1

Relative Divergences

\sigma_{(t)}
t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

Relative Divergences

\sigma_{(t)}
t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

Looking at learnability

Relative Divergences

\sigma_{(t)}
t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

all, either, every ...

Relative Divergences

\sigma_{(t)}
t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

all, either, every ...

Trump, Escobar, Zurich

Relative Divergences

\sigma_{(t-1)}
t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

can, may, need ...

Relative Divergences

\sigma_{(t-1)}
t_2 = t_1^{(+1)} \;\; \mid \;\; m_1=m_2

can, may, need ...

all, either, every, ...

Training Phases

  1. Uniform Phase (until approx. step 8)
  2. Functional Universality Phase (approx. step 8 to 64)
  3. Unigram Transition Phase (approx. step 64 to 1000)
  4. Contextual Learning Phase (approx. step 1000 onwards)

Cross-Entropy Fitting

Cross-Entropy Fitting

pos(t-1); NNP - e.g. Switzerland, Zurich

Cross-Entropy Fitting

pos(t-1); NNP - e.g. Switzerland, Zurich

\mathcal Q

Cross-Entropy Fitting

pos(t-1); NNP - e.g. Switzerland, Zurich

\alpha \cdot \mathcal Q + b

Cross-Entropy Fitting

pos(t-1); NNP - e.g. Switzerland, Zurich

\alpha = 394.09\%

Greater sensitivity is connected to higher divergence in later steps

Cross-Entropy Fitting

pos(t); DT - e.g. all, either, there

Cross-Entropy Fitting

pos(t); DT - e.g. all, either, there

\Delta H

Cross-Entropy Fitting

pos(t); DT - e.g. all, either, there

\alpha = -2.05\%

For (early) functional linguistic features, KL lags CE

Cross-Entropy Fitting

Holds true for both fitting by ratio / difference

Cross-Entropy Difference

\Delta H = H_{m_1,t_1} - H_{m_2,t_2}

Cross-Entropy Difference

\Delta H = H_{m_1,t_1} - H_{m_2,t_2}

Cross-Entropy Difference

\Delta H = H_{m_1,t_1} - H_{m_2,t_2}

Cross-Entropy Difference

\Delta H = H_{m_1,t_1} - H_{m_2,t_2}

Low cross-entropy difference is a necessary condition for low KL!

Seed Comparison

Seed Comparison

Seed Comparison

Seed Comparison

Seeds: Frequency

Seeds: Frequency

Seeds: Frequency

Seeds: Frequency

Seeds: Nouns

Context

Prediction

410m model - NN: house, cat, dog - NNP: ETH

Seeds: Nouns

Context

Prediction

410m model - NN: house, cat, dog - NNP: ETH

Seeds: Verbs

Context

Prediction

410m model - VBG: eating - VBN: called

Seeds: Verbs

Context

Prediction

410m model - VBG: eating - VBN: called

Seeds: Other P.o.S.

Context

Prediction

410m model

Earlier divergence in training correlates to greater stability later on

Linear Regression

D_{\text{KL}}(w_t) \approx \alpha \cdot \log(\text{freq}(w_t)) + \sum_{p \in \text{ P.o.S. tags}} \beta_{p} \cdot \mathbf{1}_{\text{tag}(w_t)=p} + \sum_{p \in \text{ P.o.S. tags}} \gamma_{p} \cdot \mathbf{1}_{\text{tag}(w_{t-1})=p} + \sum_{m \in \text{Model Sizes}} \delta_m \cdot \mathbf{1}_{m_1=m2=m}

Linear Regression

D_{\text{KL}}(w_t) \approx \alpha \cdot \log(\text{freq}(w_t)) + \sum_{p \in \text{ P.o.S. tags}} \beta_{p} \cdot \mathbf{1}_{\text{tag}(w_t)=p} + \sum_{p \in \text{ P.o.S. tags}} \gamma_{p} \cdot \mathbf{1}_{\text{tag}(w_{t-1})=p} + \sum_{m \in \text{Model Sizes}} \delta_m \cdot \mathbf{1}_{m_1=m2=m}

Conclusion

Summary

  • Trajectory in model training
  • Model size's role in enhancing self-consistency
  • Significant differences by part of speech
  • Efficiency gains can likely be made by adapting model training to linguistic structure

Limitations & Future Work

  • Lack of different model architectures  
  • No variability in tokenisers
  • Issues with Pythia model suite
  • Improvement by emphasis on unstable linguistic features
  • Connection between KL and cross-entropy could help improve model performance
  • Focus on multi-lingual P.o.S. features

Q&A

Analysing the Convergence and Divergence of Language Models

By Finlay Fehlauer

Analysing the Convergence and Divergence of Language Models

Presentation about the convergence of LMs

  • 109