Language Models & Transformers

Cornell CS 3/5780 ยท Spring 2026

1. Prediction & Generation
2. Representation

3. Word2Vec
4. Attention

1. Prediction & Generation Games

Please open a large, searchable text document on your laptop or phone

1. Nonlinear Methods

  • Review: Kernels:
    • Kernel methods lift linear models into nonlinear predictors.
    • Nonlinear features: \( \phi(x) \), where \( \phi(x) \) is a feature map.
    • Kernel function: (feature map is implicit and high dimensional)
      $$ k(x, x') = \phi(x)^\top \phi(x') $$
    • Training complexity: \( > n^2 \) (kernel matrix) and test complexity: \( O(n) \).
  • Today: Neural Networks
    • Multilayer Perceptron invented at Cornell by Frank Rosenblatt (1963)
    • Learns the feature mapping explicitly from data.
    • Can capture structure by stacking layers of "neurons".

1. Prediction & Generation
2. Representation

3. Word2Vec
4. Attention

2. Word Representation

\(\approx\)

1. Prediction & Generation
2. Representation

3. Word2Vec
4. Attention

2. Word2Vec

1. Prediction & Generation
2. Representation

3. Word2Vec
4. Attention

2. Attention & LLMs

Lecture 24: Language Models & Transformers

By Sarah Dean

Private

Lecture 24: Language Models & Transformers