Language Models & Transformers

1. Prediction & Generation
2. Representation

Please open a large, searchable text document on your laptop or phone

Review: Kernels:
- Kernel methods lift linear models into nonlinear predictors.
- Nonlinear features: $ \phi(x) $, where $ \phi(x) $ is a feature map.
- Kernel function: (feature map is implicit and high dimensional)
  $$ k(x, x') = \phi(x)^\top \phi(x') $$
- Training complexity: $ > n^2 $ (kernel matrix) and test complexity: $ O(n) $.
Today: Neural Networks
- Multilayer Perceptron invented at Cornell by Frank Rosenblatt (1963)
- Learns the feature mapping explicitly from data.
- Can capture structure by stacking layers of "neurons".