The original Transformer paper.
Goal: Get us all on the same page with respect to what makes up a Transformer:
Attention
Self-Attention
Positional Encodings
etc.
The GPT-3 paper
A look at the trend of increasingly large LMs and, more importantly, their ability to perform well on tasks unseen during training.
The reasoning capabilities of language models can be improved by prompting them appropriately.
Other papers linked in document.
Transformers as an alternative to CNNs?
Recasts reinforcement learning as a conditional sequence modelling problem
Matches or exceeds performance of SoTA model-free offline RL algorithms
Neural Scaling Laws: Empirically shows that the performance of LLMs follows a power law
Exploring the limits: investigates the implications of this for downstream tasks
An attempt to better understand how/what transformers learn by training them on a simple reasoning task.
A joint image/language model, trained to match captions to images
Impressive few-shot transfer to downstream tasks
An essential part of DALL-E's architecture
DALL-E
Requires (at least) a brief explanation of diffusion models