>1000x times larger than median activation magnitude.
<10 massive activations among millions of them.
(Mostly) appear on the first token.
Which Layers?
Typically appear in the early layers.
Stay roughly constant throughout the residual stream.
(Somehow) get cancelled during the final layers.
Which dimensions?
Activation: \( h_l \in \mathbb{R}^{T \times D} \)
Along features: a few fixed dimensions
Along tokens:
Starting token only (GPT2)
Starting token and first delimiter token (LLaMA2-7B)
Starting token and some other "semantically weak" tokens (LLaMA2-70B, Phi-2)
Massive Activations as Biases
Algorithm:
Change the value of massive activation at the first appearance (patch the activation)
Feed the altered hidden state to the rest of the blocks as usual
Evaluate the perplexity (WikiText, C4, ...)
Massive Activations as Biases
Elimination of only a few massive activations can drastically change the performance.
Fixing them to their empirical mean evaluated on many sentences has negligible impact.
Why these tokens?
(Maybe) First token is always visible.
(Maybe) Existence of massive features affects attention patterns, better to put them on "semantically meaningless" tokens.
(Maybe) Semantic meanings might have already been transferred to other positions.
...
Effects on Attention
Effects on Attention
What is an attention sink?
A surprisingly large amount of attention score is allocated to the initial tokens, irrespective of their relevance to the language modeling task. We term these tokens “attention sinks". Despite their lack of semantic significance, they collect significant attention scores.
"Efficient Streaming Language Models with Attention Sinks"