\mathbf v (\mathbf x_i) = W^T_v \mathbf x_i, \quad \mathbf q (\mathbf x_i) = W^T_q \mathbf x_i, \quad \mathbf k (\mathbf x_i) = W^T_k \mathbf x_i
\alpha_{i, j} = \text{softmax}_j \left( \frac{\langle \mathbf q(\mathbf x_i),\: \mathbf k(\mathbf x_j) \rangle}{\sqrt{d'}} \right)
\mathbf u_i = \sum_{j=1}^{n} \alpha_{i, j} \mathbf v(\mathbf x_j)
\mathbf u_i' = \text{LayerNorm}(x_i + \mathbf u_i; \gamma_1, \beta_1)
\mathbf z_i = W_2^T \text{ReLU} (W_1^T \mathbf u_i')
\mathbf z_i' = \text{LayerNorm}(\mathbf u_i' + \mathbf z_i; \gamma_2, \beta_2)

This block constitutes self-attention. In ordinary attention, keys/queries take different inputs.

W_v, W_q, W_k \in \mathbb R^{d \times d'}
\text{LayerNorm}(\mathbf z; \mathbf \gamma, \mathbf \beta) = \mathbf \gamma \frac{\mathbf z - \mathbf \mu_z}{\sigma_z} + \mathbf \beta

Each position gets a value, key, and query vector

Attention weights calculate how similar the query of each position is to the keys of all other positions

For each position, value vectors are linearly combined weighted by attention.

Add output to residual input + normalize

Feed through linear / nonlinearity

Add another residual connection + normalize

Note: often add positional encoding into x

Transformer block maps

\mathbf x \in \mathbb R^{n \times d} \to \mathbb R^{n \times d}
\mathbf v (\mathbf x_i) = W^T_v \mathbf x_i, \quad \mathbf q (\mathbf x_i) = W^T_q \mathbf x_i, \quad \mathbf k (\mathbf x_i) = W^T_k \mathbf x_i
\alpha_{i, j} = \text{softmax}_j \left( \frac{\langle \mathbf q(\mathbf x_i),\: \mathbf k(\mathbf x_j) \rangle}{\sqrt{d'}} \right)
\mathbf u_i = \sum_{j=1}^{n} \alpha_{i, j} \mathbf v(\mathbf x_j)
\mathbf u_i' = \text{LayerNorm}(x_i + \mathbf u_i; \gamma_1, \beta_1)
\mathbf z_i = W_2^T \text{ReLU} (W_1^T \mathbf u_i')
\mathbf z_i' = \text{LayerNorm}(\mathbf u_i' + \mathbf z_i; \gamma_2, \beta_2)

This block constitutes self-attention. In ordinary attention, keys/queries take different inputs.

W_v, W_q, W_k \in \mathbb R^{d \times d'}
\text{LayerNorm}(\mathbf z; \mathbf \gamma, \mathbf \beta) = \mathbf \gamma \frac{\mathbf z - \mathbf \mu_z}{\sigma_z} + \mathbf \beta

Each position gets a value, key, and query vector

Attention weights calculate how similar the query of each position is to the keys of all other positions

For each position, value vectors are linearly combined weighted by attention.

Add output to residual input + normalize

Feed through linear / nonlinearity

Add another residual connection + normalize

Note: often add positional encoding into x

Transformer block maps

\mathbf x \in \mathbb R^{n \times d} \to \mathbb R^{n \times d}
\mathbf v (\mathbf x_i) = W^T_v \mathbf x_i, \quad \mathbf q (\mathbf x_i) = W^T_q \mathbf x_i, \quad \mathbf k (\mathbf x_i) = W^T_k \mathbf x_i
\alpha_{i, j} = \text{softmax}_j \left( \frac{\langle \mathbf q(\mathbf x_i),\: \mathbf k(\mathbf x_j) \rangle}{\sqrt{d'}} \right)
\mathbf u_i = \sum_{j=1}^{n} \alpha_{i, j} \mathbf v(\mathbf x_j)
\mathbf u_i' = \text{LayerNorm}(x_i + \mathbf u_i; \gamma_1, \beta_1)
\mathbf z_i = W_2^T \text{ReLU} (W_1^T \mathbf u_i')
\mathbf z_i' = \text{LayerNorm}(\mathbf u_i' + \mathbf z_i; \gamma_2, \beta_2)

This block constitutes self-attention. In ordinary attention, keys/queries take different inputs.

W_v, W_q, W_k \in \mathbb R^{d \times d'}
\text{LayerNorm}(\mathbf z; \mathbf \gamma, \mathbf \beta) = \mathbf \gamma \frac{\mathbf z - \mathbf \mu_z}{\sigma_z} + \mathbf \beta

Each position gets a value, key, and query vector

Attention weights calculate how similar the query of each position is to the keys of all other positions

For each position, value vectors are linearly combined weighted by attention.

Add output to residual input + normalize

Feed through linear / nonlinearity

Add another residual connection + normalize

Note: often add positional encoding into x

Transformer block maps

\mathbf x \in \mathbb R^{n \times d} \to \mathbb R^{n \times d}
\mathbf v (\mathbf x_i) = W^T_v \mathbf x_i, \quad \mathbf q (\mathbf x_i) = W^T_q \mathbf x_i, \quad \mathbf k (\mathbf x_i) = W^T_k \mathbf x_i
\alpha_{i, j} = \text{softmax}_j \left( \frac{\langle \mathbf q(\mathbf x_i),\: \mathbf k(\mathbf x_j) \rangle}{\sqrt{d'}} \right)
\mathbf u_i = \sum_{j=1}^{n} \alpha_{i, j} \mathbf v(\mathbf x_j)
\mathbf u_i' = \text{LayerNorm}(x_i + \mathbf u_i; \gamma_1, \beta_1)
\mathbf z_i = W_2^T \text{ReLU} (W_1^T \mathbf u_i')
\mathbf z_i' = \text{LayerNorm}(\mathbf u_i' + \mathbf z_i; \gamma_2, \beta_2)

This block constitutes self-attention. In ordinary attention, keys/queries take different inputs.

W_v, W_q, W_k \in \mathbb R^{d \times d'}
\text{LayerNorm}(\mathbf z; \mathbf \gamma, \mathbf \beta) = \mathbf \gamma \frac{\mathbf z - \mathbf \mu_z}{\sigma_z} + \mathbf \beta

Each position gets a value, key, and query vector

Attention weights calculate how similar the query of each position is to the keys of all other positions

For each position, value vectors are linearly combined weighted by attention.

Add output to residual input + normalize

Feed through linear / nonlinearity

Add another residual connection + normalize

Note: often add positional encoding into x

Transformer block maps

\mathbf x \in \mathbb R^{n \times d} \to \mathbb R^{n \times d}
\mathbf v (\mathbf x_i) = W^T_v \mathbf x_i, \quad \mathbf q (\mathbf x_i) = W^T_q \mathbf x_i, \quad \mathbf k (\mathbf x_i) = W^T_k \mathbf x_i
\alpha_{i, j} = \text{softmax}_j \left( \frac{\langle \mathbf q(\mathbf x_i),\: \mathbf k(\mathbf x_j) \rangle}{\sqrt{d'}} \right)
\mathbf u_i = \sum_{j=1}^{n} \alpha_{i, j} \mathbf v(\mathbf x_j)
\mathbf u_i' = \text{LayerNorm}(x_i + \mathbf u_i; \gamma_1, \beta_1)
\mathbf z_i = W_2^T \text{ReLU} (W_1^T \mathbf u_i')
\mathbf z_i' = \text{LayerNorm}(\mathbf u_i' + \mathbf z_i; \gamma_2, \beta_2)

This block constitutes self-attention. In ordinary attention, keys/queries take different inputs.

W_v, W_q, W_k \in \mathbb R^{d \times d'}
\text{LayerNorm}(\mathbf z; \mathbf \gamma, \mathbf \beta) = \mathbf \gamma \frac{\mathbf z - \mathbf \mu_z}{\sigma_z} + \mathbf \beta

Each position gets a value, key, and query vector

Attention weights calculate how similar the query of each position is to the keys of all other positions

For each position, value vectors are linearly combined weighted by attention.

Add output to residual input + normalize

Feed through linear / nonlinearity

Add another residual connection + normalize

Note: often add positional encoding into x

Transformer block maps

\mathbf x \in \mathbb R^{n \times d} \to \mathbb R^{n \times d}
\mathbf v (\mathbf x_i) = W^T_v \mathbf x_i, \quad \mathbf q (\mathbf x_i) = W^T_q \mathbf x_i, \quad \mathbf k (\mathbf x_i) = W^T_k \mathbf x_i
\alpha_{i, j} = \text{softmax}_j \left( \frac{\langle \mathbf q(\mathbf x_i),\: \mathbf k(\mathbf x_j) \rangle}{\sqrt{d'}} \right)
\mathbf u_i = \sum_{j=1}^{n} \alpha_{i, j} \mathbf v(\mathbf x_j)
\mathbf u_i' = \text{LayerNorm}(x_i + \mathbf u_i; \gamma_1, \beta_1)
\mathbf z_i = W_2^T \text{ReLU} (W_1^T \mathbf u_i')
\mathbf z_i' = \text{LayerNorm}(\mathbf u_i' + \mathbf z_i; \gamma_2, \beta_2)

This block constitutes self-attention. In ordinary attention, keys/queries take different inputs.

W_v, W_q, W_k \in \mathbb R^{d \times d'}
\text{LayerNorm}(\mathbf z; \mathbf \gamma, \mathbf \beta) = \mathbf \gamma \frac{\mathbf z - \mathbf \mu_z}{\sigma_z} + \mathbf \beta

Each position gets a value, key, and query vector

Attention weights calculate how similar the query of each position is to the keys of all other positions

For each position, value vectors are linearly combined weighted by attention.

Add output to residual input + normalize

Feed through linear / nonlinearity

Add another residual connection + normalize

Note: often add positional encoding into x

This is self attention with only "1 head" (like having only 1 conv filter)

\mathbf v^{\textcolor{skyblue}{(h)}} (\mathbf x_i) = W^T_{v, \textcolor{skyblue}{h}} \mathbf x_i, \quad \mathbf q^{\textcolor{skyblue}{(h)}} (\mathbf x_i) = W^T_{q, \textcolor{skyblue}{h}} \mathbf x_i, \quad \mathbf k^{\textcolor{skyblue}{(h)}} (\mathbf x_i) = W^T_{k, \textcolor{skyblue}{h}} \mathbf x_i
\alpha_{i, j}^{\textcolor{skyblue}{(h)}} = \text{softmax}_j \left( \frac{\langle \mathbf q^{\textcolor{skyblue}{(h)}}(\mathbf x_i),\: \mathbf k^{\textcolor{skyblue}{(h)}}(\mathbf x_j) \rangle}{\sqrt{d'}} \right)
\mathbf u_i = \textcolor{skyblue}{\sum_{h=1}^{H} W^T_{c, h}} \sum_{j=1}^{n} \alpha_{i, j} \mathbf v(\mathbf x_j) ^{\textcolor{skyblue}{(h)}}
\mathbf u_i' = \text{LayerNorm}(x_i + \mathbf u_i; \gamma_1, \beta_1)
\mathbf z_i = W_2^T \text{ReLU} (W_1^T \mathbf u_i')
\mathbf z_i' = \text{LayerNorm}(\mathbf u_i' + \mathbf z_i; \gamma_2, \beta_2)

Multi-headed attention with H heads (similar to conv. filter dimension)

This block constitutes self-attention. In ordinary attention, keys/queries take different inputs.

W_{v, \textcolor{skyblue}{h}}, W_{q, \textcolor{skyblue}{h}}, W_{k, \textcolor{skyblue}{h}} \in \mathbb R^{d \times d'}
\text{LayerNorm}(\mathbf z; \mathbf \gamma, \mathbf \beta) = \mathbf \gamma \frac{\mathbf z - \mathbf \mu_z}{\sigma_z} + \mathbf \beta

Each position gets a value, key, and query for each head.

Attention weights calculate how similar the query of each position is to the keys of all other positions for each head.

For each position and each head, value vectors are linearly combined weighted by attention.

Add output to residual input + normalize

Feed through linear / nonlinearity

Add another residual connection + normalize

Note: often add positional encoding into x

\mathbf v (\mathbf x_i) = W^T_v \mathbf x_i, \quad \mathbf q (\mathbf x_i) = W^T_q \mathbf x_i, \quad \mathbf k (\mathbf x_i) = W^T_k \mathbf x_i
\alpha_{i, j} = \text{softmax}_j \left( \frac{\langle \mathbf q(\mathbf x_i),\: \mathbf k(\mathbf x_j) \rangle}{\sqrt{d'}} \right)
\mathbf u_i = \sum_{j=1}^{n} \alpha_{i, j} \mathbf v(\mathbf x_j)
\mathbf u_i' = \text{LayerNorm}(x_i + \mathbf u_i; \gamma_1, \beta_1)
\mathbf z_i = W_2^T \text{ReLU} (W_1^T \mathbf u_i')
\mathbf z_i' = \text{LayerNorm}(\mathbf u_i' + \mathbf z_i; \gamma_2, \beta_2)

This block constitutes self-attention. In ordinary attention, keys/queries take different inputs.

W_v, W_q, W_k \in \mathbb R^{d \times d'}
\text{LayerNorm}(\mathbf z; \mathbf \gamma, \mathbf \beta) = \mathbf \gamma \frac{\mathbf z - \mathbf \mu_z}{\sigma_z} + \mathbf \beta

Each position gets a value, key, and query vector

Attention weights calculate how similar the query of each position is to the keys of all other positions

For each position, value vectors are linearly combined weighted by attention.

Add output to residual input + normalize

Feed through linear / nonlinearity

Add another residual connection + normalize

Note: often add positional encoding into x

Transformer block maps

\mathbf x \in \mathbb R^{n \times d} \to \mathbb R^{n \times d}
\mathbf v (\mathbf x_i) = W^T_v \mathbf x_i, \quad \mathbf q (\mathbf x_i) = W^T_q \mathbf x_i, \quad \mathbf k (\mathbf x_i) = W^T_k \mathbf x_i
\alpha_{i, j} = \text{softmax}_j \left( \frac{\langle \mathbf q(\mathbf x_i),\: \mathbf k(\mathbf x_j) \rangle}{\sqrt{d'}} \right)
\mathbf u_i = \sum_{j=1}^{n} \alpha_{i, j} \mathbf v(\mathbf x_j)
\mathbf u_i' = \text{LayerNorm}(x_i + \mathbf u_i; \gamma_1, \beta_1)
\mathbf z_i = W_2^T \text{ReLU} (W_1^T \mathbf u_i')
\mathbf z_i' = \text{LayerNorm}(\mathbf u_i + \mathbf z_i; \gamma_2, \beta_2)

This block constitutes self-attention. In ordinary attention, keys/queries take different inputs.

W_v, W_q, W_k \in \mathbb R^{d \times d'}
\text{LayerNorm}(\mathbf z; \mathbf \gamma, \mathbf \beta) = \mathbf \gamma \frac{\mathbf z - \mathbf \mu_z}{\sigma_z} + \mathbf \beta

Each position gets a value, key, and query vector

Attention weights calculate how similar the query of each position is to the keys of all other positions

For each position, value vectors are linearly combined weighted by attention.

Add output to residual input + normalize

Feed through linear / nonlinearity

Add another residual connection + normalize

Note: often add positional encoding into x

Transformer block maps

\mathbf x \in \mathbb R^{n \times d} \to \mathbb R^{n \times d}
\mathbf v (\mathbf x_i) = W^T_v \mathbf x_i, \quad \mathbf q (\mathbf x_i) = W^T_q \mathbf x_i, \quad \mathbf k (\mathbf x_i) = W^T_k \mathbf x_i
\alpha_{i, j} = \text{softmax}_j \left( \frac{\langle \mathbf q(\mathbf x_i),\: \mathbf k(\mathbf x_j) \rangle}{\sqrt{d'}} \right)
\mathbf u_i = \sum_{j=1}^{n} \alpha_{i, j} \mathbf v(\mathbf x_j)
\mathbf u_i' = \text{LayerNorm}(x_i + \mathbf u_i; \gamma_1, \beta_1)
\mathbf z_i = W_2^T \text{ReLU} (W_1^T \mathbf u_i')
\mathbf z_i' = \text{LayerNorm}(\mathbf u_i' + \mathbf z_i; \gamma_2, \beta_2)

This block constitutes self-attention. In ordinary attention, keys/queries take different inputs.

W_v, W_q, W_k \in \mathbb R^{d \times d'}
\text{LayerNorm}(\mathbf z; \mathbf \gamma, \mathbf \beta) = \mathbf \gamma \frac{\mathbf z - \mathbf \mu_z}{\sigma_z} + \mathbf \beta

Each position gets a value, key, and query vector

Attention weights calculate how similar the query of each position is to the keys of all other positions

For each position, value vectors are linearly combined weighted by attention.

Add output to residual input + normalize

Feed through linear / nonlinearity

Add another residual connection + normalize

Note: often add positional encoding into x

Transformer block maps

\mathbf x \in \mathbb R^{n \times d} \to \mathbb R^{n \times d}

Self-attention cheat-sheet

By Chandan Singh

Self-attention cheat-sheet

  • 149