Press ESC to exit fullscreen
📖 Lesson ⏱️ 120 minutes

Self-Attention and Multi-Head Attention

Understanding self-attention and multi-head attention mechanisms

Introduction

Self-attention and multi-head attention are foundational components of transformers, enabling models to learn context-aware representations efficiently.

They allow each token in a sequence to attend to other tokens, capturing relationships regardless of position.


1️⃣ What is Self-Attention?

Self-attention allows each token in an input sequence to look at other tokens and gather relevant context dynamically when producing its output representation.

Example: In a sentence, self-attention enables the word “it” to attend to the noun it refers to.


2️⃣ How Does Self-Attention Work?

For input embeddings, we project them into:

  • Queries (( Q ))
  • Keys (( K ))
  • Values (( V ))

Then compute:

[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V ]

✅ ( Q K^T ) calculates similarity between tokens.
✅ Division by ( \sqrt{d_k} ) stabilizes gradients.
✅ Softmax normalizes to weights summing to 1.
✅ We multiply by ( V ) to compute a weighted sum, resulting in context-enriched outputs.


3️⃣ Why is Self-Attention Important?

Captures long-range dependencies in sequences.
✅ Is position-agnostic while enabling context learning.
✅ Allows parallel processing across sequence positions, unlike RNNs.


4️⃣ What is Multi-Head Attention?

Multi-head attention runs multiple self-attention mechanisms in parallel, allowing the model to capture different types of relationships in the data.

Each head has separate ( Q, K, V ) projections, learning:

✅ Syntax patterns.
✅ Positional dependencies.
✅ Semantic relationships.

Outputs from each head are concatenated and linearly transformed to produce the final output.


5️⃣ Multi-Head Attention Formula

Given ( h ) heads:

[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, …, \text{head}_h) W^O ]

where:

[ \text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V) ]

✅ Each head has its own learned projections ( W_i^Q, W_i^K, W_i^V ).
✅ ( W^O ) projects the concatenated output back to the model dimension.


6️⃣ Visualization Example

Attention heatmaps show which tokens each token focuses on during processing, revealing:

✅ Pronoun resolution patterns.
✅ Subject-verb-object relationships.
✅ Context learning across long sequences.


7️⃣ Applications

Transformers: Core component in encoder and decoder blocks.
BERT, GPT, T5: Use multi-head self-attention for rich embeddings.
Vision Transformers: Apply self-attention to image patches for image classification.


Conclusion

✅ Self-attention enables context learning across sequences without recurrence.
✅ Multi-head attention enriches learning by capturing multiple relationship types.
✅ Mastering these concepts is key to understanding and building transformers.


What’s Next?

✅ Visualize self-attention in a transformer on a sample sentence.
✅ Fine-tune a pretrained transformer and observe attention behavior.
✅ Continue structured learning on superml.org.


Join the SuperML Community to share and discuss your attention experiments.


Happy Learning! 🧭