Course Content
Self-Attention and Multi-Head Attention
Understanding self-attention and multi-head attention mechanisms
Introduction
Self-attention and multi-head attention are foundational components of transformers, enabling models to learn context-aware representations efficiently.
They allow each token in a sequence to attend to other tokens, capturing relationships regardless of position.
1οΈβ£ What is Self-Attention?
Self-attention allows each token in an input sequence to look at other tokens and gather relevant context dynamically when producing its output representation.
Example: In a sentence, self-attention enables the word βitβ to attend to the noun it refers to.
2οΈβ£ How Does Self-Attention Work?
For input embeddings, we project them into:
- Queries (( Q ))
- Keys (( K ))
- Values (( V ))
Then compute:
[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V ]
β
( Q K^T ) calculates similarity between tokens.
β
Division by ( \sqrt{d_k} ) stabilizes gradients.
β
Softmax normalizes to weights summing to 1.
β
We multiply by ( V ) to compute a weighted sum, resulting in context-enriched outputs.
3οΈβ£ Why is Self-Attention Important?
β
Captures long-range dependencies in sequences.
β
Is position-agnostic while enabling context learning.
β
Allows parallel processing across sequence positions, unlike RNNs.
4οΈβ£ What is Multi-Head Attention?
Multi-head attention runs multiple self-attention mechanisms in parallel, allowing the model to capture different types of relationships in the data.
Each head has separate ( Q, K, V ) projections, learning:
β
Syntax patterns.
β
Positional dependencies.
β
Semantic relationships.
Outputs from each head are concatenated and linearly transformed to produce the final output.
5οΈβ£ Multi-Head Attention Formula
Given ( h ) heads:
[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, β¦, \text{head}_h) W^O ]
where:
[ \text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V) ]
β
Each head has its own learned projections ( W_i^Q, W_i^K, W_i^V ).
β
( W^O ) projects the concatenated output back to the model dimension.
6οΈβ£ Visualization Example
Attention heatmaps show which tokens each token focuses on during processing, revealing:
β
Pronoun resolution patterns.
β
Subject-verb-object relationships.
β
Context learning across long sequences.
7οΈβ£ Applications
β
Transformers: Core component in encoder and decoder blocks.
β
BERT, GPT, T5: Use multi-head self-attention for rich embeddings.
β
Vision Transformers: Apply self-attention to image patches for image classification.
Conclusion
β
Self-attention enables context learning across sequences without recurrence.
β
Multi-head attention enriches learning by capturing multiple relationship types.
β
Mastering these concepts is key to understanding and building transformers.
Whatβs Next?
β
Visualize self-attention in a transformer on a sample sentence.
β
Fine-tune a pretrained transformer and observe attention behavior.
β
Continue structured learning on superml.org
.
Join the SuperML Community to share and discuss your attention experiments.
Happy Learning! π§