Press ESC to exit fullscreen
πŸ“– Lesson ⏱️ 120 minutes

Self-Attention and Multi-Head Attention

Understanding self-attention and multi-head attention mechanisms

Introduction

Self-attention and multi-head attention are foundational components of transformers, enabling models to learn context-aware representations efficiently.

They allow each token in a sequence to attend to other tokens, capturing relationships regardless of position.


1️⃣ What is Self-Attention?

Self-attention allows each token in an input sequence to look at other tokens and gather relevant context dynamically when producing its output representation.

Example: In a sentence, self-attention enables the word β€œit” to attend to the noun it refers to.


2️⃣ How Does Self-Attention Work?

For input embeddings, we project them into:

  • Queries (( Q ))
  • Keys (( K ))
  • Values (( V ))

Then compute:

[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V ]

βœ… ( Q K^T ) calculates similarity between tokens.
βœ… Division by ( \sqrt{d_k} ) stabilizes gradients.
βœ… Softmax normalizes to weights summing to 1.
βœ… We multiply by ( V ) to compute a weighted sum, resulting in context-enriched outputs.


3️⃣ Why is Self-Attention Important?

βœ… Captures long-range dependencies in sequences.
βœ… Is position-agnostic while enabling context learning.
βœ… Allows parallel processing across sequence positions, unlike RNNs.


4️⃣ What is Multi-Head Attention?

Multi-head attention runs multiple self-attention mechanisms in parallel, allowing the model to capture different types of relationships in the data.

Each head has separate ( Q, K, V ) projections, learning:

βœ… Syntax patterns.
βœ… Positional dependencies.
βœ… Semantic relationships.

Outputs from each head are concatenated and linearly transformed to produce the final output.


5️⃣ Multi-Head Attention Formula

Given ( h ) heads:

[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, …, \text{head}_h) W^O ]

where:

[ \text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V) ]

βœ… Each head has its own learned projections ( W_i^Q, W_i^K, W_i^V ).
βœ… ( W^O ) projects the concatenated output back to the model dimension.


6️⃣ Visualization Example

Attention heatmaps show which tokens each token focuses on during processing, revealing:

βœ… Pronoun resolution patterns.
βœ… Subject-verb-object relationships.
βœ… Context learning across long sequences.


7️⃣ Applications

βœ… Transformers: Core component in encoder and decoder blocks.
βœ… BERT, GPT, T5: Use multi-head self-attention for rich embeddings.
βœ… Vision Transformers: Apply self-attention to image patches for image classification.


Conclusion

βœ… Self-attention enables context learning across sequences without recurrence.
βœ… Multi-head attention enriches learning by capturing multiple relationship types.
βœ… Mastering these concepts is key to understanding and building transformers.


What’s Next?

βœ… Visualize self-attention in a transformer on a sample sentence.
βœ… Fine-tune a pretrained transformer and observe attention behavior.
βœ… Continue structured learning on superml.org.


Join the SuperML Community to share and discuss your attention experiments.


Happy Learning! 🧭