Self-Attention and Multi-Head Attention

Introduction

Self-attention and multi-head attention are foundational components of transformers, enabling models to learn context-aware representations efficiently.

They allow each token in a sequence to attend to other tokens, capturing relationships regardless of position.

1️⃣ What is Self-Attention?

Self-attention allows each token in an input sequence to look at other tokens and gather relevant context dynamically when producing its output representation.

Example: In a sentence, self-attention enables the word “it” to attend to the noun it refers to.

2️⃣ How Does Self-Attention Work?

For input embeddings, we project them into:

Queries (( Q ))
Keys (( K ))
Values (( V ))

Then compute:

[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V ]

✅ ( Q K^T ) calculates similarity between tokens.
✅ Division by ( \sqrt{d_k} ) stabilizes gradients.
✅ Softmax normalizes to weights summing to 1.
✅ We multiply by ( V ) to compute a weighted sum, resulting in context-enriched outputs.

3️⃣ Why is Self-Attention Important?

✅ Captures long-range dependencies in sequences.
✅ Is position-agnostic while enabling context learning.
✅ Allows parallel processing across sequence positions, unlike RNNs.

4️⃣ What is Multi-Head Attention?

Multi-head attention runs multiple self-attention mechanisms in parallel, allowing the model to capture different types of relationships in the data.

Each head has separate ( Q, K, V ) projections, learning:

✅ Syntax patterns.
✅ Positional dependencies.
✅ Semantic relationships.

Outputs from each head are concatenated and linearly transformed to produce the final output.

5️⃣ Multi-Head Attention Formula

Given ( h ) heads:

[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, …, \text{head}_h) W^O ]

where:

[ \text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V) ]

✅ Each head has its own learned projections ( W_i^Q, W_i^K, W_i^V ).
✅ ( W^O ) projects the concatenated output back to the model dimension.

6️⃣ Visualization Example

Attention heatmaps show which tokens each token focuses on during processing, revealing:

✅ Pronoun resolution patterns.
✅ Subject-verb-object relationships.
✅ Context learning across long sequences.

7️⃣ Applications

✅ Transformers: Core component in encoder and decoder blocks.
✅ BERT, GPT, T5: Use multi-head self-attention for rich embeddings.
✅ Vision Transformers: Apply self-attention to image patches for image classification.

Conclusion

✅ Self-attention enables context learning across sequences without recurrence.
✅ Multi-head attention enriches learning by capturing multiple relationship types.
✅ Mastering these concepts is key to understanding and building transformers.

What’s Next?

✅ Visualize self-attention in a transformer on a sample sentence.
✅ Fine-tune a pretrained transformer and observe attention behavior.
✅ Continue structured learning on superml.org.

Join the SuperML Community to share and discuss your attention experiments.

Happy Learning! 🧭

Course Content

Introduction

1️⃣ What is Self-Attention?

2️⃣ How Does Self-Attention Work?

3️⃣ Why is Self-Attention Important?

4️⃣ What is Multi-Head Attention?

5️⃣ Multi-Head Attention Formula

6️⃣ Visualization Example

7️⃣ Applications

Conclusion

What’s Next?

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies