Positional Embeddings in Transformers

Learn what positional embeddings are, why they are crucial in transformers, and how they help models understand the order of sequences in deep learning.

🔰 beginner
⏱️ 35 minutes
👤 SuperML Team

· Deep Learning · 2 min read

📋 Prerequisites

  • Basic understanding of transformers and attention mechanisms

🎯 What You'll Learn

  • Understand what positional embeddings are
  • Learn why positional embeddings are necessary in transformers
  • Explore different types of positional embeddings
  • Gain practical intuition with clear examples

Introduction

Transformers process input sequences in parallel, which makes them fast and scalable. However, this means they do not inherently understand the order of tokens in a sequence.

Positional embeddings solve this problem by injecting position information into the input embeddings so that transformers can utilize sequence order during processing.


1️⃣ Why Do We Need Positional Embeddings?

In NLP, the meaning of a sentence depends on word order:

✅ “The dog chased the cat.”
✅ “The cat chased the dog.”

Transformers treat input tokens equally, so without positional information, they cannot distinguish the order of words.

Positional embeddings allow transformers to capture sequential information while maintaining parallel processing.


2️⃣ What Are Positional Embeddings?

Positional embeddings are vectors added to token embeddings to encode the position of each token in the sequence.

They allow the model to understand:

✅ The relative and absolute positions of tokens.
✅ The structure of the sequence during attention calculations.


3️⃣ Types of Positional Embeddings

a) Sinusoidal Positional Embeddings

Introduced in the original transformer paper, these are fixed, deterministic embeddings based on sine and cosine functions of different frequencies.

For position ( pos ) and dimension ( i ):

[ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) ] [ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) ]

✅ Enable the model to learn relative positions easily.
✅ Do not require additional learnable parameters.


b) Learnable Positional Embeddings

Instead of fixed embeddings, these are learned during training:

✅ Each position has a learnable vector.
✅ Allow the model to adapt positional information specific to the dataset.

Popular in models like BERT.


4️⃣ How Positional Embeddings Are Used

In transformers:

✅ The positional embeddings are added to the input token embeddings before passing them into the encoder.

embeddings = token_embeddings + positional_embeddings

This combined embedding is then used in self-attention layers to process the sequence with position awareness.


5️⃣ Practical Insights

✅ Sinusoidal embeddings are lightweight and effective for many tasks.
✅ Learnable embeddings may provide better performance on large datasets where capturing dataset-specific positional patterns helps.


Conclusion

Positional embeddings are essential in transformers to encode sequence order while maintaining parallelism.
✅ They enable transformers to understand the structure of language and sequences, making them effective for NLP, speech, and even vision tasks.


What’s Next?

✅ Visualize positional embeddings in a transformer model to understand how position is encoded.
✅ Experiment with sinusoidal and learnable positional embeddings to see their impact on model performance.
✅ Continue structured deep learning on superml.org.


Join the SuperML Community to discuss and share your learnings with positional embeddings.


Happy Learning! 🔢

Back to Tutorials

Related Tutorials

🔰beginner ⏱️ 45 minutes

Introduction to Transformers

A beginner-friendly introduction to transformers in deep learning, explaining what they are, why they matter, and how they work to process sequences efficiently.

Deep Learning2 min read
deep learningtransformersbeginner +1
🔰beginner ⏱️ 45 minutes

Attention Mechanisms in Deep Learning

Learn what attention mechanisms are, why they matter in deep learning, and how they power modern architectures like transformers for sequence and vision tasks.

Deep Learning2 min read
deep learningattentiontransformers +1
🔰beginner ⏱️ 30 minutes

Basic Linear Algebra for Deep Learning

Understand the essential linear algebra concepts for deep learning, including scalars, vectors, matrices, and matrix operations, with clear examples for beginners.

Deep Learning2 min read
deep learninglinear algebrabeginner +1
🔰beginner ⏱️ 45 minutes

Your First Deep Learning Implementation

Build your first deep learning model to classify handwritten digits using TensorFlow and Keras, explained step-by-step for beginners.

Deep Learning2 min read
deep learningbeginnerkeras +2