Press ESC to exit fullscreen
πŸ“– Lesson ⏱️ 90 minutes

Positional Embeddings

Encoding positional information in transformer models

Introduction

Transformers process input sequences in parallel, which makes them fast and scalable. However, this means they do not inherently understand the order of tokens in a sequence.

Positional embeddings solve this problem by injecting position information into the input embeddings so that transformers can utilize sequence order during processing.


1️⃣ Why Do We Need Positional Embeddings?

In NLP, the meaning of a sentence depends on word order:

βœ… β€œThe dog chased the cat.”
βœ… β€œThe cat chased the dog.”

Transformers treat input tokens equally, so without positional information, they cannot distinguish the order of words.

Positional embeddings allow transformers to capture sequential information while maintaining parallel processing.


2️⃣ What Are Positional Embeddings?

Positional embeddings are vectors added to token embeddings to encode the position of each token in the sequence.

They allow the model to understand:

βœ… The relative and absolute positions of tokens.
βœ… The structure of the sequence during attention calculations.


3️⃣ Types of Positional Embeddings

a) Sinusoidal Positional Embeddings

Introduced in the original transformer paper, these are fixed, deterministic embeddings based on sine and cosine functions of different frequencies.

For position ( pos ) and dimension ( i ):

[ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) ] [ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) ]

βœ… Enable the model to learn relative positions easily.
βœ… Do not require additional learnable parameters.


b) Learnable Positional Embeddings

Instead of fixed embeddings, these are learned during training:

βœ… Each position has a learnable vector.
βœ… Allow the model to adapt positional information specific to the dataset.

Popular in models like BERT.


4️⃣ How Positional Embeddings Are Used

In transformers:

βœ… The positional embeddings are added to the input token embeddings before passing them into the encoder.

embeddings = token_embeddings + positional_embeddings

This combined embedding is then used in self-attention layers to process the sequence with position awareness.


5️⃣ Practical Insights

βœ… Sinusoidal embeddings are lightweight and effective for many tasks.
βœ… Learnable embeddings may provide better performance on large datasets where capturing dataset-specific positional patterns helps.


Conclusion

βœ… Positional embeddings are essential in transformers to encode sequence order while maintaining parallelism.
βœ… They enable transformers to understand the structure of language and sequences, making them effective for NLP, speech, and even vision tasks.


What’s Next?

βœ… Visualize positional embeddings in a transformer model to understand how position is encoded.
βœ… Experiment with sinusoidal and learnable positional embeddings to see their impact on model performance.
βœ… Continue structured deep learning on superml.org.


Join the SuperML Community to discuss and share your learnings with positional embeddings.


Happy Learning! πŸ”’