Vanishing and Exploding Gradients

Introduction

Training deep neural networks can be challenging due to vanishing and exploding gradients, which can slow or even completely stop learning.

1️⃣ What are Vanishing and Exploding Gradients?

Vanishing Gradients:

During backpropagation, gradients become very small as they are propagated backward through layers, causing:

✅ Early layers to learn very slowly or not at all.
✅ Stagnation in loss reduction.

Exploding Gradients:

Gradients become excessively large during backpropagation, causing:

✅ Unstable training.
✅ Weights to grow too large, resulting in NaN values or model divergence.

2️⃣ Why Do These Issues Occur?

In deep networks, gradients are repeatedly multiplied during backpropagation:

✅ If weights are < 1, gradients shrink exponentially (vanishing).
✅ If weights are > 1, gradients grow exponentially (exploding).

Activation functions like sigmoid and tanh further contribute to vanishing gradients due to their small derivatives.

3️⃣ Effects on Model Training

Vanishing gradients prevent lower layers from learning, leading to poor performance.
Exploding gradients cause instability and divergence during training.

4️⃣ Strategies to Mitigate These Issues

a) Proper Weight Initialization

Xavier/Glorot Initialization for tanh activations.
He Initialization for ReLU activations.

b) Using ReLU or Variants

ReLU activation functions help mitigate vanishing gradients as their derivative is 1 for positive inputs.

c) Gradient Clipping

Clipping gradients during training prevents them from exceeding a threshold, avoiding exploding gradients.

# Example in TensorFlow
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, clipnorm=1.0)

d) Batch Normalization

Normalizing activations helps maintain stable gradients across layers.

e) Residual Connections

Using skip connections allows gradients to flow directly, mitigating vanishing gradients in very deep networks.

5️⃣ Practical Example: Gradient Clipping

import tensorflow as tf

model.compile(optimizer=tf.keras.optimizers.Adam(clipvalue=1.0),
              loss='categorical_crossentropy',
              metrics=['accuracy'])

Conclusion

✅ Vanishing and exploding gradients are common issues in training deep networks.
✅ Understanding these concepts helps you design and train stable, effective models.
✅ Using proper initialization, ReLU, gradient clipping, batch normalization, and residual connections can mitigate these problems.

What’s Next?

✅ Experiment with gradient clipping in your models.
✅ Explore residual networks and advanced architectures that handle these challenges.
✅ Continue structured deep learning on superml.org.

Join the SuperML Community to discuss your experiments and get practical help.

Happy Learning! ⚡

Course Content