Course Content
Vanishing and Exploding Gradients
Addressing vanishing and exploding gradient problems
Introduction
Training deep neural networks can be challenging due to vanishing and exploding gradients, which can slow or even completely stop learning.
1οΈβ£ What are Vanishing and Exploding Gradients?
Vanishing Gradients:
During backpropagation, gradients become very small as they are propagated backward through layers, causing:
β
Early layers to learn very slowly or not at all.
β
Stagnation in loss reduction.
Exploding Gradients:
Gradients become excessively large during backpropagation, causing:
β
Unstable training.
β
Weights to grow too large, resulting in NaN
values or model divergence.
2οΈβ£ Why Do These Issues Occur?
In deep networks, gradients are repeatedly multiplied during backpropagation:
β
If weights are < 1, gradients shrink exponentially (vanishing).
β
If weights are > 1, gradients grow exponentially (exploding).
Activation functions like sigmoid and tanh further contribute to vanishing gradients due to their small derivatives.
3οΈβ£ Effects on Model Training
- Vanishing gradients prevent lower layers from learning, leading to poor performance.
- Exploding gradients cause instability and divergence during training.
4οΈβ£ Strategies to Mitigate These Issues
a) Proper Weight Initialization
- Xavier/Glorot Initialization for tanh activations.
- He Initialization for ReLU activations.
b) Using ReLU or Variants
ReLU activation functions help mitigate vanishing gradients as their derivative is 1 for positive inputs.
c) Gradient Clipping
Clipping gradients during training prevents them from exceeding a threshold, avoiding exploding gradients.
# Example in TensorFlow
optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, clipnorm=1.0)
d) Batch Normalization
Normalizing activations helps maintain stable gradients across layers.
e) Residual Connections
Using skip connections allows gradients to flow directly, mitigating vanishing gradients in very deep networks.
5οΈβ£ Practical Example: Gradient Clipping
import tensorflow as tf
model.compile(optimizer=tf.keras.optimizers.Adam(clipvalue=1.0),
loss='categorical_crossentropy',
metrics=['accuracy'])
Conclusion
β
Vanishing and exploding gradients are common issues in training deep networks.
β
Understanding these concepts helps you design and train stable, effective models.
β
Using proper initialization, ReLU, gradient clipping, batch normalization, and residual connections can mitigate these problems.
Whatβs Next?
β
Experiment with gradient clipping in your models.
β
Explore residual networks and advanced architectures that handle these challenges.
β
Continue structured deep learning on superml.org
.
Join the SuperML Community to discuss your experiments and get practical help.
Happy Learning! β‘