· Deep Learning · 2 min read
📋 Prerequisites
- Basic understanding of SGD
- Basic Python knowledge
🎯 What You'll Learn
- Understand what variance in SGD means
- Learn why variance reduction is important
- Explore practical methods for variance reduction
- Connect these concepts to your model training and debugging
Introduction
Stochastic Gradient Descent (SGD) updates model parameters using individual samples, introducing high variance in gradient estimates during training.
While this variance helps escape shallow local minima, too much variance:
✅ Causes noisy and unstable training.
✅ Slows down convergence.
✅ Makes it harder to tune learning rates.
1️⃣ What is Variance in SGD?
In Batch Gradient Descent, gradients are computed using the entire dataset, providing a stable gradient estimate.
In SGD, gradients are computed using a single sample, leading to:
✅ High variance between updates.
✅ Fluctuations in the loss curve.
2️⃣ Why Reduce Variance?
Reducing variance helps:
✅ Achieve smoother and more stable convergence.
✅ Use larger learning rates effectively.
✅ Speed up training without getting stuck in noise.
3️⃣ Methods for Variance Reduction in SGD
a) Mini-Batch Gradient Descent
Using mini-batches (e.g., 32, 64 samples) reduces the variance while maintaining computational efficiency.
Benefits:
✅ Smoother updates.
✅ Faster training compared to pure SGD.
✅ Easier to implement on GPUs.
b) Momentum
Momentum accelerates SGD in relevant directions, reducing oscillations and stabilizing training.
Update rule: [ v_t = \gamma v_{t-1} + \eta \nabla L(w) ] [ w = w - v_t ] where:
- (v_t): velocity,
- (\gamma): momentum factor (commonly 0.9),
- (\eta): learning rate.
c) Advanced Optimizers
Optimizers like Adam, RMSProp, and AdaGrad use adaptive learning rates and moving averages of gradients to reduce variance during training.
These optimizers combine variance reduction with adaptive learning, improving convergence stability.
4️⃣ Example: Using Mini-Batching and Adam
import tensorflow as tf
# Example model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
tf.keras.layers.Dense(10, activation='softmax')
])
# Compile with Adam optimizer (variance reduction benefits)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train with mini-batch size of 64
model.fit(x_train, y_train, epochs=5, batch_size=64, validation_split=0.1)
Conclusion
✅ Variance in SGD affects training stability and speed.
✅ Variance reduction techniques like mini-batching, momentum, and advanced optimizers help stabilize and speed up training.
✅ Understanding and applying these will improve your model training efficiency and reliability.
What’s Next?
✅ Try training a model with and without momentum to observe differences.
✅ Explore optimizers like Adam and RMSProp for your projects.
✅ Continue structured learning on superml.org
for deeper optimization techniques.
Join the SuperML Community to discuss variance reduction strategies and optimize your learning pipeline.
Happy Optimizing! 🎯