Course Content
Variance Reduction in SGD
Techniques to reduce variance in stochastic gradient descent
Introduction
Stochastic Gradient Descent (SGD) updates model parameters using individual samples, introducing high variance in gradient estimates during training.
While this variance helps escape shallow local minima, too much variance:
β
Causes noisy and unstable training.
β
Slows down convergence.
β
Makes it harder to tune learning rates.
1οΈβ£ What is Variance in SGD?
In Batch Gradient Descent, gradients are computed using the entire dataset, providing a stable gradient estimate.
In SGD, gradients are computed using a single sample, leading to:
β
High variance between updates.
β
Fluctuations in the loss curve.
2οΈβ£ Why Reduce Variance?
Reducing variance helps:
β
Achieve smoother and more stable convergence.
β
Use larger learning rates effectively.
β
Speed up training without getting stuck in noise.
3οΈβ£ Methods for Variance Reduction in SGD
a) Mini-Batch Gradient Descent
Using mini-batches (e.g., 32, 64 samples) reduces the variance while maintaining computational efficiency.
Benefits:
β
Smoother updates.
β
Faster training compared to pure SGD.
β
Easier to implement on GPUs.
b) Momentum
Momentum accelerates SGD in relevant directions, reducing oscillations and stabilizing training.
Update rule: [ v_t = \gamma v_{t-1} + \eta \nabla L(w) ] [ w = w - v_t ] where:
- (v_t): velocity,
- (\gamma): momentum factor (commonly 0.9),
- (\eta): learning rate.
c) Advanced Optimizers
Optimizers like Adam, RMSProp, and AdaGrad use adaptive learning rates and moving averages of gradients to reduce variance during training.
These optimizers combine variance reduction with adaptive learning, improving convergence stability.
4οΈβ£ Example: Using Mini-Batching and Adam
import tensorflow as tf
# Example model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
tf.keras.layers.Dense(10, activation='softmax')
])
# Compile with Adam optimizer (variance reduction benefits)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Train with mini-batch size of 64
model.fit(x_train, y_train, epochs=5, batch_size=64, validation_split=0.1)
Conclusion
β
Variance in SGD affects training stability and speed.
β
Variance reduction techniques like mini-batching, momentum, and advanced optimizers help stabilize and speed up training.
β
Understanding and applying these will improve your model training efficiency and reliability.
Whatβs Next?
β
Try training a model with and without momentum to observe differences.
β
Explore optimizers like Adam and RMSProp for your projects.
β
Continue structured learning on superml.org
for deeper optimization techniques.
Join the SuperML Community to discuss variance reduction strategies and optimize your learning pipeline.
Happy Optimizing! π―