Press ESC to exit fullscreen
πŸ“– Lesson ⏱️ 60 minutes

Variance Reduction in SGD

Techniques to reduce variance in stochastic gradient descent

Introduction

Stochastic Gradient Descent (SGD) updates model parameters using individual samples, introducing high variance in gradient estimates during training.

While this variance helps escape shallow local minima, too much variance:

βœ… Causes noisy and unstable training.
βœ… Slows down convergence.
βœ… Makes it harder to tune learning rates.


1️⃣ What is Variance in SGD?

In Batch Gradient Descent, gradients are computed using the entire dataset, providing a stable gradient estimate.

In SGD, gradients are computed using a single sample, leading to:

βœ… High variance between updates.
βœ… Fluctuations in the loss curve.


2️⃣ Why Reduce Variance?

Reducing variance helps:

βœ… Achieve smoother and more stable convergence.
βœ… Use larger learning rates effectively.
βœ… Speed up training without getting stuck in noise.


3️⃣ Methods for Variance Reduction in SGD

a) Mini-Batch Gradient Descent

Using mini-batches (e.g., 32, 64 samples) reduces the variance while maintaining computational efficiency.

Benefits:

βœ… Smoother updates.
βœ… Faster training compared to pure SGD.
βœ… Easier to implement on GPUs.


b) Momentum

Momentum accelerates SGD in relevant directions, reducing oscillations and stabilizing training.

Update rule: [ v_t = \gamma v_{t-1} + \eta \nabla L(w) ] [ w = w - v_t ] where:

  • (v_t): velocity,
  • (\gamma): momentum factor (commonly 0.9),
  • (\eta): learning rate.

c) Advanced Optimizers

Optimizers like Adam, RMSProp, and AdaGrad use adaptive learning rates and moving averages of gradients to reduce variance during training.

These optimizers combine variance reduction with adaptive learning, improving convergence stability.


4️⃣ Example: Using Mini-Batching and Adam

import tensorflow as tf

# Example model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile with Adam optimizer (variance reduction benefits)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train with mini-batch size of 64
model.fit(x_train, y_train, epochs=5, batch_size=64, validation_split=0.1)

Conclusion

βœ… Variance in SGD affects training stability and speed.
βœ… Variance reduction techniques like mini-batching, momentum, and advanced optimizers help stabilize and speed up training.
βœ… Understanding and applying these will improve your model training efficiency and reliability.


What’s Next?

βœ… Try training a model with and without momentum to observe differences.
βœ… Explore optimizers like Adam and RMSProp for your projects.
βœ… Continue structured learning on superml.org for deeper optimization techniques.


Join the SuperML Community to discuss variance reduction strategies and optimize your learning pipeline.


Happy Optimizing! 🎯