Variance Reduction in Stochastic Gradient Descent

Learn why variance in SGD matters, how it affects training, and practical methods like mini-batching, momentum, and advanced optimizers to reduce variance effectively.

🔰 beginner
⏱️ 40 minutes
👤 SuperML Team

· Deep Learning · 2 min read

📋 Prerequisites

  • Basic understanding of SGD
  • Basic Python knowledge

🎯 What You'll Learn

  • Understand what variance in SGD means
  • Learn why variance reduction is important
  • Explore practical methods for variance reduction
  • Connect these concepts to your model training and debugging

Introduction

Stochastic Gradient Descent (SGD) updates model parameters using individual samples, introducing high variance in gradient estimates during training.

While this variance helps escape shallow local minima, too much variance:

✅ Causes noisy and unstable training.
✅ Slows down convergence.
✅ Makes it harder to tune learning rates.


1️⃣ What is Variance in SGD?

In Batch Gradient Descent, gradients are computed using the entire dataset, providing a stable gradient estimate.

In SGD, gradients are computed using a single sample, leading to:

✅ High variance between updates.
✅ Fluctuations in the loss curve.


2️⃣ Why Reduce Variance?

Reducing variance helps:

✅ Achieve smoother and more stable convergence.
✅ Use larger learning rates effectively.
✅ Speed up training without getting stuck in noise.


3️⃣ Methods for Variance Reduction in SGD

a) Mini-Batch Gradient Descent

Using mini-batches (e.g., 32, 64 samples) reduces the variance while maintaining computational efficiency.

Benefits:

✅ Smoother updates.
✅ Faster training compared to pure SGD.
✅ Easier to implement on GPUs.


b) Momentum

Momentum accelerates SGD in relevant directions, reducing oscillations and stabilizing training.

Update rule: [ v_t = \gamma v_{t-1} + \eta \nabla L(w) ] [ w = w - v_t ] where:

  • (v_t): velocity,
  • (\gamma): momentum factor (commonly 0.9),
  • (\eta): learning rate.

c) Advanced Optimizers

Optimizers like Adam, RMSProp, and AdaGrad use adaptive learning rates and moving averages of gradients to reduce variance during training.

These optimizers combine variance reduction with adaptive learning, improving convergence stability.


4️⃣ Example: Using Mini-Batching and Adam

import tensorflow as tf

# Example model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile with Adam optimizer (variance reduction benefits)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train with mini-batch size of 64
model.fit(x_train, y_train, epochs=5, batch_size=64, validation_split=0.1)

Conclusion

✅ Variance in SGD affects training stability and speed.
✅ Variance reduction techniques like mini-batching, momentum, and advanced optimizers help stabilize and speed up training.
✅ Understanding and applying these will improve your model training efficiency and reliability.


What’s Next?

✅ Try training a model with and without momentum to observe differences.
✅ Explore optimizers like Adam and RMSProp for your projects.
✅ Continue structured learning on superml.org for deeper optimization techniques.


Join the SuperML Community to discuss variance reduction strategies and optimize your learning pipeline.


Happy Optimizing! 🎯

Back to Tutorials

Related Tutorials

🔰beginner ⏱️ 40 minutes

Stochastic Gradient Descent in Deep Learning

Understand what stochastic gradient descent (SGD) is, how it works, and why it is important in training deep learning models, explained with clear beginner-friendly examples.

Deep Learning2 min read
deep learningoptimizationstochastic gradient descent +1
🔰beginner ⏱️ 30 minutes

Gradient Descent and Optimization in Deep Learning

Understand gradient descent and optimization techniques for deep learning, including how models learn by minimizing loss using gradients, with clear explanations and examples.

Deep Learning2 min read
deep learningoptimizationgradient descent +1
🔰beginner ⏱️ 30 minutes

Optimization in Deep Learning

Learn what optimization means in deep learning, why it is important, and how techniques like gradient descent and advanced optimizers help neural networks learn efficiently.

Deep Learning2 min read
deep learningoptimizationbeginner +1
🔰beginner ⏱️ 30 minutes

Basic Linear Algebra for Deep Learning

Understand the essential linear algebra concepts for deep learning, including scalars, vectors, matrices, and matrix operations, with clear examples for beginners.

Deep Learning2 min read
deep learninglinear algebrabeginner +1