Press ESC to exit fullscreen
πŸ“– Lesson ⏱️ 90 minutes

Stochastic Gradient Descent

Implementing SGD and understanding its variants

Introduction

Stochastic Gradient Descent (SGD) is one of the most important optimization algorithms used in deep learning to train neural networks.

SGD allows models to learn by updating weights efficiently using small, random subsets of data rather than the entire dataset.


1️⃣ What is Gradient Descent?

Gradient Descent is an algorithm used to minimize the loss function by iteratively adjusting the model’s weights in the direction of the negative gradient.


2️⃣ What is Stochastic Gradient Descent?

In Batch Gradient Descent:

βœ… Gradients are computed using the entire dataset in each iteration.
βœ… Computationally expensive for large datasets.

In Stochastic Gradient Descent (SGD):

βœ… Gradients are computed using only one randomly selected data point at each iteration.
βœ… Weights are updated more frequently, leading to faster, noisier convergence.


3️⃣ Mini-Batch Gradient Descent

A middle ground:

βœ… Uses small batches (e.g., 32, 64 samples) instead of the whole dataset or a single data point.
βœ… Balances computational efficiency and stability.


4️⃣ Mathematical Overview

SGD Update Rule:

[ w = w - \eta \cdot \nabla L(w; x_i, y_i) ] where:

  • (w): weights.
  • (\eta): learning rate.
  • (\nabla L(w; x_i, y_i)): gradient of the loss with respect to weights, computed for a single data point ((x_i, y_i)).

5️⃣ Benefits of SGD

βœ… Faster updates, leading to quicker progress in early training stages.
βœ… Can escape local minima due to noisy updates.
βœ… Handles large datasets efficiently.


6️⃣ Limitations of SGD

βœ… Noisy updates can lead to fluctuating loss values.
βœ… Requires careful tuning of the learning rate.
βœ… May require additional techniques like momentum to stabilize learning.


7️⃣ Using SGD in TensorFlow/Keras

import tensorflow as tf

# Create a simple model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(10, activation='softmax')
])

# Compile using SGD
sgd_optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=sgd_optimizer, loss='categorical_crossentropy', metrics=['accuracy'])

Tips for Using SGD Effectively

βœ… Experiment with different learning rates.
βœ… Use learning rate schedules to decay the learning rate over time.
βœ… Consider using SGD with momentum for smoother convergence:

sgd_momentum = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)

Conclusion

βœ… Stochastic Gradient Descent is a core optimizer in deep learning, enabling efficient training on large datasets.
βœ… Understanding SGD builds your foundation for advanced optimizers like Adam and RMSProp.
βœ… It provides practical knowledge for debugging and improving model training.


What’s Next?

βœ… Experiment with SGD on your first models and visualize training loss behavior.
βœ… Learn about advanced optimization techniques (Momentum, Adam) next on superml.org.


Join the SuperML Community to share your experiments and learn collaboratively.


Happy Optimizing! ⚑