Course Content
Stochastic Gradient Descent
Implementing SGD and understanding its variants
Introduction
Stochastic Gradient Descent (SGD) is one of the most important optimization algorithms used in deep learning to train neural networks.
SGD allows models to learn by updating weights efficiently using small, random subsets of data rather than the entire dataset.
1οΈβ£ What is Gradient Descent?
Gradient Descent is an algorithm used to minimize the loss function by iteratively adjusting the modelβs weights in the direction of the negative gradient.
2οΈβ£ What is Stochastic Gradient Descent?
In Batch Gradient Descent:
β
Gradients are computed using the entire dataset in each iteration.
β
Computationally expensive for large datasets.
In Stochastic Gradient Descent (SGD):
β
Gradients are computed using only one randomly selected data point at each iteration.
β
Weights are updated more frequently, leading to faster, noisier convergence.
3οΈβ£ Mini-Batch Gradient Descent
A middle ground:
β
Uses small batches (e.g., 32, 64 samples) instead of the whole dataset or a single data point.
β
Balances computational efficiency and stability.
4οΈβ£ Mathematical Overview
SGD Update Rule:
[ w = w - \eta \cdot \nabla L(w; x_i, y_i) ] where:
- (w): weights.
- (\eta): learning rate.
- (\nabla L(w; x_i, y_i)): gradient of the loss with respect to weights, computed for a single data point ((x_i, y_i)).
5οΈβ£ Benefits of SGD
β
Faster updates, leading to quicker progress in early training stages.
β
Can escape local minima due to noisy updates.
β
Handles large datasets efficiently.
6οΈβ£ Limitations of SGD
β
Noisy updates can lead to fluctuating loss values.
β
Requires careful tuning of the learning rate.
β
May require additional techniques like momentum to stabilize learning.
7οΈβ£ Using SGD in TensorFlow/Keras
import tensorflow as tf
# Create a simple model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(784,)),
tf.keras.layers.Dense(10, activation='softmax')
])
# Compile using SGD
sgd_optimizer = tf.keras.optimizers.SGD(learning_rate=0.01)
model.compile(optimizer=sgd_optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
Tips for Using SGD Effectively
β
Experiment with different learning rates.
β
Use learning rate schedules to decay the learning rate over time.
β
Consider using SGD with momentum for smoother convergence:
sgd_momentum = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)
Conclusion
β
Stochastic Gradient Descent is a core optimizer in deep learning, enabling efficient training on large datasets.
β
Understanding SGD builds your foundation for advanced optimizers like Adam and RMSProp.
β
It provides practical knowledge for debugging and improving model training.
Whatβs Next?
β
Experiment with SGD on your first models and visualize training loss behavior.
β
Learn about advanced optimization techniques (Momentum, Adam) next on superml.org
.
Join the SuperML Community to share your experiments and learn collaboratively.
Happy Optimizing! β‘