Course Content
Residuals and Normalizations Combined
Understanding residual connections and normalization techniques
Introduction
As deep networks become deeper, training becomes challenging due to vanishing gradients and unstable distributions across layers.
β Residual connections and normalization techniques address these issues, allowing deep networks to train effectively and improve performance.
1οΈβ£ The Vanishing Gradient Problem
In deep networks:
β
Gradients can become very small during backpropagation.
β
Layers close to the input learn very slowly.
β
This hinders the ability to train deep architectures effectively.
2οΈβ£ What are Residual Connections?
Residual connections (skip connections) directly add the input of a layer to its output:
[ y = F(x) + x ]
where:
- (x): Input to the block.
- (F(x)): The output after passing through several layers.
Benefits:
β
Allow gradients to flow directly through skip paths, mitigating vanishing gradients.
β
Enable training of very deep networks (e.g., ResNets with 50+ layers).
β
Simplify learning by letting layers learn the residual mapping instead of the entire transformation.
3οΈβ£ What is Normalization?
Normalization techniques help stabilize and speed up training by:
β
Reducing internal covariate shift.
β
Smoothing the loss landscape for more stable optimization.
Batch Normalization
Batch Normalization normalizes the output of a layer across the current mini-batch.
Formula: [ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} ]
where:
- (\mu): Mean of the batch.
- (\sigma^2): Variance of the batch.
- (\epsilon): Small constant to prevent division by zero.
Benefits:
β
Allows higher learning rates.
β
Acts as a form of regularization.
β
Reduces the need for careful weight initialization.
4οΈβ£ Example: Using Residual Connections and Batch Normalization
TensorFlow Example
import tensorflow as tf
from tensorflow.keras import layers
# Residual block with batch normalization
def residual_block(x, units):
shortcut = x
x = layers.Dense(units, activation='relu')(x)
x = layers.BatchNormalization()(x)
x = layers.Dense(units)(x)
x = layers.BatchNormalization()(x)
x = layers.Add()([x, shortcut])
x = layers.Activation('relu')(x)
return x
inputs = tf.keras.Input(shape=(128,))
x = residual_block(inputs, 128)
x = layers.Dense(10, activation='softmax')(x)
model = tf.keras.Model(inputs=inputs, outputs=x)
5οΈβ£ Practical Tips
β
Use batch normalization after dense/convolutional layers and before activation functions.
β
Add residual connections in blocks of layers for deeper models.
β
Combine both techniques for stability and faster convergence in deep architectures.
Conclusion
β
Residual connections and normalization techniques are critical for building deep, trainable, and effective networks.
β
They help mitigate vanishing gradients and improve convergence.
Whatβs Next?
β
Experiment with adding residual blocks and batch normalization in your projects.
β
Explore deeper architectures like ResNet and DenseNet.
β
Continue your structured learning on superml.org
.
Join the SuperML Community to share your experiments and learn collaboratively.
Happy Learning! πͺ