2-Stage Backpropagation in Python

A practical, step-by-step tutorial explaining 2-Stage Backpropagation with PyTorch code examples for better convergence and generalization in training neural networks.

⚡ intermediate
⏱️ 25 minutes
👤 SuperML Team

· Deep Learning · 6 min read

📋 Prerequisites

  • Python
  • PyTorch Basics
  • Neural Networks

🎯 What You'll Learn

  • Understand the concept of 2-Stage Backpropagation
  • Implement 2-Stage Backpropagation in PyTorch
  • Learn to improve convergence and generalization in neural networks

Looking to improve your model’s performance and generalization? 🚀 This tutorial will guide you through 2-Stage Backpropagation in a practical, step-by-step manner using PyTorch, helping you understand and implement this advanced technique with clarity.

🔹 What is 2-Stage Backpropagation?

2-Stage Backpropagation is a powerful method for training neural networks that helps boost convergence and generalization. The idea is simple: split your network into two parts, train them separately, and then fine-tune the whole model for the best results.

import torch
import torch.nn as nn

class TwoStageNetwork(nn.Module):
    def __init__(self):
        super(TwoStageNetwork, self).__init__()
        self.stage1 = nn.Sequential(
            nn.Linear(784, 256),
            nn.ReLU(),
            nn.Linear(256, 128)
        )
        self.stage2 = nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 10)
        )

    def forward(self, x):
        x = self.stage1(x)
        x = self.stage2(x)
        return x

🔹 Stage 1: Learning Robust Features

The first stage is all about helping your network discover meaningful, robust features from your data. This part usually involves convolutional or dense layers that dig deep to extract useful patterns and representations.

def train_stage1(model, dataloader, optimizer, criterion, epochs):
    for epoch in range(epochs):
        for inputs, _ in dataloader:
            optimizer.zero_grad()
            features = model.stage1(inputs)
            loss = criterion(features, features.detach())  # Self-supervised loss
            loss.backward()
            optimizer.step()

🔹 Stage 2: Final Classification

Now, the features your model learned in Stage 1 are put to work! In this stage, the network uses those features to perform the main classification task—typically using fully connected layers to make predictions.

def train_stage2(model, dataloader, optimizer, criterion, epochs):
    for epoch in range(epochs):
        for inputs, labels in dataloader:
            optimizer.zero_grad()
            features = model.stage1(inputs).detach()
            outputs = model.stage2(features)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

🔹 Freezing Learned Features

When training Stage 2, it’s helpful to “freeze” the parameters from Stage 1. This means you lock in those useful features and let the classification layers do the learning, ensuring your features stay intact.

def freeze_stage1(model):
    for param in model.stage1.parameters():
        param.requires_grad = False

def unfreeze_stage1(model):
    for param in model.stage1.parameters():
        param.requires_grad = True

🔹 Fine-Tuning for Best Performance

Once both stages are trained, it’s time for the magic touch: fine-tune the entire network together! This step helps your model reach its full potential by letting all layers learn in harmony.

def finetune(model, dataloader, optimizer, criterion, epochs):
    unfreeze_stage1(model)
    for epoch in range(epochs):
        for inputs, labels in dataloader:
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

🔹 Using Learning Rate Schedulers

Smart learning rate scheduling can make a big difference. Adjusting the learning rate at the right time helps your model converge faster and more reliably—let’s see how to set that up.

from torch.optim.lr_scheduler import StepLR

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
scheduler = StepLR(optimizer, step_size=5, gamma=0.1)

for epoch in range(epochs):
    train_epoch(model, dataloader, optimizer, criterion)
    scheduler.step()

🔹 Boosting Stage 1 with Data Augmentation

Want even better features? Try data augmentation during Stage 1! By transforming your input data in creative ways, your model learns to generalize and handle real-world variations.

from torchvision import transforms

stage1_transforms = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(10),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])

stage1_dataset = CustomDataset(transform=stage1_transforms)
stage1_dataloader = DataLoader(stage1_dataset, batch_size=32, shuffle=True)

🔹 Stabilizing Training with Gradient Clipping

Ever had your training go haywire? Gradient clipping is a simple trick to keep your training stable—especially useful during fine-tuning when gradients can get out of control.

def train_with_gradient_clipping(model, dataloader, optimizer, criterion, clip_value):
    for inputs, labels in dataloader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip_value)
        optimizer.step()

🔹 Avoiding Overfitting with Early Stopping

Don’t let your model overtrain! Early stopping helps you halt training at just the right moment, preventing overfitting and saving you time.

class EarlyStopping:
    def __init__(self, patience=5, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.best_loss = float('inf')

    def __call__(self, val_loss):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
        else:
            self.counter += 1
            if self.counter >= self.patience:
                return True
        return False

🔹 Understanding What Your Network Learns

Curious about what your network is actually “seeing”? Visualizing feature maps gives you a peek into the inner workings of Stage 1, showing the features your model has learned.

import matplotlib.pyplot as plt

def visualize_feature_maps(model, input_image):
    model.eval()
    with torch.no_grad():
        features = model.stage1(input_image.unsqueeze(0))
    
    fig, axes = plt.subplots(4, 4, figsize=(12, 12))
    for i, ax in enumerate(axes.flat):
        if i < features.shape[1]:
            ax.imshow(features[0, i].cpu().numpy(), cmap='viridis')
            ax.axis('off')
    plt.tight_layout()
    plt.show()

🔹 Track Your Training Progress

Stay on top of your training! Monitoring your model’s progress (for example, with TensorBoard) helps you spot issues early and keep your experiments organized.

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter('runs/experiment_1')

def train_with_monitoring(model, dataloader, optimizer, criterion, epoch):
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(dataloader):
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item()
        if i % 100 == 99:
            writer.add_scalar('training loss',
                              running_loss / 100,
                              epoch * len(dataloader) + i)
            running_loss = 0.0

🔹 Using Transfer Learning with 2-Stage Backprop

Want to leverage pre-trained models? You can combine 2-Stage Backpropagation with transfer learning—using a powerful, pre-trained feature extractor as your Stage 1.

import torchvision.models as models

class TransferTwoStageNetwork(nn.Module):
    def __init__(self, num_classes):
        super(TransferTwoStageNetwork, self).__init__()
        resnet = models.resnet18(pretrained=True)
        self.stage1 = nn.Sequential(*list(resnet.children())[:-1])
        self.stage2 = nn.Linear(512, num_classes)

    def forward(self, x):
        x = self.stage1(x)
        x = x.view(x.size(0), -1)
        x = self.stage2(x)
        return x

🔹 Handling Class Imbalance

Imbalanced datasets? No problem! With a few tweaks to your loss function, you can help your model learn fairly from all classes—even the rare ones.

def calculate_class_weights(dataset):
    class_counts = torch.zeros(num_classes)
    for _, label in dataset:
        class_counts[label] += 1
    return 1.0 / class_counts

class_weights = calculate_class_weights(train_dataset)
criterion = nn.CrossEntropyLoss(weight=class_weights)

def train_with_weighted_loss(model, dataloader, optimizer, criterion):
    for inputs, labels in dataloader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

🔹 Evaluating Your Model

After all that work, it’s time to see how your model performs. Let’s evaluate accuracy and loss on a test set to get a clear picture of its real-world power.

def evaluate_model(model, test_loader, criterion):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            total_loss += loss.item()
            
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    accuracy = 100 * correct / total
    average_loss = total_loss / len(test_loader)
    
    return accuracy, average_loss

🔹 Want to Dive Deeper?

Ready to keep learning? Check out these resources to explore 2-Stage Backpropagation and related deep learning techniques:

  1. “Deep Learning” by Goodfellow, Bengio, and Courville - Available at: https://www.deeplearningbook.org/
  2. “Curriculum Learning” by Bengio et al. (2009) - ArXiv: https://arxiv.org/abs/0904.2425
  3. “Progressive Neural Networks” by Rusu et al. (2016) - ArXiv: https://arxiv.org/abs/1606.04671
  4. “An Overview of Multi-Task Learning in Deep Neural Networks” by Ruder (2017) - ArXiv: https://arxiv.org/abs/1706.05098
Back to Tutorials

Related Tutorials

⚡intermediate ⏱️ 45 minutes

Neural Network Basics

Learn the fundamental concepts behind neural networks, including perceptrons, activation functions, forward and backward propagation, and how they power deep learning systems.

Deep Learning2 min read
deep learningneural networksmachine learning +1
⚡intermediate ⏱️ 60 minutes

Advanced Training Techniques for Deep Learning Models

Explore advanced training techniques in deep learning, including learning rate scheduling, gradient clipping, mixed precision training, and data augmentation for stable and efficient model training.

Deep Learning2 min read
deep learningadvanced trainingmachine learning +1
⚡intermediate ⏱️ 50 minutes

Dilation and Upconvolution in PyTorch

Learn how to implement dilation and upconvolution (transposed convolution) in PyTorch for tasks like semantic segmentation and feature map upsampling with clear, practical examples.

Deep Learning2 min read
deep learningpytorchdilation +2
⚡intermediate ⏱️ 50 minutes

Dilation and Upconvolution in Deep Learning

Learn what dilation and upconvolution are, how they work, and why they are important for tasks like semantic segmentation and feature expansion in deep learning.

Deep Learning2 min read
deep learningdilationupconvolution +2