Training Loop, Hyperparameters, and Debugging

The Training Loop Is Where Things Go Wrong

You have prepared a clean dataset, loaded your model with QLoRA, configured your LoRA adapter. Now you run training, and the loss curve does one of three things:

It decreases smoothly and levels off. You got lucky, or you have done this before.
It explodes — NaN values, or a spike to infinity in the first few steps. Learning rate too high.
It barely moves — flat or very slowly declining. Learning rate too low, or the data has a problem.

Most practitioners spend significant time in scenarios 2 and 3 before reaching scenario 1. This lesson is about understanding why each failure mode happens and having a systematic approach to diagnosing and fixing them.

The Key Hyperparameters

Learning Rate

The learning rate controls how large each gradient update step is. Too large, and the optimizer overshoots the loss minimum. Too small, and training converges so slowly that you give up or run out of compute budget.

For LoRA fine-tuning, the empirically validated starting range is 1e-4 to 3e-4, with 2e-4 being the most common choice. This is significantly higher than the learning rates used for full fine-tuning (typically 1e-5 to 5e-5), because LoRA has far fewer parameters and each step needs to carry more signal.

For full fine-tuning (when your hardware allows), start at 2e-5.

The learning rate schedule matters as much as the initial value. The most common choice for fine-tuning is a cosine decay with warmup:

Step 0 to warmup_steps: LR increases linearly from 0 to max_lr
Step warmup_steps to total_steps: LR decays following a cosine curve to ~0

# Loss curves for different learning rates (conceptual):
#
# lr = 3e-3 (too high):
# Step:  0   50  100  150  200
# Loss:  2.1  NaN  ---  ---  ---
#
# lr = 2e-4 (good):
# Step:  0   50  100  150  200
# Loss:  2.1  1.6  1.2  0.95  0.82
#
# lr = 1e-6 (too low):
# Step:  0   50  100  150  200
# Loss:  2.1  2.08  2.06  2.04  2.02  # barely moving

Batch Size and Gradient Accumulation

The effective batch size is the number of training examples the model sees before each parameter update. Larger effective batch sizes:

Produce smoother gradient estimates (less noise)
Allow higher learning rates
Are more memory-efficient (fewer optimizer state updates per example)

But a larger batch size requires more GPU memory per step, because you must store activations for all examples in the batch simultaneously.

Gradient accumulation solves this: instead of processing 16 examples in one step, you process 4 examples for 4 consecutive steps, accumulate the gradients, then update the weights. The result is mathematically identical to a single step with 16 examples (ignoring batch normalization, which is not used in LLMs).

# These two configurations produce the same effective batch size:

# Option A: direct batch size (requires 16x memory for activations)
TrainingArguments(
    per_device_train_batch_size=16,
    gradient_accumulation_steps=1,
)

# Option B: gradient accumulation (4x memory for activations, 4x more optimizer calls)
TrainingArguments(
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
)

# Effective batch size = per_device_train_batch_size × gradient_accumulation_steps × num_gpus
# Both: 4 × 4 × 1 = 16  or  16 × 1 × 1 = 16

On a single 16 GB T4 with QLoRA, per_device_train_batch_size=4 and gradient_accumulation_steps=4 (effective=16) is a reliable starting configuration.

Epochs and Steps

For instruction-tuning datasets of 500–5000 examples, 1–3 epochs is usually sufficient. More than 3 epochs almost always leads to overfitting — the model memorizes training examples and its responses on held-out data degrade.

A useful rule of thumb: total training steps = (num_examples / effective_batch_size) × num_epochs. For 1000 examples, batch size 16, 3 epochs: 1000/16 × 3 = ~188 steps. That is a 5-minute training run on a T4.

# Calculate expected training steps before starting:
num_examples = 1000
effective_batch_size = 16  # per_device × gradient_accumulation × num_gpus
num_epochs = 3
total_steps = (num_examples // effective_batch_size) * num_epochs
print(f"Expected total steps: {total_steps}")  # 187

For larger datasets (50,000+ examples), a single epoch may be sufficient. The model sees each example once but the total gradient signal is strong because of the dataset size.

Warmup Steps

Warmup addresses a specific training instability: at the start of training, the model’s LoRA matrices are initialized randomly (A with a small random init, B at zero). The gradients in the first few steps are noisy and potentially large. If you apply the full learning rate immediately, these early steps can drive the parameters into a bad region from which recovery is slow.

Warmup gradually increases the learning rate from near-zero to the target value over the first few percent of training steps, smoothing out this initial instability.

TrainingArguments(
    warmup_ratio=0.03,    # warmup for 3% of total steps
    # or:
    warmup_steps=50,      # fixed number of warmup steps
)

For 188 total steps, 3% warmup = 5–6 steps. For longer runs of 1,000+ steps, 50–100 warmup steps is typical.

Complete TrainingArguments Configuration

from transformers import TrainingArguments

training_args = TrainingArguments(
    # Output
    output_dir="./qlora-output",
    
    # Training schedule
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,          # effective batch = 16
    
    # Learning rate
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    
    # Optimization
    optim="paged_adamw_32bit",
    weight_decay=0.001,                     # L2 regularization
    max_grad_norm=0.3,                      # gradient clipping — prevents explosions
    
    # Memory efficiency
    gradient_checkpointing=True,            # trade compute for activation memory
    fp16=False,
    bf16=True,                              # bfloat16 — more numerically stable
    
    # Saving and logging
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,                     # keep only 3 checkpoints to save disk
    load_best_model_at_end=True,
    
    # Evaluation
    evaluation_strategy="steps",
    eval_steps=50,
    
    # Reporting
    report_to="tensorboard",
    run_name="mistral-7b-lora-v1",
)

Memory-Saving Techniques

Gradient Checkpointing

During the forward pass, PyTorch stores all intermediate activations (the values computed at each layer) because they are needed during backpropagation. For a 7B model with sequence length 512, this can consume 5–15 GB of activation memory.

Gradient checkpointing discards these intermediate activations during the forward pass and recomputes them on-the-fly during backpropagation. The cost is roughly 33% more compute, but activation memory drops dramatically.

model.gradient_checkpointing_enable()
# or via TrainingArguments:
training_args = TrainingArguments(gradient_checkpointing=True, ...)

Always enable gradient checkpointing when training on consumer hardware.

Mixed Precision (BF16)

Training in float32 doubles memory requirements compared to bfloat16. BF16 has the same range as float32 (important for stable training) but half the precision — in practice, the precision reduction does not affect convergence for fine-tuning.

Use bf16=True if your GPU supports it (Ampere architecture and later: A100, RTX 3090, RTX 4090). Use fp16=True for older GPUs (V100, T4).

Diagnosing Training Failures

Failure 1: Loss Does Not Decrease

Symptom: Validation loss stays near its initial value (typically 2.0–2.5) for 50+ steps.

Diagnostic checklist:

# 1. Check if labels are correctly set
print(tokenized_dataset[0]['labels'][:20])
# Should have -100 for masked tokens, actual token IDs for assistant responses
# If all labels are -100, the model has nothing to learn from

# 2. Check if any examples have target tokens
labels = tokenized_dataset['train']['labels']
non_masked = [(i, sum(1 for l in ex if l != -100)) for i, ex in enumerate(labels)]
print(f"Examples with >0 target tokens: {sum(1 for _, n in non_masked if n > 0)}")

# 3. Check learning rate
# Try 10x higher: if loss starts moving, your LR was too low
# Try 10x lower: if loss was exploding, your LR was too high

The most common cause is incorrect label masking: if all your labels are -100 (masked), there is no signal to learn from and loss stays constant.

Failure 2: Loss Explodes (NaN or Rapid Increase)

Symptom: Loss spikes to NaN or 100+ within the first 20 steps.

Fixes, in order of likelihood:

# Fix 1: Lower learning rate
learning_rate = 2e-5  # 10x lower than 2e-4

# Fix 2: Enable gradient clipping
max_grad_norm = 0.3

# Fix 3: Increase warmup steps
warmup_steps = 100  # more steps before full LR

# Fix 4: Check for bad training examples
# A single example with 0-length labels or extremely long text can cause instability
# Filter examples with fewer than 10 target tokens or longer than max_length

Failure 3: Loss Hits 0 (Overfitting)

Symptom: Training loss reaches near-0, but validation loss increases or evaluation outputs are identical to training examples.

# Signs of overfitting:
# - Training loss < 0.1 while validation loss > 1.0
# - Model outputs exact strings from training data
# - Model refuses to generalize to slightly rephrased inputs

# Fixes:
# 1. Add more training data (most reliable fix)
# 2. Reduce epochs (try 1 instead of 3)
# 3. Increase LoRA dropout (try 0.1–0.2)
# 4. Reduce rank (try r=4 instead of r=8)
# 5. Add weight decay (weight_decay=0.01)

Reading Loss Curves

# Healthy training loss curve (rough numbers for 3 epochs on 1000 examples):
#
# Epoch 1:  2.1 → 1.4 (sharp initial decrease)
# Epoch 2:  1.4 → 0.95 (slower, steady decrease)  
# Epoch 3:  0.95 → 0.82 (diminishing returns)
#
# Validation loss should track training loss with a 0.1–0.3 gap
# A growing gap between train and val loss = overfitting

# How to view with TensorBoard:
# tensorboard --logdir ./qlora-output/runs

A Systematic Hyperparameter Search

When you are unsure what settings to use, run a grid search over the most impactful parameters before committing to a full training run:

import itertools

# Small search space for quick iteration
search_grid = {
    "learning_rate": [1e-4, 2e-4, 5e-4],
    "lora_r": [4, 8, 16],
}

# For each combination, train for 100 steps and record validation loss
for lr, r in itertools.product(search_grid["learning_rate"], search_grid["lora_r"]):
    print(f"Testing lr={lr}, r={r}")
    # Quick 100-step run
    args = TrainingArguments(
        learning_rate=lr,
        max_steps=100,  # short run for search
        ...
    )
    # run trainer, record val loss at step 100
    # pick the best combination, then run full training

In practice, for instruction-tuning with QLoRA, the defaults (lr=2e-4, r=8) work well enough that you rarely need an extensive search. The bigger lever is data quality, not hyperparameter tuning.

Final Checklist Before Running Training

[ ] Dataset has at least 100 examples with non-masked labels
[ ] max_seq_length is long enough to include the full assistant response
[ ] bf16=True (or fp16=True for T4)
[ ] gradient_checkpointing=True
[ ] warmup_ratio=0.03
[ ] max_grad_norm=0.3
[ ] eval_steps set to evaluate several times per epoch
[ ] save_total_limit=3 (avoid filling your disk with checkpoints)
[ ] GPU memory confirmed below 90% of VRAM at step 1
[ ] Confirmed loss is decreasing at step 10

Running that final check — confirming loss decreases at step 10 — before walking away from a training job saves you from discovering after 2 hours that the labels were wrong and the model learned nothing.

Course Content