Course Content
Full Fine-Tuning: The Baseline
Classic fine-tuning — and why it's impractical for large models
What Full Fine-Tuning Actually Does
In classical supervised learning — training a ResNet on images or an LSTM on text — fine-tuning means loading pretrained weights and updating all of them on your downstream task. You run gradient descent, and every single parameter in the network shifts slightly to reduce loss on your training data.
This is called full fine-tuning in the LLM context, to distinguish it from the parameter-efficient methods we cover later. The word “full” means you update the full parameter set.
For models with millions of parameters, this worked fine. ResNet-50 has 25 million parameters. You can load it, fine-tune it, and save it on any GPU that was manufactured in the last decade. The gradient computation is fast, the optimizer states fit in memory, and the whole process takes minutes to hours.
Then language models got big. GPT-2 launched in 2019 with 1.5 billion parameters. GPT-3 in 2020 had 175 billion. Today’s “small” open-source models — Mistral-7B, LLaMA 3.1 8B, Phi-3 Mini — have 7 to 8 billion parameters. And full fine-tuning on these models has a memory problem that makes it practically impossible on consumer hardware. Let us do the math.
The Memory Math
Understanding why full fine-tuning is expensive requires knowing what you need to store in GPU memory during training. There are three components.
1. Model Weights
Every parameter in the model must be stored in GPU memory. The precision (data type) of the parameters determines how much memory each one occupies.
- float32 (full precision): 4 bytes per parameter
- float16 (half precision): 2 bytes per parameter
- bfloat16 (brain float): 2 bytes per parameter
Mistral-7B has 7.24 billion parameters. At float16, that is:
7,240,000,000 parameters × 2 bytes = 14,480,000,000 bytes = ~14.5 GBA high-end consumer GPU — an RTX 4090 — has 24 GB of VRAM. So the weights alone consume 60% of a 4090’s memory before we even start training. On an RTX 3080 with 10 GB? The weights do not fit at all.
2. Gradients
During backpropagation, PyTorch computes a gradient for every trainable parameter. Each gradient tensor has the same shape and data type as the corresponding parameter.
Gradients: 7,240,000,000 × 2 bytes = ~14.5 GBWe are now at 29 GB just for weights and gradients. Still on an RTX 4090 with 24 GB total, this is already impossible.
3. Optimizer States
The Adam optimizer (the standard choice for LLM training) maintains two additional tensors for each parameter: the first moment (mean of gradients, m) and the second moment (variance of gradients, v). Both are stored in float32 even when the model is in float16.
Adam first moment: 7,240,000,000 × 4 bytes = ~28.96 GB
Adam second moment: 7,240,000,000 × 4 bytes = ~28.96 GBTotal optimizer state: ~57.9 GB.
The Full Accounting
| Component | Memory (float16 weights) |
|---|---|
| Model weights | ~14.5 GB |
| Gradients | ~14.5 GB |
| Adam first moment (float32) | ~28.96 GB |
| Adam second moment (float32) | ~28.96 GB |
| Activations (batch-dependent) | ~5–15 GB |
| Total | ~87–97 GB |
To fully fine-tune Mistral-7B, you need approximately 87–97 GB of GPU memory. That is four A100 80GB GPUs, or roughly $10–15 per hour on cloud providers. For a serious training run of a few hundred steps, this adds up quickly.
For a 70B model like LLaMA 3.1 70B, scale everything by 10x. You would need roughly 800 GB of GPU memory for full fine-tuning — a cluster of 10+ A100s.
The Code Anyway
Despite the memory requirements, it is instructive to see what full fine-tuning looks like in code. This is the baseline you are always comparing against, and understanding it makes the efficiency gains of LoRA and QLoRA tangible.
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments,
Trainer,
DataCollatorForSeq2Seq,
)
from datasets import load_dataset
import torch
model_name = "mistralai/Mistral-7B-v0.1"
# Load model in bfloat16 (more numerically stable than float16 for training)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
)
# NO peft config — all parameters are trainable
print("Total trainable parameters:")
total = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f" {total:,} ({total / 1e9:.2f}B)")
# Total trainable parameters:
# 7,241,732,096 (7.24B)
# Load and tokenize dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
def tokenize(example):
text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
return tokenizer(text, truncation=True, max_length=512, padding="max_length")
tokenized = dataset.map(tokenize, remove_columns=dataset.column_names)
tokenized = tokenized.map(lambda x: {"labels": x["input_ids"]})
# Training configuration
training_args = TrainingArguments(
output_dir="./full-ft-mistral",
num_train_epochs=3,
per_device_train_batch_size=1, # forced to 1 because of memory
gradient_accumulation_steps=16, # effective batch size = 16
learning_rate=2e-5, # lower than LoRA — updating all weights
fp16=True,
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="no",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized,
data_collator=DataCollatorForSeq2Seq(tokenizer, model=model, padding=True),
)
# This will OOM on a single consumer GPU:
# trainer.train()If you try to run this on a 24 GB GPU, you will see:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.50 GiB.
GPU has 1.87 GiB of free space remaining.That error message is the entire motivation for the rest of this course.
Tricks That Help (But Do Not Solve the Problem)
The ML community has developed several memory-reduction techniques for full fine-tuning. They help, but they do not change the fundamental arithmetic.
Gradient Checkpointing
Instead of storing all intermediate activations during the forward pass (needed for backprop), gradient checkpointing recomputes them on-the-fly during the backward pass. This trades compute for memory — you do about 33% more FLOPs, but activation memory drops from O(layers) to O(√layers).
model.gradient_checkpointing_enable()This can reduce activation memory by 60–70%, but it does not touch the optimizer states (the biggest contributor to memory usage).
8-bit Adam (bitsandbytes)
The bitsandbytes library implements an 8-bit version of Adam that quantizes optimizer states from float32 to 8-bit integers. This cuts optimizer state memory roughly in half.
from transformers import TrainingArguments
training_args = TrainingArguments(
...
optim="adamw_bnb_8bit", # 8-bit Adam
)Even with both of these tricks, fully fine-tuning Mistral-7B requires 40+ GB of GPU memory. You still need an A100.
DeepSpeed ZeRO
Microsoft’s DeepSpeed library shards optimizer states, gradients, and parameters across multiple GPUs (ZeRO Stage 1, 2, and 3 respectively). With ZeRO Stage 3, a 4-GPU cluster with 80 GB each can handle models up to ~150B parameters. This is how most large-scale fine-tuning was done before QLoRA.
But this requires multiple expensive GPUs and significant infrastructure setup. It is the right answer for teams training 70B+ models, not for most fine-tuning use cases.
Why This Sets Up LoRA Perfectly
The arithmetic above reveals exactly where the problem lives: updating 7.24 billion parameters requires storing 7.24 billion gradients and 14.48 billion Adam optimizer states. That is the cost you pay when every parameter is trainable.
The insight behind LoRA is this: do you actually need to update all 7.24 billion parameters to teach a model new behavior?
Empirically, the answer is no. It turns out that the updates needed to adapt a pretrained model to a new task tend to be low-rank — they live in a much lower-dimensional subspace than the full parameter space. If the effective rank of the weight updates is 8 (rather than 4096), you only need to store and compute gradients for a tiny fraction of the original parameters.
LoRA exploits this fact directly. Instead of updating the full weight matrix W, it learns two small matrices A and B whose product approximates ΔW. The number of trainable parameters drops from 7.24B to roughly 4M — a reduction of 1800x.
That is the subject of the next lesson. But now you understand why LoRA exists: because the alternative requires a GPU cluster, and most practitioners do not have one.
When Full Fine-Tuning Is Still Worth It
Full fine-tuning does make sense in specific circumstances:
Smaller models: For models under 1B parameters (Phi-3 Mini at 3.8B is borderline, DistilGPT2 at 82M is easy), full fine-tuning is feasible on a single GPU and often produces better results than LoRA.
Maximum quality on critical tasks: For production models serving millions of users where quality differences of 2–3% matter, the marginal quality improvement of full fine-tuning over LoRA may justify the compute cost.
Continued pretraining: If you are training a model from scratch on domain-specific text (a medical LLM, a code model) rather than instruction tuning, you typically need full fine-tuning because you are modifying the knowledge representation deeply, not just surface behavior.
For the rest of this course, we will work with parameter-efficient methods that run on a single consumer GPU. But keep full fine-tuning in your toolkit — understanding its costs and tradeoffs helps you make informed decisions about when the alternatives are genuinely sufficient.
