QLoRA: Fine-Tuning on Consumer Hardware

The Breakthrough Paper

In May 2023, a paper titled “QLoRA: Efficient Finetuning of Quantized LLMs” landed on arXiv and immediately changed what was possible for practitioners without access to GPU clusters. The paper’s central claim: you can fine-tune a 65-billion parameter model on a single 48 GB GPU — and a 7-billion parameter model on a single 24 GB consumer GPU — without measurable quality degradation compared to full fine-tuning.

The technique: load the frozen base model in 4-bit precision (cutting memory by 4x), then apply LoRA on top in float16. The base model’s weights stay quantized and frozen. Only the tiny LoRA adapter matrices are trained in full precision.

This lesson covers exactly how QLoRA works and gives you a complete, runnable implementation that fits on a free Colab T4 (16 GB VRAM).

Understanding Quantization

Quantization is the process of representing numbers using fewer bits. This is the same trade-off behind MP3 audio (fewer bits per sample than WAV) or JPEG images (fewer bits per pixel than PNG). You lose some information, but the loss is often imperceptible for the task at hand.

Standard Float16

A float16 number uses 16 bits: 1 sign bit, 5 exponent bits, and 10 mantissa bits. It can represent values from roughly ±65,504 and has about 3 decimal digits of precision.

Each float16 takes 2 bytes of memory.

4-bit Quantization (NF4)

A 4-bit number has only 16 possible values. To represent a float16 weight matrix in 4 bits, you need to:

Determine the range of values in the matrix (e.g., -0.8 to +0.7)
Map those values to the 16 quantization levels
Store only the 4-bit code + a small “quantization constant” that allows recovery

Each 4-bit value takes 0.5 bytes of memory — a 4x reduction compared to float16.

The specific variant used in QLoRA is called NF4 (NormalFloat4). It is designed for the statistical distribution of neural network weights, which tend to follow a normal (bell-curve) distribution. NF4 spaces its 16 quantization levels non-uniformly to minimize quantization error for normally distributed values — more levels near 0 (where most weights cluster), fewer at the extremes.

The Memory Math

Mistral-7B in float16:  7.24B × 2 bytes = 14.48 GB
Mistral-7B in 4-bit:    7.24B × 0.5 bytes = 3.62 GB

Plus overhead for quantization constants, LoRA adapters, activations, and optimizer states:

Base model (4-bit):  ~3.6 GB
LoRA adapters:       ~0.03 GB
Activations:         ~2-4 GB
Optimizer states:    ~0.06 GB
CUDA overhead:       ~1 GB
Total:               ~7-9 GB

A free Google Colab T4 has 16 GB of VRAM. Mistral-7B fits with room to spare. This is the lesson that makes the rest of the course accessible to everyone.

Complete QLoRA Implementation

Install the required packages:

pip install transformers datasets peft accelerate bitsandbytes

Step 1: Configure 4-bit Quantization

from transformers import BitsAndBytesConfig
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                          # load in 4-bit instead of float16
    bnb_4bit_quant_type="nf4",                  # use NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16,      # compute in bfloat16 during forward pass
    bnb_4bit_use_double_quant=True,             # quantize the quantization constants too
)

The bnb_4bit_compute_dtype is critical: even though weights are stored in 4 bits, the actual matrix multiplications are performed in bfloat16. The 4-bit values are dequantized on-the-fly during each forward pass. This means you get 4-bit storage density with near-float16 computation quality.

The bnb_4bit_use_double_quant=True option quantizes the quantization constants themselves (these are stored in 8-bit instead of 32-bit), saving an additional ~0.4 GB on a 7B model.

Step 2: Load Model with Quantization

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)

# Required for QLoRA — prepares the quantized model for gradient updates
from peft import prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)

model.config.use_cache = False
print("Model loaded in 4-bit. Memory footprint:")
print(f"  {model.get_memory_footprint() / 1e9:.2f} GB")
# Model loaded in 4-bit. Memory footprint:
#   3.89 GB

Step 3: Apply LoRA

The LoRA configuration is identical to the previous lesson. LoRA is applied on top of the quantized model — the adapter matrices (A and B) are created in float16 and are the only parameters that receive gradients.

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "v_proj",
        "k_proj",
        "o_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 8,388,608 || all params: 3,752,071,168 || trainable%: 0.2235

Step 4: Load and Format Dataset

from datasets import load_dataset

dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")

def format_alpaca(example):
    if example["input"]:
        text = (f"Below is an instruction that describes a task, paired with an input.\n\n"
                f"### Instruction:\n{example['instruction']}\n\n"
                f"### Input:\n{example['input']}\n\n"
                f"### Response:\n{example['output']}")
    else:
        text = (f"Below is an instruction that describes a task.\n\n"
                f"### Instruction:\n{example['instruction']}\n\n"
                f"### Response:\n{example['output']}")
    return {"text": text}

dataset = dataset.map(format_alpaca)
print(dataset[0]['text'][:200])

Step 5: Configure Training

from transformers import TrainingArguments
from trl import SFTTrainer  # Supervised Fine-Tuning Trainer from HuggingFace TRL

training_args = TrainingArguments(
    output_dir="./qlora-mistral-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,     # effective batch size = 16
    gradient_checkpointing=True,       # save activations memory at cost of ~15% speed
    optim="paged_adamw_32bit",         # paged optimizer — moves optimizer states to CPU RAM when not needed
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=True,                         # bfloat16 for stability
    max_grad_norm=0.3,                 # gradient clipping
    warmup_ratio=0.03,                 # 3% of steps for warmup
    lr_scheduler_type="cosine",
    logging_steps=25,
    save_steps=100,
    save_total_limit=2,
    report_to="tensorboard",           # or "wandb" if you have it configured
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_args,
    packing=False,
)

Step 6: Train

trainer.train()

# Save the LoRA adapter
trainer.model.save_pretrained("./qlora-mistral-adapter")
tokenizer.save_pretrained("./qlora-mistral-adapter")

print("Training complete. Adapter saved.")

The paged_adamw_32bit Optimizer

This deserves its own explanation. The standard Adam optimizer holds all optimizer states (first and second moments) in GPU memory continuously. paged_adamw_32bit uses NVIDIA’s unified memory to page optimizer states to CPU RAM when they are not needed for the current gradient update step, then page them back in when needed.

This is the difference between:

Standard Adam: 4.2M trainable params × 8 bytes (two float32 tensors) = ~32 MB — fine, no paging needed for LoRA
What really matters: if you increase r to 64 and target all modules, you might have 100M+ trainable params where paging becomes valuable

For typical QLoRA runs with r=8, the paged optimizer does not save much. But it is cheap insurance against OOM errors when you experiment with higher ranks.

Monitoring Your Training Run

Always watch these metrics during training:

# Training loss should decrease smoothly and reach below 1.0 for good models
# Validation loss should track training loss (not diverge upward)
# GPU memory should stay stable (not grow over time = memory leak)

# Check GPU memory usage
import subprocess
result = subprocess.run(
    ["nvidia-smi", "--query-gpu=memory.used,memory.free", "--format=csv,noheader"],
    capture_output=True, text=True
)
print(result.stdout)
# 9842 MiB, 6094 MiB  (using ~10 GB of 16 GB T4)

A healthy training run on a free T4 should use 10–13 GB of VRAM with these settings. If you hit 15+ GB, reduce per_device_train_batch_size or reduce r.

What Gets Saved

After training, your adapter directory contains:

./qlora-mistral-adapter/
├── adapter_config.json         # r=8, alpha=16, target_modules=[...]
├── adapter_model.safetensors   # A and B matrices for all layers (~33 MB)
├── tokenizer.json
├── tokenizer_config.json
└── special_tokens_map.json

The adapter file is ~33 MB. The base model is ~3.9 GB in 4-bit. Compare this to 14.5 GB for the full float16 model — you are working with roughly 25% of the storage requirement.

Quality vs Full Fine-Tuning

The QLoRA paper benchmarked their method against full fine-tuning on multiple standard NLP tasks. The headline result: QLoRA matches or comes within 1–2% of full fine-tuning quality on most instruction-following benchmarks.

Why does a 4x lossy compression of the base model not hurt quality? Two reasons:

The quantization error is tiny for normally distributed weights: NF4 is specifically designed for neural network weight distributions. The average quantization error is very small compared to the weight magnitudes.
The LoRA adapter compensates: Even if quantization introduces small systematic errors in the base model’s representations, the LoRA adapter can learn to correct for them during fine-tuning. The adapter has enough capacity to compensate for the quantization noise while also learning the new task.

For production deployments where you need maximum quality and have A100 access, full LoRA (float16 base + LoRA) is marginally better. For experimentation, prototyping, and most practical applications, QLoRA is indistinguishable.

The next lesson covers a broader comparison of PEFT methods beyond LoRA — including adapter layers, prefix tuning, and IA³ — so you can make informed choices for different task types.

Course Content