LoRA: Low-Rank Adaptation Explained

The Core Insight

In the previous lesson, we calculated that full fine-tuning of Mistral-7B requires roughly 87–97 GB of GPU memory. The root cause: updating all 7.24 billion parameters means storing 7.24 billion gradients and 14.48 billion Adam optimizer state values.

LoRA (Low-Rank Adaptation) solves this with a single elegant observation: the updates required to adapt a pretrained model to a new task are low-rank.

What does “low-rank” mean in plain terms? Imagine the weight matrix of a transformer attention layer as a 4096×4096 grid of numbers. During full fine-tuning, every cell in that grid gets updated. But the actual information you are injecting — the new behaviors, styles, or domain knowledge — can often be expressed as a combination of a small number of “directions” in that high-dimensional space. LoRA identifies and updates only those directions.

A physical analogy: you have a 4096-dimensional space (like a city with 4096 streets). To get anywhere useful, you do not need to move along all 4096 streets independently. Most useful destinations can be reached by moving along 8–16 primary directions. LoRA learns those 8–16 primary directions instead of the full 4096×4096 space.

The Mathematics

Let us make this concrete. In a standard transformer, a weight matrix W has shape (d × k) — for Mistral-7B’s attention layers, d = k = 4096.

In full fine-tuning, during training the weights update as:

W_new = W + ΔW

Where ΔW is a (4096 × 4096) matrix — 16.7 million parameters per layer.

LoRA’s key insight: instead of learning ΔW directly, constrain it to be a product of two smaller matrices:

ΔW = A × B

Where:

A has shape (d × r), i.e., (4096 × 8)
B has shape (r × k), i.e., (8 × 4096)
r is the “rank” — a small number you choose (typically 4–64)

The product A × B produces a (4096 × 4096) matrix, but it is constrained to be rank r. It can only represent r independent “directions” in the full parameter space.

The Parameter Count Reduction

With r = 8, d = k = 4096:

Full ΔW parameters:    4096 × 4096 = 16,777,216
LoRA parameters (A+B): 4096 × 8 + 8 × 4096 = 65,536

Ratio: 65,536 / 16,777,216 = 0.39% of the original. A 256x reduction in trainable parameters per layer.

Across all layers in Mistral-7B with LoRA applied to query and value projections:

# From peft's print_trainable_parameters():
# trainable params: 4,194,304 || all params: 7,245,574,144 || trainable%: 0.0579

Just 4.2 million trainable parameters out of 7.2 billion — a 1,800x reduction. And because optimizer states are only computed for trainable parameters, Adam’s memory requirement drops proportionally.

The Forward Pass

During the forward pass, LoRA adds its output to the frozen pretrained weight’s output:

h = W × x + (A × B) × x × (alpha / r)

The term (alpha / r) is a scaling factor. alpha is a hyperparameter that controls the magnitude of the LoRA update relative to the pretrained weight. At initialization, B is set to zeros, so LoRA has no effect — training starts from the pretrained model’s behavior and gradually introduces the adapter’s influence.

The Hyperparameters You Control

Rank (r)

Rank controls the expressiveness of the adapter. Higher rank = more parameters = more capacity to adapt, but also more memory and more risk of overfitting.

Practical guidance:

r = 4: Very lightweight. Works for simple style transfer with 500+ examples.
r = 8: The most common choice. Good balance for most instruction-tuning tasks.
r = 16: Use when the task requires significant behavioral change or when you have 5,000+ training examples.
r = 64: Rarely needed. Approaches the quality of full fine-tuning for complex tasks but requires substantially more memory.

Alpha (lora_alpha)

Alpha controls the scaling of the LoRA updates. In practice, setting lora_alpha = 2 × r (e.g., alpha=16 when r=8) is a widely used starting point. The effective learning rate of the LoRA weights scales with alpha / r, so increasing alpha amplifies the adapter’s influence on outputs.

A common pattern: if your model is not adapting enough (the outputs look too much like the base model), increase alpha. If the model is losing general capabilities (catastrophic forgetting), decrease alpha.

Target Modules

LoRA can be applied to any linear layer in the model. In transformer attention, there are typically four projection matrices: Q (query), K (key), V (value), and O (output). There are also the feed-forward layers (up_proj, down_proj, gate_proj in LLaMA-style models).

# Minimal: only Q and V (most common starting point)
target_modules = ["q_proj", "v_proj"]

# Extended: Q, K, V, O + FFN (better quality, more parameters)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", 
                  "up_proj", "down_proj", "gate_proj"]

Adding K and O projections typically improves quality by 5–15% at the cost of ~2x more trainable parameters. Adding FFN layers helps for tasks that require factual knowledge injection but can hurt for pure style adaptation.

Complete LoRA Setup Code

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

model_name = "mistralai/Mistral-7B-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Disable cache for training (incompatible with gradient checkpointing)
model.config.use_cache = False

# Configure LoRA
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,       # dropout on LoRA matrices (regularization)
    bias="none",             # do not train bias terms
    task_type=TaskType.CAUSAL_LM,
)

# Apply LoRA — freezes base model weights, adds trainable A/B matrices
model = get_peft_model(model, lora_config)

# Verify parameter counts
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 7,245,574,144 || trainable%: 0.0579

Inspecting What Changed

You can verify that the base model weights are frozen and only the LoRA matrices are trainable:

for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"TRAINABLE: {name} — {param.shape}")
    # All trainable parameters will have "lora_" in their name

# Sample output:
# TRAINABLE: base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight — torch.Size([8, 4096])
# TRAINABLE: base_model.model.model.layers.0.self_attn.q_proj.lora_B.default.weight — torch.Size([4096, 8])
# TRAINABLE: base_model.model.model.layers.0.self_attn.v_proj.lora_A.default.weight — torch.Size([8, 4096])
# TRAINABLE: base_model.model.model.layers.0.self_attn.v_proj.lora_B.default.weight — torch.Size([4096, 8])
# ... (same pattern for all 32 layers)

Memory Comparison

# Full fine-tuning memory estimate for Mistral-7B:
# Weights:         14.5 GB
# Gradients:       14.5 GB
# Adam (float32):  57.9 GB
# Total:          ~87.0 GB

# LoRA (r=8, q+v only) memory estimate:
# Weights (frozen): 14.5 GB  (loaded but no gradient storage needed)
# Gradients:         0.03 GB (only for 4.2M trainable params)
# Adam:              0.06 GB (only for trainable params)
# Total:           ~14.6 GB

print("LoRA reduces trainable parameter optimizer memory by 1800x")
print("Total memory: ~14.6 GB vs ~87 GB for full fine-tuning")
print("Fits in a single A100 40GB or even 24 GB GPU with QLoRA")

The Adapter File

After training, the LoRA adapter is saved as a small file — typically 20–100 MB, compared to 14 GB for the full model weights. The base model weights are unchanged and can be shared across multiple adapters.

# Save only the LoRA adapter weights
model.save_pretrained("./mistral-lora-adapter")

# This creates:
# ./mistral-lora-adapter/
#   adapter_config.json    (hyperparameters: r, alpha, target_modules)
#   adapter_model.safetensors  (the actual A and B matrices)

# Load later:
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
model_with_adapter = PeftModel.from_pretrained(base_model, "./mistral-lora-adapter")

This separation is powerful for production: you can serve one 14 GB base model and dynamically load different 50 MB adapters depending on the user’s context (customer A gets the support adapter, customer B gets the sales adapter, internal users get the engineering adapter).

Rank Selection in Practice

A quick experiment to calibrate your rank selection: train the same dataset with r=4, r=8, r=16, and compare validation loss and output quality. For most instruction-tuning tasks on 500–5000 examples, you will find that r=8 and r=16 perform nearly identically, while r=4 is noticeably weaker and r=64 overfits slightly.

# Quick rank sweep — run this experiment before committing to a rank
for rank in [4, 8, 16, 32]:
    config = LoraConfig(
        r=rank,
        lora_alpha=rank * 2,  # keep alpha = 2r
        target_modules=["q_proj", "v_proj"],
        task_type=TaskType.CAUSAL_LM,
    )
    m = get_peft_model(base_model, config)
    params = sum(p.numel() for p in m.parameters() if p.requires_grad)
    print(f"r={rank}: {params:,} trainable parameters")

# r=4:  2,097,152 trainable parameters
# r=8:  4,194,304 trainable parameters
# r=16: 8,388,608 trainable parameters
# r=32: 16,777,216 trainable parameters

For most practical fine-tuning tasks, start with r=8 and only increase if your evaluation results are clearly below target.

Why LoRA Works

One might ask: by constraining ΔW to be low-rank, are we not limiting what the model can learn? The answer is yes — but the constraint is rarely a binding one for instruction-tuning tasks.

The theoretical justification: research on the intrinsic dimensionality of neural networks (Aghajanyan et al., 2021) shows that the fine-tuning loss landscape of large pretrained models can be nearly perfectly minimized by updates that lie in a subspace of dimension 100–1000, even when the full parameter space has millions of dimensions. The pretrained model has already learned such a rich representation that new tasks can be learned by modestly steering existing representations, not rewriting them.

LoRA gives you a principled way to exploit this. The next lesson, QLoRA, takes LoRA further by also quantizing the frozen base model weights — making it possible to fine-tune a 7B parameter model on a single consumer GPU.

Course Content