Course Content
PEFT: Comparing Adapters, Prefix Tuning, and IA³
Beyond LoRA — a practical comparison of parameter-efficient methods
Why Bother with Alternatives to LoRA?
LoRA is the default choice for parameter-efficient fine-tuning, and for good reason: it is well-understood, widely supported, and performs well across a broad range of tasks. For most practitioners, LoRA is the answer and you can skip this lesson.
But “most” is not “all.” There are task types, resource constraints, and quality requirements where one of the alternative PEFT methods — Adapter Layers, Prefix Tuning, or IA³ — offers a genuine advantage. This lesson gives you enough depth to recognize those situations and make an informed choice.
Method 1: Adapter Layers (Houlsby Adapters)
What They Do
Adapter layers, introduced by Houlsby et al. in 2019 (predating LoRA by two years), insert small trainable modules between the existing layers of a transformer. The original weight matrices are frozen; adapters are plugged in.
The architecture of a single adapter module:
- A down-projection layer: reduces dimensionality from
d_modeltor(e.g., 4096 → 64) - A nonlinearity (ReLU or GELU)
- An up-projection layer: restores dimensionality from
rtod_model(64 → 4096) - A residual connection: adds the original input to the adapter’s output
x → LayerNorm → Attention → [Adapter: down → GELU → up] → + x → FFN → [Adapter] → outputHow Many Parameters?
For a single adapter with bottleneck size r=64 in a model with d_model=4096:
Down projection: 4096 × 64 = 262,144 parameters
Up projection: 64 × 4096 = 262,144 parameters
Total per adapter: 524,288 parametersWith two adapters per layer (after attention and after FFN) across 32 layers:
Total: 2 × 524,288 × 32 = ~33.5M parametersCompare to LoRA (r=8, q+v only): ~4.2M parameters. Adapters have roughly 8x more parameters than a typical LoRA setup.
Code
from peft import AdaptionPromptConfig, get_peft_model
# Note: HuggingFace PEFT uses "ADAPTION_PROMPT" for LLaMA-style adapters
# For classic Houlsby-style adapters, use the adapter_transformers library
# or configure manually
# Example using adapter_transformers
# pip install adapter-transformers
from transformers import AutoModelWithHeads
from adapters import AdapterConfig
model = AutoModelWithHeads.from_pretrained("mistralai/Mistral-7B-v0.1")
adapter_config = AdapterConfig(
mh_adapter=True, # adapter after multi-head attention
output_adapter=True, # adapter after feed-forward
reduction_factor=64, # d_model / reduction_factor = bottleneck size
non_linearity="relu",
)
model.add_adapter("task_adapter", config=adapter_config)
model.train_adapter("task_adapter")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")When to Use Adapters
Adapters have one meaningful advantage over LoRA: the bottleneck nonlinearity. LoRA’s update ΔW = A×B is purely linear (no activation function). Adapter layers introduce a nonlinear transformation, which can theoretically capture more complex adaptation patterns.
In practice, this matters for:
- Highly structured output tasks: code generation with complex syntax, structured JSON generation
- Multi-task learning: multiple adapters can be stacked, enabling task composition
- Tasks requiring significant behavioral shifts: when the base model behavior is very far from the target
The inference overhead is the key downside: adapters add extra compute at every layer during inference, whereas LoRA adapters can be merged into the base model weights with zero inference overhead (covered in the deployment lesson).
Method 2: Prefix Tuning
What It Does
Prefix tuning (Li & Liang, 2021) takes a completely different approach. Instead of modifying the model architecture or weight matrices, it prepends learnable “virtual tokens” to the input of every transformer layer.
Think of it this way: when you do few-shot prompting, you prepend example inputs and outputs to steer the model’s behavior. Prefix tuning learns optimal “virtual prompt” embeddings that are injected directly into the key-value matrices at every layer — not just at the input layer.
# Standard attention: attends to [user tokens]
# Prefix tuning attention: attends to [prefix tokens] + [user tokens]
# The prefix tokens are not real words — they are learnable continuous vectorsHow Many Parameters?
For a prefix of length 20 virtual tokens across 32 layers with d_model=4096:
20 tokens × 32 layers × 4096 × 2 (key + value) = 5,242,880 parametersSimilar to LoRA (r=8), but the parameters are structured differently.
Code
from peft import PrefixTuningConfig, get_peft_model, TaskType
prefix_config = PrefixTuningConfig(
task_type=TaskType.CAUSAL_LM,
num_virtual_tokens=20, # number of prefix tokens
prefix_projection=True, # use an MLP to project prefix embeddings (more expressive)
)
model = get_peft_model(model, prefix_config)
model.print_trainable_parameters()
# trainable params: 7,864,320 || all params: 7,249,596,416 || trainable%: 0.1085When to Use Prefix Tuning
Prefix tuning was originally developed for seq2seq tasks (T5, BART) and shows its best results there. For causal LMs, it tends to underperform LoRA on instruction-following tasks.
Best use cases:
- Seq2seq summarization and translation: prefix tuning was designed for these
- Tasks where you want to stack multiple adapters at inference time without the overhead of actual inserted layers
- Extremely low resource budgets where even LoRA’s parameter count is too high
Limitation: prefix tuning is sensitive to the prefix length hyperparameter and can be unstable to train. The model can sometimes ignore the prefix entirely, especially for shorter texts.
Method 3: IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)
What It Does
IA³ (Liu et al., 2022) is the most parameter-efficient method in this comparison. The key idea: instead of adding new matrices (LoRA) or new modules (adapters), IA³ multiplies the activations at specific points in the transformer by learned scaling vectors.
Three locations are scaled:
- Key activations in attention (element-wise multiply)
- Value activations in attention (element-wise multiply)
- FFN intermediate activations (element-wise multiply)
# Conceptually:
# Standard: attention_output = softmax(Q @ K.T) @ V
# IA³: attention_output = softmax(Q @ (l_k * K).T) @ (l_v * V)
# where l_k and l_v are learned vectors of shape [d_model]How Many Parameters?
For Mistral-7B with d_model=4096 across 32 layers:
Key scaling: 3,072 × 32 = 98,304 parameters (attention head dim × layers)
Value scaling: 3,072 × 32 = 98,304 parameters
FFN scaling: 14,336 × 32 = 458,752 parameters (FFN intermediate dim × layers)
Total: ~655,360 parametersThat is roughly 6x fewer parameters than LoRA (r=8, q+v only). At this scale, the adapter file is under 3 MB.
Code
from peft import IA3Config, get_peft_model, TaskType
ia3_config = IA3Config(
task_type=TaskType.CAUSAL_LM,
target_modules=["k_proj", "v_proj", "down_proj"],
feedforward_modules=["down_proj"], # which modules are feedforward (scaled differently)
)
model = get_peft_model(model, ia3_config)
model.print_trainable_parameters()
# trainable params: 655,360 || all params: 7,241,727,488 || trainable%: 0.0091When to Use IA³
IA³ shines for:
- Extremely data-limited scenarios: when you have fewer than 100 training examples, IA³’s tiny parameter count reduces overfitting risk dramatically
- Continuous learning / rapid task switching: because adapters are so small, you can maintain hundreds of IA³ adapters in memory simultaneously
- Resource-constrained edge deployment: every MB matters when deploying to edge hardware
The quality trade-off: IA³ typically scores 3–8% below LoRA on standard instruction-following benchmarks. For tasks where “pretty good” adaptation is acceptable and you care deeply about parameter budget, IA³ is the right choice.
Comparison Table
| Method | Trainable Params (7B model) | Memory Overhead | Inference Overhead | Best For |
|---|---|---|---|---|
| Full Fine-Tuning | 7.24B | ~87 GB | None | Max quality, large GPU budgets |
| LoRA (r=8, q+v) | 4.2M | ~14.6 GB | None (mergeable) | Most instruction-tuning tasks |
| LoRA (r=16, all) | 33M | ~15 GB | None (mergeable) | Complex tasks, more data |
| Adapter Layers | 33M | ~15 GB | 5–15% slower | Structured outputs, multi-task |
| Prefix Tuning | 7.9M | ~14.6 GB | Minor | Seq2seq, translation |
| IA³ | 0.66M | ~14.5 GB | None (mergeable) | Few-shot, edge deployment |
Multi-Adapter Composition
One advanced use case that makes alternatives to LoRA attractive: running multiple adapters simultaneously. HuggingFace PEFT supports loading multiple LoRA adapters and switching between them or combining them.
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(model_name, ...)
# Load multiple adapters
model = PeftModel.from_pretrained(base_model, "./adapter-task-a", adapter_name="task_a")
model.load_adapter("./adapter-task-b", adapter_name="task_b")
# Switch adapters at inference time
model.set_adapter("task_a")
output_a = model.generate(...)
model.set_adapter("task_b")
output_b = model.generate(...)
# Or combine adapters with weighted sum (LoRA addition)
model.add_weighted_adapter(
adapters=["task_a", "task_b"],
weights=[0.7, 0.3],
adapter_name="combined",
combination_type="linear",
)
model.set_adapter("combined")This is powerful for production systems that serve multiple specialized tasks from one base model, with per-request adapter selection based on the user’s context or subscription tier.
The Practical Decision
For most practitioners reading this course, the decision tree is:
Default to LoRA (r=8, target q+v, alpha=16). It works well for the vast majority of instruction-tuning tasks.
Try r=16 with all attention modules if LoRA quality is insufficient after training. This costs 8x more parameters but usually closes 50–70% of the gap to full fine-tuning.
Consider IA³ only if you have fewer than 200 training examples or are deploying to very memory-constrained environments.
Consider Adapter Layers if you need multi-task composition with strict layer-level isolation between tasks, or if you are using the
adapter-transformersecosystem which has first-class support for adapter stacking.Consider Prefix Tuning primarily for seq2seq models (T5, BART). For causal LMs, LoRA is almost always superior.
The field continues to evolve — new PEFT methods appear regularly. But the underlying principle is constant: find a low-dimensional parameterization of the adaptation delta that is expressive enough for your task. LoRA found an excellent point in this space, and the alternatives are variations on the same theme.
