Press ESC to exit fullscreen
📖 Lesson ⏱️ 90 minutes

PEFT: Comparing Adapters, Prefix Tuning, and IA³

Beyond LoRA — a practical comparison of parameter-efficient methods

Why Bother with Alternatives to LoRA?

LoRA is the default choice for parameter-efficient fine-tuning, and for good reason: it is well-understood, widely supported, and performs well across a broad range of tasks. For most practitioners, LoRA is the answer and you can skip this lesson.

But “most” is not “all.” There are task types, resource constraints, and quality requirements where one of the alternative PEFT methods — Adapter Layers, Prefix Tuning, or IA³ — offers a genuine advantage. This lesson gives you enough depth to recognize those situations and make an informed choice.

Method 1: Adapter Layers (Houlsby Adapters)

What They Do

Adapter layers, introduced by Houlsby et al. in 2019 (predating LoRA by two years), insert small trainable modules between the existing layers of a transformer. The original weight matrices are frozen; adapters are plugged in.

The architecture of a single adapter module:

  1. A down-projection layer: reduces dimensionality from d_model to r (e.g., 4096 → 64)
  2. A nonlinearity (ReLU or GELU)
  3. An up-projection layer: restores dimensionality from r to d_model (64 → 4096)
  4. A residual connection: adds the original input to the adapter’s output
x → LayerNorm → Attention → [Adapter: down → GELU → up] → + x → FFN → [Adapter] → output

How Many Parameters?

For a single adapter with bottleneck size r=64 in a model with d_model=4096:

Down projection: 4096 × 64 = 262,144 parameters
Up projection:   64 × 4096 = 262,144 parameters
Total per adapter: 524,288 parameters

With two adapters per layer (after attention and after FFN) across 32 layers:

Total: 2 × 524,288 × 32 = ~33.5M parameters

Compare to LoRA (r=8, q+v only): ~4.2M parameters. Adapters have roughly 8x more parameters than a typical LoRA setup.

Code

from peft import AdaptionPromptConfig, get_peft_model

# Note: HuggingFace PEFT uses "ADAPTION_PROMPT" for LLaMA-style adapters
# For classic Houlsby-style adapters, use the adapter_transformers library
# or configure manually

# Example using adapter_transformers
# pip install adapter-transformers
from transformers import AutoModelWithHeads
from adapters import AdapterConfig

model = AutoModelWithHeads.from_pretrained("mistralai/Mistral-7B-v0.1")

adapter_config = AdapterConfig(
    mh_adapter=True,        # adapter after multi-head attention
    output_adapter=True,    # adapter after feed-forward
    reduction_factor=64,    # d_model / reduction_factor = bottleneck size
    non_linearity="relu",
)

model.add_adapter("task_adapter", config=adapter_config)
model.train_adapter("task_adapter")

print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

When to Use Adapters

Adapters have one meaningful advantage over LoRA: the bottleneck nonlinearity. LoRA’s update ΔW = A×B is purely linear (no activation function). Adapter layers introduce a nonlinear transformation, which can theoretically capture more complex adaptation patterns.

In practice, this matters for:

  • Highly structured output tasks: code generation with complex syntax, structured JSON generation
  • Multi-task learning: multiple adapters can be stacked, enabling task composition
  • Tasks requiring significant behavioral shifts: when the base model behavior is very far from the target

The inference overhead is the key downside: adapters add extra compute at every layer during inference, whereas LoRA adapters can be merged into the base model weights with zero inference overhead (covered in the deployment lesson).

Method 2: Prefix Tuning

What It Does

Prefix tuning (Li & Liang, 2021) takes a completely different approach. Instead of modifying the model architecture or weight matrices, it prepends learnable “virtual tokens” to the input of every transformer layer.

Think of it this way: when you do few-shot prompting, you prepend example inputs and outputs to steer the model’s behavior. Prefix tuning learns optimal “virtual prompt” embeddings that are injected directly into the key-value matrices at every layer — not just at the input layer.

# Standard attention: attends to [user tokens]
# Prefix tuning attention: attends to [prefix tokens] + [user tokens]
# The prefix tokens are not real words — they are learnable continuous vectors

How Many Parameters?

For a prefix of length 20 virtual tokens across 32 layers with d_model=4096:

20 tokens × 32 layers × 4096 × 2 (key + value) = 5,242,880 parameters

Similar to LoRA (r=8), but the parameters are structured differently.

Code

from peft import PrefixTuningConfig, get_peft_model, TaskType

prefix_config = PrefixTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=20,    # number of prefix tokens
    prefix_projection=True,   # use an MLP to project prefix embeddings (more expressive)
)

model = get_peft_model(model, prefix_config)
model.print_trainable_parameters()
# trainable params: 7,864,320 || all params: 7,249,596,416 || trainable%: 0.1085

When to Use Prefix Tuning

Prefix tuning was originally developed for seq2seq tasks (T5, BART) and shows its best results there. For causal LMs, it tends to underperform LoRA on instruction-following tasks.

Best use cases:

  • Seq2seq summarization and translation: prefix tuning was designed for these
  • Tasks where you want to stack multiple adapters at inference time without the overhead of actual inserted layers
  • Extremely low resource budgets where even LoRA’s parameter count is too high

Limitation: prefix tuning is sensitive to the prefix length hyperparameter and can be unstable to train. The model can sometimes ignore the prefix entirely, especially for shorter texts.

Method 3: IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations)

What It Does

IA³ (Liu et al., 2022) is the most parameter-efficient method in this comparison. The key idea: instead of adding new matrices (LoRA) or new modules (adapters), IA³ multiplies the activations at specific points in the transformer by learned scaling vectors.

Three locations are scaled:

  1. Key activations in attention (element-wise multiply)
  2. Value activations in attention (element-wise multiply)
  3. FFN intermediate activations (element-wise multiply)
# Conceptually:
# Standard: attention_output = softmax(Q @ K.T) @ V
# IA³:      attention_output = softmax(Q @ (l_k * K).T) @ (l_v * V)
# where l_k and l_v are learned vectors of shape [d_model]

How Many Parameters?

For Mistral-7B with d_model=4096 across 32 layers:

Key scaling:   3,072 × 32 = 98,304 parameters  (attention head dim × layers)
Value scaling: 3,072 × 32 = 98,304 parameters
FFN scaling:   14,336 × 32 = 458,752 parameters (FFN intermediate dim × layers)
Total:         ~655,360 parameters

That is roughly 6x fewer parameters than LoRA (r=8, q+v only). At this scale, the adapter file is under 3 MB.

Code

from peft import IA3Config, get_peft_model, TaskType

ia3_config = IA3Config(
    task_type=TaskType.CAUSAL_LM,
    target_modules=["k_proj", "v_proj", "down_proj"],
    feedforward_modules=["down_proj"],  # which modules are feedforward (scaled differently)
)

model = get_peft_model(model, ia3_config)
model.print_trainable_parameters()
# trainable params: 655,360 || all params: 7,241,727,488 || trainable%: 0.0091

When to Use IA³

IA³ shines for:

  • Extremely data-limited scenarios: when you have fewer than 100 training examples, IA³’s tiny parameter count reduces overfitting risk dramatically
  • Continuous learning / rapid task switching: because adapters are so small, you can maintain hundreds of IA³ adapters in memory simultaneously
  • Resource-constrained edge deployment: every MB matters when deploying to edge hardware

The quality trade-off: IA³ typically scores 3–8% below LoRA on standard instruction-following benchmarks. For tasks where “pretty good” adaptation is acceptable and you care deeply about parameter budget, IA³ is the right choice.

Comparison Table

MethodTrainable Params (7B model)Memory OverheadInference OverheadBest For
Full Fine-Tuning7.24B~87 GBNoneMax quality, large GPU budgets
LoRA (r=8, q+v)4.2M~14.6 GBNone (mergeable)Most instruction-tuning tasks
LoRA (r=16, all)33M~15 GBNone (mergeable)Complex tasks, more data
Adapter Layers33M~15 GB5–15% slowerStructured outputs, multi-task
Prefix Tuning7.9M~14.6 GBMinorSeq2seq, translation
IA³0.66M~14.5 GBNone (mergeable)Few-shot, edge deployment

Multi-Adapter Composition

One advanced use case that makes alternatives to LoRA attractive: running multiple adapters simultaneously. HuggingFace PEFT supports loading multiple LoRA adapters and switching between them or combining them.

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(model_name, ...)

# Load multiple adapters
model = PeftModel.from_pretrained(base_model, "./adapter-task-a", adapter_name="task_a")
model.load_adapter("./adapter-task-b", adapter_name="task_b")

# Switch adapters at inference time
model.set_adapter("task_a")
output_a = model.generate(...)

model.set_adapter("task_b")
output_b = model.generate(...)

# Or combine adapters with weighted sum (LoRA addition)
model.add_weighted_adapter(
    adapters=["task_a", "task_b"],
    weights=[0.7, 0.3],
    adapter_name="combined",
    combination_type="linear",
)
model.set_adapter("combined")

This is powerful for production systems that serve multiple specialized tasks from one base model, with per-request adapter selection based on the user’s context or subscription tier.

The Practical Decision

For most practitioners reading this course, the decision tree is:

  1. Default to LoRA (r=8, target q+v, alpha=16). It works well for the vast majority of instruction-tuning tasks.

  2. Try r=16 with all attention modules if LoRA quality is insufficient after training. This costs 8x more parameters but usually closes 50–70% of the gap to full fine-tuning.

  3. Consider IA³ only if you have fewer than 200 training examples or are deploying to very memory-constrained environments.

  4. Consider Adapter Layers if you need multi-task composition with strict layer-level isolation between tasks, or if you are using the adapter-transformers ecosystem which has first-class support for adapter stacking.

  5. Consider Prefix Tuning primarily for seq2seq models (T5, BART). For causal LMs, LoRA is almost always superior.

The field continues to evolve — new PEFT methods appear regularly. But the underlying principle is constant: find a low-dimensional parameterization of the adaptation delta that is expressive enough for your task. LoRA found an excellent point in this space, and the alternatives are variations on the same theme.