Press ESC to exit fullscreen
📖 Lesson ⏱️ 120 minutes

Merging and Deploying Fine-Tuned Models

Merge LoRA weights, quantize for inference, and serve with vLLM

What You Have After Training

When your QLoRA training completes, you have two separate artifacts:

  1. The base model — unchanged. Mistral-7B in 4-bit quantization, ~3.9 GB on disk (or float16, ~14.5 GB).
  2. The LoRA adapter — a small file containing your trained A and B matrices, ~20–100 MB.

These two artifacts work together at inference time: the base model processes each token, and the adapter modifies the intermediate activations to produce your fine-tuned behavior. You need both, and both must be present for serving.

This two-artifact structure presents a choice at deployment time. You can serve them separately (flexible) or merge them into one (simpler, slightly faster).

Option 1: Serving with a Separate Adapter

Serving the adapter on top of the base model is the most flexible approach. It allows you to:

  • Swap different adapters at runtime (task A, task B, task C)
  • Maintain one large base model and many small adapters
  • A/B test different adapter versions
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_name = "mistralai/Mistral-7B-v0.1"
adapter_path = "./qlora-mistral-adapter"

# Load base model
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load adapter on top
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()

# Run inference
prompt = "[INST] What is our refund policy? [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        temperature=1.0,
    )

response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)

The slight overhead of the adapter (an extra matrix multiplication per attention layer) is negligible for most serving workloads — roughly 2–5% slower than the merged model.

Option 2: Merging the Adapter into the Base Model

Merging combines the LoRA adapter mathematically into the base model weights. After merging, the resulting model is a standard transformer with no PEFT dependencies — you can serve it with any inference framework.

The math: recall that during inference, the output of a LoRA-adapted layer is:

h = W × x + (A × B) × x × (alpha / r)

Merging is simply computing W_merged = W + A × B × (alpha / r) offline, so the inference path becomes:

h = W_merged × x

One matrix multiply instead of two — effectively zero overhead.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model_name = "mistralai/Mistral-7B-v0.1"
adapter_path = "./qlora-mistral-adapter"
merged_output_path = "./mistral-7b-merged"

# Load base model in float16 (NOT 4-bit — merging requires full precision)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.float16,
    device_map="cpu",  # merge on CPU to avoid VRAM pressure
)

# Load adapter
model_with_adapter = PeftModel.from_pretrained(base_model, adapter_path)

# Merge adapter into base weights
print("Merging adapter into base model...")
merged_model = model_with_adapter.merge_and_unload()

# Verify it is now a standard model (no PEFT wrapper)
print(type(merged_model))  
# <class 'transformers.models.mistral.modeling_mistral.MistralForCausalLM'>

# Save the merged model
print("Saving merged model...")
merged_model.save_pretrained(merged_output_path)
tokenizer.save_pretrained(merged_output_path)

print(f"Saved to {merged_output_path}")
# Files:
# ./mistral-7b-merged/
#   config.json
#   model-00001-of-00003.safetensors  (~4.8 GB each)
#   model-00002-of-00003.safetensors
#   model-00003-of-00003.safetensors
#   model.safetensors.index.json
#   tokenizer.json
#   tokenizer_config.json

Important: merging requires the base model in float16, not 4-bit. Load with torch_dtype=torch.float16 (not with BitsAndBytesConfig). The merged model will also be in float16.

Post-Merge Quantization for Inference

After merging, you have a 14.5 GB float16 model. For production inference, you want to quantize it for memory efficiency. This is inference-time quantization (different from QLoRA’s training-time quantization).

GGUF Quantization with llama.cpp

GGUF is the format used by llama.cpp, the most widely deployed LLM inference runtime for edge and on-premises deployment. GGUF supports 2-bit through 8-bit quantization and runs efficiently on CPUs as well as GPUs.

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make

# Convert HuggingFace model to GGUF format
python convert_hf_to_gguf.py ./mistral-7b-merged --outtype f16 --outfile mistral-merged.f16.gguf

# Quantize to Q4_K_M (4-bit, medium quality — recommended default)
./llama-quantize mistral-merged.f16.gguf mistral-merged.q4_k_m.gguf Q4_K_M

# Q4_K_M file size: ~4.1 GB (down from 14.5 GB float16)

GGUF quantization options:

  • Q2_K: 2-bit, ~2.7 GB, significant quality loss
  • Q4_0: 4-bit basic, ~3.8 GB, good quality
  • Q4_K_M: 4-bit with K-quantization (medium), ~4.1 GB, recommended
  • Q5_K_M: 5-bit, ~5.1 GB, slightly better quality
  • Q8_0: 8-bit, ~7.7 GB, near-lossless, for quality-sensitive applications

AWQ Quantization (for GPU serving)

For GPU-based serving with vLLM, AWQ (Activation-aware Weight Quantization) produces higher-quality 4-bit models than GGUF by calibrating quantization on a small dataset.

# pip install autoawq
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "./mistral-7b-merged"
quant_path = "./mistral-7b-awq"

# Load model for quantization
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantization configuration
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,           # 4-bit quantization
    "version": "GEMM"
}

# Run quantization (takes 15–30 minutes, requires calibration data)
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f"AWQ model saved to {quant_path}")
# Size: ~3.9 GB (vs 14.5 GB float16)

Serving with vLLM

vLLM is the gold standard for high-throughput LLM serving. It implements PagedAttention — a KV-cache management algorithm that eliminates memory waste in the attention cache, enabling 20x or higher throughput compared to naive HuggingFace generation.

Installing and Starting vLLM

pip install vllm

# Serve the merged model
vllm serve ./mistral-7b-merged \
    --dtype float16 \
    --max-model-len 4096 \
    --port 8000

# Serve an AWQ quantized model
vllm serve ./mistral-7b-awq \
    --quantization awq \
    --dtype float16 \
    --max-model-len 4096 \
    --port 8000

vLLM exposes an OpenAI-compatible API, so any client that works with the OpenAI SDK works with vLLM:

from openai import OpenAI

# Point to your vLLM server instead of OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM doesn't require auth by default
)

response = client.chat.completions.create(
    model="./mistral-7b-merged",  # use the model path as the model name
    messages=[
        {"role": "system", "content": "You are a helpful customer support agent."},
        {"role": "user", "content": "My billing invoice shows the wrong amount."}
    ],
    max_tokens=256,
    temperature=0.7,
)

print(response.choices[0].message.content)

Throughput Benchmark

import asyncio
import time
from openai import AsyncOpenAI

async def benchmark_throughput(num_requests=100):
    client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="none")
    
    prompts = ["Explain our return policy in one paragraph."] * num_requests
    
    start = time.time()
    tasks = [
        client.chat.completions.create(
            model="./mistral-7b-merged",
            messages=[{"role": "user", "content": p}],
            max_tokens=128,
        )
        for p in prompts
    ]
    results = await asyncio.gather(*tasks)
    elapsed = time.time() - start
    
    total_output_tokens = sum(r.usage.completion_tokens for r in results)
    throughput = total_output_tokens / elapsed
    
    print(f"Throughput: {throughput:.0f} output tokens/second")
    print(f"Latency per request: {elapsed/num_requests*1000:.0f} ms")
    print(f"Total time for {num_requests} requests: {elapsed:.1f}s")

asyncio.run(benchmark_throughput())
# Throughput: ~1800 output tokens/second (vLLM on A10G)
# vs ~90 tokens/second (naive HuggingFace generation)
# 20x improvement

Multi-GPU Serving with Tensor Parallelism

For a 7B model on a machine with two A10G GPUs (24 GB each), use tensor parallelism to split the model across GPUs:

vllm serve ./mistral-7b-merged \
    --dtype float16 \
    --tensor-parallel-size 2 \   # split across 2 GPUs
    --max-model-len 8192 \       # larger context window with 2x memory
    --port 8000

For a 70B model, you need 4× A100 80GB:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --dtype bfloat16 \
    --tensor-parallel-size 4 \
    --max-model-len 8192 \
    --port 8000

Serving the LoRA Adapter Dynamically (Advanced)

vLLM also supports serving a base model with dynamically loaded LoRA adapters — the most flexible production architecture:

vllm serve mistralai/Mistral-7B-v0.1 \
    --enable-lora \
    --max-lora-rank 8 \
    --port 8000
# Load specific adapter per request
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-v0.1",
    messages=[{"role": "user", "content": "..."}],
    extra_body={
        "lora_request": {
            "lora_name": "customer_support_v2",
            "lora_path": "./qlora-mistral-adapter",
        }
    }
)

This architecture allows one vLLM instance to serve hundreds of fine-tuned adapters with a single copy of the base model in memory. It is ideal for multi-tenant SaaS products where each customer has a customized model.

Deployment Decision Tree

Do you need to swap adapters dynamically?
├── Yes → Serve adapter separately (vLLM with --enable-lora)
└── No → Merge adapter into base model

    ├── Deploying on GPU?
    │   ├── Need maximum throughput → vLLM + AWQ quantization
    │   └── Need maximum quality → vLLM + float16

    └── Deploying on CPU/edge?
        └── llama.cpp + GGUF Q4_K_M

The most common production setup: merge the adapter, quantize with AWQ, serve with vLLM. This gives you a clean model file, near-maximum quality, and high throughput — the best trade-off for most production workloads.