Course Content
Merging and Deploying Fine-Tuned Models
Merge LoRA weights, quantize for inference, and serve with vLLM
What You Have After Training
When your QLoRA training completes, you have two separate artifacts:
- The base model — unchanged. Mistral-7B in 4-bit quantization, ~3.9 GB on disk (or float16, ~14.5 GB).
- The LoRA adapter — a small file containing your trained A and B matrices, ~20–100 MB.
These two artifacts work together at inference time: the base model processes each token, and the adapter modifies the intermediate activations to produce your fine-tuned behavior. You need both, and both must be present for serving.
This two-artifact structure presents a choice at deployment time. You can serve them separately (flexible) or merge them into one (simpler, slightly faster).
Option 1: Serving with a Separate Adapter
Serving the adapter on top of the base model is the most flexible approach. It allows you to:
- Swap different adapters at runtime (task A, task B, task C)
- Maintain one large base model and many small adapters
- A/B test different adapter versions
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model_name = "mistralai/Mistral-7B-v0.1"
adapter_path = "./qlora-mistral-adapter"
# Load base model
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto",
)
# Load adapter on top
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()
# Run inference
prompt = "[INST] What is our refund policy? [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
temperature=1.0,
)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)The slight overhead of the adapter (an extra matrix multiplication per attention layer) is negligible for most serving workloads — roughly 2–5% slower than the merged model.
Option 2: Merging the Adapter into the Base Model
Merging combines the LoRA adapter mathematically into the base model weights. After merging, the resulting model is a standard transformer with no PEFT dependencies — you can serve it with any inference framework.
The math: recall that during inference, the output of a LoRA-adapted layer is:
h = W × x + (A × B) × x × (alpha / r)Merging is simply computing W_merged = W + A × B × (alpha / r) offline, so the inference path becomes:
h = W_merged × xOne matrix multiply instead of two — effectively zero overhead.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model_name = "mistralai/Mistral-7B-v0.1"
adapter_path = "./qlora-mistral-adapter"
merged_output_path = "./mistral-7b-merged"
# Load base model in float16 (NOT 4-bit — merging requires full precision)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="cpu", # merge on CPU to avoid VRAM pressure
)
# Load adapter
model_with_adapter = PeftModel.from_pretrained(base_model, adapter_path)
# Merge adapter into base weights
print("Merging adapter into base model...")
merged_model = model_with_adapter.merge_and_unload()
# Verify it is now a standard model (no PEFT wrapper)
print(type(merged_model))
# <class 'transformers.models.mistral.modeling_mistral.MistralForCausalLM'>
# Save the merged model
print("Saving merged model...")
merged_model.save_pretrained(merged_output_path)
tokenizer.save_pretrained(merged_output_path)
print(f"Saved to {merged_output_path}")
# Files:
# ./mistral-7b-merged/
# config.json
# model-00001-of-00003.safetensors (~4.8 GB each)
# model-00002-of-00003.safetensors
# model-00003-of-00003.safetensors
# model.safetensors.index.json
# tokenizer.json
# tokenizer_config.jsonImportant: merging requires the base model in float16, not 4-bit. Load with torch_dtype=torch.float16 (not with BitsAndBytesConfig). The merged model will also be in float16.
Post-Merge Quantization for Inference
After merging, you have a 14.5 GB float16 model. For production inference, you want to quantize it for memory efficiency. This is inference-time quantization (different from QLoRA’s training-time quantization).
GGUF Quantization with llama.cpp
GGUF is the format used by llama.cpp, the most widely deployed LLM inference runtime for edge and on-premises deployment. GGUF supports 2-bit through 8-bit quantization and runs efficiently on CPUs as well as GPUs.
# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make
# Convert HuggingFace model to GGUF format
python convert_hf_to_gguf.py ./mistral-7b-merged --outtype f16 --outfile mistral-merged.f16.gguf
# Quantize to Q4_K_M (4-bit, medium quality — recommended default)
./llama-quantize mistral-merged.f16.gguf mistral-merged.q4_k_m.gguf Q4_K_M
# Q4_K_M file size: ~4.1 GB (down from 14.5 GB float16)GGUF quantization options:
- Q2_K: 2-bit, ~2.7 GB, significant quality loss
- Q4_0: 4-bit basic, ~3.8 GB, good quality
- Q4_K_M: 4-bit with K-quantization (medium), ~4.1 GB, recommended
- Q5_K_M: 5-bit, ~5.1 GB, slightly better quality
- Q8_0: 8-bit, ~7.7 GB, near-lossless, for quality-sensitive applications
AWQ Quantization (for GPU serving)
For GPU-based serving with vLLM, AWQ (Activation-aware Weight Quantization) produces higher-quality 4-bit models than GGUF by calibrating quantization on a small dataset.
# pip install autoawq
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "./mistral-7b-merged"
quant_path = "./mistral-7b-awq"
# Load model for quantization
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantization configuration
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4, # 4-bit quantization
"version": "GEMM"
}
# Run quantization (takes 15–30 minutes, requires calibration data)
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
print(f"AWQ model saved to {quant_path}")
# Size: ~3.9 GB (vs 14.5 GB float16)Serving with vLLM
vLLM is the gold standard for high-throughput LLM serving. It implements PagedAttention — a KV-cache management algorithm that eliminates memory waste in the attention cache, enabling 20x or higher throughput compared to naive HuggingFace generation.
Installing and Starting vLLM
pip install vllm
# Serve the merged model
vllm serve ./mistral-7b-merged \
--dtype float16 \
--max-model-len 4096 \
--port 8000
# Serve an AWQ quantized model
vllm serve ./mistral-7b-awq \
--quantization awq \
--dtype float16 \
--max-model-len 4096 \
--port 8000vLLM exposes an OpenAI-compatible API, so any client that works with the OpenAI SDK works with vLLM:
from openai import OpenAI
# Point to your vLLM server instead of OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed", # vLLM doesn't require auth by default
)
response = client.chat.completions.create(
model="./mistral-7b-merged", # use the model path as the model name
messages=[
{"role": "system", "content": "You are a helpful customer support agent."},
{"role": "user", "content": "My billing invoice shows the wrong amount."}
],
max_tokens=256,
temperature=0.7,
)
print(response.choices[0].message.content)Throughput Benchmark
import asyncio
import time
from openai import AsyncOpenAI
async def benchmark_throughput(num_requests=100):
client = AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="none")
prompts = ["Explain our return policy in one paragraph."] * num_requests
start = time.time()
tasks = [
client.chat.completions.create(
model="./mistral-7b-merged",
messages=[{"role": "user", "content": p}],
max_tokens=128,
)
for p in prompts
]
results = await asyncio.gather(*tasks)
elapsed = time.time() - start
total_output_tokens = sum(r.usage.completion_tokens for r in results)
throughput = total_output_tokens / elapsed
print(f"Throughput: {throughput:.0f} output tokens/second")
print(f"Latency per request: {elapsed/num_requests*1000:.0f} ms")
print(f"Total time for {num_requests} requests: {elapsed:.1f}s")
asyncio.run(benchmark_throughput())
# Throughput: ~1800 output tokens/second (vLLM on A10G)
# vs ~90 tokens/second (naive HuggingFace generation)
# 20x improvementMulti-GPU Serving with Tensor Parallelism
For a 7B model on a machine with two A10G GPUs (24 GB each), use tensor parallelism to split the model across GPUs:
vllm serve ./mistral-7b-merged \
--dtype float16 \
--tensor-parallel-size 2 \ # split across 2 GPUs
--max-model-len 8192 \ # larger context window with 2x memory
--port 8000For a 70B model, you need 4× A100 80GB:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--dtype bfloat16 \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--port 8000Serving the LoRA Adapter Dynamically (Advanced)
vLLM also supports serving a base model with dynamically loaded LoRA adapters — the most flexible production architecture:
vllm serve mistralai/Mistral-7B-v0.1 \
--enable-lora \
--max-lora-rank 8 \
--port 8000# Load specific adapter per request
response = client.chat.completions.create(
model="mistralai/Mistral-7B-v0.1",
messages=[{"role": "user", "content": "..."}],
extra_body={
"lora_request": {
"lora_name": "customer_support_v2",
"lora_path": "./qlora-mistral-adapter",
}
}
)This architecture allows one vLLM instance to serve hundreds of fine-tuned adapters with a single copy of the base model in memory. It is ideal for multi-tenant SaaS products where each customer has a customized model.
Deployment Decision Tree
Do you need to swap adapters dynamically?
├── Yes → Serve adapter separately (vLLM with --enable-lora)
└── No → Merge adapter into base model
│
├── Deploying on GPU?
│ ├── Need maximum throughput → vLLM + AWQ quantization
│ └── Need maximum quality → vLLM + float16
│
└── Deploying on CPU/edge?
└── llama.cpp + GGUF Q4_K_MThe most common production setup: merge the adapter, quantize with AWQ, serve with vLLM. This gives you a clean model file, near-maximum quality, and high throughput — the best trade-off for most production workloads.
