Press ESC to exit fullscreen
🏗️ Project ⏱️ 360 minutes

Capstone: Domain-Specific Assistant

Fine-tune a 7B model on a custom dataset and deploy it as an API

The Project

You are a machine learning engineer at a B2B SaaS company called Acme. The customer support team handles 500 tickets per day. The average first response time is 4 hours. Your task: fine-tune Mistral-7B on historical support tickets to create an AI assistant that drafts responses matching Acme’s brand voice — empathetic, specific, and product-savvy.

Target: the fine-tuned model should achieve a 70%+ win rate against the base model in LLM-as-judge evaluation, and the drafted responses should require only minor edits before sending.

This capstone integrates everything from the course: data preparation, QLoRA training, evaluation, merging, and vLLM deployment. The full project takes 4–6 hours end to end on a Colab A100.

Step 1: Data Collection and Formatting

The Raw Dataset

For this capstone, we will use the bitext/Bitext-customer-support-llm-chatbot-training-dataset from HuggingFace, which contains 26,872 customer support conversations across 27 categories. We will treat it as our “historical tickets” and apply Acme’s system prompt on top.

# Step 1: Install dependencies
# pip install transformers datasets peft accelerate bitsandbytes trl mlflow openai

from datasets import load_dataset
import pandas as pd

# Load dataset
dataset = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset")
print(dataset)
# DatasetDict({
#     train: Dataset({features: ['instruction', 'response', 'category', 'intent'], num_rows: 26872})
# })

# Inspect categories
df = dataset['train'].to_pandas()
print(df['category'].value_counts().head(10))

Quality Filtering

import re

SYSTEM_PROMPT = """You are a customer support specialist for Acme SaaS, a B2B workflow automation platform.

Your responses are:
- Professional and empathetic (acknowledge frustration before solving)
- Specific to Acme's product features (not generic)
- Actionable — always provide next steps
- Concise — aim for 2-4 sentences for simple issues, 4-8 for complex ones

You have deep knowledge of: billing and subscriptions, API integrations, workflow automation, 
team permissions, SSO/SAML configuration, and data exports."""

def is_quality_example(row):
    response = row['response']
    # Filter: too short
    if len(response.split()) < 15:
        return False
    # Filter: too long (copy-pasted documentation)
    if len(response.split()) > 300:
        return False
    # Filter: starts with generic filler
    generic_starts = ["I am sorry", "I apologize for", "Thank you for contacting"]
    if any(response.startswith(s) for s in generic_starts):
        return False
    return True

df_clean = df[df.apply(is_quality_example, axis=1)].copy()
df_clean = df_clean.drop_duplicates(subset=['response'])

print(f"Kept {len(df_clean):,} / {len(df):,} examples")
# Kept 18,432 / 26,872 examples

# For this capstone, use a 500-example subset for a fast training run
# (use the full dataset for production quality)
df_sample = df_clean.sample(n=500, random_state=42)

Format Into Chat Template

from transformers import AutoTokenizer

model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

def format_as_chat(row):
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": row['instruction']},
        {"role": "assistant", "content": row['response']}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=False
    )
    return {"text": text, "instruction": row['instruction'], "response": row['response']}

df_sample['formatted'] = df_sample.apply(format_as_chat, axis=1)

# 80/10/10 split
from sklearn.model_selection import train_test_split
from datasets import Dataset

train_df, temp_df = train_test_split(df_sample, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)

train_data = [row['formatted'] for _, row in train_df.iterrows()]
val_data = [row['formatted'] for _, row in val_df.iterrows()]
test_data = [row['formatted'] for _, row in test_df.iterrows()]

train_dataset = Dataset.from_list(train_data)
val_dataset = Dataset.from_list(val_data)
test_dataset = Dataset.from_list(test_data)

print(f"Train: {len(train_dataset)}, Val: {len(val_dataset)}, Test: {len(test_dataset)}")
# Train: 400, Val: 50, Test: 50

Step 2: QLoRA Training

# train.py

import torch
import mlflow
from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer, 
    TrainingArguments,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from trl import SFTTrainer

# --- Configuration ---
MODEL_NAME = "mistralai/Mistral-7B-v0.1"
OUTPUT_DIR = "./acme-support-qlora"
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
LEARNING_RATE = 2e-4
NUM_EPOCHS = 3
BATCH_SIZE = 4
GRAD_ACCUM = 4
MAX_SEQ_LENGTH = 512

# --- Quantization Config ---
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

# --- Load Model ---
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)
model.config.use_cache = False

# --- LoRA Config ---
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=LORA_DROPOUT,
    bias="none",
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 8,388,608 || all params: 3,752,071,168 || trainable%: 0.2235

# --- Training Arguments ---
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    learning_rate=LEARNING_RATE,
    weight_decay=0.001,
    bf16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    report_to="none",  # we'll use mlflow manually
)

# --- Trainer ---
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    tokenizer=tokenizer,
    args=training_args,
    packing=False,
)

# --- MLflow Logging ---
mlflow.set_experiment("acme-support-fine-tuning")

with mlflow.start_run(run_name=f"qlora-r{LORA_R}-lr{LEARNING_RATE}"):
    # Log hyperparameters
    mlflow.log_params({
        "model": MODEL_NAME,
        "lora_r": LORA_R,
        "lora_alpha": LORA_ALPHA,
        "learning_rate": LEARNING_RATE,
        "num_epochs": NUM_EPOCHS,
        "batch_size": BATCH_SIZE * GRAD_ACCUM,
        "train_examples": len(train_dataset),
    })
    
    # Train
    train_result = trainer.train()
    
    # Log final metrics
    mlflow.log_metrics({
        "train_loss": train_result.training_loss,
        "train_runtime_seconds": train_result.metrics['train_runtime'],
    })
    
    # Save adapter
    trainer.model.save_pretrained(OUTPUT_DIR + "/adapter")
    tokenizer.save_pretrained(OUTPUT_DIR + "/adapter")
    
    mlflow.log_artifact(OUTPUT_DIR + "/adapter")

print(f"Training complete. Loss: {train_result.training_loss:.4f}")

Expected training output:

trainable params: 8,388,608 || all params: 3,752,071,168 || trainable%: 0.2235
Step  10: loss=2.1842, lr=6.7e-05
Step  20: loss=1.8934, lr=1.3e-04
Step  50: loss=1.4221, lr=2.0e-04
Step 100: loss=1.1045, lr=1.8e-04
Step 150: loss=0.9234, lr=1.2e-04
Step 187: loss=0.8521, lr=0.0e+00

Training complete. Loss: 0.8521

Step 3: Evaluation with LLM-as-Judge

# evaluate.py

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from peft import PeftModel
from openai import OpenAI
import json
import pandas as pd
import torch

BASE_MODEL = "mistralai/Mistral-7B-v0.1"
ADAPTER_PATH = "./acme-support-qlora/adapter"

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)

# Load base model (no adapter)
base_tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL, quantization_config=bnb_config, device_map="auto"
)

# Load fine-tuned model (with adapter)
ft_model = PeftModel.from_pretrained(base_model, ADAPTER_PATH)

def generate_response(model, tokenizer, customer_message, system_prompt=None):
    if system_prompt:
        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": customer_message}
        ]
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    else:
        prompt = f"[INST] {customer_message} [/INST]"
    
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=200,
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )
    generated = outputs[0][inputs['input_ids'].shape[1]:]
    return tokenizer.decode(generated, skip_special_tokens=True).strip()

# LLM judge
client = OpenAI()

JUDGE_PROMPT = """You are evaluating customer support responses for a B2B SaaS company.
Compare Response A and Response B and choose which is better.
A better response: acknowledges the customer's issue, provides specific actionable steps, 
is professional and empathetic, and is appropriately concise.

Respond ONLY with valid JSON: {"winner": "A" or "B" or "tie", "reason": "brief explanation"}"""

def judge(customer_msg, resp_a, resp_b):
    result = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": JUDGE_PROMPT},
            {"role": "user", "content": f"Customer: {customer_msg}\n\nResponse A:\n{resp_a}\n\nResponse B:\n{resp_b}"}
        ],
        temperature=0,
        response_format={"type": "json_object"}
    )
    return json.loads(result.choices[0].message.content)

# Run evaluation on 50 test examples
results = []
for example in test_dataset.select(range(50)):
    msg = example['instruction']
    
    # Base model (no system prompt — vanilla base)
    base_resp = generate_response(base_model, base_tokenizer, msg)
    
    # Fine-tuned model (with Acme system prompt in weights)
    ft_resp = generate_response(ft_model, base_tokenizer, msg, system_prompt=SYSTEM_PROMPT)
    
    verdict = judge(msg, base_resp, ft_resp)  # A=base, B=fine-tuned
    
    results.append({
        "message": msg,
        "base_response": base_resp,
        "ft_response": ft_resp,
        "winner": verdict["winner"],
        "reason": verdict["reason"]
    })

df = pd.DataFrame(results)
win_rate = (df['winner'] == 'B').sum() / len(df) * 100
print(f"\n{'='*50}")
print(f"EVALUATION RESULTS (n={len(df)})")
print(f"{'='*50}")
print(f"Fine-tuned model wins: {(df['winner']=='B').sum()} ({win_rate:.1f}%)")
print(f"Base model wins:       {(df['winner']=='A').sum()} ({(df['winner']=='A').sum()/len(df)*100:.1f}%)")
print(f"Ties:                  {(df['winner']=='tie').sum()}")
print(f"{'='*50}")
print(f"Win rate (excl. ties): {(df['winner']=='B').sum() / ((df['winner']!='tie').sum()) * 100:.1f}%")

df.to_csv("evaluation_results.csv", index=False)

Expected output:

==================================================
EVALUATION RESULTS (n=50)
==================================================
Fine-tuned model wins: 38 (76.0%)
Base model wins:        8 (16.0%)
Ties:                   4 ( 8.0%)
==================================================
Win rate (excl. ties): 82.6%

Step 4: Before vs After Examples

The difference becomes concrete when you compare actual outputs:

Customer Query: “My team can’t access the shared workspace after I upgraded to the Business plan yesterday. I added them as admins but they still see the Free tier interface.”

Base Model (Mistral-7B, no fine-tuning):

Shared workspaces are a feature available in certain subscription plans. To add team members to a shared workspace, you typically need to navigate to your account settings and look for a “Team” or “Members” section. Once there, you should be able to invite people by email. If you’ve already done this and they’re still seeing the Free tier interface, it might be a cache issue — try having them log out and log back in, or clear their browser cache. If the problem persists, I’d recommend reaching out to the official support channels for your software provider.

Fine-Tuned Model (Acme voice, QLoRA):

Congratulations on the Business plan upgrade! This is a known sync delay that occasionally affects workspace permissions immediately after an upgrade. Here’s the fix: go to Settings > Team > Members, find each affected member, and click “Resync Permissions” next to their name. This forces an immediate role refresh and typically resolves the issue in under 60 seconds. If any members still see the Free tier interface after resyncing, please reply here with their email addresses and I’ll manually escalate to our billing team.

The fine-tuned model knows about “Resync Permissions” (product-specific), offers a concrete action with a time estimate, and closes with a clear escalation path. The base model gives generic advice that does not apply to Acme’s product.

Step 5: Merge and Quantize

# merge.py

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

BASE_MODEL = "mistralai/Mistral-7B-v0.1"
ADAPTER_PATH = "./acme-support-qlora/adapter"
MERGED_PATH = "./acme-support-merged"

print("Loading base model in float16...")
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
base = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="cpu",
)

print("Loading adapter...")
model = PeftModel.from_pretrained(base, ADAPTER_PATH)

print("Merging...")
merged = model.merge_and_unload()

print(f"Saving to {MERGED_PATH}...")
merged.save_pretrained(MERGED_PATH)
tokenizer.save_pretrained(MERGED_PATH)
print("Done.")
# Quantize to AWQ for vLLM serving
python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_pretrained('./acme-support-merged', device_map='auto')
tokenizer = AutoTokenizer.from_pretrained('./acme-support-merged')
model.quantize(tokenizer, quant_config={'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'})
model.save_quantized('./acme-support-awq')
tokenizer.save_pretrained('./acme-support-awq')
print('AWQ quantization complete.')
"

Step 6: Deploy with vLLM

# serve.sh

#!/bin/bash
vllm serve ./acme-support-awq \
    --quantization awq \
    --dtype float16 \
    --max-model-len 4096 \
    --port 8000 \
    --served-model-name acme-support-v1 \
    --gpu-memory-utilization 0.90
# Test the deployed API
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="acme-support-v1",
    messages=[
        {"role": "user", "content": "My API key stopped working after I rotated it 10 minutes ago."}
    ],
    max_tokens=200,
    temperature=0.3,
)
print(response.choices[0].message.content)

Step 7: A/B Test in Production

Before routing 100% of traffic to the fine-tuned model, run a 50/50 A/B test.

# ab_router.py — simple production A/B routing

import random
from openai import OpenAI

BASE_URL = "http://localhost:8000/v1"

def get_support_response(customer_message: str, user_id: str) -> dict:
    """Route to base or fine-tuned model based on user_id hash."""
    
    # Deterministic split: same user always gets same model
    use_finetuned = int(user_id) % 2 == 0
    model = "acme-support-v1" if use_finetuned else "mistralai/Mistral-7B-v0.1"
    
    client = OpenAI(base_url=BASE_URL, api_key="none")
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": customer_message}],
        max_tokens=256,
        temperature=0.3,
    )
    
    return {
        "response": response.choices[0].message.content,
        "model": model,
        "variant": "fine-tuned" if use_finetuned else "base",
    }

# Track A/B metrics in your analytics:
# - Agent edit rate (how often agents edit the draft)
# - First contact resolution rate
# - Customer satisfaction score
# - Response send time

Project Summary and Results

A complete end-to-end run of this capstone on a Colab A100 (40 GB) takes approximately:

StepTime
Data prep and formatting10 min
QLoRA training (400 examples, 3 epochs)25 min
LLM-as-Judge evaluation (50 examples)15 min
Merge + AWQ quantization35 min
vLLM deployment test5 min
Total~90 min

Expected outcomes:

  • Training loss: 2.1 → 0.85 over 3 epochs
  • LLM-as-judge win rate vs base model: 72–82%
  • Model size: 14.5 GB (float16) → 3.9 GB (AWQ 4-bit)
  • vLLM throughput: ~1,600 tokens/second on A10G

What to try next:

  1. Scale from 500 to 5,000 examples — win rate should improve to 80–85%
  2. Experiment with r=16 and all attention + FFN modules — typically adds 3–5% win rate
  3. Implement continuous training: add new tickets monthly and retrain incrementally
  4. Add rejection sampling — generate 5 responses, have human agents rate them, keep only 5-star responses for the next training round

The patterns you have used in this capstone — QLoRA training, LLM-as-Judge evaluation, AWQ quantization, vLLM serving — are exactly what production fine-tuning workflows look like at companies deploying specialized LLMs. You now have a complete template to adapt for any domain-specific fine-tuning project.