Evaluation: BLEU, ROUGE, and LLM-as-Judge

The Evaluation Problem

You have fine-tuned a model. The training loss went down. The outputs look better. But is the model actually better?

“Looks better to me” is not an evaluation. It is a feeling. Feelings are not reproducible, do not generalize across users, and cannot be tracked over time as you iterate on your model. To know whether your fine-tuned model is better than the base model — and by how much — you need a rigorous evaluation protocol.

The challenge: unlike image classification (accuracy is unambiguous) or regression (MSE is unambiguous), language model evaluation is fundamentally about meaning, coherence, and usefulness. Two responses can mean the same thing and score completely differently under word-overlap metrics. Two responses can score the same and one can be dramatically better in practice.

This lesson covers the available evaluation tools, their limitations, and the current industry-standard approach: LLM-as-Judge.

BLEU Score

BLEU (Bilingual Evaluation Understudy) was developed in 2002 for evaluating machine translation. It measures n-gram precision: what fraction of n-grams in the model’s output appear in the reference translation?

import evaluate

bleu = evaluate.load("bleu")

# Perfect match
predictions = ["The quick brown fox jumps over the lazy dog"]
references = [["The quick brown fox jumps over the lazy dog"]]
result = bleu.compute(predictions=predictions, references=references)
print(result['bleu'])  # 1.0

# Synonym substitution — same meaning, different words
predictions = ["The rapid auburn fox leaps over the idle dog"]
references = [["The quick brown fox jumps over the lazy dog"]]
result = bleu.compute(predictions=predictions, references=references)
print(result['bleu'])  # 0.0  ← same meaning, score is 0!

# Wrong but overlapping
predictions = ["The quick brown cat jumps over the lazy dog"]
references = [["The quick brown fox jumps over the lazy dog"]]
result = bleu.compute(predictions=predictions, references=references)
print(result['bleu'])  # 0.857  ← wrong animal, high score anyway

BLEU’s core problem: it measures surface word overlap, not meaning. A perfectly correct paraphrase scores 0. A response with one wrong word scores near-perfect.

BLEU works well for machine translation, where there are limited correct translations and word choice is relatively constrained. It works poorly for:

Open-ended generation (many valid responses)
Style transfer (model uses different but equivalent vocabulary)
Summarization with different levels of detail
Customer support (many ways to say “we’ll look into this”)

Bottom line: Calculate BLEU for completeness and benchmarking against published numbers, but do not use it as your primary signal for whether fine-tuning worked.

ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) was developed for summarization evaluation. Unlike BLEU (which measures precision), ROUGE measures recall: what fraction of the reference’s n-grams appear in the model output?

There are three main variants:

ROUGE-1: unigram (single word) overlap
ROUGE-2: bigram (two-word phrase) overlap
ROUGE-L: longest common subsequence (captures word order)

rouge = evaluate.load("rouge")

predictions = ["The company reported strong quarterly earnings, beating analyst expectations"]
references = ["The company exceeded analyst expectations with strong quarterly results"]

result = rouge.compute(predictions=predictions, references=references)
print(result)
# {
#   'rouge1': 0.647,   # 11/17 unigrams match
#   'rouge2': 0.333,   # 4/16 bigrams match
#   'rougeL': 0.529,   # longest common subsequence / reference length
# }

ROUGE is better than BLEU for summarization tasks because the reference summaries tend to use the same key vocabulary as the source documents. But it still fails on the fundamental problem: meaning is not word overlap.

# A classic ROUGE failure:
predictions = ["The treatment was not effective and caused serious side effects"]
references = ["The treatment was effective and caused no side effects"]

result = rouge.compute(predictions=predictions, references=references)
print(result['rouge1'])  # 0.75 — high overlap, completely opposite meaning

Bottom line: Use ROUGE-L for summarization tasks where it correlates reasonably with quality. Do not use any ROUGE variant as your sole evaluation for open-ended generation.

Perplexity

Perplexity measures how “surprised” the model is by a test set. It is computed as the exponential of the average negative log-likelihood per token. Lower perplexity = model assigns higher probability to the test sequences = model “expects” these sequences.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def compute_perplexity(model, tokenizer, texts, device="cuda"):
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
            inputs = {k: v.to(device) for k, v in inputs.items()}
            
            # Labels = input_ids (predict each token from previous tokens)
            outputs = model(**inputs, labels=inputs["input_ids"])
            
            # Loss is already the mean negative log-likelihood per token
            num_tokens = inputs["input_ids"].shape[1]
            total_loss += outputs.loss.item() * num_tokens
            total_tokens += num_tokens
    
    avg_nll = total_loss / total_tokens
    perplexity = torch.exp(torch.tensor(avg_nll)).item()
    return perplexity

# Compare base model vs fine-tuned model on test set
test_texts = [example['text'] for example in test_dataset]

base_ppl = compute_perplexity(base_model, tokenizer, test_texts)
ft_ppl = compute_perplexity(fine_tuned_model, tokenizer, test_texts)

print(f"Base model perplexity: {base_ppl:.2f}")
print(f"Fine-tuned model perplexity: {ft_ppl:.2f}")
print(f"Improvement: {(base_ppl - ft_ppl) / base_ppl * 100:.1f}%")

Perplexity is useful for:

Confirming your fine-tuned model has not catastrophically forgotten general language patterns
Measuring domain adaptation (a medical fine-tuned model should have lower perplexity on medical text)
Detecting overfitting (train perplexity near 1.0 while test perplexity remains high)

Perplexity’s critical limitation: it does not capture factual accuracy, instruction following, or response usefulness. A model that always outputs plausible-sounding hallucinations will have excellent perplexity.

LLM-as-Judge: The Current Industry Standard

The approach that has become the gold standard for evaluating instruction-tuned models: use a strong LLM (GPT-4, Claude, Gemini) to compare your fine-tuned model’s output against a baseline.

The protocol:

Take your held-out test set (50–200 examples)
Generate responses from both the base model and the fine-tuned model
For each test example, ask the judge LLM: “Which response is better, A or B?”
Calculate the win rate: percentage of examples where the fine-tuned model is preferred

A win rate above 55% vs the base model indicates meaningful improvement. Above 65% is strong. Below 45% means the fine-tuning made things worse and you should investigate.

from openai import OpenAI
import json

client = OpenAI()  # requires OPENAI_API_KEY

JUDGE_PROMPT = """You are an expert evaluator for AI-generated customer support responses.

You will be given:
- A customer's message
- Response A (from the base model)
- Response B (from the fine-tuned model)

Your task: determine which response is better for the customer. Consider:
1. Accuracy: Is the information correct?
2. Helpfulness: Does it solve the customer's problem?
3. Tone: Is it professional and empathetic?
4. Specificity: Does it give concrete next steps?

Respond with a JSON object:
{"winner": "A" or "B" or "tie", "reasoning": "one sentence explanation"}

Do not let response length bias your judgment. A concise correct response beats a verbose incorrect one."""

def judge_responses(customer_message, response_a, response_b):
    """Ask GPT-4 to compare two responses and return the winner."""
    
    user_prompt = f"""Customer message: {customer_message}

Response A:
{response_a}

Response B:
{response_b}

Which response is better?"""
    
    result = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": JUDGE_PROMPT},
            {"role": "user", "content": user_prompt}
        ],
        temperature=0,  # deterministic for consistent evaluation
        response_format={"type": "json_object"}
    )
    
    return json.loads(result.choices[0].message.content)

Running the Full Evaluation

from transformers import pipeline
import pandas as pd

# Load both models
base_generator = pipeline("text-generation", model=base_model, tokenizer=tokenizer, 
                           max_new_tokens=256, do_sample=False)
ft_generator = pipeline("text-generation", model=ft_model, tokenizer=ft_tokenizer,
                         max_new_tokens=256, do_sample=False)

results = []

for i, example in enumerate(test_dataset):
    customer_msg = example['customer_message']
    
    # Generate from both models
    prompt = f"[INST] {customer_msg} [/INST]"
    
    base_output = base_generator(prompt)[0]['generated_text'].split('[/INST]')[-1].strip()
    ft_output = ft_generator(prompt)[0]['generated_text'].split('[/INST]')[-1].strip()
    
    # Get judge verdict
    judgment = judge_responses(customer_msg, base_output, ft_output)
    
    results.append({
        "example_id": i,
        "customer_message": customer_msg,
        "base_response": base_output,
        "ft_response": ft_output,
        "winner": judgment["winner"],
        "reasoning": judgment["reasoning"]
    })
    
    if (i + 1) % 10 == 0:
        print(f"Evaluated {i+1}/{len(test_dataset)} examples")

# Calculate win rate
df = pd.DataFrame(results)
wins_ft = (df['winner'] == 'B').sum()       # Response B = fine-tuned
wins_base = (df['winner'] == 'A').sum()     # Response A = base model
ties = (df['winner'] == 'tie').sum()

total = len(df)
win_rate = wins_ft / total * 100
print(f"\nEvaluation Results ({total} examples)")
print(f"Fine-tuned wins: {wins_ft} ({win_rate:.1f}%)")
print(f"Base model wins: {wins_base} ({wins_base/total*100:.1f}%)")
print(f"Ties:            {ties} ({ties/total*100:.1f}%)")
print(f"\nWin rate (excluding ties): {wins_ft/(wins_ft+wins_base)*100:.1f}%")

Avoiding Position Bias

LLM judges show a known bias toward whichever response appears first (“Response A”) or whichever is longer. Mitigate this with position swapping:

def evaluate_with_swap(test_dataset, base_gen, ft_gen, judge_fn, n_samples=50):
    """Evaluate with both A=base/B=ft and A=ft/B=base, average results."""
    
    results_forward = []
    results_swapped = []
    
    for example in test_dataset[:n_samples]:
        msg = example['customer_message']
        base_r = generate(base_gen, msg)
        ft_r = generate(ft_gen, msg)
        
        # Forward: A=base, B=ft
        j1 = judge_fn(msg, base_r, ft_r)
        ft_wins_forward = j1['winner'] == 'B'
        
        # Swapped: A=ft, B=base
        j2 = judge_fn(msg, ft_r, base_r)
        ft_wins_swapped = j2['winner'] == 'A'
        
        # Only count as "ft wins" if it wins in BOTH orderings (removes position bias)
        results_forward.append(ft_wins_forward)
        results_swapped.append(ft_wins_swapped)
    
    consistent_wins = sum(a and b for a, b in zip(results_forward, results_swapped))
    consistent_losses = sum(not a and not b for a, b in zip(results_forward, results_swapped))
    
    print(f"FT consistently better: {consistent_wins}/{n_samples} ({consistent_wins/n_samples*100:.1f}%)")
    print(f"Base consistently better: {consistent_losses}/{n_samples} ({consistent_losses/n_samples*100:.1f}%)")

Using Claude Instead of GPT-4

import anthropic

client = anthropic.Anthropic()  # requires ANTHROPIC_API_KEY

def judge_with_claude(customer_message, response_a, response_b):
    message = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=256,
        system=JUDGE_PROMPT,
        messages=[{
            "role": "user",
            "content": f"Customer: {customer_message}\n\nResponse A:\n{response_a}\n\nResponse B:\n{response_b}"
        }]
    )
    return json.loads(message.content[0].text)

Putting It Together: A Complete Evaluation Script

# eval.py — run after each training run
import json
import pandas as pd
from pathlib import Path

def run_evaluation(base_model_name, ft_model_path, test_dataset_path, output_path):
    """Complete evaluation pipeline."""
    
    print("Loading models...")
    base_model, base_tokenizer = load_model(base_model_name)
    ft_model, ft_tokenizer = load_lora_model(base_model_name, ft_model_path)
    
    print("Loading test data...")
    test_data = load_test_dataset(test_dataset_path)
    
    print("Computing perplexity...")
    base_ppl = compute_perplexity(base_model, base_tokenizer, test_data)
    ft_ppl = compute_perplexity(ft_model, ft_tokenizer, test_data)
    
    print("Running LLM-as-Judge evaluation...")
    judge_results = evaluate_with_swap(test_data, base_model, ft_model, judge_with_claude)
    
    # BLEU and ROUGE for completeness
    bleu_score = compute_bleu(ft_model, ft_tokenizer, test_data)
    rouge_scores = compute_rouge(ft_model, ft_tokenizer, test_data)
    
    report = {
        "base_perplexity": base_ppl,
        "ft_perplexity": ft_ppl,
        "perplexity_improvement": f"{(base_ppl - ft_ppl) / base_ppl * 100:.1f}%",
        "win_rate": judge_results["win_rate"],
        "bleu": bleu_score,
        "rouge_l": rouge_scores["rougeL"],
        "num_test_examples": len(test_data),
    }
    
    print(f"\n{'='*50}")
    print("EVALUATION REPORT")
    print(f"{'='*50}")
    for k, v in report.items():
        print(f"{k:30s}: {v}")
    
    with open(output_path, "w") as f:
        json.dump(report, f, indent=2)
    
    return report

What Good Numbers Look Like

After a successful fine-tuning run on a customer support dataset:

Metric	Expected Improvement
Perplexity (domain text)	20–40% reduction
BLEU	5–15% increase (often not meaningful)
ROUGE-L	10–25% increase
LLM-as-Judge win rate	65–80% (vs base model)

If your win rate is below 55%, something went wrong. Common causes: the dataset did not align with the test set distribution, training data quality was poor, or the base model already handles the task well and fine-tuning added noise.

The LLM-as-Judge win rate is the number you should optimize. The automatic metrics are secondary context. When you ship a fine-tuned model to users, their experience is what matters — and LLM-as-Judge approximates that experience better than n-gram overlap.

Course Content

The Evaluation Problem

BLEU Score

ROUGE

Perplexity

LLM-as-Judge: The Current Industry Standard

Running the Full Evaluation

Avoiding Position Bias

Using Claude Instead of GPT-4

Putting It Together: A Complete Evaluation Script

What Good Numbers Look Like

🍪 Cookie Notice

Cookie Preferences

Essential Cookies

Analytics Cookies

Marketing Cookies

Functionality Cookies