Press ESC to exit fullscreen
📖 Lesson ⏱️ 75 minutes

Prompt Testing and Evaluation

Build a prompt evaluation framework to measure and improve quality

Why Systematic Evaluation Matters

Gut-feel prompt testing doesn’t scale. A prompt that works on 3 examples might fail on the 4th. Systematic evaluation gives you confidence that your prompt is actually better — not just better on the examples you tested.

Building a Test Set

A good prompt test set has:

  • 20–50 examples covering diverse input patterns
  • Edge cases: unusual inputs, short inputs, long inputs
  • Known tricky cases: inputs where naive prompts fail
  • Expected outputs for each input (your ground truth)

Scoring Rubrics

Define what “good” means before you test:

rubric = {
    "accuracy": "Does the output match the ground truth?" ,  # 0-1
    "format": "Does the output follow the specified format?",  # 0-1  
    "conciseness": "Is the output appropriately concise (no padding)?",  # 0-1
    "tone": "Is the tone appropriate for the audience?",  # 0-1
}

LLM-as-Judge

For complex tasks, use an LLM to evaluate outputs:

eval_prompt = f"""
Rate the following AI response on a scale of 1-5 for each dimension.
Return ONLY JSON.

User question: {question}
AI response: {response}
Ground truth: {expected}

Rate: accuracy, completeness, conciseness, format_compliance
"""

A/B Comparison

Test two prompt variants on the same test set and compare scores. Only ship the new prompt if it’s meaningfully better (>5% improvement) across the full test set.