Course Content
Prompt Testing and Evaluation
Build a prompt evaluation framework to measure and improve quality
Why Systematic Evaluation Matters
Gut-feel prompt testing doesn’t scale. A prompt that works on 3 examples might fail on the 4th. Systematic evaluation gives you confidence that your prompt is actually better — not just better on the examples you tested.
Building a Test Set
A good prompt test set has:
- 20–50 examples covering diverse input patterns
- Edge cases: unusual inputs, short inputs, long inputs
- Known tricky cases: inputs where naive prompts fail
- Expected outputs for each input (your ground truth)
Scoring Rubrics
Define what “good” means before you test:
rubric = {
"accuracy": "Does the output match the ground truth?" , # 0-1
"format": "Does the output follow the specified format?", # 0-1
"conciseness": "Is the output appropriately concise (no padding)?", # 0-1
"tone": "Is the tone appropriate for the audience?", # 0-1
}LLM-as-Judge
For complex tasks, use an LLM to evaluate outputs:
eval_prompt = f"""
Rate the following AI response on a scale of 1-5 for each dimension.
Return ONLY JSON.
User question: {question}
AI response: {response}
Ground truth: {expected}
Rate: accuracy, completeness, conciseness, format_compliance
"""A/B Comparison
Test two prompt variants on the same test set and compare scores. Only ship the new prompt if it’s meaningfully better (>5% improvement) across the full test set.
