Press ESC to exit fullscreen
📖 Lesson ⏱️ 90 minutes

Agent Evaluation: Measuring What Matters

How to measure task completion, tool accuracy, and reasoning quality

The Problem with “Does It Work?”

You’ve built a web research agent. You run it on a few test queries, it produces reasonable-looking answers, and you ship it. Three weeks later, users are complaining that it gets questions wrong, goes in circles, and sometimes returns confident nonsense.

What went wrong? You never actually measured whether it worked.

“Testing” an agent by eyeballing a few outputs isn’t evaluation — it’s hoping. Real evaluation means defining what success looks like, building a test set that covers your use cases, and running automated measurements that tell you a number. That number should go up as you improve the agent, and alert you when it goes down.

Agent evaluation is harder than model evaluation. You’re not just checking an output against a label — you’re measuring a process: did the agent take the right steps, in the right order, using the right tools? This lesson gives you a framework for doing that rigorously.

The Four Dimensions of Agent Quality

1. Task Completion Rate

Did the agent finish the task at all?

This is the most basic metric. An agent that runs into an error, hits max iterations without a result, or produces “I’m unable to help with that” on a valid task has a completion failure.

def measure_task_completion(agent_response: dict) -> bool:
    """Check if the agent completed the task."""
    # Check for explicit failure signals
    failure_phrases = [
        "i'm unable to", "i cannot", "i don't have access",
        "task incomplete", "max iterations", "error occurred"
    ]
    
    output = agent_response.get("output", "").lower()
    
    if any(phrase in output for phrase in failure_phrases):
        return False
    
    # Check if the agent actually produced meaningful content
    if len(output.strip()) < 50:  # Suspiciously short response
        return False
    
    return True

A good baseline to aim for: 95%+ task completion on your defined test cases. Below 90% means your agent has reliability problems that will frustrate users.

2. Tool Accuracy

Did the agent call the right tool with the right parameters?

This is trickier to measure because you need to define what the “right” tool call looks like. For a research agent, a question about recent events should trigger web_search, not wikipedia. A question about well-established concepts should prefer wikipedia for depth.

def evaluate_tool_calls(
    actual_steps: list,
    expected_tool_sequence: list[str]
) -> dict:
    """
    Compare actual tool calls against expected sequence.
    
    actual_steps: list of (AgentAction, observation) tuples from the agent
    expected_tool_sequence: list of expected tool names in order
    """
    actual_tools = [step[0].tool for step in actual_steps]
    
    # Check if all expected tools were called (order-independent)
    expected_set = set(expected_tool_sequence)
    actual_set = set(actual_tools)
    
    coverage = len(expected_set & actual_set) / len(expected_set) if expected_set else 1.0
    
    # Check for unnecessary tool calls (hallucinated steps)
    unnecessary = [t for t in actual_tools if t not in expected_set]
    
    return {
        "tool_coverage": coverage,           # Did it use all expected tools?
        "unnecessary_calls": unnecessary,    # What extra calls did it make?
        "total_calls": len(actual_tools),    # How many total calls?
        "expected_calls": len(expected_tool_sequence)
    }

3. Answer Quality

Is the final answer actually correct and useful?

This is where it gets philosophically hard. “Correct” for research summaries isn’t a binary yes/no — it’s a spectrum. You need either:

  • Reference answers: Human-written gold standards to compare against
  • LLM-as-judge: Use a separate LLM to score the answer (covered below)
  • Factual verification: Check specific claims in the answer against known facts

4. Trajectory Efficiency

Did the agent take an efficient path?

An agent that calls search 7 times when 2 would suffice is inefficient — it’s slower and costs more. Measure this:

def measure_trajectory_efficiency(
    actual_steps: int,
    optimal_steps: int
) -> float:
    """
    Returns a score between 0 and 1.
    1.0 = perfectly efficient, took exactly optimal steps.
    0.5 = took twice as many steps as needed.
    """
    if actual_steps == 0:
        return 0.0
    return min(1.0, optimal_steps / actual_steps)

Building a Test Suite

A good evaluation requires a systematic test set. Here’s a template for research agent evaluation:

# test_cases.py

TEST_CASES = [
    {
        "id": "TC001",
        "query": "What is Flash Attention and why was it invented?",
        "expected_tools": ["wikipedia"],  # Background concept, not breaking news
        "expected_answer_contains": ["memory", "attention", "GPU", "efficient"],
        "optimal_steps": 2,
        "category": "technical_background"
    },
    {
        "id": "TC002", 
        "query": "What LLM models were released in the last 3 months?",
        "expected_tools": ["web_search"],  # Requires current info
        "expected_answer_contains": ["model", "release", "2025", "2026"],
        "optimal_steps": 2,
        "category": "current_events"
    },
    {
        "id": "TC003",
        "query": "Compare BERT and GPT architectures. What are their key differences?",
        "expected_tools": ["wikipedia", "web_search"],  # Both background + current context
        "expected_answer_contains": ["encoder", "decoder", "pre-training", "bidirectional"],
        "optimal_steps": 3,
        "category": "comparison"
    },
    {
        "id": "TC004",
        "query": "What is the current state of the art accuracy on ImageNet?",
        "expected_tools": ["web_search"],
        "expected_answer_contains": ["percent", "accuracy", "top-1"],
        "optimal_steps": 2,
        "category": "benchmarks"
    },
    # ... 16 more test cases
]

Design your test cases to cover:

  • Different task types (factual, analytical, comparative, current events)
  • Different complexity levels (1 tool call vs 3+)
  • Edge cases (ambiguous queries, multi-part questions)
  • Failure cases (queries outside the agent’s capability)

The Automated Evaluation Framework

import json
from dataclasses import dataclass
from typing import Optional

@dataclass
class EvaluationResult:
    test_id: str
    completed: bool
    tool_coverage: float
    unnecessary_calls: list
    answer_keyword_score: float
    trajectory_efficiency: float
    overall_score: float
    notes: str = ""


def evaluate_agent_on_test(
    agent_executor,
    test_case: dict
) -> EvaluationResult:
    """Run a single test case and return metrics."""
    
    try:
        result = agent_executor.invoke(
            {"input": test_case["query"]},
        )
        
        completed = measure_task_completion(result)
        
        tool_metrics = evaluate_tool_calls(
            result.get("intermediate_steps", []),
            test_case["expected_tools"]
        )
        
        # Check if answer contains expected keywords
        output_lower = result["output"].lower()
        keyword_matches = sum(
            1 for kw in test_case["expected_answer_contains"]
            if kw.lower() in output_lower
        )
        keyword_score = keyword_matches / len(test_case["expected_answer_contains"])
        
        efficiency = measure_trajectory_efficiency(
            actual_steps=len(result.get("intermediate_steps", [])),
            optimal_steps=test_case["optimal_steps"]
        )
        
        overall = (
            (1.0 if completed else 0.0) * 0.3 +
            tool_metrics["tool_coverage"] * 0.25 +
            keyword_score * 0.3 +
            efficiency * 0.15
        )
        
        return EvaluationResult(
            test_id=test_case["id"],
            completed=completed,
            tool_coverage=tool_metrics["tool_coverage"],
            unnecessary_calls=tool_metrics["unnecessary_calls"],
            answer_keyword_score=keyword_score,
            trajectory_efficiency=efficiency,
            overall_score=overall
        )
    
    except Exception as e:
        return EvaluationResult(
            test_id=test_case["id"],
            completed=False,
            tool_coverage=0.0,
            unnecessary_calls=[],
            answer_keyword_score=0.0,
            trajectory_efficiency=0.0,
            overall_score=0.0,
            notes=f"Exception: {str(e)}"
        )


def run_full_evaluation(agent_executor, test_cases: list) -> dict:
    """Run all test cases and aggregate metrics."""
    results = []
    
    for test in test_cases:
        print(f"Running {test['id']}: {test['query'][:60]}...")
        result = evaluate_agent_on_test(agent_executor, test)
        results.append(result)
        print(f"  Score: {result.overall_score:.2f} | Complete: {result.completed}")
    
    # Aggregate metrics
    completion_rate = sum(1 for r in results if r.completed) / len(results)
    avg_tool_coverage = sum(r.tool_coverage for r in results) / len(results)
    avg_keyword_score = sum(r.answer_keyword_score for r in results) / len(results)
    avg_efficiency = sum(r.trajectory_efficiency for r in results) / len(results)
    avg_overall = sum(r.overall_score for r in results) / len(results)
    
    report = {
        "total_tests": len(results),
        "completion_rate": completion_rate,
        "avg_tool_coverage": avg_tool_coverage,
        "avg_answer_quality": avg_keyword_score,
        "avg_efficiency": avg_efficiency,
        "overall_score": avg_overall,
        "failing_tests": [r.test_id for r in results if r.overall_score < 0.6],
        "raw_results": [vars(r) for r in results]
    }
    
    return report

The LLM-as-Judge Pattern

Keyword matching is a blunt instrument. For evaluating reasoning quality and answer correctness, use a second LLM as a judge. This scales where human evaluation doesn’t.

import anthropic

judge_client = anthropic.Anthropic()

def llm_judge_answer(
    question: str,
    agent_answer: str,
    reference_answer: Optional[str] = None
) -> dict:
    """Use Claude to evaluate the quality of an agent's answer."""
    
    reference_context = ""
    if reference_answer:
        reference_context = f"\nReference answer (ground truth): {reference_answer}"
    
    prompt = f"""You are evaluating the quality of an AI agent's response.

Question asked: {question}

Agent's answer: {agent_answer}
{reference_context}

Evaluate the answer on these criteria (score each 1-5):
1. Accuracy: Is the information factually correct?
2. Completeness: Does it fully address the question?
3. Reasoning quality: Is the answer well-reasoned and logical?
4. Conciseness: Is it appropriately concise without missing key points?

Respond in JSON format:
{{
  "accuracy": <1-5>,
  "completeness": <1-5>,
  "reasoning_quality": <1-5>,
  "conciseness": <1-5>,
  "overall": <1-5>,
  "strengths": "...",
  "weaknesses": "...",
  "verdict": "pass" or "fail"
}}"""
    
    response = judge_client.messages.create(
        model="claude-opus-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    
    try:
        return json.loads(response.content[0].text)
    except json.JSONDecodeError:
        return {"error": "Could not parse judge response", "raw": response.content[0].text}


# Example usage
judgement = llm_judge_answer(
    question="What is Flash Attention and why was it invented?",
    agent_answer="""Flash Attention is a memory-efficient attention algorithm introduced in 2022. 
    It reformulates the standard attention computation to work in tiles, keeping data 
    in fast SRAM instead of slow HBM (GPU memory). This reduces memory usage from O(n²) 
    to O(n) and significantly speeds up training of long-context transformers."""
)
print(json.dumps(judgement, indent=2))

The LLM-as-judge pattern is powerful but has a known bias: LLMs tend to prefer longer, more detailed answers even when concise answers are better. Mitigate this by explicitly including a conciseness criterion and tuning the rubric to your use case.

Regression Testing: Catching Regressions Early

Evaluation isn’t just for measuring quality — it’s for catching when things get worse. Set up a regression test that runs on every code change:

BASELINE_SCORES = {
    "completion_rate": 0.95,
    "avg_tool_coverage": 0.88,
    "avg_answer_quality": 0.80,
    "overall_score": 0.85
}

def check_for_regressions(current_report: dict, tolerance: float = 0.05) -> list[str]:
    """Return list of regression warnings if any metric dropped significantly."""
    warnings = []
    
    for metric, baseline in BASELINE_SCORES.items():
        current = current_report.get(metric, 0)
        if current < baseline - tolerance:
            warnings.append(
                f"REGRESSION: {metric} dropped from {baseline:.2f} to {current:.2f}"
            )
    
    return warnings

Run this in CI/CD before deploying agent updates. If a prompt change or model upgrade causes a regression, you catch it before users do.

A Practical Evaluation Workflow

For a typical agent development cycle:

  1. Define 20-30 test cases covering your key use cases before writing a line of code
  2. Run evaluation after each significant change — new tools, updated prompts, different models
  3. Use keyword scoring for fast feedback during development
  4. Use LLM-as-judge weekly for deeper quality assessment (it’s slower and costs more)
  5. Flag and manually review any test case scoring below 0.6
  6. Track scores over time — plot them on a chart so regressions are visually obvious

Summary

  • Task completion rate measures whether the agent finished the job at all
  • Tool accuracy measures whether it called the right tools with the right parameters
  • Answer quality can be measured with keyword matching (fast) or LLM-as-judge (accurate)
  • Trajectory efficiency measures how many steps the agent took vs the optimal
  • Build a test set of 20+ cases before shipping — cover different task types and edge cases
  • Use regression testing to catch when changes make the agent worse
  • LLM-as-judge scales where human evaluation doesn’t — but calibrate it to your specific quality criteria

Next: Multi-Agent Systems — what happens when one agent isn’t enough.