/

What good AI agent evaluation looks like

A chatbot gives you information, but an agent takes action. That distinction changes what "correct" means in ways that aren't obvious until you start evaluating agent output in practice.

When you ask a chatbot for a recipe, it tells you what to cook. An agent given the same goal can check your grocery list, order missing ingredients, and block time on your calendar to bake. One produces text in response to a prompt. The other operates across multiple steps, using tools to complete a task in the world. That operational difference means the standards for evaluating quality have to shift accordingly.

The loop at the center of everything

AI agents work in a continuous cycle: observe, think, act. The agent gathers information about its current situation, develops a plan, executes an action by calling a tool, observes the result, and loops again.

Consider asking an agent to book a flight to San Francisco next month. It observes the goal and thinks through what it needs first: available flights. It calls a flight search tool with the relevant parameters, observes the results, uses those dates to inform a hotel search, and eventually presents a complete trip plan. The quality of the agent's output isn't located in any single step. It's in the trajectory: whether the chain of observations, plans, and actions follows a logical path from goal to completion.

This is what makes agent evaluation different from evaluating a chatbot response. A chatbot response can be assessed on its own terms. An agent response needs to be assessed in context: does this step move the project state forward without introducing new problems?

System prompts and what they constrain

Before evaluating any agent action, the first question is whether it follows the system prompt. A system prompt is a set of universal rules that govern the agent's behavior throughout an interaction, and it functions as the authoritative reference for everything the agent does.

Failure to follow the system prompt is a critical failure, even when an individual step is technically correct. Consider an agent instructed to "always ask for explicit confirmation before making changes in production." An agent that modifies production code without asking, even if the modification is correct, has failed on the most fundamental criterion. The analogy is a software engineer who produces good code but ignores the team's review process: the output might be fine, but the process failure still matters and would matter more at scale.

How to stress-test an agent

One way to assess an agent's real capabilities is to give it low-context or low-specificity prompts and observe how it handles the ambiguity.

Low-context prompts test whether the agent can identify what information it's missing before acting. "Something's breaking when a buyer clicks the purchase button, can you look at the code and fix it?" is deliberately underspecified. A capable agent asks clarifying questions before touching the codebase. A poor one guesses at parameters, hallucinates details that weren't provided, and makes changes that may or may not address the actual problem. The ability to recognize the limits of available information, and ask rather than assume, is a meaningful capability to train for.

Low-specificity prompts test whether the agent can plan a logical trajectory without being told each step. "Make me a restaurant reservation at an Italian restaurant at 5pm" doesn't specify how to do it. A capable agent checks the user's location first, searches for available restaurants, confirms hours, and then proceeds. A naive agent might skip the location check and book somewhere without knowing whether it's near the user or even open. Prompts that require multi-step reasoning without spelling out each step reveal whether an agent understands context or just follows explicit instructions.

Why failure recovery is a core metric

A good agent is a resilient agent. If a tool call fails because the agent passed an invalid date, a capable agent reads the error message, identifies what went wrong, and retries with a valid date. A poorly designed agent runs the same broken call until it exhausts its retry limit.

In evaluation, resilience and correctness are scored separately. An agent that fails on step three, correctly diagnoses the problem, and recovers on step four is producing good output. An agent that arrives at the right answer through an illogical path is still penalized, because the trajectory reflects how the agent would behave across a range of scenarios, not just this one.

The practical frame for evaluators: you're a senior engineer reviewing a junior engineer's work. The junior engineer is talented and, given clear goals and context, gets things done effectively. Your job is to assess whether the decisions were sound, how the process would hold up under different conditions, and whether problems were handled in a way that reflects competence, not just a lucky outcome.

Share this article on