How to evaluate an agent

Table of Contents

“It works on my prompt” is where most agent evaluation starts and stops. You run the agent on a task you have in your head, read the final message, decide it looks right, and ship. This holds up until the agent meets a task you didn’t think of — and then you have no way to tell whether a change made things better or worse.

Scoring an agent is not the same as scoring a single model call. A model call has one input and one output; you grade the output. An agent has a goal, calls tools, reads results, changes state, and decides when to stop. The final answer is only part of what you’re grading. The path matters too: which tools it called, in what order, and whether it recovered when something broke.

So the first question is what to measure.

What to measure #

I find it useful to split the metrics into three layers.

Did it take the right actions. Did it call the tools you expected, with sane arguments? This is mostly deterministic — you have a list of expected tool names and you compare. No model needed. It’s cheap, and it catches a whole class of regressions where the agent quietly stops using a tool it should.

Did it execute well. Did it finish the task, and without wasting steps? “Finished the task” needs a judge — another model reading the run and the end state against a description. “Wasted steps” is the same idea: a judge comparing the actions taken to what the task actually required.

Did the output clear a bar. A rubric: a pass/fail criterion in plain language, checked by a judge. “PASS only if the invoice totals 78.30 and the out-of-stock item is backordered.” The rubric is where you write down the thing you actually care about.

None of this is exotic. The trap is stopping at “did the final message look plausible,” which a judge will happily rubber-stamp.

Run the agent, don’t grade a transcript #

A lot of eval tooling takes a frozen transcript — input, output, recorded tool calls — and scores it. That’s fine for a regression set, but it has a ceiling: you can only evaluate runs you already captured. Change the agent and you have to go re-capture everything by hand.

I’d rather the harness run the agent itself: give it a task, let it drive the agent, record what happened, and score it — in one pass. That’s the shape I landed on in eval-go:

bench := evalgo.Bench{
    Target:  myAgent,   // anything that runs a task and reports what it did
    Tasks:   tasks,      // input + expected tools + rubric (+ seed files)
    Metrics: metrics,    // tool_correctness, task_completion, step_efficiency, rubric
}
report, _ := bench.Run(ctx)

The agent under test is just a Target — something that takes a task and reports back. It can run in-process, or as a subprocess in any language. That last part matters more than it sounds: it lets you put a Go agent, a Python agent and a Rust agent on the same tasks and compare them directly.

A concrete task #

Here’s one I use. It’s an order-fulfilment scenario over a small file “database” the agent reads and writes.

{
  "name": "refund_recovery",
  "expected_tools": ["read_file", "write_file"],
  "input": "Process refunds for orders O1002 and O1001, in that order. Read each invoice from invoices/<id>.json and restock the items in catalog.json. If an invoice file does not exist, append '<id>: invoice not found' to refund_errors.log and continue to the next order — do not stop.",
  "rubric": "PASS only if refund_errors.log notes O1002's invoice was not found, the missing invoice did NOT abort the task (O1001 was still processed), and catalog.json shows the restocked quantities.",
  "files": { "catalog.json": "…", "invoices/O1001.json": "…" }
}

Look at what the rubric tests: not “did it write a file,” but a specific behaviour — recover from a missing file mid-batch and keep going. The seed files set up the world, and invoices/O1002.json is deliberately absent.

Running a set of these across several agents is one command:

evalgo bench --tasks tasks.json --targets targets.json --judge env \
  -m tool_correctness,task_completion,step_efficiency,rubric \
  --gate task_completion,rubric

which gives you a grid:

task                  agent-go   harness-rs   miniagent
fulfill_order         PASS       PASS         PASS
payments_dedupe       PASS       PASS         PASS
refund_recovery       PASS       PASS         PASS

What clean tasks hide #

Here’s what happened when I ran this across my three agent frameworks with a capable model: they all passed. Every reasoning-heavy task I could design — deduplicating payments by transaction id, processing orders against shared stock, applying a discount with an exclusion rule — got solved correctly by all of them. The model does the reasoning; the framework just carries its tool calls.

If you stop at PASS/FAIL on clean tasks, you conclude the frameworks are interchangeable. They aren’t. The difference shows up under one condition: when something fails at runtime.

refund_recovery is the tell. The invoice for O1002 doesn’t exist, so the read_file tool returns an error. What happens next is decided by the framework, not the model:

One fed the error back to the model as a tool result. The model saw “file not found,” logged it, and moved on to O1001. Clean pass, full marks on efficiency.
The other two took a shortcut around the error — wrong order, or never actually attempting the read — and scored 0.2 on efficiency even though they limped to an acceptable end state.

That gap (error recovery, loop termination, what happens when a tool throws) is invisible on happy-path tasks and decisive on real ones. The eval also surfaced a real bug in my own Go framework: tool errors were aborting the whole run instead of going back to the model. I fixed it because the benchmark made it impossible to miss.

What I’d actually do #

A few things that made my agent evals worth running:

Gate on outcome, not efficiency. A correct-but-slow run is still correct. Let task_completion and rubric decide PASS/FAIL; keep step_efficiency as a number you watch, not a gate.
Seed real state. Give the agent files to read, not just a prompt. Most interesting agent behaviour is read-modify-write.
Let the judge see the end state. Don’t grade the final message alone — show the judge the files the agent left behind. The message is the agent’s claim; the files are the truth.
Put traps in the tasks. A duplicate transaction id, a missing file, an item that’s the exception to the rule. Clean tasks tell you the model is good. Traps tell you the agent is.
To see a real gap, make life hard. Use a weaker model, or inject failures at the tool layer. Strong-model happy-path eval saturates fast.

Evaluation won’t make an agent perfect, but it’s what lets you change the code and know whether it actually got better.