Three agent frameworks, one eval harness
Table of Contents
I ended up writing the same thing three times — an agent framework — in Go (agent-go), Rust (harness-rs) and Python (miniagent). Not on purpose. Each one answered a different question: Go for something I’d actually ship, Rust to see how the loop feels with a strict type system, Python to teach the ideas with as little ceremony as possible.
Three frameworks raise an obvious question: are they any good, and how would I know? “It works on my prompt” is not an answer. So I built a fourth thing — eval-go — to score agent runs against rubrics, and then to run the agents itself and compare them on identical tasks.
What the benchmark actually found #
I gave all three the same file-tool tasks — fulfill an order, reconcile payments, recover from a missing file mid-batch. The interesting result wasn’t a leaderboard. With a capable model, all three were correct. The gap showed up under one condition: runtime adversity.
On a task where a tool errors (a file that doesn’t exist), only the framework that feeds the error back to the model — instead of aborting the run — actually recovers. That bug was real, it was in agent-go, and the benchmark is what surfaced it. I shipped the fix.
That’s the whole thesis of this blog: the parts of agents that matter least in a demo — error recovery, loop termination, honest scoring — are the parts that decide whether they work. More to come.