[{"content":"","date":null,"permalink":"https://blog.superleo.app/tags/agents/","section":"Tags","summary":"","title":"Agents"},{"content":"","date":null,"permalink":"https://blog.superleo.app/tags/eval/","section":"Tags","summary":"","title":"Eval"},{"content":"","date":null,"permalink":"https://blog.superleo.app/tags/go/","section":"Tags","summary":"","title":"Go"},{"content":"\u0026ldquo;It works on my prompt\u0026rdquo; is where most agent evaluation starts and stops. You run the agent on a task you have in your head, read the final message, decide it looks right, and ship. This holds up until the agent meets a task you didn\u0026rsquo;t think of — and then you have no way to tell whether a change made things better or worse.\nScoring an agent is not the same as scoring a single model call. A model call has one input and one output; you grade the output. An agent has a goal, calls tools, reads results, changes state, and decides when to stop. The final answer is only part of what you\u0026rsquo;re grading. The path matters too: which tools it called, in what order, and whether it recovered when something broke.\nSo the first question is what to measure.\nWhat to measure #I find it useful to split the metrics into three layers.\nDid it take the right actions. Did it call the tools you expected, with sane arguments? This is mostly deterministic — you have a list of expected tool names and you compare. No model needed. It\u0026rsquo;s cheap, and it catches a whole class of regressions where the agent quietly stops using a tool it should.\nDid it execute well. Did it finish the task, and without wasting steps? \u0026ldquo;Finished the task\u0026rdquo; needs a judge — another model reading the run and the end state against a description. \u0026ldquo;Wasted steps\u0026rdquo; is the same idea: a judge comparing the actions taken to what the task actually required.\nDid the output clear a bar. A rubric: a pass/fail criterion in plain language, checked by a judge. \u0026ldquo;PASS only if the invoice totals 78.30 and the out-of-stock item is backordered.\u0026rdquo; The rubric is where you write down the thing you actually care about.\nNone of this is exotic. The trap is stopping at \u0026ldquo;did the final message look plausible,\u0026rdquo; which a judge will happily rubber-stamp.\nRun the agent, don\u0026rsquo;t grade a transcript #A lot of eval tooling takes a frozen transcript — input, output, recorded tool calls — and scores it. That\u0026rsquo;s fine for a regression set, but it has a ceiling: you can only evaluate runs you already captured. Change the agent and you have to go re-capture everything by hand.\nI\u0026rsquo;d rather the harness run the agent itself: give it a task, let it drive the agent, record what happened, and score it — in one pass. That\u0026rsquo;s the shape I landed on in eval-go:\nbench := evalgo.Bench{ Target: myAgent, // anything that runs a task and reports what it did Tasks: tasks, // input + expected tools + rubric (+ seed files) Metrics: metrics, // tool_correctness, task_completion, step_efficiency, rubric } report, _ := bench.Run(ctx) The agent under test is just a Target — something that takes a task and reports back. It can run in-process, or as a subprocess in any language. That last part matters more than it sounds: it lets you put a Go agent, a Python agent and a Rust agent on the same tasks and compare them directly.\nA concrete task #Here\u0026rsquo;s one I use. It\u0026rsquo;s an order-fulfilment scenario over a small file \u0026ldquo;database\u0026rdquo; the agent reads and writes.\n{ \u0026#34;name\u0026#34;: \u0026#34;refund_recovery\u0026#34;, \u0026#34;expected_tools\u0026#34;: [\u0026#34;read_file\u0026#34;, \u0026#34;write_file\u0026#34;], \u0026#34;input\u0026#34;: \u0026#34;Process refunds for orders O1002 and O1001, in that order. Read each invoice from invoices/\u0026lt;id\u0026gt;.json and restock the items in catalog.json. If an invoice file does not exist, append \u0026#39;\u0026lt;id\u0026gt;: invoice not found\u0026#39; to refund_errors.log and continue to the next order — do not stop.\u0026#34;, \u0026#34;rubric\u0026#34;: \u0026#34;PASS only if refund_errors.log notes O1002\u0026#39;s invoice was not found, the missing invoice did NOT abort the task (O1001 was still processed), and catalog.json shows the restocked quantities.\u0026#34;, \u0026#34;files\u0026#34;: { \u0026#34;catalog.json\u0026#34;: \u0026#34;…\u0026#34;, \u0026#34;invoices/O1001.json\u0026#34;: \u0026#34;…\u0026#34; } } Look at what the rubric tests: not \u0026ldquo;did it write a file,\u0026rdquo; but a specific behaviour — recover from a missing file mid-batch and keep going. The seed files set up the world, and invoices/O1002.json is deliberately absent.\nRunning a set of these across several agents is one command:\nevalgo bench --tasks tasks.json --targets targets.json --judge env \\ -m tool_correctness,task_completion,step_efficiency,rubric \\ --gate task_completion,rubric which gives you a grid:\ntask agent-go harness-rs miniagent fulfill_order PASS PASS PASS payments_dedupe PASS PASS PASS refund_recovery PASS PASS PASS What clean tasks hide #Here\u0026rsquo;s what happened when I ran this across my three agent frameworks with a capable model: they all passed. Every reasoning-heavy task I could design — deduplicating payments by transaction id, processing orders against shared stock, applying a discount with an exclusion rule — got solved correctly by all of them. The model does the reasoning; the framework just carries its tool calls.\nIf you stop at PASS/FAIL on clean tasks, you conclude the frameworks are interchangeable. They aren\u0026rsquo;t. The difference shows up under one condition: when something fails at runtime.\nrefund_recovery is the tell. The invoice for O1002 doesn\u0026rsquo;t exist, so the read_file tool returns an error. What happens next is decided by the framework, not the model:\nOne fed the error back to the model as a tool result. The model saw \u0026ldquo;file not found,\u0026rdquo; logged it, and moved on to O1001. Clean pass, full marks on efficiency. The other two took a shortcut around the error — wrong order, or never actually attempting the read — and scored 0.2 on efficiency even though they limped to an acceptable end state. That gap (error recovery, loop termination, what happens when a tool throws) is invisible on happy-path tasks and decisive on real ones. The eval also surfaced a real bug in my own Go framework: tool errors were aborting the whole run instead of going back to the model. I fixed it because the benchmark made it impossible to miss.\nWhat I\u0026rsquo;d actually do #A few things that made my agent evals worth running:\nGate on outcome, not efficiency. A correct-but-slow run is still correct. Let task_completion and rubric decide PASS/FAIL; keep step_efficiency as a number you watch, not a gate. Seed real state. Give the agent files to read, not just a prompt. Most interesting agent behaviour is read-modify-write. Let the judge see the end state. Don\u0026rsquo;t grade the final message alone — show the judge the files the agent left behind. The message is the agent\u0026rsquo;s claim; the files are the truth. Put traps in the tasks. A duplicate transaction id, a missing file, an item that\u0026rsquo;s the exception to the rule. Clean tasks tell you the model is good. Traps tell you the agent is. To see a real gap, make life hard. Use a weaker model, or inject failures at the tool layer. Strong-model happy-path eval saturates fast. Evaluation won\u0026rsquo;t make an agent perfect, but it\u0026rsquo;s what lets you change the code and know whether it actually got better.\n","date":"28 June 2026","permalink":"https://blog.superleo.app/posts/how-to-evaluate-an-agent/","section":"Posts","summary":"","title":"How to evaluate an agent"},{"content":"I build agent frameworks across Go, Rust and Python — and the eval tooling to compare them. This is where I write up what I learn: agent loops, tool calling, evaluation, and the unglamorous plumbing that makes agents actually work.\n","date":null,"permalink":"https://blog.superleo.app/","section":"Liang Li","summary":"","title":"Liang Li"},{"content":"","date":null,"permalink":"https://blog.superleo.app/posts/","section":"Posts","summary":"","title":"Posts"},{"content":"","date":null,"permalink":"https://blog.superleo.app/tags/python/","section":"Tags","summary":"","title":"Python"},{"content":"","date":null,"permalink":"https://blog.superleo.app/tags/rust/","section":"Tags","summary":"","title":"Rust"},{"content":"","date":null,"permalink":"https://blog.superleo.app/tags/","section":"Tags","summary":"","title":"Tags"},{"content":"","date":null,"permalink":"https://blog.superleo.app/tags/testing/","section":"Tags","summary":"","title":"Testing"},{"content":"I ended up writing the same thing three times — an agent framework — in Go (agent-go), Rust (harness-rs) and Python (miniagent). Not on purpose. Each one answered a different question: Go for something I\u0026rsquo;d actually ship, Rust to see how the loop feels with a strict type system, Python to teach the ideas with as little ceremony as possible.\nThree frameworks raise an obvious question: are they any good, and how would I know? \u0026ldquo;It works on my prompt\u0026rdquo; is not an answer. So I built a fourth thing — eval-go — to score agent runs against rubrics, and then to run the agents itself and compare them on identical tasks.\nWhat the benchmark actually found #I gave all three the same file-tool tasks — fulfill an order, reconcile payments, recover from a missing file mid-batch. The interesting result wasn\u0026rsquo;t a leaderboard. With a capable model, all three were correct. The gap showed up under one condition: runtime adversity.\nOn a task where a tool errors (a file that doesn\u0026rsquo;t exist), only the framework that feeds the error back to the model — instead of aborting the run — actually recovers. That bug was real, it was in agent-go, and the benchmark is what surfaced it. I shipped the fix.\nThat\u0026rsquo;s the whole thesis of this blog: the parts of agents that matter least in a demo — error recovery, loop termination, honest scoring — are the parts that decide whether they work. More to come.\n","date":"28 June 2026","permalink":"https://blog.superleo.app/posts/three-agent-frameworks-one-eval/","section":"Posts","summary":"","title":"Three agent frameworks, one eval harness"},{"content":"I\u0026rsquo;m Liang Li, an engineer working on AI agent infrastructure.\nI build and maintain a small but connected stack:\nagent-go — a Go agent framework: builder API, tool calling, memory, multi-agent teams. harness-rs — an agent framework in Rust. miniagent — a pedagogical agent framework in Python. eval-go — a native-Go eval framework that scores and runs agents, so the three above can be benchmarked side by side. I care about the parts that don\u0026rsquo;t demo well: error recovery, loop robustness, honest evaluation.\nReach me at ll_faw@hotmail.com or on GitHub.\n","date":null,"permalink":"https://blog.superleo.app/about/","section":"Liang Li","summary":"","title":"About"}]