AI Agent Evaluation Benchmarks: How to Score Agents on Speed, Cost & Quality

April 19, 2026·15 min read
Most teams pick an AI agent framework on vibes. A year later they discover the “smart” model loops forever, the “cheap” model costs twice as much per completed task, and nobody knows which change broke last week’s golden path. This guide covers the benchmarks that actually predict production behavior — and how to run them yourself.
TL;DR
AI agent benchmarks measure a full agent loop, not a single model call. Score on four dimensions: quality (task completion rate), speed (wall-clock latency per task), cost (dollars per completed task), and reliability (variance across runs). Use public benchmarks — SWE-bench Verified, Tau-bench, GAIA, WebArena, Terminal-Bench — as a baseline, then build a small custom eval set for your workload. Automate with agent-bench and track scores on every model, prompt, or tool change.
Benchmark your agent on managed infrastructure.
Try Rapid ClawTable of Contents
1. Why Agent Benchmarks Are Not LLM Benchmarks
MMLU, HumanEval, GSM8K, and the rest of the classic LLM benchmarks measure one thing: the quality of a single forward pass. Give the model a prompt, get a completion, grade it. That’s a useful signal for raw model capability, but it doesn’t predict agent behavior.
An agent is a loop. It reads state, plans a step, calls a tool, observes the result, updates memory, and plans again. Any of those stages can fail independently of model quality. A model with 90% MMLU can still:
- Call the wrong tool (or the right tool with wrong arguments) and never recover.
- Retry a failing action 50 times, burning tokens and patience.
- Forget a critical constraint after the third turn because memory management is broken.
- Produce a perfect final answer — two minutes and $4 later.
If you’re debugging one of these in production, our companion post on why AI agents fail in production catalogs the top five failure modes. Benchmarks are how you catch them before production.
The right benchmark for an agent scores the full loop: did the task complete correctly, how many tool calls did it take, how long did it take, and how many dollars did it burn? Everything else is noise.
2. The Four Dimensions That Matter
Every useful agent benchmark score rolls up into four numbers. Track all four on every change. Optimizing one in isolation almost always wrecks the others.
| Dimension | What it measures | Primary metric |
|---|---|---|
| Quality | Did the agent finish the task correctly? | Task completion rate (%) |
| Speed | How long did the full loop take? | p50 / p95 wall-clock per task |
| Cost | How much did a successful run cost? | $ per completed task |
| Reliability | How consistent is the agent across runs? | Variance of quality/cost at N=10 |
The trap: reporting cost per token or accuracy per attempt. A cheap model that loops ten times costs more than a premium model that solves the task once. An agent that gets the right answer 70% of the time and fails loudly 30% is more useful than one that gets 90% right but hallucinates the rest confidently. Normalize on “per completed task” and you get honest numbers. For the cost side, see why AI agents cost $100K/year for the economics that make this framing obvious.
3. Public Benchmarks: What Each One Measures
There are dozens of agent benchmarks. Five of them are worth your attention in 2026. Pick the one closest to your workload, and pair it with one general-purpose benchmark.
SWE-bench Verified
code500 real-world GitHub issues from popular Python repos, human-verified to be solvable. The agent must patch the repo to pass the hidden test suite. Current frontier models land around 60\u201365% resolution rate in mid-2026. The gold standard for coding agents.
Tau-bench (\u03C4-bench)
tool use + multi-turnSimulated customer-support scenarios in retail and airline domains. Agents must call the right tools in the right order to satisfy a user-simulator while obeying domain rules. Measures tool accuracy, policy adherence, and multi-turn coherence \u2014 exactly the failure modes that break production agents.
GAIA
general assistant466 real-world questions that require browsing, file reading, reasoning, and synthesis \u2014 the kind of research a personal assistant does. Answers are short and objectively gradeable. Excellent for testing whether an agent can finish open-ended tasks, not just start them.
WebArena / VisualWebArena
browserRealistic self-hosted clones of popular websites (shopping, forums, maps, GitLab) with 812 tasks. Tests whether an agent can navigate, fill forms, compare options, and complete goals end-to-end. VisualWebArena adds screenshots for vision-enabled agents.
Terminal-Bench
shell + devopsDocker-sandboxed terminal tasks \u2014 fix a broken build, debug a container, trace a flaky test, recover a corrupted git repo. The closest public benchmark to the kind of work OpenClaw agents do on a developer’s machine. Results often expose weaknesses that language-only benchmarks miss.
Watch for contamination
Public benchmarks end up in training data. A model that scored 85% on SWE-bench last quarter might score the same via memorization, not capability. Always cross-check against SWE-bench Verified (a held-out subset), and pair public scores with a private custom eval set that no model has seen.
4. Running agent-bench Against Your Deployment
We built agent-bench (MIT-licensed, open source) as a thin harness to benchmark any agent endpoint against speed, cost, and quality in a single report card. The workflow is the same whether you’re comparing providers or regression-testing a prompt change.
# Clone and install
git clone https://github.com/arcane-bear/agent-bench
cd agent-bench && pip install -e .
# Run a baseline suite against three providers
agent-bench run \
--suite gaia-lite \
--providers anthropic,openai,openrouter \
--models claude-opus-4-7,gpt-5,mistral-large-3 \
--n-runs 5 \
--out report.html
# Open the report
open report.htmlThe HTML report gives you a matrix: completion rate, p50/p95 latency, dollar cost per completed task, and variance over N runs. You can plug in a local OpenClaw endpoint, a self-hosted Hermes Agent, or a Rapid Claw-managed deployment \u2014 agent-bench doesn’t care where the agent runs, only what it returns.
# agent-bench: custom suite definition
suite:
name: checkout-flow-regressions
description: "Our own production-shaped tasks"
seed: 42
tasks:
- id: refund-happy-path
input: "Refund order #A-1042 and email the customer."
tools_required: [lookup_order, issue_refund, send_email]
success:
judge: llm
rubric: |
Passes only if: (1) order was located,
(2) refund was issued once, (3) customer email
was sent with refund amount.
- id: refund-ambiguous
input: "Customer says 'I want my money back for my recent order.'"
tools_required: [clarify, lookup_order, issue_refund]
success:
judge: llm
rubric: |
Must ask for clarification before refunding.
scoring:
primary: completion_rate
cost_cap_usd: 0.50 # fail task if run costs more
time_cap_sec: 60 # fail task if run exceeds5. Building a Custom Eval Set That Predicts Production
Public benchmarks tell you whether a model is generally competent. A custom eval tells you whether your agent, with your prompt and your tools, handles your workload. Without the second, the first is interesting trivia.
A useful custom eval has three parts:
1. Golden path tasks (10\u201320)
The happy-path scenarios your agent should always nail. If any of these regress, you roll back the change. Build these from real production traffic, not imagination.
2. Edge-case tasks (15\u201330)
Ambiguous inputs, missing data, tool failures, conflicting constraints. These are where agents go off the rails in production. Mining your observability logs for real failures is the fastest way to build this set.
3. Adversarial tasks (5\u201310)
Prompt injections, contradictory instructions, attempts to bypass permission boundaries. These should always fail gracefully. If the agent executes a protected action because of a crafted input, you need to know before an attacker does.
Keep the eval set under 100 tasks. Any larger and it takes too long to run on every change, which means it won’t get run, which means it might as well not exist.
6. LLM-as-Judge: Patterns That Actually Work
For tasks without a deterministic correct answer (most agent tasks), you grade with another LLM. LLM-as-judge is fast and cheap but notoriously unreliable if you wing it. Three rules keep judgment stable:
You are grading whether an AI agent correctly
completed a customer-service task.
Task: {task_input}
Rubric:
{rubric}
Agent trajectory (tool calls + final response):
{agent_trace}
Score 0–5:
0 = Did nothing useful / hallucinated
1 = Attempted but completely wrong outcome
2 = Partial — right direction, wrong result
3 = Correct outcome, messy execution
4 = Correct outcome, reasonable efficiency
5 = Correct outcome, minimal tool calls, clean
Respond ONLY with JSON:
{"score": <int>, "reason": "<one sentence>"}7. Continuous Evals in Production
Offline benchmarks catch known problems. Production evals catch the ones you haven’t imagined yet. Sample 1\u20135% of real agent traffic, run the same judge against the sampled traces, and alert when the completion-rate or cost-per-task metric drifts outside a rolling window.
For the observability plumbing that makes this practical, see the AI agent observability guide. The short version: emit structured traces per run (inputs, tool calls, outputs, token counts), and pipe them into a table you can sample from.
# continuous_eval.py — sampled production grading
import random
from agent_bench.judges import LlmJudge
judge = LlmJudge(model="claude-opus-4-7")
def grade_if_sampled(trace: dict, sample_rate: float = 0.02) -> None:
if random.random() > sample_rate:
return # skip most traffic
score = judge.grade(
task=trace["user_input"],
rubric=trace["rubric"] or DEFAULT_RUBRIC,
trajectory=trace["tool_calls"] + [trace["final"]],
)
emit_metric(
name="agent.production.score",
value=score.score,
tags={
"agent_id": trace["agent_id"],
"model": trace["model"],
"version": trace["prompt_version"],
},
)
if score.score <= 2:
alert_on_call(trace, score) # low-score gets eyes8. Common Benchmarking Pitfalls
Skip the benchmark plumbing
Rapid Claw deployments ship with per-agent dashboards for latency, cost, and task-completion signals. Plug agent-bench in once and every prompt or model change gets graded automatically.