agenteval is a lightweight framework for scoring AI agents on four things that matter in practice:
accuracylatencycostsafety
It is designed for string-in / string-out agent interfaces, which makes it useful for quick baselines, regression checks, side by side comparisons, and task-suite based evaluation without binding the repo to one orchestration framework.
A lot of "agent evaluation" still collapses into screenshots, vibes, and one cherry-picked run.
That is not useful.
You need a repeatable way to answer questions like:
- Which agent is actually more accurate on this task suite?
- How much slower is the safer variant?
- What is the p95 latency under repeated runs?
- Did the output leak PII or reflect prompt-injection style failure modes?
- Which version wins when accuracy, reliability, and cost are considered together?
agenteval is built to make those comparisons explicit.
| Layer | What it handles |
|---|---|
TaskSuite |
task definitions from Python lists, JSON, or YAML |
AgentRunner and AsyncAgentRunner |
single-run execution, repeated runs, concurrency, retry, and timeout support |
Metrics |
accuracy, success rate, latency percentiles, token-based cost estimates |
SafetyChecker |
PII leakage checks, prompt injection indicators, and custom forbidden patterns |
Judges |
exact, contains, numeric, custom, and optional LLM-as-judge functions |
ComparisonReport |
side-by-side comparison and winner selection |
Export |
JSON, CSV, and Markdown outputs |
flowchart LR
A["TaskSuite"] --> B["AgentRunner / AsyncAgentRunner"]
B --> C["Run Results"]
C --> D["Metrics"]
C --> E["Safety Checks"]
D --> F["Comparison Report"]
E --> F
F --> G["JSON / CSV / Markdown Export"]
from agenteval import AgentEvaluator, TaskSuite
suite = TaskSuite.from_list([
{"name": "math", "prompt": "What is 2+2?", "expected": "4", "category": "math"},
{"name": "capital", "prompt": "Capital of France?", "expected": "Paris", "category": "geo"},
])
evaluator = AgentEvaluator(
agents={
"agent_a": my_agent_a,
"agent_b": my_agent_b,
},
runs_per_task=3,
)
results = evaluator.run(suite)
comparison = evaluator.compare_results(results)
print(results["agent_a"].metrics.accuracy)
print(results["agent_a"].metrics.latency_p95)
print(results["agent_a"].safety.safety_score)
comparison.print_table()Install:
pip install agentevalOptional LLM judges:
pip install agenteval[openai]
pip install agenteval[anthropic]| Dimension | Included signals |
|---|---|
Accuracy |
exact match, contains match, numeric match, custom judge functions |
Reliability |
success rate and error handling across repeated runs |
Latency |
mean, p50, p95, p99 |
Cost |
token-based estimation from configured input/output pricing |
Safety |
emails, phones, SSNs, credit cards, IP addresses, injection indicators, custom regex patterns |
The repo includes a small CLI for task-suite level workflows:
agenteval validate examples/tasks.yaml
agenteval info examples/tasks.yaml
agenteval versionThe example flow in examples/basic_eval.py demonstrates:
- multiple agents
- repeated task runs
- side-by-side comparison
- a deliberately leaky agent
- safety scanning over outputs
The repo also includes examples/tasks.yaml for YAML-driven suites.
Use the included judge helpers when exact string equality is too brittle:
exact_matchcontains_matchnumeric_matchcustom_judgellm_judgeanthropic_judge
SafetyChecker currently flags:
- email addresses
- phone numbers
- SSNs
- credit card-like patterns
- IP addresses
- prompt-injection style phrases
- custom forbidden regexes you provide
That makes it suitable for catching obvious output regressions before you over-interpret a high accuracy score.
This project is still alpha.
Current boundaries:
- optimized for agents that can be treated as callables over text input and output
- not a full conversation-trace platform
- token cost is estimation-based, using configured pricing inputs
- LLM-as-judge support is optional and provider-specific
That scope is intentional. The project is trying to be a fast evaluation harness, not a monolithic agent platform.
git clone https://github.com/atharvajoshi01/agenteval.git
cd agenteval
python3 -m venv .venv
.venv/bin/pip install -e ".[dev]"
.venv/bin/pytest -qTests currently cover task suites, metrics, export, async execution, retries, timeouts, judges, and CLI behavior.
MIT