GitHub - atharvajoshi01/agenteval: Lightweight evaluation framework for AI agents — measure accuracy, cost, latency, and safety across any agent architecture

Figure out which agent is actually better before you ship it.

agenteval is a lightweight framework for scoring AI agents on four things that matter in practice:

accuracy
latency
cost
safety

It is designed for string-in / string-out agent interfaces, which makes it useful for quick baselines, regression checks, side by side comparisons, and task-suite based evaluation without binding the repo to one orchestration framework.

Why This Exists

A lot of "agent evaluation" still collapses into screenshots, vibes, and one cherry-picked run.

That is not useful.

You need a repeatable way to answer questions like:

Which agent is actually more accurate on this task suite?
How much slower is the safer variant?
What is the p95 latency under repeated runs?
Did the output leak PII or reflect prompt-injection style failure modes?
Which version wins when accuracy, reliability, and cost are considered together?

agenteval is built to make those comparisons explicit.

Evaluation Surface

Layer	What it handles
`TaskSuite`	task definitions from Python lists, JSON, or YAML
`AgentRunner` and `AsyncAgentRunner`	single-run execution, repeated runs, concurrency, retry, and timeout support
`Metrics`	accuracy, success rate, latency percentiles, token-based cost estimates
`SafetyChecker`	PII leakage checks, prompt injection indicators, and custom forbidden patterns
`Judges`	exact, contains, numeric, custom, and optional LLM-as-judge functions
`ComparisonReport`	side-by-side comparison and winner selection
`Export`	JSON, CSV, and Markdown outputs

Evaluation Flow

flowchart LR
    A["TaskSuite"] --> B["AgentRunner / AsyncAgentRunner"]
    B --> C["Run Results"]
    C --> D["Metrics"]
    C --> E["Safety Checks"]
    D --> F["Comparison Report"]
    E --> F
    F --> G["JSON / CSV / Markdown Export"]

Quick Start

from agenteval import AgentEvaluator, TaskSuite

suite = TaskSuite.from_list([
    {"name": "math", "prompt": "What is 2+2?", "expected": "4", "category": "math"},
    {"name": "capital", "prompt": "Capital of France?", "expected": "Paris", "category": "geo"},
])

evaluator = AgentEvaluator(
    agents={
        "agent_a": my_agent_a,
        "agent_b": my_agent_b,
    },
    runs_per_task=3,
)

results = evaluator.run(suite)
comparison = evaluator.compare_results(results)

print(results["agent_a"].metrics.accuracy)
print(results["agent_a"].metrics.latency_p95)
print(results["agent_a"].safety.safety_score)
comparison.print_table()

Install:

pip install agenteval

Optional LLM judges:

pip install agenteval[openai]
pip install agenteval[anthropic]

What It Measures

Dimension	Included signals
`Accuracy`	exact match, contains match, numeric match, custom judge functions
`Reliability`	success rate and error handling across repeated runs
`Latency`	mean, p50, p95, p99
`Cost`	token-based estimation from configured input/output pricing
`Safety`	emails, phones, SSNs, credit cards, IP addresses, injection indicators, custom regex patterns

CLI

The repo includes a small CLI for task-suite level workflows:

agenteval validate examples/tasks.yaml
agenteval info examples/tasks.yaml
agenteval version

Included Example

The example flow in examples/basic_eval.py demonstrates:

multiple agents
repeated task runs
side-by-side comparison
a deliberately leaky agent
safety scanning over outputs

The repo also includes examples/tasks.yaml for YAML-driven suites.

Judges and Safety

Built-in judges

Use the included judge helpers when exact string equality is too brittle:

exact_match
contains_match
numeric_match
custom_judge
llm_judge
anthropic_judge

Safety checks

SafetyChecker currently flags:

email addresses
phone numbers
SSNs
credit card-like patterns
IP addresses
prompt-injection style phrases
custom forbidden regexes you provide

That makes it suitable for catching obvious output regressions before you over-interpret a high accuracy score.

Current Scope

This project is still alpha.

Current boundaries:

optimized for agents that can be treated as callables over text input and output
not a full conversation-trace platform
token cost is estimation-based, using configured pricing inputs
LLM-as-judge support is optional and provider-specific

That scope is intentional. The project is trying to be a fast evaluation harness, not a monolithic agent platform.

Development

git clone https://github.com/atharvajoshi01/agenteval.git
cd agenteval
python3 -m venv .venv
.venv/bin/pip install -e ".[dev]"
.venv/bin/pytest -q

Tests currently cover task suites, metrics, export, async execution, retries, timeouts, judges, and CLI behavior.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
agenteval		agenteval
assets		assets
examples		examples
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Why This Exists

Evaluation Surface

Evaluation Flow

Quick Start

What It Measures

CLI

Included Example

Judges and Safety

Built-in judges

Safety checks

Current Scope

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Why This Exists

Evaluation Surface

Evaluation Flow

Quick Start

What It Measures

CLI

Included Example

Judges and Safety

Built-in judges

Safety checks

Current Scope

Development

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages