Skip to content

ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use Under Adversarial Conditions#3378

Open
yonghongzhang-io wants to merge 1 commit intohuggingface:mainfrom
yonghongzhang-io:comtradebench-openenv-benchmark
Open

ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use Under Adversarial Conditions#3378
yonghongzhang-io wants to merge 1 commit intohuggingface:mainfrom
yonghongzhang-io:comtradebench-openenv-benchmark

Conversation

@yonghongzhang-io
Copy link
Copy Markdown

Summary

This PR adds a new community blog post: ComtradeBench, an OpenEnv-native benchmark for evaluating reliable LLM tool-use under adversarial API conditions.

The post documents the design of a 10-task benchmark, results from 5 frontier and open-source LLMs, a full GRPO training pipeline, and a latent GRPO bug discovery (silent policy/actor desync).

Resources

Author

Member of blog-explorers.

Submitted as part of AgentBeats Phase 2 OpenEnv Challenge.

ComtradeBench is an OpenEnv-native benchmark that evaluates whether LLM agents
can execute multi-step API workflows reliably under adversarial conditions:
pagination, duplicates, rate limits, server errors, page drift, totals traps,
and adaptive adversaries.

Key contributions:
- 10 procedurally-generated tasks (T1-T9 + adaptive variant)
- 6-dimension scoring with anti-gaming governance gates
- 5 LLM evaluations (Kimi K2.5, Claude Sonnet 4.6, GPT-5, Llama 3.3 70B, Qwen2.5-7B)
- Full GRPO training pipeline with reproducible rollouts
- Latent GRPO bug found and fixed (silent policy/actor desync)

Submitted as part of AgentBeats Phase 2 OpenEnv Challenge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant