ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use Under Adversarial Conditions by yonghongzhang-io · Pull Request #3378 · huggingface/blog

yonghongzhang-io · 2026-05-08T08:15:44Z

Summary

This PR adds a new community blog post: ComtradeBench, an OpenEnv-native benchmark for evaluating reliable LLM tool-use under adversarial API conditions.

The post documents the design of a 10-task benchmark, results from 5 frontier and open-source LLMs, a full GRPO training pipeline, and a latent GRPO bug discovery (silent policy/actor desync).

Resources

Live Space: https://huggingface.co/spaces/yonghongzhang/comtrade-env
GitHub: https://github.com/yonghongzhang-io/comtrade-openenv
OpenEnv Framework: https://github.com/meta-pytorch/OpenEnv

Author

Member of blog-explorers.

Submitted as part of AgentBeats Phase 2 OpenEnv Challenge.

ComtradeBench is an OpenEnv-native benchmark that evaluates whether LLM agents can execute multi-step API workflows reliably under adversarial conditions: pagination, duplicates, rate limits, server errors, page drift, totals traps, and adaptive adversaries. Key contributions: - 10 procedurally-generated tasks (T1-T9 + adaptive variant) - 6-dimension scoring with anti-gaming governance gates - 5 LLM evaluations (Kimi K2.5, Claude Sonnet 4.6, GPT-5, Llama 3.3 70B, Qwen2.5-7B) - Full GRPO training pipeline with reproducible rollouts - Latent GRPO bug found and fixed (silent policy/actor desync) Submitted as part of AgentBeats Phase 2 OpenEnv Challenge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use Under Adversarial Conditions#3378

ComtradeBench: An OpenEnv Benchmark for Reliable LLM Tool-Use Under Adversarial Conditions#3378
yonghongzhang-io wants to merge 1 commit intohuggingface:mainfrom
yonghongzhang-io:comtradebench-openenv-benchmark

yonghongzhang-io commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yonghongzhang-io commented May 8, 2026

Summary

Resources

Author

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant