| title | BugHunt | |
|---|---|---|
| emoji | ๐ | |
| colorFrom | indigo | |
| colorTo | purple | |
| sdk | docker | |
| pinned | false | |
| tags |
|
BugHunt is an OpenEnv-compliant reinforcement learning environment where AI agents learn to debug Python code through systematic investigation and targeted fixes.
Unique differentiator: Hard mode features interdependent bugs โ fixing one without the other yields zero improvement. Agents must reason about bug coupling, not just pattern-match.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ BugHunt v2.0 โ
โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโค
โ Gradio โ AnalyticsโCurriculumโ Bug Dep โ Leader- โ
โ Web UI โ Tracking โ Learning โ Graphs โ board โ
โโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโค
โ OpenEnv Core SDK (Environment) โ
โ reset() ยท step() ยท state ยท create_app() โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ Sandboxed Python Execution โ
โ Restricted builtins ยท No imports ยท No eval โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Agents receive buggy Python functions and must:
- Inspect function source code (free โ no penalty)
- Run tests to identify failures and read diagnostic hints (free)
- Propose fixes โ submit corrected function definitions (scored)
- Submit to finalize their solution
This mirrors real-world debugging: gather information โ form hypotheses โ apply fixes โ validate.
| Task | Difficulty | Bugs | Tests | Max Ops | Key Challenge |
|---|---|---|---|---|---|
easy |
โญ | 1 | 5 | 10 | Off-by-one divisor in calculate_average |
medium |
โญโญ | 2 | 7 | 15 | Two independent bugs in text processing |
hard |
โญโญโญ | 3 | 9 | 20 | Two interdependent bugs + one independent |
The hard task features a unique challenge: two of three bugs are interdependent. Bug 1 (wrong operator in weighted_average) masks Bug 2 (swapped arguments in calculate_final_grade). Fixing Bug 2 alone produces zero observable improvement because the arithmetic is still wrong. The agent must understand both bugs before fixing either โ a test of genuine debugging reasoning, not just pattern matching.
H_BUG1 (weighted_average: + โ *) โโmasksโโโถ H_BUG2 (calculate_final_grade: args swapped)
โ
H_BUG3 (class_statistics: > โ >=) โโโindependentโโโ
- โ
OpenEnv SDK v2 โ Full compliance with
openenv-coretypes and server - โ Deterministic grading โ No randomness, no LLM-as-judge
- โ Sandboxed execution โ Restricted builtins, no imports, no file access
- โ Concurrent sessions โ Via OpenEnv WebSocket protocol
- ๐ Curriculum Learning โ Auto-promotes difficulty when avg score > 0.8
- ๐ Analytics & Metrics โ Per-task episode stats, reward distributions
- ๐ Leaderboard โ Top 10 scores per difficulty level
- ๐ Bug Dependency Graphs โ Structured metadata for interdependent bugs
- ๐จ Interactive Web UI โ Beautiful Gradio playground at
/web - ๐ Capabilities API โ Machine-readable feature discovery
| Action | Parameters | Reward | Description |
|---|---|---|---|
inspect_function |
function_name |
0.0 |
Read function source code |
run_test |
test_id |
0.0 |
Execute test, see pass/fail + hint |
propose_fix |
function_name, new_code |
ฮscore or -0.05 |
Replace function with fix |
submit |
โ | final_score |
Finalize episode |
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/reset |
POST | Reset environment |
/step |
POST | Take an action |
/state |
GET | Current state |
/ws |
WS | WebSocket sessions |
/docs |
GET | Swagger API docs |
/web |
GET | Interactive Gradio UI |
| Endpoint | Method | Description |
|---|---|---|
/analytics |
GET | Aggregate episode metrics |
/analytics/record |
POST | Record episode result |
/leaderboard |
GET | Top scores per task |
/curriculum |
GET | Curriculum learning status |
/curriculum/step |
POST | Record & check promotion |
/tasks/info |
GET | Task metadata & challenges |
/tasks/dependency_graph/{id} |
GET | Bug dependency graph |
/env/capabilities |
GET | Feature discovery |
from client import BugHuntEnv
from models import BugHuntAction
async with BugHuntEnv(base_url="https://ayushxx9-bughunt-env.hf.space") as env:
result = await env.reset(task_id="easy")
result = await env.inspect_function("calculate_average")
result = await env.run_test("E1")
result = await env.propose_fix(
"calculate_average",
"def calculate_average(numbers):\n if not numbers:\n return 0\n return sum(numbers) / len(numbers)\n"
)
result = await env.submit()
print(f"Score: {result.reward}")# Health check
curl https://ayushxx9-bughunt-env.hf.space/health
# Reset
curl -X POST https://ayushxx9-bughunt-env.hf.space/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "easy"}'
# Step
curl -X POST https://ayushxx9-bughunt-env.hf.space/step \
-H "Content-Type: application/json" \
-d '{"action": {"action_type": "inspect_function", "function_name": "calculate_average"}}'
# Bug dependency graph (hard mode)
curl https://ayushxx9-bughunt-env.hf.space/tasks/dependency_graph/hardpip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860
# With web UI enabled
ENABLE_WEB_INTERFACE=true uvicorn server.app:app --host 0.0.0.0 --port 7860| Event | Reward | Rationale |
|---|---|---|
inspect_function |
0.0 |
Free information gathering |
run_test |
0.0 |
Free information gathering |
propose_fix (improves) |
ฮscore |
Positive delta encourages progress |
propose_fix (no improvement) |
-0.05 |
Small penalty prevents guess-and-check |
| Invalid action | -0.05 |
Penalty for errors |
submit |
final_score โ [0, 1] |
Fraction of tests passing |
BugHunt supports automatic difficulty progression:
easy โโ(avg > 0.8)โโโถ medium โโ(avg > 0.8)โโโถ hard
Use /curriculum to check current level and /curriculum/step to record scores. The controller auto-promotes when the sliding window average exceeds the threshold.
| Variable | Default | Description |
|---|---|---|
API_BASE_URL |
https://api.openai.com/v1 |
LLM API endpoint |
MODEL_NAME |
gpt-4o-mini |
Model for inference |
HF_TOKEN |
โ | HuggingFace / API token |
ENV_BASE_URL |
http://localhost:7860 |
Environment server URL |
ENABLE_WEB_INTERFACE |
false |
Enable Gradio at /web |
- SDK:
openenv-corewithEnvironment,Action,Observation,Statetypes - Server:
create_app()factory with customgradio_builder - Grading: 100% deterministic โ no randomness, no LLM-as-judge
- Sandbox: Code execution uses restricted builtins
- Concurrency: Supports concurrent sessions via WebSocket
- Analytics: In-memory episode tracking with leaderboard
- Curriculum: Sliding window auto-promotion across difficulty tiers
MIT