Deterministic model selection + judge-only regrading by jb510 · Pull Request #27 · pinchbench/skill

jb510 · 2026-03-08T01:39:34Z

Summary

This PR makes benchmark model routing deterministic and adds a regrading workflow that avoids rerunning task execution.

Deterministic routing

require provider-qualified model refs for benchmark + judge (provider/model)
preserve provider-qualified IDs exactly (remove silent openrouter/ rewriting)
validate requested models against local OpenClaw model catalog
verify runtime provider/model from transcript on every execution and judge turn
fail fast on model mismatch

Error handling improvements

treat only terminal assistant errors as fatal (allow transient mid-turn provider rotation)
keep strict fail-fast behavior for true execution/grading failures

Session persistence + checkpoints

preserve agent/judge sessions by default (no auto cleanup)
add --clear-sessions to explicitly clear transcripts before turns
write rolling checkpoint files during runs
include full per-task transcripts in results/checkpoints

Regrading workflow

add --judge-only <results-or-checkpoint.json> to re-run grading without re-executing tasks
supports changing --judge-model for replay grading

Docs

update README/SKILL docs for deterministic requirements, persistent sessions, checkpoints, --judge-only, and --clear-sessions

Notes

This is a draft PR per workflow preference (no review requested yet).
Validated locally with full off/low benchmark runs and judge-only replay from checkpoint.

Closes #16

kilo-code-bot · 2026-03-08T01:57:37Z

scripts/lib_agent.py

+    if assistant_error:
+        status = "error"
+        stderr = f"{stderr}\nAssistant error: {assistant_error}".strip()
+    if runtime_model is None:


WARNING: Status override chain masks timeout status

When a task times out, status is set to "timeout" at line 632. However, a timed-out task will likely have an incomplete or empty transcript, causing runtime_model to be None. This check then unconditionally overrides the status to "error" with a misleading "Could not verify runtime provider/model" message, masking the real root cause (timeout).

The same pattern exists in run_openclaw_prompt at line 773.

Consider guarding the model verification checks so they only run when status == "success":

if status == "success": if runtime_model is None: status = "error" stderr = f"{stderr}\nCould not verify runtime provider/model from transcript.".strip() elif runtime_model.lower() != requested_model.lower(): status = "error" stderr = ( f"{stderr}\nModel mismatch: requested `{requested_model}` but runtime used `{runtime_model}`." ).strip()

@jb510 have you tested this path?

kilo-code-bot · 2026-03-08T01:57:37Z

scripts/lib_agent.py

+    recent_entries = []
+    for entry in transcript:
+        entry_ts = _entry_timestamp_epoch(entry)
+        if entry_ts is not None and entry_ts >= cutoff:


WARNING: Transcript filtering silently drops entries without timestamps

The condition entry_ts is not None and entry_ts >= cutoff only keeps entries that have a parseable timestamp AND are recent enough. Entries without timestamps (e.g., tool-result events, custom events, or malformed entries with parse_error) are silently dropped when recent_entries is non-empty.

This could cause incomplete transcripts for downstream consumers like _extract_runtime_model_ref (which looks for custom/model-snapshot entries) and _extract_usage_from_transcript.

Consider also including entries that have no timestamp (they likely belong to the current invocation):

if entry_ts is None or entry_ts >= cutoff: recent_entries.append(entry)

@jb510 thoughts on this comment?

kilo-code-bot · 2026-03-08T01:57:56Z

Code Review Summary

Status: 4 Issues Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	3
SUGGESTION	1

Issue Details (click to expand)

WARNING

File	Line	Issue
`scripts/lib_agent.py`	52	Missing `timeout` on `subprocess.run` in `_load_model_catalog()` — could hang indefinitely if `openclaw` CLI is unresponsive
`scripts/lib_agent.py`	646	Status override chain masks timeout status — a timed-out task can be overridden to "error" by later model-mismatch checks
`scripts/lib_agent.py`	462	Transcript filtering silently drops entries without timestamps, potentially losing valid transcript data

SUGGESTION

File	Line	Issue
`scripts/benchmark.py`	327	Full transcripts in checkpoint payloads cause O(N²) I/O growth as `_flush_checkpoint()` re-serializes all accumulated results each time

Files Reviewed (5 files)

README.md - 0 issues (documentation updates)
SKILL.md - 0 issues (documentation updates)
scripts/benchmark.py - 1 issue
scripts/lib_agent.py - 3 issues
scripts/lib_grading.py - 0 issues

Fix these issues in Kilo Cloud

…nflicts

kilo-code-bot · 2026-03-08T22:04:13Z

scripts/lib_agent.py

+        return _MODEL_CATALOG_CACHE
+
+    try:
+        result = subprocess.run(


[WARNING]: Missing timeout on subprocess.run — if openclaw models list hangs (broken CLI, network issue, unresponsive daemon), the benchmark process will block indefinitely.

Consider adding a reasonable timeout (e.g., 30s) so the benchmark fails fast with a clear error instead of hanging:

Suggested change

result = subprocess.run(

result = subprocess.run(

["openclaw", "models", "list", "--all", "--json"],

capture_output=True,

text=True,

check=False,

timeout=30,

@jb510 thoughts on a 60s timeout here?

kilo-code-bot · 2026-03-08T22:04:13Z

scripts/benchmark.py

+        "stdout": result.get("stdout", ""),
+        "stderr": result.get("stderr", ""),
+        "transcript_length": len(result.get("transcript", [])),
+        "transcript": result.get("transcript", []),


[SUGGESTION]: Including full transcripts in checkpoint payloads means every _flush_checkpoint() call re-serializes all accumulated transcripts to disk. With N tasks, total checkpoint I/O grows as O(N²) in transcript size.

For a 22-task suite with large transcripts this is likely fine, but if transcript sizes grow (e.g., multi-turn tasks with tool use) this could become a bottleneck. Consider either:

Writing only the latest task's data incrementally (append-style), or

Deferring full transcript inclusion to the final output only (checkpoints could store transcript_length without the full transcript array)

I feel like n^2 is okay here. If in the future we have 100s or 1000s of tasks, you can show me this comment and point and laugh

evanjacobson · 2026-03-12T13:42:27Z

@olearycrew Merged #29 first as it's simpler. Will be testing this one today before merging

ScuttleBot · 2026-03-12T14:01:03Z

👋 Hi @jb510! I'm @olearycrew's OpenClaw bot doing a triage pass.

Heads up: this PR has merge conflicts with main that need to be resolved. Looks like #29 was merged first (per evanjacobson's comment), which probably caused some overlap.

Just flagging in case you missed it!

olearycrew

@jb510 thanks again for this submission. There are a few merge conflicts, as well as some bot comments that I wanted to make sure we either dismiss or address properly before merging

Add deterministic benchmark routing + judge-only regrading

7372107

jb510 marked this pull request as ready for review March 8, 2026 01:54

kilo-code-bot bot reviewed Mar 8, 2026

View reviewed changes

Merge upstream/main into deterministic-model-selection and resolve co…

9de9039

…nflicts

kilo-code-bot bot reviewed Mar 8, 2026

View reviewed changes

olearycrew mentioned this pull request Mar 10, 2026

feat: Add --judge CLI argument and require full provider prefix for model IDs #29

Merged

olearycrew requested a review from evanjacobson March 12, 2026 12:55

ScuttleBot mentioned this pull request Mar 12, 2026

Add thinking-level benchmarking support (off/minimal/low/medium/high/xhigh/adaptive) #12

Open

olearycrew requested changes Mar 19, 2026

View reviewed changes

-        result = subprocess.run(
+        result = subprocess.run(
+            ["openclaw", "models", "list", "--all", "--json"],
+            capture_output=True,
+            text=True,
+            check=False,
+            timeout=30,

Conversation

jb510 commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Deterministic routing

Error handling improvements

Session persistence + checkpoints

Regrading workflow

Docs

Notes

Uh oh!

kilo-code-bot bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

olearycrew Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

olearycrew Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Overview

WARNING

SUGGESTION

Uh oh!

kilo-code-bot bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

olearycrew Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

kilo-code-bot bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

olearycrew Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

evanjacobson commented Mar 12, 2026

Uh oh!

ScuttleBot commented Mar 12, 2026

Uh oh!

olearycrew left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jb510 commented Mar 8, 2026 •

edited

Loading

kilo-code-bot bot commented Mar 8, 2026 •

edited

Loading