Deterministic model selection + judge-only regrading#27
Deterministic model selection + judge-only regrading#27jb510 wants to merge 2 commits intopinchbench:mainfrom
Conversation
| if assistant_error: | ||
| status = "error" | ||
| stderr = f"{stderr}\nAssistant error: {assistant_error}".strip() | ||
| if runtime_model is None: |
There was a problem hiding this comment.
WARNING: Status override chain masks timeout status
When a task times out, status is set to "timeout" at line 632. However, a timed-out task will likely have an incomplete or empty transcript, causing runtime_model to be None. This check then unconditionally overrides the status to "error" with a misleading "Could not verify runtime provider/model" message, masking the real root cause (timeout).
The same pattern exists in run_openclaw_prompt at line 773.
Consider guarding the model verification checks so they only run when status == "success":
if status == "success":
if runtime_model is None:
status = "error"
stderr = f"{stderr}\nCould not verify runtime provider/model from transcript.".strip()
elif runtime_model.lower() != requested_model.lower():
status = "error"
stderr = (
f"{stderr}\nModel mismatch: requested `{requested_model}` but runtime used `{runtime_model}`."
).strip()| recent_entries = [] | ||
| for entry in transcript: | ||
| entry_ts = _entry_timestamp_epoch(entry) | ||
| if entry_ts is not None and entry_ts >= cutoff: |
There was a problem hiding this comment.
WARNING: Transcript filtering silently drops entries without timestamps
The condition entry_ts is not None and entry_ts >= cutoff only keeps entries that have a parseable timestamp AND are recent enough. Entries without timestamps (e.g., tool-result events, custom events, or malformed entries with parse_error) are silently dropped when recent_entries is non-empty.
This could cause incomplete transcripts for downstream consumers like _extract_runtime_model_ref (which looks for custom/model-snapshot entries) and _extract_usage_from_transcript.
Consider also including entries that have no timestamp (they likely belong to the current invocation):
if entry_ts is None or entry_ts >= cutoff:
recent_entries.append(entry)
Code Review SummaryStatus: 4 Issues Found | Recommendation: Address before merge Overview
Issue Details (click to expand)WARNING
SUGGESTION
Files Reviewed (5 files)
|
| return _MODEL_CATALOG_CACHE | ||
|
|
||
| try: | ||
| result = subprocess.run( |
There was a problem hiding this comment.
[WARNING]: Missing timeout on subprocess.run — if openclaw models list hangs (broken CLI, network issue, unresponsive daemon), the benchmark process will block indefinitely.
Consider adding a reasonable timeout (e.g., 30s) so the benchmark fails fast with a clear error instead of hanging:
| result = subprocess.run( | |
| result = subprocess.run( | |
| ["openclaw", "models", "list", "--all", "--json"], | |
| capture_output=True, | |
| text=True, | |
| check=False, | |
| timeout=30, |
| "stdout": result.get("stdout", ""), | ||
| "stderr": result.get("stderr", ""), | ||
| "transcript_length": len(result.get("transcript", [])), | ||
| "transcript": result.get("transcript", []), |
There was a problem hiding this comment.
[SUGGESTION]: Including full transcripts in checkpoint payloads means every _flush_checkpoint() call re-serializes all accumulated transcripts to disk. With N tasks, total checkpoint I/O grows as O(N²) in transcript size.
For a 22-task suite with large transcripts this is likely fine, but if transcript sizes grow (e.g., multi-turn tasks with tool use) this could become a bottleneck. Consider either:
- Writing only the latest task's data incrementally (append-style), or
- Deferring full transcript inclusion to the final output only (checkpoints could store
transcript_lengthwithout the fulltranscriptarray)
There was a problem hiding this comment.
I feel like n^2 is okay here. If in the future we have 100s or 1000s of tasks, you can show me this comment and point and laugh
|
@olearycrew Merged #29 first as it's simpler. Will be testing this one today before merging |
|
👋 Hi @jb510! I'm @olearycrew's OpenClaw bot doing a triage pass. Heads up: this PR has merge conflicts with main that need to be resolved. Looks like #29 was merged first (per evanjacobson's comment), which probably caused some overlap. Just flagging in case you missed it! |
olearycrew
left a comment
There was a problem hiding this comment.
@jb510 thanks again for this submission. There are a few merge conflicts, as well as some bot comments that I wanted to make sure we either dismiss or address properly before merging
Summary
This PR makes benchmark model routing deterministic and adds a regrading workflow that avoids rerunning task execution.
Deterministic routing
provider/model)openrouter/rewriting)Error handling improvements
Session persistence + checkpoints
--clear-sessionsto explicitly clear transcripts before turnsRegrading workflow
--judge-only <results-or-checkpoint.json>to re-run grading without re-executing tasks--judge-modelfor replay gradingDocs
--judge-only, and--clear-sessionsNotes
Closes #16