Skip to content

Normalize judge totals to 0-1 scale#73

Open
luccathescientist wants to merge 1 commit intopinchbench:mainfrom
luccathescientist:fix-judge-total-normalization
Open

Normalize judge totals to 0-1 scale#73
luccathescientist wants to merge 1 commit intopinchbench:mainfrom
luccathescientist:fix-judge-total-normalization

Conversation

@luccathescientist
Copy link

Summary

This fixes an issue where LLM-judge scores could exceed 1.0 when the judge returned a total that was the sum of per-criterion scores instead of
their mean.

The grading pipeline already expects scores on a 0..1 scale, but some recent runs showed impossible values like 3.85/1.0, 5.0/1.0, and hybrid
scores above 1.0.

What changed

  • Clarified the judge prompt so total must be the arithmetic mean of criterion scores and must remain in 0..1

  • Normalized judge responses so that when:

    • per-criterion scores are in 0..1, and
    • the returned total is greater than 1.0

    we treat that total as a summed score and convert it back to the mean

  • Added regression tests for:

    • summed LLM-judge totals being normalized back to 0..1
    • hybrid scoring staying within range after normalization

Evidence

Observed in saved benchmark runs:

  • results/0009_local-openai-gpt-oss-20b.json
    • task_15_daily_summary: 3.85/1.0
    • task_16_email_triage: 2.8925/1.0
  • results/0010_local-openai-gpt-oss-20b.json
    • task_15_daily_summary: 5.0/1.0
    • task_17_email_search: 3.39/1.0

In these cases, the criterion breakdowns were already on a 0..1 scale, so the overflowing totals were consistent with summed judge outputs leaking
into aggregate scoring.

Validation

python3 -m py_compile scripts/lib_grading.py tests/test_lib_grading.py
python3 -m unittest tests/test_lib_grading.py

@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 21, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

The normalization logic is well-guarded: it only fires when all criterion scores are individually within [0.0, 1.0] AND the reported total exceeds 1.0, which correctly distinguishes summed totals from legitimately high scores on a wider scale. The prompt clarification is a sensible belt-and-suspenders approach alongside the runtime normalization.

The regression tests cover the primary fix path. One minor gap is the absence of a negative test (e.g., criteria scored on a 0..5 scale with total > 1.0 should NOT be normalized), but the guard condition all(0.0 <= float(v) <= 1.0 for v in values) handles that correctly.

Files Reviewed (2 files)
  • scripts/lib_grading.py - 0 issues
  • tests/test_lib_grading.py - 0 issues

Reviewed by claude-4.6-opus-20260205 · 173,890 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant