Conversation
First increment of the post-generation judge pass. Two independent checks that guard ADR *output* (analyze.py decides which candidates exist; this guards what the agent then writes): - check_tag_syntax: DETERMINISTIC. After @adr comment tags are placed, the file must still parse. Dispatches to the language's own compiler (py_compile, node --check, ruby -c, bash -n, gofmt -e); unknown/absent checkers degrade to skipped, never false-fail. - judge_options: LLM. Audits the "Considered Options" MADR section (the top hallucination risk) for whether each option is evidenced by the diff/ messages. Runs N times, majority-vote overall, reports agreement as the variance signal. Can only DEMOTE unevidenced options, never invent, so it cannot drop correctness below the deterministic floor. Fails closed if all runs garble. Defaults to Haiku (cheap N-run verdict; JUDGE_MODEL override). The model call is injected (runner callable), so prompt building, verdict parsing, and N-run aggregation are fully unit-tested offline — tests/ test_judge.py (17 tests) never spawns claude. Verified end-to-end against the live CLI separately. Relates to the LLM-judge follow-up from #2. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- SKILL Step 4: run judge-options after generating an ADR with Considered Options; on fail, drop unevidenced options / fall back to "No alternatives recorded". Includes the evidence-file build commands. - SKILL Step 5: run check-tags after placing tags; on fail, revert/move the insertion. States the fixed syntax-check language set explicitly. - README: new "Output verification" subsection (the two judge checks) and a "Tag validation coverage" table — Python/JS/Ruby/Shell/Go are syntax-checked; TypeScript/Java/Kotlin/Rust get tags but are skipped by the checker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build_evidence(repo, shas) renders the judge's commit-evidence block (subject lines + name-status, where "instead of X" phrases and deletions of the replaced approach live) straight from a candidate's SHAs. Full hunks are omitted to keep the prompt dense. - evidence_for_candidate(repo, candidate) pulls the SHAs from an analyze.py candidate dict, so the two scripts compose into one pipeline. - new `judge.py evidence <repo> --commits ...` subcommand prints the block. - `judge-options` gains --repo/--commits to build evidence automatically (--evidence file still supported as override). - SKILL Step 4 now calls --commits instead of a hand-rolled `git show` combo. Tests: build/compose against the real synthetic fixture repos (no model call). Suite: 31 passed, 1 xfailed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements the citation-grounded approach chosen after researching alternatives (NLI, claim-decomposition, RAG faithfulness, self-consistency). Rationale: the git history is a CLOSED corpus, so "retrieve then verify" collapses to "cite then look up" — most checks become exact git lookups, and a fabricated SHA fails `git cat-file` immediately. This is a stronger and cheaper floor than the all-LLM N-run judge, whose majority vote measures variance, not the judge's systematic over-inference from terse metadata. New deterministic layer in judge.py: - inline citation grammar embedded as an HTML comment after each option: `* **Redis** <!-- evidence: <sha> deleted:path -->` - extract_cited_options() pairs each option with its citation - verify_citation() checks SHA existence + name-status (added/deleted/renamed) or message-phrase presence — exact, no model call - verify_options() returns per-option verdicts and an overall of pass / fail (failed or uncited) / review (fuzzy -> LLM residue) - `judge.py verify-options` CLI subcommand The existing LLM judge_options is now the tier-2 residue handler for the fuzzy minority, not the primary check. NLI as a future zero-model-call residue layer is noted but deferred (keeps the stdlib-only property). Tests: 9 new, derived from real fixture commit data (verified/failed/ fabricated-sha/uncited/fuzzy/message-phrase). Suite: 40 passed, 1 xfailed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the citation-grounded pipeline: - verify_adr(): runs tier-1 structural checks, routes only the `fuzzy` residue (real SHA, non-structural evidence type) to the LLM judge, fed the actual diff hunks (build_evidence gains include_patch). Returns kept/dropped outcomes + residue agreement. - render_verified_adr(): drops failed/uncited/unevidenced options, re-emits verified ones verbatim with their hidden citation comments, collapses to "No alternatives recorded" if none survive, and upserts an evidence-verified provenance block into the YAML frontmatter (preserving existing keys). - `judge.py verify-adr [--write]` CLI for the whole flow. SKILL: - MADR template now requires an inline `<!-- evidence: sha type:detail -->` citation per option. - Step 4 calls verify-adr (two tiers) instead of the old judge-options. - Adds the downstream trust-boundary: evidence-verified ADRs are to be trusted, not re-investigated. README: Output-verification section rewritten for the two-tier citation model. Tests: +6 (kept/dropped routing, fuzzy->residue, frontmatter upsert, all- dropped collapse, no-op on no-options). Suite: 46 passed, 1 xfailed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per decision that citation grounding is sufficient: drop the LLM residue tier and the N-run variance plan entirely. - Add `removed:"<line>"` evidence type: greps the cited commit's diff for a deleted line, covering "code replaced inside a modified file" — the one case that previously needed the LLM — deterministically. - verify_adr no longer takes a runner/runs and never calls a model. Options that aren't `verified` (failed, uncited, or an uncheckable evidence type) are dropped. Fully reproducible: same repo + text -> same result. - verify_options collapses to pass/fail (no review/fuzzy state). - Provenance string is now `citation-structural`. - judge_options (LLM, N-run majority vote) is kept ONLY as an optional manual CLI escape hatch, explicitly off the verification path, with a comment on why it was demoted (vote measures variance, not systematic over-inference). - Citation detail now supports quoting to carry spaces/punctuation. SKILL + README: document the single deterministic check and the `removed:"line"` citation; drop all tier-2/N-run language. Tests: residue/runner tests replaced with deterministic ones (uncheckable drop, removed-line verify, reproducibility). Suite: 47 passed, 1 xfailed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
LLM-judge follow-up from #2, redesigned as fully deterministic citation-grounded verification after researching alternatives (NLI, claim-decomposition, RAG faithfulness, self-consistency). Key insight: the git history is a closed corpus, so "retrieve-then-verify" collapses to "cite-then-look-up" — every check is an exact git lookup, and a fabricated SHA fails
git cat-fileinstantly. No model on the verification path.Pipeline (
scripts/judge.py)Considered Options — deterministic. The generator cites each option inline (
<!-- evidence: <sha> deleted:path -->).verify_adrchecks each citation against git:deleted/added/renamed→ name-status;message:→ commit message;removed:"line"→ greps the commit's diff for that deleted line (covers code replaced inside a modified file — the one case that used to need an LLM).Provenance render.
render_verified_adrdrops unevidenced options, re-emits verified ones verbatim (citation comments preserved as audit trail), collapses to "No alternatives recorded" if none survive, stampsevidence-verified: true/verification: citation-structuralinto the YAML frontmatter (existing keys preserved).Tag syntax check (deterministic). After
@ADRtags are placed, files are parse-checked with each language's own compiler (Python/JS/Ruby/Shell/Go; TS/Java/Rust degrade to skipped).judge_options(LLM, N-run majority vote) is kept ONLY as an optional manual CLI escape hatch — explicitly OFF the verification path. It was demoted because a model verdict over terse metadata measures sample variance, not the judge's systematic over-inference.CLIs:
verify-adr [--write],verify-options,evidence,check-tags,judge-options.Skill + docs
removed:"line").verify-adr; Step 5 runscheck-tags.evidence-verified: trueADR is to be trusted, NOT re-investigated — citations are an audit trail, not a re-check prompt.Testability
Tier-1, evidence builder, frontmatter render, and the (off-path) LLM judge are unit-tested offline against the real synthetic fixtures; the verification path has no model to stub.
tests/test_judge.py— 36 tests, never spawns claude. Full suite 47 passed, 1 xfailed. Verified end-to-end against the live CLI.Done
removed:"line"content check)Optional follow-ups (not blocking)
🤖 Generated with Claude Code