Citation-grounded output verification for ADRs by hailcpy · Pull Request #12 · hailcpy/gen-adr

hailcpy · 2026-06-05T15:22:09Z

LLM-judge follow-up from #2, redesigned as fully deterministic citation-grounded verification after researching alternatives (NLI, claim-decomposition, RAG faithfulness, self-consistency). Key insight: the git history is a closed corpus, so "retrieve-then-verify" collapses to "cite-then-look-up" — every check is an exact git lookup, and a fabricated SHA fails git cat-file instantly. No model on the verification path.

Pipeline (`scripts/judge.py`)

Considered Options — deterministic. The generator cites each option inline (). verify_adr checks each citation against git:

deleted / added / renamed → name-status; message: → commit message; removed:"line" → greps the commit's diff for that deleted line (covers code replaced inside a modified file — the one case that used to need an LLM).
Fabricated SHA / claim-that-didn't-happen / uncited / uncheckable evidence type → dropped. An option that can only be argued semantically is dropped, not adjudicated — for a retroactive ADR, if you can't point at the bytes it shouldn't be asserted.
Fully reproducible: same repo + text → same result.

Provenance render. render_verified_adr drops unevidenced options, re-emits verified ones verbatim (citation comments preserved as audit trail), collapses to "No alternatives recorded" if none survive, stamps evidence-verified: true / verification: citation-structural into the YAML frontmatter (existing keys preserved).

Tag syntax check (deterministic). After @ADR tags are placed, files are parse-checked with each language's own compiler (Python/JS/Ruby/Shell/Go; TS/Java/Rust degrade to skipped).

judge_options (LLM, N-run majority vote) is kept ONLY as an optional manual CLI escape hatch — explicitly OFF the verification path. It was demoted because a model verdict over terse metadata measures sample variance, not the judge's systematic over-inference.

CLIs: verify-adr [--write], verify-options, evidence, check-tags, judge-options.

Skill + docs

MADR template requires an inline citation per option (incl. removed:"line").
SKILL Step 4 runs verify-adr; Step 5 runs check-tags.
Downstream trust boundary documented: an evidence-verified: true ADR is to be trusted, NOT re-investigated — citations are an audit trail, not a re-check prompt.
README "Output verification" section + tag-coverage table.

Testability

Tier-1, evidence builder, frontmatter render, and the (off-path) LLM judge are unit-tested offline against the real synthetic fixtures; the verification path has no model to stub. tests/test_judge.py — 36 tests, never spawns claude. Full suite 47 passed, 1 xfailed. Verified end-to-end against the live CLI.

Done

deterministic citation verifier (incl. removed:"line" content check)
LLM taken off the verification path; kept as optional manual tool
provenance frontmatter render + drop/rewrite
wire into SKILL Step 4/5 + downstream trust-boundary
README + coverage docs
deterministic tag syntax check

Optional follow-ups (not blocking)

generation-time CoVe nudge ("can you cite a commit?") in the SKILL prompt
(deferred) NLI residue layer — only if we ever want to rescue genuinely-semantic options without dropping them; breaks stdlib-only, so future

🤖 Generated with Claude Code

First increment of the post-generation judge pass. Two independent checks that guard ADR *output* (analyze.py decides which candidates exist; this guards what the agent then writes): - check_tag_syntax: DETERMINISTIC. After @adr comment tags are placed, the file must still parse. Dispatches to the language's own compiler (py_compile, node --check, ruby -c, bash -n, gofmt -e); unknown/absent checkers degrade to skipped, never false-fail. - judge_options: LLM. Audits the "Considered Options" MADR section (the top hallucination risk) for whether each option is evidenced by the diff/ messages. Runs N times, majority-vote overall, reports agreement as the variance signal. Can only DEMOTE unevidenced options, never invent, so it cannot drop correctness below the deterministic floor. Fails closed if all runs garble. Defaults to Haiku (cheap N-run verdict; JUDGE_MODEL override). The model call is injected (runner callable), so prompt building, verdict parsing, and N-run aggregation are fully unit-tested offline — tests/ test_judge.py (17 tests) never spawns claude. Verified end-to-end against the live CLI separately. Relates to the LLM-judge follow-up from #2. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

- SKILL Step 4: run judge-options after generating an ADR with Considered Options; on fail, drop unevidenced options / fall back to "No alternatives recorded". Includes the evidence-file build commands. - SKILL Step 5: run check-tags after placing tags; on fail, revert/move the insertion. States the fixed syntax-check language set explicitly. - README: new "Output verification" subsection (the two judge checks) and a "Tag validation coverage" table — Python/JS/Ruby/Shell/Go are syntax-checked; TypeScript/Java/Kotlin/Rust get tags but are skipped by the checker. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

build_evidence(repo, shas) renders the judge's commit-evidence block (subject lines + name-status, where "instead of X" phrases and deletions of the replaced approach live) straight from a candidate's SHAs. Full hunks are omitted to keep the prompt dense. - evidence_for_candidate(repo, candidate) pulls the SHAs from an analyze.py candidate dict, so the two scripts compose into one pipeline. - new `judge.py evidence <repo> --commits ...` subcommand prints the block. - `judge-options` gains --repo/--commits to build evidence automatically (--evidence file still supported as override). - SKILL Step 4 now calls --commits instead of a hand-rolled `git show` combo. Tests: build/compose against the real synthetic fixture repos (no model call). Suite: 31 passed, 1 xfailed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Implements the citation-grounded approach chosen after researching alternatives (NLI, claim-decomposition, RAG faithfulness, self-consistency). Rationale: the git history is a CLOSED corpus, so "retrieve then verify" collapses to "cite then look up" — most checks become exact git lookups, and a fabricated SHA fails `git cat-file` immediately. This is a stronger and cheaper floor than the all-LLM N-run judge, whose majority vote measures variance, not the judge's systematic over-inference from terse metadata. New deterministic layer in judge.py: - inline citation grammar embedded as an HTML comment after each option: `* **Redis** ` - extract_cited_options() pairs each option with its citation - verify_citation() checks SHA existence + name-status (added/deleted/renamed) or message-phrase presence — exact, no model call - verify_options() returns per-option verdicts and an overall of pass / fail (failed or uncited) / review (fuzzy -> LLM residue) - `judge.py verify-options` CLI subcommand The existing LLM judge_options is now the tier-2 residue handler for the fuzzy minority, not the primary check. NLI as a future zero-model-call residue layer is noted but deferred (keeps the stdlib-only property). Tests: 9 new, derived from real fixture commit data (verified/failed/ fabricated-sha/uncited/fuzzy/message-phrase). Suite: 40 passed, 1 xfailed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Completes the citation-grounded pipeline: - verify_adr(): runs tier-1 structural checks, routes only the `fuzzy` residue (real SHA, non-structural evidence type) to the LLM judge, fed the actual diff hunks (build_evidence gains include_patch). Returns kept/dropped outcomes + residue agreement. - render_verified_adr(): drops failed/uncited/unevidenced options, re-emits verified ones verbatim with their hidden citation comments, collapses to "No alternatives recorded" if none survive, and upserts an evidence-verified provenance block into the YAML frontmatter (preserving existing keys). - `judge.py verify-adr [--write]` CLI for the whole flow. SKILL: - MADR template now requires an inline `` citation per option. - Step 4 calls verify-adr (two tiers) instead of the old judge-options. - Adds the downstream trust-boundary: evidence-verified ADRs are to be trusted, not re-investigated. README: Output-verification section rewritten for the two-tier citation model. Tests: +6 (kept/dropped routing, fuzzy->residue, frontmatter upsert, all- dropped collapse, no-op on no-options). Suite: 46 passed, 1 xfailed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Per decision that citation grounding is sufficient: drop the LLM residue tier and the N-run variance plan entirely. - Add `removed:"<line>"` evidence type: greps the cited commit's diff for a deleted line, covering "code replaced inside a modified file" — the one case that previously needed the LLM — deterministically. - verify_adr no longer takes a runner/runs and never calls a model. Options that aren't `verified` (failed, uncited, or an uncheckable evidence type) are dropped. Fully reproducible: same repo + text -> same result. - verify_options collapses to pass/fail (no review/fuzzy state). - Provenance string is now `citation-structural`. - judge_options (LLM, N-run majority vote) is kept ONLY as an optional manual CLI escape hatch, explicitly off the verification path, with a comment on why it was demoted (vote measures variance, not systematic over-inference). - Citation detail now supports quoting to carry spaces/punctuation. SKILL + README: document the single deterministic check and the `removed:"line"` citation; drop all tier-2/N-run language. Tests: residue/runner tests replaced with deterministic ones (uncheckable drop, removed-line verify, reproducibility). Suite: 47 passed, 1 xfailed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jaditya8889 and others added 4 commits June 5, 2026 20:51

hailcpy force-pushed the llm-judge branch from c577219 to 231a154 Compare June 5, 2026 18:20

hailcpy changed the title ~~LLM-judge output-safety layer (WIP)~~ Citation-grounded output verification for ADRs Jun 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Citation-grounded output verification for ADRs#12

Citation-grounded output verification for ADRs#12
hailcpy wants to merge 6 commits into
mainfrom
llm-judge

hailcpy commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hailcpy commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pipeline (scripts/judge.py)

Skill + docs

Testability

Done

Optional follow-ups (not blocking)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hailcpy commented Jun 5, 2026 •

edited

Loading

Pipeline (`scripts/judge.py`)