Skip to content

Citation-grounded output verification for ADRs#12

Draft
hailcpy wants to merge 6 commits into
mainfrom
llm-judge
Draft

Citation-grounded output verification for ADRs#12
hailcpy wants to merge 6 commits into
mainfrom
llm-judge

Conversation

@hailcpy

@hailcpy hailcpy commented Jun 5, 2026

Copy link
Copy Markdown
Owner

LLM-judge follow-up from #2, redesigned as fully deterministic citation-grounded verification after researching alternatives (NLI, claim-decomposition, RAG faithfulness, self-consistency). Key insight: the git history is a closed corpus, so "retrieve-then-verify" collapses to "cite-then-look-up" — every check is an exact git lookup, and a fabricated SHA fails git cat-file instantly. No model on the verification path.

Pipeline (scripts/judge.py)

Considered Options — deterministic. The generator cites each option inline (<!-- evidence: <sha> deleted:path -->). verify_adr checks each citation against git:

  • deleted / added / renamed → name-status; message: → commit message; removed:"line" → greps the commit's diff for that deleted line (covers code replaced inside a modified file — the one case that used to need an LLM).
  • Fabricated SHA / claim-that-didn't-happen / uncited / uncheckable evidence type → dropped. An option that can only be argued semantically is dropped, not adjudicated — for a retroactive ADR, if you can't point at the bytes it shouldn't be asserted.
  • Fully reproducible: same repo + text → same result.

Provenance render. render_verified_adr drops unevidenced options, re-emits verified ones verbatim (citation comments preserved as audit trail), collapses to "No alternatives recorded" if none survive, stamps evidence-verified: true / verification: citation-structural into the YAML frontmatter (existing keys preserved).

Tag syntax check (deterministic). After @ADR tags are placed, files are parse-checked with each language's own compiler (Python/JS/Ruby/Shell/Go; TS/Java/Rust degrade to skipped).

judge_options (LLM, N-run majority vote) is kept ONLY as an optional manual CLI escape hatch — explicitly OFF the verification path. It was demoted because a model verdict over terse metadata measures sample variance, not the judge's systematic over-inference.

CLIs: verify-adr [--write], verify-options, evidence, check-tags, judge-options.

Skill + docs

  • MADR template requires an inline citation per option (incl. removed:"line").
  • SKILL Step 4 runs verify-adr; Step 5 runs check-tags.
  • Downstream trust boundary documented: an evidence-verified: true ADR is to be trusted, NOT re-investigated — citations are an audit trail, not a re-check prompt.
  • README "Output verification" section + tag-coverage table.

Testability

Tier-1, evidence builder, frontmatter render, and the (off-path) LLM judge are unit-tested offline against the real synthetic fixtures; the verification path has no model to stub. tests/test_judge.py36 tests, never spawns claude. Full suite 47 passed, 1 xfailed. Verified end-to-end against the live CLI.

Done

  • deterministic citation verifier (incl. removed:"line" content check)
  • LLM taken off the verification path; kept as optional manual tool
  • provenance frontmatter render + drop/rewrite
  • wire into SKILL Step 4/5 + downstream trust-boundary
  • README + coverage docs
  • deterministic tag syntax check

Optional follow-ups (not blocking)

  • generation-time CoVe nudge ("can you cite a commit?") in the SKILL prompt
  • (deferred) NLI residue layer — only if we ever want to rescue genuinely-semantic options without dropping them; breaks stdlib-only, so future

🤖 Generated with Claude Code

jaditya8889 and others added 4 commits June 5, 2026 20:51
First increment of the post-generation judge pass. Two independent checks
that guard ADR *output* (analyze.py decides which candidates exist; this
guards what the agent then writes):

- check_tag_syntax: DETERMINISTIC. After @adr comment tags are placed, the
  file must still parse. Dispatches to the language's own compiler
  (py_compile, node --check, ruby -c, bash -n, gofmt -e); unknown/absent
  checkers degrade to skipped, never false-fail.

- judge_options: LLM. Audits the "Considered Options" MADR section (the
  top hallucination risk) for whether each option is evidenced by the diff/
  messages. Runs N times, majority-vote overall, reports agreement as the
  variance signal. Can only DEMOTE unevidenced options, never invent, so it
  cannot drop correctness below the deterministic floor. Fails closed if all
  runs garble. Defaults to Haiku (cheap N-run verdict; JUDGE_MODEL override).

The model call is injected (runner callable), so prompt building, verdict
parsing, and N-run aggregation are fully unit-tested offline — tests/
test_judge.py (17 tests) never spawns claude. Verified end-to-end against
the live CLI separately.

Relates to the LLM-judge follow-up from #2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- SKILL Step 4: run judge-options after generating an ADR with Considered
  Options; on fail, drop unevidenced options / fall back to "No alternatives
  recorded". Includes the evidence-file build commands.
- SKILL Step 5: run check-tags after placing tags; on fail, revert/move the
  insertion. States the fixed syntax-check language set explicitly.
- README: new "Output verification" subsection (the two judge checks) and a
  "Tag validation coverage" table — Python/JS/Ruby/Shell/Go are syntax-checked;
  TypeScript/Java/Kotlin/Rust get tags but are skipped by the checker.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build_evidence(repo, shas) renders the judge's commit-evidence block
(subject lines + name-status, where "instead of X" phrases and deletions
of the replaced approach live) straight from a candidate's SHAs. Full
hunks are omitted to keep the prompt dense.

- evidence_for_candidate(repo, candidate) pulls the SHAs from an analyze.py
  candidate dict, so the two scripts compose into one pipeline.
- new `judge.py evidence <repo> --commits ...` subcommand prints the block.
- `judge-options` gains --repo/--commits to build evidence automatically
  (--evidence file still supported as override).
- SKILL Step 4 now calls --commits instead of a hand-rolled `git show` combo.

Tests: build/compose against the real synthetic fixture repos (no model
call). Suite: 31 passed, 1 xfailed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Implements the citation-grounded approach chosen after researching
alternatives (NLI, claim-decomposition, RAG faithfulness, self-consistency).
Rationale: the git history is a CLOSED corpus, so "retrieve then verify"
collapses to "cite then look up" — most checks become exact git lookups, and
a fabricated SHA fails `git cat-file` immediately. This is a stronger and
cheaper floor than the all-LLM N-run judge, whose majority vote measures
variance, not the judge's systematic over-inference from terse metadata.

New deterministic layer in judge.py:
- inline citation grammar embedded as an HTML comment after each option:
  `* **Redis** <!-- evidence: <sha> deleted:path -->`
- extract_cited_options() pairs each option with its citation
- verify_citation() checks SHA existence + name-status (added/deleted/renamed)
  or message-phrase presence — exact, no model call
- verify_options() returns per-option verdicts and an overall of
  pass / fail (failed or uncited) / review (fuzzy -> LLM residue)
- `judge.py verify-options` CLI subcommand

The existing LLM judge_options is now the tier-2 residue handler for the
fuzzy minority, not the primary check. NLI as a future zero-model-call
residue layer is noted but deferred (keeps the stdlib-only property).

Tests: 9 new, derived from real fixture commit data (verified/failed/
fabricated-sha/uncited/fuzzy/message-phrase). Suite: 40 passed, 1 xfailed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes the citation-grounded pipeline:

- verify_adr(): runs tier-1 structural checks, routes only the `fuzzy`
  residue (real SHA, non-structural evidence type) to the LLM judge, fed the
  actual diff hunks (build_evidence gains include_patch). Returns kept/dropped
  outcomes + residue agreement.
- render_verified_adr(): drops failed/uncited/unevidenced options, re-emits
  verified ones verbatim with their hidden citation comments, collapses to
  "No alternatives recorded" if none survive, and upserts an
  evidence-verified provenance block into the YAML frontmatter (preserving
  existing keys).
- `judge.py verify-adr [--write]` CLI for the whole flow.

SKILL:
- MADR template now requires an inline `<!-- evidence: sha type:detail -->`
  citation per option.
- Step 4 calls verify-adr (two tiers) instead of the old judge-options.
- Adds the downstream trust-boundary: evidence-verified ADRs are to be
  trusted, not re-investigated.
README: Output-verification section rewritten for the two-tier citation model.

Tests: +6 (kept/dropped routing, fuzzy->residue, frontmatter upsert, all-
dropped collapse, no-op on no-options). Suite: 46 passed, 1 xfailed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hailcpy hailcpy changed the title LLM-judge output-safety layer (WIP) Citation-grounded output verification for ADRs Jun 5, 2026
Per decision that citation grounding is sufficient: drop the LLM residue
tier and the N-run variance plan entirely.

- Add `removed:"<line>"` evidence type: greps the cited commit's diff for a
  deleted line, covering "code replaced inside a modified file" — the one
  case that previously needed the LLM — deterministically.
- verify_adr no longer takes a runner/runs and never calls a model. Options
  that aren't `verified` (failed, uncited, or an uncheckable evidence type)
  are dropped. Fully reproducible: same repo + text -> same result.
- verify_options collapses to pass/fail (no review/fuzzy state).
- Provenance string is now `citation-structural`.
- judge_options (LLM, N-run majority vote) is kept ONLY as an optional manual
  CLI escape hatch, explicitly off the verification path, with a comment on
  why it was demoted (vote measures variance, not systematic over-inference).
- Citation detail now supports quoting to carry spaces/punctuation.

SKILL + README: document the single deterministic check and the
`removed:"line"` citation; drop all tier-2/N-run language.

Tests: residue/runner tests replaced with deterministic ones (uncheckable
drop, removed-line verify, reproducibility). Suite: 47 passed, 1 xfailed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants