Evaluator divergence: comply suite and CLI runner produce materially different grades on the same agent

## Summary

The AAO conformance suite and the `@adcp/client` CLI runner — both AAO-owned, both grading against the same storyboard corpus — report materially different verdicts when run against the same agent at the same time with the same auth. Reported in [adcontextprotocol/adcp#4419](https://github.com/adcontextprotocol/adcp/issues/4419):

| Evaluator | Score | Pass rate |
|---|---|---|
| Comply suite (runner token) | 63/128 | 49% |
| `storyboard run` CLI | 45/59 (steps_passed/total) | 76% |

Same agent (`https://adcp.bidmachine.io/adcp/mcp` master commit Wave 23.20.9), same auth, same window. 27-point delta.

> ## 📌 Update 2026-05-12: original divergence resolved; issue narrowed to regression guard
>
> The 27-point delta turned out to be **version-driven**, not evaluator-driven. Both surfaces are the same underlying SDK; the evaluators just happened to be running different `@adcp/sdk` versions and one of them tripped the `no_secret_echo` over-rejection (adcp-client#1713 / PR #1714) that the newer version had grown to overscan.
>
> Root causes both shipped in `@adcp/sdk@7.1.0`:
>
> - **adcp-client#1713** (PR #1714, merged `110b49e1`) — `findSecretEcho` now requires a suspect name AND a non-empty string value; structured `authorization` payloads pass through. This was the BidMachine unblocker.
> - **adcp-client#1709** (PR #1712, merged `36d279f5`) — Zod-reject error attribution; failures now correctly resolve to `response_schema` instead of falling through to the next invariant.
>
> BidMachine retested on `7.1.0` ([receipt comment](https://github.com/adcontextprotocol/adcp-client/issues/1711)): 47/59 ≈ 80% on `media_buy_seller`, 26/37 ≈ 70% on `sales_non_guaranteed`. Both surfaces now converge. The three remaining failures are clean seller-side issues they've already triaged.
>
> The two underlying buckets I originally enumerated as contributing causes:
>
> - **"Zod `.strict()` codegen lag"** — was never the cause. Verified the published @adcp/sdk tarballs 5.x / 6.x / 7.x all use `.passthrough()` on `SyncAccountsResponseSchema.accounts[]`. Removed from #1707's scope.
> - **"Failure attribution differences"** — fixed by #1709.
>
> The third bucket (build-time schema pin, #1707) is real cosmetic-URL drift but is bounded by `additionalProperties: true` permissiveness on most response schemas; doesn't drive verdict divergence today. Held pending adopter demand.
>
> **Narrowed scope for this issue**: a small in-SDK parity smoke test that catches the *class* of bug we just chased — when `comply()` and `runStoryboard()` disagree on the verdict for the same fixture. See Recommended fix shape below.

## Why this is the load-bearing issue (worse than either underlying bug)

~~Two AAO-owned evaluators producing this big a disagreement on the same wire means:~~

**Update (above):** the disagreement was version-driven, not evaluator-driven, and both root causes shipped in 7.1.0. Both evaluators now converge. The argument below is preserved for context but the urgency framing no longer applies.

~~1. **Adopters can't trust either score.** BidMachine ran 10+ deploys chasing the comply-suite verdict; their CLI runner verdict would have pointed at a different (and partially different) set of root causes.~~
~~2. **The badge claim doesn't anchor to a single contract.** Whatever the next adopter sees on the dashboard is whichever evaluator the dashboard happens to consult. They can grade the same passing run as failing depending on which surface answered.~~
~~3. **Each underlying bug (Zod `.strict()`, build-time schema pin in #1707, codegen regen lag, error-attribution flattening, etc.) hides behind the divergence.** Even after #1707 + the strict→passthrough flip land, there's no parity check ensuring the two surfaces stay in sync as one gets fixed before the other.~~

The remaining real value: **a regression guard so a future change can't reintroduce the divergence we just spent five PRs collapsing.**

## Recommended fix shape (narrowed)

A focused in-SDK parity smoke test, ~50-100 LOC + a fixture file:

1. **Fixture: a mock agent that emits a BidMachine-shaped response** — passthrough-tolerated `authorization` field on `sync_accounts.accounts[]`, a structured `authorization` validation-result object somewhere, no actual credential leakage. Pure JSON / inline TS fixture, no live network.
2. **Test: same fixture × both surfaces.**
   - Run `runStoryboard(fixtureAgentUrl, sb, options)` against each universal storyboard once.
   - Run `comply(fixtureAgentUrl, options)` once.
   - Assert `comply().failures` is a subset of (and consistent with) the per-storyboard `runStoryboard().validations` failures. Concretely: same `failures.length`, same set of `(storyboard_id, step_id, validation.check)` tuples.
3. **Test: regression cases for the bugs we just fixed.**
   - A response with `authorization: { type: 'oauth', ... }` MUST pass `no_secret_echo`. Both surfaces.
   - A response that triggers a Zod reject MUST attribute to `response_schema`, NOT to a downstream invariant. Both surfaces.
   - A capability-unsupported storyboard MUST skip identically through `comply()` and `runStoryboard()`.

Specifically NOT in scope (would be a separate effort):

- **Shared assertion engine** — the two surfaces already share the underlying `runValidations` / `runAssertions` library. A "shared engine" refactor would be solving a problem we don't have evidence for; the smoke test catches divergence whether or not the engines are formally unified.
- **Single source of truth for schema set per run** — that's #1707's deliverable 7. Tracked separately.
- **Cross-version drift** (one evaluator on @adcp/sdk@5.x, the other on 7.x) — the smoke test runs in a single CI process so by construction tests one version against itself. Cross-version is what `COMPATIBLE_ADCP_VERSIONS` (#1706, merged) and #1707 (parked) address.

## Refs

- adcontextprotocol/adcp#4419 — original BidMachine report with the side-by-side data
- adcp-client#1711 — fgranata's report and retest confirmation
- adcp-client#1713 / PR #1714 — actual root cause of the original divergence; merged into 7.1.0
- adcp-client#1709 / PR #1712 — error-attribution fix; merged into 7.1.0
- adcp-client#1707 — build-time schema pin (scope-corrected; parked for adopter demand)
- adcp-client#1676–#1680 — earlier filed runner-side issues (account fabrication, package shape, error preservation, not_applicable grading) — pattern of the comply harness path silently diverging from spec behaviour, addressed in 7.0.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluator divergence: comply suite and CLI runner produce materially different grades on the same agent #1708

Summary

📌 Update 2026-05-12: original divergence resolved; issue narrowed to regression guard

Why this is the load-bearing issue (worse than either underlying bug)

Recommended fix shape (narrowed)

Refs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Evaluator	Score	Pass rate
Comply suite (runner token)	63/128	49%
`storyboard run` CLI	45/59 (steps_passed/total)	76%

Evaluator divergence: comply suite and CLI runner produce materially different grades on the same agent #1708

Description

Summary

📌 Update 2026-05-12: original divergence resolved; issue narrowed to regression guard

Why this is the load-bearing issue (worse than either underlying bug)

Recommended fix shape (narrowed)

Refs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions