Summary
The AAO conformance suite and the @adcp/client CLI runner — both AAO-owned, both grading against the same storyboard corpus — report materially different verdicts when run against the same agent at the same time with the same auth. Reported in adcontextprotocol/adcp#4419:
| Evaluator |
Score |
Pass rate |
| Comply suite (runner token) |
63/128 |
49% |
storyboard run CLI |
45/59 (steps_passed/total) |
76% |
Same agent (https://adcp.bidmachine.io/adcp/mcp master commit Wave 23.20.9), same auth, same window. 27-point delta.
📌 Update 2026-05-12: original divergence resolved; issue narrowed to regression guard
The 27-point delta turned out to be version-driven, not evaluator-driven. Both surfaces are the same underlying SDK; the evaluators just happened to be running different @adcp/sdk versions and one of them tripped the no_secret_echo over-rejection (adcp-client#1713 / PR #1714) that the newer version had grown to overscan.
Root causes both shipped in @adcp/sdk@7.1.0:
BidMachine retested on 7.1.0 (receipt comment): 47/59 ≈ 80% on media_buy_seller, 26/37 ≈ 70% on sales_non_guaranteed. Both surfaces now converge. The three remaining failures are clean seller-side issues they've already triaged.
The two underlying buckets I originally enumerated as contributing causes:
The third bucket (build-time schema pin, #1707) is real cosmetic-URL drift but is bounded by additionalProperties: true permissiveness on most response schemas; doesn't drive verdict divergence today. Held pending adopter demand.
Narrowed scope for this issue: a small in-SDK parity smoke test that catches the class of bug we just chased — when comply() and runStoryboard() disagree on the verdict for the same fixture. See Recommended fix shape below.
Why this is the load-bearing issue (worse than either underlying bug)
Two AAO-owned evaluators producing this big a disagreement on the same wire means:
Update (above): the disagreement was version-driven, not evaluator-driven, and both root causes shipped in 7.1.0. Both evaluators now converge. The argument below is preserved for context but the urgency framing no longer applies.
1. Adopters can't trust either score. BidMachine ran 10+ deploys chasing the comply-suite verdict; their CLI runner verdict would have pointed at a different (and partially different) set of root causes.
2. The badge claim doesn't anchor to a single contract. Whatever the next adopter sees on the dashboard is whichever evaluator the dashboard happens to consult. They can grade the same passing run as failing depending on which surface answered.
3. Each underlying bug (Zod .strict(), build-time schema pin in #1707, codegen regen lag, error-attribution flattening, etc.) hides behind the divergence. Even after #1707 + the strict→passthrough flip land, there's no parity check ensuring the two surfaces stay in sync as one gets fixed before the other.
The remaining real value: a regression guard so a future change can't reintroduce the divergence we just spent five PRs collapsing.
Recommended fix shape (narrowed)
A focused in-SDK parity smoke test, ~50-100 LOC + a fixture file:
- Fixture: a mock agent that emits a BidMachine-shaped response — passthrough-tolerated
authorization field on sync_accounts.accounts[], a structured authorization validation-result object somewhere, no actual credential leakage. Pure JSON / inline TS fixture, no live network.
- Test: same fixture × both surfaces.
- Run
runStoryboard(fixtureAgentUrl, sb, options) against each universal storyboard once.
- Run
comply(fixtureAgentUrl, options) once.
- Assert
comply().failures is a subset of (and consistent with) the per-storyboard runStoryboard().validations failures. Concretely: same failures.length, same set of (storyboard_id, step_id, validation.check) tuples.
- Test: regression cases for the bugs we just fixed.
- A response with
authorization: { type: 'oauth', ... } MUST pass no_secret_echo. Both surfaces.
- A response that triggers a Zod reject MUST attribute to
response_schema, NOT to a downstream invariant. Both surfaces.
- A capability-unsupported storyboard MUST skip identically through
comply() and runStoryboard().
Specifically NOT in scope (would be a separate effort):
Refs
Summary
The AAO conformance suite and the
@adcp/clientCLI runner — both AAO-owned, both grading against the same storyboard corpus — report materially different verdicts when run against the same agent at the same time with the same auth. Reported in adcontextprotocol/adcp#4419:storyboard runCLISame agent (
https://adcp.bidmachine.io/adcp/mcpmaster commit Wave 23.20.9), same auth, same window. 27-point delta.Why this is the load-bearing issue (worse than either underlying bug)
Two AAO-owned evaluators producing this big a disagreement on the same wire means:Update (above): the disagreement was version-driven, not evaluator-driven, and both root causes shipped in 7.1.0. Both evaluators now converge. The argument below is preserved for context but the urgency framing no longer applies.
1. Adopters can't trust either score. BidMachine ran 10+ deploys chasing the comply-suite verdict; their CLI runner verdict would have pointed at a different (and partially different) set of root causes.2. The badge claim doesn't anchor to a single contract. Whatever the next adopter sees on the dashboard is whichever evaluator the dashboard happens to consult. They can grade the same passing run as failing depending on which surface answered.3. Each underlying bug (Zod.strict(), build-time schema pin in #1707, codegen regen lag, error-attribution flattening, etc.) hides behind the divergence. Even after #1707 + the strict→passthrough flip land, there's no parity check ensuring the two surfaces stay in sync as one gets fixed before the other.The remaining real value: a regression guard so a future change can't reintroduce the divergence we just spent five PRs collapsing.
Recommended fix shape (narrowed)
A focused in-SDK parity smoke test, ~50-100 LOC + a fixture file:
authorizationfield onsync_accounts.accounts[], a structuredauthorizationvalidation-result object somewhere, no actual credential leakage. Pure JSON / inline TS fixture, no live network.runStoryboard(fixtureAgentUrl, sb, options)against each universal storyboard once.comply(fixtureAgentUrl, options)once.comply().failuresis a subset of (and consistent with) the per-storyboardrunStoryboard().validationsfailures. Concretely: samefailures.length, same set of(storyboard_id, step_id, validation.check)tuples.authorization: { type: 'oauth', ... }MUST passno_secret_echo. Both surfaces.response_schema, NOT to a downstream invariant. Both surfaces.comply()andrunStoryboard().Specifically NOT in scope (would be a separate effort):
runValidations/runAssertionslibrary. A "shared engine" refactor would be solving a problem we don't have evidence for; the smoke test catches divergence whether or not the engines are formally unified.adcp_version#1707's deliverable 7. Tracked separately.COMPATIBLE_ADCP_VERSIONS(fix(version): auto-derive COMPATIBLE_ADCP_VERSIONS from ADCP_VERSION pin #1706, merged) and Schema URLs and Zod validators are baked at SDK build time — drift from agent's advertisedadcp_version#1707 (parked) address.Refs