Skip to content

Evaluator divergence: comply suite and CLI runner produce materially different grades on the same agent #1708

@bokelley

Description

@bokelley

Summary

The AAO conformance suite and the @adcp/client CLI runner — both AAO-owned, both grading against the same storyboard corpus — report materially different verdicts when run against the same agent at the same time with the same auth. Reported in adcontextprotocol/adcp#4419:

Evaluator Score Pass rate
Comply suite (runner token) 63/128 49%
storyboard run CLI 45/59 (steps_passed/total) 76%

Same agent (https://adcp.bidmachine.io/adcp/mcp master commit Wave 23.20.9), same auth, same window. 27-point delta.

📌 Update 2026-05-12: original divergence resolved; issue narrowed to regression guard

The 27-point delta turned out to be version-driven, not evaluator-driven. Both surfaces are the same underlying SDK; the evaluators just happened to be running different @adcp/sdk versions and one of them tripped the no_secret_echo over-rejection (adcp-client#1713 / PR #1714) that the newer version had grown to overscan.

Root causes both shipped in @adcp/sdk@7.1.0:

BidMachine retested on 7.1.0 (receipt comment): 47/59 ≈ 80% on media_buy_seller, 26/37 ≈ 70% on sales_non_guaranteed. Both surfaces now converge. The three remaining failures are clean seller-side issues they've already triaged.

The two underlying buckets I originally enumerated as contributing causes:

The third bucket (build-time schema pin, #1707) is real cosmetic-URL drift but is bounded by additionalProperties: true permissiveness on most response schemas; doesn't drive verdict divergence today. Held pending adopter demand.

Narrowed scope for this issue: a small in-SDK parity smoke test that catches the class of bug we just chased — when comply() and runStoryboard() disagree on the verdict for the same fixture. See Recommended fix shape below.

Why this is the load-bearing issue (worse than either underlying bug)

Two AAO-owned evaluators producing this big a disagreement on the same wire means:

Update (above): the disagreement was version-driven, not evaluator-driven, and both root causes shipped in 7.1.0. Both evaluators now converge. The argument below is preserved for context but the urgency framing no longer applies.

1. Adopters can't trust either score. BidMachine ran 10+ deploys chasing the comply-suite verdict; their CLI runner verdict would have pointed at a different (and partially different) set of root causes.
2. The badge claim doesn't anchor to a single contract. Whatever the next adopter sees on the dashboard is whichever evaluator the dashboard happens to consult. They can grade the same passing run as failing depending on which surface answered.
3. Each underlying bug (Zod .strict(), build-time schema pin in #1707, codegen regen lag, error-attribution flattening, etc.) hides behind the divergence. Even after #1707 + the strict→passthrough flip land, there's no parity check ensuring the two surfaces stay in sync as one gets fixed before the other.

The remaining real value: a regression guard so a future change can't reintroduce the divergence we just spent five PRs collapsing.

Recommended fix shape (narrowed)

A focused in-SDK parity smoke test, ~50-100 LOC + a fixture file:

  1. Fixture: a mock agent that emits a BidMachine-shaped response — passthrough-tolerated authorization field on sync_accounts.accounts[], a structured authorization validation-result object somewhere, no actual credential leakage. Pure JSON / inline TS fixture, no live network.
  2. Test: same fixture × both surfaces.
    • Run runStoryboard(fixtureAgentUrl, sb, options) against each universal storyboard once.
    • Run comply(fixtureAgentUrl, options) once.
    • Assert comply().failures is a subset of (and consistent with) the per-storyboard runStoryboard().validations failures. Concretely: same failures.length, same set of (storyboard_id, step_id, validation.check) tuples.
  3. Test: regression cases for the bugs we just fixed.
    • A response with authorization: { type: 'oauth', ... } MUST pass no_secret_echo. Both surfaces.
    • A response that triggers a Zod reject MUST attribute to response_schema, NOT to a downstream invariant. Both surfaces.
    • A capability-unsupported storyboard MUST skip identically through comply() and runStoryboard().

Specifically NOT in scope (would be a separate effort):

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions