The structural gap
comply_test_controller presence currently flips the storyboard's grading rubric implicitly. That conflates two different compliance contexts under one summary, and operators have no way to declare which one they're in.
Mode A: controlled-test (cert / onboarding)
The seller exposes comply_test_controller. The SDK uses it to seed deterministic state — force a buy into the submitted arm, inject delivery metrics, plant governance fixtures. This is the rubric for certifying a new seller: every scenario in the bundle must grade, including the ones that need seeded state.
In this mode, missing the test controller is a hard fail. The operator has signed up to be certified; they're asserting their stack supports the full compliance surface.
Mode B: live-sandbox / verified-live (production monitoring)
The seller is a real production agent. No comply_test_controller (you don't ship test seeding skills to prod). The runner grades what's naturally observable on the wire — happy paths, real get_products queries, real create_media_buy against real inventory. Controlled-only scenarios should be marked not-applicable, not skipped — they're a different rubric.
In this mode, missing the test controller is expected. The verdict should be "live agent passed N/M live-applicable scenarios", not "passed N/M with 28 skipped for unspecified reasons."
What's broken today
Today the runner's behavior is "if test controller present → grade everything; if absent → silently skip controller-driven scenarios with skip_reason: missing_test_controller." The operator never declared intent. Downstream tooling can't tell:
- a real production agent that's correctly in mode B (28 N/A is fine, agent is healthy), from
- a stack that should be in mode A but is missing seeding skills (28 skipped means it can't be certified)
These have opposite operator-action implications — "ship it" vs "block release" — but the same wire signal today.
Proposal
1. Operator declares grading mode
Add a --grading-mode flag (or env var ADCP_GRADING_MODE) with values:
controlled-test (default for npx storyboard run against localhost?)
live-sandbox (default for runs against public HTTPS URLs?)
auto (current behavior; flagged as deprecated)
The mode picks which rubric is loaded and what counts as a fail.
2. Scenarios declare their applicable mode(s)
Each storyboard scenario YAML grows an applicable_modes list:
id: media_buy_seller/create_media_buy_async/force_submitted_arm
applicable_modes: [controlled-test] # only graded when --grading-mode=controlled-test
id: media_buy_seller/refine_products/setup
applicable_modes: [controlled-test, live-sandbox] # graded in both, but
required_tools_one_of: # under different rules
- [sync_accounts, list_accounts] # one of these is required
When mode mismatch → scenario is not_applicable (distinct from skipped), and counted in a new bucket.
3. Mode-aware verdict
── Controlled-test mode ──
Steps: 31 passed, 28 failed (missing test controller), 0 skipped
STORYBOARD-FAIL: cannot certify without comply_test_controller
── Live-sandbox mode ──
Steps: 31 passed, 0 failed, 28 not-applicable (controlled-test only)
STORYBOARD-OK: agent passes all 31 live-applicable scenarios
Same agent, same wire behavior, different verdict — because the operator declared intent.
Why this matters more than #1623 / #1624
#1623 (skip-cause aggregation) is presentation. #1624 (required-tool fail) is rubric for one specific gap. This issue is the architecture both should land into: the rubric isn't fixed — it depends on the mode the operator is grading in. Once mode is explicit:
Strawman migration
- Land the mode flag with
auto as the default (no behavior change).
- Tag scenarios with
applicable_modes over the next few releases.
- Flip the default to require explicit
--grading-mode once tagging is complete.
- Verified-live infrastructure (badges, dashboards) starts emitting from
live-sandbox-mode runs only.
Happy to spec this out further or send a PR for step 1 — wanted to surface the framing before going deeper.
Related: #1623, #1624.
The structural gap
comply_test_controllerpresence currently flips the storyboard's grading rubric implicitly. That conflates two different compliance contexts under one summary, and operators have no way to declare which one they're in.Mode A: controlled-test (cert / onboarding)
The seller exposes
comply_test_controller. The SDK uses it to seed deterministic state — force a buy into thesubmittedarm, inject delivery metrics, plant governance fixtures. This is the rubric for certifying a new seller: every scenario in the bundle must grade, including the ones that need seeded state.In this mode, missing the test controller is a hard fail. The operator has signed up to be certified; they're asserting their stack supports the full compliance surface.
Mode B: live-sandbox / verified-live (production monitoring)
The seller is a real production agent. No
comply_test_controller(you don't ship test seeding skills to prod). The runner grades what's naturally observable on the wire — happy paths, realget_productsqueries, realcreate_media_buyagainst real inventory. Controlled-only scenarios should be marked not-applicable, not skipped — they're a different rubric.In this mode, missing the test controller is expected. The verdict should be "live agent passed N/M live-applicable scenarios", not "passed N/M with 28 skipped for unspecified reasons."
What's broken today
Today the runner's behavior is "if test controller present → grade everything; if absent → silently skip controller-driven scenarios with
skip_reason: missing_test_controller." The operator never declared intent. Downstream tooling can't tell:These have opposite operator-action implications — "ship it" vs "block release" — but the same wire signal today.
Proposal
1. Operator declares grading mode
Add a
--grading-modeflag (or env varADCP_GRADING_MODE) with values:controlled-test(default fornpx storyboard runagainstlocalhost?)live-sandbox(default for runs against public HTTPS URLs?)auto(current behavior; flagged as deprecated)The mode picks which rubric is loaded and what counts as a fail.
2. Scenarios declare their applicable mode(s)
Each storyboard scenario YAML grows an
applicable_modeslist:When mode mismatch → scenario is
not_applicable(distinct fromskipped), and counted in a new bucket.3. Mode-aware verdict
Same agent, same wire behavior, different verdict — because the operator declared intent.
Why this matters more than #1623 / #1624
#1623 (skip-cause aggregation) is presentation. #1624 (required-tool fail) is rubric for one specific gap. This issue is the architecture both should land into: the rubric isn't fixed — it depends on the mode the operator is grading in. Once mode is explicit:
controlled-test; inlive-sandboxthe operator may legitimately not have account discovery wired (different deployment).live-sandboxmode.Strawman migration
autoas the default (no behavior change).applicable_modesover the next few releases.--grading-modeonce tagging is complete.live-sandbox-mode runs only.Happy to spec this out further or send a PR for step 1 — wanted to surface the framing before going deeper.
Related: #1623, #1624.