Skip to content

storyboard: declare grading mode (controlled-test vs live-sandbox); test controller presence shouldn't silently change rubric #1626

@bokelley

Description

@bokelley

The structural gap

comply_test_controller presence currently flips the storyboard's grading rubric implicitly. That conflates two different compliance contexts under one summary, and operators have no way to declare which one they're in.

Mode A: controlled-test (cert / onboarding)

The seller exposes comply_test_controller. The SDK uses it to seed deterministic state — force a buy into the submitted arm, inject delivery metrics, plant governance fixtures. This is the rubric for certifying a new seller: every scenario in the bundle must grade, including the ones that need seeded state.

In this mode, missing the test controller is a hard fail. The operator has signed up to be certified; they're asserting their stack supports the full compliance surface.

Mode B: live-sandbox / verified-live (production monitoring)

The seller is a real production agent. No comply_test_controller (you don't ship test seeding skills to prod). The runner grades what's naturally observable on the wire — happy paths, real get_products queries, real create_media_buy against real inventory. Controlled-only scenarios should be marked not-applicable, not skipped — they're a different rubric.

In this mode, missing the test controller is expected. The verdict should be "live agent passed N/M live-applicable scenarios", not "passed N/M with 28 skipped for unspecified reasons."

What's broken today

Today the runner's behavior is "if test controller present → grade everything; if absent → silently skip controller-driven scenarios with skip_reason: missing_test_controller." The operator never declared intent. Downstream tooling can't tell:

  • a real production agent that's correctly in mode B (28 N/A is fine, agent is healthy), from
  • a stack that should be in mode A but is missing seeding skills (28 skipped means it can't be certified)

These have opposite operator-action implications — "ship it" vs "block release" — but the same wire signal today.

Proposal

1. Operator declares grading mode

Add a --grading-mode flag (or env var ADCP_GRADING_MODE) with values:

  • controlled-test (default for npx storyboard run against localhost?)
  • live-sandbox (default for runs against public HTTPS URLs?)
  • auto (current behavior; flagged as deprecated)

The mode picks which rubric is loaded and what counts as a fail.

2. Scenarios declare their applicable mode(s)

Each storyboard scenario YAML grows an applicable_modes list:

id: media_buy_seller/create_media_buy_async/force_submitted_arm
applicable_modes: [controlled-test]   # only graded when --grading-mode=controlled-test
id: media_buy_seller/refine_products/setup
applicable_modes: [controlled-test, live-sandbox]   # graded in both, but
required_tools_one_of:                              # under different rules
  - [sync_accounts, list_accounts]                  # one of these is required

When mode mismatch → scenario is not_applicable (distinct from skipped), and counted in a new bucket.

3. Mode-aware verdict

── Controlled-test mode ──
  Steps: 31 passed, 28 failed (missing test controller), 0 skipped
  STORYBOARD-FAIL: cannot certify without comply_test_controller

── Live-sandbox mode ──
  Steps: 31 passed, 0 failed, 28 not-applicable (controlled-test only)
  STORYBOARD-OK: agent passes all 31 live-applicable scenarios

Same agent, same wire behavior, different verdict — because the operator declared intent.

Why this matters more than #1623 / #1624

#1623 (skip-cause aggregation) is presentation. #1624 (required-tool fail) is rubric for one specific gap. This issue is the architecture both should land into: the rubric isn't fixed — it depends on the mode the operator is grading in. Once mode is explicit:

Strawman migration

  1. Land the mode flag with auto as the default (no behavior change).
  2. Tag scenarios with applicable_modes over the next few releases.
  3. Flip the default to require explicit --grading-mode once tagging is complete.
  4. Verified-live infrastructure (badges, dashboards) starts emitting from live-sandbox-mode runs only.

Happy to spec this out further or send a PR for step 1 — wanted to surface the framing before going deeper.

Related: #1623, #1624.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions