storyboard: declare grading mode (controlled-test vs live-sandbox); test controller presence shouldn't silently change rubric

## The structural gap

`comply_test_controller` presence currently flips the storyboard's grading rubric *implicitly*. That conflates two different compliance contexts under one summary, and operators have no way to declare which one they're in.

### Mode A: controlled-test (cert / onboarding)

The seller exposes `comply_test_controller`. The SDK uses it to seed deterministic state — force a buy into the `submitted` arm, inject delivery metrics, plant governance fixtures. This is the rubric for **certifying a new seller**: every scenario in the bundle must grade, including the ones that need seeded state.

In this mode, **missing the test controller is a hard fail**. The operator has signed up to be certified; they're asserting their stack supports the full compliance surface.

### Mode B: live-sandbox / verified-live (production monitoring)

The seller is a real production agent. No `comply_test_controller` (you don't ship test seeding skills to prod). The runner grades what's naturally observable on the wire — happy paths, real `get_products` queries, real `create_media_buy` against real inventory. Controlled-only scenarios should be **marked not-applicable**, not skipped — they're a different rubric.

In this mode, **missing the test controller is expected**. The verdict should be "live agent passed N/M live-applicable scenarios", not "passed N/M with 28 skipped for unspecified reasons."

## What's broken today

Today the runner's behavior is "if test controller present → grade everything; if absent → silently skip controller-driven scenarios with `skip_reason: missing_test_controller`." The operator never declared intent. Downstream tooling can't tell:

- a real production agent that's correctly in mode B (28 N/A is fine, agent is healthy), from
- a stack that *should* be in mode A but is missing seeding skills (28 skipped means it can't be certified)

These have **opposite operator-action implications** — "ship it" vs "block release" — but the same wire signal today.

## Proposal

### 1. Operator declares grading mode

Add a `--grading-mode` flag (or env var `ADCP_GRADING_MODE`) with values:

- `controlled-test` (default for `npx storyboard run` against `localhost`?)
- `live-sandbox` (default for runs against public HTTPS URLs?)
- `auto` (current behavior; flagged as deprecated)

The mode picks which rubric is loaded and what counts as a fail.

### 2. Scenarios declare their applicable mode(s)

Each storyboard scenario YAML grows an `applicable_modes` list:

```yaml
id: media_buy_seller/create_media_buy_async/force_submitted_arm
applicable_modes: [controlled-test]   # only graded when --grading-mode=controlled-test
```

```yaml
id: media_buy_seller/refine_products/setup
applicable_modes: [controlled-test, live-sandbox]   # graded in both, but
required_tools_one_of:                              # under different rules
  - [sync_accounts, list_accounts]                  # one of these is required
```

When mode mismatch → scenario is `not_applicable` (distinct from `skipped`), and counted in a new bucket.

### 3. Mode-aware verdict

```
── Controlled-test mode ──
  Steps: 31 passed, 28 failed (missing test controller), 0 skipped
  STORYBOARD-FAIL: cannot certify without comply_test_controller

── Live-sandbox mode ──
  Steps: 31 passed, 0 failed, 28 not-applicable (controlled-test only)
  STORYBOARD-OK: agent passes all 31 live-applicable scenarios
```

Same agent, same wire behavior, different verdict — because the operator declared intent.

## Why this matters more than #1623 / #1624

#1623 (skip-cause aggregation) is presentation. #1624 (required-tool fail) is rubric for one specific gap. **This issue is the architecture both should land into**: the rubric isn't fixed — it depends on the mode the operator is grading in. Once mode is explicit:

- #1623's skip-cause block gets organized by mode-applicability.
- #1624's "required tool missing → fail" only applies in `controlled-test`; in `live-sandbox` the operator may legitimately not have account discovery wired (different deployment).
- Future "verified live" badge work has a clean place to plug in: it's just a published artifact of `live-sandbox` mode.

## Strawman migration

1. Land the mode flag with `auto` as the default (no behavior change).
2. Tag scenarios with `applicable_modes` over the next few releases.
3. Flip the default to require explicit `--grading-mode` once tagging is complete.
4. Verified-live infrastructure (badges, dashboards) starts emitting from `live-sandbox`-mode runs only.

Happy to spec this out further or send a PR for step 1 — wanted to surface the framing before going deeper.

Related: #1623, #1624.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storyboard: declare grading mode (controlled-test vs live-sandbox); test controller presence shouldn't silently change rubric #1626

The structural gap

Mode A: controlled-test (cert / onboarding)

Mode B: live-sandbox / verified-live (production monitoring)

What's broken today

Proposal

1. Operator declares grading mode

2. Scenarios declare their applicable mode(s)

3. Mode-aware verdict

Why this matters more than #1623 / #1624

Strawman migration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

storyboard: declare grading mode (controlled-test vs live-sandbox); test controller presence shouldn't silently change rubric #1626

Description

The structural gap

Mode A: controlled-test (cert / onboarding)

Mode B: live-sandbox / verified-live (production monitoring)

What's broken today

Proposal

1. Operator declares grading mode

2. Scenarios declare their applicable mode(s)

3. Mode-aware verdict

Why this matters more than #1623 / #1624

Strawman migration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions