Reconciler Phase 5: Debug endpoints, Grafana integration, and alerting

## Summary

Complete the remaining Phase 5 (Hardening + Observability) deliverables from the [reconciliation implementation plan](../docs/gt/reconciliation-implementation-plan.md). The reconciler is live in production with invariant checking and per-tick metrics, but is missing debug tooling, dashboard integration, and alerting.

**Parent issue:** #204 (Phase 4: Hardening)

## Deliverables

### 1. Event replay debug endpoint

`POST /api/towns/:townId/debug/replay-events`

- Accepts a time range (`from`/`to` ISO timestamps)
- Reads all `town_events` for that range
- Applies them to a fresh in-memory state snapshot
- Runs the reconciler against the resulting state
- Returns the computed actions **without applying them**

This is the highest-value debug tool — it allows replaying the exact sequence of events that led to a stuck state. Would have saved significant time during the review-round debugging sessions that motivated the reconciler rewrite.

### 2. Reconciler dry-run debug endpoint

`POST /api/towns/:townId/debug/reconcile-dry-run`

- Runs `reconciler.reconcile(sql)` against current live state
- Returns the actions it would emit
- Does NOT apply them

Useful for inspecting what the reconciler thinks should happen right now without affecting state. Simpler than event replay — no time range, no state reconstruction.

### 3. Grafana dashboard integration

Update the existing Grafana dashboard (`gastown-grafana-dash-1.json`) with reconciler-specific panels:

- **Events drained per tick** (timeseries)
- **Actions emitted per tick by type** (stacked bar)
- **Side effects attempted / succeeded / failed** (timeseries)
- **Invariant violations** (counter, alert threshold > 0)
- **Reconciler wall clock time** (timeseries, alert threshold > 500ms)
- **Pending event queue depth** (gauge, alert threshold > 50)

Data source: the `_lastReconcilerMetrics` field exposed via `getAlarmStatus()`.

### 4. Alerting on invariant violations

Currently invariant violations are logged as `console.error`. Add:

- Sentry error capture for each violation (with structured context: invariant number, message, townId)
- A counter metric for violations that Grafana can alert on
- Consider: auto-recovery actions for specific invariants (e.g., invariant 7 "working agent with no hook" → transition to idle)

### 5. Filter container_status events to reduce noise

The alarm pre-phase currently inserts a `container_status` event for every working/stalled agent on every tick (every 5s). Most return `running` which is a no-op in `applyEvent`. This creates ~720 events/hour per working agent.

Fix: only insert `container_status` events when the status is NOT `running`, or when it differs from the last-observed status. Track last-observed status on `agent_metadata` or in DO transient storage.

## Verification

- [ ] Event replay endpoint returns correct actions for a known stuck-state scenario
- [ ] Dry-run endpoint returns actions matching what the next alarm tick would produce
- [ ] Grafana dashboard shows reconciler metrics with no gaps
- [ ] Invariant violations trigger Sentry alerts
- [ ] Container status event volume drops by ~99% after filtering

## References

- [Reconciliation spec](../docs/gt/reconciliation-spec.md) — §6 (Invariants), §7 (Timing)
- [Implementation plan](../docs/gt/reconciliation-implementation-plan.md) — Phase 5
- [Deviations](../docs/gt/reconciliation-deviations.md) — Deviation 17 (resolved), 22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconciler Phase 5: Debug endpoints, Grafana integration, and alerting #1335

Summary

Deliverables

1. Event replay debug endpoint

2. Reconciler dry-run debug endpoint

3. Grafana dashboard integration

4. Alerting on invariant violations

5. Filter container_status events to reduce noise

Verification

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reconciler Phase 5: Debug endpoints, Grafana integration, and alerting #1335

Description

Summary

Deliverables

1. Event replay debug endpoint

2. Reconciler dry-run debug endpoint

3. Grafana dashboard integration

4. Alerting on invariant violations

5. Filter container_status events to reduce noise

Verification

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions