-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Summary
Complete the remaining Phase 5 (Hardening + Observability) deliverables from the reconciliation implementation plan. The reconciler is live in production with invariant checking and per-tick metrics, but is missing debug tooling, dashboard integration, and alerting.
Parent issue: #204 (Phase 4: Hardening)
Deliverables
1. Event replay debug endpoint
POST /api/towns/:townId/debug/replay-events
- Accepts a time range (
from/toISO timestamps) - Reads all
town_eventsfor that range - Applies them to a fresh in-memory state snapshot
- Runs the reconciler against the resulting state
- Returns the computed actions without applying them
This is the highest-value debug tool — it allows replaying the exact sequence of events that led to a stuck state. Would have saved significant time during the review-round debugging sessions that motivated the reconciler rewrite.
2. Reconciler dry-run debug endpoint
POST /api/towns/:townId/debug/reconcile-dry-run
- Runs
reconciler.reconcile(sql)against current live state - Returns the actions it would emit
- Does NOT apply them
Useful for inspecting what the reconciler thinks should happen right now without affecting state. Simpler than event replay — no time range, no state reconstruction.
3. Grafana dashboard integration
Update the existing Grafana dashboard (gastown-grafana-dash-1.json) with reconciler-specific panels:
- Events drained per tick (timeseries)
- Actions emitted per tick by type (stacked bar)
- Side effects attempted / succeeded / failed (timeseries)
- Invariant violations (counter, alert threshold > 0)
- Reconciler wall clock time (timeseries, alert threshold > 500ms)
- Pending event queue depth (gauge, alert threshold > 50)
Data source: the _lastReconcilerMetrics field exposed via getAlarmStatus().
4. Alerting on invariant violations
Currently invariant violations are logged as console.error. Add:
- Sentry error capture for each violation (with structured context: invariant number, message, townId)
- A counter metric for violations that Grafana can alert on
- Consider: auto-recovery actions for specific invariants (e.g., invariant 7 "working agent with no hook" → transition to idle)
5. Filter container_status events to reduce noise
The alarm pre-phase currently inserts a container_status event for every working/stalled agent on every tick (every 5s). Most return running which is a no-op in applyEvent. This creates ~720 events/hour per working agent.
Fix: only insert container_status events when the status is NOT running, or when it differs from the last-observed status. Track last-observed status on agent_metadata or in DO transient storage.
Verification
- Event replay endpoint returns correct actions for a known stuck-state scenario
- Dry-run endpoint returns actions matching what the next alarm tick would produce
- Grafana dashboard shows reconciler metrics with no gaps
- Invariant violations trigger Sentry alerts
- Container status event volume drops by ~99% after filtering
References
- Reconciliation spec — §6 (Invariants), §7 (Timing)
- Implementation plan — Phase 5
- Deviations — Deviation 17 (resolved), 22