Skip to content

Reconciler Phase 5: Debug endpoints, Grafana integration, and alerting #1335

@jrf0110

Description

@jrf0110

Summary

Complete the remaining Phase 5 (Hardening + Observability) deliverables from the reconciliation implementation plan. The reconciler is live in production with invariant checking and per-tick metrics, but is missing debug tooling, dashboard integration, and alerting.

Parent issue: #204 (Phase 4: Hardening)

Deliverables

1. Event replay debug endpoint

POST /api/towns/:townId/debug/replay-events

  • Accepts a time range (from/to ISO timestamps)
  • Reads all town_events for that range
  • Applies them to a fresh in-memory state snapshot
  • Runs the reconciler against the resulting state
  • Returns the computed actions without applying them

This is the highest-value debug tool — it allows replaying the exact sequence of events that led to a stuck state. Would have saved significant time during the review-round debugging sessions that motivated the reconciler rewrite.

2. Reconciler dry-run debug endpoint

POST /api/towns/:townId/debug/reconcile-dry-run

  • Runs reconciler.reconcile(sql) against current live state
  • Returns the actions it would emit
  • Does NOT apply them

Useful for inspecting what the reconciler thinks should happen right now without affecting state. Simpler than event replay — no time range, no state reconstruction.

3. Grafana dashboard integration

Update the existing Grafana dashboard (gastown-grafana-dash-1.json) with reconciler-specific panels:

  • Events drained per tick (timeseries)
  • Actions emitted per tick by type (stacked bar)
  • Side effects attempted / succeeded / failed (timeseries)
  • Invariant violations (counter, alert threshold > 0)
  • Reconciler wall clock time (timeseries, alert threshold > 500ms)
  • Pending event queue depth (gauge, alert threshold > 50)

Data source: the _lastReconcilerMetrics field exposed via getAlarmStatus().

4. Alerting on invariant violations

Currently invariant violations are logged as console.error. Add:

  • Sentry error capture for each violation (with structured context: invariant number, message, townId)
  • A counter metric for violations that Grafana can alert on
  • Consider: auto-recovery actions for specific invariants (e.g., invariant 7 "working agent with no hook" → transition to idle)

5. Filter container_status events to reduce noise

The alarm pre-phase currently inserts a container_status event for every working/stalled agent on every tick (every 5s). Most return running which is a no-op in applyEvent. This creates ~720 events/hour per working agent.

Fix: only insert container_status events when the status is NOT running, or when it differs from the last-observed status. Track last-observed status on agent_metadata or in DO transient storage.

Verification

  • Event replay endpoint returns correct actions for a known stuck-state scenario
  • Dry-run endpoint returns actions matching what the next alarm tick would produce
  • Grafana dashboard shows reconciler metrics with no gaps
  • Invariant violations trigger Sentry alerts
  • Container status event volume drops by ~99% after filtering

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestkilo-auto-fixAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions