Skip to content

feat(gastown): add debug replay-events endpoint for reconciler phase 5#1373

Open
jrf0110 wants to merge 4 commits intomainfrom
convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head
Open

feat(gastown): add debug replay-events endpoint for reconciler phase 5#1373
jrf0110 wants to merge 4 commits intomainfrom
convoy/reconciler-phase-5-debug-endpoints-grafa/4763028e/head

Conversation

@jrf0110
Copy link
Contributor

@jrf0110 jrf0110 commented Mar 21, 2026

Summary

Adds a POST /debug/towns/:townId/replay-events endpoint that replays town events from a given time range for debugging purposes. The endpoint:

  • Accepts from/to ISO timestamps, queries all town_events in that range (regardless of processed_at)
  • Applies each event via reconciler.applyEvent() to reconstruct state transitions
  • Runs reconciler.reconcile() against the resulting state to compute what actions would be emitted
  • Captures agent and non-terminal bead snapshots
  • Rolls back all mutations via SQLite SAVEPOINT so the endpoint is fully side-effect-free

Also includes the preceding commits on this convoy branch: dry-run reconciler endpoint, debug dry-run with event draining, and a fix for skipping container_status events.

Verification

  • Code review: all patterns match existing debugDryRun endpoint conventions (SAVEPOINT/ROLLBACK, parameterized queries, Zod validation, eslint-disable comments)
  • Imports verified: town_events, TownEventRecord, reconciler, query, Action all correctly imported
  • SQL injection safe: user inputs passed as parameterized ? placeholders
  • Input validation: missing fields (400), invalid dates (400), reversed range (400)

Visual Changes

N/A

Reviewer Notes

  • This is an unauthenticated /debug/ route, consistent with existing debug endpoints marked for removal after debugging
  • Unlike debugDryRun, this endpoint does NOT call events.markProcessed() — this is intentional since it replays historical (already-processed) events rather than draining pending ones
  • The SAVEPOINT pattern (SAVEPOINT → try/finally → ROLLBACK TORELEASE) is identical to the existing debugDryRun method

jrf0110 and others added 4 commits March 21, 2026 11:15
Filter out 'running' status in the alarm pre-phase before calling
upsertContainerStatus(). Running is the steady-state for healthy agents
and a no-op in applyEvent(), so recording it just bloats the event table
(~720 events/hour/agent). Non-running statuses (stopped, error, unknown)
still get inserted for reconciler detection.
Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount
* feat(claw): evaluate button-vs-card feature flag for PostHog experiment tracking

* fix(claw): move button-vs-card flag eval to CreateInstanceCard

Moves useFeatureFlagVariantKey('button-vs-card') from ClawDashboard
(which renders for all users including those with existing instances)
to CreateInstanceCard (which only renders for users who haven't
provisioned yet). This scopes the experiment exposure to users who
can actually see the create CTA, avoiding population dilution.

* feat(gastown): add POST /debug/reconcile-dry-run endpoint

Add a debug endpoint that runs the reconciler against current live state
and returns the actions it would emit without applying them. This enables
inspecting what the reconciler thinks should happen at any given moment.

- Add debugDryRun() method to TownDO that calls reconciler.reconcile()
  and returns actions + metrics without calling applyAction()
- Add POST /debug/towns/:townId/reconcile-dry-run route following the
  same unauthenticated debug pattern as GET /debug/towns/:townId/status
- Response includes actions array, actionsEmitted count, actionsByType
  breakdown, and pendingEventCount

* fix(gastown): drain pending events in debugDryRun() before reconciling

Wrap debugDryRun() in a SQLite savepoint so it can drain and apply
pending town_events (Phase 0) before running reconcile (Phase 1),
matching the real alarm loop behavior. The savepoint is rolled back
in a finally block so the endpoint remains fully side-effect-free.

Adds eventsDrained to the returned metrics.

---------

Co-authored-by: kiloconnect[bot] <240665456+kiloconnect[bot]@users.noreply.github.com>
Co-authored-by: Pedro Heyerdahl <pedro@kilocode.ai>
Co-authored-by: Pedro Heyerdahl <61753986+pedroheyerdahl@users.noreply.github.com>
…y debugging

Adds debugReplayEvents(from, to) method to Town.do.ts that queries all
town_events in a time range (regardless of processed_at), applies them
to reconstruct state transitions, runs the reconciler, and returns the
computed actions and a state snapshot. Uses a SQLite SAVEPOINT that is
rolled back so the endpoint remains fully side-effect-free.

Route: POST /debug/towns/:townId/replay-events
Body: { from: ISO, to: ISO }
Response: { eventsReplayed, actions, stateSnapshot }

// Apply each event to reconstruct state transitions
for (const event of rangeEvents) {
reconciler.applyEvent(this.sql, event);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARNING: Replaying onto live state double-applies historical events

debugReplayEvents() starts from the current town tables, then runs each selected event through non-idempotent handlers like reviewQueue.agentDone() and completeReviewWithResult(). If the selected window has already been processed, this can target different beads than it originally did and return actions/snapshots that never would have existed at from. A real replay needs to start from a state snapshot taken before the requested range (or otherwise reset the affected state before applying the events).

@kilo-code-bot
Copy link
Contributor

kilo-code-bot bot commented Mar 21, 2026

Code Review Summary

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 1
SUGGESTION 0

Fix these issues in Kilo Cloud

Issue Details (click to expand)

WARNING

File Line Issue
cloudflare-gastown/src/dos/Town.do.ts 3730 debugReplayEvents() re-applies historical events on top of current state, so non-idempotent handlers can return misleading actions and snapshots.
Other Observations (not in diff)

N/A

Files Reviewed (3 files)
  • cloudflare-gastown/src/dos/Town.do.ts - 1 issue
  • cloudflare-gastown/src/gastown.worker.ts - 0 issues
  • src/app/(app)/claw/components/CreateInstanceCard.tsx - 0 issues

Reviewed by gpt-5.4-20260305 · 797,828 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant