Skip to content

telegram-bridge: serial update dispatch deadlocks approvals — poll loop must never await a turn #2966

@Hmbown

Description

@Hmbown

Problem

Hit live in the v0.8.56 remote-workbench smoke (#2964): the agent requested a tool approval, the bridge rendered the keyboard, and the user's "Allow + remember" button press did nothing. The turn sat pending-approval until the operator delivered the decision by hand through the runtime API (POST /v1/approvals/<id>).

The deadlock is structural, not a race:

  1. A prompt enters runPrompt and the dispatch path parks on await streamTurnEvents(...) (src/index.mjs:503), which holds an open SSE read until the turn ends (src/index.mjs:549-660).
  2. The runtime emits approval.required; the bridge sends the approval keyboard (src/index.mjs:587-624) and keeps waiting for SSE events. The runtime turn is itself paused awaiting the decision.
  3. The user taps Allow. That arrives as a new Telegram update — which can never be fetched, because getUpdates (src/index.mjs:202-206) only runs again after handleIncomingUpdate returns (src/index.mjs:207-212), which requires the turn to finish, which requires the approval. Circular wait.
  4. The only exit is the turn-timeout abort (src/index.mjs:550-551; turnTimeoutMs default 900000 ms).
  5. After the timeout, the queued approval is dropped anyway: handleCallbackQuery awaits answerCallback(query.id, "Working...") (src/index.mjs:342) before handleModalAction (src/index.mjs:343). By then the callback query is stale, Telegram returns 400, telegramApi throws (src/index.mjs:840-848), the throw is swallowed by the batch loop (src/index.mjs:209-211), and decideApproval never runs.

autoApprove defaults to false, so this is the default-path behavior for any tool needing approval.

Everything else is dead too while a turn streams

All inbound interaction arrives exclusively via getUpdates, so during any runPrompt-initiated turn the following are blocked for up to 15 minutes: the Interrupt button (cw:interrupt, src/lib.mjs:181,197src/index.mjs:362-363) — unusable for the exact turn it accompanies; /allow, /deny, /interrupt, /status, /threads, /new, /compact text commands; all other button callbacks; and all other chats' prompts (process-global head-of-line blocking). Graceful shutdown is also parked: stopping is only checked at the top of the poll loop (src/index.mjs:200), never inside the stream.

Asymmetry confirming the diagnosis: turns resumed by reattachActiveTurns (fire-and-forget at src/index.mjs:180-182) run concurrently with the poll loop, so approvals and /interrupt work during reattached turns — only normally started turns deadlock.

Root cause

The poll loop serially awaits the full lifecycle of every update (src/index.mjs:207-212), and a prompt's lifecycle includes the entire turn stream. Dispatch and turn execution share one await chain.

Fix design

Three rules:

  1. Handlers return in milliseconds; turns run as detached tasks. The poll loop dispatches an update and immediately continues polling. A prompt spawns its turn (POST /turns + SSE stream) as a tracked background promise; the update queue keeps draining for the turn's whole lifetime.
  2. Per-chat serialization, global concurrency. Keep a per-chat active-turn registry (e.g. Map<chatId, { promise, abort }>), with the entry installed synchronously before the async work starts so two rapid prompts can't double-spawn. A second prompt to a busy chat gets the existing activeTurnBlock refusal (src/index.mjs:461-469, src/lib.mjs:290-299) — which becomes actually reachable mid-turn. Other chats are unaffected.
  3. Approvals and interrupts are always reachable mid-turn. They are already short HTTP calls to the runtime (src/index.mjs:767-782, 730-754); dispatched from the now-always-live poll loop they simply signal the waiting turn through the runtime API.

Concrete changes:

  • handleCommand case "prompt" must not await the turn: spawn runPrompt(...) as a tracked background promise (with error logging) and return.
  • Fix the callback-drop compounder: make answerCallback best-effort (.catch) and/or answer after handleModalAction succeeds, so a stale/failed answerCallbackQuery can never discard the user's decision (src/index.mjs:342-343).
  • Shutdown: track per-chat AbortControllers so SIGINT/SIGTERM (src/index.mjs:163-168) aborts in-flight streams promptly instead of waiting out turnTimeoutMs.
  • Route reattachActiveTurns through the same per-chat registry — fixing its serial cross-chat head-of-line blocking (src/index.mjs:513,539) and the double-attach race against the poll loop (both paths patchChat the same record, last writer wins: src/index.mjs:573 vs :493-497/:505-508).

Scope

Smallest correct change, confined to integrations/telegram-bridge:

  • src/index.mjs: detach turn streams from dispatch; per-chat turn registry + guard; answerCallback hardening; shutdown abort; reattach through the registry.
  • src/lib.mjs: pure helpers for registry/guard decisions as needed (keep testable).
  • test/: new tests (below).

Non-goals

Progress-edit streaming, typing indicators, MarkdownV2, smarter chunking, offset persistence, callback replay dedupe, 409/429 backoff curves, busy-chat queue/steer semantics, webhook mode — tracked in the companion hardening issue.

Acceptance criteria

  • Approval buttons (Allow once / Allow + remember / Deny) and /allow <id> / /deny <id> text commands resolve an approval while that chat's turn is actively streaming — no timeout wait, no operator intervention.
  • /interrupt and the Interrupt button stop the active turn while it is streaming.
  • One chat's running turn never blocks another chat's prompts, commands, or callbacks (cross-chat isolation).
  • A second prompt to a chat with an active turn gets the existing activeTurnBlock refusal immediately (per-chat serialization preserved; no duplicate turn starts).
  • A failed answerCallbackQuery (e.g. stale query) never prevents the underlying action from executing.
  • SIGTERM aborts in-flight turn streams promptly instead of waiting up to turnTimeoutMs.
  • All existing 23 tests in test/ still pass (node --test), plus new tests covering concurrent dispatch.

Test plan

  • Unit (mock fetch for both Telegram and runtime): simulate getUpdates delivering [prompt], then while the SSE stream is held open with approval.required emitted, deliver [callback_query Allow] — assert POST /v1/approvals/<id> fires before the turn completes.
  • Same shape for /interrupt (assert /turns/<id>/interrupt fires mid-stream) and for a second chat's prompt starting while chat A streams.
  • Registry tests: duplicate prompt to busy chat → refusal; turn completion clears the registry entry even on stream error; abort on shutdown.
  • Regression: stale answerCallbackQuery rejection does not skip handleModalAction.
  • Manual smoke: re-run the exact failed scenario from v0.8.56: Ship DigitalOcean + Telegram remote-workbench setup #2964 (tool approval over Telegram, tap "Allow + remember" mid-turn) against a live runtime.

docs/REMOTE_SETUP_DESIGN.md checklist note

The "Telegram hardening checklist" table in docs/REMOTE_SETUP_DESIGN.md tracks send/format/transport edge cases but has no row for concurrent update dispatch / reaching a running turn — the exact class that bit us. Add a row ("Handlers return fast; approvals/interrupts reachable mid-turn") marked done when this lands.

Blocks #2964. Related: #1990.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocumentationImprovements or additions to documentation

    Projects

    Status
    In progress

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions