You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hit live in the v0.8.56 remote-workbench smoke (#2964): the agent requested a tool approval, the bridge rendered the keyboard, and the user's "Allow + remember" button press did nothing. The turn sat pending-approval until the operator delivered the decision by hand through the runtime API (POST /v1/approvals/<id>).
The deadlock is structural, not a race:
A prompt enters runPrompt and the dispatch path parks on await streamTurnEvents(...) (src/index.mjs:503), which holds an open SSE read until the turn ends (src/index.mjs:549-660).
The runtime emits approval.required; the bridge sends the approval keyboard (src/index.mjs:587-624) and keeps waiting for SSE events. The runtime turn is itself paused awaiting the decision.
The user taps Allow. That arrives as a new Telegram update — which can never be fetched, because getUpdates (src/index.mjs:202-206) only runs again after handleIncomingUpdate returns (src/index.mjs:207-212), which requires the turn to finish, which requires the approval. Circular wait.
The only exit is the turn-timeout abort (src/index.mjs:550-551; turnTimeoutMs default 900000 ms).
After the timeout, the queued approval is dropped anyway: handleCallbackQuery awaits answerCallback(query.id, "Working...") (src/index.mjs:342) beforehandleModalAction (src/index.mjs:343). By then the callback query is stale, Telegram returns 400, telegramApi throws (src/index.mjs:840-848), the throw is swallowed by the batch loop (src/index.mjs:209-211), and decideApproval never runs.
autoApprove defaults to false, so this is the default-path behavior for any tool needing approval.
Everything else is dead too while a turn streams
All inbound interaction arrives exclusively via getUpdates, so during any runPrompt-initiated turn the following are blocked for up to 15 minutes: the Interrupt button (cw:interrupt, src/lib.mjs:181,197 → src/index.mjs:362-363) — unusable for the exact turn it accompanies; /allow, /deny, /interrupt, /status, /threads, /new, /compact text commands; all other button callbacks; and all other chats' prompts (process-global head-of-line blocking). Graceful shutdown is also parked: stopping is only checked at the top of the poll loop (src/index.mjs:200), never inside the stream.
Asymmetry confirming the diagnosis: turns resumed by reattachActiveTurns (fire-and-forget at src/index.mjs:180-182) run concurrently with the poll loop, so approvals and /interrupt work during reattached turns — only normally started turns deadlock.
Root cause
The poll loop serially awaits the full lifecycle of every update (src/index.mjs:207-212), and a prompt's lifecycle includes the entire turn stream. Dispatch and turn execution share one await chain.
Fix design
Three rules:
Handlers return in milliseconds; turns run as detached tasks. The poll loop dispatches an update and immediately continues polling. A prompt spawns its turn (POST /turns + SSE stream) as a tracked background promise; the update queue keeps draining for the turn's whole lifetime.
Per-chat serialization, global concurrency. Keep a per-chat active-turn registry (e.g. Map<chatId, { promise, abort }>), with the entry installed synchronously before the async work starts so two rapid prompts can't double-spawn. A second prompt to a busy chat gets the existing activeTurnBlock refusal (src/index.mjs:461-469, src/lib.mjs:290-299) — which becomes actually reachable mid-turn. Other chats are unaffected.
Approvals and interrupts are always reachable mid-turn. They are already short HTTP calls to the runtime (src/index.mjs:767-782, 730-754); dispatched from the now-always-live poll loop they simply signal the waiting turn through the runtime API.
Concrete changes:
handleCommandcase "prompt" must not await the turn: spawn runPrompt(...) as a tracked background promise (with error logging) and return.
Fix the callback-drop compounder: make answerCallback best-effort (.catch) and/or answer afterhandleModalAction succeeds, so a stale/failed answerCallbackQuery can never discard the user's decision (src/index.mjs:342-343).
Shutdown: track per-chat AbortControllers so SIGINT/SIGTERM (src/index.mjs:163-168) aborts in-flight streams promptly instead of waiting out turnTimeoutMs.
Route reattachActiveTurns through the same per-chat registry — fixing its serial cross-chat head-of-line blocking (src/index.mjs:513,539) and the double-attach race against the poll loop (both paths patchChat the same record, last writer wins: src/index.mjs:573 vs :493-497/:505-508).
Scope
Smallest correct change, confined to integrations/telegram-bridge:
src/index.mjs: detach turn streams from dispatch; per-chat turn registry + guard; answerCallback hardening; shutdown abort; reattach through the registry.
src/lib.mjs: pure helpers for registry/guard decisions as needed (keep testable).
Approval buttons (Allow once / Allow + remember / Deny) and/allow <id> / /deny <id> text commands resolve an approval while that chat's turn is actively streaming — no timeout wait, no operator intervention.
/interrupt and the Interrupt button stop the active turn while it is streaming.
One chat's running turn never blocks another chat's prompts, commands, or callbacks (cross-chat isolation).
A second prompt to a chat with an active turn gets the existing activeTurnBlock refusal immediately (per-chat serialization preserved; no duplicate turn starts).
A failed answerCallbackQuery (e.g. stale query) never prevents the underlying action from executing.
SIGTERM aborts in-flight turn streams promptly instead of waiting up to turnTimeoutMs.
All existing 23 tests in test/ still pass (node --test), plus new tests covering concurrent dispatch.
Test plan
Unit (mock fetch for both Telegram and runtime): simulate getUpdates delivering [prompt], then while the SSE stream is held open with approval.required emitted, deliver [callback_query Allow] — assert POST /v1/approvals/<id> fires before the turn completes.
Same shape for /interrupt (assert /turns/<id>/interrupt fires mid-stream) and for a second chat's prompt starting while chat A streams.
Registry tests: duplicate prompt to busy chat → refusal; turn completion clears the registry entry even on stream error; abort on shutdown.
Regression: stale answerCallbackQuery rejection does not skip handleModalAction.
The "Telegram hardening checklist" table in docs/REMOTE_SETUP_DESIGN.md tracks send/format/transport edge cases but has no row for concurrent update dispatch / reaching a running turn — the exact class that bit us. Add a row ("Handlers return fast; approvals/interrupts reachable mid-turn") marked done when this lands.
Problem
Hit live in the v0.8.56 remote-workbench smoke (#2964): the agent requested a tool approval, the bridge rendered the keyboard, and the user's "Allow + remember" button press did nothing. The turn sat pending-approval until the operator delivered the decision by hand through the runtime API (
POST /v1/approvals/<id>).The deadlock is structural, not a race:
runPromptand the dispatch path parks onawait streamTurnEvents(...)(src/index.mjs:503), which holds an open SSE read until the turn ends (src/index.mjs:549-660).approval.required; the bridge sends the approval keyboard (src/index.mjs:587-624) and keeps waiting for SSE events. The runtime turn is itself paused awaiting the decision.getUpdates(src/index.mjs:202-206) only runs again afterhandleIncomingUpdatereturns (src/index.mjs:207-212), which requires the turn to finish, which requires the approval. Circular wait.src/index.mjs:550-551;turnTimeoutMsdefault 900000 ms).handleCallbackQueryawaitsanswerCallback(query.id, "Working...")(src/index.mjs:342) beforehandleModalAction(src/index.mjs:343). By then the callback query is stale, Telegram returns 400,telegramApithrows (src/index.mjs:840-848), the throw is swallowed by the batch loop (src/index.mjs:209-211), anddecideApprovalnever runs.autoApprovedefaults tofalse, so this is the default-path behavior for any tool needing approval.Everything else is dead too while a turn streams
All inbound interaction arrives exclusively via
getUpdates, so during anyrunPrompt-initiated turn the following are blocked for up to 15 minutes: the Interrupt button (cw:interrupt,src/lib.mjs:181,197→src/index.mjs:362-363) — unusable for the exact turn it accompanies;/allow,/deny,/interrupt,/status,/threads,/new,/compacttext commands; all other button callbacks; and all other chats' prompts (process-global head-of-line blocking). Graceful shutdown is also parked:stoppingis only checked at the top of the poll loop (src/index.mjs:200), never inside the stream.Asymmetry confirming the diagnosis: turns resumed by
reattachActiveTurns(fire-and-forget atsrc/index.mjs:180-182) run concurrently with the poll loop, so approvals and/interruptwork during reattached turns — only normally started turns deadlock.Root cause
The poll loop serially awaits the full lifecycle of every update (
src/index.mjs:207-212), and a prompt's lifecycle includes the entire turn stream. Dispatch and turn execution share one await chain.Fix design
Three rules:
/turns+ SSE stream) as a tracked background promise; the update queue keeps draining for the turn's whole lifetime.Map<chatId, { promise, abort }>), with the entry installed synchronously before the async work starts so two rapid prompts can't double-spawn. A second prompt to a busy chat gets the existingactiveTurnBlockrefusal (src/index.mjs:461-469,src/lib.mjs:290-299) — which becomes actually reachable mid-turn. Other chats are unaffected.src/index.mjs:767-782, 730-754); dispatched from the now-always-live poll loop they simply signal the waiting turn through the runtime API.Concrete changes:
handleCommandcase "prompt"must not await the turn: spawnrunPrompt(...)as a tracked background promise (with error logging) and return.answerCallbackbest-effort (.catch) and/or answer afterhandleModalActionsucceeds, so a stale/failedanswerCallbackQuerycan never discard the user's decision (src/index.mjs:342-343).AbortControllers so SIGINT/SIGTERM (src/index.mjs:163-168) aborts in-flight streams promptly instead of waiting outturnTimeoutMs.reattachActiveTurnsthrough the same per-chat registry — fixing its serial cross-chat head-of-line blocking (src/index.mjs:513,539) and the double-attach race against the poll loop (both pathspatchChatthe same record, last writer wins:src/index.mjs:573vs:493-497/:505-508).Scope
Smallest correct change, confined to
integrations/telegram-bridge:src/index.mjs: detach turn streams from dispatch; per-chat turn registry + guard;answerCallbackhardening; shutdown abort; reattach through the registry.src/lib.mjs: pure helpers for registry/guard decisions as needed (keep testable).test/: new tests (below).Non-goals
Progress-edit streaming, typing indicators, MarkdownV2, smarter chunking, offset persistence, callback replay dedupe, 409/429 backoff curves, busy-chat queue/steer semantics, webhook mode — tracked in the companion hardening issue.
Acceptance criteria
/allow <id>//deny <id>text commands resolve an approval while that chat's turn is actively streaming — no timeout wait, no operator intervention./interruptand the Interrupt button stop the active turn while it is streaming.activeTurnBlockrefusal immediately (per-chat serialization preserved; no duplicate turn starts).answerCallbackQuery(e.g. stale query) never prevents the underlying action from executing.turnTimeoutMs.test/still pass (node --test), plus new tests covering concurrent dispatch.Test plan
fetchfor both Telegram and runtime): simulategetUpdatesdelivering [prompt], then while the SSE stream is held open withapproval.requiredemitted, deliver [callback_query Allow] — assertPOST /v1/approvals/<id>fires before the turn completes./interrupt(assert/turns/<id>/interruptfires mid-stream) and for a second chat's prompt starting while chat A streams.answerCallbackQueryrejection does not skiphandleModalAction.docs/REMOTE_SETUP_DESIGN.md checklist note
The "Telegram hardening checklist" table in
docs/REMOTE_SETUP_DESIGN.mdtracks send/format/transport edge cases but has no row for concurrent update dispatch / reaching a running turn — the exact class that bit us. Add a row ("Handlers return fast; approvals/interrupts reachable mid-turn") marked done when this lands.Blocks #2964. Related: #1990.