Summary
The Telegram bridge (integrations/telegram-bridge/src/index.mjs, src/lib.mjs) is a deliberately minimal MVP. This issue folds the production behaviors a long-running phone bridge needs into one incremental workstream (one issue, sliced PRs — avoids issue spam). Depends on the dispatch-concurrency bug fix (#2966) landing first.
This picks up the deferred/partial rows of the "Telegram hardening checklist" in docs/REMOTE_SETUP_DESIGN.md (the doc's own note says "Revisit after remote-setup lands" — this is that revisit; update the table statuses as PRs land):
| Checklist row |
Status in doc |
Picked up here |
| Text batching + progress-edit (edit one msg vs spam) |
deferred |
Yes — items 1 and 4 |
| MarkdownV2 escaping + table rendering |
deferred |
Yes — item 3 |
| Network/connect-timeout retry |
partial (generic 3s backoff) |
Yes — item 6 |
| 409 polling conflict |
done (flat 10s forever) |
Item 6 adds escalating ladder + fatal taxonomy |
| 429 retry_after |
done (poll path only) |
Item 6 extends to the send path |
| Webhook mode |
out of scope |
Still out of scope (non-goal) |
Items
1. Progress-edit streaming (one message, edited in place)
- Gap: deltas accumulate silently; a flush requires BOTH >3500 chars AND >15s (
src/index.mjs:577-585), so any response under ~3500 chars is total silence between "Started turn" and the final dump (src/index.mjs:636). Every flush is a brand-new sendMessage (src/index.mjs:802-815); the only edit helper today is editMessageReplyMarkup (src/index.mjs:825-831).
- Design: first flush sends the progress message and captures its
message_id; subsequent flushes editMessageText that same message with a trailing cursor marker (e.g. ▌); flush when ~0.8s elapsed or a small content threshold accrues; on flood/rate errors back off adaptively (doubling toward ~10s), and after repeated edit failures degrade gracefully to today's append-only sends; the final edit removes the cursor and becomes the answer.
2. Typing indicator cadence
- Gap: the bridge never calls
sendChatAction — zero feedback during the (common) silent phase of a turn.
- Design: refresh typing every ~2s (Telegram expires the state at ~5s), each call time-bounded (~1.5s) so a slow round-trip can't break the cadence; re-fire after every send (delivery clears the state); pause while an approval prompt is pending and resume on resolution (hooks into the
approval.required path, src/index.mjs:587-624).
3. MarkdownV2 escaping + table rendering
- Gap:
sendMessage sets no parse_mode (src/index.mjs:805-809); model markdown and code fences render as raw text on a phone.
- Design: escape pipeline with placeholder protection for code spans/fences; rewrite GFM pipe tables (MarkdownV2 has no tables) into a bold-heading + bullet layout; and — critically — a send-time fallback chain: attempt MarkdownV2, on a parse error resend as stripped plain text. The fallback is what makes MDV2 shippable despite its parser-escaping bug class (the reason the design doc deferred it). Helpers live in
src/lib.mjs, pure and unit-tested.
4. 4096 chunking: fence-aware splits + document fallback
- Gap:
splitMessage hard-splits at exactly maxChars by code point (src/lib.mjs:260-271) — surrogate-safe but splits mid-word/mid-line/mid-code-fence; the streaming path slices the same way (src/index.mjs:581-582); reply_markup only on the last chunk (src/index.mjs:810-811).
- Design: prefer newline, then space, as split points; close an open triple-backtick fence at the split and reopen it (with language tag) in the next chunk; avoid splitting inside inline code; reserve room for
(n/m) part indicators; measure length in UTF-16 code units (Telegram's actual limit unit — we currently count code points). Optional follow-up: very long outputs delivered as a document upload instead of a message stack.
5. Offset persistence + callback replay protection
- Gap:
TELEGRAM_UPDATE_OFFSET is read from env once and advanced only in memory (src/index.mjs:161, 208) — restarts redeliver unconfirmed updates; the offset advances before handling (:208 vs :209-211) so a handler error skips updates. Message dedupe is a 500-entry ring keyed chatId:messageId (src/index.mjs:54-62, 236-237). handleCallbackQuery has no dedupe at all (src/index.mjs:311-344): replayed cw:new/cw:compact/cw:interrupt callbacks re-execute fully (duplicate threads/compactions), and resume stored actions are looked up via getAction and never consumed (src/index.mjs:380). Approval actions alone are protected by single-use takeAction (src/index.mjs:395, 102-109).
- Design: persist
updateOffset in the existing threadStore, advancing only after successful handling — or explicitly drop pending updates on fresh start (documented choice, with redelivery accepted only where harmless); dedupe callbacks by update_id; make resume actions one-shot via takeAction like approvals.
6. 409/429/network backoff curves + send-path retry
- Gap: 409 conflict retries forever at a flat 10s (
src/index.mjs:214-217); poll errors honor retry_after else flat 3s (src/lib.mjs:301-307); the send path has zero retry — one failed sendMessage throws straight out of telegramApi (src/index.mjs:833-850). No fatal-error taxonomy: a permanently stolen bot token and a transient blip look identical.
- Design: 409 → escalating ladder (e.g. 15/25/35/45/55s, bounded attempts), then a non-retryable fatal with a clear operator message ("another process owns this bot token"); network errors → exponential 5s→60s with bounded attempts, then retryable-fatal so systemd restarts cleanly; send path → retry transient network errors briefly (1s/2s), sleep exactly
retry_after on flood (up to 3 attempts), and do not retry generic timeouts (duplicate-send risk). A single failed send must never kill a turn stream.
7. Restart reattach UX
- Gap:
reattachActiveTurns (src/index.mjs:512-547) loops chats serially — chat B's "Bridge restarted. Reattaching..." notice waits for chat A's entire reattached turn; a reattach stream and a poll-loop update for the same chat race on patchChat (last writer wins, :573 vs :493-497). patchChat(lastSeq) rewrites the whole JSON state file on every SSE event (src/index.mjs:121-127, 573).
- Design: reattach all chats in parallel through the per-chat turn registry (from the concurrency fix); apply a freshness window so stale turns get a clean "no active turn remains" notice; debounce state-file writes (no write-per-SSE-event).
Non-goals
Webhook mode (no inbound ports, per the design doc); pairing-code flows (allowlist refusal text is adequate); inbound >4096 user-message handling; draft streaming.
Suggested PR slicing (each independently shippable)
- Correctness first — offset persistence + callback replay protection (item 5). Small, state-layer only, de-risks everything after it.
- Backoff curves + send-path retry + fatal taxonomy (item 6).
- Typing indicator (item 2). Tiny, immediate UX win.
- Progress-edit streaming (item 1). The big UX change; lands on top of 2/3.
- Fence-aware chunking (item 4), then MarkdownV2 + tables with plain-text fallback (item 3) — two PRs; the fallback chain ships in the same PR as MDV2 or not at all.
- Reattach UX + state-write debounce (item 7).
Acceptance criteria
- Streaming (1+2): during a turn the chat shows typing within ~2s; a progress message appears once and is edited in place (no message-per-chunk spam); typing pauses while an approval is pending and resumes after; flood errors degrade gracefully to append-only.
- Formatting (3+4): code fences, bold/italic, links, and pipe tables render correctly; an MDV2 parse failure falls back to plain text rather than dropping the message; splits never break a code fence mid-block; chunk lengths measured in UTF-16 code units.
- State safety (5): restarting the bridge neither replays handled prompts/callbacks nor re-executes
cw:new/cw:compact/cw:interrupt/resume actions; offset survives restart (or pending updates are explicitly dropped — documented choice); a handler error doesn't silently skip later updates.
- Resilience (6): 409 escalates and goes fatal after bounded attempts with a clear operator message; transient network/429 on send retries with bounded backoff; a single failed send never kills a turn stream.
- Reattach (7): after a restart with N chats mid-turn, all N get their notice promptly; state writes are debounced.
- All existing tests pass (
node --test, currently 23) plus new unit tests per item (formatting/chunking helpers in src/lib.mjs stay pure/testable).
Related: #2964, #1990, docs/REMOTE_SETUP_DESIGN.md.
Summary
The Telegram bridge (
integrations/telegram-bridge/src/index.mjs,src/lib.mjs) is a deliberately minimal MVP. This issue folds the production behaviors a long-running phone bridge needs into one incremental workstream (one issue, sliced PRs — avoids issue spam). Depends on the dispatch-concurrency bug fix (#2966) landing first.This picks up the deferred/partial rows of the "Telegram hardening checklist" in
docs/REMOTE_SETUP_DESIGN.md(the doc's own note says "Revisit afterremote-setuplands" — this is that revisit; update the table statuses as PRs land):Items
1. Progress-edit streaming (one message, edited in place)
src/index.mjs:577-585), so any response under ~3500 chars is total silence between "Started turn" and the final dump (src/index.mjs:636). Every flush is a brand-newsendMessage(src/index.mjs:802-815); the only edit helper today iseditMessageReplyMarkup(src/index.mjs:825-831).message_id; subsequent flusheseditMessageTextthat same message with a trailing cursor marker (e.g.▌); flush when ~0.8s elapsed or a small content threshold accrues; on flood/rate errors back off adaptively (doubling toward ~10s), and after repeated edit failures degrade gracefully to today's append-only sends; the final edit removes the cursor and becomes the answer.2. Typing indicator cadence
sendChatAction— zero feedback during the (common) silent phase of a turn.approval.requiredpath,src/index.mjs:587-624).3. MarkdownV2 escaping + table rendering
sendMessagesets noparse_mode(src/index.mjs:805-809); model markdown and code fences render as raw text on a phone.src/lib.mjs, pure and unit-tested.4. 4096 chunking: fence-aware splits + document fallback
splitMessagehard-splits at exactlymaxCharsby code point (src/lib.mjs:260-271) — surrogate-safe but splits mid-word/mid-line/mid-code-fence; the streaming path slices the same way (src/index.mjs:581-582);reply_markuponly on the last chunk (src/index.mjs:810-811).(n/m)part indicators; measure length in UTF-16 code units (Telegram's actual limit unit — we currently count code points). Optional follow-up: very long outputs delivered as a document upload instead of a message stack.5. Offset persistence + callback replay protection
TELEGRAM_UPDATE_OFFSETis read from env once and advanced only in memory (src/index.mjs:161, 208) — restarts redeliver unconfirmed updates; the offset advances before handling (:208vs:209-211) so a handler error skips updates. Message dedupe is a 500-entry ring keyedchatId:messageId(src/index.mjs:54-62, 236-237).handleCallbackQueryhas no dedupe at all (src/index.mjs:311-344): replayedcw:new/cw:compact/cw:interruptcallbacks re-execute fully (duplicate threads/compactions), andresumestored actions are looked up viagetActionand never consumed (src/index.mjs:380). Approval actions alone are protected by single-usetakeAction(src/index.mjs:395, 102-109).updateOffsetin the existingthreadStore, advancing only after successful handling — or explicitly drop pending updates on fresh start (documented choice, with redelivery accepted only where harmless); dedupe callbacks byupdate_id; makeresumeactions one-shot viatakeActionlike approvals.6. 409/429/network backoff curves + send-path retry
src/index.mjs:214-217); poll errors honorretry_afterelse flat 3s (src/lib.mjs:301-307); the send path has zero retry — one failedsendMessagethrows straight out oftelegramApi(src/index.mjs:833-850). No fatal-error taxonomy: a permanently stolen bot token and a transient blip look identical.retry_afteron flood (up to 3 attempts), and do not retry generic timeouts (duplicate-send risk). A single failed send must never kill a turn stream.7. Restart reattach UX
reattachActiveTurns(src/index.mjs:512-547) loops chats serially — chat B's "Bridge restarted. Reattaching..." notice waits for chat A's entire reattached turn; a reattach stream and a poll-loop update for the same chat race onpatchChat(last writer wins,:573vs:493-497).patchChat(lastSeq)rewrites the whole JSON state file on every SSE event (src/index.mjs:121-127, 573).Non-goals
Webhook mode (no inbound ports, per the design doc); pairing-code flows (allowlist refusal text is adequate); inbound >4096 user-message handling; draft streaming.
Suggested PR slicing (each independently shippable)
Acceptance criteria
cw:new/cw:compact/cw:interrupt/resume actions; offset survives restart (or pending updates are explicitly dropped — documented choice); a handler error doesn't silently skip later updates.node --test, currently 23) plus new unit tests per item (formatting/chunking helpers insrc/lib.mjsstay pure/testable).Related: #2964, #1990,
docs/REMOTE_SETUP_DESIGN.md.