Skip to content

telegram-bridge: streaming & resilience hardening — progress-edit, typing, MarkdownV2, chunking, offset/replay safety, backoff #2967

@Hmbown

Description

@Hmbown

Summary

The Telegram bridge (integrations/telegram-bridge/src/index.mjs, src/lib.mjs) is a deliberately minimal MVP. This issue folds the production behaviors a long-running phone bridge needs into one incremental workstream (one issue, sliced PRs — avoids issue spam). Depends on the dispatch-concurrency bug fix (#2966) landing first.

This picks up the deferred/partial rows of the "Telegram hardening checklist" in docs/REMOTE_SETUP_DESIGN.md (the doc's own note says "Revisit after remote-setup lands" — this is that revisit; update the table statuses as PRs land):

Checklist row Status in doc Picked up here
Text batching + progress-edit (edit one msg vs spam) deferred Yes — items 1 and 4
MarkdownV2 escaping + table rendering deferred Yes — item 3
Network/connect-timeout retry partial (generic 3s backoff) Yes — item 6
409 polling conflict done (flat 10s forever) Item 6 adds escalating ladder + fatal taxonomy
429 retry_after done (poll path only) Item 6 extends to the send path
Webhook mode out of scope Still out of scope (non-goal)

Items

1. Progress-edit streaming (one message, edited in place)

  • Gap: deltas accumulate silently; a flush requires BOTH >3500 chars AND >15s (src/index.mjs:577-585), so any response under ~3500 chars is total silence between "Started turn" and the final dump (src/index.mjs:636). Every flush is a brand-new sendMessage (src/index.mjs:802-815); the only edit helper today is editMessageReplyMarkup (src/index.mjs:825-831).
  • Design: first flush sends the progress message and captures its message_id; subsequent flushes editMessageText that same message with a trailing cursor marker (e.g. ); flush when ~0.8s elapsed or a small content threshold accrues; on flood/rate errors back off adaptively (doubling toward ~10s), and after repeated edit failures degrade gracefully to today's append-only sends; the final edit removes the cursor and becomes the answer.

2. Typing indicator cadence

  • Gap: the bridge never calls sendChatAction — zero feedback during the (common) silent phase of a turn.
  • Design: refresh typing every ~2s (Telegram expires the state at ~5s), each call time-bounded (~1.5s) so a slow round-trip can't break the cadence; re-fire after every send (delivery clears the state); pause while an approval prompt is pending and resume on resolution (hooks into the approval.required path, src/index.mjs:587-624).

3. MarkdownV2 escaping + table rendering

  • Gap: sendMessage sets no parse_mode (src/index.mjs:805-809); model markdown and code fences render as raw text on a phone.
  • Design: escape pipeline with placeholder protection for code spans/fences; rewrite GFM pipe tables (MarkdownV2 has no tables) into a bold-heading + bullet layout; and — critically — a send-time fallback chain: attempt MarkdownV2, on a parse error resend as stripped plain text. The fallback is what makes MDV2 shippable despite its parser-escaping bug class (the reason the design doc deferred it). Helpers live in src/lib.mjs, pure and unit-tested.

4. 4096 chunking: fence-aware splits + document fallback

  • Gap: splitMessage hard-splits at exactly maxChars by code point (src/lib.mjs:260-271) — surrogate-safe but splits mid-word/mid-line/mid-code-fence; the streaming path slices the same way (src/index.mjs:581-582); reply_markup only on the last chunk (src/index.mjs:810-811).
  • Design: prefer newline, then space, as split points; close an open triple-backtick fence at the split and reopen it (with language tag) in the next chunk; avoid splitting inside inline code; reserve room for (n/m) part indicators; measure length in UTF-16 code units (Telegram's actual limit unit — we currently count code points). Optional follow-up: very long outputs delivered as a document upload instead of a message stack.

5. Offset persistence + callback replay protection

  • Gap: TELEGRAM_UPDATE_OFFSET is read from env once and advanced only in memory (src/index.mjs:161, 208) — restarts redeliver unconfirmed updates; the offset advances before handling (:208 vs :209-211) so a handler error skips updates. Message dedupe is a 500-entry ring keyed chatId:messageId (src/index.mjs:54-62, 236-237). handleCallbackQuery has no dedupe at all (src/index.mjs:311-344): replayed cw:new/cw:compact/cw:interrupt callbacks re-execute fully (duplicate threads/compactions), and resume stored actions are looked up via getAction and never consumed (src/index.mjs:380). Approval actions alone are protected by single-use takeAction (src/index.mjs:395, 102-109).
  • Design: persist updateOffset in the existing threadStore, advancing only after successful handling — or explicitly drop pending updates on fresh start (documented choice, with redelivery accepted only where harmless); dedupe callbacks by update_id; make resume actions one-shot via takeAction like approvals.

6. 409/429/network backoff curves + send-path retry

  • Gap: 409 conflict retries forever at a flat 10s (src/index.mjs:214-217); poll errors honor retry_after else flat 3s (src/lib.mjs:301-307); the send path has zero retry — one failed sendMessage throws straight out of telegramApi (src/index.mjs:833-850). No fatal-error taxonomy: a permanently stolen bot token and a transient blip look identical.
  • Design: 409 → escalating ladder (e.g. 15/25/35/45/55s, bounded attempts), then a non-retryable fatal with a clear operator message ("another process owns this bot token"); network errors → exponential 5s→60s with bounded attempts, then retryable-fatal so systemd restarts cleanly; send path → retry transient network errors briefly (1s/2s), sleep exactly retry_after on flood (up to 3 attempts), and do not retry generic timeouts (duplicate-send risk). A single failed send must never kill a turn stream.

7. Restart reattach UX

  • Gap: reattachActiveTurns (src/index.mjs:512-547) loops chats serially — chat B's "Bridge restarted. Reattaching..." notice waits for chat A's entire reattached turn; a reattach stream and a poll-loop update for the same chat race on patchChat (last writer wins, :573 vs :493-497). patchChat(lastSeq) rewrites the whole JSON state file on every SSE event (src/index.mjs:121-127, 573).
  • Design: reattach all chats in parallel through the per-chat turn registry (from the concurrency fix); apply a freshness window so stale turns get a clean "no active turn remains" notice; debounce state-file writes (no write-per-SSE-event).

Non-goals

Webhook mode (no inbound ports, per the design doc); pairing-code flows (allowlist refusal text is adequate); inbound >4096 user-message handling; draft streaming.

Suggested PR slicing (each independently shippable)

  1. Correctness first — offset persistence + callback replay protection (item 5). Small, state-layer only, de-risks everything after it.
  2. Backoff curves + send-path retry + fatal taxonomy (item 6).
  3. Typing indicator (item 2). Tiny, immediate UX win.
  4. Progress-edit streaming (item 1). The big UX change; lands on top of 2/3.
  5. Fence-aware chunking (item 4), then MarkdownV2 + tables with plain-text fallback (item 3) — two PRs; the fallback chain ships in the same PR as MDV2 or not at all.
  6. Reattach UX + state-write debounce (item 7).

Acceptance criteria

  • Streaming (1+2): during a turn the chat shows typing within ~2s; a progress message appears once and is edited in place (no message-per-chunk spam); typing pauses while an approval is pending and resumes after; flood errors degrade gracefully to append-only.
  • Formatting (3+4): code fences, bold/italic, links, and pipe tables render correctly; an MDV2 parse failure falls back to plain text rather than dropping the message; splits never break a code fence mid-block; chunk lengths measured in UTF-16 code units.
  • State safety (5): restarting the bridge neither replays handled prompts/callbacks nor re-executes cw:new/cw:compact/cw:interrupt/resume actions; offset survives restart (or pending updates are explicitly dropped — documented choice); a handler error doesn't silently skip later updates.
  • Resilience (6): 409 escalates and goes fatal after bounded attempts with a clear operator message; transient network/429 on send retries with bounded backoff; a single failed send never kills a turn stream.
  • Reattach (7): after a restart with N chats mid-turn, all N get their notice promptly; state writes are debounced.
  • All existing tests pass (node --test, currently 23) plus new unit tests per item (formatting/chunking helpers in src/lib.mjs stay pure/testable).

Related: #2964, #1990, docs/REMOTE_SETUP_DESIGN.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocumentationImprovements or additions to documentationenhancementNew feature or request

    Projects

    Status
    Backlog

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions