feat(campaign-sender): exponential backoff retry on transient dispatch failures (EVO-1219)#54
Conversation
…atcher (EVO-1219)
Transient dispatch failures (HTTP 5xx or network errors) retry with
min(base*2^n, cap) backoff — DISPATCH_RETRY_COUNT retries after the initial
attempt (default 3; 0 is a valid no-retries override), base/cap via
DISPATCH_BACKOFF_BASE_MS / DISPATCH_BACKOFF_CAP_MS. 4xx fails immediately
(http_4xx: <status>); exhaustion aggregates every attempt's error code.
Each attempt re-acquires a rate-limit token (4.4 contract) and backoff
sleeps are sliced ~1s polling the sender's shouldAbort probe, so a
paused/stopped campaign stops retrying without failing the contact.
The dispatch runs with the new optional transportRetries=1, disabling the
legacy in-transport network/429 quick retries on this path (the legacy
CampaignMessageSenderService keeps the default 3); a 429 therefore surfaces
as a plain 4xx with no Retry-After wait, per the story's scope. Also fixes
the transport abort timer leaking for 30s after a rejected fetch.
New metrics: evo_flow_dispatch_retries_total{mode,attempt} and
evo_flow_dispatch_terminal_failures_total{mode,reason}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gration smoke (EVO-1219) The sender hands the dispatcher an abort probe backed by the existing TTL status cache; an aborted retry ends the page without marking the contact (row stays PENDING, the message is acked, resume reprocesses it — same semantic the 4.3 pre-dispatch recheck established). Failed outcomes carry the policy's reason strings into the structured markFailed log. Integration spec runs the REAL CrmInboxDispatcher + BatchDispatcherService against an ephemeral http server answering 503,503,200 — delivery after two backoff retries with measured exponential timing, and fail-fast on 422. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reviewer's GuideImplements an exponential backoff retry loop around CRM inbox dispatches in BatchDispatcherService, adds abort-aware backoff using a campaign-status probe, centralizes retry ownership by passing transportRetries=1 into CrmInboxDispatcher, records new Prometheus metrics for retries and terminal failures, and adapts CampaignSenderService and tests to the new outcome model while fixing a transport timeout leak. Sequence diagram for BatchDispatcherService exponential backoff and abort-aware retriessequenceDiagram
actor CampaignSenderService
participant BatchDispatcherService
participant RateLimiterService
participant CrmInboxDispatcher
participant ShouldAbortProbe
CampaignSenderService->>BatchDispatcherService: dispatch(BatchDispatchInput { shouldAbort })
loop attempts 0..retryCount
BatchDispatcherService->>RateLimiterService: acquireWithBackpressure(inboxId)
RateLimiterService-->>BatchDispatcherService: token acquired
alt attempt > 0
BatchDispatcherService->>BatchDispatcherService: retriesTotal.labels(mode, attempt).inc()
end
BatchDispatcherService->>CrmInboxDispatcher: dispatch(contact, inboxId, transportRetries=1)
CrmInboxDispatcher-->>BatchDispatcherService: DispatchResult
alt result.success
BatchDispatcherService-->>CampaignSenderService: { kind: sent, result }
else result.statusCode in 400..499
BatchDispatcherService->>BatchDispatcherService: terminalFailuresTotal.labels(mode, http_4xx).inc()
BatchDispatcherService-->>CampaignSenderService: { kind: failed, reason: http_4xx, statusCode }
else [transient failure]
BatchDispatcherService->>BatchDispatcherService: accumulate attemptErrors
alt attempt == retryCount
BatchDispatcherService->>BatchDispatcherService: terminalFailuresTotal.labels(mode, exhausted_retries).inc()
BatchDispatcherService-->>CampaignSenderService: { kind: failed, reason: dispatch_exhausted_retries, statusCode }
else attempt < retryCount
BatchDispatcherService->>BatchDispatcherService: compute backoff delay
BatchDispatcherService->>BatchDispatcherService: logger.warn(dispatch retry scheduled)
alt shouldAbort defined
BatchDispatcherService->>ShouldAbortProbe: backoffOrAbort(delayMs, shouldAbort)
loop sliced sleeps
ShouldAbortProbe-->>BatchDispatcherService: shouldAbort()
end
alt abortReason
BatchDispatcherService->>BatchDispatcherService: logger.warn(aborted: campaign paused/stopped during retry)
BatchDispatcherService-->>CampaignSenderService: { kind: aborted, abortReason }
end
else no shouldAbort
BatchDispatcherService->>BatchDispatcherService: sleep(delayMs)
end
end
end
end
Sequence diagram for transportRetries ownership between BatchDispatcherService and CrmInboxDispatchersequenceDiagram
participant BatchDispatcherService
participant CrmInboxDispatcher
participant executeRequest
participant fetch
Note over BatchDispatcherService,CrmInboxDispatcher: New campaign-sender path
BatchDispatcherService->>CrmInboxDispatcher: dispatch(..., transportRetries=1)
CrmInboxDispatcher->>executeRequest: executeRequest(url, options, transportRetries=1)
executeRequest->>fetch: fetch(url, options)
fetch-->>executeRequest: Response or network error
executeRequest-->>CrmInboxDispatcher: single DispatchResult (no internal retries)
Note over CrmInboxDispatcher,fetch: Legacy behavior when transportRetries omitted
BatchDispatcherService-x CrmInboxDispatcher: dispatch(..., transportRetries omitted)
CrmInboxDispatcher->>executeRequest: executeRequest(url, options, maxRetries=3)
alt [network error]
executeRequest->>fetch: fetch(url, options) retry
else [HTTP 429 and attempt < maxRetries]
executeRequest->>fetch: fetch(url, options) after Retry-After
end
executeRequest-->>CrmInboxDispatcher: DispatchResult after quick retries
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- BatchDispatcherService currently creates and retrieves prom-client counters directly from the global register in its constructor; consider injecting a metrics facade or the counters themselves so the service is easier to reuse in other contexts and less coupled to prom-client’s global state.
- The retry configuration (DISPATCH_RETRY_COUNT/BACKOFF_*_MS) is read from process.env on every dispatch call; if these values are effectively static at runtime, you could cache them on construction to avoid repeated parsing and make the behavior less sensitive to mid-process env changes.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- BatchDispatcherService currently creates and retrieves prom-client counters directly from the global register in its constructor; consider injecting a metrics facade or the counters themselves so the service is easier to reuse in other contexts and less coupled to prom-client’s global state.
- The retry configuration (DISPATCH_RETRY_COUNT/BACKOFF_*_MS) is read from process.env on every dispatch call; if these values are effectively static at runtime, you could cache them on construction to avoid repeated parsing and make the behavior less sensitive to mid-process env changes.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
dpaes
left a comment
There was a problem hiding this comment.
✅ Code review — APPROVED (EVO-1219 · story 4.5)
Exponential-backoff retry engine in BatchDispatcherService with pause/stop-aware abort, transportRetries=1 (single retry owner), two prom metrics + a real-HTTP integration smoke. Logic verified against the code; all 5 ACs met under the coherent "1 initial + N retries" convention.
Acceptance criteria
- AC1 503×all, RETRY_COUNT=3 → 4 attempts, backoffs 1s/2s/4s, FAILED aggregating errors ✓ (see note)
- AC2 400 → immediate FAILED
http_4xx: 400; 429 same no-retry path ✓ - AC3 503→200 → SENT +
dispatch_retries_total{attempt=1}✓ - AC4 stop during backoff → abort, ACK (no FAILED), log
aborted: campaign stopped during retry✓ (interruptible ~1s slices) - AC5 RETRY_COUNT=2 / BASE=500 → ~1.5s ✓ (real-timer spec)
Non-blocking
- [low] AC1 literal wording ("tries 3 times" / 3-element reason) reconciles to the implemented "1 initial + N retries" convention → RETRY_COUNT=3 = 4 attempts / 4-element array. Coherent, documented, tested; matches AC5/AC3. One-line note on the card so QA validates the convention.
- [low] FAILED reason is log-only (CampaignContact has no reason column) — matches the 4.3 design.
- [nit] verify the campaign-sender process sets
RUN_MODEso the{mode}metric label isn'tunknown.
Note: branch was 1 commit behind develop (EVO-1255 merged today) but GitHub reports clean and the PR's diff does not touch EVO-1255's files — squash is safe and does not revert it. Approving.
Summary
BatchDispatcherService(story 4.5): HTTP 5xx and network errors retry withmin(base·2^n, cap)backoff —DISPATCH_RETRY_COUNTretries after the initial attempt (default 3;0is a valid no-retries override),DISPATCH_BACKOFF_BASE_MS/DISPATCH_BACKOFF_CAP_MS(1s/30s defaults). 4xx fails immediately withhttp_4xx: <status>; exhaustion marks FAILED with a reason aggregating every attempt's error code (dispatch_exhausted_retries: ["503","DISPATCH_ERROR","504","503"]).shouldAbortprobe (backed by 4.3's TTL status cache) — a paused/stopped campaign stops retrying within ~1s, the contact stays PENDING (no FAILED), the page is acked and resume reprocesses it (same semantic 4.3 established for its pre-dispatch recheck). Log wording per AC4:aborted: campaign stopped during retry.transportRetries: 1throughChannelDispatchInput, disabling the legacy in-transport quick network/429 retries on this path — a 429 surfaces as a plain 4xx with no Retry-After wait, per the story's scope. The legacyCampaignMessageSenderServicekeeps the default (3) untouched.RateLimitedErrormid-retry still requeues the page (and is not counted as an executed retry).evo_flow_dispatch_retries_total{mode,attempt}andevo_flow_dispatch_terminal_failures_total{mode,reason}.clearTimeoutnow infinally) — pre-existing, exposed by the new network-error tests.Acceptance Criteria
http_4xx: 400, single attemptdispatch_retries_total{attempt=1}incrementedDISPATCH_RETRY_COUNT=2+DISPATCH_BACKOFF_BASE_MS=500→ FAILED after measured ~1.5s (real-timer test)DISPATCH_RETRY_COUNTcounts retries (1 initial + N), matching the scope bullet's 4-error example and AC5's timing math — AC1's "3 erros" phrasing is the inconsistent corner of the card, resolved with the assignee during discovery.Validation
npx tsc --noEmit— cleannpx jest src/runners/campaign-sender/ src/shared/messaging-channels/— 6 suites, 55 tests (16 new), no leaked handlesnpx eslint— 0 errors on touched files (1 pre-existing error remains incampaigns-send.consumer.spec.ts:74, untouched, shipped with 4.3)CrmInboxDispatcher+BatchDispatcherServiceagainst an ephemeral http server answering 503,503,200 → delivered after 2 backoff retries with measured exponential timing; 422 → fail fast (the story's E2E smoke, repeatable)Design notes
consumer_timeout(default 30min), redelivery is safe: tabular idempotency skips processed rows and the EVO-1677 delivery-limit backstop bounds the loop.Linked Issue
🤖 Generated with Claude Code
Summary by Sourcery
Introduce an exponential backoff retry policy for campaign dispatches, with pause/stop-aware abort semantics and single-owner retry control in the CRM inbox transport.
New Features:
Bug Fixes:
Enhancements:
Tests: