Skip to content

feat(campaign-sender): exponential backoff retry on transient dispatch failures (EVO-1219)#54

Merged
dpaes merged 2 commits into
developfrom
feat/EVO-1219
Jun 11, 2026
Merged

feat(campaign-sender): exponential backoff retry on transient dispatch failures (EVO-1219)#54
dpaes merged 2 commits into
developfrom
feat/EVO-1219

Conversation

@nickoliveira23

@nickoliveira23 nickoliveira23 commented Jun 10, 2026

Copy link
Copy Markdown

Summary

  • Retry engine in BatchDispatcherService (story 4.5): HTTP 5xx and network errors retry with min(base·2^n, cap) backoff — DISPATCH_RETRY_COUNT retries after the initial attempt (default 3; 0 is a valid no-retries override), DISPATCH_BACKOFF_BASE_MS/DISPATCH_BACKOFF_CAP_MS (1s/30s defaults). 4xx fails immediately with http_4xx: <status>; exhaustion marks FAILED with a reason aggregating every attempt's error code (dispatch_exhausted_retries: ["503","DISPATCH_ERROR","504","503"]).
  • Pause/stop-aware backoff: sleeps are sliced ~1s polling a sender-provided shouldAbort probe (backed by 4.3's TTL status cache) — a paused/stopped campaign stops retrying within ~1s, the contact stays PENDING (no FAILED), the page is acked and resume reprocesses it (same semantic 4.3 established for its pre-dispatch recheck). Log wording per AC4: aborted: campaign stopped during retry.
  • Single retry owner: the dispatch passes the new optional transportRetries: 1 through ChannelDispatchInput, disabling the legacy in-transport quick network/429 retries on this path — a 429 surfaces as a plain 4xx with no Retry-After wait, per the story's scope. The legacy CampaignMessageSenderService keeps the default (3) untouched.
  • Rate-limiter contract preserved (4.4): every attempt re-acquires a token; a RateLimitedError mid-retry still requeues the page (and is not counted as an executed retry).
  • New metrics: evo_flow_dispatch_retries_total{mode,attempt} and evo_flow_dispatch_terminal_failures_total{mode,reason}.
  • Drive-by hygiene fix: the transport's 30s abort timer leaked after a rejected fetch (clearTimeout now in finally) — pre-existing, exposed by the new network-error tests.

Acceptance Criteria

  • ✅ AC1 persistent 503 with default count → 4 attempts, capped backoff sequence asserted via log meta, FAILED aggregating all attempt errors
  • ✅ AC2 HTTP 400 → immediate FAILED http_4xx: 400, single attempt
  • ✅ AC3 503→200 → SENT after 1 retry, dispatch_retries_total{attempt=1} incremented
  • ✅ AC4 campaign stopped during backoff → retries abort, no FAILED, page acked, literal log
  • ✅ AC5 DISPATCH_RETRY_COUNT=2 + DISPATCH_BACKOFF_BASE_MS=500 → FAILED after measured ~1.5s (real-timer test)
  • Note: DISPATCH_RETRY_COUNT counts retries (1 initial + N), matching the scope bullet's 4-error example and AC5's timing math — AC1's "3 erros" phrasing is the inconsistent corner of the card, resolved with the assignee during discovery.

Validation

  • npx tsc --noEmit — clean
  • npx jest src/runners/campaign-sender/ src/shared/messaging-channels/ — 6 suites, 55 tests (16 new), no leaked handles
  • npx eslint — 0 errors on touched files (1 pre-existing error remains in campaigns-send.consumer.spec.ts:74, untouched, shipped with 4.3)
  • Integration smoke in CI: real CrmInboxDispatcher + BatchDispatcherService against an ephemeral http server answering 503,503,200 → delivered after 2 backoff retries with measured exponential timing; 422 → fail fast (the story's E2E smoke, repeatable)

Design notes

  • A full CRM outage now burns a page slowly by design (~7s+ per contact before FAILED with defaults). If a huge page under outage ever exceeds RabbitMQ's consumer_timeout (default 30min), redelivery is safe: tabular idempotency skips processed rows and the EVO-1677 delivery-limit backstop bounds the loop.

Linked Issue

  • EVO-1219 — next link of the Campaign Dispatch Pipeline chain (blocks EVO-1220 [4.6])

🤖 Generated with Claude Code

Summary by Sourcery

Introduce an exponential backoff retry policy for campaign dispatches, with pause/stop-aware abort semantics and single-owner retry control in the CRM inbox transport.

New Features:

  • Add configurable exponential backoff retries for transient dispatch failures in BatchDispatcherService, returning structured sent/failed/aborted outcomes.
  • Expose a transportRetries option on channel dispatches so the campaign-sender path can own all retry behavior.
  • Record dispatch retry attempts and terminal failure reasons via new Prometheus counters.

Bug Fixes:

  • Fix a CRM inbox dispatcher timeout leak by always clearing the abort timer even when fetch rejects.

Enhancements:

  • Wire CampaignSenderService to the new dispatch outcome model and propagate campaign pause/stop status into the retry loop via an abort probe.

Tests:

  • Extend unit and integration coverage for BatchDispatcherService, CampaignSenderService, and CrmInboxDispatcher to cover retry policy, abort behavior, metrics, and transportRetries handling.

nickoliveira23 and others added 2 commits June 10, 2026 20:12
…atcher (EVO-1219)

Transient dispatch failures (HTTP 5xx or network errors) retry with
min(base*2^n, cap) backoff — DISPATCH_RETRY_COUNT retries after the initial
attempt (default 3; 0 is a valid no-retries override), base/cap via
DISPATCH_BACKOFF_BASE_MS / DISPATCH_BACKOFF_CAP_MS. 4xx fails immediately
(http_4xx: <status>); exhaustion aggregates every attempt's error code.
Each attempt re-acquires a rate-limit token (4.4 contract) and backoff
sleeps are sliced ~1s polling the sender's shouldAbort probe, so a
paused/stopped campaign stops retrying without failing the contact.

The dispatch runs with the new optional transportRetries=1, disabling the
legacy in-transport network/429 quick retries on this path (the legacy
CampaignMessageSenderService keeps the default 3); a 429 therefore surfaces
as a plain 4xx with no Retry-After wait, per the story's scope. Also fixes
the transport abort timer leaking for 30s after a rejected fetch.

New metrics: evo_flow_dispatch_retries_total{mode,attempt} and
evo_flow_dispatch_terminal_failures_total{mode,reason}.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…gration smoke (EVO-1219)

The sender hands the dispatcher an abort probe backed by the existing TTL
status cache; an aborted retry ends the page without marking the contact
(row stays PENDING, the message is acked, resume reprocesses it — same
semantic the 4.3 pre-dispatch recheck established). Failed outcomes carry
the policy's reason strings into the structured markFailed log.

Integration spec runs the REAL CrmInboxDispatcher + BatchDispatcherService
against an ephemeral http server answering 503,503,200 — delivery after two
backoff retries with measured exponential timing, and fail-fast on 422.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@sourcery-ai

sourcery-ai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Reviewer's Guide

Implements an exponential backoff retry loop around CRM inbox dispatches in BatchDispatcherService, adds abort-aware backoff using a campaign-status probe, centralizes retry ownership by passing transportRetries=1 into CrmInboxDispatcher, records new Prometheus metrics for retries and terminal failures, and adapts CampaignSenderService and tests to the new outcome model while fixing a transport timeout leak.

Sequence diagram for BatchDispatcherService exponential backoff and abort-aware retries

sequenceDiagram
  actor CampaignSenderService
  participant BatchDispatcherService
  participant RateLimiterService
  participant CrmInboxDispatcher
  participant ShouldAbortProbe

  CampaignSenderService->>BatchDispatcherService: dispatch(BatchDispatchInput { shouldAbort })
  loop attempts 0..retryCount
    BatchDispatcherService->>RateLimiterService: acquireWithBackpressure(inboxId)
    RateLimiterService-->>BatchDispatcherService: token acquired
    alt attempt > 0
      BatchDispatcherService->>BatchDispatcherService: retriesTotal.labels(mode, attempt).inc()
    end
    BatchDispatcherService->>CrmInboxDispatcher: dispatch(contact, inboxId, transportRetries=1)
    CrmInboxDispatcher-->>BatchDispatcherService: DispatchResult

    alt result.success
      BatchDispatcherService-->>CampaignSenderService: { kind: sent, result }
    else result.statusCode in 400..499
      BatchDispatcherService->>BatchDispatcherService: terminalFailuresTotal.labels(mode, http_4xx).inc()
      BatchDispatcherService-->>CampaignSenderService: { kind: failed, reason: http_4xx, statusCode }
    else [transient failure]
      BatchDispatcherService->>BatchDispatcherService: accumulate attemptErrors
      alt attempt == retryCount
        BatchDispatcherService->>BatchDispatcherService: terminalFailuresTotal.labels(mode, exhausted_retries).inc()
        BatchDispatcherService-->>CampaignSenderService: { kind: failed, reason: dispatch_exhausted_retries, statusCode }
      else attempt < retryCount
        BatchDispatcherService->>BatchDispatcherService: compute backoff delay
        BatchDispatcherService->>BatchDispatcherService: logger.warn(dispatch retry scheduled)
        alt shouldAbort defined
          BatchDispatcherService->>ShouldAbortProbe: backoffOrAbort(delayMs, shouldAbort)
          loop sliced sleeps
            ShouldAbortProbe-->>BatchDispatcherService: shouldAbort()
          end
          alt abortReason
            BatchDispatcherService->>BatchDispatcherService: logger.warn(aborted: campaign paused/stopped during retry)
            BatchDispatcherService-->>CampaignSenderService: { kind: aborted, abortReason }
          end
        else no shouldAbort
          BatchDispatcherService->>BatchDispatcherService: sleep(delayMs)
        end
      end
    end
  end
Loading

Sequence diagram for transportRetries ownership between BatchDispatcherService and CrmInboxDispatcher

sequenceDiagram
  participant BatchDispatcherService
  participant CrmInboxDispatcher
  participant executeRequest
  participant fetch

  Note over BatchDispatcherService,CrmInboxDispatcher: New campaign-sender path
  BatchDispatcherService->>CrmInboxDispatcher: dispatch(..., transportRetries=1)
  CrmInboxDispatcher->>executeRequest: executeRequest(url, options, transportRetries=1)
  executeRequest->>fetch: fetch(url, options)
  fetch-->>executeRequest: Response or network error
  executeRequest-->>CrmInboxDispatcher: single DispatchResult (no internal retries)

  Note over CrmInboxDispatcher,fetch: Legacy behavior when transportRetries omitted
  BatchDispatcherService-x CrmInboxDispatcher: dispatch(..., transportRetries omitted)
  CrmInboxDispatcher->>executeRequest: executeRequest(url, options, maxRetries=3)
  alt [network error]
    executeRequest->>fetch: fetch(url, options) retry
  else [HTTP 429 and attempt < maxRetries]
    executeRequest->>fetch: fetch(url, options) after Retry-After
  end
  executeRequest-->>CrmInboxDispatcher: DispatchResult after quick retries
Loading

File-Level Changes

Change Details Files
Introduce an exponential backoff, abort-aware retry loop and new metrics in BatchDispatcherService, changing its return type from raw DispatchResult to a structured outcome union.
  • Wrap each dispatch attempt in a for-loop that retries transient failures (HTTP 5xx or network errors) up to DISPATCH_RETRY_COUNT with base*2^n capped at DISPATCH_BACKOFF_CAP_MS between attempts.
  • Classify 4xx responses as permanent failures with immediate termination and reason http_4xx: , while aggregating all transient attempt error codes into dispatch_exhausted_retries: [...] on exhaustion.
  • Add support for an optional shouldAbort() probe polled in sliced sleeps (≤1s) during backoff to abort retries when campaigns pause/stop, returning an aborted outcome that leaves contacts PENDING.
  • Create Prometheus counters evo_flow_dispatch_retries_total{mode,attempt} and evo_flow_dispatch_terminal_failures_total{mode,reason}, wiring them into the retry/terminal-failure paths and resetting them per-test in specs.
  • Refactor dispatch into dispatchOnce with transportRetries=1 and add env helpers/env-backed configuration for retry count and backoff timings, with DISPATCH_RETRY_COUNT=0 as a no-retry override.
src/runners/campaign-sender/services/batch-dispatcher.service.ts
src/runners/campaign-sender/services/batch-dispatcher.service.spec.ts
src/runners/campaign-sender/batch-dispatcher.retry.integration.spec.ts
Adapt CampaignSenderService to consume BatchDispatcherService’s new outcome model and supply a status-based abort probe, updating tests accordingly.
  • Change send() to handle sent, failed, and aborted outcomes from the dispatcher, marking contacts SENT only on sent, FAILED with the dispatcher’s reason/status on failed, and leaving contacts PENDING while marking the batch result as aborted on aborted.
  • Wire a shouldAbort probe into BatchDispatchInput that uses the existing TTL-backed campaign status cache to return paused or stopped, and log aborted pages with the current campaign status.
  • Update unit tests to mock the new outcome union, assert proper FAILED categorization for exhausted retries, verify aborted behavior leaves contacts PENDING and short-circuits the page, and verify shouldAbort reflects status changes after cache expiry.
src/runners/campaign-sender/services/campaign-sender.service.ts
src/runners/campaign-sender/services/campaign-sender.service.spec.ts
Extend CrmInboxDispatcher and ChannelDispatchInput to support configurable transport-level retries and fix an abort timeout leak on network errors.
  • Add an optional transportRetries field to ChannelDispatchInput and thread it to CrmInboxDispatcher.executeRequest to control the internal network/429 quick-retry loop.
  • Use input.transportRetries=1 from BatchDispatcherService to disable in-transport retries on the campaign-sender path, making the sender’s exponential loop the single retry owner while preserving legacy behavior when transportRetries is omitted.
  • Wrap fetch in a try/finally that always clears the per-request AbortController timeout, preventing a 30s timer leak on rejected fetch calls.
  • Add tests verifying that transportRetries=1 does not retry network errors, surfaces 429 without waiting on Retry-After, and that legacy quick retries still occur when transportRetries is absent.
src/shared/messaging-channels/interfaces/channel-dispatcher.interface.ts
src/shared/messaging-channels/dispatchers/crm-inbox.dispatcher.ts
src/shared/messaging-channels/dispatchers/crm-inbox.dispatcher.spec.ts

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • BatchDispatcherService currently creates and retrieves prom-client counters directly from the global register in its constructor; consider injecting a metrics facade or the counters themselves so the service is easier to reuse in other contexts and less coupled to prom-client’s global state.
  • The retry configuration (DISPATCH_RETRY_COUNT/BACKOFF_*_MS) is read from process.env on every dispatch call; if these values are effectively static at runtime, you could cache them on construction to avoid repeated parsing and make the behavior less sensitive to mid-process env changes.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- BatchDispatcherService currently creates and retrieves prom-client counters directly from the global register in its constructor; consider injecting a metrics facade or the counters themselves so the service is easier to reuse in other contexts and less coupled to prom-client’s global state.
- The retry configuration (DISPATCH_RETRY_COUNT/BACKOFF_*_MS) is read from process.env on every dispatch call; if these values are effectively static at runtime, you could cache them on construction to avoid repeated parsing and make the behavior less sensitive to mid-process env changes.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@dpaes dpaes left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Code review — APPROVED (EVO-1219 · story 4.5)

Exponential-backoff retry engine in BatchDispatcherService with pause/stop-aware abort, transportRetries=1 (single retry owner), two prom metrics + a real-HTTP integration smoke. Logic verified against the code; all 5 ACs met under the coherent "1 initial + N retries" convention.

Acceptance criteria

  • AC1 503×all, RETRY_COUNT=3 → 4 attempts, backoffs 1s/2s/4s, FAILED aggregating errors ✓ (see note)
  • AC2 400 → immediate FAILED http_4xx: 400; 429 same no-retry path ✓
  • AC3 503→200 → SENT + dispatch_retries_total{attempt=1}
  • AC4 stop during backoff → abort, ACK (no FAILED), log aborted: campaign stopped during retry ✓ (interruptible ~1s slices)
  • AC5 RETRY_COUNT=2 / BASE=500 → ~1.5s ✓ (real-timer spec)

Non-blocking

  • [low] AC1 literal wording ("tries 3 times" / 3-element reason) reconciles to the implemented "1 initial + N retries" convention → RETRY_COUNT=3 = 4 attempts / 4-element array. Coherent, documented, tested; matches AC5/AC3. One-line note on the card so QA validates the convention.
  • [low] FAILED reason is log-only (CampaignContact has no reason column) — matches the 4.3 design.
  • [nit] verify the campaign-sender process sets RUN_MODE so the {mode} metric label isn't unknown.

Note: branch was 1 commit behind develop (EVO-1255 merged today) but GitHub reports clean and the PR's diff does not touch EVO-1255's files — squash is safe and does not revert it. Approving.

@dpaes dpaes merged commit 50b45c5 into develop Jun 11, 2026
2 checks passed
@dpaes dpaes deleted the feat/EVO-1219 branch June 11, 2026 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants