Skip to content

feat(sdks/go): stream reconnect and lifecycle for listeners#4257

Draft
igor-kupczynski wants to merge 8 commits into
mainfrom
stream-reconnect-lifecycle
Draft

feat(sdks/go): stream reconnect and lifecycle for listeners#4257
igor-kupczynski wants to merge 8 commits into
mainfrom
stream-reconnect-lifecycle

Conversation

@igor-kupczynski

@igor-kupczynski igor-kupczynski commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Description

Implements PR #2 for issue #4228: stream reconnect and lifecycle for the legacy Go SDK (pkg/client). This PR stacks on #4240 (rubylabs-audit-go-sdk), which added REST read retries and shared backoff primitives; here we extend that foundation to gRPC stream listeners.

Long-lived gRPC subscriptions (workflow runs, durable events, worker actions, metadata streams) now use explicit app-level reconnect with full-jitter backoff instead of relying on the unary gRPC interceptor or ad-hoc hot loops. Sync paths (initial subscribe / AddWorkflowRun) use bounded reconnect; background listen loops reconnect unboundedly while the listener remains open. Permanent errors and consecutive no-progress failures still surface to callers.

Dependencies: #4240 (rubylabs-audit-go-sdk base branch)

Fixes #4228

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation change (pure documentation change)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (non-breaking changes to code which doesn't change any behaviour)
  • CI (any automation pipeline changes)
  • Chore (changes which are not directly related to any business logic)
  • Test changes (add, refactor, improve or change a test)
  • This change requires a documentation update

What's Changed

  • Add pkg/client/retry/stream.go: stream backoff (1s base, 30s cap, full jitter), cancellable sleep helpers, and ClassifyStreamError
  • Split synchronous vs background reconnect paths (retrySubscribeSync / retrySubscribeBackground, mirrored for durable events) with singleflight coalescing
  • Introduce dedicated listener lifecycle contexts cancelled only by Close()
  • Migrate workflow runs listener, durable event listener, and worker action listener to the new reconnect model
  • Worker action listener: unbounded reconnect on transient Recv failures; permanent errors still surface on errCh; V2→V1 fallback preserved
  • Disable gRPC stream interceptor retry on worker Listen/ListenV2; add app-level reconnect for StreamByAdditionalMetadata (including EOF)
  • Durable task listener: cancellable retry.Sleep during reconnect backoff
  • Workflow.Result(): fail fast when AddWorkflowRun fails instead of blocking indefinitely
  • Review / Bugbot follow-ups: stop unbounded background reconnect on NoProgress errors, pass caller context into initial subscribe/listen setup, deterministic transient-failure reconnect tests via stream sleep test hook
  • Worker reconnect logging: structured warn/debug logs when the action listener enters and recovers from reconnect mode

Checklist

Changes have been:

  • Tested (unit, integration, or manually with steps specified)
  • Linted and formatted
  • Documented (where applicable)
  • Added to CHANGELOG (where applicable) -- see Keep a Changelog

Testing

  • go test ./pkg/client/retry ./pkg/client -count=1
  • Stream backoff / classifier unit tests (pkg/client/retry/stream_test.go)
  • Workflow listener: bounded sync subscribe, background reconnect past sync cap, EOF / no-handler guard, permanent error stop, handler replay, stale generation, grpc_retry.Disable() verification
  • Durable listener: bounded AddSignal, permanent error stop, reconnect during retry backoff
  • Worker dispatcher: >5 transient Recv failures, V2→V1 fallback, V1 Unimplemented terminal, ctx cancel
  • StreamByAdditionalMetadata recv reconnect (including EOF)
  • Durable task listener cancellation during sleep
  • Workflow.Result() fail-fast when subscribe fails

Related

Remaining risks

  • Unknown stream errors now stop / no-progress instead of hot-looping — may surface latent bugs sooner
  • Workflow.Result() has no caller ctx; relies on bounded sync reconnect only
  • Worker no longer terminates after 5 transient retries (intentional; documented in code)

🤖 AI Disclosure

@vercel

vercel Bot commented Jun 22, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hatchet-docs Ready Ready Preview, Comment Jun 30, 2026 5:08pm

Request Review

@github-actions github-actions Bot added the engine Related to the core Hatchet engine label Jun 22, 2026
@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Optional test failure: The load-deadlock job failed on this PR (optimistic-scheduling=false). This check is non-mandatory and does not block merging, but may be worth investigating. View logs

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Optional test failure: The load-deadlock job failed on this PR (optimistic-scheduling=true). This check is non-mandatory and does not block merging, but may be worth investigating. View logs

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Optional test failure: The load-deadlock job failed on this PR (optimistic-scheduling=true). This check is non-mandatory and does not block merging, but may be worth investigating. View logs

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Optional test failure: The load-deadlock job failed on this PR (optimistic-scheduling=false). This check is non-mandatory and does not block merging, but may be worth investigating. View logs

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Optional test failure: The load-deadlock job failed on this PR (optimistic-scheduling=true). This check is non-mandatory and does not block merging, but may be worth investigating. View logs

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Optional test failure: The load-deadlock job failed on this PR (optimistic-scheduling=false). This check is non-mandatory and does not block merging, but may be worth investigating. View logs

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Optional test failure: The load-deadlock job failed on this PR (optimistic-scheduling=true). This check is non-mandatory and does not block merging, but may be worth investigating. View logs

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Optional test failure: The load-deadlock job failed on this PR (optimistic-scheduling=false). This check is non-mandatory and does not block merging, but may be worth investigating. View logs

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Optional test failure: The load-deadlock job failed on this PR (optimistic-scheduling=true). This check is non-mandatory and does not block merging, but may be worth investigating. View logs

@github-actions

Copy link
Copy Markdown
Contributor

⚠️ Optional test failure: The load-deadlock job failed on this PR (optimistic-scheduling=false). This check is non-mandatory and does not block merging, but may be worth investigating. View logs

Base automatically changed from rubylabs-audit-go-sdk to main June 24, 2026 10:04
Share listener reconnect state across stream consumers and split worker action listening into smaller lifecycle helpers.
Move duplicated workflow and durable event listen loops into the shared reconnect helper and cover the behavior once at the policy layer.
@github-actions

Copy link
Copy Markdown
Contributor

Benchmark results

goos: linux
goarch: amd64
pkg: github.com/hatchet-dev/hatchet/pkg/scheduling/v1
cpu: AMD Ryzen 9 7950X3D 16-Core Processor          
              │ /tmp/old.txt │         /tmp/new.txt         │
              │    sec/op    │   sec/op     vs base         │
RateLimiter-8    48.80µ ± 8%   48.96µ ± 7%  ~ (p=1.000 n=6)

              │ /tmp/old.txt │         /tmp/new.txt          │
              │     B/op     │     B/op      vs base         │
RateLimiter-8   137.7Ki ± 0%   137.7Ki ± 0%  ~ (p=0.303 n=6)

              │ /tmp/old.txt │          /tmp/new.txt          │
              │  allocs/op   │  allocs/op   vs base           │
RateLimiter-8    1.022k ± 0%   1.022k ± 0%  ~ (p=1.000 n=6) ¹
¹ all samples are equal

Compared against main (6b176e1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine Related to the core Hatchet engine

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Go SDK Retries

1 participant