You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Disclosure: this report was drafted with AI assistance (Claude). All quoted code, file paths, and proto definitions were verified against current main at filing time. The reproduction, environment data, and binary-symbol investigation come from a live self-hosted deployment.
Summary
The Workers tab on a self-hosted cluster shows worker entries with status "Running" for workers whose process has been dead for days. The UI renders worker.status (the last value the SDK reported) directly, without consulting heartbeat freshness. Crashed / OOM-killed / forcibly terminated workers therefore stay "Running" forever, because no graceful-shutdown handler ran to flip their status to SHUTDOWN.
Verified against current main
The row component derives the badge with no liveness check:
And worker-status.svelte takes the status literal as a prop and renders a HeartBeat indicator whenever it equals 'Running' — no freshness threshold anywhere.
The SDK transitions RUNNING → SHUTTING_DOWN → SHUTDOWN during a graceful shutdown path. Hard crashes, OOM kills, kill -9, docker kill, host failures, network partitions — none of these run the shutdown path, so the last persisted status stays RUNNING. There is no enum value, and no client-side derivation, that can express "alive record but the worker stopped reporting." That is the gap.
Run a self-hosted cluster with frontend.workerHeartbeatsEnabled=true and frontend.listWorkersEnabled=true.
Start a worker. Confirm it appears in the Workers tab with status Running.
Kill the worker (docker compose stop <worker>, kill -9 <pid>, or just stop the container).
Wait. Refresh the Workers tab after 1 hour, 1 day, 2 days.
The dead worker still shows Running.
Observed behaviour
Status badge: Running
Worker entry persists across the visible heartbeat-record lifetime (multi-day in our case).
Multiple zombies accumulate when workers use PID-based identities ({pid}@{host}) and the container restarts — each restart adds a new "Running" row without removing the previous one.
Example — single live worker, but nine "Running" entries from prior PIDs across the last 48 hours:
Status Identity Last Heartbeat
Running 16579@worker-host 2026-05-28 08:55
Running 48561@worker-host 2026-05-28 10:37
Running 68895@worker-host 2026-05-28 23:09 <-- only live one
Running 38935@worker-host 2026-05-27 16:29
Running 9512@worker-host 2026-05-28 00:23
...
Ground truth from the matching service (real pollers) shows only one is alive:
$ temporal task-queue describe --task-queue workflows --namespace example
Pollers:
BuildID TaskQueueType Identity LastAccessTime
UNVERSIONED workflow 68895@worker-host 43 seconds ago
Expected behaviour
The Workers tab should derive a displayed liveness independent of the persisted WorkerStatus enum, using heartbeat_time (or elapsed_since_last_heartbeat, which the server already computes). Suggested rule:
Condition
Displayed status
worker.status == SHUTTING_DOWN / SHUTDOWN
passthrough (current behaviour)
worker.status == RUNNING AND elapsed_since_last_heartbeat <= 2 × interval
worker.status == RUNNING AND elapsed > grace_window
Unreachable (or hidden by default)
A sensible default for 2 × interval is 120s (matching the matching-service poller TTL); a default grace_window of 1h keeps the tab clean during incident response.
At minimum, surface heartbeat_time / elapsed_since_last_heartbeat as a column so operators can spot zombies without manual cross-checking against temporal task-queue describe.
Optional follow-up (server-side)
The narrower fix is purely UI. A more thorough fix would extend WorkerStatus with a WORKER_STATUS_STALE value the server sets when elapsed_since_last_heartbeat exceeds a configurable threshold — but that's a larger conversation and not strictly required to resolve this issue.
Workarounds I used
Use stable identity= strings on each Worker(...) so restarts overwrite the same record instead of accumulating new ones. Reduces growth but doesn't fix the misleading "Running" label.
Treat temporal task-queue describe's Pollers[] as the source of truth for liveness. The Workers tab is not used for monitoring.
Why this matters
Operators look at the Workers tab to answer "is anything running, and what?". A page that confidently reports "Running" for dead workers is actively misleading during incident response — engineers waste time investigating phantom workers, or worse, assume coverage exists when it doesn't.
Related
Workers list on workflow page is misleading #570 — Workers list on workflow page is misleading. Distinct surface (the per-workflow Workers sub-tab, not the top-level navigation tab) and a different complaint (relevance, not liveness), but adjacent enough to cross-link.
A delete-worker-modal.svelte exists under src/lib/components/workers/ but does not appear to be wired into the workers-table render path on main. If there's an in-progress operator action for manual cleanup, this issue is complementary — manual deletion doesn't replace automatic staleness signalling.
The server has no exposed dynamic config knob to bound how long heartbeat records persist after a worker stops reporting. I searched the server binary symbols:
Only frontend.WorkerHeartbeatsEnabled and frontend.ListWorkersEnabled exist — no *TTL, no *Retention, no *Expiry. An operator-facing retention key (e.g. matching.workerHeartbeatRetention) would let clusters bound the zombie window even without the UI fix above.
Disclosure: this report was drafted with AI assistance (Claude). All quoted code, file paths, and proto definitions were verified against current
mainat filing time. The reproduction, environment data, and binary-symbol investigation come from a live self-hosted deployment.Summary
The Workers tab on a self-hosted cluster shows worker entries with status "Running" for workers whose process has been dead for days. The UI renders
worker.status(the last value the SDK reported) directly, without consulting heartbeat freshness. Crashed / OOM-killed / forcibly terminated workers therefore stay "Running" forever, because no graceful-shutdown handler ran to flip their status toSHUTDOWN.Verified against current
mainThe row component derives the badge with no liveness check:
And
worker-status.sveltetakes thestatusliteral as a prop and renders a HeartBeat indicator whenever it equals'Running'— no freshness threshold anywhere.Meanwhile the proto carries the data needed to compute liveness (
temporal/api/worker/v1/message.proto):Neither field is referenced by the workers-table render path.
Why
statusalone is insufficientThe
WorkerStatusenum intemporal/api/enums/v1/common.protohas only four values:The SDK transitions
RUNNING → SHUTTING_DOWN → SHUTDOWNduring a graceful shutdown path. Hard crashes, OOM kills,kill -9,docker kill, host failures, network partitions — none of these run the shutdown path, so the last persisted status staysRUNNING. There is no enum value, and no client-side derivation, that can express "alive record but the worker stopped reporting." That is the gap.Environment
temporalioPython 1.27.2temporalio/auto-setupimage), single-node docker-composeSteps to reproduce
frontend.workerHeartbeatsEnabled=trueandfrontend.listWorkersEnabled=true.docker compose stop <worker>,kill -9 <pid>, or just stop the container).Observed behaviour
{pid}@{host}) and the container restarts — each restart adds a new "Running" row without removing the previous one.Example — single live worker, but nine "Running" entries from prior PIDs across the last 48 hours:
Ground truth from the matching service (real pollers) shows only one is alive:
Expected behaviour
The Workers tab should derive a displayed liveness independent of the persisted
WorkerStatusenum, usingheartbeat_time(orelapsed_since_last_heartbeat, which the server already computes). Suggested rule:worker.status == SHUTTING_DOWN/SHUTDOWNworker.status == RUNNINGANDelapsed_since_last_heartbeat <= 2 × intervalworker.status == RUNNINGAND2 × interval < elapsed <= grace_windowworker.status == RUNNINGANDelapsed > grace_windowA sensible default for
2 × intervalis120s(matching the matching-service poller TTL); a defaultgrace_windowof1hkeeps the tab clean during incident response.At minimum, surface
heartbeat_time/elapsed_since_last_heartbeatas a column so operators can spot zombies without manual cross-checking againsttemporal task-queue describe.Optional follow-up (server-side)
The narrower fix is purely UI. A more thorough fix would extend
WorkerStatuswith aWORKER_STATUS_STALEvalue the server sets whenelapsed_since_last_heartbeatexceeds a configurable threshold — but that's a larger conversation and not strictly required to resolve this issue.Workarounds I used
identity=strings on eachWorker(...)so restarts overwrite the same record instead of accumulating new ones. Reduces growth but doesn't fix the misleading "Running" label.temporal task-queue describe'sPollers[]as the source of truth for liveness. The Workers tab is not used for monitoring.Why this matters
Operators look at the Workers tab to answer "is anything running, and what?". A page that confidently reports "Running" for dead workers is actively misleading during incident response — engineers waste time investigating phantom workers, or worse, assume coverage exists when it doesn't.
Related
delete-worker-modal.svelteexists undersrc/lib/components/workers/but does not appear to be wired into the workers-table render path onmain. If there's an in-progress operator action for manual cleanup, this issue is complementary — manual deletion doesn't replace automatic staleness signalling.Related server-side ask
File separately at https://github.com/temporalio/temporal
The server has no exposed dynamic config knob to bound how long heartbeat records persist after a worker stops reporting. I searched the server binary symbols:
Only
frontend.WorkerHeartbeatsEnabledandfrontend.ListWorkersEnabledexist — no*TTL, no*Retention, no*Expiry. An operator-facing retention key (e.g.matching.workerHeartbeatRetention) would let clusters bound the zombie window even without the UI fix above.