Skip to content

Workers tab shows crashed workers as "Running" indefinitely #3475

@Tobias-Mann

Description

@Tobias-Mann

Disclosure: this report was drafted with AI assistance (Claude). All quoted code, file paths, and proto definitions were verified against current main at filing time. The reproduction, environment data, and binary-symbol investigation come from a live self-hosted deployment.

Summary

The Workers tab on a self-hosted cluster shows worker entries with status "Running" for workers whose process has been dead for days. The UI renders worker.status (the last value the SDK reported) directly, without consulting heartbeat freshness. Crashed / OOM-killed / forcibly terminated workers therefore stay "Running" forever, because no graceful-shutdown handler ran to flip their status to SHUTDOWN.

Verified against current main

The row component derives the badge with no liveness check:

// src/lib/components/workers/workers-table/workers-table-row.svelte
const status = $derived(toWorkerStatusReadable(worker?.status));

And worker-status.svelte takes the status literal as a prop and renders a HeartBeat indicator whenever it equals 'Running' — no freshness threshold anywhere.

Meanwhile the proto carries the data needed to compute liveness (temporal/api/worker/v1/message.proto):

google.protobuf.Timestamp heartbeat_time = 10;
google.protobuf.Duration  elapsed_since_last_heartbeat = 11;

Neither field is referenced by the workers-table render path.

Why status alone is insufficient

The WorkerStatus enum in temporal/api/enums/v1/common.proto has only four values:

WORKER_STATUS_UNSPECIFIED = 0;
WORKER_STATUS_RUNNING     = 1;
WORKER_STATUS_SHUTTING_DOWN = 2;
WORKER_STATUS_SHUTDOWN    = 3;

The SDK transitions RUNNING → SHUTTING_DOWN → SHUTDOWN during a graceful shutdown path. Hard crashes, OOM kills, kill -9, docker kill, host failures, network partitions — none of these run the shutdown path, so the last persisted status stays RUNNING. There is no enum value, and no client-side derivation, that can express "alive record but the worker stopped reporting." That is the gap.

Environment

  • Temporal Server version: 1.29.3
  • Temporal UI version: 2.49.1
  • SDK: temporalio Python 1.27.2
  • Deployment: self-hosted, Postgres-backed (temporalio/auto-setup image), single-node docker-compose
  • Dynamic config (relevant entries):
    frontend.workerHeartbeatsEnabled:
      - value: true
        constraints: {}
    frontend.listWorkersEnabled:
      - value: true
        constraints: {}

Steps to reproduce

  1. Run a self-hosted cluster with frontend.workerHeartbeatsEnabled=true and frontend.listWorkersEnabled=true.
  2. Start a worker. Confirm it appears in the Workers tab with status Running.
  3. Kill the worker (docker compose stop <worker>, kill -9 <pid>, or just stop the container).
  4. Wait. Refresh the Workers tab after 1 hour, 1 day, 2 days.
  5. The dead worker still shows Running.

Observed behaviour

  • Status badge: Running
  • Worker entry persists across the visible heartbeat-record lifetime (multi-day in our case).
  • Multiple zombies accumulate when workers use PID-based identities ({pid}@{host}) and the container restarts — each restart adds a new "Running" row without removing the previous one.

Example — single live worker, but nine "Running" entries from prior PIDs across the last 48 hours:

Status   Identity                          Last Heartbeat
Running  16579@worker-host                 2026-05-28 08:55
Running  48561@worker-host                 2026-05-28 10:37
Running  68895@worker-host                 2026-05-28 23:09   <-- only live one
Running  38935@worker-host                 2026-05-27 16:29
Running   9512@worker-host                 2026-05-28 00:23
...

Ground truth from the matching service (real pollers) shows only one is alive:

$ temporal task-queue describe --task-queue workflows --namespace example
Pollers:
    BuildID    TaskQueueType         Identity         LastAccessTime
  UNVERSIONED  workflow       68895@worker-host       43 seconds ago

Expected behaviour

The Workers tab should derive a displayed liveness independent of the persisted WorkerStatus enum, using heartbeat_time (or elapsed_since_last_heartbeat, which the server already computes). Suggested rule:

Condition Displayed status
worker.status == SHUTTING_DOWN / SHUTDOWN passthrough (current behaviour)
worker.status == RUNNING AND elapsed_since_last_heartbeat <= 2 × interval Running
worker.status == RUNNING AND 2 × interval < elapsed <= grace_window Stale
worker.status == RUNNING AND elapsed > grace_window Unreachable (or hidden by default)

A sensible default for 2 × interval is 120s (matching the matching-service poller TTL); a default grace_window of 1h keeps the tab clean during incident response.

At minimum, surface heartbeat_time / elapsed_since_last_heartbeat as a column so operators can spot zombies without manual cross-checking against temporal task-queue describe.

Optional follow-up (server-side)

The narrower fix is purely UI. A more thorough fix would extend WorkerStatus with a WORKER_STATUS_STALE value the server sets when elapsed_since_last_heartbeat exceeds a configurable threshold — but that's a larger conversation and not strictly required to resolve this issue.

Workarounds I used

  • Use stable identity= strings on each Worker(...) so restarts overwrite the same record instead of accumulating new ones. Reduces growth but doesn't fix the misleading "Running" label.
  • Treat temporal task-queue describe's Pollers[] as the source of truth for liveness. The Workers tab is not used for monitoring.

Why this matters

Operators look at the Workers tab to answer "is anything running, and what?". A page that confidently reports "Running" for dead workers is actively misleading during incident response — engineers waste time investigating phantom workers, or worse, assume coverage exists when it doesn't.

Related

  • Workers list on workflow page is misleading #570 — Workers list on workflow page is misleading. Distinct surface (the per-workflow Workers sub-tab, not the top-level navigation tab) and a different complaint (relevance, not liveness), but adjacent enough to cross-link.
  • A delete-worker-modal.svelte exists under src/lib/components/workers/ but does not appear to be wired into the workers-table render path on main. If there's an in-progress operator action for manual cleanup, this issue is complementary — manual deletion doesn't replace automatic staleness signalling.

Related server-side ask

File separately at https://github.com/temporalio/temporal

The server has no exposed dynamic config knob to bound how long heartbeat records persist after a worker stops reporting. I searched the server binary symbols:

strings $(which temporal-server) | grep -oE '(frontend|matching|history|system|worker)\.[a-zA-Z]*([Hh]eartbeat|[Ll]istWorker)[a-zA-Z]*' | sort -u

Only frontend.WorkerHeartbeatsEnabled and frontend.ListWorkersEnabled exist — no *TTL, no *Retention, no *Expiry. An operator-facing retention key (e.g. matching.workerHeartbeatRetention) would let clusters bound the zombie window even without the UI fix above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions