Workers tab shows crashed workers as "Running" indefinitely

---

<sub>Disclosure: this report was drafted with AI assistance (Claude). All quoted code, file paths, and proto definitions were verified against current `main` at filing time. The reproduction, environment data, and binary-symbol investigation come from a live self-hosted deployment.</sub>
---


## Summary

The Workers tab on a self-hosted cluster shows worker entries with status **"Running"** for workers whose process has been dead for days. The UI renders `worker.status` (the last value the SDK reported) directly, without consulting heartbeat freshness. Crashed / OOM-killed / forcibly terminated workers therefore stay "Running" forever, because no graceful-shutdown handler ran to flip their status to `SHUTDOWN`.

### Verified against current `main`

The row component derives the badge with no liveness check:

```ts
// src/lib/components/workers/workers-table/workers-table-row.svelte
const status = $derived(toWorkerStatusReadable(worker?.status));
```

And [`worker-status.svelte`](https://github.com/temporalio/ui/blob/main/src/lib/components/workers/worker-status.svelte) takes the `status` literal as a prop and renders a HeartBeat indicator whenever it equals `'Running'` — no freshness threshold anywhere.

Meanwhile the proto carries the data needed to compute liveness ([`temporal/api/worker/v1/message.proto`](https://github.com/temporalio/api/blob/master/temporal/api/worker/v1/message.proto)):

```proto
google.protobuf.Timestamp heartbeat_time = 10;
google.protobuf.Duration  elapsed_since_last_heartbeat = 11;
```

Neither field is referenced by the workers-table render path.

### Why `status` alone is insufficient

The `WorkerStatus` enum in [`temporal/api/enums/v1/common.proto`](https://github.com/temporalio/api/blob/master/temporal/api/enums/v1/common.proto) has only four values:

```proto
WORKER_STATUS_UNSPECIFIED = 0;
WORKER_STATUS_RUNNING     = 1;
WORKER_STATUS_SHUTTING_DOWN = 2;
WORKER_STATUS_SHUTDOWN    = 3;
```

The SDK transitions `RUNNING → SHUTTING_DOWN → SHUTDOWN` during a graceful shutdown path. Hard crashes, OOM kills, `kill -9`, `docker kill`, host failures, network partitions — none of these run the shutdown path, so the last persisted status stays `RUNNING`. There is no enum value, and no client-side derivation, that can express "alive record but the worker stopped reporting." That is the gap.

## Environment

- **Temporal Server version:** 1.29.3
- **Temporal UI version:** 2.49.1
- **SDK:** `temporalio` Python 1.27.2
- **Deployment:** self-hosted, Postgres-backed (`temporalio/auto-setup` image), single-node docker-compose
- **Dynamic config (relevant entries):**
  ```yaml
  frontend.workerHeartbeatsEnabled:
    - value: true
      constraints: {}
  frontend.listWorkersEnabled:
    - value: true
      constraints: {}
  ```

## Steps to reproduce

1. Run a self-hosted cluster with `frontend.workerHeartbeatsEnabled=true` and `frontend.listWorkersEnabled=true`.
2. Start a worker. Confirm it appears in the **Workers** tab with status **Running**.
3. Kill the worker (`docker compose stop <worker>`, `kill -9 <pid>`, or just stop the container).
4. Wait. Refresh the Workers tab after 1 hour, 1 day, 2 days.
5. The dead worker still shows **Running**.

## Observed behaviour

- Status badge: **Running**
- Worker entry persists across the visible heartbeat-record lifetime (multi-day in our case).
- Multiple zombies accumulate when workers use PID-based identities (`{pid}@{host}`) and the container restarts — each restart adds a new "Running" row without removing the previous one.

Example — single live worker, but nine "Running" entries from prior PIDs across the last 48 hours:

```
Status   Identity                          Last Heartbeat
Running  16579@worker-host                 2026-05-28 08:55
Running  48561@worker-host                 2026-05-28 10:37
Running  68895@worker-host                 2026-05-28 23:09   <-- only live one
Running  38935@worker-host                 2026-05-27 16:29
Running   9512@worker-host                 2026-05-28 00:23
...
```

Ground truth from the matching service (real pollers) shows only one is alive:

```
$ temporal task-queue describe --task-queue workflows --namespace example
Pollers:
    BuildID    TaskQueueType         Identity         LastAccessTime
  UNVERSIONED  workflow       68895@worker-host       43 seconds ago
```

## Expected behaviour

The Workers tab should derive a *displayed* liveness independent of the persisted `WorkerStatus` enum, using `heartbeat_time` (or `elapsed_since_last_heartbeat`, which the server already computes). Suggested rule:

| Condition                                                 | Displayed status           |
| --------------------------------------------------------- | -------------------------- |
| `worker.status == SHUTTING_DOWN` / `SHUTDOWN`             | passthrough (current behaviour) |
| `worker.status == RUNNING` AND `elapsed_since_last_heartbeat <= 2 × interval` | Running |
| `worker.status == RUNNING` AND `2 × interval < elapsed <= grace_window`       | Stale   |
| `worker.status == RUNNING` AND `elapsed > grace_window`                       | Unreachable (or hidden by default) |

A sensible default for `2 × interval` is `120s` (matching the matching-service poller TTL); a default `grace_window` of `1h` keeps the tab clean during incident response.

At minimum, surface `heartbeat_time` / `elapsed_since_last_heartbeat` as a column so operators can spot zombies without manual cross-checking against `temporal task-queue describe`.

### Optional follow-up (server-side)

The narrower fix is purely UI. A more thorough fix would extend `WorkerStatus` with a `WORKER_STATUS_STALE` value the server sets when `elapsed_since_last_heartbeat` exceeds a configurable threshold — but that's a larger conversation and not strictly required to resolve this issue.

## Workarounds I used

- **Use stable `identity=` strings** on each `Worker(...)` so restarts overwrite the same record instead of accumulating new ones. Reduces growth but doesn't fix the misleading "Running" label.
- **Treat `temporal task-queue describe`'s `Pollers[]` as the source of truth** for liveness. The Workers tab is not used for monitoring.

## Why this matters

Operators look at the Workers tab to answer "is anything running, and what?". A page that confidently reports "Running" for dead workers is actively misleading during incident response — engineers waste time investigating phantom workers, or worse, assume coverage exists when it doesn't.

## Related

- **#570 — Workers list on workflow page is misleading.** Distinct surface (the per-workflow Workers sub-tab, not the top-level navigation tab) and a different complaint (relevance, not liveness), but adjacent enough to cross-link.
- A `delete-worker-modal.svelte` exists under `src/lib/components/workers/` but does not appear to be wired into the workers-table render path on `main`. If there's an in-progress operator action for manual cleanup, this issue is complementary — manual deletion doesn't replace automatic staleness signalling.

---

## Related server-side ask

**File separately at https://github.com/temporalio/temporal**

The server has no exposed dynamic config knob to bound how long heartbeat records persist after a worker stops reporting. I searched the server binary symbols:

```bash
strings $(which temporal-server) | grep -oE '(frontend|matching|history|system|worker)\.[a-zA-Z]*([Hh]eartbeat|[Ll]istWorker)[a-zA-Z]*' | sort -u
```

Only `frontend.WorkerHeartbeatsEnabled` and `frontend.ListWorkersEnabled` exist — no `*TTL`, no `*Retention`, no `*Expiry`. An operator-facing retention key (e.g. `matching.workerHeartbeatRetention`) would let clusters bound the zombie window even without the UI fix above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workers tab shows crashed workers as "Running" indefinitely #3475

_{Disclosure: this report was drafted with AI assistance (Claude). All quoted code, file paths, and proto definitions were verified against current main at filing time. The reproduction, environment data, and binary-symbol investigation come from a live self-hosted deployment.}

Summary

Verified against current `main`

Why `status` alone is insufficient

Environment

Steps to reproduce

Observed behaviour

Expected behaviour

Optional follow-up (server-side)

Workarounds I used

Why this matters

Related

Related server-side ask

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Condition	Displayed status
`worker.status == SHUTTING_DOWN` / `SHUTDOWN`	passthrough (current behaviour)
`worker.status == RUNNING` AND `elapsed_since_last_heartbeat <= 2 × interval`	Running
`worker.status == RUNNING` AND `2 × interval < elapsed <= grace_window`	Stale
`worker.status == RUNNING` AND `elapsed > grace_window`	Unreachable (or hidden by default)

Workers tab shows crashed workers as "Running" indefinitely #3475

Description

Disclosure: this report was drafted with AI assistance (Claude). All quoted code, file paths, and proto definitions were verified against current main at filing time. The reproduction, environment data, and binary-symbol investigation come from a live self-hosted deployment.

Summary

Verified against current main

Why status alone is insufficient

Environment

Steps to reproduce

Observed behaviour

Expected behaviour

Optional follow-up (server-side)

Workarounds I used

Why this matters

Related

Related server-side ask

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

_{Disclosure: this report was drafted with AI assistance (Claude). All quoted code, file paths, and proto definitions were verified against current main at filing time. The reproduction, environment data, and binary-symbol investigation come from a live self-hosted deployment.}

Verified against current `main`

Why `status` alone is insufficient