Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/per-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ jobs:
- os: windows-latest
timeoutMinutes: 20
- os: ubuntu-latest
useWftChunkingV2: "true"
- os: ubuntu-arm
runsOn: ubuntu-24.04-arm64-2-core
- os: macos-arm
Expand Down Expand Up @@ -87,6 +88,8 @@ jobs:
cache-bin: false
save-if: ${{ github.ref == 'refs/heads/main' }}
- run: cargo test -- --include-ignored --nocapture
env:
TEMPORAL_USE_WFT_CHUNKING_V2: ${{ matrix.useWftChunkingV2 || 'false' }}
- name: Find test executable for cgroup tests
id: find-cgroup-test
if: runner.os == 'Linux' && runner.arch == 'X64'
Expand Down Expand Up @@ -162,6 +165,7 @@ jobs:
include:
- os: windows-latest
- os: ubuntu-latest
useWftChunkingV2: "true"
- os: ubuntu-arm
runsOn: ubuntu-24.04-arm64-2-core
- os: macos-arm
Expand Down Expand Up @@ -192,6 +196,8 @@ jobs:
cache-bin: false
save-if: ${{ github.ref == 'refs/heads/main' }}
- run: cargo integ-test
env:
TEMPORAL_USE_WFT_CHUNKING_V2: ${{ matrix.useWftChunkingV2 || 'false' }}

cloud-tests:
if: github.event.pull_request.head.repo.full_name == '' || github.event.pull_request.head.repo.full_name == 'temporalio/sdk-rust'
Expand Down
259 changes: 224 additions & 35 deletions arch_docs/workflow_task_chunking.md
Original file line number Diff line number Diff line change
@@ -1,51 +1,240 @@
# Workflow Task Chunking

## Workflow Tasks Model

Conceptually, Workflow Tasks are usually presented using the following model:

- \[Preceding Events\] (optional)
- WFT Scheduled
- WFT Started
- WFT Completed
- \[Command Events\] (optional)

That is, some events happen that the workflow needs to react to (Preceding Events), so the server
schedules a Workflow Task (WFT Scheduled). A worker then polls the task (WFT Started), and executes
workflow code. The workflow may produce commands, which are sent back to the server when the task
completes (WFT Completed). The server interprets those commands and records the corresponding
history events (Command Events).

Commands and events are both "optional" in the sense that:

- Workflow code, after being woken up, may not do anything and produce no new commands.
- There may be no inbound events, for a few reasons:
1. The workflow might have been running a long-running local activity (LA). In that case the
workflow must "WFT heartbeat" — completing the current WFT with no commands while the LA is
still in progress — in order to avoid timing out the WFT on the server.
2. The workflow might have received an Update, which is not represented as a history event but
rather as a "protocol message" attached to the task.
3. The server can occasionally force a new WFT via some obscure internal APIs.

## Logical Workflow Tasks Model

Though useful as a general mental model, the sequence described previously does not faithfully align
with how Workflow Tasks are actually _processed_ by a Worker.

Indeed, when a given Workflow Task is processed for the first time, that Task has obviously not yet
been completed and no commands have yet been produced, so the corresponding events are not present
in history. The Workflow Task handed to the Worker therefore ends at the WFT Started event. The WFT
Completed and Command Events resulting from that WFT execution only become visible to the Worker on
the _next_ Workflow Task.

Therefore, from the perspective of how the Worker advances through a Workflow history, the
processing unit is better described as follows:

- WFT Completed
- [Command Events] (optional)
- [Preceding Events] (optional)
- WFT Scheduled
- WFT Started

Another particularity of the processing-oriented model is that it is possible, under some
conditions, for multiple consecutive Workflow Tasks to be collapsed into a single processing unit.
It is notably the case of Workflow Tasks that completed unsuccessfully (i.e. failed or timed out),
as well as most Workflow Tasks that belong to a Workflow Task Heartbeat sequence.

This processing-oriented model is what we call **logical Workflow Tasks** (LWFT). Note that LWFT do
not represent a different history: all events in the Workflow Execution history remain present and
in the same order. It only changes how those events are aligned and grouped into processing units as
the Worker advances through history.

// FIXME: What are absolute must-haves for LWFT?

## Workflow History Pagination

There are various ways The Worker obtains Workflow history from the server.

## Chunking history into Logical Workflow Tasks

It is critical that, on replay, Workflow history be split into LWFTs that correctly align with the
LWFTs processed during the original execution. Differences in how events are grouped affect how and
when those events are presented to workflow code. That, in turn, may affect which commands the
workflow produces, or when it produces them, resulting in non-determinism errors.

It is also critical that

Note that we said that LWFTs have to _align_ with the original execution, not that they have to be
_strictly the same_. In some cases, it is

Comment on lines +60 to +77
One source of complexity in Core is the chunking of history into "logical" Workflow Tasks.

Workflow tasks (WFTs) always take the following form in event history:
A LWFT that contains no inbound events, no commands, and no protocol messages is a no-op from the
workflow's perspective. To avoid waking lang for no purpose, Core collapses such empty LWFTs into
the preceding non-empty one.

* \[Preceding Events\] (optional)
* WFT Scheduled
* WFT Started
* WFT Completed
* \[Commands\] (optional)
Deciding which sequences of WFTs are safe to collapse is the job of the WFT chunking algorithm.

In the typical case, the "logical" WFT consists of all the commands from the last workflow task,
any events generated in the interrim, and the scheduled/started preamble. So:
Two chunking algorithms exist: **v1** (the legacy heuristic) and **v2** (the current, more rigorous
algorithm). Both ship in this version of Core. Which one a given workflow uses is recorded as the
`WftChunkingV2` SDK flag on its first `WorkflowTaskCompleted` event. Workflows that don't carry the
flag use v1; workflows that do, use v2. v2 is opt-in for new workflows via the
`TEMPORAL_USE_WFT_CHUNKING_V2` environment variable.

* WFT Completed
* \[Commands\] (optional)
* \[Events\] (optional)
* WFT Scheduled
* WFT Started
## v1 — heartbeat detection heuristic

Commands and events are both "optional" in the sense that:
v1 (`find_end_index_of_next_wft_seq_v1`) walks history events linearly. At each
`WorkflowTaskStarted`, it peeks two events ahead and applies a simple heuristic to
decide whether the current WFT is a "WFT heartbeat" and should therefore be collapsed
into the next one. In essence:

```rust
if !saw_command && next_next_event.event_type() == EventType::WorkflowTaskScheduled {
// This WFT contained no commands, and history immediately rolls into the next
// WFTScheduled with no intervening events. Treat it as a heartbeat: don't
// conclude the current LWFT here, keep walking.
continue;
}
```

`saw_command` tracks whether any _command event_ — i.e. an event generated by a worker
command, like `ActivityTaskScheduled`, `TimerStarted`, `MarkerRecorded`, etc. — appeared
before the current `WorkflowTaskStarted`. The combination "no commands + no inbound
events between WFTCompleted and the next WFTScheduled" is v1's signature of a heartbeat.

This compact heuristic is the heart of v1's chunking decisions.

### Downsides of v1

v1's compactness comes at the cost of accuracy. Several edge cases led to incorrect
chunking and, in some cases, non-determinism errors (NDE) at replay time:

1. **Updates after empty WFTs.** Workflow Updates are delivered as protocol messages
attached to a WFT, not as history events. When a server delivers an Update right
after a WFT heartbeat, v1 sees the heartbeat as "empty" (no commands, nothing
between WFTCompleted and the next WFTScheduled) and collapses it into the next
WFT. The Update is then delivered as part of the combined LWFT, but the original
execution produced two separate activations — the heartbeat, then the Update.
On replay this discrepancy typically manifests as an NDE.

2. **Brittle handling of paginated history.** v1's heuristic needs to look two events
past each `WorkflowTaskStarted` to commit a decision. When history is fetched in
pages, page boundaries can fall mid-decision. v1 returns `Incomplete(...)` in
that case, but the paginator's handling of `Incomplete` interacts subtly with
the heuristic, and the chunking outcome can diverge from the original run.

3. **Inbound events could be absorbed into a heartbeat chain.** v1's heuristic
tracks only _command_ events explicitly (via the `saw_command` flag). Other
event types that can affect workflow state — signals, timer fires, accepted
updates, child-workflow events, etc. — are only handled indirectly through the
two-event look-ahead at each `WorkflowTaskStarted`. In some scenarios — most
visibly when the heuristic interacts with paginated history, but also in plain
in-memory chunking when events fall just-so — v1 ended up considering WFTs that
actually carried inbound events as part of the preceding heartbeat chain.
Because a collapsed LWFT presents _one_ timestamp (that of the first WFT in
the chain) to user code, any code that runs for the absorbed WFT observes
`workflow.now()` at the wrong moment in time, which leads to a non-determinism
error at replay.

These problems motivated the v2 rewrite.

## v2 — explicit state machine with time-sensitive events

v2 (`find_end_index_of_next_wft_seq_v2`) replaces v1's compact heuristic with an
explicit state machine over the structure of WFT sequences in history. Three ideas
make v2 more robust than v1:

### 1. Time-sensitive events explicitly prevent collapsing

v2 introduces the concept of a **time-sensitive event**: any event that can cause user
workflow code to run on replay. Examples include `WorkflowExecutionStarted`,
`WorkflowExecutionSignaled`, `TimerStarted`/`TimerFired`, `ActivityTaskScheduled` and
its terminal counterparts, `WorkflowExecutionUpdateAccepted`, child-workflow events,
Nexus events, and so on. Events that cannot independently wake user code — WFT framing
events, `MarkerRecorded`, `UpsertWorkflowSearchAttributes`, etc. — are _not_
time-sensitive.

The full classification lives in `HistoryEvent::is_wft_time_sensitive_event` and uses
an exhaustive `match` (no catch-all) so any new event type added to the proto must be
explicitly classified.

While walking events, v2 maintains a `prevent_heartbeat` flag. Once any time-sensitive
event is seen, the flag is latched on and v2 will not collapse the current LWFT into
the next WFT. This replaces v1's implicit "saw a command" check with an explicit,
complete list of which event types force a boundary.

### 2. Pending speculative Updates are taken into account

The chunker is parameterized with `has_pending_speculative_updates` — set true when the
current task carries Update protocol messages that haven't yet produced
`WorkflowExecutionUpdateAccepted` events in history. When set, v2 refuses to collapse
the **last** WFT in a heartbeat chain into the preceding chain, because the pending
Update message must be delivered in its own activation in order to match the original
execution. Earlier (non-last) heartbeats in the same chain can still be collapsed
together.

This is the structural fix for the "Update after heartbeat" class of NDE that v1
caused.

### 3. Bounded look-ahead with explicit `NeedMore`

v2 looks up to five events past the current `WorkflowTaskStarted` to make its
decision, and exposes its outcome via a three-variant enum:

- `Complete(ix)` — a LWFT boundary has been found at the `WorkflowTaskStarted` at
index `ix`.
- `NeedMore` — not enough events are available to commit a decision; the caller
should fetch more history pages and retry.
- `Tail` — no more LWFT boundaries exist in the slice; any remaining events are
trailing matter (terminal `WorkflowExecution*` events, `WorkflowTaskCompleted`
with commands).

When the look-ahead runs past the end of the slice, v2 either commits a decision
(if the slice covers the last WFT of the workflow) or returns `NeedMore`. This
moves "I'm not sure yet, fetch more" out of the look-ahead's blind spot and into
an explicit return value, which the paginator can act on cleanly.

Workflow code, after being woken up, might not do anything, and thus generate no new commands
### How v2 addresses each v1 downside

There may be no events for more nuanced reasons:
1. **Update-after-heartbeat NDE.** Fixed structurally: `has_pending_speculative_updates`
prevents the last WFT in a heartbeat chain from being collapsed.

1. The workflow might have been running a long-running local activity. In such cases, the workflow
must "workflow task heartbeat" in order to avoid timing out the workflow task. This means
completing the WFT with no commands while the LA is ongoing.
2. The workflow might have received an update, which does not come as an event in history, but
rather as a "protocol message" attached to the task.
3. Server can forcibly generate a new WFT with some obscure APIs
2. **Pagination correctness.** Fixed by the explicit `NeedMore` return value, which
gives the paginator a clean signal to fetch more events rather than relying on
ambiguous `Incomplete` handling.

Core does not consider such empty WFT sequences as worthy of waking lang (on replay - as a new
task, they always will), since nothing meaningful has happened. Thus, they are grouped together
as part of a "logical" WFT with the last WFT that had any real work in it.
3. **Inbound events absorbed into a heartbeat chain.** Fixed by
`is_wft_time_sensitive_event` and the `prevent_heartbeat` flag — every event
type that can wake user workflow code is enumerated and classified explicitly,
and seeing any one of them latches the chunker out of heartbeat-collapsing
mode for the remainder of the current LWFT. A WFT that carries any such event
can no longer be silently merged with a preceding heartbeat chain, so user
code always observes `workflow.now()` at the timestamp of the WFT it really
belongs to.

## Possible issues as of this writing (5/25)
## Rollout

The "new WFT force-issued by server" case would, currently, not cause a wakeup on replay for the
reasons discussed above. In some obscure edge cases (inspecting workflow clock) this could cause
NDE.
Per Core's SDK-flag rollout policy, this version of Core ships **support** for v2 but
leaves it off by default. Operators opt in via the `TEMPORAL_USE_WFT_CHUNKING_V2`
environment variable. When set to a truthy value (`"true"` or `"1"`, case-insensitive),
the worker, for each newly started workflow:

### Possible solutions
- records `WftChunkingV2` in `sdk_metadata.core_used_flags` on the first
`WorkflowTaskCompleted` event, and
- uses v2 chunking for that workflow from then on, regardless of whether the env
variable is set on subsequent worker versions (the flag in history is the source
of truth).

* Core can attach a flag on WFT completes in order to be explicit that that WFT may be skipped on
replay. IE: During WFT heartbeating for LAs.
* We could legislate that server should never send empty WFTs. Seemingly the only case of this
is
the [obscure api](https://github.com/temporalio/temporal/blob/d189737aa2ed1b07c221abb9fbdd28ecf68f0492/proto/internal/temporal/server/api/adminservice/v1/service.proto#L151)
Workflows that started without the flag — either before opt-in, or on a worker where
opt-in was not enabled — continue to use v1 indefinitely. This guarantees that a
worker can be rolled back to the previous version without stranding workflows that
already adopted v2, since the previous version also understood the flag (it just
didn't write it by default).
Loading
Loading