Summary
When the Actions Service's TaskAction CRD watcher (actions/k8s/client.go:notifyRunService) fails to call InternalRunService.RecordAction, subsequent UpdateActionStatus calls for the same action silently no-op — the gRPC response is success, but zero rows are updated and the action is effectively invisible in the Runs DB until a K8s watch re-list re-emits ADDED.
Reproduction
- Runs Service is briefly unavailable when a TaskAction's
ADDED event is processed.
RecordAction fails. The Actions Service logs a Warnf and moves on (no retry). The bloom filter is not updated, but no further action is taken.
- Runs Service recovers.
- Subsequent
MODIFIED events for that TaskAction call UpdateActionStatus. The SQL UPDATE matches zero rows. Runs Service returns nil error (see runs/repository/impl/action.go:397-416, runs/service/internal_run_service.go:232).
- The action never appears to clients using
WatchRunDetails until a watch re-list occurs (controller restart / watcher reconnect).
Root cause
Two coupled issues in InternalRunService:
UpdateActionPhase returns success on 0 rows affected. The caller has no way to know the action is missing from the DB.
RecordAction and UpdateActionStatus are split RPCs. The client must call both in order; there is no single idempotent "report" operation.
Proposed fix
1. Make UpdateActionStatus return NotFound when no row is matched.
// runs/repository/impl/action.go
if rowsAffected == 0 {
return ErrActionNotFound
}
…mapped to connect.CodeNotFound in the gRPC handler.
2. Add ActionStatus status = 6 to RecordActionRequest so the initial insert carries the correct phase (no transient PHASE_UNSPECIFIED window).
3. Self-heal in actions/k8s/client.go:notifyRunService:
_, err := c.runClient.UpdateActionStatus(ctx, statusReq)
if connect.CodeOf(err) == connect.CodeNotFound {
// Row missing — RecordAction must have failed earlier. Rebuild from CRD
// and re-record. Follow the existing pattern: Add to the bloom filter only
// on success (the filter is add-only; never Remove).
recordReq := buildRecordRequestFromCRD(taskAction)
if _, recErr := c.runClient.RecordAction(ctx, recordReq); recErr == nil {
if c.recordedFilter != nil {
c.recordedFilter.Add(ctx, actionKey)
}
// Optional: retry UpdateActionStatus now that the row exists.
}
}
The Actions Service still has the TaskAction CRD in its informer cache, so rebuilding the RecordActionRequest is free.
4. On ADDED events, skip the separate UpdateActionStatus — RecordAction now carries the status.
Independent improvements
- Bounded retry with backoff for transient errors (
Unavailable, DeadlineExceeded) — 3 attempts, exponential (100ms / 500ms / 2s). Don't block the sharded worker on long outages.
- Periodic reconciliation sweep — every ~5min list TaskAction CRDs without the
flyte.org/terminal-status-recorded label and force a notifyRunService pass. Backstop for cases where the watcher doesn't reconnect but Runs Service has recovered.
References
actions/k8s/client.go:554-646 — notifyRunService
runs/service/internal_run_service.go:191-255 — UpdateActionStatus handler
runs/repository/impl/action.go:361-417 — UpdateActionPhase (the silent-success site)
Summary
When the Actions Service's TaskAction CRD watcher (
actions/k8s/client.go:notifyRunService) fails to callInternalRunService.RecordAction, subsequentUpdateActionStatuscalls for the same action silently no-op — the gRPC response is success, but zero rows are updated and the action is effectively invisible in the Runs DB until a K8s watch re-list re-emitsADDED.Reproduction
ADDEDevent is processed.RecordActionfails. The Actions Service logs aWarnfand moves on (no retry). The bloom filter is not updated, but no further action is taken.MODIFIEDevents for that TaskAction callUpdateActionStatus. The SQLUPDATEmatches zero rows. Runs Service returnsnilerror (seeruns/repository/impl/action.go:397-416,runs/service/internal_run_service.go:232).WatchRunDetailsuntil a watch re-list occurs (controller restart / watcher reconnect).Root cause
Two coupled issues in
InternalRunService:UpdateActionPhasereturns success on 0 rows affected. The caller has no way to know the action is missing from the DB.RecordActionandUpdateActionStatusare split RPCs. The client must call both in order; there is no single idempotent "report" operation.Proposed fix
1. Make
UpdateActionStatusreturnNotFoundwhen no row is matched.…mapped to
connect.CodeNotFoundin the gRPC handler.2. Add
ActionStatus status = 6toRecordActionRequestso the initial insert carries the correct phase (no transientPHASE_UNSPECIFIEDwindow).3. Self-heal in
actions/k8s/client.go:notifyRunService:The Actions Service still has the TaskAction CRD in its informer cache, so rebuilding the
RecordActionRequestis free.4. On
ADDEDevents, skip the separateUpdateActionStatus—RecordActionnow carries the status.Independent improvements
Unavailable,DeadlineExceeded) — 3 attempts, exponential (100ms / 500ms / 2s). Don't block the sharded worker on long outages.flyte.org/terminal-status-recordedlabel and force anotifyRunServicepass. Backstop for cases where the watcher doesn't reconnect but Runs Service has recovered.References
actions/k8s/client.go:554-646—notifyRunServiceruns/service/internal_run_service.go:191-255—UpdateActionStatushandlerruns/repository/impl/action.go:361-417—UpdateActionPhase(the silent-success site)