You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When updating a Pipeline CRD, the scheduler incorrectly removes the pipeline from envoy routing after the old pipeline version is deleted, even though the new version was successfully loaded. This leaves the pipeline in a broken state where:
Pipeline status shows Ready: True
Actual requests return 503 no healthy upstream
Environment
Seldon Core version: 2.10.2
Kubernetes version: 1.28
Installation method: Helm
Kafka: Tested with both local Strimzi and Confluent Cloud (same behavior)
To Reproduce
Deploy a working pipeline
Verify pipeline is functional (returns 200/400, not 503)
Apply an update to the pipeline spec (e.g., change stepsJoin: inner to stepsJoin: outer)
Observe pipeline gateway logs
Expected behavior
Pipeline should remain routable after the update. The old version should be deleted without affecting the new version's routing.
Actual behavior
The pipeline becomes unroutable (503 errors) despite showing Ready: True in the CRD status.
Logs
Scheduler logs showing the issue:
time="2026-01-06T16:14:47Z" level=info msg="Received pipeline status event update:{op:Create pipeline:\"mlserver-example-pipeline\" version:9 ...} success:true reason:\"Pipeline mlserver-example-pipeline loaded\""
time="2026-01-06T16:14:47Z" level=info msg="Pipeline mlserver-example-pipeline status counts: 1/1 ready"
time="2026-01-06T16:14:47Z" level=info msg="Adding normal pipeline route mlserver-example-pipeline"
time="2026-01-06T16:14:48Z" level=info msg="Pipeline mlserver-example-pipeline status counts: 1/1 terminated"
time="2026-01-06T16:14:49Z" level=info msg="Received pipeline status event update:{op:Delete pipeline:\"mlserver-example-pipeline\" version:8 ...} success:true reason:\"Pipeline mlserver-example-pipeline deleted\""
time="2026-01-06T16:14:49Z" level=info msg="Pipeline mlserver-example-pipeline has been terminated, removing from conflict resolution and envoy"
Key observation: Version 9 is created and loaded successfully, but when version 8 is deleted, the scheduler's GetPipelineStatus function reports 1/1 terminated and removes the pipeline from envoy - even though version 9 is still active.
Root cause analysis
The bug appears to be in the scheduler's dataflow-conflict-resolution component. When processing the delete event for the old pipeline version, GetPipelineStatus incorrectly counts the pipeline as terminated and triggers removal from envoy, ignoring that a newer version is still loaded.
The sequence is:
Pipeline v9 created → "1/1 ready" → added to envoy ✓
Pipeline v8 delete event received
GetPipelineStatus returns "1/1 terminated" (BUG: should still show ready because v9 exists)
Pipeline removed from envoy (BUG: v9 is still valid)
Workaround
Restarting the pipeline gateway pod after any pipeline update resolves the issue:
Describe the bug
When updating a Pipeline CRD, the scheduler incorrectly removes the pipeline from envoy routing after the old pipeline version is deleted, even though the new version was successfully loaded. This leaves the pipeline in a broken state where:
Ready: True503 no healthy upstreamEnvironment
To Reproduce
stepsJoin: innertostepsJoin: outer)Expected behavior
Pipeline should remain routable after the update. The old version should be deleted without affecting the new version's routing.
Actual behavior
The pipeline becomes unroutable (503 errors) despite showing
Ready: Truein the CRD status.Logs
Scheduler logs showing the issue:
Pipeline gateway logs:
Key observation: Version 9 is created and loaded successfully, but when version 8 is deleted, the scheduler's
GetPipelineStatusfunction reports1/1 terminatedand removes the pipeline from envoy - even though version 9 is still active.Root cause analysis
The bug appears to be in the scheduler's
dataflow-conflict-resolutioncomponent. When processing the delete event for the old pipeline version,GetPipelineStatusincorrectly counts the pipeline as terminated and triggers removal from envoy, ignoring that a newer version is still loaded.The sequence is:
GetPipelineStatusreturns "1/1 terminated" (BUG: should still show ready because v9 exists)Workaround
Restarting the pipeline gateway pod after any pipeline update resolves the issue:
This works because the fresh pod connects to the scheduler and loads the current pipeline version without any "old version" delete events to process.
Additional context
Impact