Skip to content

[BUG] Workflow re-registration fails with RST_STREAM due to non-deterministic digest and oversized error #7212

@andresgomezfrr

Description

@andresgomezfrr

Describe the issue

When re-registering a workflow with the same version, FlyteAdmin returns INTERNAL: RST_STREAM closed stream. HTTP/2 error code: INTERNAL_ERROR instead of ALREADY_EXISTS or INVALID_ARGUMENT. This happens silently — no error is logged server-side.

Root cause (two bugs):

  1. Non-deterministic workflow digest: ValidateWorkflow in workflow_compiler.go iterates wf.Nodes (a Go map) without sorting keys. Go map iteration order is randomized, so the same workflow produces different CompiledWorkflowClosure outputs across compilations. FlyteAdmin's digest comparison (bytes.Equal) fails, incorrectly taking the "different structure" code path.

  2. Oversized gRPC error message: The "different structure" code path in errors.go (NewWorkflowExistsDifferentStructureError) computes a jsondiff of two large compiled workflow closures and includes the full diff in the gRPC status description. For large workflows (e.g. with 400+ JAR dependencies), this diff exceeds gRPC's default 4MB MaxSendMsgSize. gRPC-Go rejects the response at the transport layer with RST_STREAM INTERNAL_ERROR — no server-side log is produced.

Steps to reproduce

  1. Register a workflow with version v1 (succeeds)
  2. Re-register the same workflow with version v1 but with a trivially different template (e.g. add a metadata tag)
  3. Client receives INTERNAL: RST_STREAM closed stream. HTTP/2 error code: INTERNAL_ERROR
  4. FlyteAdmin logs nothing

Even without step 2's modification, identical workflows can trigger this because the digest is non-deterministic — it depends on Go map iteration order.

Expected behavior

  • Identical workflow + same version → ALREADY_EXISTS
  • Different workflow + same version → INVALID_ARGUMENT with a bounded error message

Additional context

  • createTask is not affected because task compilation does not iterate Go maps non-deterministically
  • The issue is particularly impactful for CI systems that use content-based versioning (e.g. Bazel) where the same version may be re-registered when only dependencies change

Relevant code

  • Digest comparison: flyteadmin/pkg/manager/impl/workflow_manager.go (CreateWorkflow → bytes.Equal(workflowDigest, existingWorkflowModel.Digest))
  • Non-deterministic map iteration: flytepropeller/pkg/compiler/workflow_compiler.go lines 220 and 235 (for nodeID, n := range wf.Nodes)
  • Unbounded error message: flyteadmin/pkg/errors/errors.go (NewWorkflowExistsDifferentStructureErrorjsondiff.Comparestrings.Join with no size limit)

Proposed fix

PR: #7211

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions