Skip to content

fix(cache): use status code (not identity) to detect Service AlreadyExists#3507

Open
1fanwang wants to merge 1 commit into
kubeflow:masterfrom
1fanwang:fix/cache-service-conflict-check
Open

fix(cache): use status code (not identity) to detect Service AlreadyExists#3507
1fanwang wants to merge 1 commit into
kubeflow:masterfrom
1fanwang:fix/cache-service-conflict-check

Conversation

@1fanwang
Copy link
Copy Markdown

@1fanwang 1fanwang commented May 12, 2026

What this PR does

pkg/initializers/dataset/cache.py:272 checks if e is ConflictError: after a failed create_namespaced_service. That's a class-identity comparison against the exception instance, which is always False. So when a cache cluster's Service already exists (HTTP 409 on retry), the code falls through to the cleanup branch and deletes the ServiceAccount that was just successfully created.

The sibling ServiceAccount creation block 145 lines above uses the correct if e.status == 409: pattern. This PR aligns the Service branch with it.

Why

Symptom on re-run of a dataset cache initialization:

  • ServiceAccount created (200)
  • Service create → 409 AlreadyExists
  • ConflictError identity check fails → cleanup branch fires
  • ServiceAccount deleted
  • Cache pod ends up orphaned without its SA

The fix is two lines (one comparison change, one now-unused import removed).

How was this tested?

Added test_download_dataset_service_already_exists in pkg/initializers/dataset/cache_test.py. It mocks create_namespaced_service to raise ApiException(status=409) and asserts delete_namespaced_service_account is NOT called. The new test fails on master and passes with this PR.

Signed-off-by: 1fanwang 1fannnw@gmail.com

The `cache` dataset initializer compared the exception instance to the
`ConflictError` class using `is`, which is always False:

    except ApiException as e:
        if e is ConflictError:  # always False

`core_v1.create_namespaced_service()` raises a plain `ApiException` with
`status=409`, not a `ConflictError`, so this branch was unreachable. When
the Service already existed (e.g., re-reconcile after a crash between
Service creation and LWS readiness), the code re-raised, fell into the
outer cleanup, and deleted the ServiceAccount — leaving the cache
cluster in a broken state.

Match the pattern used a few lines above for the ServiceAccount and
check `e.status == 409` instead.

Add a unit test that mocks `create_namespaced_service` to raise 409 and
asserts `download_dataset` completes without calling
`delete_namespaced_service_account`.

Signed-off-by: 1fanwang <1fannnw@gmail.com>
Copilot AI review requested due to automatic review settings May 12, 2026 08:26
@google-oss-prow google-oss-prow Bot requested review from jinchihe and kuizhiqing May 12, 2026 08:26
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a Kubernetes Service creation retry path in the dataset cache initializer by correctly treating HTTP 409 (AlreadyExists) as an idempotent no-op, preventing unintended ServiceAccount cleanup on reruns.

Changes:

  • Replace an incorrect exception identity comparison with ApiException.status == 409 handling for Service creation conflicts.
  • Remove the now-unused ConflictError import.
  • Add a unit test to ensure a 409 from create_namespaced_service does not trigger delete_namespaced_service_account.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pkg/initializers/dataset/cache.py Handles Service AlreadyExists via HTTP status code to avoid falling into the failure/cleanup path.
pkg/initializers/dataset/cache_test.py Adds regression coverage for the 409 Service AlreadyExists scenario to ensure ServiceAccount is not deleted.

Copy link
Copy Markdown
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix @1fanwang!
/assign @akshaychitneni

@andreyvelich
Copy link
Copy Markdown
Member

/ok-to-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants