fix(cache): use status code (not identity) to detect Service AlreadyExists#3507
fix(cache): use status code (not identity) to detect Service AlreadyExists#35071fanwang wants to merge 1 commit into
Conversation
The `cache` dataset initializer compared the exception instance to the
`ConflictError` class using `is`, which is always False:
except ApiException as e:
if e is ConflictError: # always False
`core_v1.create_namespaced_service()` raises a plain `ApiException` with
`status=409`, not a `ConflictError`, so this branch was unreachable. When
the Service already existed (e.g., re-reconcile after a crash between
Service creation and LWS readiness), the code re-raised, fell into the
outer cleanup, and deleted the ServiceAccount — leaving the cache
cluster in a broken state.
Match the pattern used a few lines above for the ServiceAccount and
check `e.status == 409` instead.
Add a unit test that mocks `create_namespaced_service` to raise 409 and
asserts `download_dataset` completes without calling
`delete_namespaced_service_account`.
Signed-off-by: 1fanwang <1fannnw@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
Fixes a Kubernetes Service creation retry path in the dataset cache initializer by correctly treating HTTP 409 (AlreadyExists) as an idempotent no-op, preventing unintended ServiceAccount cleanup on reruns.
Changes:
- Replace an incorrect exception identity comparison with
ApiException.status == 409handling for Service creation conflicts. - Remove the now-unused
ConflictErrorimport. - Add a unit test to ensure a 409 from
create_namespaced_servicedoes not triggerdelete_namespaced_service_account.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/initializers/dataset/cache.py | Handles Service AlreadyExists via HTTP status code to avoid falling into the failure/cleanup path. |
| pkg/initializers/dataset/cache_test.py | Adds regression coverage for the 409 Service AlreadyExists scenario to ensure ServiceAccount is not deleted. |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for this fix @1fanwang!
/assign @akshaychitneni
|
/ok-to-test |
What this PR does
pkg/initializers/dataset/cache.py:272checksif e is ConflictError:after a failedcreate_namespaced_service. That's a class-identity comparison against the exception instance, which is always False. So when a cache cluster's Service already exists (HTTP 409 on retry), the code falls through to the cleanup branch and deletes the ServiceAccount that was just successfully created.The sibling ServiceAccount creation block 145 lines above uses the correct
if e.status == 409:pattern. This PR aligns the Service branch with it.Why
Symptom on re-run of a dataset cache initialization:
The fix is two lines (one comparison change, one now-unused import removed).
How was this tested?
Added
test_download_dataset_service_already_existsinpkg/initializers/dataset/cache_test.py. It mockscreate_namespaced_serviceto raiseApiException(status=409)and assertsdelete_namespaced_service_accountis NOT called. The new test fails on master and passes with this PR.Signed-off-by: 1fanwang 1fannnw@gmail.com