[pull] master from ray-project:master#4083
Merged
pull[bot] merged 4 commits intomiqdigital:masterfrom Apr 25, 2026
Merged
Conversation
## Description When the Serve controller crashes mid-shutdown, replica actors are orphaned and their resources are leaked. This happens because KV checkpoints are deleted at the very start of the shutdown process. If the controller crashes and restarts after checkpoint deletion but before actor teardown completes, the restarted controller has no record of the apps/deployments it needs to clean up. This PR fixes the issue by: - Persisting a `SHUTDOWN_IN_PROGRESS_KEY` in the KV store at the start of `graceful_shutdown()`. On restart, the controller checks this key and automatically re-enters the shutdown path. - Deferring checkpoint deletion to the very end, after all the resources are released and before the controller kills itself. ### Tests - `test_shutdown.py` - controller crash mid-shutdown recovers and continues shutdown - `test_application_state.py` - app checkpoint survives shutdown lifecycle - `test_deployment_state.py` - deployment checkpoint survives shutdown lifecycle ## Related issues >Fixes #62729 --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Co-authored-by: Abrar Sheikh <abrar@anyscale.com>
#62882) ## Description Splits `test_limit_operator.py` into unit and integration test files. * `test_per_block_limit_fn` Remain in integration file (3 tests): * `test_limit_operator` (uses `ray_start_regular_shared`) * `test_limit_operator_memory_leak_fix` (uses `ray_start_regular_shared`, `ray.data.read_parquet`, `StreamingExecutor`, `ray.get`) * `test_limit_estimated_num_output_bundles` (uses `ray_start_regular_shared` — `make_ref_bundles` internally calls `ray.put`) `test_per_block_limit_fn` exercises `_per_block_limit_fn` directly with hand-constructed pandas DataFrames and a `TaskContext` — no Ray cluster involved. Imports in both files were trimmed to only what each file actually uses. ## Related issues Related to #61125 --------- Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>
…ts (#62874) ## Description Splits `test_auto_batch_size.py` into unit and integration test files following the `tests/unit/` convention. Moved to `tests/unit/` (5 tests): * `test_compute_auto_batch_size_basic` * `test_compute_auto_batch_size_clamped_to_one` * `test_compute_auto_batch_size_returns_none` (parametrized: `empty_iterator`, `zero_rows`) * `test_compute_auto_batch_size_iterator_includes_peeked_block` * `test_auto_batches_respect_target_size` Remain in integration file (1 test): * `test_map_batches_auto_correctness` (uses `ray_start_regular_shared`, `ray.data`) ## Related issues Related to #61125 ## Additional information Tests were classified by whether they use `ray_start_*` fixtures or make runtime `ray.*` calls — not by import paths, since the unit tests still import from `ray.data._internal.*` to exercise internal classes directly. --------- Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>
…asible queue (#62483) ## Description Let us say we have a placement group that has nodes scheduled on some domain. Part of the nodes go down, which makes the scheduler try to schedule the bundles back on the same domain assignment, which fails and the placement group is placed into the infeasible queue. Now that the placement group is infeasible, let us say all of the bundles' nodes go down and the placement group is all unplaced. There is no code currently to wake up the scheduling for this placement group from the infeasible queue again. Instead, the placement group will be forever stuck there until OnNodeAdd clears the queue. These fixes are to help with this. ## Testing If you copy the test and run it without any changes, this should currently fail with timeout on pg2.ready() line. Now, run the test again with these changes and it should now succeed. Command to run: `python -m pytest python/ray/tests/test_bundle_label_selector.py::test_scheduling_feasible_after_rack_kill` --------- Signed-off-by: aaron.li <aaron.li@anyscale.com> Signed-off-by: aaronscalene <aaron.li@anyscale.com> Signed-off-by: Joshua Lee <joshlee@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Joshua Lee <joshlee@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )