[pull] master from ray-project:master by pull[bot] · Pull Request #4083 · miqdigital/ray

pull · 2026-04-25T01:35:22Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

## Description When the Serve controller crashes mid-shutdown, replica actors are orphaned and their resources are leaked. This happens because KV checkpoints are deleted at the very start of the shutdown process. If the controller crashes and restarts after checkpoint deletion but before actor teardown completes, the restarted controller has no record of the apps/deployments it needs to clean up. This PR fixes the issue by: - Persisting a `SHUTDOWN_IN_PROGRESS_KEY` in the KV store at the start of `graceful_shutdown()`. On restart, the controller checks this key and automatically re-enters the shutdown path. - Deferring checkpoint deletion to the very end, after all the resources are released and before the controller kills itself. ### Tests - `test_shutdown.py` - controller crash mid-shutdown recovers and continues shutdown - `test_application_state.py` - app checkpoint survives shutdown lifecycle - `test_deployment_state.py` - deployment checkpoint survives shutdown lifecycle ## Related issues >Fixes #62729 --------- Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com> Co-authored-by: Abrar Sheikh <abrar@anyscale.com>

#62882) ## Description Splits `test_limit_operator.py` into unit and integration test files. * `test_per_block_limit_fn` Remain in integration file (3 tests): * `test_limit_operator` (uses `ray_start_regular_shared`) * `test_limit_operator_memory_leak_fix` (uses `ray_start_regular_shared`, `ray.data.read_parquet`, `StreamingExecutor`, `ray.get`) * `test_limit_estimated_num_output_bundles` (uses `ray_start_regular_shared` — `make_ref_bundles` internally calls `ray.put`) `test_per_block_limit_fn` exercises `_per_block_limit_fn` directly with hand-constructed pandas DataFrames and a `TaskContext` — no Ray cluster involved. Imports in both files were trimmed to only what each file actually uses. ## Related issues Related to #61125 --------- Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>

…ts (#62874) ## Description Splits `test_auto_batch_size.py` into unit and integration test files following the `tests/unit/` convention. Moved to `tests/unit/` (5 tests): * `test_compute_auto_batch_size_basic` * `test_compute_auto_batch_size_clamped_to_one` * `test_compute_auto_batch_size_returns_none` (parametrized: `empty_iterator`, `zero_rows`) * `test_compute_auto_batch_size_iterator_includes_peeked_block` * `test_auto_batches_respect_target_size` Remain in integration file (1 test): * `test_map_batches_auto_correctness` (uses `ray_start_regular_shared`, `ray.data`) ## Related issues Related to #61125 ## Additional information Tests were classified by whether they use `ray_start_*` fixtures or make runtime `ray.*` calls — not by import paths, since the unit tests still import from `ray.data._internal.*` to exercise internal classes directly. --------- Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>

…asible queue (#62483) ## Description Let us say we have a placement group that has nodes scheduled on some domain. Part of the nodes go down, which makes the scheduler try to schedule the bundles back on the same domain assignment, which fails and the placement group is placed into the infeasible queue. Now that the placement group is infeasible, let us say all of the bundles' nodes go down and the placement group is all unplaced. There is no code currently to wake up the scheduling for this placement group from the infeasible queue again. Instead, the placement group will be forever stuck there until OnNodeAdd clears the queue. These fixes are to help with this. ## Testing If you copy the test and run it without any changes, this should currently fail with timeout on pg2.ready() line. Now, run the test again with these changes and it should now succeed. Command to run: `python -m pytest python/ray/tests/test_bundle_label_selector.py::test_scheduling_feasible_after_rack_kill` --------- Signed-off-by: aaron.li <aaron.li@anyscale.com> Signed-off-by: aaronscalene <aaron.li@anyscale.com> Signed-off-by: Joshua Lee <joshlee@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Joshua Lee <joshlee@anyscale.com>

vaishdho1 and others added 4 commits April 24, 2026 12:40

pull Bot locked and limited conversation to collaborators Apr 25, 2026

pull Bot added the ⤵️ pull label Apr 25, 2026

pull Bot merged commit 65d2640 into miqdigital:master Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#4083

[pull] master from ray-project:master#4083
pull[bot] merged 4 commits intomiqdigital:masterfrom
ray-project:master

pull Bot commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pull Bot commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pull Bot commented Apr 25, 2026 •

edited

Loading