Skip to content

[pull] master from ray-project:master#4083

Merged
pull[bot] merged 4 commits intomiqdigital:masterfrom
ray-project:master
Apr 25, 2026
Merged

[pull] master from ray-project:master#4083
pull[bot] merged 4 commits intomiqdigital:masterfrom
ray-project:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 25, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

vaishdho1 and others added 4 commits April 24, 2026 12:40
## Description
When the Serve controller crashes mid-shutdown, replica actors are
orphaned and their resources are leaked. This happens because KV
checkpoints are deleted at the very start of the shutdown process. If
the controller crashes and restarts after checkpoint deletion but before
actor teardown completes, the restarted controller has no record of the
apps/deployments it needs to clean up.
This PR fixes the issue by:
- Persisting a `SHUTDOWN_IN_PROGRESS_KEY` in the KV store at the start
of `graceful_shutdown()`. On restart, the controller checks this key and
automatically re-enters the shutdown path.
- Deferring checkpoint deletion to the very end, after all the resources
are released and before the controller kills itself.

### Tests
- `test_shutdown.py` - controller crash mid-shutdown recovers and
continues shutdown
- `test_application_state.py` - app checkpoint survives shutdown
lifecycle
- `test_deployment_state.py` - deployment checkpoint survives shutdown
lifecycle
## Related issues
>Fixes #62729

---------

Signed-off-by: Vaishnavi Panchavati <vaishdho10@gmail.com>
Co-authored-by: Abrar Sheikh <abrar@anyscale.com>
#62882)

## Description

Splits `test_limit_operator.py` into unit and integration test files.

* `test_per_block_limit_fn`

Remain in integration file (3 tests):

* `test_limit_operator` (uses `ray_start_regular_shared`)
* `test_limit_operator_memory_leak_fix` (uses
`ray_start_regular_shared`, `ray.data.read_parquet`,
`StreamingExecutor`, `ray.get`)
* `test_limit_estimated_num_output_bundles` (uses
`ray_start_regular_shared` — `make_ref_bundles` internally calls
`ray.put`)

`test_per_block_limit_fn` exercises `_per_block_limit_fn` directly with
hand-constructed pandas DataFrames and a `TaskContext` — no Ray cluster
involved. Imports in both files were trimmed to only what each file
actually uses.

## Related issues

Related to #61125

---------

Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>
…ts (#62874)

## Description

Splits `test_auto_batch_size.py` into unit and integration test files
following the `tests/unit/` convention.

Moved to `tests/unit/` (5 tests):

* `test_compute_auto_batch_size_basic`
* `test_compute_auto_batch_size_clamped_to_one`
* `test_compute_auto_batch_size_returns_none` (parametrized:
`empty_iterator`, `zero_rows`)
* `test_compute_auto_batch_size_iterator_includes_peeked_block`
* `test_auto_batches_respect_target_size`

Remain in integration file (1 test):

* `test_map_batches_auto_correctness` (uses `ray_start_regular_shared`,
`ray.data`)

## Related issues

Related to #61125

## Additional information

Tests were classified by whether they use `ray_start_*` fixtures or make
runtime `ray.*` calls — not by import paths, since the unit tests still
import from `ray.data._internal.*` to exercise internal classes
directly.

---------

Signed-off-by: Hyunoh-Yeo <hyunoh.yeo@gmail.com>
…asible queue (#62483)

## Description
Let us say we have a placement group that has nodes scheduled on some
domain. Part of the nodes go down, which makes the scheduler try to
schedule the bundles back on the same domain assignment, which fails and
the placement group is placed into the infeasible queue.

Now that the placement group is infeasible, let us say all of the
bundles' nodes go down and the placement group is all unplaced. There is
no code currently to wake up the scheduling for this placement group
from the infeasible queue again. Instead, the placement group will be
forever stuck there until OnNodeAdd clears the queue. These fixes are to
help with this.

## Testing
If you copy the test and run it without any changes, this should
currently fail with timeout on pg2.ready() line.
Now, run the test again with these changes and it should now succeed.

Command to run:
`python -m pytest
python/ray/tests/test_bundle_label_selector.py::test_scheduling_feasible_after_rack_kill`

---------

Signed-off-by: aaron.li <aaron.li@anyscale.com>
Signed-off-by: aaronscalene <aaron.li@anyscale.com>
Signed-off-by: Joshua Lee <joshlee@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Joshua Lee <joshlee@anyscale.com>
@pull pull Bot locked and limited conversation to collaborators Apr 25, 2026
@pull pull Bot added the ⤵️ pull label Apr 25, 2026
@pull pull Bot merged commit 65d2640 into miqdigital:master Apr 25, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants