[pull] master from ray-project:master by pull[bot] · Pull Request #4076 · miqdigital/ray

pull · 2026-04-23T01:35:23Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…me killing policy (#62643) ## Description This PR replaces the existing by group killing policy with the time based worker killing policy. **Note:** This will introduces a behavioral change in the next version, where: 1. We no longer prioritize killing large worker groups first. 2. We can now select multiple workers to kill at a time when memory pressure is detected. The existing by group killing policy selects the worker to kill based on the group size (where a group is defined by the number workers belonging to the same owner), retry-ability, and submission time. Additionally, it also only selects a single worker to killing at a time. This is problematic in situations where eliminating a single worker may be insufficient to bring the system back below the memory threshold, failing to reduce memory pressure. Sorting by the group size also doesn't achieve our goal of always preserving the tasks that has completed more work (estimated by elapsed time since start of task execution). Finally, in most workloads there are typically only a single owner for tasks, thus sorting by groups is redundant. The new killing policy addresses these issues by always prioritizing workers with longer execution duration in order to preserve more work, and it can kill multiple workers when necessary to put us back under the killing threshold. ## Additional information PR that introduced the time based killing policy: #61323 --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>

Image classification is using `s3://ray-benchmark-data-internal-us-west-2/imagenet/parquet_split`, which only has <100 gb data. We manually duplicate the data to 1t and save to `s3://ray-benchmark-data-internal-us-west-2/imagenet/parquet_split_1t`. Use this dataset in the slow and super slow test. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ot registered. (#62789) ## Description This PR addresses the following issue: * We should only emit unexpected idle worker failure when a registered client without lease or actor id disconnects. **We should only emit unexpected idle worker failure when a registered idle worker disconnects** Previously, we were emitting unexpected idle worker failure when a client for an unregistered process disconnects. This is inaccurate as idle workers are still registered but neither posses a lease id nor actor id. This PR addresses this, by only emitting an unexpected idle worker failure when a worker without lease id and actor id disconnects. The graphs below shows the existing oom graph and unexpected worker failure graph. As we can see, the two graphs are identical, and indicates that each oom kill is also being emitted as an unexpected idle worker failure. <img width="1709" height="613" alt="image" src="https://github.com/user-attachments/assets/d4336cbb-f0e4-40a6-b502-b6a2e0d5304e" /> The graph below shows the oom graph and the unexpected worker failure graph after the PR's fix. Here, we see that the oom kill graph can now increase independent on the unexpected idle worker failure. <img width="1705" height="612" alt="image" src="https://github.com/user-attachments/assets/9d13a837-d8fe-47c4-bb4c-c2b5ff581f5b" /> ## Related issues ## Additional information PR that introduced the unexpected worker failure metric: #62297 --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>

…orker nodes for local clusters (#62169) ## Description `ray down` on an SSH Docker cluster stops the head container but **skips all workers**. Their Docker containers keep running indefinitely. **Root cause:** Two separate `LocalNodeProvider` instances maintain independent state files — one on the machine invoking `ray down` and one on the head node (managed by the autoscaler). Workers are only ever marked `"running"` by the head's autoscaler in its own `/tmp/ray/cluster-<name>.state` file. The invoking machine's state file initializes workers as `"terminated"` in `ClusterState.__init__` and never receives those updates. When `teardown_cluster` calls `remaining_nodes()` → `provider.non_terminated_nodes()`, all workers are filtered out, so the `docker stop` loop has nothing to iterate. **Fix:** Add `NodeProvider.get_all_node_ids(tag_filters)` that returns all known node IDs regardless of state. The base class delegates to `non_terminated_nodes()` (no behavior change for cloud providers that query live infrastructure). `LocalNodeProvider` overrides it to skip the `state == "terminated"` filter. `teardown_cluster` now uses `get_all_node_ids` to build the Docker stop target list, ensuring worker containers are stopped even when the local state file is stale. ## Related issues Closes: #62058 ## Additional information **Files changed:** - `python/ray/autoscaler/node_provider.py` — Added `get_all_node_ids()` to base `NodeProvider` class (defaults to `non_terminated_nodes`) - `python/ray/autoscaler/_private/local/node_provider.py` — `LocalNodeProvider` override that includes terminated nodes - `python/ray/autoscaler/_private/commands.py` — `teardown_cluster` Docker stop phase uses `get_all_node_ids` instead of `remaining_nodes()` - `python/ray/tests/test_coordinator_server.py` — Added `testGetAllNodeIdsIncludesTerminated` with `_make_local_provider` helper **Backward compatibility:** The `terminate_nodes` loop is unchanged (still uses `non_terminated_nodes`). For cloud providers (AWS, GCP, Azure, etc.), `get_all_node_ids` delegates to `non_terminated_nodes`, so behavior is identical. The only change is that Docker stop now targets all configured local nodes during teardown. --------- Signed-off-by: dev-miro26 <devmiro26@gmail.com>

…ease test (#62662) ## Description - `release/serve_tests/workloads/router_microbenchmark.py`: Router benchmark. This is the exact benchmark that #62323 uses. - `release/serve_tests/workloads/plot_router_benchmark.py`: Plot script (requires manual trigger; doesn't run in release tests) ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information <img width="1453" height="394" alt="Screenshot 2026-04-16 at 12 45 05 AM" src="https://github.com/user-attachments/assets/b6307a22-5cda-4073-8de4-d1174c7d25c8" /> --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Abrar Sheikh <abrar@anyscale.com>

## Description As titled, I ran this with and without the change, and found about a 25% reduction in time spent in deserialization On a 100 node cluster (map_benchmark.py release test) before: <img width="637" height="315" alt="image" src="https://github.com/user-attachments/assets/abd70765-5242-4175-b86c-39abf484ebdb" /> after: <img width="697" height="315" alt="image" src="https://github.com/user-attachments/assets/1fa7dc06-c193-4897-8abb-e4a787f49533" /> ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

Kunchd and others added 6 commits April 22, 2026 12:57

pull Bot locked and limited conversation to collaborators Apr 23, 2026

pull Bot added the ⤵️ pull label Apr 23, 2026

pull Bot merged commit ab9a7d7 into miqdigital:master Apr 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#4076

[pull] master from ray-project:master#4076
pull[bot] merged 6 commits intomiqdigital:masterfrom
ray-project:master

pull Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

pull Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pull Bot commented Apr 23, 2026 •

edited

Loading