[pull] master from ray-project:master#4076
Merged
pull[bot] merged 6 commits intomiqdigital:masterfrom Apr 23, 2026
Merged
Conversation
…me killing policy (#62643) ## Description This PR replaces the existing by group killing policy with the time based worker killing policy. **Note:** This will introduces a behavioral change in the next version, where: 1. We no longer prioritize killing large worker groups first. 2. We can now select multiple workers to kill at a time when memory pressure is detected. The existing by group killing policy selects the worker to kill based on the group size (where a group is defined by the number workers belonging to the same owner), retry-ability, and submission time. Additionally, it also only selects a single worker to killing at a time. This is problematic in situations where eliminating a single worker may be insufficient to bring the system back below the memory threshold, failing to reduce memory pressure. Sorting by the group size also doesn't achieve our goal of always preserving the tasks that has completed more work (estimated by elapsed time since start of task execution). Finally, in most workloads there are typically only a single owner for tasks, thus sorting by groups is redundant. The new killing policy addresses these issues by always prioritizing workers with longer execution duration in order to preserve more work, and it can kill multiple workers when necessary to put us back under the killing threshold. ## Additional information PR that introduced the time based killing policy: #61323 --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>
Image classification is using `s3://ray-benchmark-data-internal-us-west-2/imagenet/parquet_split`, which only has <100 gb data. We manually duplicate the data to 1t and save to `s3://ray-benchmark-data-internal-us-west-2/imagenet/parquet_split_1t`. Use this dataset in the slow and super slow test. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ot registered. (#62789) ## Description This PR addresses the following issue: * We should only emit unexpected idle worker failure when a registered client without lease or actor id disconnects. **We should only emit unexpected idle worker failure when a registered idle worker disconnects** Previously, we were emitting unexpected idle worker failure when a client for an unregistered process disconnects. This is inaccurate as idle workers are still registered but neither posses a lease id nor actor id. This PR addresses this, by only emitting an unexpected idle worker failure when a worker without lease id and actor id disconnects. The graphs below shows the existing oom graph and unexpected worker failure graph. As we can see, the two graphs are identical, and indicates that each oom kill is also being emitted as an unexpected idle worker failure. <img width="1709" height="613" alt="image" src="https://github.com/user-attachments/assets/d4336cbb-f0e4-40a6-b502-b6a2e0d5304e" /> The graph below shows the oom graph and the unexpected worker failure graph after the PR's fix. Here, we see that the oom kill graph can now increase independent on the unexpected idle worker failure. <img width="1705" height="612" alt="image" src="https://github.com/user-attachments/assets/9d13a837-d8fe-47c4-bb4c-c2b5ff581f5b" /> ## Related issues ## Additional information PR that introduced the unexpected worker failure metric: #62297 --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com>
…orker nodes for local clusters (#62169) ## Description `ray down` on an SSH Docker cluster stops the head container but **skips all workers**. Their Docker containers keep running indefinitely. **Root cause:** Two separate `LocalNodeProvider` instances maintain independent state files — one on the machine invoking `ray down` and one on the head node (managed by the autoscaler). Workers are only ever marked `"running"` by the head's autoscaler in its own `/tmp/ray/cluster-<name>.state` file. The invoking machine's state file initializes workers as `"terminated"` in `ClusterState.__init__` and never receives those updates. When `teardown_cluster` calls `remaining_nodes()` → `provider.non_terminated_nodes()`, all workers are filtered out, so the `docker stop` loop has nothing to iterate. **Fix:** Add `NodeProvider.get_all_node_ids(tag_filters)` that returns all known node IDs regardless of state. The base class delegates to `non_terminated_nodes()` (no behavior change for cloud providers that query live infrastructure). `LocalNodeProvider` overrides it to skip the `state == "terminated"` filter. `teardown_cluster` now uses `get_all_node_ids` to build the Docker stop target list, ensuring worker containers are stopped even when the local state file is stale. ## Related issues Closes: #62058 ## Additional information **Files changed:** - `python/ray/autoscaler/node_provider.py` — Added `get_all_node_ids()` to base `NodeProvider` class (defaults to `non_terminated_nodes`) - `python/ray/autoscaler/_private/local/node_provider.py` — `LocalNodeProvider` override that includes terminated nodes - `python/ray/autoscaler/_private/commands.py` — `teardown_cluster` Docker stop phase uses `get_all_node_ids` instead of `remaining_nodes()` - `python/ray/tests/test_coordinator_server.py` — Added `testGetAllNodeIdsIncludesTerminated` with `_make_local_provider` helper **Backward compatibility:** The `terminate_nodes` loop is unchanged (still uses `non_terminated_nodes`). For cloud providers (AWS, GCP, Azure, etc.), `get_all_node_ids` delegates to `non_terminated_nodes`, so behavior is identical. The only change is that Docker stop now targets all configured local nodes during teardown. --------- Signed-off-by: dev-miro26 <devmiro26@gmail.com>
…ease test (#62662) ## Description - `release/serve_tests/workloads/router_microbenchmark.py`: Router benchmark. This is the exact benchmark that #62323 uses. - `release/serve_tests/workloads/plot_router_benchmark.py`: Plot script (requires manual trigger; doesn't run in release tests) ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information <img width="1453" height="394" alt="Screenshot 2026-04-16 at 12 45 05 AM" src="https://github.com/user-attachments/assets/b6307a22-5cda-4073-8de4-d1174c7d25c8" /> --------- Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Abrar Sheikh <abrar@anyscale.com>
## Description As titled, I ran this with and without the change, and found about a 25% reduction in time spent in deserialization On a 100 node cluster (map_benchmark.py release test) before: <img width="637" height="315" alt="image" src="https://github.com/user-attachments/assets/abd70765-5242-4175-b86c-39abf484ebdb" /> after: <img width="697" height="315" alt="image" src="https://github.com/user-attachments/assets/1fa7dc06-c193-4897-8abb-e4a787f49533" /> ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )