Skip to content

[pull] master from ray-project:master#4076

Merged
pull[bot] merged 6 commits intomiqdigital:masterfrom
ray-project:master
Apr 23, 2026
Merged

[pull] master from ray-project:master#4076
pull[bot] merged 6 commits intomiqdigital:masterfrom
ray-project:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 23, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

Kunchd and others added 6 commits April 22, 2026 12:57
…me killing policy (#62643)

## Description
This PR replaces the existing by group killing policy with the time
based worker killing policy.
**Note:** This will introduces a behavioral change in the next version,
where:
1. We no longer prioritize killing large worker groups first. 
2. We can now select multiple workers to kill at a time when memory
pressure is detected.

The existing by group killing policy selects the worker to kill based on
the group size (where a group is defined by the number workers belonging
to the same owner), retry-ability, and submission time. Additionally, it
also only selects a single worker to killing at a time. This is
problematic in situations where eliminating a single worker may be
insufficient to bring the system back below the memory threshold,
failing to reduce memory pressure. Sorting by the group size also
doesn't achieve our goal of always preserving the tasks that has
completed more work (estimated by elapsed time since start of task
execution). Finally, in most workloads there are typically only a single
owner for tasks, thus sorting by groups is redundant.

The new killing policy addresses these issues by always prioritizing
workers with longer execution duration in order to preserve more work,
and it can kill multiple workers when necessary to put us back under the
killing threshold.

## Additional information
PR that introduced the time based killing policy:
#61323

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
Image classification is using
`s3://ray-benchmark-data-internal-us-west-2/imagenet/parquet_split`,
which only has <100 gb data. We manually duplicate the data to 1t and
save to
`s3://ray-benchmark-data-internal-us-west-2/imagenet/parquet_split_1t`.
Use this dataset in the slow and super slow test.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ot registered. (#62789)

## Description
This PR addresses the following issue: 
* We should only emit unexpected idle worker failure when a registered
client without lease or actor id disconnects.

**We should only emit unexpected idle worker failure when a registered
idle worker disconnects**
Previously, we were emitting unexpected idle worker failure when a
client for an unregistered process disconnects. This is inaccurate as
idle workers are still registered but neither posses a lease id nor
actor id. This PR addresses this, by only emitting an unexpected idle
worker failure when a worker without lease id and actor id disconnects.

The graphs below shows the existing oom graph and unexpected worker
failure graph. As we can see, the two graphs are identical, and
indicates that each oom kill is also being emitted as an unexpected idle
worker failure.
<img width="1709" height="613" alt="image"
src="https://github.com/user-attachments/assets/d4336cbb-f0e4-40a6-b502-b6a2e0d5304e"
/>

The graph below shows the oom graph and the unexpected worker failure
graph after the PR's fix. Here, we see that the oom kill graph can now
increase independent on the unexpected idle worker failure.
<img width="1705" height="612" alt="image"
src="https://github.com/user-attachments/assets/9d13a837-d8fe-47c4-bb4c-c2b5ff581f5b"
/>


## Related issues

## Additional information
PR that introduced the unexpected worker failure metric:
#62297

---------

Signed-off-by: davik <davik@anyscale.com>
Co-authored-by: davik <davik@anyscale.com>
…orker nodes for local clusters (#62169)

## Description

`ray down` on an SSH Docker cluster stops the head container but **skips
all workers**. Their Docker containers keep running indefinitely.

**Root cause:** Two separate `LocalNodeProvider` instances maintain
independent state files — one on the machine invoking `ray down` and one
on the head node (managed by the autoscaler). Workers are only ever
marked `"running"` by the head's autoscaler in its own
`/tmp/ray/cluster-<name>.state` file. The invoking machine's state file
initializes workers as `"terminated"` in `ClusterState.__init__` and
never receives those updates. When `teardown_cluster` calls
`remaining_nodes()` → `provider.non_terminated_nodes()`, all workers are
filtered out, so the `docker stop` loop has nothing to iterate.

**Fix:** Add `NodeProvider.get_all_node_ids(tag_filters)` that returns
all known node IDs regardless of state. The base class delegates to
`non_terminated_nodes()` (no behavior change for cloud providers that
query live infrastructure). `LocalNodeProvider` overrides it to skip the
`state == "terminated"` filter. `teardown_cluster` now uses
`get_all_node_ids` to build the Docker stop target list, ensuring worker
containers are stopped even when the local state file is stale.

## Related issues

Closes: #62058 

## Additional information

**Files changed:**
- `python/ray/autoscaler/node_provider.py` — Added `get_all_node_ids()`
to base `NodeProvider` class (defaults to `non_terminated_nodes`)
- `python/ray/autoscaler/_private/local/node_provider.py` —
`LocalNodeProvider` override that includes terminated nodes
- `python/ray/autoscaler/_private/commands.py` — `teardown_cluster`
Docker stop phase uses `get_all_node_ids` instead of `remaining_nodes()`
- `python/ray/tests/test_coordinator_server.py` — Added
`testGetAllNodeIdsIncludesTerminated` with `_make_local_provider` helper

**Backward compatibility:** The `terminate_nodes` loop is unchanged
(still uses `non_terminated_nodes`). For cloud providers (AWS, GCP,
Azure, etc.), `get_all_node_ids` delegates to `non_terminated_nodes`, so
behavior is identical. The only change is that Docker stop now targets
all configured local nodes during teardown.

---------

Signed-off-by: dev-miro26 <devmiro26@gmail.com>
…ease test (#62662)

## Description
- `release/serve_tests/workloads/router_microbenchmark.py`: Router
benchmark. This is the exact benchmark that
#62323 uses.
- `release/serve_tests/workloads/plot_router_benchmark.py`: Plot script
(requires manual trigger; doesn't run in release tests)

## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
<img width="1453" height="394" alt="Screenshot 2026-04-16 at 12 45
05 AM"
src="https://github.com/user-attachments/assets/b6307a22-5cda-4073-8de4-d1174c7d25c8"
/>

---------

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Abrar Sheikh <abrar@anyscale.com>
## Description
As titled, I ran this with and without the change, and found about a 25%
reduction in time spent in deserialization

On a 100 node cluster (map_benchmark.py release test)
before:
<img width="637" height="315" alt="image"
src="https://github.com/user-attachments/assets/abd70765-5242-4175-b86c-39abf484ebdb"
/>

after:
<img width="697" height="315" alt="image"
src="https://github.com/user-attachments/assets/1fa7dc06-c193-4897-8abb-e4a787f49533"
/>


## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
@pull pull Bot locked and limited conversation to collaborators Apr 23, 2026
@pull pull Bot added the ⤵️ pull label Apr 23, 2026
@pull pull Bot merged commit ab9a7d7 into miqdigital:master Apr 23, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants