[pull] master from ray-project:master by pull[bot] · Pull Request #4079 · miqdigital/ray

pull · 2026-04-24T01:35:24Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…ency_ms` (#62868) `serve_long_poll_latency_ms` measures `receive_time − notify_timestamp`, where `notify_timestamp` is set once when the controller calls `notify_changed`. This timestamp is never reset — it stays frozen at the time of the *last data change* for that key. When a brand-new `LongPollClient` starts (e.g. a new replica or proxy), its snapshot IDs are initialised to `-1`. The host's "stale snapshot" fast-path returns the current value immediately, stamped with the original `notify_timestamp`. For long-lived, rarely-changing data (route tables, deployment configs), this produces latency observations of tens of minutes — not because propagation is slow, but because the client is bootstrapping against data that was last changed long ago. Signed-off-by: abrar <abrar@anyscale.com>

## Description Ok, so I thought the name was equal to operator name, but unfortunately, it is not - Regular Tasks: Same as operator name - Actor Tasks: Equal to `f"MapWorker(op_name).submit`. In the interest of keeping things simple, I decided to not filter using the core filter so that I can parse the raw results myself. I don't think this should have dramatic impact because - This feature is still gated behind detail=True - Core still does a full scan of the tasks anyways I now parse by checking if the `name in t.name` since we know that t.name will contain a substring somewhere ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

…ory (#62341) Signed-off-by: Joshua Lee <joshlee@anyscale.com>

## Description This PR delegates the KubeRay autoscaler's TPU topology logic to call the utils added in`ray/_private/accelerators/tpu.py`, consolidating the logic so there are less changes required for new TPU releases. This PR also adds the `tpu7x` node selector string to the known mapping so that the head resource is automatically added to the autoscaling config. Finally, this PR changes `v5e` to `v5litepod` to match what's set by GKE in the `TPU_ACCELERATOR_TYPE` env var, and for consistency with the rest of the code base. In GKE documentation it's clear that v5e maps to v5litepod so this should not cause confusion for users. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com>

Signed-off-by: dayshah <dhyey2019@gmail.com>

Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>

…el locality options (#62487) In #61614 we introduced the proto changes required to support label locality, i.e. the autoscaler has enough information to understand what to scale up. In this PR, we are now populating this proto with the correct information for gpu domain label locality, and added a couple unit tests to verify this behavior --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com>

…s through state API (#62533) Adding observability for label domain labels through state API using ray list placement-groups --detail and the ray dashboard. The main incentive for doing this is to provide users a way to easily see which racks contain a specific placement group. Here's an example of what the output looks like in the ray dashboard now when label domains are set and when they aren't set. <img width="441" height="151" alt="Screenshot 2026-04-12 at 9 44 36 PM" src="https://github.com/user-attachments/assets/212a4659-80c8-49b8-ac4d-8ca28b577953" /> <img width="794" height="268" alt="Screenshot 2026-04-12 at 9 43 56 PM" src="https://github.com/user-attachments/assets/f4d1ece2-0e54-4cf9-9b65-02af9c20a19e" /> --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>

#61840 updated our benchmark utility to track object store memory spilling. It's implementation calls `ray.get_runtime_context()`, and that API implicitly starts a Ray cluster. Since the `does_not_over_provision` script explicitly calls `ray.init()`, it started failing with this error: ``` results_working_dirs_does_not_over_provision_kmvkzybtvq__anyscale_pkg_bb2d7f959c995d17beba491351814253/autoscaling/does_not_over_provision.py", line 11, in main -- ray.init() File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 107, in wrapper return func(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 1832, in init raise RuntimeError( RuntimeError: Maybe you called ray.init twice by accident? This error can be suppressed by passing in 'ignore_reinit_error=True' or by calling 'ray.shutdown()' prior to 'ray.init()'. Subprocess return code: 1 ``` To fix the failure, I've added a guard to check if Ray is initialized before calling `ray.init()`. I've also changed the frequency to nightly so that we capture failures in this test. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

## Description Problem: `get_allocated_resources` was called every ~1s from the scheduling loop but used a blocking `ray.get()`, so any actor queue delay or result transfer latency directly stalled dataset execution. This PR makes `get_allocated_resources` non-blocking: it fires the remote call in the background and immediately returns the last cached value, updating the cache when the response arrives on the next loop step. The first call for a new requester returns [] while the initial response is in-flight, resolving ~1s later. ## Additional information Currently, only `DefaultAutoscalingCoordinator.get_allocated_resources` is made non-blocking. `request_resources` and `cancel_request` remain blocking since they are not on the hot path. Unit tests cover each behavior independently: in-flight caching, cache update on success, non-Ray error propagation, and failure counter escalation for both actor exceptions and timeouts. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: HFFuture <ray.huang@anyscale.com>

abrarsheikh and others added 10 commits April 23, 2026 13:03

[core][rdt] Support deregistering NIXL memory via deregister_nixl_mem…

2134216

…ory (#62341) Signed-off-by: Joshua Lee <joshlee@anyscale.com>

[core] Cleanup lease construction code (#62547)

cca633a

Signed-off-by: dayshah <dhyey2019@gmail.com>

[core][rdt] Retry support for RDT (#62842)

2b73244

Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>

pull Bot locked and limited conversation to collaborators Apr 24, 2026

pull Bot added the ⤵️ pull label Apr 24, 2026

pull Bot merged commit 1112d90 into miqdigital:master Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#4079

[pull] master from ray-project:master#4079
pull[bot] merged 10 commits intomiqdigital:masterfrom
ray-project:master

pull Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

pull Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

pull Bot commented Apr 24, 2026 •

edited

Loading