[pull] master from ray-project:master#4079
Merged
pull[bot] merged 10 commits intomiqdigital:masterfrom Apr 24, 2026
Merged
Conversation
…ency_ms` (#62868) `serve_long_poll_latency_ms` measures `receive_time − notify_timestamp`, where `notify_timestamp` is set once when the controller calls `notify_changed`. This timestamp is never reset — it stays frozen at the time of the *last data change* for that key. When a brand-new `LongPollClient` starts (e.g. a new replica or proxy), its snapshot IDs are initialised to `-1`. The host's "stale snapshot" fast-path returns the current value immediately, stamped with the original `notify_timestamp`. For long-lived, rarely-changing data (route tables, deployment configs), this produces latency observations of tens of minutes — not because propagation is slow, but because the client is bootstrapping against data that was last changed long ago. Signed-off-by: abrar <abrar@anyscale.com>
## Description Ok, so I thought the name was equal to operator name, but unfortunately, it is not - Regular Tasks: Same as operator name - Actor Tasks: Equal to `f"MapWorker(op_name).submit`. In the interest of keeping things simple, I decided to not filter using the core filter so that I can parse the raw results myself. I don't think this should have dramatic impact because - This feature is still gated behind detail=True - Core still does a full scan of the tasks anyways I now parse by checking if the `name in t.name` since we know that t.name will contain a substring somewhere ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ory (#62341) Signed-off-by: Joshua Lee <joshlee@anyscale.com>
## Description This PR delegates the KubeRay autoscaler's TPU topology logic to call the utils added in`ray/_private/accelerators/tpu.py`, consolidating the logic so there are less changes required for new TPU releases. This PR also adds the `tpu7x` node selector string to the known mapping so that the head resource is automatically added to the autoscaling config. Finally, this PR changes `v5e` to `v5litepod` to match what's set by GKE in the `TPU_ACCELERATOR_TYPE` env var, and for consistency with the rest of the code base. In GKE documentation it's clear that v5e maps to v5litepod so this should not cause confusion for users. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
…el locality options (#62487) In #61614 we introduced the proto changes required to support label locality, i.e. the autoscaler has enough information to understand what to scale up. In this PR, we are now populating this proto with the correct information for gpu domain label locality, and added a couple unit tests to verify this behavior --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com>
…s through state API (#62533) Adding observability for label domain labels through state API using ray list placement-groups --detail and the ray dashboard. The main incentive for doing this is to provide users a way to easily see which racks contain a specific placement group. Here's an example of what the output looks like in the ray dashboard now when label domains are set and when they aren't set. <img width="441" height="151" alt="Screenshot 2026-04-12 at 9 44 36 PM" src="https://github.com/user-attachments/assets/212a4659-80c8-49b8-ac4d-8ca28b577953" /> <img width="794" height="268" alt="Screenshot 2026-04-12 at 9 43 56 PM" src="https://github.com/user-attachments/assets/f4d1ece2-0e54-4cf9-9b65-02af9c20a19e" /> --------- Signed-off-by: Joshua Lee <joshlee@anyscale.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
#61840 updated our benchmark utility to track object store memory spilling. It's implementation calls `ray.get_runtime_context()`, and that API implicitly starts a Ray cluster. Since the `does_not_over_provision` script explicitly calls `ray.init()`, it started failing with this error: ``` results_working_dirs_does_not_over_provision_kmvkzybtvq__anyscale_pkg_bb2d7f959c995d17beba491351814253/autoscaling/does_not_over_provision.py", line 11, in main -- ray.init() File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 107, in wrapper return func(*args, **kwargs) File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 1832, in init raise RuntimeError( RuntimeError: Maybe you called ray.init twice by accident? This error can be suppressed by passing in 'ignore_reinit_error=True' or by calling 'ray.shutdown()' prior to 'ray.init()'. Subprocess return code: 1 ``` To fix the failure, I've added a guard to check if Ray is initialized before calling `ray.init()`. I've also changed the frequency to nightly so that we capture failures in this test. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Description Problem: `get_allocated_resources` was called every ~1s from the scheduling loop but used a blocking `ray.get()`, so any actor queue delay or result transfer latency directly stalled dataset execution. This PR makes `get_allocated_resources` non-blocking: it fires the remote call in the background and immediately returns the last cached value, updating the cache when the response arrives on the next loop step. The first call for a new requester returns [] while the initial response is in-flight, resolving ~1s later. ## Additional information Currently, only `DefaultAutoscalingCoordinator.get_allocated_resources` is made non-blocking. `request_resources` and `cancel_request` remain blocking since they are not on the hot path. Unit tests cover each behavior independently: in-flight caching, cache update on success, non-Ray error propagation, and failure counter escalation for both actor exceptions and timeouts. --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: HFFuture <ray.huang@anyscale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )