Skip to content

[pull] master from ray-project:master#4079

Merged
pull[bot] merged 10 commits intomiqdigital:masterfrom
ray-project:master
Apr 24, 2026
Merged

[pull] master from ray-project:master#4079
pull[bot] merged 10 commits intomiqdigital:masterfrom
ray-project:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 24, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

abrarsheikh and others added 10 commits April 23, 2026 13:03
…ency_ms` (#62868)

`serve_long_poll_latency_ms` measures `receive_time − notify_timestamp`,
where `notify_timestamp` is set once when the controller calls
`notify_changed`. This timestamp is never reset — it stays frozen at the
time of the *last data change* for that key.

When a brand-new `LongPollClient` starts (e.g. a new replica or proxy),
its snapshot IDs are initialised to `-1`. The host's "stale snapshot"
fast-path returns the current value immediately, stamped with the
original `notify_timestamp`. For long-lived, rarely-changing data (route
tables, deployment configs), this produces latency observations of tens
of minutes — not because propagation is slow, but because the client is
bootstrapping against data that was last changed long ago.

Signed-off-by: abrar <abrar@anyscale.com>
## Description
Ok, so I thought the name was equal to operator name, but unfortunately,
it is not
- Regular Tasks: Same as operator name
- Actor Tasks: Equal to `f"MapWorker(op_name).submit`.

In the interest of keeping things simple, I decided to not filter using
the core filter so that I can parse the raw results myself. I don't
think this should have dramatic impact because
- This feature is still gated behind detail=True
- Core still does a full scan of the tasks anyways

I now parse by checking if the `name in t.name` since we know that
t.name will contain a substring somewhere
## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ory (#62341)

Signed-off-by: Joshua Lee <joshlee@anyscale.com>
## Description

This PR delegates the KubeRay autoscaler's TPU topology logic to call
the utils added in`ray/_private/accelerators/tpu.py`, consolidating the
logic so there are less changes required for new TPU releases. This PR
also adds the `tpu7x` node selector string to the known mapping so that
the head resource is automatically added to the autoscaling config.
Finally, this PR changes `v5e` to `v5litepod` to match what's set by GKE
in the `TPU_ACCELERATOR_TYPE` env var, and for consistency with the rest
of the code base. In GKE documentation it's clear that v5e maps to
v5litepod so this should not cause confusion for users.


## Related issues
> Link related issues: "Fixes #1234", "Closes #1234", or "Related to
#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
…el locality options (#62487)

In #61614 we introduced the proto changes required to support label
locality, i.e. the autoscaler has enough information to understand what
to scale up. In this PR, we are now populating this proto with the
correct information for gpu domain label locality, and added a couple
unit tests to verify this behavior

---------

Signed-off-by: Joshua Lee <joshlee@anyscale.com>
…s through state API (#62533)

Adding observability for label domain labels through state API using ray
list placement-groups --detail and the ray dashboard. The main incentive
for doing this is to provide users a way to easily see which racks
contain a specific placement group. Here's an example of what the output
looks like in the ray dashboard now when label domains are set and when
they aren't set.

<img width="441" height="151" alt="Screenshot 2026-04-12 at 9 44 36 PM"
src="https://github.com/user-attachments/assets/212a4659-80c8-49b8-ac4d-8ca28b577953"
/>
<img width="794" height="268" alt="Screenshot 2026-04-12 at 9 43 56 PM"
src="https://github.com/user-attachments/assets/f4d1ece2-0e54-4cf9-9b65-02af9c20a19e"
/>

---------

Signed-off-by: Joshua Lee <joshlee@anyscale.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
#61840 updated our benchmark
utility to track object store memory spilling. It's implementation calls
`ray.get_runtime_context()`, and that API implicitly starts a Ray
cluster.
 
Since the `does_not_over_provision` script explicitly calls
`ray.init()`, it started failing with this error:

```
results_working_dirs_does_not_over_provision_kmvkzybtvq__anyscale_pkg_bb2d7f959c995d17beba491351814253/autoscaling/does_not_over_provision.py", line 11, in main
--
ray.init()
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 107, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 1832, in init
raise RuntimeError(
RuntimeError: Maybe you called ray.init twice by accident? This error can be suppressed by passing in 'ignore_reinit_error=True' or by calling 'ray.shutdown()' prior to 'ray.init()'.
Subprocess return code: 1
```

To fix the failure, I've added a guard to check if Ray is initialized
before calling `ray.init()`. I've also changed the frequency to nightly
so that we capture failures in this test.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
## Description

Problem: `get_allocated_resources` was called every ~1s from the
scheduling loop but used a blocking `ray.get()`, so any actor queue
delay or result transfer latency directly stalled dataset execution.

This PR makes `get_allocated_resources` non-blocking: it fires the
remote call in the background and immediately returns the last cached
value, updating the cache when the response arrives on the next loop
step. The first call for a new requester returns [] while the initial
response is in-flight, resolving ~1s later.

## Additional information

Currently, only `DefaultAutoscalingCoordinator.get_allocated_resources`
is made non-blocking. `request_resources` and `cancel_request` remain
blocking since they are not on the hot path.

Unit tests cover each behavior independently: in-flight caching, cache
update on success, non-Ray error propagation, and failure counter
escalation for both actor exceptions and timeouts.

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: HFFuture <ray.huang@anyscale.com>
@pull pull Bot locked and limited conversation to collaborators Apr 24, 2026
@pull pull Bot added the ⤵️ pull label Apr 24, 2026
@pull pull Bot merged commit 1112d90 into miqdigital:master Apr 24, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants