[pull] master from ray-project:master by pull[bot] · Pull Request #4075 · miqdigital/ray

pull · 2026-04-22T19:35:26Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

….cc` (#62557) We have a temporary `RedisContext` that is used for `RedisDelKeyPrefixSync`, but it was made a `shared_ptr`. This is somewhat dangerous because its arguments were stack allocated. To avoid any lifetime issues in the future, refactored to make the `RedisContext` stack allocated as well. The `RedisStoreClient` directly calls `make_shared` and `Connect` for the `RedisContext` in its constructor now. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Commit 6634f27 generated doctest_depset_py3.10.lock from a tree that predated commit 8b6c71e ("[train][doc] Fix flaky doc test for TorchDetectionPredictor"), which added boto3==1.29.7 to python/requirements/ml/py313/{train,tune}-test-requirements.txt. The boto3 entry in the lock file was missing the two "# via -r ..." provenance comments, causing raydepsets --check to fail in premerge CI. Topic: fix-raydepset-doc Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: andrew <andrew@anyscale.com>

…ING (#62751) ## Description During controller recovery, live replica actors are re-registered in `RECOVERING` state before they transition to `RUNNING`. The long-poll broadcast (`broadcast_running_replicas_if_changed`) only includes `RUNNING`/`PENDING_MIGRATION `replicas, so as replicas complete recovery one by one, each transition triggered a partial broadcast. Proxies and routers applied that snapshot immediately, briefly **routing to a reduced replica set** (or none at all if no replicas had finished recovering yet). ## Related issues Fixes #62728 ## Additional information Skip broadcast if any replicas still `RECOVERING`, and don't reset the "dirty flag" `_broadcasted_replicas_set_changed`. Added `test_broadcast_deferred_while_replicas_recovering` assessing: 1. No `DEPLOYMENT_TARGETS` broadcast while all replicas are `RECOVERING`. 2. No broadcast during partial recovery (2 of 3 done, 1 still `RECOVERING`). 3. A single broadcast fires with all 3 replicas once the last one transitions. ## Downsides This change does introduce performance/guarantee regressions in the following cases: - ServerController dies AND a replica dies -> Router's stale cache contains dead replica's handle - ServerController dies AND replica is added -> Router's stale cache does not contain fresh replica handle Now cache is updated at a wall clock time for ALL replicas to check in (dominated by the slowest replica) Overall decision for this approach (delaying broadcast until none still `RECOVERING`) seems to come down to **how tolerant are we to stale cache?** ## Alternative - The grace period suggested in the root issue; this will reduce the number of blips but not provide a guarantee (grace period must be constant, recovery time is not constant) Signed-off-by: Soham Rajpure <srajpure@outlook.com> Co-authored-by: Abrar Sheikh <abrar@anyscale.com>

## Description 1. **Docstring security warnings** on `Tuner.restore` and `BaseTrainer.restore`. These APIs deserialize `tuner.pkl` / `trainer.pkl` and other experiment-state files from the supplied `path` (including remote URIs) using `pickle` / `cloudpickle` before any validation, so loading from a path an untrusted party can write to is equivalent to executing arbitrary Python code. The new `.. warning::` block makes this constraint explicit, matching the convention used by `pickle`, `numpy.load`, and `torch.load`. 2. **Opt-in cloudpickle expansion in `TuneFunctionDecoder`**. `TuneFunctionEncoder` may embed a cloudpickle blob inside JSON output under a `CLOUDPICKLE_FALLBACK` marker for objects that cannot be JSON-encoded. The matching decoder previously expanded those blobs silently, turning every `experiment_state-*.json` into a code-execution sink that looks like data. The decoder now refuses to expand `CLOUDPICKLE_FALLBACK` payloads by default and raises `ValueError` instead. Tune-internal callers that load state Tune itself just wrote opt in explicitly through a new private helper, `_loads_with_cloudpickle`. The four internal call sites that legitimately need to expand cloudpickle blobs — `tune_controller.py`, `experiment_analysis.py`, `experiment/trial.py`, `trainable/metadata.py` are updated to use the helper. Any other caller using `json.loads(s, cls=TuneFunctionDecoder)` on a document that contains an embedded cloudpickle payload will now get a clear `ValueError` instead of silently executing it. ## Test unit tests Signed-off-by: Lehui Liu <lehui@anyscale.com>

#62584) ## Description Improve the TPU section of the Ray Train scaling/accelerators user guide: - Clarify that `topology` and `accelerator_type` are required for all `use_tpu=True` usage. - Document multi-slice TPU support: `num_workers` can be a multiple of the VM count to launch multiple slices - List all valid TPU accelerator types: TPU-V2, TPU-V3, TPU-V4, TPU-V5P, TPU-V5LITEPOD, TPU-V6E, TPU-V7X ## Additional information created a redirect: <img width="1524" height="158" alt="image" src="https://github.com/user-attachments/assets/7bb7b631-4670-4e31-a590-6b60e6183cb5" /> `make develop` && `make local` <img width="1882" height="1386" alt="e70ba6d8873321e9ab649873d043f696" src="https://github.com/user-attachments/assets/f2a68d70-4ed2-450f-9905-51f6bae0d69d" /> <img width="1936" height="1528" alt="cdc7bc8a93f060acf275a03d453c454c" src="https://github.com/user-attachments/assets/3342437d-bc57-498f-8896-ed7333f34138" /> <img width="1894" height="906" alt="35918821801237cbbef0bee411793017" src="https://github.com/user-attachments/assets/adb35fb7-df97-4f25-b575-2984602c760e" /> <img width="1816" height="984" alt="2aafc2ba7313a4b1517a65ef9426aa64" src="https://github.com/user-attachments/assets/b6aa49f5-42e0-4ac8-94e4-e3139076709a" /> <img width="1808" height="1556" alt="2f4f793892da067df8736188f2305853" src="https://github.com/user-attachments/assets/65b1a0f3-5dd5-4164-8646-33c05a3397e9" /> <img width="1830" height="840" alt="4a4445a76f39151adc293d4cdf9f10a7" src="https://github.com/user-attachments/assets/ac1e0ab7-7197-43d0-8cc1-8b52ad1ab20a" /> --------- Signed-off-by: Lehui Liu <lehui@anyscale.com>

…_and_test_init (#62737) Follow-up to the chunked release_tests.json upload: have custom_image_build_and_test_init itself call `buildkite-agent pipeline upload` for each chunk it writes, gated behind a new --upload-to-buildkite flag. The bash init script drops its while loop and just passes the flag through. Signed-off-by: andrew <andrew@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

edoakes and others added 6 commits April 22, 2026 09:39

pull Bot locked and limited conversation to collaborators Apr 22, 2026

pull Bot added the ⤵️ pull label Apr 22, 2026

pull Bot merged commit 622f071 into miqdigital:master Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] master from ray-project:master#4075

[pull] master from ray-project:master#4075
pull[bot] merged 6 commits intomiqdigital:masterfrom
ray-project:master

pull Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

pull Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pull Bot commented Apr 22, 2026 •

edited

Loading