[pull] master from ray-project:master#4075
Merged
pull[bot] merged 6 commits intomiqdigital:masterfrom Apr 22, 2026
Merged
Conversation
….cc` (#62557) We have a temporary `RedisContext` that is used for `RedisDelKeyPrefixSync`, but it was made a `shared_ptr`. This is somewhat dangerous because its arguments were stack allocated. To avoid any lifetime issues in the future, refactored to make the `RedisContext` stack allocated as well. The `RedisStoreClient` directly calls `make_shared` and `Connect` for the `RedisContext` in its constructor now. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Commit 6634f27 generated doctest_depset_py3.10.lock from a tree that predated commit 8b6c71e ("[train][doc] Fix flaky doc test for TorchDetectionPredictor"), which added boto3==1.29.7 to python/requirements/ml/py313/{train,tune}-test-requirements.txt. The boto3 entry in the lock file was missing the two "# via -r ..." provenance comments, causing raydepsets --check to fail in premerge CI. Topic: fix-raydepset-doc Signed-off-by: andrew <andrew@anyscale.com> Signed-off-by: andrew <andrew@anyscale.com>
…ING (#62751) ## Description During controller recovery, live replica actors are re-registered in `RECOVERING` state before they transition to `RUNNING`. The long-poll broadcast (`broadcast_running_replicas_if_changed`) only includes `RUNNING`/`PENDING_MIGRATION `replicas, so as replicas complete recovery one by one, each transition triggered a partial broadcast. Proxies and routers applied that snapshot immediately, briefly **routing to a reduced replica set** (or none at all if no replicas had finished recovering yet). ## Related issues Fixes #62728 ## Additional information Skip broadcast if any replicas still `RECOVERING`, and don't reset the "dirty flag" `_broadcasted_replicas_set_changed`. Added `test_broadcast_deferred_while_replicas_recovering` assessing: 1. No `DEPLOYMENT_TARGETS` broadcast while all replicas are `RECOVERING`. 2. No broadcast during partial recovery (2 of 3 done, 1 still `RECOVERING`). 3. A single broadcast fires with all 3 replicas once the last one transitions. ## Downsides This change does introduce performance/guarantee regressions in the following cases: - ServerController dies AND a replica dies -> Router's stale cache contains dead replica's handle - ServerController dies AND replica is added -> Router's stale cache does not contain fresh replica handle Now cache is updated at a wall clock time for ALL replicas to check in (dominated by the slowest replica) Overall decision for this approach (delaying broadcast until none still `RECOVERING`) seems to come down to **how tolerant are we to stale cache?** ## Alternative - The grace period suggested in the root issue; this will reduce the number of blips but not provide a guarantee (grace period must be constant, recovery time is not constant) Signed-off-by: Soham Rajpure <srajpure@outlook.com> Co-authored-by: Abrar Sheikh <abrar@anyscale.com>
## Description 1. **Docstring security warnings** on `Tuner.restore` and `BaseTrainer.restore`. These APIs deserialize `tuner.pkl` / `trainer.pkl` and other experiment-state files from the supplied `path` (including remote URIs) using `pickle` / `cloudpickle` before any validation, so loading from a path an untrusted party can write to is equivalent to executing arbitrary Python code. The new `.. warning::` block makes this constraint explicit, matching the convention used by `pickle`, `numpy.load`, and `torch.load`. 2. **Opt-in cloudpickle expansion in `TuneFunctionDecoder`**. `TuneFunctionEncoder` may embed a cloudpickle blob inside JSON output under a `CLOUDPICKLE_FALLBACK` marker for objects that cannot be JSON-encoded. The matching decoder previously expanded those blobs silently, turning every `experiment_state-*.json` into a code-execution sink that looks like data. The decoder now refuses to expand `CLOUDPICKLE_FALLBACK` payloads by default and raises `ValueError` instead. Tune-internal callers that load state Tune itself just wrote opt in explicitly through a new private helper, `_loads_with_cloudpickle`. The four internal call sites that legitimately need to expand cloudpickle blobs — `tune_controller.py`, `experiment_analysis.py`, `experiment/trial.py`, `trainable/metadata.py` are updated to use the helper. Any other caller using `json.loads(s, cls=TuneFunctionDecoder)` on a document that contains an embedded cloudpickle payload will now get a clear `ValueError` instead of silently executing it. ## Test unit tests Signed-off-by: Lehui Liu <lehui@anyscale.com>
#62584) ## Description Improve the TPU section of the Ray Train scaling/accelerators user guide: - Clarify that `topology` and `accelerator_type` are required for all `use_tpu=True` usage. - Document multi-slice TPU support: `num_workers` can be a multiple of the VM count to launch multiple slices - List all valid TPU accelerator types: TPU-V2, TPU-V3, TPU-V4, TPU-V5P, TPU-V5LITEPOD, TPU-V6E, TPU-V7X ## Additional information created a redirect: <img width="1524" height="158" alt="image" src="https://github.com/user-attachments/assets/7bb7b631-4670-4e31-a590-6b60e6183cb5" /> `make develop` && `make local` <img width="1882" height="1386" alt="e70ba6d8873321e9ab649873d043f696" src="https://github.com/user-attachments/assets/f2a68d70-4ed2-450f-9905-51f6bae0d69d" /> <img width="1936" height="1528" alt="cdc7bc8a93f060acf275a03d453c454c" src="https://github.com/user-attachments/assets/3342437d-bc57-498f-8896-ed7333f34138" /> <img width="1894" height="906" alt="35918821801237cbbef0bee411793017" src="https://github.com/user-attachments/assets/adb35fb7-df97-4f25-b575-2984602c760e" /> <img width="1816" height="984" alt="2aafc2ba7313a4b1517a65ef9426aa64" src="https://github.com/user-attachments/assets/b6aa49f5-42e0-4ac8-94e4-e3139076709a" /> <img width="1808" height="1556" alt="2f4f793892da067df8736188f2305853" src="https://github.com/user-attachments/assets/65b1a0f3-5dd5-4164-8646-33c05a3397e9" /> <img width="1830" height="840" alt="4a4445a76f39151adc293d4cdf9f10a7" src="https://github.com/user-attachments/assets/ac1e0ab7-7197-43d0-8cc1-8b52ad1ab20a" /> --------- Signed-off-by: Lehui Liu <lehui@anyscale.com>
…_and_test_init (#62737) Follow-up to the chunked release_tests.json upload: have custom_image_build_and_test_init itself call `buildkite-agent pipeline upload` for each chunk it writes, gated behind a new --upload-to-buildkite flag. The bash init script drops its while loop and just passes the flag through. Signed-off-by: andrew <andrew@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )