Skip to content

[pull] master from ray-project:master#4075

Merged
pull[bot] merged 6 commits intomiqdigital:masterfrom
ray-project:master
Apr 22, 2026
Merged

[pull] master from ray-project:master#4075
pull[bot] merged 6 commits intomiqdigital:masterfrom
ray-project:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented Apr 22, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

edoakes and others added 6 commits April 22, 2026 09:39
….cc` (#62557)

We have a temporary `RedisContext` that is used for
`RedisDelKeyPrefixSync`, but it was made a `shared_ptr`. This is
somewhat dangerous because its arguments were stack allocated. To avoid
any lifetime issues in the future, refactored to make the `RedisContext`
stack allocated as well.

The `RedisStoreClient` directly calls `make_shared` and `Connect` for
the `RedisContext` in its constructor now.

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Commit 6634f27 generated doctest_depset_py3.10.lock from a tree that
predated commit 8b6c71e ("[train][doc] Fix flaky doc test for
TorchDetectionPredictor"), which added boto3==1.29.7 to
python/requirements/ml/py313/{train,tune}-test-requirements.txt. The
boto3 entry in the lock file was missing the two "# via -r ..."
provenance comments, causing raydepsets --check to fail in premerge CI.

Topic: fix-raydepset-doc
Signed-off-by: andrew <andrew@anyscale.com>

Signed-off-by: andrew <andrew@anyscale.com>
…ING (#62751)

## Description
During controller recovery, live replica actors are re-registered in
`RECOVERING` state before they transition to `RUNNING`. The long-poll
broadcast (`broadcast_running_replicas_if_changed`) only includes
`RUNNING`/`PENDING_MIGRATION `replicas, so as replicas complete recovery
one by one, each transition triggered a partial broadcast.

Proxies and routers applied that snapshot immediately, briefly **routing
to a reduced replica set** (or none at all if no replicas had finished
recovering yet).

## Related issues
Fixes #62728 

## Additional information
Skip broadcast if any replicas still `RECOVERING`, and don't reset the
"dirty flag" `_broadcasted_replicas_set_changed`.

Added `test_broadcast_deferred_while_replicas_recovering` assessing:
1. No `DEPLOYMENT_TARGETS` broadcast while all replicas are
`RECOVERING`.
2. No broadcast during partial recovery (2 of 3 done, 1 still
`RECOVERING`).
3. A single broadcast fires with all 3 replicas once the last one
transitions.

## Downsides

This change does introduce performance/guarantee regressions in the
following cases:
- ServerController dies AND a replica dies -> Router's stale cache
contains dead replica's handle
- ServerController dies AND replica is added -> Router's stale cache
does not contain fresh replica handle

Now cache is updated at a wall clock time for ALL replicas to check in
(dominated by the slowest replica)

Overall decision for this approach (delaying broadcast until none still
`RECOVERING`) seems to come down to **how tolerant are we to stale
cache?**

## Alternative
- The grace period suggested in the root issue; this will reduce the
number of blips but not provide a guarantee (grace period must be
constant, recovery time is not constant)

Signed-off-by: Soham Rajpure <srajpure@outlook.com>
Co-authored-by: Abrar Sheikh <abrar@anyscale.com>
## Description
1. **Docstring security warnings** on `Tuner.restore` and
`BaseTrainer.restore`. These APIs deserialize `tuner.pkl` /
`trainer.pkl` and other experiment-state files from the supplied `path`
(including remote URIs) using `pickle` / `cloudpickle` before any
validation, so loading from a path an untrusted party can write to is
equivalent to executing arbitrary Python code. The new `.. warning::`
block makes this constraint explicit, matching the convention used by
`pickle`, `numpy.load`, and `torch.load`.
2. **Opt-in cloudpickle expansion in `TuneFunctionDecoder`**.
`TuneFunctionEncoder` may embed a cloudpickle blob inside JSON output
under a `CLOUDPICKLE_FALLBACK` marker for objects that cannot be
JSON-encoded. The matching decoder previously expanded those blobs
silently, turning every `experiment_state-*.json` into a code-execution
sink that looks like data. The decoder now refuses to expand
`CLOUDPICKLE_FALLBACK` payloads by default and raises `ValueError`
instead. Tune-internal callers that load state Tune itself just wrote
opt in explicitly through a new private helper,
`_loads_with_cloudpickle`.

The four internal call sites that legitimately need to expand
cloudpickle blobs — `tune_controller.py`, `experiment_analysis.py`,
`experiment/trial.py`, `trainable/metadata.py` are updated to use the
helper. Any other caller using `json.loads(s, cls=TuneFunctionDecoder)`
on a document that contains an embedded cloudpickle payload will now get
a clear `ValueError` instead of silently executing it.

## Test

unit tests

Signed-off-by: Lehui Liu <lehui@anyscale.com>
#62584)

## Description
Improve the TPU section of the Ray Train scaling/accelerators user
guide:
- Clarify that `topology` and `accelerator_type` are required for all
`use_tpu=True` usage.
- Document multi-slice TPU support: `num_workers` can be a multiple of
the VM count to launch multiple slices
- List all valid TPU accelerator types: TPU-V2, TPU-V3, TPU-V4, TPU-V5P,
TPU-V5LITEPOD, TPU-V6E, TPU-V7X


## Additional information

created a redirect: 
<img width="1524" height="158" alt="image"
src="https://github.com/user-attachments/assets/7bb7b631-4670-4e31-a590-6b60e6183cb5"
/>

`make develop` && `make local`
<img width="1882" height="1386" alt="e70ba6d8873321e9ab649873d043f696"
src="https://github.com/user-attachments/assets/f2a68d70-4ed2-450f-9905-51f6bae0d69d"
/>

<img width="1936" height="1528" alt="cdc7bc8a93f060acf275a03d453c454c"
src="https://github.com/user-attachments/assets/3342437d-bc57-498f-8896-ed7333f34138"
/>
<img width="1894" height="906" alt="35918821801237cbbef0bee411793017"
src="https://github.com/user-attachments/assets/adb35fb7-df97-4f25-b575-2984602c760e"
/>
<img width="1816" height="984" alt="2aafc2ba7313a4b1517a65ef9426aa64"
src="https://github.com/user-attachments/assets/b6aa49f5-42e0-4ac8-94e4-e3139076709a"
/>

<img width="1808" height="1556" alt="2f4f793892da067df8736188f2305853"
src="https://github.com/user-attachments/assets/65b1a0f3-5dd5-4164-8646-33c05a3397e9"
/>
<img width="1830" height="840" alt="4a4445a76f39151adc293d4cdf9f10a7"
src="https://github.com/user-attachments/assets/ac1e0ab7-7197-43d0-8cc1-8b52ad1ab20a"
/>

---------

Signed-off-by: Lehui Liu <lehui@anyscale.com>
…_and_test_init (#62737)

Follow-up to the chunked release_tests.json upload: have
custom_image_build_and_test_init itself call `buildkite-agent pipeline
upload` for each chunk it writes, gated behind a new
--upload-to-buildkite flag. The bash init script drops its while loop
and just passes the flag through.

Signed-off-by: andrew <andrew@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
@pull pull Bot locked and limited conversation to collaborators Apr 22, 2026
@pull pull Bot added the ⤵️ pull label Apr 22, 2026
@pull pull Bot merged commit 622f071 into miqdigital:master Apr 22, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants