diff --git a/README.md b/README.md index c0f7e44847..76bef01a01 100644 --- a/README.md +++ b/README.md @@ -9,7 +9,7 @@ [![License](https://img.shields.io/github/license/radixark/miles)](LICENSE) [![Slack](https://img.shields.io/badge/slack-join-brightgreen.svg)](https://slack.sglang.ai) -[**Latest Updates**](#latest-updates) | [**Quick Start**](#quick-start) | [**Key Features**](#key-features) | [**Documentation**](https://www.radixark.com/miles/docs) +[**Latest Updates**](#latest-updates) | [**Quick Start**](#quick-start) | [**Key Features**](#key-features) | [**Documentation**](https://miles.radixark.com/docs) @@ -18,11 +18,11 @@ ## Latest Updates -* **[2026/02]** ๐Ÿ’ก **Miles Detailed Arguments**: We've added a detailed command-line argument guide used to configure Miles for RL training and inference. These arguments enable precise control over cluster resources, training backends (Megatron/FSDP), inference optimization via SGLang, and RL algorithmic hyperparameters. [Link](https://github.com/radixark/miles/blob/main/docs/en/advanced/miles_server_args.md) +* **[2026/02]** ๐Ÿ’ก **Miles Detailed Arguments**: We've added a detailed command-line argument guide used to configure Miles for RL training and inference. These arguments enable precise control over cluster resources, training backends (Megatron/FSDP), inference optimization via SGLang, and RL algorithmic hyperparameters. [Link](https://miles.radixark.com/docs/user-guide/cli-reference) * **[2026/01]** ๐Ÿ’Ž **INT4 Quantization-Aware Training (QAT)**: Inspired by the Kimi K2-Thinking report, Miles now features a full-stack INT4 W4A16 QAT pipeline. This allows 1TB-scale models to fit into single-machine VRAM (e.g., NVIDIA H200), doubling rollout efficiency by eliminating cross-node bottlenecks while maintaining BF16-equivalent accuracy. [Blog](https://lmsys.org/blog/2026-01-26-int4-qat/) * **[2026/01]** ๐Ÿ’Ž **Unified VLM/LLM Multi-Turn Training**: We provided an implementation for the VLM multi-turn sampling paradigm. Developers only need to write a customized `rollout` function to easily start multi-turn RL for VLM, just like training LLM. [Blog](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/vlm-multi-turn/readme-en.md) * **[2026/01]** ๐Ÿค– **Multi-Agent Co-Evolution**: Miles now supports **MrlX**, a novel asynchronous co-evolutionary framework for Multi-Agent RL. Achieve superior performance in complex tasks like Doctor-Patient simulations and DeepResearch pipelines by enabling specialized agents to evolve together symbiotically. [[Link]](https://github.com/AQ-MedAI/MrlX) -* **[2025/12]** ๐Ÿ”„ **Rollout Routing Replay (R3)**: In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [[Paper]](https://arxiv.org/pdf/2510.11370) [[Docs]](docs/en/advanced/miles-router.md#22-rollout-routing-replay-r3-for-moe) +* **[2025/12]** ๐Ÿ”„ **Rollout Routing Replay (R3)**: In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [[Paper]](https://arxiv.org/pdf/2510.11370) [[Docs]](https://miles.radixark.com/docs/advanced/miles-router) * **[2025/11]** ๐Ÿ”ฅ **Unified FP8 Release**: Solves the stability issues in MoE RL by ensuring training and inference use the exact same FP8 quantization logic. [[Blog]](https://lmsys.org/blog/2025-11-25-fp8-rl/) * **[2025/11]** โšก **Speculative Decoding in RL**: Integrated speculative rollout with online SFT for draft models, achieving massive throughput gains. [[Blog]](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md) * **[2025/11]** ๐ŸŽ‰ **Miles Project Launch**: A joint effort by InfiXAI, Ant Group, SGLang RL Team, and the Miles community. [[Announcement]](https://lmsys.org/blog/2025-11-19-miles/) @@ -107,7 +107,7 @@ python train.py \ --n-samples-per-prompt 8 ``` -For comprehensive guides on environment setup and custom reward functions, see the [Quick Start Guide](docs/en/get_started/quick_start.md). +For comprehensive guides on environment setup and custom reward functions, see the [Quick Start Guide](https://miles.radixark.com/docs/getting-started/quick-start). --- diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000000..1a96cd006a --- /dev/null +++ b/docs/README.md @@ -0,0 +1,34 @@ +# Miles Documentation + +Live site: https://miles.radixark.com/docs + +## Layout + +``` +docs/ +โ”œโ”€โ”€ docs.json # Mintlify config: navigation, theme, redirects +โ”œโ”€โ”€ index.md # Homepage +โ”œโ”€โ”€ getting-started/ models/ user-guide/ advanced/ +โ”œโ”€โ”€ examples/ developer/ platforms/ blog/ +โ””โ”€โ”€ assets/ # Images and stylesheets +``` + +## Previewing locally + +```bash +npm i -g mint +cd docs +mint dev +``` + +Then open http://localhost:3000. + +## Adding or editing a page + +1. Add or edit a `.md` file (e.g. `models/qwen/qwen4.md`). +2. New pages need an entry in the `navigation` tree in `docs.json`, otherwise they won't + show up in the sidebar. +3. When linking between pages, use absolute paths: `[Quick Start](/getting-started/quick-start)`. + Drop the `.md` extension. +4. Images and other assets go in `assets/` and are referenced the same way: + `/assets/images/arch.png`. diff --git a/docs/advanced/architecture-support.md b/docs/advanced/architecture-support.md index 1d680c209e..0c08c45d63 100644 --- a/docs/advanced/architecture-support.md +++ b/docs/advanced/architecture-support.md @@ -2,9 +2,6 @@ title: Backends Beyond Megatron description: Embed HuggingFace implementations as black-box modules inside Megatron's parallel pipeline. --- - -# Backends Beyond Megatron - Adding a new architecture (such as Qwen3-Next's Gated-Delta-Net) directly to Megatron-LM's native code path is invasive. Miles takes a different approach: wrap the model's official HuggingFace implementation as a black-box module and @@ -33,7 +30,6 @@ starts from `get_gpt_decoder_block_spec`, then for the layers whose HF params={"args": args})` (referenced from `miles_plugins/models/`): ```python -# miles_plugins/models/qwen3_next.py (simplified excerpt) transformer_layer_spec = get_gpt_decoder_block_spec(config, **kwargs) ... for layer_id in range(num_layers_to_build): @@ -122,10 +118,8 @@ of the model is bf16. Qwen3.5's `A_log` is the canonical example. Rounding it to bf16 makes Megatron-side activations diverge from SGLang-side rollout, causing precision drift. -The canonical cast point is Megatron's `Float16Module`, which (per the -docstring on `enforce_marked_param_dtypes` in -`miles/backends/megatron_utils/fp32_param_utils.py`) "unconditionally casts -every floating-point parameter to bf16/fp16 at wrap time". The mbridge +The canonical cast point is Megatron's `Float16Module`, which casts +every floating-point parameter to bf16/fp16 at wrap time. The mbridge weight-conversion path (`_weight_to_mcore_format` and friends) is the other place fp32 weights can be silently downcast. Two steps are required to keep tagged params in fp32. diff --git a/docs/advanced/fault-tolerance.md b/docs/advanced/fault-tolerance.md index d6bbc1d8da..8657b6b3d9 100644 --- a/docs/advanced/fault-tolerance.md +++ b/docs/advanced/fault-tolerance.md @@ -2,17 +2,14 @@ title: Fault Tolerance description: Rollout-side health checks and engine recovery, gated by --use-fault-tolerance. --- - -# Fault Tolerance - -`--use-fault-tolerance` enables Miles's rollout-side fault-tolerance machinery. -It gates two code paths: +The `--use-fault-tolerance` flag enables Miles's rollout-side +fault-tolerance machinery. It gates two code paths: 1. A `RolloutHealthMonitor` thread per server group, started in - `miles/ray/rollout.py:379`, which periodically heart-beats each SGLang + `miles/ray/rollout.py`, which periodically heart-beats each SGLang engine. 2. A recovery hook in the trainer's weight-update step - (`miles/backends/megatron_utils/actor.py:500`), which restarts engines + (`miles/backends/megatron_utils/actor.py`), which restarts engines that the health monitor has killed. ```bash @@ -20,14 +17,14 @@ It gates two code paths: ``` The flag is `action="store_true"`, default `False` -(`miles/utils/arguments.py:528`). +(`miles/utils/arguments.py`). ## Health monitor `RolloutHealthMonitor` (`miles/utils/health_monitor.py`) runs in a daemon thread. Lifecycle: `start` (called once during init), `pause` and `resume` (called when engines offload / onload), `stop` (called during dispose). -`pause` / `resume` are wired up in `miles/ray/rollout.py:497, 501` and called +`pause` / `resume` are wired up in `miles/ray/rollout.py` and called around offload / onload events. Each loop iteration does: @@ -37,7 +34,7 @@ Each loop iteration does: 2. For every active engine in the group, call `engine.health_generate.remote(timeout=self._check_timeout)`. 3. If the call raises, run `_kill_engine`: `engine.shutdown.remote()`, `ray.kill(engine)`, and the engine slot is set to `None` - (`miles/utils/health_monitor.py:160-180`). + (`miles/utils/health_monitor.py`). 4. Sleep `--rollout-health-check-interval` seconds, then repeat. ### Flags @@ -52,14 +49,14 @@ Each loop iteration does: When `--use-fault-tolerance` is on, `MegatronActor.update_weights` calls `rollout_manager.recover_updatable_engines` on rank 0 before each weight -update (`miles/backends/megatron_utils/actor.py:500`). +update (`miles/backends/megatron_utils/actor.py`). -`recover_updatable_engines` (`miles/ray/rollout.py:513`): +`recover_updatable_engines` (`miles/ray/rollout.py`): 1. Pauses health monitoring. 2. Calls `srv.recover()` on the updatable server. -`srv.recover()` (`miles/ray/rollout.py:263`): +`srv.recover()` (`miles/ray/rollout.py`): 1. Finds engine slots set to `None` (killed by the health monitor). 2. Calls `start_engines` for each affected group. @@ -72,14 +69,14 @@ the new engines and the next weight transfer proceeds normally. When `--update-weight-transfer-mode p2p` is on, every P2P transfer is bounded by `--p2p-transfer-timeout` (default `30.0`s, defined in -`miles/utils/arguments.py:519`; consumed at -`miles/backends/megatron_utils/update_weight/update_weight_from_distributed/p2p.py:73`). +`miles/utils/arguments.py`; consumed at +`miles/backends/megatron_utils/update_weight/update_weight_from_distributed/p2p.py`). On timeout the failed transfer is logged (`[P2P] Transfer future failed: ...`) in `p2p_transfer_utils.py`. There is no automatic retry or automatic broadcast-mode fallback in the source today. ## Dumper-mode interaction -In dumper mode (`miles/utils/arguments.py:2102`), Miles forces +In dumper mode (`miles/utils/arguments.py`), Miles forces `use_fault_tolerance = False` and `rollout_health_check_interval = 1e18` to keep heartbeats from firing. diff --git a/docs/advanced/fp8-low-precision.md b/docs/advanced/fp8-low-precision.md index fba328a789..5d5e354551 100644 --- a/docs/advanced/fp8-low-precision.md +++ b/docs/advanced/fp8-low-precision.md @@ -2,9 +2,6 @@ title: Low Precision RL description: Unified low-precision pipelines for RL โ€” block-wise FP8, MXFP8, and NVFP4 across rollout and training. --- - -# Low Precision RL - A common failure mode in MoE RL is precision drift between training and inference. Pipelines that train in BF16 and serve in FP8 accumulate per-layer numerical disagreement, which compounds into divergent log-probabilities and @@ -82,7 +79,6 @@ recipe to use on Hopper, and the recipe DeepSeek-V3 / DeepSeek-R1 ship in. Block layout is 128ร—128 with FP32 scales. ```bash -# Megatron / TransformerEngine --transformer-impl transformer_engine --bf16 --fp8-format e4m3 diff --git a/docs/advanced/index.md b/docs/advanced/index.md index d281ef0fea..c6438d18cf 100644 --- a/docs/advanced/index.md +++ b/docs/advanced/index.md @@ -2,43 +2,40 @@ title: Advanced Features description: Systems-level features for large-scale and long-running RL. --- - -# Advanced Features - This section covers the Miles features that the Core-features section of the homepage points at: low-precision training (FP8 / MXFP8 / INT4 QAT), Rollout Routing Replay for MoE, speculative decoding, and LoRA training and serving. - + The unified FP8 path: matched quantization between training and inference, BF16 backward and master weights. - + W4A16 quantization-aware training for fitting large models on a single 8-GPU node. - + Capture expert routing during inference and replay during training. The mechanism that keeps MoE RL stable. - + Draft + target speculative rollout, with online MTP-SFT for the draft. - + Train LoRA adapters with SFT or RL and serve them through SGLang from the same checkpoint. diff --git a/docs/advanced/int4-qat.md b/docs/advanced/int4-qat.md index 2e3aa043d3..e7fd5cbb29 100644 --- a/docs/advanced/int4-qat.md +++ b/docs/advanced/int4-qat.md @@ -2,9 +2,6 @@ title: INT4 Quantization-Aware Training description: Fit large models on a single 8-GPU node by training with W4A16 quantization in the loop. --- - -# INT4 W4A16 Quantization-Aware Training - When the model is large enough that even FP8 will not fit on one node, the options are spreading across more nodes (and paying cross-node bandwidth) or quantizing further. Miles ships an INT4 W4A16 quant-aware-training pipeline. @@ -83,10 +80,10 @@ so the KL anchor stays full-precision. ## Pairs with -* [R3](miles-router.md). Keeps MoE routing stable across the quantized forward. -* [P2P weight transfer](p2p-weight-transfer.md). INT4 weights are 4ร— smaller, +* [R3](/advanced/miles-router). Keeps MoE routing stable across the quantized forward. +* [P2P weight transfer](/advanced/p2p-weight-transfer). INT4 weights are 4ร— smaller, so weight sync transfers less data. -* [Speculative decoding](speculative-decoding.md). Compounds for end-to-end +* [Speculative decoding](/advanced/speculative-decoding). Compounds for end-to-end rollout speedup. ## When QAT is not appropriate diff --git a/docs/advanced/lora.md b/docs/advanced/lora.md index 0a49d43f68..bb4944ad9f 100644 --- a/docs/advanced/lora.md +++ b/docs/advanced/lora.md @@ -2,23 +2,17 @@ title: LoRA Training and Serving description: Train LoRA adapters with miles SFT or RL recipes and serve them through SGLang from the same checkpoint. --- - -# LoRA Training and Serving - Miles supports LoRA adapters for both SFT and RL recipes. Adapters trained by miles load directly into SGLang for rollout, so there is no separate merge or conversion step in the training-serving loop. -This page is a stub; the full LoRA tutorial is being written. In the meantime, -the pieces below are enough to get a recipe running. - ## Example launchers The canonical LoRA recipes live under [`examples/lora/`](https://github.com/radixark/miles/tree/main/examples/lora) in the miles repo: -- `examples/lora/run-qwen2.5-0.5B-megatron-lora.sh` โ€” small dense, single GPU. +- `examples/lora/run-qwen2.5-0.5B-megatron-lora.sh` โ€” small dense model, single GPU. - `examples/lora/run-qwen3-4B-megatron-lora.sh` โ€” Qwen3-4B, RL with LoRA. - `examples/lora/run-gpt-oss-20B-megatron-moe-lora.sh` โ€” MoE example. @@ -35,12 +29,15 @@ the miles repo: | `--lora-adapter-path` | Path to a pre-trained adapter to resume from. | | `--lora-sync-from-tensor` | Sync adapter weights to SGLang via in-memory tensors instead of a file round-trip. | -Two existing arguments also have LoRA-specific requirements that are easy to -miss: the launcher has to pass `--megatron-to-hf-mode bridge` (the LoRA path -goes through Megatron-Bridge's PEFT integration; the default `raw` converter -does not understand LoRA layers), and the Ray job has to run with -`--colocate`. Distributed (PD-disaggregated) rollout with LoRA is not -supported today. + +Two existing arguments are easy to miss when configuring LoRA: + +- **`--megatron-to-hf-mode bridge`** is required. The LoRA path goes through + Megatron-Bridge's PEFT integration; the default `raw` converter does not + understand LoRA layers. +- **`--colocate`** is required. Distributed (PD-disaggregated) rollout with + LoRA is not supported today. + ## MoE @@ -75,11 +72,14 @@ reason. drives `train.py`. * **Low-precision training**: the LoRA branch follows the surrounding precision, so block-wise FP8, MXFP8, and INT4 QAT recipes are compatible. - See [Low Precision RL](fp8-low-precision.md) and [INT4 QAT](int4-qat.md). -* **`--target-modules` is mandatory** when `--lora-rank > 0`. There is no - auto-detection; the launcher asserts at startup. -* **Single adapter per run**: multi-LoRA training in a single job is not - implemented today. + See [Low Precision RL](/advanced/fp8-low-precision) and [INT4 QAT](/advanced/int4-qat). +* **Target modules**: `--target-modules` is required whenever + `--lora-rank > 0`. There is no auto-detection; the launcher asserts at + startup. +* **Single adapter per run**: only one set of `--lora-*` arguments is + honored per training job. Training multiple LoRA adapters in parallel + within a single `train.py` run is not implemented today โ€” run separate + jobs if you need multiple adapters. ## Internals @@ -93,7 +93,7 @@ The bridge between Megatron's LoRA path and SGLang adapter loading is in: - `miles/backends/megatron_utils/checkpoint.py` โ€” adapter-aware save and load. - `miles/backends/megatron_utils/update_weight/update_weight_from_tensor.py` โ€” colocate-mode weight sync from the trainer's LoRA tensors into the SGLang - rollout engine. We will merge this [PR](https://github.com/radixark/miles/pull/988) soon to support disaggregate mode. + rollout engine. Disaggregate-mode weight sync is not supported yet. A worked tutorial covering checkpoint conversion, SGLang adapter loading, and LoRA-specific evaluation will land here in a future doc pass. diff --git a/docs/advanced/miles-router.md b/docs/advanced/miles-router.md index 35c8c73ed7..4f021cdc6b 100644 --- a/docs/advanced/miles-router.md +++ b/docs/advanced/miles-router.md @@ -1,22 +1,19 @@ --- title: Rollout Routing Replay (R3) -description: Capture expert routing during inference and replay it during training so MoE RL is stable. +description: Capture expert routing during inference and replay it during training to stabilize RL. --- - -# Rollout Routing Replay (R3) - Rollout Routing Replay (R3) records the expert routing decisions made during inference and replays them during training, producing bit-identical expert allocation between rollout and training. -## Why MoE RL was previously unstable +## Why MoE RL is unstable without R3 For each token, an MoE router picks `top-k` experts. The choice depends on the -input through a soft router and a top-k op. In production the router is a +input through a soft router and a top-k operation. In production the router is a learned `nn.Linear` with non-deterministic kernels and FP8 quantization, so tiny numerical differences flip routes at the per-layer, per-token level. -Without R3: +An example without R3: * Rollout selects experts `{2, 7}` for token 314. * Training (with the same weights but slightly different precision and kernels) @@ -25,8 +22,8 @@ Without R3: layers, tens of thousands of tokens, and thousands of training steps, the policy diverges. -With R3 the inference router's choice is what training also uses. Numerical -noise no longer flips routes. +With R3, the trainer replays the rollout router's expert assignments verbatim, +so numerical noise no longer flips routes. ## How R3 wires up @@ -50,7 +47,7 @@ forward pass so recorded routes are used instead of recomputed ones. ## Memory cost `(num_tokens - 1) ร— num_layers ร— top_k ร— 4 bytes` (int32 per element, see -`miles/utils/types.py:29`). For a 32K-token sequence, 60 layers, and +`miles/utils/types.py`). For a 32K-token sequence, 60 layers, and `top_k = 8`, that is roughly 60 MB per sample of routing metadata. ## When R3 is not required diff --git a/docs/advanced/p2p-weight-transfer.md b/docs/advanced/p2p-weight-transfer.md index 31212c2fbc..2abb50bee1 100644 --- a/docs/advanced/p2p-weight-transfer.md +++ b/docs/advanced/p2p-weight-transfer.md @@ -2,9 +2,6 @@ title: P2P Weight Transfer description: Direct rank-to-rank weight sync from actor to rollout via RDMA writes. --- - -# P2P Weight Transfer - miles supports P2P (point-to-point) weight transfer between training and rollout engines. By using `--update-weight-transfer-mode p2p`, miles enables more efficient weight transfer from training ranks to rollout engine ranks. More details on the design and implementation can be found in [this issue](https://github.com/radixark/miles/issues/755). ## Usage @@ -125,7 +122,7 @@ Models marked with โ˜… are MoE architectures, where P2P benefits are most pronou | DeepSeek-V3 โ˜… | Kimi-K2 | 1T(64B) | `DeepseekV3ForCausalLM` | TP=8, PP=8, CP=4, EP=32, ETP=1, 32 nodes | TP=32, EP=32, 32 nodes | 53,279.1 | 7,227.3 | **โˆ’86.4%** | -![P2P vs NCCL Broadcast Scaling](../assets/images/p2p_vs_nccl_scaling.png) +![P2P vs NCCL Broadcast Scaling](/assets/images/p2p_vs_nccl_scaling.png) \* Kimi-K2 RDMA time includes ~884 ms GPU-side `post_load_weights` requantization on rollout engines. @@ -136,7 +133,6 @@ Models marked with โ˜… are MoE architectures, where P2P benefits are most pronou The P2P weight transfer E2E test validates correctness on a single node using `Qwen3-4B`: ```python -# tests/e2e/megatron/test_qwen3_4B_p2p.py # # Train: 4 GPUs (TP=2, CP=2) # Rollout: 4 GPUs (sglang, 2 engines ร— 2 GPUs each) diff --git a/docs/advanced/pd-disaggregation.md b/docs/advanced/pd-disaggregation.md index 197315cb9b..2c7bf44fcc 100644 --- a/docs/advanced/pd-disaggregation.md +++ b/docs/advanced/pd-disaggregation.md @@ -2,9 +2,6 @@ title: PD Disaggregation description: Separate prefill and decode pools so each is sized for its workload. --- - -# PD Disaggregation - In a typical SGLang deployment, every engine handles both **prefill** (the one-shot forward over the prompt) and **decode** (the per-token autoregressive loop). The two phases have different compute profiles: @@ -25,13 +22,13 @@ PD disaggregation splits them into two pools, each sized for its own workload. `--prefill-num-servers` is a Miles-native flag added by `add_prefill_decode_disaggregation_arguments` in `miles/utils/arguments.py`. -When set, `miles/ray/rollout.py:1090` calls +When set, `miles/ray/rollout.py` calls `SglangConfig.from_prefill_num_servers(args)` to dedicate that many SGLang servers to prefill, with the rest used for decode. `--prefill-num-servers` is mutually exclusive with the `sglang_config` attribute (the YAML `server_groups` config), and also cannot be combined -with `--rollout-external` (`arguments.py:2082-2087`). +with `--rollout-external` (`arguments.py`). ## When PD is worth it @@ -80,9 +77,9 @@ on observed queueing: ## Pairs with -* [DeepSeek R1 recipe](../models/deepseek/deepseek.md). PD is a clear win at +* [DeepSeek R1 recipe](/models/deepseek/deepseek). PD is a clear win at 671B scale. -* [Speculative decoding](speculative-decoding.md). Both are SGLang-side +* [Speculative decoding](/advanced/speculative-decoding). Both are SGLang-side features; pool sizing should account for the verify-batch size when speculative is on. diff --git a/docs/advanced/speculative-decoding.md b/docs/advanced/speculative-decoding.md index 420946de41..515697ee6e 100644 --- a/docs/advanced/speculative-decoding.md +++ b/docs/advanced/speculative-decoding.md @@ -2,9 +2,6 @@ title: Speculative Decoding description: Draft + target speculative rollout, with online SFT for MTP-style drafts. --- - -# Speculative Decoding - Speculative decoding accelerates rollout by letting a lightweight draft model generate ahead a few tokens and then verifying them with a single batched forward of the target model. When the draft is correct the target produces N @@ -81,10 +78,10 @@ rollouts and reload it. ## Pairs with -* [Unified FP8](fp8-low-precision.md). Draft and target both quantized the +* [Unified FP8](/advanced/fp8-low-precision). Draft and target both quantized the same way. -* [INT4 QAT](int4-qat.md). A quantized draft is cheaper to verify. -* [R3](miles-router.md). R3 captures routing for the verified tokens emitted +* [INT4 QAT](/advanced/int4-qat). A quantized draft is cheaper to verify. +* [R3](/advanced/miles-router). R3 captures routing for the verified tokens emitted by the target. ## When to skip diff --git a/docs/assets/images/miles_logo_dark.png b/docs/assets/images/miles_logo_dark.png new file mode 100644 index 0000000000..e3e742c146 Binary files /dev/null and b/docs/assets/images/miles_logo_dark.png differ diff --git a/docs/assets/images/miles_logo_light.png b/docs/assets/images/miles_logo_light.png new file mode 100644 index 0000000000..a25fb34920 Binary files /dev/null and b/docs/assets/images/miles_logo_light.png differ diff --git a/docs/blog/index.md b/docs/blog/index.md index cd20f64425..08113c2af5 100644 --- a/docs/blog/index.md +++ b/docs/blog/index.md @@ -2,12 +2,9 @@ title: Blog description: Engineering posts and release notes from the Miles team. --- - -# Blog - - + Why we built Miles, what it means for production-scale RL post-training, and what we shipped on day one. @@ -19,5 +16,4 @@ description: Engineering posts and release notes from the Miles team. ## Subscribe * GitHub releases: [radixark/miles/releases](https://github.com/radixark/miles/releases) -* RSS: coming soon -* Twitter/X: [@radixark_miles](https://twitter.com) +* Twitter/X: [@radixark](https://twitter.com/radixark) diff --git a/docs/blog/introducing-miles.md b/docs/blog/introducing-miles.md index 5b71acdbb2..be397b35d1 100644 --- a/docs/blog/introducing-miles.md +++ b/docs/blog/introducing-miles.md @@ -3,9 +3,6 @@ title: Introducing Miles description: Why RadixArk built Miles, and what it means for production-scale RL post-training. date: 2025-11-19 --- - -# Introducing Miles - *November 19, 2025 โ€” RadixArk team* Today RadixArk is open-sourcing **Miles**, a reinforcement learning framework purpose- @@ -45,7 +42,7 @@ on the RL part. ## Design principles **Small core, many edges.** The trainer is a short Python program; almost every -behaviour is swappable through a `--*-path` flag rather than a code patch. +behavior is swappable through a `--*-path` flag rather than a code patch. **Match the hardware.** Miles is designed around NVLink, InfiniBand, and RDMA โ€” at trillion-parameter scale, the interconnect is the rate limiter, so we optimize for it @@ -56,7 +53,7 @@ runs (routing mismatch, precision drift, NCCL hangs) before chasing the next alg ## Try it -Head to the [Quick Start](../getting-started/quick-start.md) for a quick GRPO +Head to the [Quick Start](/getting-started/quick-start) for a quick GRPO run on Qwen3-4B with a single 8-GPU node. ## What's next diff --git a/docs/developer/architecture.md b/docs/developer/architecture.md index d66cb4f713..972fcb637c 100644 --- a/docs/developer/architecture.md +++ b/docs/developer/architecture.md @@ -2,9 +2,6 @@ title: Architecture Overview description: The 30-minute tour of how Miles is organized internally. --- - -# Architecture Overview - A reading guide before you start patching. ## The processes @@ -108,7 +105,7 @@ loop and uses a continuously-running worker. ## Extension points (the right way) The trainer is plug-in-friendly. Most extensions don't need a code change inside Miles โ€” -just pass a `--something-path my_pkg.thing`. See [Customization](../user-guide/customization.md) +just pass a `--something-path my_pkg.thing`. See [Customization](/user-guide/customization) for the full list. If you find yourself patching the trainer to make something work, that's a sign we're @@ -124,8 +121,8 @@ tests/ โ””โ”€โ”€ e2e/ # end-to-end (spins up Ray + SGLang); GPU or CPU CI, registered explicitly ``` -CI discovery is location-based. `tests/fast/` may hold **only CPU CI**: every `test_*.py` there -auto-registers as `stage-a-cpu`, so no boilerplate is needed โ€” write a literal `register_cpu_ci(...)` +CI discovery is location-based. The `tests/fast/` folder may hold **only CPU CI**: every `test_*.py` +there auto-registers as `stage-a-cpu`, so no boilerplate is needed โ€” write a literal `register_cpu_ci(...)` only to override the defaults, and a `register_cuda_ci` under `tests/fast/` is an error (move the file to `tests/fast-gpu/`). Every other folder may hold **GPU or CPU CI** and must register each test explicitly with `register_cpu_ci` / `register_cuda_ci`. The runner collects `tests/fast/`, diff --git a/docs/developer/contributing.md b/docs/developer/contributor-guide.md similarity index 95% rename from docs/developer/contributing.md rename to docs/developer/contributor-guide.md index 864652d6b1..987bef89c9 100644 --- a/docs/developer/contributing.md +++ b/docs/developer/contributor-guide.md @@ -2,9 +2,6 @@ title: Contributing description: PR conventions, code layout, and how reviews work. --- - -# Contributing - Miles is open source under the LICENSE file in the repo. We accept community contributions of every size โ€” bug reports, doc fixes, new model recipes, full features. @@ -40,7 +37,6 @@ miles/ ## Local dev loop ```bash -# Inside the radixark/miles container cd /root/miles git remote add me git@github.com:/miles.git git checkout -b feat/awesome @@ -67,12 +63,12 @@ gh pr create --title "..." --body "..." Before you click "Ready for review": - [ ] `pre-commit run --all-files` passes. -- [ ] You added or updated tests for new behaviour. +- [ ] You added or updated tests for new behavior. - [ ] You ran `pytest -x` and it's green. - [ ] If you touched the launch flags, `python3 train.py --help` still parses. -- [ ] If you added a public flag, it appears in [Server Arguments](../user-guide/cli-reference.md). +- [ ] If you added a public flag, it appears in [Server Arguments](/user-guide/cli-reference). - [ ] If you added a new example, you wrote a real walkthrough (use - [examples/index](../examples/index.md) as the structural template). + [examples/index](/examples/index) as the structural template). ## Commit messages diff --git a/docs/developer/debug.md b/docs/developer/debug.md index dd9b1a41b8..49d7e4d016 100644 --- a/docs/developer/debug.md +++ b/docs/developer/debug.md @@ -2,9 +2,6 @@ title: Debugging description: Aligning precision, separate train/rollout debugging, common kernel pitfalls. --- - -# Debugging - When something is wrong with a Miles run, the question is almost always: **rollout or training?** Once you've isolated which side is misbehaving, the rest is a normal debugging session. @@ -40,7 +37,7 @@ They should be: at step 1 the actor and reference are the same weights. If KL > * **Large values (KL > 1).** Configuration error. Re-check parallelism and precision. * **Slightly elevated logp on instruct (~0.8 per token).** Almost always a chat-template mismatch โ€” your prompts don't match the format the model was trained on. Run the - [chat template verifier](../user-guide/agentic-chat-template.md). + [chat template verifier](/user-guide/agentic-chat-template). #### Is `grad_norm` reasonable? @@ -80,7 +77,7 @@ This is the single most useful pattern in the Miles workflow. Use it. ## Determinism for bisecting When you need to A/B test a code change, bit-wise reproducibility is your friend. See -the [Reproducibility recipe](../examples/reproducibility.md) for the exact flag set +the [Reproducibility recipe](/examples/reproducibility) for the exact flag set and env vars. The 25% throughput cost is worth it during development. ## Common kernel pitfalls diff --git a/docs/developer/experimental-features.md b/docs/developer/experimental-features.md index 7fa9dc1c21..ed268765b3 100644 --- a/docs/developer/experimental-features.md +++ b/docs/developer/experimental-features.md @@ -2,9 +2,6 @@ title: Experimental Features description: Backends and features that exist in tree but are not production-ready โ€” opt-in at your own risk. --- - -# Experimental Features - These features live in the Miles tree but are **not** production-ready. They typically have rough edges, missing parallelism, or known bugs against current dependency versions. Use them when you want to iterate quickly or co-develop a feature, not for @@ -34,7 +31,7 @@ models, not for production runs. - You want a HuggingFace-native checkpoint at every step with no conversion. For large MoE models, multi-rack jobs, or anything where TP / PP / CP / EP matters, -use the production [Megatron-LM backend](../user-guide/usage.md#megatron-lm) instead. +use the production [Megatron-LM backend](/user-guide/usage#megatron-lm) instead. ### Enabling it @@ -64,7 +61,6 @@ Most RL-level flags carry over unchanged. Backend-specific differences: ### Quick start ```bash -# Optional: wandb export WANDB_API_KEY= # Model + data diff --git a/docs/developer/index.md b/docs/developer/index.md index 8ab006543b..9b5a33daa8 100644 --- a/docs/developer/index.md +++ b/docs/developer/index.md @@ -2,39 +2,36 @@ title: Developer Guide description: Architecture, contribution conventions, debugging, and migration notes. --- - -# Developer Guide - You're here because you want to change Miles, not just use it. This section is the short tour for new contributors. - + PR conventions, code layout, how reviews work. - + Aligning precision, separate train/rollout debugging, common kernel pitfalls. - + Sync โ†’ async loop, breaking flag changes between releases. - + The 30-minute tour of how Miles is organized internally. - + Opt-in backends and features (FSDP, โ€ฆ) that aren't production-ready yet. @@ -45,7 +42,7 @@ short tour for new contributors. ## TL;DR for first-time contributors 1. Pick something small from `good first issue` on [GitHub](https://github.com/radixark/miles/issues). -2. Run the [Reproducibility recipe](../examples/reproducibility.md) so you can be sure +2. Run the [Reproducibility recipe](/examples/reproducibility) so you can be sure "I changed X and it broke" actually means that. 3. Use `--debug-train-only` or `--debug-rollout-only` to scope your changes. 4. Open a PR. We'll review within ~48h. diff --git a/docs/developer/migration.md b/docs/developer/migration.md index 2cd43ceb6f..2039385607 100644 --- a/docs/developer/migration.md +++ b/docs/developer/migration.md @@ -2,9 +2,6 @@ title: Migration Guide description: Sync โ†’ async loop, breaking flag changes between releases. --- - -# Migration Guide - This page tracks breaking changes between Miles releases and how to update your code or launch scripts. diff --git a/docs/docs.json b/docs/docs.json new file mode 100644 index 0000000000..ea0a01473a --- /dev/null +++ b/docs/docs.json @@ -0,0 +1,322 @@ +{ + "$schema": "https://mintlify.com/schema.json", + "name": "Miles", + "theme": "mint", + "logo": { + "light": "/assets/images/miles_logo_light.png", + "dark": "/assets/images/miles_logo_dark.png", + "href": "/", + "width": 130 + }, + "favicon": "/assets/images/miles_square.png", + "seo": { + "metatags": { + "canonical": "https://miles.radixark.com/docs", + "og:site_name": "Miles Documentation" + } + }, + "colors": { + "primary": "#d55816", + "light": "#e8722a", + "dark": "#b84a12" + }, + "navbar": { + "links": [ + { + "type": "github", + "href": "https://github.com/radixark/miles" + } + ], + "primary": { + "type": "button", + "label": "Contact", + "href": "mailto:miles@radixark.ai" + } + }, + "navigation": { + "tabs": [ + { + "tab": "Welcome", + "groups": [ + { + "group": "Welcome", + "root": "index", + "pages": [ + "getting-started/index", + "getting-started/installation", + "getting-started/quick-start" + ] + } + ] + }, + { + "tab": "Models", + "groups": [ + { + "group": "Models", + "root": "models/index", + "pages": [ + { + "group": "DeepSeek", + "root": "models/deepseek/index", + "pages": [ + { + "group": "DeepSeek-V4", + "pages": [ + "models/deepseek/deepseek-v4-flash", + "models/deepseek/deepseek-v4-pro" + ], + "expanded": false + }, + "models/deepseek/deepseek" + ], + "expanded": false + }, + { + "group": "Qwen", + "root": "models/qwen/index", + "pages": [ + { + "group": "Qwen3.6", + "pages": [ + "models/qwen/qwen3-6", + "models/qwen/qwen3-6-moe" + ], + "expanded": false + }, + { + "group": "Qwen3.5", + "pages": [ + "models/qwen/qwen3-5", + "models/qwen/qwen3-5-moe" + ], + "expanded": false + }, + "models/qwen/qwen3-next", + { + "group": "Qwen3", + "pages": [ + "models/qwen/qwen3", + "models/qwen/qwen3-moe" + ], + "expanded": false + } + ], + "expanded": false + }, + { + "group": "GLM", + "root": "models/glm/index", + "pages": [ + "models/glm/glm5", + "models/glm/glm4-7-flash", + "models/glm/glm4-5", + "models/glm/glm4" + ], + "expanded": false + }, + { + "group": "Kimi", + "root": "models/kimi/index", + "pages": [ + "models/kimi/kimi-k2.5", + "models/kimi/kimi-k2", + "models/kimi/moonlight" + ], + "expanded": false + }, + { + "group": "Nemotron", + "root": "models/nemotron/index", + "pages": [ + { + "group": "Nemotron-3-Nano", + "pages": [ + "models/nemotron/nemotron-3-nano", + "models/nemotron/nemotron-3-nano-moe" + ], + "expanded": false + }, + "models/nemotron/nemotron-3-super" + ], + "expanded": false + }, + "models/mimo/mimo", + "models/gpt-oss/gpt-oss" + ] + } + ] + }, + { + "tab": "User Guide", + "groups": [ + { + "group": "User Guide", + "root": "user-guide/index", + "pages": [ + "user-guide/concepts", + "user-guide/argument-groups", + "user-guide/usage", + "user-guide/training-script-walkthrough", + "user-guide/monitoring", + "user-guide/customization", + "user-guide/rollout-endpoints", + "user-guide/fully-async", + "user-guide/agentic-chat-template", + "user-guide/cli-reference" + ] + } + ] + }, + { + "tab": "Advanced Features", + "groups": [ + { + "group": "Advanced Features", + "root": "advanced/index", + "pages": [ + { + "group": "Performance", + "pages": [ + "advanced/fp8-low-precision", + "advanced/int4-qat", + "advanced/speculative-decoding", + "advanced/lora" + ], + "expanded": false + }, + { + "group": "Scale & Reliability", + "pages": [ + "advanced/fault-tolerance", + "advanced/pd-disaggregation", + "advanced/p2p-weight-transfer" + ], + "expanded": false + }, + { + "group": "MoE & Routing", + "pages": [ + "advanced/miles-router" + ], + "expanded": false + }, + { + "group": "Backends", + "pages": [ + "advanced/architecture-support" + ], + "expanded": false + } + ] + } + ] + }, + { + "tab": "Examples", + "groups": [ + { + "group": "Examples", + "root": "examples/index", + "pages": [ + "examples/fully-async", + "examples/search-r1", + "examples/retool", + "examples/multi-agent", + "examples/reproducibility", + "examples/openhermes-sft" + ] + } + ] + }, + { + "tab": "Developer Guide", + "groups": [ + { + "group": "Developer Guide", + "root": "developer/index", + "pages": [ + "developer/contributor-guide", + "developer/debug", + "developer/migration", + "developer/architecture", + "developer/experimental-features" + ] + } + ] + }, + { + "tab": "Platforms", + "groups": [ + { + "group": "Platforms", + "root": "platforms/index", + "pages": [ + "platforms/nvidia", + "platforms/amd" + ] + } + ] + }, + { + "tab": "Resources", + "groups": [ + { + "group": "Resources", + "pages": [ + "faq", + "blog/index" + ] + } + ] + } + ] + }, + "background": { + "dark": "#0c0b0a", + "light": "#faf7f4" + }, + "url": "https://www.radixark.com", + "contextual": { + "options": [ + { + "title": "Raise a docs issue", + "description": "Flag a wording, formatting, or accuracy issue on this page.", + "icon": "github", + "href": "https://github.com/radixark/miles/issues/new" + } + ] + }, + "footer": { + "socials": { + "github": "https://github.com/radixark/miles" + }, + "links": [ + { + "header": "Contribute", + "items": [ + { + "label": "Raise a docs issue", + "href": "https://github.com/radixark/miles/issues/new" + }, + { + "label": "Source on GitHub", + "href": "https://github.com/radixark/miles" + } + ] + } + ] + }, + "font": { + "headings": { + "family": "Hanken Grotesk", + "weight": 600, + "url": "https://fonts.googleapis.com/css2?family=Hanken+Grotesk:wght@100..900&display=swap" + }, + "body": { + "family": "Inter", + "weight": 400, + "url": "https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500&display=swap" + } + } +} diff --git a/docs/examples/fully-async.md b/docs/examples/fully-async.md index d5c4fbebb8..b706ece316 100644 --- a/docs/examples/fully-async.md +++ b/docs/examples/fully-async.md @@ -2,9 +2,6 @@ title: Fully Async Rollout description: Keep generation running continuously in the background so the trainer never waits. --- - -# Fully Async Rollout - **What you'll learn:** how to make rollout production and trainer consumption fully parallel, with a queue in between, by using a custom rollout function. @@ -26,9 +23,9 @@ instead of the sum. ## Prerequisites -* You completed the [Qwen3-4B](../models/qwen/qwen3.md) recipe (or have an +* You completed the [Qwen3-4B](/models/qwen/qwen3) recipe (or have an equivalent model + dataset). -* Comfortable with [Customization](../user-guide/customization.md) โ€” async rollout uses +* Comfortable with [Customization](/user-guide/customization) โ€” async rollout uses a custom rollout function. ## Files @@ -181,7 +178,7 @@ check GPU utilization. * **Best-effort ordering.** Samples are sorted by index at drain time, but exact-order guarantees aren't provided. * **Minimal error handling.** If a generate task throws, it's logged but the worker - keeps going. Production users wire in [fault tolerance](../advanced/fault-tolerance.md). + keeps going. Production users wire in [fault tolerance](/advanced/fault-tolerance). ## Variations diff --git a/docs/examples/index.md b/docs/examples/index.md index ef6e96ef2f..8d629b4e2f 100644 --- a/docs/examples/index.md +++ b/docs/examples/index.md @@ -2,9 +2,6 @@ title: Examples description: Annotated end-to-end walkthroughs for the workflows people actually want to build. --- - -# Examples - The model recipes show you how to train a model. The examples below show you how to *build something useful* with Miles โ€” tools, search, multi-agent, distillation, and async rollout. @@ -21,46 +18,46 @@ Each example follows the same template: 8. **Troubleshooting** โ€” the failure modes we've actually hit. 9. **Variations** โ€” common adaptations. -## The catalogue +## The catalog - + Continuous background generation with a queue between rollout and training. Up to 2ร— end-to-end speedup. - + Multi-turn rollout where the model can issue `...` actions, get observations from a retrieval server, and produce a final answer. - + SFT + RL pipeline for tool-augmented reasoning. Sandboxed Python code execution interleaved with thinking. - + Two specialized agents (e.g. doctor + patient) train together and improve each other. - + Bit-stable training across reruns. Determinism flags, seeds, and what to watch. - + Plain SFT (no RL) โ€” sometimes you just need a quick fine-tune. @@ -70,7 +67,7 @@ Each example follows the same template: ## Where to start -* **Never used Miles for anything beyond GRPO?** โ†’ [Fully Async Rollout](fully-async.md). -* **Want tool use / RAG?** โ†’ [Search-R1](search-r1.md), then [ReTool](retool.md). -* **VLM / multi-agent?** โ†’ [Multi-Agent Co-Evolution](multi-agent.md). -* **Replay an old result?** โ†’ [Reproducibility Recipe](reproducibility.md). +* **Never used Miles for anything beyond GRPO?** โ†’ [Fully Async Rollout](/examples/fully-async). +* **Want tool use / RAG?** โ†’ [Search-R1](/examples/search-r1), then [ReTool](/examples/retool). +* **VLM / multi-agent?** โ†’ [Multi-Agent Co-Evolution](/examples/multi-agent). +* **Replay an old result?** โ†’ [Reproducibility Recipe](/examples/reproducibility). diff --git a/docs/examples/multi-agent.md b/docs/examples/multi-agent.md index 4c5b559280..2bc5c278ff 100644 --- a/docs/examples/multi-agent.md +++ b/docs/examples/multi-agent.md @@ -2,9 +2,6 @@ title: Multi-Agent Co-Evolution description: Two specialized agents train together and improve each other. --- - -# Multi-Agent Co-Evolution - **What you'll learn:** how to wire up an asynchronous multi-agent system in Miles, where two (or more) specialized agents take alternating turns and the joint outcome drives a single shared reward. @@ -22,9 +19,9 @@ you can hack on it without pulling in MrlX's full dependency tree. ## Prerequisites -* You've completed the [Qwen3-30B-A3B](../models/qwen/qwen3-moe.md) recipe (the +* You've completed the [Qwen3-30B-A3B](/models/qwen/qwen3-moe) recipe (the example uses that model). -* Familiar with [Customization](../user-guide/customization.md). +* Familiar with [Customization](/user-guide/customization). ## Files @@ -185,7 +182,7 @@ verifier becomes verbose. Tighten its system prompt or reduce its `max_tokens`. Replace `call_role` with a VLM-aware caller that includes images in messages. Miles supports VLM multi-turn natively โ€” same pattern, just `multimodal_train_inputs` in the -sample dict (see [Customization #13](../user-guide/customization.md#training)). +sample dict (see [Customization #13](/user-guide/customization#training)). ### True asymmetric agents diff --git a/docs/examples/openhermes-sft.md b/docs/examples/openhermes-sft.md index 183bfeffa3..c3893afaab 100644 --- a/docs/examples/openhermes-sft.md +++ b/docs/examples/openhermes-sft.md @@ -2,9 +2,6 @@ title: SFT on OpenHermes description: Plain supervised fine-tuning of Qwen3-4B-Base on the OpenHermes-2.5 dataset. --- - -# SFT on OpenHermes - **What you'll learn:** how to use Miles for plain supervised fine-tuning. No RL, no rollout, no reward โ€” just data โ†’ loss โ†’ optimizer. @@ -16,7 +13,7 @@ Why use Miles for SFT? Two reasons: ## Prerequisites -* You completed the [Qwen3-4B](../models/qwen/qwen3.md) recipe (we reuse the +* You completed the [Qwen3-4B](/models/qwen/qwen3) recipe (we reuse the conversion). * ~50 GB free disk for OpenHermes-2.5. @@ -67,7 +64,7 @@ bash scripts/run-qwen3-4B-base-sft.sh ## What changes vs. the GRPO recipe -Compare to [run-qwen3-4B.sh](../models/qwen/qwen3.md). The deltas: +Compare to [run-qwen3-4B.sh](/models/qwen/qwen3). The deltas: ```diff - python3 train.py diff --git a/docs/examples/reproducibility.md b/docs/examples/reproducibility.md index c12457a9da..4d4c9787db 100644 --- a/docs/examples/reproducibility.md +++ b/docs/examples/reproducibility.md @@ -2,9 +2,6 @@ title: Reproducibility Recipe description: Bit-stable training across reruns. Determinism flags, seeds, and what to watch. --- - -# Reproducibility Recipe - **What you'll learn:** how to configure Miles + SGLang + Megatron for **bit-wise reproducible** RL training. Same inputs โ†’ identical outputs across reruns, machines, and time. diff --git a/docs/examples/retool.md b/docs/examples/retool.md index 8ea7478a93..53bef0952c 100644 --- a/docs/examples/retool.md +++ b/docs/examples/retool.md @@ -2,9 +2,6 @@ title: ReTool โ€” Code Execution Tool Use description: SFT + RL pipeline that teaches a model to interleave thinking with sandboxed Python execution. --- - -# ReTool โ€” Code Execution Tool Use - **What you'll learn:** the full SFT โ†’ RL pipeline for tool-augmented reasoning, with sandboxed Python execution and a reward function that checks both the final answer and the tool-use trace. @@ -14,7 +11,7 @@ Python snippet, get the result, continue reasoning, repeat. ## Prerequisites -* Familiar with [Search-R1](search-r1.md) โ€” same general pattern. +* Familiar with [Search-R1](/examples/search-r1) โ€” same general pattern. * `radixark/miles:latest` container. * ~150 GB free disk for the SFT data + base model. @@ -48,7 +45,6 @@ You can skip the SFT phase by using the pre-trained checkpoint we publish: ## Quick start (RL only) ```bash -# 1. Download model + datasets hf download font-info/qwen3-4b-sft-SGLang-RL --local-dir /root/font-info/qwen3-4b-sft hf download --repo-type dataset BytedTsinghua-SIA/DAPO-Math-17K --local-dir /root/dapo-math-17k hf download --repo-type dataset zhuzilin/aime-2024 --local-dir /root/aime-2024 diff --git a/docs/examples/search-r1.md b/docs/examples/search-r1.md index 9c41f320bc..b5ef519029 100644 --- a/docs/examples/search-r1.md +++ b/docs/examples/search-r1.md @@ -2,9 +2,6 @@ title: Search-R1 (Tool Use) description: Train a model to issue search queries, integrate observations, and answer multi-turn QA. --- - -# Search-R1 โ€” Tool-augmented multi-turn RL - **What you'll learn:** how to wire up a tool (web search) into a Miles training loop โ€” custom multi-turn rollout, observation interleaving, reward function, and TIS to keep training stable when train โ‰  inference. @@ -17,7 +14,7 @@ This is a Miles-friendly reproduction of the original * `radixark/miles:latest` container. * Either a serper.dev API key (Google search backend) or ~135 GB free disk for the local Wikipedia retriever (see [appendix](#appendix-local-wikipedia-retriever)). -* You completed [Customization](../user-guide/customization.md) โ€” this example uses a +* You completed [Customization](/user-guide/customization) โ€” this example uses a custom rollout function and reward. ## Files @@ -144,7 +141,7 @@ async def generate(args, sample: Sample, sampling_params) -> Sample: learns to *predict the search results*, which is both wrong and wildly unhelpful. 2. **Tokenization alignment.** The model must see and the trainer must score the *exact same tokens*. Pre-tokenizing vs. re-tokenizing at training time can drift โ€” - that's where the [chat template verifier](../user-guide/agentic-chat-template.md) + that's where the [chat template verifier](/user-guide/agentic-chat-template) matters. ## Walkthrough โ€” reward @@ -234,7 +231,6 @@ conflicting with Miles. ### One-time setup ```bash -# 1. Conda wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh bash ~/miniconda.sh -b -p $HOME/miniconda3 source ~/miniconda3/etc/profile.d/conda.sh diff --git a/docs/faq.md b/docs/faq.md index a3ba0be132..41dcd75e99 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -2,9 +2,6 @@ title: FAQ description: The questions every new Miles user asks in their first week. --- - -# FAQ - @@ -28,7 +25,7 @@ A common mistake is forgetting `--colocate` when sharing GPUs. - +I'm OOM during training. What is max_tokens_per_gpu?}> `max-tokens-per-gpu` caps how many tokens a single GPU sees per micro-batch (only when `--use-dynamic-batch-size` is on, which it should be). @@ -44,7 +41,7 @@ parallel (`--context-parallel-size N`) to spread one sample across N ranks. - +Multi-node training fails with transformers cannot find a model.}> Multiple workers calling `AutoConfig.from_pretrained` on a shared filesystem race each other. Set `--model-name ` so workers don't re-resolve the path. @@ -81,7 +78,7 @@ never need to pad manually. - +SGLang gives Max retries exceeded with url: /get_model_info.}> Multiple SGLang servers are colliding on the same node. Reduce the number of SGLang instances per node โ€” e.g. set `--rollout-num-gpus-per-engine 8` so there's exactly @@ -92,7 +89,7 @@ one server per host. Check the chat template first โ€” most "exploding gradient" reports come from feeding already-templated data into a model that re-applies its template. Then read -[Debugging](developer/debug.md). +[Debugging](/developer/debug). @@ -103,14 +100,14 @@ them explicitly with `--rollout-stop` or `--rollout-stop-token-ids`. - +SGLang error: illegal memory access.}> Per the [SGLang FAQ](https://docs.sglang.io/references/faq.html), this is usually OOM masquerading. Lower `--sglang-mem-fraction-static`. - +JSONDecodeError from torch.compile / inductor.}> Torch's compiler cache is corrupt. Add `TORCHINDUCTOR_FORCE_DISABLE_CACHES=1` to your Ray env vars and re-run. @@ -139,3 +136,4 @@ then go investigate the data + model alignment that caused it. Still stuck? Drop a thread in the Miles channel of the [SGLang Slack](https://slack.sglang.ai) or open an issue on [GitHub](https://github.com/radixark/miles/issues). + diff --git a/docs/getting-started/index.md b/docs/getting-started/index.md index 77783edaa8..6c45d9f76d 100644 --- a/docs/getting-started/index.md +++ b/docs/getting-started/index.md @@ -2,12 +2,9 @@ title: Getting Started description: Install Miles and run your first RL training job โ€” the two pages you need to go from zero to a working loop. --- - -# Getting Started - Two pages take you from a fresh machine to a running RL training job: -- **[Installation](installation.md)** โ€” Docker (recommended), pip, or building from source on NVIDIA / AMD. -- **[Quick Start](quick-start.md)** โ€” `docker pull` to a GRPO run on Qwen3-4B in under an hour, on a single 8-GPU node. +- **[Installation](/getting-started/installation)** โ€” Docker (recommended), pip, or building from source on NVIDIA / AMD. +- **[Quick Start](/getting-started/quick-start)** โ€” `docker pull` to a GRPO run on Qwen3-4B in under an hour, on a single 8-GPU node. -After the loop is running, [Models](../models/index.md) covers the family-specific recipes and the [User Guide](../user-guide/index.md) walks through concepts, data, monitoring, and customization. +After the loop is running, [Models](/models/index) covers the family-specific recipes and the [User Guide](/user-guide/index) walks through concepts, data, monitoring, and customization. diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md index e3d78d25db..de50accbca 100644 --- a/docs/getting-started/installation.md +++ b/docs/getting-started/installation.md @@ -2,9 +2,6 @@ title: Installation description: Install Miles on NVIDIA or AMD GPUs. Docker is the recommended path. --- - -# Installation - There are three ways to install Miles. Docker is recommended because Miles pins patched versions of SGLang, Megatron-LM, and a few CUDA kernels. @@ -48,7 +45,7 @@ The image ships with: - Megatron-LM, SGLang, FlashAttention-3, DeepGEMM, Apex - Ray, uv, and Miles installed editable at `/root/miles` -See [Platforms](../platforms/index.md) for platform-specific notes. +See [Platforms](/platforms/index) for platform-specific notes. ## Method 2: From source @@ -88,7 +85,7 @@ python -c "import miles; print('Miles import OK')" nvidia-smi ``` -If either command fails, see [Debugging](../developer/debug.md) or the [FAQ](../faq.md). +If either command fails, see [Debugging](/developer/debug) or the [FAQ](/faq). ## Hardware requirements @@ -104,6 +101,6 @@ or Slingshot โ€” and 200+ GB/s per node. Single-node jobs run fine over NVLink o ## Next steps -- [Quick Start](quick-start.md) โ€” run your first training job. -- [Core concepts](../user-guide/concepts.md) โ€” the mental model behind Miles. -- [Training backends](../user-guide/usage.md) โ€” Megatron vs FSDP. +- [Quick Start](/getting-started/quick-start) โ€” run your first training job. +- [Core concepts](/user-guide/concepts) โ€” the mental model behind Miles. +- [Training backends](/user-guide/usage) โ€” Megatron vs FSDP. diff --git a/docs/getting-started/quick-start.md b/docs/getting-started/quick-start.md index c8fa98923a..aeba8284ed 100644 --- a/docs/getting-started/quick-start.md +++ b/docs/getting-started/quick-start.md @@ -2,13 +2,10 @@ title: Quick Start description: A working RL training job on Qwen3-4B in under an hour. --- - -# Quick Start - This page takes you from `docker pull` to a running GRPO training job on Qwen3-4B. It assumes an 8-GPU node (H100 / H200 / B-series) and roughly 200 GB of disk. -For other models, see [Models](../models/index.md). +For other models, see [Models](/models/index). ## 1. Start the container @@ -54,7 +51,7 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \ ``` For larger models, run the converter under `torchrun --nproc-per-node 8` (optionally -multi-node). See the [Models](../models/index.md) section for per-family conversion +multi-node). See the [Models](/models/index) section for per-family conversion commands. ## 4. Launch training @@ -112,12 +109,12 @@ Miles fills in whichever side you leave unset. ## Next steps -- [Core concepts](../user-guide/concepts.md) โ€” the model behind rollout / actor / reference. -- [Training script walkthrough](../user-guide/training-script-walkthrough.md) โ€” +- [Core concepts](/user-guide/concepts) โ€” the model behind rollout / actor / reference. +- [Training script walkthrough](/user-guide/training-script-walkthrough) โ€” an annotated tour through every argument group in a launch script, plus colocation, dynamic sampling, partial rollout, and BF16+FP8 inference. -- [Training backends](../user-guide/usage.md) โ€” Megatron vs FSDP. -- [Customization](../user-guide/customization.md) โ€” plug in custom rollout / reward. -- [Models](../models/index.md) โ€” recipes for Qwen3.5, GLM4.5, DeepSeek R1, Kimi K2, and more. +- [Training backends](/user-guide/usage) โ€” Megatron vs FSDP. +- [Customization](/user-guide/customization) โ€” plug in custom rollout / reward. +- [Models](/models/index) โ€” recipes for Qwen3.5, GLM4.5, DeepSeek R1, Kimi K2, and more. -If you hit issues, the [FAQ](../faq.md) covers the common ones. +If you hit issues, the [FAQ](/faq) covers the common ones. diff --git a/docs/index.md b/docs/index.md index 146e72ba59..0d49fe640e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,9 +1,6 @@ --- title: Miles Documentation --- - -# Miles - Miles is a high-performance, enterprise-ready reinforcement learning (RL) framework specifically optimized for **Large-Scale model Post-Training**. It couples [SGLang](https://github.com/sgl-project/sglang) for high-throughput rollout with [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for scalable training, and ships the precision, stability, and observability features @@ -17,7 +14,7 @@ needed to run RL at trillion-parameter scale. - **Fast and stable support for the latest models.** Day-0 enablement of frontier releases such as DeepSeek-V4, with rapid follow-on support for new architectures including GLM-5, Qwen 3.6, and Nemotron-3-Super. -- **Unified low-precision training.** Customisable precision across the rollout and +- **Unified low-precision training.** Customizable precision across the rollout and training engines, with unified **BF16**, **FP8**, **MXFP8**, and **INT4 QAT** recipes available now and an **NVFP4** training recipe in progress. - **Efficient Rollout Routing Replay (R3).** For MoE models, expert routing captured @@ -41,31 +38,19 @@ needed to run RL at trillion-parameter scale. ## Supported models - - - - - - **Qwen**: [Qwen3.6](models/qwen/qwen3-6.md), [Qwen3.5](models/qwen/qwen3-5.md), [Qwen3](models/qwen/qwen3.md) - - **GLM**: [GLM4](models/glm/glm4.md) - - **Nemotron**: [Nemotron-3-Nano](models/nemotron/nemotron-3-nano.md) - - **MiMo**: [MiMo](models/mimo/mimo.md) - - **GPT-OSS**: [GPT-OSS](models/gpt-oss/gpt-oss.md) - - - - - - - **DeepSeek**: [DeepSeek-V4 Pro](models/deepseek/deepseek-v4-pro.md), [DeepSeek-V4 Flash](models/deepseek/deepseek-v4-flash.md), [DeepSeek-V3 / R1](models/deepseek/deepseek.md) - - **Qwen**: [Qwen3.6 MoE](models/qwen/qwen3-6-moe.md), [Qwen3.5 MoE](models/qwen/qwen3-5-moe.md), [Qwen3-Next](models/qwen/qwen3-next.md), [Qwen3 MoE](models/qwen/qwen3-moe.md) - - **GLM**: [GLM5 / GLM5.1](models/glm/glm5.md), [GLM4.7](models/glm/glm4-7-flash.md), [GLM4.5](models/glm/glm4-5.md) - - **Kimi**: [Kimi K2.5 / K2.6](models/kimi/kimi-k2.5.md), [Kimi K2](models/kimi/kimi-k2.md), [Moonlight](models/kimi/moonlight.md) - - **Nemotron**: [Nemotron-3-Nano MoE](models/nemotron/nemotron-3-nano-moe.md), [Nemotron-3-Super](models/nemotron/nemotron-3-super.md) - - +Each model name links to its recipe page. - +| Family | Models | +|---|---| +| **DeepSeek** | [DeepSeek-V4 Pro](/models/deepseek/deepseek-v4-pro)
[DeepSeek-V4 Flash](/models/deepseek/deepseek-v4-flash)
[DeepSeek-R1](/models/deepseek/deepseek)
[DeepSeek-V3](/models/deepseek/deepseek) | +| **Qwen** | [Qwen3.6 MoE](/models/qwen/qwen3-6-moe)
[Qwen3.6](/models/qwen/qwen3-6)
[Qwen3.5-35B-A3B](/models/qwen/qwen3-5-moe)
[Qwen3.5-4B / 9B / 27B](/models/qwen/qwen3-5)
[Qwen3-Next-80B-A3B-Thinking](/models/qwen/qwen3-next)
[Qwen3-30B-A3B / 235B-A22B](/models/qwen/qwen3-moe)
[Qwen3-0.6B / 1.7B / 4B / 8B / 14B / 32B](/models/qwen/qwen3) | +| **GLM** | [GLM-5.1](/models/glm/glm5)
[GLM-5](/models/glm/glm5)
[GLM-4.7-Flash](/models/glm/glm4-7-flash)
[GLM-4.5](/models/glm/glm4-5)
[GLM-Z1-9B-0414](/models/glm/glm4) | +| **Kimi** | [Kimi-K2.6](/models/kimi/kimi-k2.5)
[Kimi-K2.5](/models/kimi/kimi-k2.5)
[Kimi-K2-Instruct / Thinking](/models/kimi/kimi-k2)
[Moonlight-16B-A3B](/models/kimi/moonlight) | +| **Nemotron** | [Nemotron-3-Super-120B-A12B-FP8](/models/nemotron/nemotron-3-super)
[Nemotron-3-Nano MoE](/models/nemotron/nemotron-3-nano-moe)
[Nemotron-3-Nano](/models/nemotron/nemotron-3-nano) | +| **MiMo** | [MiMo-7B-RL](/models/mimo/mimo) | +| **GPT-OSS** | [gpt-oss-20b](/models/gpt-oss/gpt-oss) | -See [Models](models/index.md) for exact conversion commands, launch scripts, and +See [Models](/models/index) for exact conversion commands, launch scripts, and parallelism settings. ## Supported hardware @@ -73,28 +58,28 @@ parallelism settings. - **NVIDIA**: GB300, GB200, B200, B100, H200, H100, A100. - **AMD**: MI300X, MI325, MI350, MI355X (via ROCm). -See [Platforms](platforms/index.md). +See [Platforms](/platforms/index). ## Latest updates -- **[2026/02]** Complete argument reference. [CLI Reference](user-guide/cli-reference.md) -- **[2026/01]** INT4 W4A16 QAT. [INT4 Quantization-Aware Training](advanced/int4-qat.md) -- **[2026/01]** Unified VLM/LLM multi-turn rollout. [Multi-Agent Co-Evolution](examples/multi-agent.md) -- **[2025/12]** Rollout Routing Replay (R3) for MoE. [Rollout Routing Replay (R3)](advanced/miles-router.md) -- **[2025/11]** Unified FP8 pipeline generally available. [FP8 and Low Precision](advanced/fp8-low-precision.md) -- **[2025/11]** Speculative decoding with online MTP-SFT. [Speculative Decoding](advanced/speculative-decoding.md) +- **[2026/02]** Complete argument reference. [CLI Reference](/user-guide/cli-reference) +- **[2026/01]** INT4 W4A16 QAT. [INT4 Quantization-Aware Training](/advanced/int4-qat) +- **[2026/01]** Unified VLM/LLM multi-turn rollout. [Multi-Agent Co-Evolution](/examples/multi-agent) +- **[2025/12]** Rollout Routing Replay (R3) for MoE. [Rollout Routing Replay (R3)](/advanced/miles-router) +- **[2025/11]** Unified FP8 pipeline generally available. [FP8 and Low Precision](/advanced/fp8-low-precision) +- **[2025/11]** Speculative decoding with online MTP-SFT. [Speculative Decoding](/advanced/speculative-decoding) ## Start here -1. **[Installation](getting-started/installation.md)** โ€” Docker, bare metal, AMD. -2. **[Quick Start](getting-started/quick-start.md)** โ€” a working training run in under an hour. -3. **[Core concepts](user-guide/concepts.md)** โ€” the four objects in every Miles job. -4. **[Training backend](user-guide/usage.md)** โ€” Megatron-LM, parallelism, checkpoints, and hooks. -5. **[Training script walkthrough](user-guide/training-script-walkthrough.md)** โ€” every +1. **[Installation](/getting-started/installation)** โ€” Docker, bare metal, AMD. +2. **[Quick Start](/getting-started/quick-start)** โ€” a working training run in under an hour. +3. **[Core concepts](/user-guide/concepts)** โ€” the four objects in every Miles job. +4. **[Training backend](/user-guide/usage)** โ€” Megatron-LM, parallelism, checkpoints, and hooks. +5. **[Training script walkthrough](/user-guide/training-script-walkthrough)** โ€” every argument group in a launch script, annotated. ## Contribute - GitHub: [github.com/radixark/miles](https://github.com/radixark/miles) - Slack: [slack.sglang.ai](https://slack.sglang.ai), channel `#miles` -- Contributing: [developer guide](developer/contributing.md) +- Contributing: [developer guide](/developer/contributor-guide) diff --git a/docs/models/deepseek/deepseek-v4-flash.md b/docs/models/deepseek/deepseek-v4-flash.md index 12db7db593..e927666dc2 100644 --- a/docs/models/deepseek/deepseek-v4-flash.md +++ b/docs/models/deepseek/deepseek-v4-flash.md @@ -2,14 +2,11 @@ title: DeepSeek-V4 Flash description: Launch recipe for DeepSeek-V4-Flash (284 B) โ€” FP8 rollout / BF16 train, 8-node H200 (64 GPUs). --- - -# DeepSeek-V4 Flash - DeepSeek V4 training tracking issue: [`radixark/miles#1046`](https://github.com/radixark/miles/issues/1046). ## 1. Model Introduction -[DeepSeek-V4-Flash](https://huggingface.co/sgl-project/DeepSeek-V4-Flash-FP8) is a 13 B-active / 284 B-total MoE model with a substantially different attention stack from V3/R1. The miles + Megatron-Core (`mcore`) integration is shipped together in the [`radixark/miles#1045`](https://github.com/radixark/miles/pull/1045) and [`radixark/Megatron-LM#28`](https://github.com/radixark/Megatron-LM/pull/28) pull requests, and ships in the `radixark/miles:dev` image. +[DeepSeek-V4-Flash](https://huggingface.co/sgl-project/DeepSeek-V4-Flash-FP8) is a 13 B-active / 284 B-total MoE model with a substantially different attention stack from V3/R1. It ships in the `radixark/miles:latest` image. The larger [DeepSeek-V4-Pro](/models/deepseek/deepseek-v4-pro) shares the same V4 architecture family at Pro scale. **Key highlights:** @@ -32,8 +29,8 @@ DeepSeek V4 training tracking issue: [`radixark/miles#1046`](https://github.com/ One command runs the full pipeline โ€” dataset download, FP8 โ†’ BF16 cast, distributed `torch_dist` conversion, and the training loop: ```bash -# Pull the dev image: -docker pull radixark/miles:dev +# Pull the image: +docker pull radixark/miles:latest # 8-node Flash run (colocated), inside the container cd /root/miles @@ -73,7 +70,7 @@ In this section, we explain what `full-train` does under the hood, and how to dr ### 4.1 Download model + datasets ```bash -# inside the radixark/miles:dev container +# inside the radixark/miles:latest container hf download sgl-project/DeepSeek-V4-Flash-FP8 --local-dir /root/models/DeepSeek-V4-Flash-FP8 hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/datasets/dapo-math-17k hf download --repo-type dataset zhuzilin/aime-2024 --local-dir /root/datasets/aime-2024 @@ -111,7 +108,7 @@ The Python launcher's `prepare-spmd` subcommand drives the same conversion. ### 4.3 Multi-node fan-out -The Python launcher manages Ray internally โ€” start each pod with the `radixark/miles:dev` image and a working shared filesystem mounted at the same path on every node, then on the head node: +The Python launcher manages Ray internally โ€” start each pod with the `radixark/miles:latest` image and a working shared filesystem mounted at the same path on every node, then on the head node: ```bash ray start --head --num-gpus 8 --disable-usage-stats @@ -137,7 +134,7 @@ These are the validated layouts shipped with the launcher; All parallelisms are | GB300 | 8 ร— 4 = 32 | 8 | 4 | 1 | 8 | 1 | first 11 / last 10 layers | | GB300 | 8 ร— 4 = 32 | 2 | 8 | 2 | 4 | 1 | first 4 / last 3 layers | -The Nodes ร— GPUs column counts **training nodes** โ€” in disaggregated mode (see [ยง3.3](#33-colocated-vs-disaggregated-rollout)) rollout nodes come on top of these. +The Nodes ร— GPUs column counts **actor (training) nodes** โ€” in disaggregated mode (see [ยง3.3](#33-colocated-vs-disaggregated-rollout)) rollout nodes come on top of these. ### 5.2 Algorithm @@ -192,5 +189,5 @@ The `--low-memory-resume` flag (off by default) puts optimizer states on CPU dur ## 6. Pairs Well With -- [FP8 & Low Precision](../../advanced/fp8-low-precision.md) -- [Architecture Support](../../advanced/architecture-support.md) โ€” the V4 plugin lives under `miles_plugins/models/deepseek_v4/`. +- [FP8 & Low Precision](/advanced/fp8-low-precision) +- [Architecture Support](/advanced/architecture-support) โ€” the V4 plugin lives under `miles_plugins/models/deepseek_v4/`. diff --git a/docs/models/deepseek/deepseek-v4-pro.md b/docs/models/deepseek/deepseek-v4-pro.md index 96e006ea78..a2fd8c028b 100644 --- a/docs/models/deepseek/deepseek-v4-pro.md +++ b/docs/models/deepseek/deepseek-v4-pro.md @@ -2,23 +2,20 @@ title: DeepSeek-V4 Pro description: Launch recipe for DeepSeek-V4-Pro (1.6 T) โ€” V4-family architecture at Pro scale. --- - -# DeepSeek-V4 Pro - DeepSeek V4 training tracking issue: [`radixark/miles#1046`](https://github.com/radixark/miles/issues/1046). ## 1. Model Introduction -[DeepSeek-V4-Pro](https://huggingface.co/sgl-project/DeepSeek-V4-Pro-FP8) is a 49 B-active / 1.6 T-total MoE that scales up the same sparse-MLA + DSA-indexer + KV-compressor + hyper-connection stack as [V4-Flash](deepseek-v4-flash.md). The architecture family is identical; the deltas are size and a handful of tuned knobs (indexer top-k, output-projection groups, compression schedule). The miles + Megatron-Core integration ships in the same image as Flash and is selected with `--model-name DeepSeek-V4-Pro-FP8`. +[DeepSeek-V4-Pro](https://huggingface.co/sgl-project/DeepSeek-V4-Pro-FP8) is a 49 B-active / 1.6 T-total MoE that scales up the same sparse-MLA + DSA-indexer + KV-compressor + hyper-connection stack as [V4-Flash](/models/deepseek/deepseek-v4-flash). The architecture family is identical; the deltas are size and a handful of tuned knobs (indexer top-k, output-projection groups, compression schedule). The miles + Megatron-Core integration ships in the same image as Flash and is selected with `--model-name DeepSeek-V4-Pro-FP8`. -**Key highlights** (deltas vs [V4-Flash](deepseek-v4-flash.md#1-model-introduction)): +**Key highlights** (deltas vs [V4-Flash](/models/deepseek/deepseek-v4-flash#1-model-introduction)): - **Scaled-up V4 architecture**: 61 layers (vs 43), hidden-size 7168 (vs 4096), 128 attention heads (vs 64), `ffn_hidden_size=3072` and `moe_ffn_hidden_size=3072` (vs 2048). All layers are MoE (same `--moe-layer-freq` pattern). `q_lora_rank=1536` (vs 1024); latent KV (`kv_lora_rank=512`, `qk_head_dim=512`, `v_head_dim=512`) is unchanged across V4. - **Hybrid Attention with wider indexer and output projection**: `index_topk=1024` (vs Flash's 512) โ€” Pro keeps 64 indexer heads ร— 128 dim but picks twice as many KV per query. Grouped output projection uses `o_groups=16` (vs 8), keeping `o_lora_rank=1024`. - **KV compressors start heavily compressed**: 60-element schedule `[128, 128, 4, 128, 4, 128, โ€ฆ, 4, 0]` โ€” Pro skips Flash's two leading uncompressed layers and starts at ratio-128 (HCA) from layer 0. Middle layers still alternate 4ร— (CSA) and 128ร— (HCA); only the final layer is uncompressed. Compressor RoPE base (`compress_rope_theta=160000`) is shared with Flash. - **MoE topology**: 384 routed experts + 1 shared (vs Flash's 256 + 1), top-6. `--moe-router-topk-scaling-factor 2.5` (vs Flash 1.5) compensates for the larger expert pool. The first 3 layers (`num_hash_layers=3`) remain dense-routed via hash buckets. - **Identical YaRN RoPE and context**: `rope_theta=10000`, YaRN `factor=16`, `original_max_position_embeddings=65536` โ†’ effective context length **1,048,576 tokens (1 M)**, same as Flash. -- **Hyper-connection (HC) routing**: `hc_mult=4` parallel streams with sinkhorn-normalised mixing, same as Flash (PP buffers stay 4-D). +- **Hyper-connection (HC) routing**: `hc_mult=4` parallel streams with sinkhorn-normalized mixing, same as Flash (PP buffers stay 4-D). - **FP8 weights with simulated FP8 QAT** on indexer and compressor activations; default training is BF16 on the cast checkpoint and default rollout is FP8 in SGLang with `--sglang-attention-backend compressed`. ## 2. Supported Variants @@ -32,8 +29,8 @@ DeepSeek V4 training tracking issue: [`radixark/miles#1046`](https://github.com/ ### 3.1 One-line launch ```bash -# Pull the dev image: -docker pull radixark/miles:dev +# Pull the image: +docker pull radixark/miles:latest # Production Pro run, inside the container cd /root/miles @@ -53,11 +50,11 @@ The `full-train` subcommand chains `prepare-download โ†’ prepare-single โ†’ prep | `--model-local-dir` | unset โ†’ same as `--model-dir` | local NVMe path on each node; `prepare-cp` rsyncs the HF checkpoint and `_torch_dist` here so the trainer reads from local disk (set it when `--model-dir` is on shared/remote storage) | | `--save-dir` | `/root/models` | training checkpoints under `{save-dir}/{run-id}/checkpoints/` | -Pro uses the same launcher as V4-Flash, so every option above can also be preconfigured via `MILES_SCRIPT_` env vars (precedence: CLI flag > env var > built-in default) โ€” see [V4-Flash ยง3.2](deepseek-v4-flash.md#32-launcher-path-defaults) for details. +Pro uses the same launcher as V4-Flash, so every option above can also be preconfigured via `MILES_SCRIPT_` env vars (precedence: CLI flag > env var > built-in default) โ€” see [V4-Flash ยง3.2](/models/deepseek/deepseek-v4-flash#32-launcher-path-defaults) for details. ## 4. Script breakdown -The under-the-hood stages are essentially identical to V4-Flash โ€” see the [V4-Flash Script breakdown](deepseek-v4-flash.md#4-script-breakdown) and substitute the Pro model name and path defaults shown above. +The under-the-hood stages are essentially identical to V4-Flash โ€” see the [V4-Flash Script breakdown](/models/deepseek/deepseek-v4-flash#4-script-breakdown) and substitute the Pro model name and path defaults shown above. ## 5. Example Recipe Configuration @@ -71,7 +68,7 @@ These are the validated layouts shipped with the launcher; All parallelisms are ### 5.2 Algorithm -Same as Flash โ€” see [V4-Flash ยง5.2 Algorithm](deepseek-v4-flash.md#52-algorithm). +Same as Flash โ€” see [V4-Flash ยง5.2 Algorithm](/models/deepseek/deepseek-v4-flash#52-algorithm). ### 5.3 Rollout & SGLang @@ -97,10 +94,11 @@ Required env vars (the launcher sets these for you): `SGLANG_SKIP_CHECKPOINT_LOA Megatron side: `--qkv-format bshd` (V4 needs `bshd` with CP-aware data slicing). The DSA indexer additionally supports replay via `--use-rollout-indexer-replay` (off by default). -!!! warning "Pro-specific rollout caveats" - 1. **Engine size โ‰ฅ 32 GPUs.** Pro needs a single SGLang engine spanning at least 32 GPUs โ€” the launcher hard-codes `--rollout-num-gpus-per-engine 32`. Smaller engines do not leave enough memory after weights, KV cache, indexer state, and DeepEP buffers, and rollout will OOM under load. - 2. **EP is mandatory; pure TP will not shard the model.** 384 routed experts ร— `moe_ffn_hidden_size=3072` cannot be partitioned by tensor parallelism alone โ€” the model must use expert parallelism (`--sglang-ep-size 32`) to spread the expert MLPs across ranks. `--sglang-tp-size 32` only covers the attention / embedding paths. - 3. **DeepEP normal-mode + CUDA graphs can hang at large batch sizes.** When `--sglang-moe-a2a-backend deepep` is on, an overly large `--sglang-cuda-graph-max-bs` makes SGLang hang during graph capture or replay. The launcher pins it to `8` for Pro โ€” raise it only after verifying the engine doesn't deadlock at your target batch. + +1. **Engine size โ‰ฅ 32 GPUs.** Pro needs a single SGLang engine spanning at least 32 GPUs โ€” the launcher hard-codes `--rollout-num-gpus-per-engine 32`. Smaller engines do not leave enough memory after weights, KV cache, indexer state, and DeepEP buffers, and rollout will OOM under load. +2. **EP is mandatory; pure TP will not shard the model.** 384 routed experts ร— `moe_ffn_hidden_size=3072` cannot be partitioned by tensor parallelism alone โ€” the model must use expert parallelism (`--sglang-ep-size 32`) to spread the expert MLPs across ranks. `--sglang-tp-size 32` only covers the attention / embedding paths. +3. **DeepEP normal-mode + CUDA graphs can hang at large batch sizes.** When `--sglang-moe-a2a-backend deepep` is on, an overly large `--sglang-cuda-graph-max-bs` makes SGLang hang during graph capture or replay. The launcher pins it to `8` for Pro โ€” raise it only after verifying the engine doesn't deadlock at your target batch. + ### 5.4 Optimizer @@ -123,6 +121,6 @@ Pro selects `--model-name DeepSeek-V4-Pro-FP8`, which flips `optimizer_offload=T ## 6. Pairs Well With -- [FP8 & Low Precision](../../advanced/fp8-low-precision.md) -- [Architecture Support](../../advanced/architecture-support.md) -- [DeepSeek V4 Flash](deepseek-v4-flash.md) โ€” sibling recipe; shares the V4-family architecture. +- [FP8 & Low Precision](/advanced/fp8-low-precision) +- [Architecture Support](/advanced/architecture-support) +- [DeepSeek V4 Flash](/models/deepseek/deepseek-v4-flash) โ€” sibling recipe; shares the V4-family architecture. diff --git a/docs/models/deepseek/deepseek.md b/docs/models/deepseek/deepseek.md index decfbe2deb..ffc3d4cba9 100644 --- a/docs/models/deepseek/deepseek.md +++ b/docs/models/deepseek/deepseek.md @@ -2,9 +2,6 @@ title: DeepSeek R1 / V3 description: Launch recipe for DeepSeek-R1 / DeepSeek-V3 (671 B total / 37 B active) on 16 nodes ร— 8 H100. --- - -# DeepSeek R1 / V3 - ## 1. Model Introduction [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) is a large-scale Mixture-of-Experts language model from DeepSeek, and [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) is the reasoning-tuned variant built on the same architecture. Both expose the same Megatron-side definition in miles and share the launch recipe on this page. @@ -241,6 +238,6 @@ OPTIMIZER_ARGS=( ## 6. Pairs Well With -- [PD Disaggregation](../../advanced/pd-disaggregation.md) -- [P2P Weight Transfer](../../advanced/p2p-weight-transfer.md) -- [Fault Tolerance](../../advanced/fault-tolerance.md) +- [PD Disaggregation](/advanced/pd-disaggregation) +- [P2P Weight Transfer](/advanced/p2p-weight-transfer) +- [Fault Tolerance](/advanced/fault-tolerance) diff --git a/docs/models/deepseek/index.md b/docs/models/deepseek/index.md index 322a3cbdcb..ff15dda912 100644 --- a/docs/models/deepseek/index.md +++ b/docs/models/deepseek/index.md @@ -2,25 +2,22 @@ title: DeepSeek description: Miles recipes for the DeepSeek family โ€” DeepSeek-V4 Flash (sparse-MLA + DSA indexer), V3, and R1. --- - -# DeepSeek family - Miles ships recipes for the DeepSeek family across two generations: **DeepSeek-V4 Flash** introduces sparse multi-head latent attention with a learned indexer and KV compressors (8-node H200), while **V3 / R1** remain the canonical 16-node 671 B-parameter recipes (BF16 train + 128ร—128 block-wise FP8 rollout, DeepEP, DAPO-style dynamic sampling). ## Variants | Model | Active / Total | HF ID | Recipe | |---|---|---|---| -| DeepSeek-V4-Pro | 49 B / 1.6 T | TBA | [deepseek-v4-pro](deepseek-v4-pro.md) | -| DeepSeek-V4-Flash | 13 B / 284 B | `sgl-project/DeepSeek-V4-Flash-FP8` | [deepseek-v4-flash](deepseek-v4-flash.md) | -| DeepSeek-V3 | 37 B / 671 B | `deepseek-ai/DeepSeek-V3` | [deepseek](deepseek.md) | -| DeepSeek-R1 | 37 B / 671 B | `deepseek-ai/DeepSeek-R1` | [deepseek](deepseek.md) | +| DeepSeek-V4-Pro | 49 B / 1.6 T | TBA | [deepseek-v4-pro](/models/deepseek/deepseek-v4-pro) | +| DeepSeek-V4-Flash | 13 B / 284 B | `sgl-project/DeepSeek-V4-Flash-FP8` | [deepseek-v4-flash](/models/deepseek/deepseek-v4-flash) | +| DeepSeek-V3 | 37 B / 671 B | `deepseek-ai/DeepSeek-V3` | [deepseek](/models/deepseek/deepseek) | +| DeepSeek-R1 | 37 B / 671 B | `deepseek-ai/DeepSeek-R1` | [deepseek](/models/deepseek/deepseek) | A validated DeepSeek-V4-Pro recipe is not yet available โ€” see [`radixark/miles#1046`](https://github.com/radixark/miles/issues/1046) for tracking. ## Fastest path to train -DeepSeek-V4-Flash needs 8 nodes of 8ร— H200 and the `radixark/miles:dev` image: +DeepSeek-V4-Flash needs 8 nodes of 8ร— H200 and the `radixark/miles:latest` image: ```bash cd /root/miles @@ -36,10 +33,10 @@ cd /root/miles bash scripts/run-deepseek-r1.sh # full 16-node run ``` -See the [DeepSeek-V4 Flash](deepseek-v4-flash.md) page for the V4 architecture summary, parallelism layouts, and known workarounds; see the [DeepSeek R1 / V3](deepseek.md) page for the V3 flow โ€” FP8 โ†’ BF16 conversion, Megatron parallelism layout (TP8 / PP4 / EP32 / CP4), per-arg walkthrough, and the alternate Python launcher (`scripts/run_deepseek.py`). +See the [DeepSeek-V4 Flash](/models/deepseek/deepseek-v4-flash) page for the V4 architecture summary, parallelism layouts, and known workarounds; see the [DeepSeek R1 / V3](/models/deepseek/deepseek) page for the V3 flow โ€” FP8 โ†’ BF16 conversion, Megatron parallelism layout (TP8 / PP4 / EP32 / CP4), per-arg walkthrough, and the alternate Python launcher (`scripts/run_deepseek.py`). ## Pairs well with -- [PD Disaggregation](../../advanced/pd-disaggregation.md) โ€” 671 B is where PD really earns its keep. -- [P2P Weight Transfer](../../advanced/p2p-weight-transfer.md) โ€” amortize weight sync across ranks. -- [Fault Tolerance](../../advanced/fault-tolerance.md) โ€” node failures are inevitable at 16-node scale. +- [PD Disaggregation](/advanced/pd-disaggregation) โ€” 671 B is where PD really earns its keep. +- [P2P Weight Transfer](/advanced/p2p-weight-transfer) โ€” amortize weight sync across ranks. +- [Fault Tolerance](/advanced/fault-tolerance) โ€” node failures are inevitable at 16-node scale. diff --git a/docs/models/glm/glm4-5.md b/docs/models/glm/glm4-5.md new file mode 100644 index 0000000000..0d7ab59bd9 --- /dev/null +++ b/docs/models/glm/glm4-5.md @@ -0,0 +1,141 @@ +--- +title: GLM4.5 +description: Launch recipes for GLM-4.5 (355B-A32B) โ€” bash launcher and Python launcher. +--- +## 1. Model Introduction + +[GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) is Zhipu AI's flagship MoE language model with advanced capabilities in reasoning, function calling, and multi-modal understanding. + +**Key highlights:** + +- **Sparse MoE architecture**: 355 B / 32 B-active for frontier runs and 106 B / 12 B-active for two-node experimentation. +- **Strong reasoning**: built-in step-by-step reasoning, with FP8 rollout supported on Blackwell hardware. +- **Speculative decoding**: EAGLE/MTP rollout supported by the bash launcher; the Python launcher exposes `--enable-mtp`. +- **R3 / MIS opt-in**: routing-stability extensions available behind a flag (`--enable-mis`) on the Python launcher. + +## 2. Supported Variants + +| Model | Active / Total | HF ID | +|---|---|---| +| GLM-4.5-355B-A32B | 32 B / 355 B | [zai-org/GLM-4.5](https://huggingface.co/zai-org/GLM-4.5) | +| GLM-4.5-Air (106B-A12B) | 12 B / 106 B | [zai-org/GLM-4.5-Air](https://huggingface.co/zai-org/GLM-4.5-Air) | + +The 106B-A12B variant has no launcher under `scripts/`; the canonical recipe is [`examples/p2p_weight_transfer/GLM-4.5-Air.sh`](https://github.com/radixark/miles/blob/main/examples/p2p_weight_transfer/GLM-4.5-Air.sh) (8-node, P2P weight transfer). + +## 3. Environment Setup + +### 3.1 Required env vars + +The bash launcher (`run-glm4.5-355B-A32B.sh`) requires: + +```bash +export BASE_DIR= +# so it comes from the cluster orchestrator. +``` + +The Python launcher (`run_glm45_355b_a32b.py`) reads no env vars โ€” pass options via the Typer CLI. + +### 3.2 Download model + datasets + +```bash +hf download zai-org/GLM-4.5 --local-dir $BASE_DIR/GLM-4.5-355B-A32B +hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir $BASE_DIR/dapo-math-17k +hf download --repo-type dataset zhuzilin/aime-2024 --local-dir $BASE_DIR/rl_data/aime-2024 +``` + +### 3.3 HF โ†’ Megatron `torch_dist` conversion + +The bash launcher does **not** convert for you โ€” produce `$BASE_DIR/GLM-4.5-355B-A32B_torch_dist/` ahead of time: + +```bash +cd /root/miles +source scripts/models/glm4.5-355B-A32B.sh +PYTHONPATH=/root/Megatron-LM torchrun --nproc-per-node 8 \ + tools/convert_hf_to_torch_dist.py \ + ${MODEL_ARGS[@]} \ + --hf-checkpoint $BASE_DIR/GLM-4.5-355B-A32B \ + --save $BASE_DIR/GLM-4.5-355B-A32B_torch_dist +``` + +The Python launcher automates the full flow (download โ†’ optional `tools/convert_hf_to_fp8.py` โ†’ `convert_checkpoint` โ†’ `rsync` to `model_local_dir` โ†’ submit). + +## 4. Launch + +### 4.1 Quick start + +```bash +# Bash launcher (8 nodes ร— 8 GPU) +cd /root/miles +export BASE_DIR=... +bash scripts/run-glm4.5-355B-A32B.sh + +# Python launcher (Blackwell hardware only โ€” _execute_train asserts hardware != "H100") +python scripts/run_glm45_355b_a32b.py train --hardware GB300 +``` + +### 4.2 Multi-node fan-out + +`run-glm4.5-355B-A32B.sh` performs Ray fan-out internally via the `ssh` loop over `/root/mpi_rack_hostfile`. + +## 5. Recipe Configuration + +### 5.1 Parallelism + +| Source | TP | PP | CP | EP | expert-TP | `max_tokens_per_gpu` | GPUs | +|---|---|---|---|---|---|---|---| +| `run-glm4.5-355B-A32B.sh` | 8 | 4 | 2 | 16 | 1 | 16384 | 64 (8 ร— 8) | +| `run_glm45_355b_a32b.py` (`num_nodes โ‰ค 4`, debug) | 4 | 1 | 1 | 4 | 1 | 16384 | โ‰ค 32 (โ‰ค 4 ร— 8) | +| `run_glm45_355b_a32b.py` (`num_nodes == 8`) | 4 | 8 | 2 | 8 | 1 | 16384 | 64 (8 ร— 8) | + +### 5.2 Algorithm + +| Source | Advantage | Notable flags | +|---|---|---| +| `run-glm4.5-355B-A32B.sh` | GSPO | `--eps-clip 1e-4 --eps-clip-high 2e-4 --use-tis` | +| `run_glm45_355b_a32b.py` | GRPO | `--eps-clip 1e-4 --eps-clip-high 2e-4 --use-tis` | + +Neither launcher enables `--use-rollout-routing-replay` by default. The Python launcher exposes `--enable-mis` (TIS/RS config) as an opt-in. + +### 5.3 Rollout & SGLang + +```bash +SGLANG_ARGS=( + --rollout-num-gpus-per-engine 32 + --sglang-mem-fraction-static 0.7 + --sglang-enable-dp-attention + --sglang-dp-size 4 + --sglang-ep-size 32 + --sglang-enable-dp-lm-head + --sglang-moe-dense-tp-size 1 + + # mtp / EAGLE + --sglang-speculative-algorithm EAGLE + --sglang-speculative-num-steps 1 + --sglang-speculative-eagle-topk 1 + --sglang-speculative-num-draft-tokens 2 + --sglang-enable-draft-weights-cpu-backup +) +``` + +Megatron side: `--moe-token-dispatcher-type flex`, `--moe-enable-deepep`. + +### 5.4 Optimizer + +CPU Adam on: + +```bash +--optimizer-cpu-offload +--overlap-cpu-optimizer-d2h-h2d +--use-precision-aware-optimizer +``` + +### 5.5 Notable quirks + +- The bash launcher does not set `--load`/`--save` in `CKPT_ARGS` โ€” `--load` defaults to the value of `--ref-load`. +- `run_glm45_355b_a32b.py` is Blackwell-only: `_execute_train` asserts `args.hardware != "H100"`. + +## 6. Pairs Well With + +- [Low Precision RL](/advanced/fp8-low-precision) +- [INT4 QAT](/advanced/int4-qat) +- [Rollout Routing Replay (R3)](/advanced/miles-router) โ€” opt-in via `--enable-mis` on the Python launcher. diff --git a/docs/models/glm/glm4-7-flash.md b/docs/models/glm/glm4-7-flash.md new file mode 100644 index 0000000000..cf467c0ae8 --- /dev/null +++ b/docs/models/glm/glm4-7-flash.md @@ -0,0 +1,112 @@ +--- +title: GLM4.7 Flash +description: Launch recipes for GLM-4.7-Flash โ€” compact MLA + MoE with R3 enabled by default. +--- +## 1. Model Introduction + +[GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) is a lightweight, high-speed MoE model in the GLM-4.7 series from Zhipu AI, designed for single-GPU-node deployment. + +**Key highlights:** + +- **Compact MoE architecture**: 30 B total / 3 B active, sparse activation for efficient inference. +- **MLA attention**: Multi-head Latent Attention with q-LoRA rank 768 and kv-LoRA rank 512. +- **MTP head + EAGLE speculative**: built-in `--mtp-num-layers 1` and EAGLE rollout enabled by default. +- **R3 on by default**: both miles launchers enable `--use-rollout-routing-replay` out of the box. + +## 2. Supported Variants + +| Model | Active / Total | HF ID | +|---|---|---| +| GLM-4.7-Flash | 3 B / 30 B | [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) | + +## 3. Environment Setup + +### 3.1 Download model + datasets + +```bash +hf download zai-org/GLM-4.7-Flash --local-dir /root/shared/GLM-4.7-Flash +hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/shared/dapo-math-17k +hf download --repo-type dataset zhuzilin/aime-2024 --local-dir /root/shared/aime-2024 +``` + +The bash launcher hardcodes `BASE_DIR=/root/shared`. The Python launcher downloads `zhuzilin/dapo-math-17k` and `zhuzilin/aime-2024` automatically. + +### 3.2 HF โ†’ Megatron `torch_dist` conversion + +```bash +cd /root/miles +source scripts/models/glm4.7-flash.sh +PYTHONPATH=/root/Megatron-LM torchrun --nproc-per-node 8 \ + tools/convert_hf_to_torch_dist.py \ + ${MODEL_ARGS[@]} \ + --hf-checkpoint /root/shared/GLM-4.7-Flash \ + --save /root/shared/GLM-4.7-Flash_torch_dist +``` + +The Python launcher does the conversion automatically. + +## 4. Launch + +### 4.1 Quick start + +```bash +cd /root/miles +bash scripts/run-glm4.7-flash.sh + +# Python launcher (H200 only โ€” `hardware` literal in the dataclass) +python scripts/run_glm47_flash.py +``` + +Defaults of the Python launcher (see `ScriptArgs`): `model_org=zai-org`, `model_name=GLM-4.7-Flash`, `num_gpus_per_node=8`, `hardware=H200`, `data_dir=/root/datasets`, `model_dir=/root/models`. + +## 5. Recipe Configuration + +### 5.1 Parallelism + +| TP | PP | CP | EP | expert-TP | `max_tokens_per_gpu` | GPUs | +|---|---|---|---|---|---|---| +| 4 | 1 | 1 | 8 | 1 | 32768 | 8 (1 ร— 8) | + +`--rollout-num-gpus-per-engine 4` (TP must divide 20 attention heads, so TP=4). The bash launcher's `SGLANG_ARGS` keeps `--sglang-enable-dp-attention` / `--sglang-dp-size` commented out โ€” the in-source comment notes that DP-attention requires `tp_size % dp_size == 0`. + +### 5.2 Algorithm + +GRPO with `--eps-clip 0.2 --eps-clip-high 0.28 --use-kl-loss --kl-loss-coef 0.00`. + +### 5.3 Rollout & SGLang + +```bash +SGLANG_ARGS=( + --rollout-num-gpus-per-engine 4 + --sglang-mem-fraction-static 0.7 + + # EAGLE speculative decoding (MTP) + --sglang-speculative-algorithm EAGLE + --sglang-speculative-num-steps 2 + --sglang-speculative-eagle-topk 1 + --sglang-speculative-num-draft-tokens 3 + + # R3 โ€” on by default in this script + --use-rollout-routing-replay +) +``` + +### 5.4 Optimizer + +CPU Adam on: + +```bash +--optimizer-cpu-offload +--overlap-cpu-optimizer-d2h-h2d +--use-precision-aware-optimizer +``` + +### 5.5 Notable quirks + +- Megatron-side DeepEP / `flex` dispatcher are commented out by default in this recipe. +- R3 (`--use-rollout-routing-replay`) is enabled by default โ€” atypical for the rest of the model lineup. + +## 6. Pairs Well With + +- [Rollout Routing Replay (R3)](/advanced/miles-router) โ€” already on by default. +- [Low Precision RL](/advanced/fp8-low-precision) diff --git a/docs/models/glm/glm4.md b/docs/models/glm/glm4.md new file mode 100644 index 0000000000..688ba67ce8 --- /dev/null +++ b/docs/models/glm/glm4.md @@ -0,0 +1,100 @@ +--- +title: GLM4 +description: Launch recipes for GLM-Z1-9B-0414. The 32 B model config ships without a launcher. +--- +## 1. Model Introduction + +[GLM-Z1-9B-0414](https://huggingface.co/zai-org/GLM-Z1-9B-0414) is a dense reasoning-tuned model from Zhipu AI's GLM-4 series, sized for single-node experimentation. + +**Key highlights:** + +- **Dense 9 B architecture**: fits comfortably on a single 8-GPU node. +- **Reasoning-tuned**: post-trained for step-by-step reasoning under the GLM-Z1 line. +- **Compatible RL recipe**: GRPO with DAPO-style rollout, drop-in replacement for other dense Qwen / LLaMA-class workloads. + +## 2. Supported Variants + +| Model | HF ID | +|---|---| +| GLM-Z1-9B-0414 | [zai-org/GLM-Z1-9B-0414](https://huggingface.co/zai-org/GLM-Z1-9B-0414) | + +## 3. Environment Setup + +### 3.1 Download model + datasets + +```bash +hf download zai-org/GLM-Z1-9B-0414 --local-dir /root/GLM-Z1-9B-0414 +hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k +hf download --repo-type dataset zhuzilin/aime-2024 --local-dir /root/aime-2024 +``` + +### 3.2 HF โ†’ Megatron `torch_dist` conversion + +```bash +cd /root/miles +source scripts/models/glm4-9B.sh +PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \ + ${MODEL_ARGS[@]} \ + --hf-checkpoint /root/GLM-Z1-9B-0414 \ + --save /root/GLM-Z1-9B-0414_torch_dist +``` + +## 4. Launch + +### 4.1 Quick start + +```bash +cd /root/miles +bash scripts/run-glm4-9B.sh # 8 GPU +bash scripts/run-glm4-9B-4xgpu-radixtree.sh # 4 GPU smoke test +``` + +## 5. Recipe Configuration + +### 5.1 Parallelism + +| Script | TP | PP | CP | EP | `max_tokens_per_gpu` | actor / rollout GPUs | GPUs | +|---|---|---|---|---|---|---|---| +| `run-glm4-9B.sh` | 2 | 1 | 2 | 1 | 4608 | 4 / 4 (non-colocate) | 8 (1 ร— 8) | +| `run-glm4-9B-4xgpu-radixtree.sh` | 2 | 1 | 1 | 1 | 2304 | 4 / 2 | 4 (1 ร— 4) | + +### 5.2 Algorithm + +GRPO across both scripts: + +```bash +GRPO_ARGS=( + --advantage-estimator grpo + --use-kl-loss + --kl-loss-coef 0.00 + --eps-clip 0.2 + --eps-clip-high 0.28 +) +``` + +### 5.3 Rollout & SGLang + +```bash +# run-glm4-9B.sh +SGLANG_ARGS=( + --rollout-num-gpus-per-engine 2 +) + +# run-glm4-9B-4xgpu-radixtree.sh +SGLANG_ARGS=( + --rollout-num-gpus-per-engine 2 +) +``` + +### 5.4 Optimizer + +CPU Adam is not enabled in either launcher. + +### 5.5 Notable quirks + +- `run-glm4-9B.sh` runs actor and rollout on disjoint GPUs (non-colocate). + +## 6. Pairs Well With + +- [Rollout Routing Replay (R3)](/advanced/miles-router) +- [Low Precision RL](/advanced/fp8-low-precision) diff --git a/docs/models/glm/glm5.md b/docs/models/glm/glm5.md new file mode 100644 index 0000000000..9703e32c03 --- /dev/null +++ b/docs/models/glm/glm5.md @@ -0,0 +1,117 @@ +--- +title: GLM-5 / GLM-5.1 +description: Launch recipe for GLM-5 and GLM-5.1 (744 B / 40 B active) โ€” Python launcher, 16+ node config. +--- +## 1. Model Introduction + +[GLM-5](https://huggingface.co/zai-org/GLM-5) is the most powerful language model in Zhipu AI's GLM series, scaling to 744 B parameters (40 B active) and integrating DeepSeek Sparse Attention (DSA) for long-context efficiency. [GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) is the next-generation model for agentic engineering on top of GLM-5, sharing the same model architectures, + +**Key highlights:** + +- **Sparse MoE at frontier scale**: 744 B total / 40 B active per token, 256 routed experts top-8 + 1 shared. +- **MLA + DSA attention**: Multi-head Latent Attention (q-LoRA 2048 / kv-LoRA 512) combined with DeepSeek Sparse Attention to keep KV-cache cost low at long context. +- **Speculative decoding**: EAGLE/MTP rollout supported via `--enable-mtp`. +- **PD disaggregation**: prefill/decode disaggregation enabled by default for โ‰ฅ1 node. + +## 2. Supported Variants + +| Model | Active / Total | HF ID | +|---|---|---| +| GLM-5.1 | 40 B / 744 B | [zai-org/GLM-5.1](https://huggingface.co/zai-org/GLM-5.1) | +| GLM-5 | 40 B / 744 B | [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5) | + +## 3. Environment Setup + +### 3.1 Download model + datasets + +The Python launcher's `prepare` subcommand handles download + dataset staging: + +```bash +python scripts/run_glm5_744b_a40b.py prepare --model-name GLM-5 --num-nodes 16 +``` + +### 3.2 HF โ†’ Megatron `torch_dist` conversion + +Also handled by `prepare`. The launcher patches `config.json` to set `model_type=deepseek_v32` (`_process_glm_checkpoint`) before conversion โ€” GLM-5 is loaded through the DeepseekV32 architecture path. Run `prepare-cp` afterwards on every node to copy the converted checkpoint from shared NFS to local disk. + +## 4. Launch + +### 4.1 Quick start + +```bash +python scripts/run_glm5_744b_a40b.py full-train --model-name GLM-5 --num-nodes 16 +``` + +The Typer app exposes four subcommands: + +```bash +python scripts/run_glm5_744b_a40b.py full-train --model-name GLM-5 --num-nodes + +# Just download model + datasets and convert to Megatron +python scripts/run_glm5_744b_a40b.py prepare --model-name GLM-5 --num-nodes + +# Copy converted checkpoint from shared NFS to local disk (run on every node) +python scripts/run_glm5_744b_a40b.py prepare-cp --model-name GLM-5 --num-nodes + +# Train only (assumes prepare/prepare-cp done) +python scripts/run_glm5_744b_a40b.py train --model-name GLM-5 --num-nodes +``` + +The launcher's docstring says it's tested on **H200 / B200 / GB300**; the dataclass restricts `--hardware` to `{H200, B200, GB300}`. + +## 5. Recipe Configuration + +### 5.1 Parallelism + +Verbatim from `_execute_train`, `--num-nodes โ‰ฅ 16` branch: + +| TP | PP | CP | EP | expert-TP | `decoder-last-pipeline-num-layers` | `max_tokens_per_gpu` | GPUs | +|---|---|---|---|---|---|---|---| +| 4 | 4 | 2 | 32 | 1 | 18 | 16384 | โ‰ฅ 128 (โ‰ฅ 16 ร— 8) | + +Plus `--use-dynamic-batch-size`, `--data-pad-size-multiplier 4096`, `--log-probs-chunk-size 1024`, `--recompute-granularity full --recompute-method uniform --recompute-num-layers 1`. + +### 5.2 Algorithm + +GRPO with `--eps-clip 0.2 --eps-clip-high 0.28`. R3 (`--use-rollout-routing-replay`) is **not** enabled by default. + +### 5.3 Rollout & SGLang + +Always-on flags: + +```bash +--sglang-mem-fraction-static 0.70 +--sglang-enable-dp-attention +--sglang-ep-size +--sglang-dp-size +--sglang-moe-dense-tp-size 1 +--sglang-enable-dp-lm-head + +# DSA / NSA attention +--sglang-page-size 64 +--sglang-nsa-decode-backend flashmla_sparse +--sglang-nsa-prefill-backend flashmla_sparse +--sglang-attention-backend nsa + +--sglang-max-running-requests 512 +--sglang-watchdog-timeout 3600 +``` + +### 5.4 Optimizer + +`--enable-optimizer-offload` adds `--optimizer-cpu-offload --overlap-cpu-optimizer-d2h-h2d --use-precision-aware-optimizer` (opt-in). + +### 5.5 Notable quirks + +The launcher exposes these as flags: + +- `--fp8-rollout` โ€” runs `tools/convert_hf_to_fp8.py --strategy block --block-size 128 128` and feeds the FP8 directory to SGLang (Megatron stays BF16). +- `--enable-mtp` โ€” adds SGLang EAGLE speculative decoding (`--sglang-speculative-{algorithm,num-steps,eagle-topk,num-draft-tokens}`). +- `--enable-pd` (default `True` for โ‰ฅ1 node) โ€” enables prefill/decode disaggregation; with PD the launcher uses larger SGLang world sizes (16 for `<16` nodes, 64 for `โ‰ฅ16` nodes). +- `--use-deepep` (default `True`) โ€” enables Megatron-side DeepEP (`--moe-enable-deepep --moe-token-dispatcher-type flex`); falls back to `alltoall`. Forced off on GB300. + +## 6. Pairs Well With + +- [PD Disaggregation](/advanced/pd-disaggregation) โ€” on by default for `num_nodes โ‰ฅ 1`. +- [Low Precision RL](/advanced/fp8-low-precision) โ€” opt-in via `--fp8-rollout`. +- [Speculative Decoding](/advanced/speculative-decoding) โ€” opt-in via `--enable-mtp`. diff --git a/docs/models/glm/index.md b/docs/models/glm/index.md new file mode 100644 index 0000000000..8650fec856 --- /dev/null +++ b/docs/models/glm/index.md @@ -0,0 +1,35 @@ +--- +title: GLM +description: Miles recipes for the GLM4, GLM4.5, GLM4.7 Flash, and GLM5 families โ€” dense and MoE. +--- +Miles ships RL recipes for every GLM generation currently in production: the dense GLM4 line (9 B, 32 B โ€” Zhipu "Z1" reasoning checkpoints), the GLM4.5 MoE at 106 B-A12B and 355 B-A32B, the compact GLM4.7 Flash with 64 routed experts, and the 744 B-A40B GLM5 flagship. + +## Variants + +| Family | Class | Sizes | Recipe | +|---|---|---|---| +| GLM4 | Dense | 9 B ยท 32 B | [glm4](/models/glm/glm4) | +| GLM4.5 | MoE | 12 B / 106 B ยท 32 B / 355 B | [glm4-5](/models/glm/glm4-5) | +| GLM4.7 Flash | MoE (64 experts, top-4) | Compact | [glm4-7-flash](/models/glm/glm4-7-flash) | +| GLM5 | MoE | 40 B / 744 B | [glm5](/models/glm/glm5) | + +## Fastest path to train + +GLM4-9B (GLM-Z1-9B-0414) on a single 8ร— H100 node โ€” the smallest GLM recipe: + +```bash +cd /root/miles +hf download zai-org/GLM-Z1-9B-0414 --local-dir /root/GLM-Z1-9B-0414 +bash scripts/run-glm4-9B.sh +``` + +See the [GLM4 Dense](/models/glm/glm4) page for weight conversion and the full walkthrough. + +## Which variant do I pick? + +- **Single-node GLM first try** โ†’ GLM4-9B ([glm4](/models/glm/glm4)). +- **Larger dense** โ†’ GLM4-32B ([glm4](/models/glm/glm4)). +- **MoE on a budget** โ†’ GLM4.5-106B-A12B ([glm4-5](/models/glm/glm4-5)). +- **Full MoE scale (multi-node)** โ†’ GLM4.5-355B-A32B ([glm4-5](/models/glm/glm4-5)). +- **Compact MoE for routing experiments (R3)** โ†’ GLM4.7 Flash ([glm4-7-flash](/models/glm/glm4-7-flash)). +- **Frontier scale (744 B)** โ†’ GLM5 ([glm5](/models/glm/glm5)). diff --git a/docs/models/gpt-oss/gpt-oss.md b/docs/models/gpt-oss/gpt-oss.md index a07a8ed7e3..ca25816570 100644 --- a/docs/models/gpt-oss/gpt-oss.md +++ b/docs/models/gpt-oss/gpt-oss.md @@ -1,10 +1,8 @@ --- title: GPT-OSS 20B +sidebarTitle: GPT-OSS description: Two launchers โ€” Megatron BF16 (8 GPU, mbridge) and FSDP (4 GPU, dequantizes MXFP4 โ†’ BF16 first). --- - -# GPT-OSS 20B - ## 1. Model Introduction [GPT-OSS](https://huggingface.co/openai/gpt-oss-20b) is OpenAI's open-weight language model, designed for reasoning, agentic tasks, and developer use cases. miles supports the 20 B variant. @@ -27,7 +25,6 @@ description: Two launchers โ€” Megatron BF16 (8 GPU, mbridge) and FSDP (4 GPU, d ### 3.1 Download model + datasets ```bash -# Megatron BF16 path โ€” stage everything under /root/shared (script's hardcoded BASE_DIR) hf download openai/gpt-oss-20b --local-dir /root/shared/gpt-oss-20b hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/shared/dapo-math-17k @@ -118,5 +115,5 @@ Neither launcher writes `--save`/`--load`/`--save-interval`. ## 6. Pairs Well With -- [Backends Beyond Megatron](../../advanced/architecture-support.md) โ€” the FSDP variant. -- [Low Precision RL](../../advanced/fp8-low-precision.md) +- [Backends Beyond Megatron](/advanced/architecture-support) โ€” the FSDP variant. +- [Low Precision RL](/advanced/fp8-low-precision) diff --git a/docs/models/index.md b/docs/models/index.md index 7529335043..a81b880c1a 100644 --- a/docs/models/index.md +++ b/docs/models/index.md @@ -2,9 +2,6 @@ title: Supported Models description: Per-family recipes covering weight conversion, launch flags, and parallelism choices. --- - -# Supported Models - Miles ships ready-to-run recipes for every model family listed below. Each page covers weight conversion, parallelism, and the launch script in the order you'd actually run them. @@ -15,13 +12,13 @@ Each model name links to its recipe page. | Family | Models | |---|---| -| **DeepSeek** | [DeepSeek-V4 Pro](deepseek/deepseek-v4-pro.md)
[DeepSeek-V4 Flash](deepseek/deepseek-v4-flash.md)
[DeepSeek-R1](deepseek/deepseek.md)
[DeepSeek-V3](deepseek/deepseek.md) | -| **Qwen** | [Qwen3.6 MoE](qwen/qwen3-6-moe.md)
[Qwen3.6](qwen/qwen3-6.md)
[Qwen3.5-35B-A3B](qwen/qwen3-5-moe.md)
[Qwen3.5-4B / 9B / 27B](qwen/qwen3-5.md)
[Qwen3-Next-80B-A3B-Thinking](qwen/qwen3-next.md)
[Qwen3-30B-A3B / 235B-A22B](qwen/qwen3-moe.md)
[Qwen3-0.6B / 1.7B / 4B / 8B / 14B / 32B](qwen/qwen3.md) | -| **GLM** | [GLM-5.1](glm/glm5.md)
[GLM-5](glm/glm5.md)
[GLM-4.7-Flash](glm/glm4-7-flash.md)
[GLM-4.5](glm/glm4-5.md)
[GLM-Z1-9B-0414](glm/glm4.md) | -| **Kimi** | [Kimi-K2.6](kimi/kimi-k2.5.md)
[Kimi-K2.5](kimi/kimi-k2.5.md)
[Kimi-K2-Instruct / Thinking](kimi/kimi-k2.md)
[Moonlight-16B-A3B](kimi/moonlight.md) | -| **Nemotron** | [Nemotron-3-Super-120B-A12B-FP8](nemotron/nemotron-3-super.md)
[Nemotron-3-Nano MoE](nemotron/nemotron-3-nano-moe.md)
[Nemotron-3-Nano](nemotron/nemotron-3-nano.md) | -| **MiMo** | [MiMo-7B-RL](mimo/mimo.md) | -| **GPT-OSS** | [gpt-oss-20b](gpt-oss/gpt-oss.md) | +| **DeepSeek** | [DeepSeek-V4 Pro](/models/deepseek/deepseek-v4-pro)
[DeepSeek-V4 Flash](/models/deepseek/deepseek-v4-flash)
[DeepSeek-R1](/models/deepseek/deepseek)
[DeepSeek-V3](/models/deepseek/deepseek) | +| **Qwen** | [Qwen3.6 MoE](/models/qwen/qwen3-6-moe)
[Qwen3.6](/models/qwen/qwen3-6)
[Qwen3.5-35B-A3B](/models/qwen/qwen3-5-moe)
[Qwen3.5-4B / 9B / 27B](/models/qwen/qwen3-5)
[Qwen3-Next-80B-A3B-Thinking](/models/qwen/qwen3-next)
[Qwen3-30B-A3B / 235B-A22B](/models/qwen/qwen3-moe)
[Qwen3-0.6B / 1.7B / 4B / 8B / 14B / 32B](/models/qwen/qwen3) | +| **GLM** | [GLM-5.1](/models/glm/glm5)
[GLM-5](/models/glm/glm5)
[GLM-4.7-Flash](/models/glm/glm4-7-flash)
[GLM-4.5](/models/glm/glm4-5)
[GLM-Z1-9B-0414](/models/glm/glm4) | +| **Kimi** | [Kimi-K2.6](/models/kimi/kimi-k2.5)
[Kimi-K2.5](/models/kimi/kimi-k2.5)
[Kimi-K2-Instruct / Thinking](/models/kimi/kimi-k2)
[Moonlight-16B-A3B](/models/kimi/moonlight) | +| **Nemotron** | [Nemotron-3-Super-120B-A12B-FP8](/models/nemotron/nemotron-3-super)
[Nemotron-3-Nano MoE](/models/nemotron/nemotron-3-nano-moe)
[Nemotron-3-Nano](/models/nemotron/nemotron-3-nano) | +| **MiMo** | [MiMo-7B-RL](/models/mimo/mimo) | +| **GPT-OSS** | [gpt-oss-20b](/models/gpt-oss/gpt-oss) | ## How a recipe is structured @@ -38,4 +35,4 @@ Every recipe page follows the same six sections: Miles's plugin architecture lets you wrap a HuggingFace implementation as a Megatron module without patching Megatron core. See -[Backends Beyond Megatron](../advanced/architecture-support.md) for the workflow. +[Backends Beyond Megatron](/advanced/architecture-support) for the workflow. diff --git a/docs/models/kimi/index.md b/docs/models/kimi/index.md index 6708717a8b..819c570c9b 100644 --- a/docs/models/kimi/index.md +++ b/docs/models/kimi/index.md @@ -1,19 +1,18 @@ --- title: Kimi -description: Miles recipes for the Moonshot family โ€” Kimi K2 / K2-Thinking (1 T / 32 B-A) and Moonlight 16B-A3B. +description: Miles recipes for the Moonshot family โ€” Kimi K2.6 / K2.5 (multimodal, 1 T / 32 B-A), Kimi K2 / K2-Thinking, and Moonlight 16B-A3B. --- - -# Kimi family - -Miles supports both ends of Moonshot's MoE line: the 1 T-parameter Kimi K2 (Instruct and Thinking variants) at 32 B active per token, and the compact Moonlight 16B-A3B that fits on a single 8ร— H100 node โ€” handy as a single-node test target before scaling K2 across 16 nodes. K2-Thinking is also the canonical target for INT4 QAT. +Miles supports Moonshot's MoE line from top to bottom. The latest Kimi K2.6 and K2.5 are natively multimodal agentic models at 1 T total / 32 B active per token; the text-only Kimi K2 (Instruct and Thinking variants) runs at the same 1 T / 32 B scale; and the compact Moonlight 16B-A3B fits on a single 8ร— H100 node, a handy single-node test target before scaling K2 across many nodes. K2-Thinking is the canonical INT4 QAT target, and the K2.5 / K2.6 recipe trains an INT4 actor under the same QAT path. ## Variants | Model | Active / Total | HF ID | Recipe | |---|---|---|---| -| Kimi-K2-Instruct | 32 B / 1 T | `moonshotai/Kimi-K2-Instruct` | [kimi-k2](kimi-k2.md) | -| Kimi-K2-Thinking | 32 B / 1 T | `moonshotai/Kimi-K2-Thinking` | [kimi-k2](kimi-k2.md) | -| Moonlight-16B-A3B | 3 B / 16 B | `moonshotai/Moonlight-16B-A3B` | [moonlight](moonlight.md) | +| Kimi-K2.6 | 32 B / 1 T | `moonshotai/Kimi-K2.6` | [kimi-k2.5](/models/kimi/kimi-k2.5) | +| Kimi-K2.5 | 32 B / 1 T | `moonshotai/Kimi-K2.5` | [kimi-k2.5](/models/kimi/kimi-k2.5) | +| Kimi-K2-Instruct | 32 B / 1 T | `moonshotai/Kimi-K2-Instruct` | [kimi-k2](/models/kimi/kimi-k2) | +| Kimi-K2-Thinking | 32 B / 1 T | `moonshotai/Kimi-K2-Thinking` | [kimi-k2](/models/kimi/kimi-k2) | +| Moonlight-16B-A3B | 3 B / 16 B | `moonshotai/Moonlight-16B-A3B` | [moonlight](/models/kimi/moonlight) | ## Fastest path to train @@ -25,10 +24,11 @@ hf download moonshotai/Moonlight-16B-A3B --local-dir /root/Moonlight-16B-A3B bash scripts/run-moonlight-16B-A3B.sh ``` -See the [Moonlight](moonlight.md) page for the full walkthrough, or [Kimi K2](kimi-k2.md) for the 16-node K2-Thinking recipe (including the one-line `model_type` patch that lets Miles treat K2 as a DeepSeek-V3-shaped architecture). +See the [Moonlight](/models/kimi/moonlight) page for the full walkthrough, or [Kimi K2](/models/kimi/kimi-k2) for the 16-node K2-Thinking recipe (including the one-line `model_type` patch that lets Miles treat K2 as a DeepSeek-V3-shaped architecture). ## Which variant do I pick? -- **Single-node MoE smoke test** โ†’ Moonlight-16B-A3B ([moonlight](moonlight.md)). -- **Frontier-scale instruction-tuned MoE** โ†’ Kimi-K2-Instruct ([kimi-k2](kimi-k2.md)). -- **Reasoning-style training, INT4 QAT target** โ†’ Kimi-K2-Thinking ([kimi-k2](kimi-k2.md)). +- **Latest multimodal agentic model** โ†’ Kimi-K2.6 or Kimi-K2.5 ([kimi-k2.5](/models/kimi/kimi-k2.5)). +- **Single-node MoE smoke test** โ†’ Moonlight-16B-A3B ([moonlight](/models/kimi/moonlight)). +- **Frontier-scale instruction-tuned MoE** โ†’ Kimi-K2-Instruct ([kimi-k2](/models/kimi/kimi-k2)). +- **Reasoning-style training, INT4 QAT target** โ†’ Kimi-K2-Thinking ([kimi-k2](/models/kimi/kimi-k2)). diff --git a/docs/models/kimi/kimi-k2.5.md b/docs/models/kimi/kimi-k2.5.md index 32141f8582..53fd87db3d 100644 --- a/docs/models/kimi/kimi-k2.5.md +++ b/docs/models/kimi/kimi-k2.5.md @@ -1,7 +1,194 @@ --- -title: Kimi K2.5 -description: Launch recipe for Kimi K2.5. +title: Kimi K2.5 / K2.6 +description: Launch recipe for Kimi-K2.5, running full-parameter GRPO on 32 ร— 8 H200 with an INT4 actor and a BF16 reference. --- -# Kimi K2.5 +The reference launcher is [`scripts/run-kimi-k25.sh`](https://github.com/radixark/miles/blob/main/scripts/run-kimi-k25.sh), which sources the shared model definition in `scripts/models/kimi-k2-thinking.sh`. -Placeholder for the tutorial, refer to [PR](https://github.com/radixark/miles/pull/792) \ No newline at end of file +## 1. Model Introduction + +[Kimi-K2.5](https://huggingface.co/moonshotai/Kimi-K2.5) is an open-source, natively multimodal agentic model from Moonshot AI. It is built by continual pretraining on roughly 15 T mixed vision and text tokens on top of Kimi-K2-Base, and it pairs a Mixture-of-Experts (MoE) language backbone with a MoonViT vision encoder so a single model handles both image and text inputs. K2.5 keeps the 1 T-total / 32 B-active shape of the K2 family and extends the context window to 256K tokens. + +**Architecture at a glance** (from the model card): + +| Specification | Value | +|---|---| +| Architecture | Mixture-of-Experts (MoE) | +| Total / activated parameters | 1 T / 32 B | +| Layers (1 dense + 60 MoE) | 61 | +| Attention hidden dimension | 7168 | +| MoE hidden dimension (per expert) | 2048 | +| Attention heads | 64 | +| Routed experts (selected per token) | 384 (top-8) | +| Shared experts | 1 | +| Attention mechanism | Multi-head Latent Attention (MLA) | +| Activation | SwiGLU | +| Vocabulary size | 160K | +| Context length | 256K | +| Vision encoder | MoonViT (400 M) | + +**Key features:** + +- **Native multimodality.** K2.5 is pretrained on both vision and language tokens, so it covers visual knowledge, cross-modal reasoning, and tool use grounded in images alongside text. +- **Coding with vision.** It generates code from visual specifications such as UI designs and video workflows, and drives tools for visual data processing. +- **Agent swarm.** It decomposes a complex task into parallel sub-tasks run by dynamically instantiated, domain-specific agents, rather than scaling a single agent. + +## 2. Supported Variants + +The K2.5 launcher expects two checkpoints under `$BASE_DIR`: an INT4 actor checkpoint and a BF16 reference checkpoint. + +| Role | Checkpoint | Loaded with | +|---|---|---| +| Actor (trained) | `$BASE_DIR/Kimi-K2.5-int4` | `--hf-checkpoint` | +| Reference | `$BASE_DIR/Kimi-K2.5-bf16` | `--ref-load` | + +Both share the K2 family's 1 T-total / 32 B-active MoE + MLA shape inherited from Kimi-K2-Thinking. + +## 3. Quick start + +### 3.1 Prerequisites + +The launcher references two environment variables but never sets them, so you should export them yourself before launch: + +```bash +export BASE_DIR= +export MASTER_ADDR= +``` + +The `$BASE_DIR` directory must already hold the two K2.5 checkpoints from ยง2 alongside the DAPO-Math-17k training set (`dapo-math-17k/dapo-math-17k.jsonl`) and the AIME-2024 eval set (`aime-2024.jsonl`). + +### 3.2 One-line launch + +The script submits to an **already-running Ray cluster** (`ray job submit --address http://127.0.0.1:8265`); it does not run `ray start --head` itself. It also runs a `pkill` / `ray stop` cleanup pass at the top so a failed run can be re-launched cleanly. + +```bash +cd /root/miles +export BASE_DIR=...; export MASTER_ADDR=... +bash scripts/run-kimi-k25.sh +``` + +### 3.3 Multi-node fan-out + +Bring up Ray on every node before launching, the same way as the other Kimi recipes: + +```bash +# head +ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats +# each worker +ray start --address=${MASTER_ADDR}:6379 --num-gpus 8 --node-ip-address ${WORKER_IP} +``` + +## 4. Script breakdown + +The launcher groups its flags into the arrays that are passed to `train.py`. The model shape comes from `MODEL_ARGS`, which is sourced from `scripts/models/kimi-k2-thinking.sh`. That definition sets the MLA latent ranks (`q_lora_rank=1536`, `kv_lora_rank=512`, `qk_head_dim=128`, `qk_pos_emb_head_dim=64`, `v_head_dim=128`), the MoE routing (384 experts, top-8, sigmoid pre-softmax scoring, FP32 router, `--moe-router-topk-scaling-factor 2.827`), and RoPE (`--rotary-base 50000`, `--rotary-scaling-factor 64.0`). The K2.5 recipe then layers the following on top: + +- **`CKPT_ARGS`** wires up the dual checkpoint (INT4 actor via `--hf-checkpoint`, BF16 reference via `--ref-load`) together with `--megatron-to-hf-mode bridge` and `--model-name kimi_k25`. +- **`ROLLOUT_ARGS`** and **`EVAL_ARGS`** configure GRPO sampling and periodic AIME evaluation (covered in ยง5.2). +- **`PERF_ARGS`** sets the parallelism layout and recomputation (ยง5.1). +- **`GRPO_ARGS`** and **`OPTIMIZER_ARGS`** set the algorithm and CPU-offloaded Adam (ยง5.2, ยง5.4). +- **`SGLANG_ARGS`** configures the colocated rollout engine (ยง5.3). + +The job runs colocated (`--colocate`) across 32 nodes (`--actor-num-nodes 32 --actor-num-gpus-per-node 8`) and uses the miles router (`--use-miles-router`) with `--update-weight-buffer-size $((4*512*1024*1024))`. + +## 5. Example Recipe Configuration + +### 5.1 Megatron Parallelism + +This is the validated layout shipped with the launcher. All parallelisms are supported, so you can supply any other TP / EP / PP / CP combination that fits your compute. + +| Hardware | Nodes ร— GPUs | TP | PP | CP | EP | expert-TP | `--decoder-last-pipeline-num-layers` | `--max-tokens-per-gpu` | +|---|---|---|---|---|---|---|---|---| +| H200 | 32 ร— 8 = 256 | 8 | 8 | 4 | 32 | 1 | 5 | 4096 | + +Sequence parallelism (`--sequence-parallel`) is on, and the trainer uses dynamic batching (`--use-dynamic-batch-size`) capped at `--max-tokens-per-gpu 4096`. Recomputation is full and uniform over a single layer: + +```bash +--recompute-granularity full +--recompute-method uniform +--recompute-num-layers 1 +``` + +### 5.2 Algorithm + +The recipe uses GRPO with KL and entropy losses disabled: + +```bash +--advantage-estimator grpo +--eps-clip 0.2 --eps-clip-high 0.28 +--kl-loss-coef 0.00 --kl-loss-type low_var_kl +--entropy-coef 0.00 +``` + +Rollouts draw from DAPO-Math-17k and score with the `deepscaler` reward: + +```bash +--prompt-data $BASE_DIR/dapo-math-17k/dapo-math-17k.jsonl +--input-key prompt --label-key label +--apply-chat-template +--rollout-shuffle --balance-data +--rm-type deepscaler + +--num-rollout 20 +--rollout-batch-size 32 +--n-samples-per-prompt 8 +--rollout-max-response-len 16384 +--rollout-temperature 1 + +--global-batch-size 256 +--filter-zero-reward-samples +--use-dynamic-global-batch-size +``` + +Evaluation runs every 20 steps against AIME-2024, sampling 16 responses per prompt: + +```bash +--eval-interval 20 +--eval-prompt-data aime $BASE_DIR/aime-2024.jsonl +--n-samples-per-eval-prompt 16 +--eval-max-response-len 16384 +--eval-top-p 1 +``` + +### 5.3 Rollout & SGLang + +The rollout engine is colocated with training, spanning 8 GPUs per engine with 8-way expert parallelism: + +```bash +SGLANG_ARGS=( + --rollout-num-gpus-per-engine 8 + --sglang-mem-fraction-static 0.7 + --sglang-ep-size 8 + --sglang-server-concurrency 1024 + --sglang-cuda-graph-bs 1 2 4 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128 + --use-rollout-routing-replay +) +``` + +The `--use-rollout-routing-replay` flag replays the rollout-time MoE routing decisions during training so the two stages stay consistent. On the Megatron side, attention uses the Flash backend (`--attention-backend flash`). + +The launcher sets the required env vars for you, including the INT4 QAT pair (`OPEN_TRAINING_INT4_FAKE_QAT_FLAG=1`, `OPEN_TRAINING_INT4_GROUP_SIZE=32`), a long NCCL timeout (`NCCL_TIMEOUT=3600`), `CUDA_DEVICE_MAX_CONNECTIONS=1`, and NVLink-gated NVLS (`NCCL_NVLS_ENABLE` follows the script's NVLink autodetection). + +### 5.4 Optimizer + +CPU-offloaded Adam is combined with the distributed optimizer: + +```bash +--optimizer adam +--lr 1e-6 --lr-decay-style constant +--weight-decay 0.1 +--adam-beta1 0.9 --adam-beta2 0.98 + +--optimizer-cpu-offload +--overlap-cpu-optimizer-d2h-h2d +--use-precision-aware-optimizer +--use-distributed-optimizer +``` + +Adam states live on host RAM and are D2H/H2D-overlapped with the backward pass, freeing GPU memory for the 1 T-parameter weight footprint. Gradients accumulate and all-reduce in FP32 (`--accumulate-allreduce-grads-in-fp32`), and the attention softmax also runs in FP32 (`--attention-softmax-in-fp32`). + +## 6. Pairs Well With + +- [INT4 QAT](/advanced/int4-qat) +- [PD Disaggregation](/advanced/pd-disaggregation) +- [P2P Weight Transfer](/advanced/p2p-weight-transfer) +- [Fault Tolerance](/advanced/fault-tolerance) +- [Kimi K2](/models/kimi/kimi-k2): sibling recipe; K2.5 reuses the K2-Thinking architecture. diff --git a/docs/models/kimi/kimi-k2.md b/docs/models/kimi/kimi-k2.md index 7b7c8dfd34..7bb410b834 100644 --- a/docs/models/kimi/kimi-k2.md +++ b/docs/models/kimi/kimi-k2.md @@ -2,9 +2,6 @@ title: Kimi K2 description: Launch recipes for Kimi-K2-Instruct and Kimi-K2-Thinking โ€” 32 nodes ร— 8 GPU. --- - -# Kimi K2 - ## 1. Model Introduction [Kimi-K2](https://moonshotai.github.io/Kimi-K2/) is a state-of-the-art MoE language model from Moonshot AI with 32 B activated parameters and 1 T total parameters. @@ -38,7 +35,6 @@ Both are referenced but never set inside the scripts โ€” export them yourself be ```bash hf download moonshotai/Kimi-K2-Instruct --local-dir $BASE_DIR/Kimi-K2-Instruct -# or, for Thinking (FP8 by default): hf download moonshotai/Kimi-K2-Thinking --local-dir $BASE_DIR/Kimi-K2-Thinking-fp8 hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir $BASE_DIR/dapo-math-17k @@ -168,7 +164,7 @@ CPU Adam is enabled in both: ## 6. Pairs Well With -- [PD Disaggregation](../../advanced/pd-disaggregation.md) -- [P2P Weight Transfer](../../advanced/p2p-weight-transfer.md) -- [Fault Tolerance](../../advanced/fault-tolerance.md) -- [INT4 QAT](../../advanced/int4-qat.md) +- [PD Disaggregation](/advanced/pd-disaggregation) +- [P2P Weight Transfer](/advanced/p2p-weight-transfer) +- [Fault Tolerance](/advanced/fault-tolerance) +- [INT4 QAT](/advanced/int4-qat) diff --git a/docs/models/kimi/moonlight.md b/docs/models/kimi/moonlight.md index e4968b7f97..5f1e781ba5 100644 --- a/docs/models/kimi/moonlight.md +++ b/docs/models/kimi/moonlight.md @@ -2,9 +2,6 @@ title: Moonlight description: Single-node MoE recipe (8 GPU) โ€” DAPO-style dynamic sampling and CPU Adam on by default. --- - -# Moonlight - ## 1. Model Introduction [Moonlight](https://huggingface.co/moonshotai/Moonlight-16B-A3B) is Moonshot AI's compact MoE โ€” 16 B total / 3 B active, trained with the Muon optimizer โ€” and a useful single-node test target for MoE RL code changes before scaling to Kimi K2. @@ -110,5 +107,5 @@ CPU Adam on: ## 6. Pairs Well With -- [Rollout Routing Replay (R3)](../../advanced/miles-router.md) -- [Low Precision RL](../../advanced/fp8-low-precision.md) +- [Rollout Routing Replay (R3)](/advanced/miles-router) +- [Low Precision RL](/advanced/fp8-low-precision) diff --git a/docs/models/mimo/mimo.md b/docs/models/mimo/mimo.md index ee72bf1cea..bec3687702 100644 --- a/docs/models/mimo/mimo.md +++ b/docs/models/mimo/mimo.md @@ -2,9 +2,6 @@ title: MiMo description: Single-node GRPO + EAGLE speculative recipe with online MTP training. --- - -# MiMo 7B - ## 1. Model Introduction [MiMo-7B-RL](https://huggingface.co/XiaomiMiMo/MiMo-7B-RL) is Xiaomi's dense reasoning RL model with a built-in MTP (Multi-Token Prediction) layer. @@ -120,4 +117,4 @@ CPU Adam is **not** enabled. ## 6. Pairs Well With -- [Speculative Decoding](../../advanced/speculative-decoding.md) +- [Speculative Decoding](/advanced/speculative-decoding) diff --git a/docs/models/nemotron/index.md b/docs/models/nemotron/index.md index 23370848b1..82e64ff6d8 100644 --- a/docs/models/nemotron/index.md +++ b/docs/models/nemotron/index.md @@ -2,18 +2,15 @@ title: Nemotron description: Miles recipes for NVIDIA's Nemotron-3 family โ€” Mamba+Attention(+MoE) hybrids loaded via Megatron AutoBridge. --- - -# Nemotron family - Miles supports NVIDIA's Nemotron-3 line: a Mamba + Attention hybrid that, in the Super tier, adds MoE and ships natively in FP8. All three variants load via the Megatron AutoBridge path, so there is no offline HF โ†’ `torch_dist` conversion step. ## Variants | Model | Active / Total | HF ID | Recipe | |---|---|---|---| -| Nemotron-3-Nano | 4 B / 4 B (dense) | `nvidia/Nemotron-3-Nano-4B` | [nemotron-3-nano](nemotron-3-nano.md) | -| Nemotron-3-Nano MoE | 3 B / 30 B | `nvidia/Nemotron-3-Nano-30B-A3B` | [nemotron-3-nano-moe](nemotron-3-nano-moe.md) | -| Nemotron-3-Super | 12 B / 120 B (FP8) | `nvidia/Nemotron-3-Super-120B-A12B-FP8` | [nemotron-3-super](nemotron-3-super.md) | +| Nemotron-3-Nano | 4 B / 4 B (dense) | `nvidia/Nemotron-3-Nano-4B` | [nemotron-3-nano](/models/nemotron/nemotron-3-nano) | +| Nemotron-3-Nano MoE | 3 B / 30 B | `nvidia/Nemotron-3-Nano-30B-A3B` | [nemotron-3-nano-moe](/models/nemotron/nemotron-3-nano-moe) | +| Nemotron-3-Super | 12 B / 120 B (FP8) | `nvidia/Nemotron-3-Super-120B-A12B-FP8` | [nemotron-3-super](/models/nemotron/nemotron-3-super) | ## Fastest path to train @@ -24,15 +21,15 @@ cd /root/miles bash scripts/run-nemotron-3-nano.sh ``` -See the [Nemotron-3-Nano](nemotron-3-nano.md) page for the dense walkthrough, [Nemotron-3-Nano MoE](nemotron-3-nano-moe.md) for the 30 B MoE variant, and [Nemotron-3-Super](nemotron-3-super.md) for the FP8-native 120 B-A12B recipe. +See the [Nemotron-3-Nano](/models/nemotron/nemotron-3-nano) page for the dense walkthrough, [Nemotron-3-Nano MoE](/models/nemotron/nemotron-3-nano-moe) for the 30 B MoE variant, and [Nemotron-3-Super](/models/nemotron/nemotron-3-super) for the FP8-native 120 B-A12B recipe. ## Which variant do I pick? -- **Smallest, single-node smoke test** โ†’ Nemotron-3-Nano ([nemotron-3-nano](nemotron-3-nano.md)). -- **Mid-scale hybrid MoE** โ†’ Nemotron-3-Nano MoE ([nemotron-3-nano-moe](nemotron-3-nano-moe.md)). -- **Frontier-scale FP8-native MoE** โ†’ Nemotron-3-Super ([nemotron-3-super](nemotron-3-super.md)). +- **Smallest, single-node smoke test** โ†’ Nemotron-3-Nano ([nemotron-3-nano](/models/nemotron/nemotron-3-nano)). +- **Mid-scale hybrid MoE** โ†’ Nemotron-3-Nano MoE ([nemotron-3-nano-moe](/models/nemotron/nemotron-3-nano-moe)). +- **Frontier-scale FP8-native MoE** โ†’ Nemotron-3-Super ([nemotron-3-super](/models/nemotron/nemotron-3-super)). ## Pairs well with -- [Backends Beyond Megatron](../../advanced/architecture-support.md) โ€” the AutoBridge path Nemotron rides on. -- [Low Precision RL](../../advanced/fp8-low-precision.md) โ€” Super ships natively in FP8. +- [Backends Beyond Megatron](/advanced/architecture-support) โ€” the AutoBridge path Nemotron rides on. +- [Low Precision RL](/advanced/fp8-low-precision) โ€” Super ships natively in FP8. diff --git a/docs/models/nemotron/nemotron-3-nano-moe.md b/docs/models/nemotron/nemotron-3-nano-moe.md index 489d8213d7..a260c2c722 100644 --- a/docs/models/nemotron/nemotron-3-nano-moe.md +++ b/docs/models/nemotron/nemotron-3-nano-moe.md @@ -2,9 +2,6 @@ title: Nemotron-3-Nano MoE description: Launch recipe for NVIDIA Nemotron-3-Nano-30B-A3B (Mamba+Attention+MoE hybrid) via Megatron AutoBridge. --- - -# Nemotron-3-Nano MoE - ## 1. Model Introduction [NVIDIA Nemotron-3-Nano-30B-A3B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16) @@ -140,11 +137,11 @@ From `scripts/models/nemotron-3-nano-30b-a3b.sh` and `scripts/run-nemotron-3-nan - `--position-embedding-type none`, `--vocab-size 131072 --make-vocab-size-divisible-by 128`. - `--attention-backend auto` (Mamba layers select their own kernel). -See [Backends Beyond Megatron](../../advanced/architecture-support.md) for how the bridge +See [Backends Beyond Megatron](/advanced/architecture-support) for how the bridge shim layers `routed_scaling_factor` / `n_group` / `topk_group` onto the Megatron provider. ## 6. Pairs Well With -- [Backends Beyond Megatron](../../advanced/architecture-support.md) -- [P2P Weight Transfer](../../advanced/p2p-weight-transfer.md) -- [FP8 & Low Precision](../../advanced/fp8-low-precision.md) +- [Backends Beyond Megatron](/advanced/architecture-support) +- [P2P Weight Transfer](/advanced/p2p-weight-transfer) +- [FP8 & Low Precision](/advanced/fp8-low-precision) diff --git a/docs/models/nemotron/nemotron-3-nano.md b/docs/models/nemotron/nemotron-3-nano.md index c8337e9d42..f45af01253 100644 --- a/docs/models/nemotron/nemotron-3-nano.md +++ b/docs/models/nemotron/nemotron-3-nano.md @@ -2,9 +2,6 @@ title: Nemotron-3-Nano description: Launch recipe for the dense NVIDIA Nemotron-3-Nano-4B (Mamba+Attention hybrid) via Megatron AutoBridge. --- - -# Nemotron-3-Nano - ## 1. Model Introduction [NVIDIA Nemotron-3-Nano-4B-BF16](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16) @@ -123,9 +120,9 @@ From `scripts/models/nemotron-3-nano-4b.sh` and `scripts/run-nemotron-3-nano-4b. - `--attention-backend auto` (the Mamba layers select their own kernel; flash-only is not safe here). - Bridge load is required for hybrid `nemotron_h`: the AutoBridge wires `mamba_num_heads`, `mamba_state_dim`, `hybrid_override_pattern`. PP additionally needs miles' PP-unwrap shim (already on the `feat/nemotron-gemma4-rl` branch). -See [Backends Beyond Megatron](../../advanced/architecture-support.md) for the AutoBridge wiring. +See [Backends Beyond Megatron](/advanced/architecture-support) for the AutoBridge wiring. ## 6. Pairs Well With -- [Backends Beyond Megatron](../../advanced/architecture-support.md) -- [P2P Weight Transfer](../../advanced/p2p-weight-transfer.md) +- [Backends Beyond Megatron](/advanced/architecture-support) +- [P2P Weight Transfer](/advanced/p2p-weight-transfer) diff --git a/docs/models/nemotron/nemotron-3-super.md b/docs/models/nemotron/nemotron-3-super.md index bf76d880f2..365af85be1 100644 --- a/docs/models/nemotron/nemotron-3-super.md +++ b/docs/models/nemotron/nemotron-3-super.md @@ -2,9 +2,6 @@ title: Nemotron-3-Super description: Launch recipe for NVIDIA Nemotron-3-Super-120B-A12B-FP8 (Mamba+Attention+MoE hybrid, FP8 native) via Megatron AutoBridge. --- - -# Nemotron-3-Super - ## 1. Model Introduction [NVIDIA Nemotron-3-Super-120B-A12B-FP8](https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8) @@ -146,13 +143,13 @@ this scale. --make-vocab-size-divisible-by 128`. - `--attention-backend auto` (Mamba layers select their own kernel). -See [Backends Beyond Megatron](../../advanced/architecture-support.md) for how +See [Backends Beyond Megatron](/advanced/architecture-support) for how the bridge shim layers `routed_scaling_factor` / `n_group` / `topk_group` onto -the Megatron provider, and [FP8 & Low Precision](../../advanced/fp8-low-precision.md) +the Megatron provider, and [FP8 & Low Precision](/advanced/fp8-low-precision) for the FP8 weight format. ## 6. Pairs Well With -- [Backends Beyond Megatron](../../advanced/architecture-support.md) -- [FP8 & Low Precision](../../advanced/fp8-low-precision.md) -- [P2P Weight Transfer](../../advanced/p2p-weight-transfer.md) +- [Backends Beyond Megatron](/advanced/architecture-support) +- [FP8 & Low Precision](/advanced/fp8-low-precision) +- [P2P Weight Transfer](/advanced/p2p-weight-transfer) diff --git a/docs/models/qwen/index.md b/docs/models/qwen/index.md index 00e1e98524..6ee30e18d5 100644 --- a/docs/models/qwen/index.md +++ b/docs/models/qwen/index.md @@ -2,20 +2,17 @@ title: Qwen description: Miles recipes for the full Qwen3, Qwen3.5, and Qwen3-Next line โ€” dense and MoE. --- - -# Qwen family - Miles ships ready-to-run RL recipes for every generation of the Qwen line: the dense Qwen3 series (0.6 B โ†’ 32 B), the Qwen3.5 family with its gated-attention architecture, the Qwen3 and Qwen3.5 MoE variants, and the Gated-Delta-Net Qwen3-Next-80B-A3B. ## Variants | Family | Class | Sizes | Recipe | |---|---|---|---| -| Qwen3 | Dense | 0.6 B ยท 1.7 B ยท 4 B ยท 8 B ยท 14 B ยท 32 B | [qwen3](qwen3.md) | -| Qwen3 | MoE | 3 B / 30 B ยท 22 B / 235 B | [qwen3-moe](qwen3-moe.md) | -| Qwen3.5 | Dense | 4 B ยท 9 B ยท 27 B | [qwen3-5](qwen3-5.md) | -| Qwen3.5 | MoE | 3 B / 35 B | [qwen3-5-moe](qwen3-5-moe.md) | -| Qwen3-Next | MoE (GDN) | 3 B / 80 B | [qwen3-next](qwen3-next.md) | +| Qwen3 | Dense | 0.6 B ยท 1.7 B ยท 4 B ยท 8 B ยท 14 B ยท 32 B | [qwen3](/models/qwen/qwen3) | +| Qwen3 | MoE | 3 B / 30 B ยท 22 B / 235 B | [qwen3-moe](/models/qwen/qwen3-moe) | +| Qwen3.5 | Dense | 4 B ยท 9 B ยท 27 B | [qwen3-5](/models/qwen/qwen3-5) | +| Qwen3.5 | MoE | 3 B / 35 B | [qwen3-5-moe](/models/qwen/qwen3-5-moe) | +| Qwen3-Next | MoE (GDN) | 3 B / 80 B | [qwen3-next](/models/qwen/qwen3-next) | ## Fastest path to train @@ -27,13 +24,13 @@ hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B bash scripts/run-qwen3-4B.sh ``` -Dataset is [DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17K) at `/root/dapo-math-17k/dapo-math-17k.jsonl`. See the [Qwen3 Dense](qwen3.md) page for the full walkthrough, weight conversion, and variants. +Dataset is [DAPO-Math-17k](https://huggingface.co/datasets/BytedTsinghua-SIA/DAPO-Math-17K) at `/root/dapo-math-17k/dapo-math-17k.jsonl`. See the [Qwen3 Dense](/models/qwen/qwen3) page for the full walkthrough, weight conversion, and variants. ## Which variant do I pick? -- **Learning Miles for the first time** โ†’ Qwen3-4B ([qwen3](qwen3.md)). Fits one H100 node, fast loop. -- **Need MoE on a single node** โ†’ Qwen3-30B-A3B ([qwen3-moe](qwen3-moe.md)). -- **Scaling to multi-node** โ†’ Qwen3-235B-A22B ([qwen3-moe](qwen3-moe.md)). -- **Latest dense architecture (gated attention, A\_log FP32)** โ†’ Qwen3.5-4B ([qwen3-5](qwen3-5.md)). -- **Hybrid MTP / speculative decoding experiments** โ†’ Qwen3.5-35B-A3B ([qwen3-5-moe](qwen3-5-moe.md)). -- **Gated-Delta-Net (fla backend, real-CP)** โ†’ Qwen3-Next-80B-A3B ([qwen3-next](qwen3-next.md)). +- **Learning Miles for the first time** โ†’ Qwen3-4B ([qwen3](/models/qwen/qwen3)). Fits one H100 node, fast loop. +- **Need MoE on a single node** โ†’ Qwen3-30B-A3B ([qwen3-moe](/models/qwen/qwen3-moe)). +- **Scaling to multi-node** โ†’ Qwen3-235B-A22B ([qwen3-moe](/models/qwen/qwen3-moe)). +- **Latest dense architecture (gated attention, A\_log FP32)** โ†’ Qwen3.5-4B ([qwen3-5](/models/qwen/qwen3-5)). +- **Hybrid MTP / speculative decoding experiments** โ†’ Qwen3.5-35B-A3B ([qwen3-5-moe](/models/qwen/qwen3-5-moe)). +- **Gated-Delta-Net (fla backend, real-CP)** โ†’ Qwen3-Next-80B-A3B ([qwen3-next](/models/qwen/qwen3-next)). diff --git a/docs/models/qwen/qwen3-5-moe.md b/docs/models/qwen/qwen3-5-moe.md index a62c07fbd2..0c0fd7092b 100644 --- a/docs/models/qwen/qwen3-5-moe.md +++ b/docs/models/qwen/qwen3-5-moe.md @@ -2,9 +2,6 @@ title: Qwen3.5 MoE description: Launch recipe for Qwen3.5-35B-A3B with MTP training and EAGLE speculative rollout. --- - -# Qwen3.5 MoE - ## 1. Model Introduction [Qwen3.5-35B-A3B](https://github.com/QwenLM/Qwen3) is the MoE branch of the Qwen3.5 line โ€” 3 B active / 35 B total โ€” combining the gated-attention architecture with a built-in MTP head. @@ -29,7 +26,6 @@ description: Launch recipe for Qwen3.5-35B-A3B with MTP training and EAGLE specu ```bash hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k hf download --repo-type dataset zhuzilin/aime-2024 --local-dir /root/aime-2024 -# Place the model checkpoint at /root/Qwen3.5-35B-A3B ``` ### 3.2 HF โ†’ Megatron `torch_dist` conversion @@ -102,9 +98,9 @@ CPU Adam is enabled (`--optimizer-cpu-offload --overlap-cpu-optimizer-d2h-h2d -- ### 5.5 Notable quirks - The Megatron side uses `--moe-token-dispatcher-type flex`; DeepEP isn't enabled here, unlike Qwen3-Next. -- The model config (`scripts/models/qwen3.5-35B-A3B.sh`) reuses the Qwen3.5 spec: `--attention-output-gate`, `--rotary-base 10000000`, `--rotary-percent 0.25`, `A_log` kept in FP32 via the bridge. See [Backends Beyond Megatron](../../advanced/architecture-support.md). +- The model config (`scripts/models/qwen3.5-35B-A3B.sh`) reuses the Qwen3.5 spec: `--attention-output-gate`, `--rotary-base 10000000`, `--rotary-percent 0.25`, `A_log` kept in FP32 via the bridge. See [Backends Beyond Megatron](/advanced/architecture-support). ## 6. Pairs Well With -- [Speculative Decoding](../../advanced/speculative-decoding.md) -- [Backends Beyond Megatron](../../advanced/architecture-support.md) +- [Speculative Decoding](/advanced/speculative-decoding) +- [Backends Beyond Megatron](/advanced/architecture-support) diff --git a/docs/models/qwen/qwen3-5.md b/docs/models/qwen/qwen3-5.md index 793cd6995d..8169df811d 100644 --- a/docs/models/qwen/qwen3-5.md +++ b/docs/models/qwen/qwen3-5.md @@ -2,16 +2,13 @@ title: Qwen3.5 description: Launch recipes for Qwen3.5-4B / 9B / 27B with attention-output-gate. --- - -# Qwen3.5 - ## 1. Model Introduction [Qwen3.5](https://github.com/QwenLM/Qwen3) is the next iteration of the Qwen3 dense series, introducing the gated-attention architecture and an FP32-preserved `A_log` parameter. **Key highlights:** -- **Attention-output gate**: a learned gate on the attention output, trained alongside attention weights for stronger long-context behaviour. +- **Attention-output gate**: a learned gate on the attention output, trained alongside attention weights for stronger long-context behavior. - **Extended rotary base**: `--rotary-base 10000000`, `--rotary-percent 0.25` โ€” wider effective context than the original Qwen3. - **Larger vocabulary**: 248320 tokens. - **FP32 `A_log` preservation**: a parameter that must stay in FP32 through Megatron's mixed-precision pipeline; miles handles this via the bridge. @@ -31,7 +28,6 @@ description: Launch recipes for Qwen3.5-4B / 9B / 27B with attention-output-gate ```bash hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k hf download --repo-type dataset zhuzilin/aime-2024 --local-dir /root/aime-2024 -# Place the model checkpoint at /root/Qwen3.5-{4B,9B,27B} ``` ### 3.2 HF โ†’ Megatron `torch_dist` conversion @@ -104,9 +100,9 @@ From `scripts/models/qwen3.5-4B.sh` (and analogous configs for 9 B / 27 B): - `--apply-layernorm-1p`, `--qk-layernorm`, `--group-query-attention`. - `--attention-output-gate`. -See [Backends Beyond Megatron](../../advanced/architecture-support.md) for how miles preserves FP32 parameters like `A_log` through Megatron's mixed-precision pipeline. +See [Backends Beyond Megatron](/advanced/architecture-support) for how miles preserves FP32 parameters like `A_log` through Megatron's mixed-precision pipeline. ## 6. Pairs Well With -- [Backends Beyond Megatron](../../advanced/architecture-support.md) -- [Low Precision RL](../../advanced/fp8-low-precision.md) +- [Backends Beyond Megatron](/advanced/architecture-support) +- [Low Precision RL](/advanced/fp8-low-precision) diff --git a/docs/models/qwen/qwen3-6-moe.md b/docs/models/qwen/qwen3-6-moe.md index 43d3430343..1e0b21c87d 100644 --- a/docs/models/qwen/qwen3-6-moe.md +++ b/docs/models/qwen/qwen3-6-moe.md @@ -2,9 +2,6 @@ title: Qwen3.6 MoE description: Launch recipe for Qwen3.6-35B-A3B with MTP training and EAGLE speculative rollout. --- - -# Qwen3.6 MoE - ## 1. Model Introduction [Qwen3.6-35B-A3B](https://github.com/QwenLM/Qwen3) is the sparse MoE branch of @@ -148,10 +145,10 @@ From `scripts/models/qwen3.6-35B-A3B.sh` and `scripts/run_qwen3_6_35b_a3b_mtp.py - `--moe-grouped-gemm`, `--moe-token-drop-policy probs`, `--moe-router-dtype fp32`, `--moe-permute-fusion`, `--moe-aux-loss-coeff 0`. - `--attention-output-gate`, `--rotary-base 10000000`, `--rotary-percent 0.25`, `--vocab-size 248320`. -See [Backends Beyond Megatron](../../advanced/architecture-support.md) for FP32 parameter handling and how miles wires the spec. +See [Backends Beyond Megatron](/advanced/architecture-support) for FP32 parameter handling and how miles wires the spec. ## 6. Pairs Well With -- [Speculative Decoding](../../advanced/speculative-decoding.md) -- [Backends Beyond Megatron](../../advanced/architecture-support.md) -- [P2P Weight Transfer](../../advanced/p2p-weight-transfer.md) +- [Speculative Decoding](/advanced/speculative-decoding) +- [Backends Beyond Megatron](/advanced/architecture-support) +- [P2P Weight Transfer](/advanced/p2p-weight-transfer) diff --git a/docs/models/qwen/qwen3-6.md b/docs/models/qwen/qwen3-6.md index 0a460966ec..adac7cbba7 100644 --- a/docs/models/qwen/qwen3-6.md +++ b/docs/models/qwen/qwen3-6.md @@ -2,9 +2,6 @@ title: Qwen3.6 description: Launch recipe for the dense Qwen3.6-27B with attention-output-gate. --- - -# Qwen3.6 - ## 1. Model Introduction [Qwen3.6](https://github.com/QwenLM/Qwen3) is the next iteration of Alibaba's @@ -42,7 +39,6 @@ wider, deeper Qwen3.5 with the gated-attention design preserved. ```bash hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k hf download --repo-type dataset zhuzilin/aime-2024 --local-dir /root/aime-2024 -# Place the model checkpoint at /root/Qwen3.6-27B ``` ### 3.2 HF โ†’ Megatron `torch_dist` conversion @@ -121,10 +117,10 @@ From `scripts/models/qwen3.6-27B.sh`: - `--apply-layernorm-1p`, `--qk-layernorm`, `--group-query-attention`. - `--attention-output-gate`. -See [Backends Beyond Megatron](../../advanced/architecture-support.md) for how miles +See [Backends Beyond Megatron](/advanced/architecture-support) for how miles preserves FP32 parameters like `A_log` through Megatron's mixed-precision pipeline. ## 6. Pairs Well With -- [Backends Beyond Megatron](../../advanced/architecture-support.md) -- [FP8 & Low Precision](../../advanced/fp8-low-precision.md) +- [Backends Beyond Megatron](/advanced/architecture-support) +- [FP8 & Low Precision](/advanced/fp8-low-precision) diff --git a/docs/models/qwen/qwen3-moe.md b/docs/models/qwen/qwen3-moe.md index 1af4c86c47..4c8f92c2b4 100644 --- a/docs/models/qwen/qwen3-moe.md +++ b/docs/models/qwen/qwen3-moe.md @@ -2,9 +2,6 @@ title: Qwen3 MoE description: Launch recipes for Qwen3-30B-A3B (single node) and Qwen3-235B-A22B (multi-node). --- - -# Qwen3 MoE - ## 1. Model Introduction [Qwen3 MoE](https://github.com/QwenLM/Qwen3) is the Mixture-of-Experts branch of the Qwen3 series, available in two sizes: 30 B-A3B (single-node) and 235 B-A22B (multi-node). @@ -39,7 +36,6 @@ The 30 B Python launcher reads no env vars โ€” pass options via the Typer CLI. ### 3.2 Download model + datasets ```bash -# 30 B (single node) hf download Qwen/Qwen3-30B-A3B --local-dir /root/models/Qwen3-30B-A3B hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/datasets/dapo-math-17k hf download --repo-type dataset zhuzilin/aime-2024 --local-dir /root/datasets/aime-2024 @@ -140,5 +136,5 @@ Both `run_qwen3_30b_a3b.py` (H100, 1 node) and `run-qwen3-235B-A22B.sh` enable C ## 6. Pairs Well With -- [Low Precision RL](../../advanced/fp8-low-precision.md) -- [Rollout Routing Replay (R3)](../../advanced/miles-router.md) +- [Low Precision RL](/advanced/fp8-low-precision) +- [Rollout Routing Replay (R3)](/advanced/miles-router) diff --git a/docs/models/qwen/qwen3-next.md b/docs/models/qwen/qwen3-next.md index dc23260875..9aef4afe56 100644 --- a/docs/models/qwen/qwen3-next.md +++ b/docs/models/qwen/qwen3-next.md @@ -2,9 +2,6 @@ title: Qwen3-Next 80B-A3B description: Launch recipes for Qwen3-Next-80B-A3B-Thinking on Megatron and FSDP backends. --- - -# Qwen3-Next 80B-A3B - ## 1. Model Introduction [Qwen3-Next](https://huggingface.co/collections/Qwen/qwen3-next) is Alibaba's next-generation Qwen architecture, swapping classical attention for a hybrid Gated DeltaNet + Full Attention design. @@ -91,7 +88,6 @@ The canonical script enables EAGLE speculative rollout: --sglang-ep-size 8 --sglang-cuda-graph-bs 1 2 4 8 $(seq 16 8 128) -# mtp / EAGLE --sglang-speculative-algorithm EAGLE --sglang-speculative-num-steps 2 --sglang-speculative-eagle-topk 1 @@ -120,6 +116,6 @@ The FSDP variant leaves Adam on GPU. ## 6. Pairs Well With -- [Backends Beyond Megatron](../../advanced/architecture-support.md) -- [Rollout Routing Replay (R3)](../../advanced/miles-router.md) -- [Speculative Decoding](../../advanced/speculative-decoding.md) +- [Backends Beyond Megatron](/advanced/architecture-support) +- [Rollout Routing Replay (R3)](/advanced/miles-router) +- [Speculative Decoding](/advanced/speculative-decoding) diff --git a/docs/models/qwen/qwen3.md b/docs/models/qwen/qwen3.md index 14382b71bb..d7616f9007 100644 --- a/docs/models/qwen/qwen3.md +++ b/docs/models/qwen/qwen3.md @@ -2,9 +2,6 @@ title: Qwen3 description: Launch recipes for dense Qwen3 models (0.6 B โ€“ 32 B). --- - -# Qwen3 - ## 1. Model Introduction [Qwen3](https://github.com/QwenLM/Qwen3) is the latest generation of Alibaba's Qwen language model series, available in dense and MoE variants with both Instruct and reasoning-enhanced Thinking editions. @@ -13,7 +10,7 @@ description: Launch recipes for dense Qwen3 models (0.6 B โ€“ 32 B). - **Stronger general intelligence**: significant improvements in instruction following, logical reasoning, mathematics, science, coding, and tool usage over Qwen2.5. - **Extended context length**: trained for 256 K-token contexts, useful for long-document reasoning and agentic workflows. -- **Flexible deployment options**: dense sizes from 0.6 B up to 32 B; this page covers the dense recipes (MoE recipes live in [qwen3-moe](qwen3-moe.md)). +- **Flexible deployment options**: dense sizes from 0.6 B up to 32 B; this page covers the dense recipes (MoE recipes live in [qwen3-moe](/models/qwen/qwen3-moe)). - **Stronger agent interaction**: improved tool-use and search-based agent performance. ## 2. Supported Variants @@ -119,11 +116,11 @@ The 4 B / 8 B / 14 B recipes leave Adam on GPU. ### 5.5 Notable quirks -- **BF16 train + FP8 inference**: `run-qwen3-4B.sh:30-31` ships a commented `--hf-checkpoint /root/Qwen3-4B-FP8` alternative โ€” uncomment it (and download `Qwen/Qwen3-4B-FP8`) to swap rollout to FP8 while keeping BF16 training. See [Low Precision RL](../../advanced/fp8-low-precision.md). +- **BF16 train + FP8 inference**: `run-qwen3-4B.sh` ships a commented `--hf-checkpoint /root/Qwen3-4B-FP8` alternative โ€” uncomment it (and download `Qwen/Qwen3-4B-FP8`) to swap rollout to FP8 while keeping BF16 training. See [Low Precision RL](/advanced/fp8-low-precision). - **FSDP backend**: `run-qwen3-4B-fsdp.sh` runs the same recipe with `--train-backend fsdp`; no Megatron `torch_dist` conversion needed. - **AMD ROCm**: `scripts/amd/run-qwen3-4B-amd.sh` mirrors the recipe with `${NUM_GPUS}` resolved from the AMD environment. ## 6. Pairs Well With -- [Low Precision RL](../../advanced/fp8-low-precision.md) -- [Backends Beyond Megatron](../../advanced/architecture-support.md) โ€” for the FSDP variant. +- [Low Precision RL](/advanced/fp8-low-precision) +- [Backends Beyond Megatron](/advanced/architecture-support) โ€” for the FSDP variant. diff --git a/docs/platforms/amd.md b/docs/platforms/amd.md index e824440993..f1cd7bd380 100644 --- a/docs/platforms/amd.md +++ b/docs/platforms/amd.md @@ -2,16 +2,12 @@ title: AMD MI300X description: ROCm 6.3+ with patches for virtual memory management. Same launch scripts. --- - -# AMD MI300X - Miles runs on AMD Instinct GPUs (MI300, MI325, MI350, MI355X) with ROCm. The launch scripts are the same as on NVIDIA โ€” only the container and a few env vars differ. ## Container images ```bash -# MI350 / MI355X docker pull rlsys/miles:MI350-355-latest # MI300 / MI325 diff --git a/docs/platforms/index.md b/docs/platforms/index.md index 5cd2043b37..c87dafbada 100644 --- a/docs/platforms/index.md +++ b/docs/platforms/index.md @@ -2,20 +2,17 @@ title: Platforms description: Hardware-specific tutorials. Most users want NVIDIA H/B; AMD MI300X is supported via ROCm. --- - -# Platforms - Miles runs on NVIDIA H/B-series and AMD MI300X with the same launch scripts. Each platform page covers driver versions, build flags, and the FP8 / ROCm quirks you need to know before kicking off a job. - + The default GB300 / GB200 / B200 / B100 / H200 / H100 with FP8, NVLink, and InfiniBand. - + ROCm 6.3+ with patches for virtual memory management. Same launch scripts. diff --git a/docs/platforms/nvidia.md b/docs/platforms/nvidia.md index 7bf0a1aa3a..bc9a7bc28c 100644 --- a/docs/platforms/nvidia.md +++ b/docs/platforms/nvidia.md @@ -2,9 +2,6 @@ title: NVIDIA H / B Series description: H100, H200, B100, B200 โ€” Miles's primary target. --- - -# NVIDIA GPUs - NVIDIA Blackwell (GB300 / GB200 / B200 / B100) and Hopper (H200 / H100) are Miles's first-class targets. ## Recommended setup diff --git a/docs/style.css b/docs/style.css index bcbc904178..7040f5ceed 100644 --- a/docs/style.css +++ b/docs/style.css @@ -1,9 +1,159 @@ -/* Regular-weight model links inside the homepage CardGroup. - The Card body applies a bold weight to its children, so the family - name (e.g. **Qwen**) and the comma-separated model links would both - render bold without this override. */ +/* ============================================= + Design tokens โ€” light / dark + ============================================= */ +:root { + --rx-code-bg: rgba(26, 20, 16, 0.05); + --rx-code-border: rgba(26, 20, 16, 0.10); + --rx-divider: rgba(26, 20, 16, 0.10); + --rx-table-bg: rgba(26, 20, 16, 0.02); +} + +html.dark { + --rx-code-bg: rgba(255, 255, 255, 0.06); + --rx-code-border: rgba(255, 255, 255, 0.08); + --rx-divider: rgba(255, 255, 255, 0.08); + --rx-table-bg: rgba(255, 255, 255, 0.02); +} + +/* ============================================= + Links โ€” internal orange, external lighter + โ†— + ============================================= */ +a { + color: #d55816; + text-decoration-thickness: 1px; + text-underline-offset: 3px; +} + +a:hover { + text-decoration-thickness: 2px; +} + +a[href^="http"]:not([href*="radixark.com"])::after { + content: "โ†—"; + margin-left: 0.15em; + font-size: 0.7em; + opacity: 0.45; + vertical-align: super; +} + +/* ============================================= + Nav / UI โ€” Inter (matches RadixArk buttons) + ============================================= */ +#navbar, button { + font-family: "Inter", system-ui, sans-serif; +} + +/* ============================================= + Code โ€” consistent across headings, body, FAQ + ============================================= */ +code { + background: var(--rx-code-bg) !important; + border-radius: 4px !important; + font-size: 0.875em !important; + padding: 0.15em 0.4em !important; +} + +pre { + border: 1px solid var(--rx-code-border) !important; + border-radius: 6px !important; + overflow-x: auto; + -webkit-overflow-scrolling: touch; +} + +pre code { + font-size: 0.875rem !important; + padding: 0 !important; + background: none !important; + border-radius: 0 !important; +} + +/* ============================================= + Tables โ€” readable + mobile-safe + ============================================= */ +table { + display: block; + overflow-x: auto; + -webkit-overflow-scrolling: touch; + border-collapse: collapse; + width: max-content; + max-width: 100%; + font-size: 0.9375rem; +} + +th { + background: var(--rx-table-bg); + font-weight: 600; + letter-spacing: 0.01em; + white-space: nowrap; +} + +td { + border-color: var(--rx-divider); + vertical-align: top; + line-height: 1.5; +} + +/* ============================================= + Images โ€” always responsive + ============================================= */ +img { + max-width: 100%; + height: auto; + border-radius: 4px; +} + +/* ============================================= + Dividers + ============================================= */ +hr { + border-color: var(--rx-divider) !important; + margin: 2rem 0; +} + +/* ============================================= + Model-list card fix (homepage CardGroup) + ============================================= */ .model-list, .model-list a { font-weight: 400 !important; } +.model-list a { + font-size: 0.9em; + text-underline-offset: 2px; +} + +/* ============================================= + Mobile โ€” overflow and text consistency + ============================================= */ +@media (max-width: 768px) { + pre, table, img, video, iframe { + max-width: 100%; + } + + pre code { + font-size: 0.8125rem !important; + } + + table { + font-size: 0.875rem; + } + + h1 code, h2 code, h3 code { + font-size: 0.8em !important; + } +} + +/* ============================================= + Accordion / FAQ โ€” title size + ============================================= */ +/* Mintlify renders accordion titles as buttons; + scale them down relative to body text */ +button[data-accordion-trigger], +[data-accordion] button, +.accordion-title, +details summary { + font-size: 0.9375rem !important; /* 15px โ€” tighter than body */ + line-height: 1.4 !important; + font-weight: 500 !important; +} diff --git a/docs/user-guide/agentic-chat-template.md b/docs/user-guide/agentic-chat-template.md index 933bd06933..23b53beb01 100644 --- a/docs/user-guide/agentic-chat-template.md +++ b/docs/user-guide/agentic-chat-template.md @@ -49,7 +49,7 @@ A full multi-turn agentic setup on the session-server TITO path lives in [`examp ## Add a new model -Models in the table are verified by Miles maintainers โ€” just pick the family. To support a new model (or a new append-role surface), register a `TITOTokenizer` subclass plus its fixed Jinja template (or HF-native + kwargs) and `SUPPORTED_TEMPLATES` rows in [`tito_tokenizer.py`](../../miles/utils/chat_template_utils/tito_tokenizer.py), then verify with both scripts โ€” either failing blocks it. Each prints `Verdict: PASS/FAIL`. +Models in the table are verified by Miles maintainers โ€” just pick the family. To support a new model (or a new append-role surface), register a `TITOTokenizer` subclass plus its fixed Jinja template (or HF-native + kwargs) and `SUPPORTED_TEMPLATES` rows in [`tito_tokenizer.py`](https://github.com/radixark/miles/blob/main/miles/utils/chat_template_utils/tito_tokenizer.py), then verify with both scripts โ€” either failing blocks it. Each prints `Verdict: PASS/FAIL`. ```bash # CPU / fast โ€” rendered token sequence is append-only diff --git a/docs/user-guide/argument-groups.md b/docs/user-guide/argument-groups.md index 92f226cccd..e0b783523b 100644 --- a/docs/user-guide/argument-groups.md +++ b/docs/user-guide/argument-groups.md @@ -2,14 +2,11 @@ title: Argument Groups description: The launch-script argument groups used by Miles recipes, with links to the flags that belong in each group. --- - -# Argument Groups - Miles launch scripts are bash arrays. The grouping is deliberately boring: each array owns one operational concern, then the script expands all arrays into `train.py` or `train_async.py`. -Use this page to decide where a flag belongs. Use the [CLI Reference](cli-reference.md) +Use this page to decide where a flag belongs. Use the [CLI Reference](/user-guide/cli-reference) when you need the full default and type for an individual flag. | Group | Owns | Typical source | @@ -75,7 +72,7 @@ produces. | Filtering | `--over-sampling-batch-size`, `--dynamic-sampling-filter-path` | The rollout volume and training consumption must satisfy the -[four-knob invariant](concepts.md#the-four-knob-invariant). +[four-knob invariant](/user-guide/concepts#the-four-knob-invariant). ## EVAL_ARGS - evaluation overrides @@ -110,7 +107,7 @@ Flags not set in `EVAL_ARGS` inherit from `ROLLOUT_ARGS`. Megatron exposes TP, PP, CP, EP, and ETP, but not every product of those dimensions is valid or worth using for every model. Start from the recipe's tested combination and -see [parallelism compatibility](usage.md#parallelism-compatibility) before changing +see [parallelism compatibility](/user-guide/usage#parallelism-compatibility) before changing more than one dimension. diff --git a/docs/user-guide/cli-reference.md b/docs/user-guide/cli-reference.md index 4ce581236a..30487b5a3d 100644 --- a/docs/user-guide/cli-reference.md +++ b/docs/user-guide/cli-reference.md @@ -2,9 +2,6 @@ title: CLI Reference description: Every command-line flag Miles accepts, grouped by subsystem. --- - -# CLI Reference - Miles is configured through command-line flags passed to `train.py` or `train_async.py`. The Megatron flags (such as `--num-layers`, `--rotary-base`, `--recompute-granularity`) are inherited via Megatron's argument parser; Miles adds @@ -30,7 +27,7 @@ This page has two passes. | `--rollout-num-gpus-per-engine` | `1` | TP size of each SGLang engine. | | `--colocate` | off | Share GPUs between actor and rollout. | -See [Training Script Walkthrough: Colocation](training-script-walkthrough.md#colocation-share-gpus-or-dont) +See [Training Script Walkthrough: Colocation](/user-guide/training-script-walkthrough#colocation-share-gpus-or-dont) for what `--colocate` flips on under the hood. ### Batch sizing @@ -265,7 +262,7 @@ Sections mirror the launch-script argument groups. | `--rm-type` | enum | โ€“ | Built-in reward: `math`, `dapo`, `deepscaler`, `f1`, `gpqa`, `ifbench`, `remote_rm`, `random`. | | `--rm-url` | str | โ€“ | Endpoint when `--rm-type remote_rm`. | | `--group-rm` | flag | off | Batched reward computation. | -| `--custom-rm-path` | str | โ€“ | Custom reward function (see [Customization](customization.md)). | +| `--custom-rm-path` | str | โ€“ | Custom reward function (see [Customization](/user-guide/customization)). | | `--dynamic-sampling-filter-path` | str | โ€“ | Group filter (DAPO-style). | | `--buffer-filter-path` | str | โ€“ | Buffer dequeue filter. | | `--rollout-sample-filter-path` | str | โ€“ | Per-sample filter. | @@ -349,5 +346,5 @@ Common `--sglang-*` flags: ### Customization -See [Customization](customization.md) for the full catalogue of `--*-path` flags -that replace or extend Miles's behaviour. +See [Customization](/user-guide/customization) for the full catalog of `--*-path` flags +that replace or extend Miles's behavior. diff --git a/docs/user-guide/concepts.md b/docs/user-guide/concepts.md index a4d79ac013..8a177b2b9b 100644 --- a/docs/user-guide/concepts.md +++ b/docs/user-guide/concepts.md @@ -2,9 +2,6 @@ title: Core Concepts description: The four objects that make up every Miles job, and how data flows between them. --- - -# Core Concepts - A Miles training job is a loop over four objects. Once you understand what each one *is* and how data flows between them, every flag in the system has an obvious home. @@ -76,19 +73,19 @@ Use this map when reading any launch script: | Argument group | Concerns | |---|---| -| [`MODEL_ARGS`](argument-groups.md#model-args) | Architecture constants (layers, hidden size, rotary base, ...) | -| [`CKPT_ARGS`](argument-groups.md#ckpt-args) | Filesystem paths for the actor / reference / save directory | -| [`ROLLOUT_ARGS`](argument-groups.md#rollout-args) | Prompt dataset, batch knobs, sampling parameters, reward type | -| [`EVAL_ARGS`](argument-groups.md#eval-args) | Eval dataset, cadence, sampling overrides for evaluation | -| [`PERF_ARGS`](argument-groups.md#perf-args) | Parallelism (TP/PP/CP/EP/ETP), recomputation, dynamic batching | -| [`GRPO_ARGS`](argument-groups.md#grpo-args) | RL algorithm, KL, clipping, entropy bonus, advantage estimator | -| [`OPTIMIZER_ARGS`](argument-groups.md#optimizer-args) | Learning rate, schedule, weight decay, Adam betas | -| [`SGLANG_ARGS`](argument-groups.md#sglang-args) | Engine TP, memory fraction, log level, `--sglang-*` passthrough | +| [`MODEL_ARGS`](/user-guide/argument-groups#model-args) | Architecture constants (layers, hidden size, rotary base, ...) | +| [`CKPT_ARGS`](/user-guide/argument-groups#ckpt-args) | Filesystem paths for the actor / reference / save directory | +| [`ROLLOUT_ARGS`](/user-guide/argument-groups#rollout-args) | Prompt dataset, batch knobs, sampling parameters, reward type | +| [`EVAL_ARGS`](/user-guide/argument-groups#eval-args) | Eval dataset, cadence, sampling overrides for evaluation | +| [`PERF_ARGS`](/user-guide/argument-groups#perf-args) | Parallelism (TP/PP/CP/EP/ETP), recomputation, dynamic batching | +| [`GRPO_ARGS`](/user-guide/argument-groups#grpo-args) | RL algorithm, KL, clipping, entropy bonus, advantage estimator | +| [`OPTIMIZER_ARGS`](/user-guide/argument-groups#optimizer-args) | Learning rate, schedule, weight decay, Adam betas | +| [`SGLANG_ARGS`](/user-guide/argument-groups#sglang-args) | Engine TP, memory fraction, log level, `--sglang-*` passthrough | ## Next -- [Training Backend](usage.md) โ€” Megatron-LM, parallelism, checkpoints, and hooks. -- [Argument Groups](argument-groups.md) โ€” where each launch-script array belongs. -- [Training Script Walkthrough](training-script-walkthrough.md) โ€” the launch script +- [Training Backend](/user-guide/usage) โ€” Megatron-LM, parallelism, checkpoints, and hooks. +- [Argument Groups](/user-guide/argument-groups) โ€” where each launch-script array belongs. +- [Training Script Walkthrough](/user-guide/training-script-walkthrough) โ€” the launch script group by group, plus execution modes (colocation, sync/async, dynamic sampling, โ€ฆ). -- [CLI Reference](cli-reference.md) โ€” every flag, grouped and fully catalogued. +- [CLI Reference](/user-guide/cli-reference) โ€” every flag, grouped and fully catalogd. diff --git a/docs/user-guide/customization.md b/docs/user-guide/customization.md index fdbc86b89b..ee617f3ce1 100644 --- a/docs/user-guide/customization.md +++ b/docs/user-guide/customization.md @@ -2,10 +2,7 @@ title: Customization description: The 22 plug-points where you can drop in your own Python without forking Miles. --- - -# Customization - -Most of Miles's behaviour can be replaced with user-supplied Python by passing a +Most of Miles's behavior can be replaced with user-supplied Python by passing a `--*-path` flag. This page lists every such hook, the function signature it expects, and the default it replaces. @@ -92,7 +89,6 @@ configured. ### `--custom-rm-path` ```python -# Single-sample mode async def custom_rm(args, sample: Sample) -> float: ... @@ -281,4 +277,4 @@ ROLLOUT_ARGS+=( That is the entire delta from the stock GRPO recipe, with no source changes to Miles. -โ†’ Next: [Server arguments reference](cli-reference.md) +โ†’ Next: [Server arguments reference](/user-guide/cli-reference) diff --git a/docs/user-guide/fully-async.md b/docs/user-guide/fully-async.md index 8c615b3c20..3ac4dd4aee 100644 --- a/docs/user-guide/fully-async.md +++ b/docs/user-guide/fully-async.md @@ -2,9 +2,6 @@ title: Fully Async Rollout description: How fully async rollout decouples generation from training, when to use it, and which flags enable it. --- - -# Fully Async Rollout - Fully async rollout splits Miles into two concurrent loops: 1. A background rollout worker keeps SGLang generation in flight and pushes completed @@ -38,7 +35,7 @@ function that owns the background worker: + --rollout-function-path fully_async_rollout.generate_rollout_fully_async ``` -Everything else belongs in the same [argument groups](argument-groups.md) as a +Everything else belongs in the same [argument groups](/user-guide/argument-groups) as a synchronous run. ## Queue model @@ -94,11 +91,11 @@ Warning: No progress for s. Queue size: , Collected: / ``` Treat large staleness windows as a training-quality signal, not just a performance -signal. Fast [P2P weight transfer](../advanced/p2p-weight-transfer.md) keeps the +signal. Fast [P2P weight transfer](/advanced/p2p-weight-transfer) keeps the rollout engines closer to the latest actor weights so fewer groups get recycled by `--max-weight-staleness`. ## Example implementation For a complete Qwen3 launch script and worker implementation, see the -[Fully Async Rollout example](../examples/fully-async.md). +[Fully Async Rollout example](/examples/fully-async). diff --git a/docs/user-guide/index.md b/docs/user-guide/index.md index 6763ea1af7..5b9e62d8e8 100644 --- a/docs/user-guide/index.md +++ b/docs/user-guide/index.md @@ -2,25 +2,22 @@ title: User Guide description: Concepts, launch script walkthrough, customization hooks, and a complete CLI reference. --- - -# User Guide - | Page | What it covers | |---|---| -| [Core Concepts](concepts.md) | The four objects in the training loop and the four-knob invariant. | -| [Argument Groups](argument-groups.md) | Where `MODEL_ARGS`, `PERF_ARGS`, `GRPO_ARGS`, and the other launch-script arrays belong. | -| [Training Backend](usage.md) | Megatron-LM as the training backend โ€” parallelism, checkpoints, and hooks. | -| [Training Script Walkthrough](training-script-walkthrough.md) | The eight `XXX_ARGS` arrays in a launch script, plus the execution modes (sync/async, colocation, dynamic sampling, partial rollout, BF16+FP8). | -| [Monitoring & Logging](monitoring.md) | wandb, structured logs, per-source breakdowns, profiling, router metrics. | -| [Customization](customization.md) | The 22 `--*-path` plug-points for custom Python โ€” rollout, reward, filters, loss, hooks. | -| [Rollout Endpoints](rollout-endpoints.md) | The `/generate` endpoint and the OpenAI chat endpoint for agentic sessions. | -| [Fully Async Rollout](fully-async.md) | Queue-backed rollout production, tuning knobs, and when to use `train_async.py`. | -| [Agentic Chat Templates](agentic-chat-template.md) | Turning on and verifying TITO so multi-turn agentic rollout stays append-only. | -| [CLI Reference](cli-reference.md) | Every flag Miles accepts, grouped by subsystem. | +| [Core Concepts](/user-guide/concepts) | The four objects in the training loop and the four-knob invariant. | +| [Argument Groups](/user-guide/argument-groups) | Where `MODEL_ARGS`, `PERF_ARGS`, `GRPO_ARGS`, and the other launch-script arrays belong. | +| [Training Backend](/user-guide/usage) | Megatron-LM as the training backend โ€” parallelism, checkpoints, and hooks. | +| [Training Script Walkthrough](/user-guide/training-script-walkthrough) | The eight `XXX_ARGS` arrays in a launch script, plus the execution modes (sync/async, colocation, dynamic sampling, partial rollout, BF16+FP8). | +| [Monitoring & Logging](/user-guide/monitoring) | wandb, structured logs, per-source breakdowns, profiling, router metrics. | +| [Customization](/user-guide/customization) | The 22 `--*-path` plug-points for custom Python โ€” rollout, reward, filters, loss, hooks. | +| [Rollout Endpoints](/user-guide/rollout-endpoints) | The `/generate` endpoint and the OpenAI chat endpoint for agentic sessions. | +| [Fully Async Rollout](/user-guide/fully-async) | Queue-backed rollout production, tuning knobs, and when to use `train_async.py`. | +| [Agentic Chat Templates](/user-guide/agentic-chat-template) | Turning on and verifying TITO so multi-turn agentic rollout stays append-only. | +| [CLI Reference](/user-guide/cli-reference) | Every flag Miles accepts, grouped by subsystem. | ## Which pages do I actually need? -- **Training my first job** โ€” read [Core Concepts](concepts.md), then [Training Script Walkthrough](training-script-walkthrough.md). -- **Tuning a running job** โ€” [Training Script Walkthrough](training-script-walkthrough.md) in depth + [CLI Reference](cli-reference.md). -- **Plugging in a custom reward / rollout / filter** โ€” skim [Core Concepts](concepts.md) for vocabulary, then go to [Customization](customization.md). +- **Training my first job** โ€” read [Core Concepts](/user-guide/concepts), then [Training Script Walkthrough](/user-guide/training-script-walkthrough). +- **Tuning a running job** โ€” [Training Script Walkthrough](/user-guide/training-script-walkthrough) in depth + [CLI Reference](/user-guide/cli-reference). +- **Plugging in a custom reward / rollout / filter** โ€” skim [Core Concepts](/user-guide/concepts) for vocabulary, then go to [Customization](/user-guide/customization). - **Contributor onboarding** โ€” read top to bottom. diff --git a/docs/user-guide/monitoring.md b/docs/user-guide/monitoring.md index 6bd317ee7c..77a17e1008 100644 --- a/docs/user-guide/monitoring.md +++ b/docs/user-guide/monitoring.md @@ -2,9 +2,6 @@ title: Monitoring & Logging description: wandb, structured logs, profiling, and what to look at when something looks off. --- - -# Monitoring & Logging - Miles emits per-rollout metrics to stdout and (optionally) Weights & Biases. SGLang and Ray write their own logs to their default directories. diff --git a/docs/user-guide/rollout-endpoints.md b/docs/user-guide/rollout-endpoints.md index ca9782b958..2625f4840b 100644 --- a/docs/user-guide/rollout-endpoints.md +++ b/docs/user-guide/rollout-endpoints.md @@ -2,9 +2,6 @@ title: Rollout Endpoints description: How Miles talks to SGLang. The /generate endpoint and the OpenAI-format /v1/chat/completions endpoint. --- - -# Rollout Endpoints - Miles supports two ways for a custom rollout function to talk to SGLang. The `/generate` endpoint is the most direct interface; you control tokenization. The OpenAI-format `/v1/chat/completions` endpoint is router-session aware and fits @@ -229,7 +226,7 @@ inherited across turns. Each request is tokenized independently. ## Next -- [Customization](customization.md): the full catalogue of `--*-path` hooks. -- [Agentic Chat Templates](agentic-chat-template.md): verifying that a template is +- [Customization](/user-guide/customization): the full catalog of `--*-path` hooks. +- [Agentic Chat Templates](/user-guide/agentic-chat-template): verifying that a template is append-only across turns. -- [Multi-agent example](../examples/multi-agent.md): full agentic walkthrough. +- [Multi-agent example](/examples/multi-agent): full agentic walkthrough. diff --git a/docs/user-guide/training-script-walkthrough.md b/docs/user-guide/training-script-walkthrough.md index 87782a0cd5..12b7e74865 100644 --- a/docs/user-guide/training-script-walkthrough.md +++ b/docs/user-guide/training-script-walkthrough.md @@ -2,9 +2,6 @@ title: Training Script Walkthrough description: An annotated tour through every argument group in a Miles launch script, plus the feature modes you turn on when a recipe isn't enough. --- - -# Training Script Walkthrough - A Miles launch script is plain bash โ€” a sequence of `XXX_ARGS=( ... )` arrays handed to `train.py` or `train_async.py`. This page walks through each group and then covers the execution modes you turn on beyond the default recipe. @@ -18,14 +15,14 @@ off to `train.py`: | Array | Governs | |---|---| -| [`MODEL_ARGS`](argument-groups.md#model-args) | Architecture constants (layers, hidden size, rotary base, ...) | -| [`CKPT_ARGS`](argument-groups.md#ckpt-args) | Filesystem paths for the actor / reference / save directory | -| [`ROLLOUT_ARGS`](argument-groups.md#rollout-args) | Prompt dataset, batch knobs, sampling parameters, reward type | -| [`EVAL_ARGS`](argument-groups.md#eval-args) | Eval dataset, cadence, sampling overrides for evaluation | -| [`PERF_ARGS`](argument-groups.md#perf-args) | Parallelism (TP/PP/CP/EP/ETP), recomputation, dynamic batching | -| [`GRPO_ARGS`](argument-groups.md#grpo-args) | RL algorithm, KL, clipping, entropy bonus, advantage estimator | -| [`OPTIMIZER_ARGS`](argument-groups.md#optimizer-args) | Learning rate, schedule, weight decay, Adam betas | -| [`SGLANG_ARGS`](argument-groups.md#sglang-args) | Engine TP, memory fraction, log level, `--sglang-*` passthrough | +| [`MODEL_ARGS`](/user-guide/argument-groups#model-args) | Architecture constants (layers, hidden size, rotary base, ...) | +| [`CKPT_ARGS`](/user-guide/argument-groups#ckpt-args) | Filesystem paths for the actor / reference / save directory | +| [`ROLLOUT_ARGS`](/user-guide/argument-groups#rollout-args) | Prompt dataset, batch knobs, sampling parameters, reward type | +| [`EVAL_ARGS`](/user-guide/argument-groups#eval-args) | Eval dataset, cadence, sampling overrides for evaluation | +| [`PERF_ARGS`](/user-guide/argument-groups#perf-args) | Parallelism (TP/PP/CP/EP/ETP), recomputation, dynamic batching | +| [`GRPO_ARGS`](/user-guide/argument-groups#grpo-args) | RL algorithm, KL, clipping, entropy bonus, advantage estimator | +| [`OPTIMIZER_ARGS`](/user-guide/argument-groups#optimizer-args) | Learning rate, schedule, weight decay, Adam betas | +| [`SGLANG_ARGS`](/user-guide/argument-groups#sglang-args) | Engine TP, memory fraction, log level, `--sglang-*` passthrough | --- @@ -57,7 +54,7 @@ MODEL_ARGS+=(--rotary-base 10000) ## CKPT_ARGS โ€” paths -The three roles โ€” actor, frozen reference, HuggingFace directory โ€” are defined in [Core Concepts](concepts.md#the-four-objects). Here they map to four flags: +The three roles โ€” actor, frozen reference, HuggingFace directory โ€” are defined in [Core Concepts](/user-guide/concepts#the-four-objects). Here they map to four flags: ```bash CKPT_ARGS=( @@ -90,11 +87,11 @@ Their product is the total sample count produced each rollout. - `--global-batch-size` โ€” samples used per optimizer step. - `--num-steps-per-rollout` โ€” optimizer steps per rollout. Leave at `1` for strict - on-policy behaviour; raise it for off-policy reuse of rollout batches. + on-policy behavior; raise it for off-policy reuse of rollout batches. Their product is the total sample count consumed each rollout. -These two products must be equal โ€” that's the [four-knob invariant](concepts.md#the-four-knob-invariant). Set three sides; Miles fills in the fourth. Set all four and Miles validates the equation โ€” inconsistent values abort early. +These two products must be equal โ€” that's the [four-knob invariant](/user-guide/concepts#the-four-knob-invariant). Set three sides; Miles fills in the fourth. Set all four and Miles validates the equation โ€” inconsistent values abort early. **Outer loop** @@ -136,7 +133,7 @@ regardless of how many optimizer steps fired in between. ## EVAL_ARGS โ€” a strict subset of rollout -Evaluation reuses the rollout machinery but lets you override sampling behaviour so +Evaluation reuses the rollout machinery but lets you override sampling behavior so that eval is deterministic and comparable across runs. ```bash @@ -223,7 +220,7 @@ A few design choices become visible here: or when you want length-proportional weighting. - **`--use-tis` is the numerical safety belt.** Switch it on when rollout and trainer operate at different precisions or when you explicitly want off-policy reuse. See - the R3 deep dive in [Rollout Routing Replay (R3)](../advanced/miles-router.md). + the R3 deep dive in [Rollout Routing Replay (R3)](/advanced/miles-router). ## OPTIMIZER_ARGS โ€” nothing surprising @@ -274,7 +271,6 @@ DP-attention is off, the effective `dp_size` is derived from --- -# Execution modes The eight argument groups describe **what** you're training. The next set of sections describe **how** the training runs โ€” the execution modes that flip Miles from its @@ -304,7 +300,7 @@ Enable it with two changes to the launch script: | Sync *(default)* | Lower | Lower overall | Strict on-policy, debugging | | Async | Higher | Up to 2ร— | Rollout-bound jobs, long runs | -See the [Fully Async Rollout example](../examples/fully-async.md) for the full +See the [Fully Async Rollout example](/examples/fully-async) for the full walkthrough including the worker implementation. ## Colocation: share GPUs or don't @@ -458,15 +454,15 @@ KL anchor silently and makes the loss curve incomparable to earlier runs. For end-to-end FP8 (trainer and inference at bit-identical precision), see -[Low Precision RL](../advanced/fp8-low-precision.md). For INT4 quant-aware -training, see [INT4 QAT](../advanced/int4-qat.md). +[Low Precision RL](/advanced/fp8-low-precision). For INT4 quant-aware +training, see [INT4 QAT](/advanced/int4-qat). --- ## Next -- [Configuration](cli-reference.md) โ€” the same material organized as a flag-by-flag +- [Configuration](/user-guide/cli-reference) โ€” the same material organized as a flag-by-flag reference. -- [Server Arguments](cli-reference.md) โ€” the complete CLI surface. -- [Customization](customization.md) โ€” the twenty-plus Python extension points. -- [Training Backends](usage.md) โ€” Megatron vs FSDP and each one's plumbing. +- [Server Arguments](/user-guide/cli-reference) โ€” the complete CLI surface. +- [Customization](/user-guide/customization) โ€” the twenty-plus Python extension points. +- [Training Backends](/user-guide/usage) โ€” Megatron vs FSDP and each one's plumbing. diff --git a/docs/user-guide/usage.md b/docs/user-guide/usage.md index cbbd7c848a..3a2b0b0c34 100644 --- a/docs/user-guide/usage.md +++ b/docs/user-guide/usage.md @@ -2,9 +2,6 @@ title: Training Backend description: Megatron-LM as the training backend โ€” parameters, parallelism, checkpoints, and hooks. --- - -# Training Backend - Miles decouples the **training backend** (how the model is sharded, checkpointed, and stepped) from the **inference backend** (SGLang). The production training backend is **Megatron-LM**. @@ -47,7 +44,7 @@ MODEL_ARGS=( The spec function replaces specific Megatron submodules with the HF implementation without patching Megatron itself. Details: -[Backends Beyond Megatron](../advanced/architecture-support.md). +[Backends Beyond Megatron](/advanced/architecture-support). ### Parallelism compatibility @@ -66,7 +63,7 @@ the model recipe's tested combination, then change one dimension at a time. Do not assume TP, CP, EP, and ETP can all be raised independently for a new model โ€” the exact set of supported combinations depends on the Megatron Core kernels and model spec -being used. The [Argument Groups](argument-groups.md#perf-args) page lists the flags +being used. The [Argument Groups](/user-guide/argument-groups#perf-args) page lists the flags that belong in `PERF_ARGS`. ### Checkpoint format @@ -105,7 +102,7 @@ command. ### Hooks -Three extension points override Megatron behaviour without forking: +Three extension points override Megatron behavior without forking: | Flag | Runs | |---|---| @@ -114,7 +111,7 @@ Three extension points override Megatron behaviour without forking: | `--custom-megatron-before-train-step-hook-path` | Before every training step | Typical use cases: mixing in an auxiliary loss, instrumenting per-step metrics, or -clipping weights surgically. See [Customization](customization.md#megatron-hooks). +clipping weights surgically. See [Customization](/user-guide/customization#megatron-hooks). --- @@ -173,13 +170,13 @@ at startup. ## Further reading -- [Core concepts](concepts.md) โ€” the four objects that make up any Miles job. -- [Training script walkthrough](training-script-walkthrough.md) โ€” the launch script, +- [Core concepts](/user-guide/concepts) โ€” the four objects that make up any Miles job. +- [Training script walkthrough](/user-guide/training-script-walkthrough) โ€” the launch script, argument group by argument group. -- [Fully Async Rollout](fully-async.md) โ€” decouple generation from trainer steps with +- [Fully Async Rollout](/user-guide/fully-async) โ€” decouple generation from trainer steps with a queue-backed rollout worker. -- [Configuration](cli-reference.md) โ€” the flag taxonomy and defaults. -- [Backends beyond Megatron](../advanced/architecture-support.md) โ€” wrapping new +- [Configuration](/user-guide/cli-reference) โ€” the flag taxonomy and defaults. +- [Backends beyond Megatron](/advanced/architecture-support) โ€” wrapping new architectures without patching Megatron core. -- [Experimental Features โ†’ FSDP backend](../developer/experimental-features.md#fsdp-backend) +- [Experimental Features โ†’ FSDP backend](/developer/experimental-features#fsdp-backend) โ€” experimental PyTorch FSDP2 backend for fast iteration on small dense models.