radixark · Zhichenzzz · Jun 12, 2026 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/README.md b/README.md
@@ -9,7 +9,7 @@
 [![License](https://img.shields.io/github/license/radixark/miles)](LICENSE)
 [![Slack](https://img.shields.io/badge/slack-join-brightgreen.svg)](https://slack.sglang.ai)
 
-[**Latest Updates**](#latest-updates) | [**Quick Start**](#quick-start) | [**Key Features**](#key-features) | [**Documentation**](https://www.radixark.com/miles/docs)
+[**Latest Updates**](#latest-updates) | [**Quick Start**](#quick-start) | [**Key Features**](#key-features) | [**Documentation**](https://miles.radixark.com/docs)
 
 </div>
 
@@ -18,11 +18,11 @@
 
 ## Latest Updates
 
-*   **[2026/02]** 💡 **Miles Detailed Arguments**: We've added a detailed command-line argument guide used to configure Miles for RL training and inference. These arguments enable precise control over cluster resources, training backends (Megatron/FSDP), inference optimization via SGLang, and RL algorithmic hyperparameters. [Link](https://github.com/radixark/miles/blob/main/docs/en/advanced/miles_server_args.md)
+*   **[2026/02]** 💡 **Miles Detailed Arguments**: We've added a detailed command-line argument guide used to configure Miles for RL training and inference. These arguments enable precise control over cluster resources, training backends (Megatron/FSDP), inference optimization via SGLang, and RL algorithmic hyperparameters. [Link](https://miles.radixark.com/docs/user-guide/cli-reference)
 *   **[2026/01]** 💎 **INT4 Quantization-Aware Training (QAT)**: Inspired by the Kimi K2-Thinking report, Miles now features a full-stack INT4 W4A16 QAT pipeline. This allows 1TB-scale models to fit into single-machine VRAM (e.g., NVIDIA H200), doubling rollout efficiency by eliminating cross-node bottlenecks while maintaining BF16-equivalent accuracy. [Blog](https://lmsys.org/blog/2026-01-26-int4-qat/)
 *   **[2026/01]** 💎 **Unified VLM/LLM Multi-Turn Training**: We provided an implementation for the VLM multi-turn sampling paradigm. Developers only need to write a customized `rollout` function to easily start multi-turn RL for VLM, just like training LLM. [Blog](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/vlm-multi-turn/readme-en.md)
 *   **[2026/01]** 🤖 **Multi-Agent Co-Evolution**: Miles now supports **MrlX**, a novel asynchronous co-evolutionary framework for Multi-Agent RL. Achieve superior performance in complex tasks like Doctor-Patient simulations and DeepResearch pipelines by enabling specialized agents to evolve together symbiotically. [[Link]](https://github.com/AQ-MedAI/MrlX)
-*   **[2025/12]** 🔄 **Rollout Routing Replay (R3)**: In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [[Paper]](https://arxiv.org/pdf/2510.11370) [[Docs]](docs/en/advanced/miles-router.md#22-rollout-routing-replay-r3-for-moe)
+*   **[2025/12]** 🔄 **Rollout Routing Replay (R3)**: In collaboration with SGLang, we've launched R3 to solve MoE RL instability. R3 records inference routing decisions and replays them during training, effectively eliminating the "training-inference mismatch" and preventing training collapse in large MoE models like Qwen3 and DeepSeek-V3. [[Paper]](https://arxiv.org/pdf/2510.11370) [[Docs]](https://miles.radixark.com/docs/advanced/miles-router)
 *   **[2025/11]** 🔥 **Unified FP8 Release**: Solves the stability issues in MoE RL by ensuring training and inference use the exact same FP8 quantization logic. [[Blog]](https://lmsys.org/blog/2025-11-25-fp8-rl/)
 *   **[2025/11]** ⚡ **Speculative Decoding in RL**: Integrated speculative rollout with online SFT for draft models, achieving massive throughput gains. [[Blog]](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/rlhf/slime/spec/readme-en.md)
 *   **[2025/11]** 🎉 **Miles Project Launch**: A joint effort by InfiXAI, Ant Group, SGLang RL Team, and the Miles community. [[Announcement]](https://lmsys.org/blog/2025-11-19-miles/)
@@ -107,7 +107,7 @@ python train.py \
     --n-samples-per-prompt 8
 ```
 
-For comprehensive guides on environment setup and custom reward functions, see the [Quick Start Guide](docs/en/get_started/quick_start.md).
+For comprehensive guides on environment setup and custom reward functions, see the [Quick Start Guide](https://miles.radixark.com/docs/getting-started/quick-start).
 
 ---
 

diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,34 @@
+# Miles Documentation
+
+Live site: https://miles.radixark.com/docs
+
+## Layout
+
+```
+docs/
+├── docs.json        # Mintlify config: navigation, theme, redirects
+├── index.md         # Homepage
+├── getting-started/ models/ user-guide/ advanced/
+├── examples/ developer/ platforms/ blog/
+└── assets/          # Images and stylesheets
+```
+
+## Previewing locally
+
+```bash
+npm i -g mint
+cd docs
+mint dev
+```
+
+Then open http://localhost:3000.
+
+## Adding or editing a page
+
+1. Add or edit a `.md` file (e.g. `models/qwen/qwen4.md`).
+2. New pages need an entry in the `navigation` tree in `docs.json`, otherwise they won't
+   show up in the sidebar.
+3. When linking between pages, use absolute paths: `[Quick Start](/getting-started/quick-start)`.
+   Drop the `.md` extension.
+4. Images and other assets go in `assets/` and are referenced the same way:
+   `/assets/images/arch.png`.
diff --git a/docs/advanced/architecture-support.md b/docs/advanced/architecture-support.md
@@ -2,9 +2,6 @@
 title: Backends Beyond Megatron
 description: Embed HuggingFace implementations as black-box modules inside Megatron's parallel pipeline.
 ---
-
-# Backends Beyond Megatron
-
 Adding a new architecture (such as Qwen3-Next's Gated-Delta-Net) directly to
 Megatron-LM's native code path is invasive. Miles takes a different approach:
 wrap the model's official HuggingFace implementation as a black-box module and
@@ -33,7 +30,6 @@ starts from `get_gpt_decoder_block_spec`, then for the layers whose HF
 params={"args": args})` (referenced from `miles_plugins/models/`):
 
 ```python
-# miles_plugins/models/qwen3_next.py (simplified excerpt)
 transformer_layer_spec = get_gpt_decoder_block_spec(config, **kwargs)
 ...
 for layer_id in range(num_layers_to_build):
@@ -122,10 +118,8 @@ of the model is bf16. Qwen3.5's `A_log` is the canonical example. Rounding it
 to bf16 makes Megatron-side activations diverge from SGLang-side rollout,
 causing precision drift.
 
-The canonical cast point is Megatron's `Float16Module`, which (per the
-docstring on `enforce_marked_param_dtypes` in
-`miles/backends/megatron_utils/fp32_param_utils.py`) "unconditionally casts
-every floating-point parameter to bf16/fp16 at wrap time". The mbridge
+The canonical cast point is Megatron's `Float16Module`, which casts
+every floating-point parameter to bf16/fp16 at wrap time. The mbridge
 weight-conversion path (`_weight_to_mcore_format` and friends) is the
 other place fp32 weights can be silently downcast. Two steps are required
 to keep tagged params in fp32.

diff --git a/docs/advanced/fault-tolerance.md b/docs/advanced/fault-tolerance.md
@@ -2,32 +2,29 @@
 title: Fault Tolerance
 description: Rollout-side health checks and engine recovery, gated by --use-fault-tolerance.
 ---
-
-# Fault Tolerance
-
-`--use-fault-tolerance` enables Miles's rollout-side fault-tolerance machinery.
-It gates two code paths:
+The `--use-fault-tolerance` flag enables Miles's rollout-side
+fault-tolerance machinery. It gates two code paths:
 
 1. A `RolloutHealthMonitor` thread per server group, started in
-   `miles/ray/rollout.py:379`, which periodically heart-beats each SGLang
+   `miles/ray/rollout.py`, which periodically heart-beats each SGLang
    engine.
 2. A recovery hook in the trainer's weight-update step
-   (`miles/backends/megatron_utils/actor.py:500`), which restarts engines
+   (`miles/backends/megatron_utils/actor.py`), which restarts engines
    that the health monitor has killed.
 
 ```bash
 --use-fault-tolerance
 ```
 
 The flag is `action="store_true"`, default `False`
-(`miles/utils/arguments.py:528`).
+(`miles/utils/arguments.py`).
 
 ## Health monitor
 
 `RolloutHealthMonitor` (`miles/utils/health_monitor.py`) runs in a daemon
 thread. Lifecycle: `start` (called once during init), `pause` and `resume`
 (called when engines offload / onload), `stop` (called during dispose).
-`pause` / `resume` are wired up in `miles/ray/rollout.py:497, 501` and called
+`pause` / `resume` are wired up in `miles/ray/rollout.py` and called
 around offload / onload events.
 
 Each loop iteration does:
@@ -37,7 +34,7 @@ Each loop iteration does:
 2. For every active engine in the group, call `engine.health_generate.remote(timeout=self._check_timeout)`.
 3. If the call raises, run `_kill_engine`: `engine.shutdown.remote()`,
    `ray.kill(engine)`, and the engine slot is set to `None`
-   (`miles/utils/health_monitor.py:160-180`).
+   (`miles/utils/health_monitor.py`).
 4. Sleep `--rollout-health-check-interval` seconds, then repeat.
 
 ### Flags
@@ -52,14 +49,14 @@ Each loop iteration does:
 
 When `--use-fault-tolerance` is on, `MegatronActor.update_weights` calls
 `rollout_manager.recover_updatable_engines` on rank 0 before each weight
-update (`miles/backends/megatron_utils/actor.py:500`).
+update (`miles/backends/megatron_utils/actor.py`).
 
-`recover_updatable_engines` (`miles/ray/rollout.py:513`):
+`recover_updatable_engines` (`miles/ray/rollout.py`):
 
 1. Pauses health monitoring.
 2. Calls `srv.recover()` on the updatable server.
 
-`srv.recover()` (`miles/ray/rollout.py:263`):
+`srv.recover()` (`miles/ray/rollout.py`):
 
 1. Finds engine slots set to `None` (killed by the health monitor).
 2. Calls `start_engines` for each affected group.
@@ -72,14 +69,14 @@ the new engines and the next weight transfer proceeds normally.
 
 When `--update-weight-transfer-mode p2p` is on, every P2P transfer is
 bounded by `--p2p-transfer-timeout` (default `30.0`s, defined in
-`miles/utils/arguments.py:519`; consumed at
-`miles/backends/megatron_utils/update_weight/update_weight_from_distributed/p2p.py:73`).
+`miles/utils/arguments.py`; consumed at
+`miles/backends/megatron_utils/update_weight/update_weight_from_distributed/p2p.py`).
 On timeout the failed transfer is logged (`[P2P] Transfer future failed: ...`)
 in `p2p_transfer_utils.py`. There is no automatic retry or automatic
 broadcast-mode fallback in the source today.
 
 ## Dumper-mode interaction
 
-In dumper mode (`miles/utils/arguments.py:2102`), Miles forces
+In dumper mode (`miles/utils/arguments.py`), Miles forces
 `use_fault_tolerance = False` and `rollout_health_check_interval = 1e18`
 to keep heartbeats from firing.
diff --git a/docs/advanced/fp8-low-precision.md b/docs/advanced/fp8-low-precision.md
@@ -2,9 +2,6 @@
 title: Low Precision RL
 description: Unified low-precision pipelines for RL — block-wise FP8, MXFP8, and NVFP4 across rollout and training.
 ---
-
-# Low Precision RL
-
 A common failure mode in MoE RL is precision drift between training and
 inference. Pipelines that train in BF16 and serve in FP8 accumulate per-layer
 numerical disagreement, which compounds into divergent log-probabilities and
@@ -82,7 +79,6 @@ recipe to use on Hopper, and the recipe DeepSeek-V3 / DeepSeek-R1 ship in.
 Block layout is 128×128 with FP32 scales.
 
 ```bash
-# Megatron / TransformerEngine
 --transformer-impl transformer_engine
 --bf16
 --fp8-format e4m3

diff --git a/docs/advanced/index.md b/docs/advanced/index.md
@@ -2,43 +2,40 @@
 title: Advanced Features
 description: Systems-level features for large-scale and long-running RL.
 ---
-
-# Advanced Features
-
 This section covers the Miles features that the Core-features section of the
 homepage points at: low-precision training (FP8 / MXFP8 / INT4 QAT), Rollout
 Routing Replay for MoE, speculative decoding, and LoRA training and serving.
 
 <CardGroup cols={2}>
 
-  <Card title="Low Precision RL" icon="bolt" href="fp8-low-precision">
+  <Card title="Low Precision RL" icon="bolt" href="/advanced/fp8-low-precision">
 
     The unified FP8 path: matched quantization between training and inference,
     BF16 backward and master weights.
 
   </Card>
 
-  <Card title="INT4 QAT" icon="microchip" href="int4-qat">
+  <Card title="INT4 QAT" icon="microchip" href="/advanced/int4-qat">
 
     W4A16 quantization-aware training for fitting large models on a single
     8-GPU node.
 
   </Card>
 
-  <Card title="Rollout Routing Replay (R3)" icon="network-wired" href="miles-router">
+  <Card title="Rollout Routing Replay (R3)" icon="network-wired" href="/advanced/miles-router">
 
     Capture expert routing during inference and replay during training. The
     mechanism that keeps MoE RL stable.
 
   </Card>
 
-  <Card title="Speculative Decoding" icon="rocket" href="speculative-decoding">
+  <Card title="Speculative Decoding" icon="rocket" href="/advanced/speculative-decoding">
 
     Draft + target speculative rollout, with online MTP-SFT for the draft.
 
   </Card>
 
-  <Card title="LoRA Training and Serving" icon="sliders" href="lora">
+  <Card title="LoRA Training and Serving" icon="sliders" href="/advanced/lora">
 
     Train LoRA adapters with SFT or RL and serve them through SGLang from the
     same checkpoint.

diff --git a/docs/advanced/int4-qat.md b/docs/advanced/int4-qat.md
@@ -2,9 +2,6 @@
 title: INT4 Quantization-Aware Training
 description: Fit large models on a single 8-GPU node by training with W4A16 quantization in the loop.
 ---
-
-# INT4 W4A16 Quantization-Aware Training
-
 When the model is large enough that even FP8 will not fit on one node, the
 options are spreading across more nodes (and paying cross-node bandwidth) or
 quantizing further. Miles ships an INT4 W4A16 quant-aware-training pipeline.
@@ -83,10 +80,10 @@ so the KL anchor stays full-precision.
 
 ## Pairs with
 
-* [R3](miles-router.md). Keeps MoE routing stable across the quantized forward.
-* [P2P weight transfer](p2p-weight-transfer.md). INT4 weights are 4× smaller,
+* [R3](/advanced/miles-router). Keeps MoE routing stable across the quantized forward.
+* [P2P weight transfer](/advanced/p2p-weight-transfer). INT4 weights are 4× smaller,
   so weight sync transfers less data.
-* [Speculative decoding](speculative-decoding.md). Compounds for end-to-end
+* [Speculative decoding](/advanced/speculative-decoding). Compounds for end-to-end
   rollout speedup.
 
 ## When QAT is not appropriate

diff --git a/docs/advanced/lora.md b/docs/advanced/lora.md
@@ -2,23 +2,17 @@
 title: LoRA Training and Serving
 description: Train LoRA adapters with miles SFT or RL recipes and serve them through SGLang from the same checkpoint.
 ---
-
-# LoRA Training and Serving
-
 Miles supports LoRA adapters for both SFT and RL recipes. Adapters trained by
 miles load directly into SGLang for rollout, so there is no separate merge or
 conversion step in the training-serving loop.
 
-This page is a stub; the full LoRA tutorial is being written. In the meantime,
-the pieces below are enough to get a recipe running.
-
 ## Example launchers
 
 The canonical LoRA recipes live under
 [`examples/lora/`](https://github.com/radixark/miles/tree/main/examples/lora) in
 the miles repo:
 
-- `examples/lora/run-qwen2.5-0.5B-megatron-lora.sh` — small dense, single GPU.
+- `examples/lora/run-qwen2.5-0.5B-megatron-lora.sh` — small dense model, single GPU.
 - `examples/lora/run-qwen3-4B-megatron-lora.sh` — Qwen3-4B, RL with LoRA.
 - `examples/lora/run-gpt-oss-20B-megatron-moe-lora.sh` — MoE example.
 
@@ -35,12 +29,15 @@ the miles repo:
 | `--lora-adapter-path` | Path to a pre-trained adapter to resume from. |
 | `--lora-sync-from-tensor` | Sync adapter weights to SGLang via in-memory tensors instead of a file round-trip. |
 
-Two existing arguments also have LoRA-specific requirements that are easy to
-miss: the launcher has to pass `--megatron-to-hf-mode bridge` (the LoRA path
-goes through Megatron-Bridge's PEFT integration; the default `raw` converter
-does not understand LoRA layers), and the Ray job has to run with
-`--colocate`. Distributed (PD-disaggregated) rollout with LoRA is not
-supported today.
+<Warning>
+Two existing arguments are easy to miss when configuring LoRA:
+
+- **`--megatron-to-hf-mode bridge`** is required. The LoRA path goes through
+  Megatron-Bridge's PEFT integration; the default `raw` converter does not
+  understand LoRA layers.
+- **`--colocate`** is required. Distributed (PD-disaggregated) rollout with
+  LoRA is not supported today.
+</Warning>
 
 ## MoE
 
@@ -75,11 +72,14 @@ reason.
   drives `train.py`.
 * **Low-precision training**: the LoRA branch follows the surrounding
   precision, so block-wise FP8, MXFP8, and INT4 QAT recipes are compatible.
-  See [Low Precision RL](fp8-low-precision.md) and [INT4 QAT](int4-qat.md).
-* **`--target-modules` is mandatory** when `--lora-rank > 0`. There is no
-  auto-detection; the launcher asserts at startup.
-* **Single adapter per run**: multi-LoRA training in a single job is not
-  implemented today.
+  See [Low Precision RL](/advanced/fp8-low-precision) and [INT4 QAT](/advanced/int4-qat).
+* **Target modules**: `--target-modules` is required whenever
+  `--lora-rank > 0`. There is no auto-detection; the launcher asserts at
+  startup.
+* **Single adapter per run**: only one set of `--lora-*` arguments is
+  honored per training job. Training multiple LoRA adapters in parallel
+  within a single `train.py` run is not implemented today — run separate
+  jobs if you need multiple adapters.
 
 ## Internals
 
@@ -93,7 +93,7 @@ The bridge between Megatron's LoRA path and SGLang adapter loading is in:
 - `miles/backends/megatron_utils/checkpoint.py` — adapter-aware save and load.
 - `miles/backends/megatron_utils/update_weight/update_weight_from_tensor.py`
   — colocate-mode weight sync from the trainer's LoRA tensors into the SGLang
-  rollout engine. We will merge this [PR](https://github.com/radixark/miles/pull/988) soon to support disaggregate mode.
+  rollout engine. Disaggregate-mode weight sync is not supported yet.
 
 A worked tutorial covering checkpoint conversion, SGLang adapter loading, and
 LoRA-specific evaluation will land here in a future doc pass.
diff --git a/docs/advanced/miles-router.md b/docs/advanced/miles-router.md
@@ -1,22 +1,19 @@
 ---
 title: Rollout Routing Replay (R3)
-description: Capture expert routing during inference and replay it during training so MoE RL is stable.
+description: Capture expert routing during inference and replay it during training to stabilize RL.
 ---
-
-# Rollout Routing Replay (R3)
-
 Rollout Routing Replay (R3) records the expert routing decisions made during
 inference and replays them during training, producing bit-identical expert
 allocation between rollout and training.
 
-## Why MoE RL was previously unstable
+## Why MoE RL is unstable without R3
 
 For each token, an MoE router picks `top-k` experts. The choice depends on the
-input through a soft router and a top-k op. In production the router is a
+input through a soft router and a top-k operation. In production the router is a
 learned `nn.Linear` with non-deterministic kernels and FP8 quantization, so tiny
 numerical differences flip routes at the per-layer, per-token level.
 
-Without R3:
+An example without R3:
 
 * Rollout selects experts `{2, 7}` for token 314.
 * Training (with the same weights but slightly different precision and kernels)
@@ -25,8 +22,8 @@ Without R3:
   layers, tens of thousands of tokens, and thousands of training steps, the
   policy diverges.
 
-With R3 the inference router's choice is what training also uses. Numerical
-noise no longer flips routes.
+With R3, the trainer replays the rollout router's expert assignments verbatim,
+so numerical noise no longer flips routes.
 
 ## How R3 wires up
 
@@ -50,7 +47,7 @@ forward pass so recorded routes are used instead of recomputed ones.
 ## Memory cost
 
 `(num_tokens - 1) × num_layers × top_k × 4 bytes` (int32 per element, see
-`miles/utils/types.py:29`). For a 32K-token sequence, 60 layers, and
+`miles/utils/types.py`). For a 32K-token sequence, 60 layers, and
 `top_k = 8`, that is roughly 60 MB per sample of routing metadata.
 
 ## When R3 is not required