ROCm/support test_deepep_fp8: e2e docs, aiter/sglang patches, mori rollout harness on gfx950#1320
Conversation
…st case on AMD gfx950
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Mori EP FP8 backend and includes several bug fixes and optimizations. Specifically, it implements lazy importing of DeepSeek V3.2 and V4 encoders to maintain compatibility with older sglang builds, fixes an activation scale transposition issue in aiter's fused MoE kernels, and addresses a bug where the expert_mask is silently zeroed during CUDA-graph capture by rebuilding it on every forward pass. Additionally, it adds debugging probes, updates the Qwen3 MoE model to support the Mori backend, and provides setup documentation and end-to-end tests. The review feedback suggests appending the user's UID to temporary file paths in /tmp to prevent permission conflicts on multi-user clusters.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| + "num_local_experts": self.num_local_experts, | ||
| + } | ||
| + blob["a1_nonzero"] = a1nz | ||
| + path = f"/tmp/mori_fmoe_fail_r{rank}.pt" |
There was a problem hiding this comment.
When using shared temporary directories like /tmp for process-specific storage, it is recommended to append the user's UID (using _os.getuid()) to the path. This prevents permission conflicts and directory sharing issues on multi-user clusters.
path = f"/tmp/mori_fmoe_fail_r{rank}_{_os.getuid()}.pt"
References
- When using shared temporary directories (such as
/tmp) for caching or process-specific storage, append the user's UID (e.g., using os.getuid()) to the path to avoid permission conflicts and directory sharing issues on multi-user clusters.
| + ) | ||
| + if amax0 == 0.0 and not getattr(self, "_mori_dumped", False): | ||
| + self._mori_dumped = True | ||
| + path = f"/tmp/mori_combine_fail_r{rank}.pt" |
There was a problem hiding this comment.
When using shared temporary directories like /tmp for process-specific storage, it is recommended to append the user's UID (using _os.getuid()) to the path. This prevents permission conflicts and directory sharing issues on multi-user clusters.
path = f"/tmp/mori_combine_fail_r{rank}_{_os.getuid()}.pt"
References
- When using shared temporary directories (such as
/tmp) for caching or process-specific storage, append the user's UID (e.g., using os.getuid()) to the path to avoid permission conflicts and directory sharing issues on multi-user clusters.
Summary
readme_deepep_fp8.md) — End-to-end guide for MI350–355 / gfx950: Docker launch (rlsys/miles:MI350-355-latest), dry-run and applypatches/aiter.patchandpatches/sglang.patchagainst/sgl-workspace/aiterand/sgl-workspace/sglang, mori editable install withMORI_GPU_ARCHS=gfx950, uccl.ep + deep_ep install withoutinstall_deps.sh(avoids overwriting ROCm torch), how to run the DeepEP FP8 test from the miles repo root, and Python 3.11+ (see note below).patches/aiter.patch,patches/sglang.patch) — Checked-in diffs for downstream trees:partial_transpose); 1-stage and 2-stageasm_stage1paths.MILES_MORI_*diagnostics, Qwen3 MoE block skipsforward_normalwhen backend is mori.tests/e2e/megatron/test_qwen3_30B_A3B/_common.py) — Whenuse_deepep, SGLang MoE A2A backend ismori(still--sglang-deepep-mode auto); optionalMILES_DEBUG_DISABLE_CUDA_GRAPH=1adds--sglang-disable-cuda-graph; forwardsSGLANG_USE_AITER,SGLANG_MORI_*,MILES_MORI_*from the parent env into Rayextra_env_varsso subprocesses see them.run_test_deepep_fp8.sh) — SetsPYTHONPATH,SGLANG_USE_AITER=1,SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16384, unsetsSGLANG_DEEPEP_BF16_DISPATCH, stops stale Ray, runstest_deepep_fp8.py.deepseek_v32.py,deepseek_v4.py) — Lazy import ofencoding_dsv32/encoding_dsv4so older SGLang images that do not export those symbols still import miles for non–DeepSeek workloads.Test plan
git apply --checkboth patches, then apply; install mori + uccl per readme (or use a prebuilt image that already matches). Confirm Python 3.11+ (or accept that 3.10 will hitStrEnumimport errors until upgraded).bash tests/e2e/megatron/test_qwen3_30B_A3B/run_test_deepep_fp8.sh— confirmtest_deepep_fp8.pycompletes (or document known infra limits if CI cannot run 8-GPU ROCm).