Skip to content

ROCm/support test_deepep_fp8: e2e docs, aiter/sglang patches, mori rollout harness on gfx950#1320

Draft
kailashg26 wants to merge 1 commit into
radixark:mainfrom
kailashg26:kailash/rocm/test_deepep_fp8_stage-c-8-gpu-h100
Draft

ROCm/support test_deepep_fp8: e2e docs, aiter/sglang patches, mori rollout harness on gfx950#1320
kailashg26 wants to merge 1 commit into
radixark:mainfrom
kailashg26:kailash/rocm/test_deepep_fp8_stage-c-8-gpu-h100

Conversation

@kailashg26

Copy link
Copy Markdown

Summary

  • Docs (readme_deepep_fp8.md) — End-to-end guide for MI350–355 / gfx950: Docker launch (rlsys/miles:MI350-355-latest), dry-run and apply patches/aiter.patch and patches/sglang.patch against /sgl-workspace/aiter and /sgl-workspace/sglang, mori editable install with MORI_GPU_ARCHS=gfx950, uccl.ep + deep_ep install without install_deps.sh (avoids overwriting ROCm torch), how to run the DeepEP FP8 test from the miles repo root, and Python 3.11+ (see note below).
  • Patches (patches/aiter.patch, patches/sglang.patch) — Checked-in diffs for downstream trees:
    • AITER — FP8 per-1×128 scale layout for caller-quantized / padded mori dispatch buffers (transpose vs partial_transpose); 1-stage and 2-stage asm_stage1 paths.
    • SGLangMoriEPMoE expert-mask rebuild (CUDA-graph / memory reuse), optional MILES_MORI_* diagnostics, Qwen3 MoE block skips forward_normal when backend is mori.
  • Megatron e2e harness (tests/e2e/megatron/test_qwen3_30B_A3B/_common.py) — When use_deepep, SGLang MoE A2A backend is mori (still --sglang-deepep-mode auto); optional MILES_DEBUG_DISABLE_CUDA_GRAPH=1 adds --sglang-disable-cuda-graph; forwards SGLANG_USE_AITER, SGLANG_MORI_*, MILES_MORI_* from the parent env into Ray extra_env_vars so subprocesses see them.
  • Runner (run_test_deepep_fp8.sh) — Sets PYTHONPATH, SGLANG_USE_AITER=1, SGLANG_MORI_NUM_MAX_DISPATCH_TOKENS_PER_RANK=16384, unsets SGLANG_DEEPEP_BF16_DISPATCH, stops stale Ray, runs test_deepep_fp8.py.
  • Chat templates (deepseek_v32.py, deepseek_v4.py)Lazy import of encoding_dsv32 / encoding_dsv4 so older SGLang images that do not export those symbols still import miles for non–DeepSeek workloads.

Note (Python / enum.StrEnum): Use Python 3.11+ in the container for this workflow. On Python 3.10, from enum import StrEnum raises ImportError: cannot import name 'StrEnum' from 'enum' because enum.StrEnum exists only from 3.11 onward. Upgrade Python (or use a 3.11+ image) before importing miles or running the e2e test; details are in readme_deepep_fp8.md.


Test plan

  • In MI350–355 image (or matching layout): git apply --check both patches, then apply; install mori + uccl per readme (or use a prebuilt image that already matches). Confirm Python 3.11+ (or accept that 3.10 will hit StrEnum import errors until upgraded).
  • From miles repo root: bash tests/e2e/megatron/test_qwen3_30B_A3B/run_test_deepep_fp8.sh — confirm test_deepep_fp8.py completes (or document known infra limits if CI cannot run 8-GPU ROCm).
  • Smoke: import miles / run a small job that does not use DeepSeek V3.2/V4 templates — confirm no regression from lazy encoder imports.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Mori EP FP8 backend and includes several bug fixes and optimizations. Specifically, it implements lazy importing of DeepSeek V3.2 and V4 encoders to maintain compatibility with older sglang builds, fixes an activation scale transposition issue in aiter's fused MoE kernels, and addresses a bug where the expert_mask is silently zeroed during CUDA-graph capture by rebuilding it on every forward pass. Additionally, it adds debugging probes, updates the Qwen3 MoE model to support the Mori backend, and provides setup documentation and end-to-end tests. The review feedback suggests appending the user's UID to temporary file paths in /tmp to prevent permission conflicts on multi-user clusters.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread patches/sglang.patch
+ "num_local_experts": self.num_local_experts,
+ }
+ blob["a1_nonzero"] = a1nz
+ path = f"/tmp/mori_fmoe_fail_r{rank}.pt"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When using shared temporary directories like /tmp for process-specific storage, it is recommended to append the user's UID (using _os.getuid()) to the path. This prevents permission conflicts and directory sharing issues on multi-user clusters.

            path = f"/tmp/mori_fmoe_fail_r{rank}_{_os.getuid()}.pt"
References
  1. When using shared temporary directories (such as /tmp) for caching or process-specific storage, append the user's UID (e.g., using os.getuid()) to the path to avoid permission conflicts and directory sharing issues on multi-user clusters.

Comment thread patches/sglang.patch
+ )
+ if amax0 == 0.0 and not getattr(self, "_mori_dumped", False):
+ self._mori_dumped = True
+ path = f"/tmp/mori_combine_fail_r{rank}.pt"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When using shared temporary directories like /tmp for process-specific storage, it is recommended to append the user's UID (using _os.getuid()) to the path. This prevents permission conflicts and directory sharing issues on multi-user clusters.

                path = f"/tmp/mori_combine_fail_r{rank}_{_os.getuid()}.pt"
References
  1. When using shared temporary directories (such as /tmp) for caching or process-specific storage, append the user's UID (e.g., using os.getuid()) to the path to avoid permission conflicts and directory sharing issues on multi-user clusters.

@kailashg26 kailashg26 marked this pull request as draft June 11, 2026 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant