Skip to content

[AMD] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark#1570

Merged
functionstackx merged 5 commits into
mainfrom
chang/qwen35-fp8-mi355x-sglang-disagg-pr-1
May 28, 2026
Merged

[AMD] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark#1570
functionstackx merged 5 commits into
mainfrom
chang/qwen35-fp8-mi355x-sglang-disagg-pr-1

Conversation

@ChangLiu0709
Copy link
Copy Markdown
Collaborator

@ChangLiu0709 ChangLiu0709 commented May 27, 2026

## rebase of @chunfangamd & @ChangLiu0709 's enablement branch https://github.com/SemiAnalysisAI/InferenceX/tree/refs/heads/chun-chang/sglang-disagg-qwen3.5

Summary

  • Add SGLang disaggregated prefill–decode (PD) benchmark recipe for Qwen/Qwen3.5-397B-A17B-FP8 on MI355X (MoRI backend), mirroring the existing DeepSeek-R1 disagg structure with model-specific tuning (no MTP, no DeepEP dispatch)
  • New CI config qwen3.5-fp8-mi355x-sglang-disagg in amd-master.yaml: 1P+1D TP8/EP1 smoke sweep for ISL 1K1K and 8K1K, concurrency 8–512, image lmsysorg/sglang-rocm:v0.5.11-rocm700-mi35x-20260511
  • New Qwen3.5-397B-A17B-FP8 entry in benchmarks/multi_node/amd_utils/models.yaml (MoRI transfer, FP8 KV cache, aiter attention, DP / no-DP / EP-only prefill and decode profiles)
  • New multinode entry script benchmarks/multi_node/qwen3.5_fp8_mi355x_sglang-disagg.sh (maps CI sweep params → shared amd_utils/submit.sh; optional NODE_LIST for local smoke on fixed nodes)
  • runners/launch_mi355x-amds.sh: KEEP_LOGS=1 skips EXIT trap log cleanup for local debugging
  • No changes to shared job.slurm / submit.sh / server.sh (uses existing main disagg plumbing)

Test plan

  • Verify CI passes: amd-master.yaml loads and validates the new qwen3.5-fp8-mi355x-sglang-disagg entry (scenarios.fixed-seq-len schema)
  • Confirm existing SGLang disagg benchmarks are unaffected (this PR does not modify shared amd_utils scripts)
  • Smoke on mia1: qwen3.5-fp8-mi355x-sglang-disagg 1P+1D, ISL/OSL 1024/1024, conc=64 (target: requests complete, results JSON produced)
  • Optional local: re-run with KEEP_LOGS=1 and confirm Slurm logs are retained under benchmark_logs/

Out of scope (follow-up PRs)

  • utils/bench_serving/benchmark_serving.py Qwen3.5 tokenizer fallback
  • GLM-5 disagg config, setup_deps.sh image patches, v0.5.12 + MoRI conn.py overlay
  • Upstream sglang fix for Qwen3.5 + MoRI + enable-dp-attention (is_deepep_class_backend / shared-expert slot accounting)

Note

Risk
Low: only adds new config, model block, and launch script; reuses unmodified amd_utils from main.

Overview
First InferenceX recipe for Qwen3.5-397B-A17B-FP8 PD disaggregation on MI355X using the existing multinode SGLang + MoRI stack. Scope is intentionally narrow (PR-1): CI config, model flags, launch script, and KEEP_LOGS runner tweak—no shared amd_utils edits, no in-container patches, no benchmark_serving changes.


Note

Low Risk
Additive benchmark config, models.yaml block, and launch script only; reuses existing multinode disagg plumbing with no shared amd_utils core changes.

Overview
Adds Qwen/Qwen3.5-397B-A17B-FP8 prefill–decode disaggregation on MI355X via a new CI key qwen3.5-fp8-mi355x-sglang-disagg (sglang-disagg, mi355x-disagg, image lmsysorg/sglang-rocm:v0.5.11-rocm700-mi35x-20260511). Sweeps 1P+1D, TP8/EP1, no MTP, conc 8–512 for 1k/1k and 8k/1k; both rows set dp-attn: false because MoRI + DP-attention currently trips a Qwen3.5 MoE expert-count assertion in upstream SGLang.

Adds Qwen3.5-397B-A17B-FP8 in benchmarks/multi_node/amd_utils/models.yaml (MoRI transfer, FP8 KV, aiter attention, prefill/decode tuning profiles) and benchmarks/multi_node/qwen3.5_fp8_mi355x_sglang-disagg.sh to map matrix env vars into existing amd_utils/submit.sh. Documents the change in perf-changelog.yaml.

runners/launch_mi355x-amds.sh: KEEP_LOGS=1 skips the EXIT trap that deletes Slurm logs (local debug only). Shared submit.sh / job.slurm are unchanged.

Reviewed by Cursor Bugbot for commit 224d444. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@ChangLiu0709 ChangLiu0709 marked this pull request as ready for review May 27, 2026 14:06
@ChangLiu0709 ChangLiu0709 requested a review from a team May 27, 2026 14:06
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 27, 2026

Claude finished @ChangLiu0709's task in 2m 57s —— View job


Review: [AMD] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark

  • Read CLAUDE.md and repository guidelines
  • Read all changed files
  • Validate master config & perf-changelog
  • Validate model prefix in config keys
  • Validate container image accessibility
  • Validate benchmark script code style
  • Check shared amd_utils changes for correctness
  • Post review findings

Findings

1 blocking issue found. See inline comment.

🔴 perf-changelog.yaml not updated.github/configs/amd-master.yaml was modified (new qwen3.5-fp8-mi355x-sglang-disagg entry added) but perf-changelog.yaml has no corresponding entry. A new entry must be appended to the end of perf-changelog.yaml.

What looks good

  • New config (amd-master.yaml): follows the existing disagg schema (scenarios.fixed-seq-len), valid public Docker Hub image (lmsysorg/sglang-rocm:v0.5.11-rocm700-mi35x-20260511), model-prefix: qwen3.5 is consistent with existing Qwen entries.
  • New benchmark script (qwen3.5_fp8_mi355x_sglang-disagg.sh): mirrors the existing dsr1_fp8_mi355x_sglang-disagg.sh structure exactly, with the addition of ${NODE_LIST:-} for local smoke testing. Clean and minimal.
  • Sudo auto-detection (job.slurm): well-implemented dual-path — detects passwordless sudo on both the submit node ($SUDO) and compute nodes ($_SUDO), falls back gracefully. Backward-compatible since CI runners have passwordless sudo.
  • Opt-in decode parallelism flags (server.sh): ENABLE_DISAGG_DECODE_PARALLELISM_FLAGS defaults to false, so older ROCm images aren't affected. The is_truthy() helper is thorough.
  • Model config (models.yaml): Qwen3.5-397B-A17B-FP8 entry with mtp_flags: "" correctly disables MTP, --moe-dense-tp-size 1 and --disaggregation-transfer-backend mori are Qwen-specific tuning. DP/no-DP/EP-only profiles are well-structured.
  • KEEP_LOGS=1 (launch_mi355x-amds.sh): simple, clean opt-in for retaining Slurm logs during local debugging.

Comment thread .github/configs/amd-master.yaml
Comment on lines +56 to +60
if [[ "${KEEP_LOGS:-0}" == "1" ]]; then
trap '' EXIT
else
trap 'sudo rm -rf "$BENCHMARK_LOGS_DIR" 2>/dev/null || true' EXIT
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new KEEP_LOGS=1 path only disarms the EXIT trap (lines 56-60), but line 178's unconditional sudo rm -rf "$BENCHMARK_LOGS_DIR" 2>/dev/null || true (and the $BENCHMARK_LOGS_DIR/logs rm at line 167) still fires on the happy path. So a successful run with KEEP_LOGS=1 and no GITHUB_ACTIONS set wipes slurm_job-${JOB_ID}.{out,err} anyway — exactly the local-debugging use case the PR description claims to support. Fix: gate the rms at lines 167 and 178 on "${KEEP_LOGS:-0}" != "1" as well.

Extended reasoning...

What the bug is

The PR documents KEEP_LOGS=1 as the local-debugging knob that "skips EXIT trap log cleanup" so users can inspect Slurm logs under benchmark_logs/ after a run. The implementation (runners/launch_mi355x-amds.sh:56-60) replaces the EXIT trap with trap '' EXIT when KEEP_LOGS=1:

if [[ "${KEEP_LOGS:-0}" == "1" ]]; then
    trap '' EXIT
else
    trap 'sudo rm -rf "$BENCHMARK_LOGS_DIR" 2>/dev/null || true' EXIT
fi

That correctly preserves logs when the script exits early (e.g. before slurm finishes). But the script ALSO has two inline sudo rm -rf calls at the bottom of the IS_MULTINODE=true branch that are unrelated to the trap:

  • Line 167: sudo rm -rf "$BENCHMARK_LOGS_DIR/logs" 2>/dev/null || true
  • Line 178: sudo rm -rf "$BENCHMARK_LOGS_DIR" 2>/dev/null || true

Neither is gated on KEEP_LOGS. On the happy path, both fire after the result-collection phase, so KEEP_LOGS=1 does not actually preserve logs after a successful run — only after an early/failed exit. The PR's own test-plan item ("re-run with KEEP_LOGS=1 and confirm Slurm logs are retained under benchmark_logs/") fails.

Step-by-step proof

Walk through the local-debug case: KEEP_LOGS=1, no GITHUB_ACTIONS set, IS_MULTINODE=true, successful run.

  1. Line 56-60: KEEP_LOGS=1trap '' EXIT disarms the EXIT trap. ✓
  2. Lines 62-86: Slurm job submitted, log file $BENCHMARK_LOGS_DIR/slurm_job-${JOB_ID}.out (and .err) gets populated as the job runs. ✓
  3. Lines 88-158: Job completes, result JSONs are copied to $GITHUB_WORKSPACE — but the slurm_job-*.{out,err} files themselves are still in $BENCHMARK_LOGS_DIR. ✓
  4. Line 167: sudo rm -rf "$BENCHMARK_LOGS_DIR/logs" — removes the per-job subdir. Unconditional.
  5. Lines 170-175: if [[ -n "${GITHUB_ACTIONS:-}" ]] block is skipped because we're running locally — so the cp of slurm_job-*.{out,err} to $ARTIFACT_DIR never happens.
  6. Line 178: sudo rm -rf "$BENCHMARK_LOGS_DIR" 2>/dev/null || true — wipes the entire directory, including the slurm out/err files the user wanted to inspect. This is the bug.
  7. Script exits cleanly. EXIT trap is no-op (good — but irrelevant, the damage is already done by step 6).

User looks under benchmark_logs/ — empty. Feature does not work as advertised in its primary use case.

Why existing code doesn't prevent it

The trap disarmament only matters for paths where control exits before reaching line 178 (e.g. the exit 1 at line 81 when the job dies before creating the log file). On any successful run, control always reaches line 178 because there's no early return. The KEEP_LOGS check was added in only one place when it needed to be added in three.

The GH Actions branch at lines 170-175 partially masks the issue in CI: when GITHUB_ACTIONS is set, the slurm out/err files are first copied to $ARTIFACT_DIR ($GITHUB_WORKSPACE/benchmark_artifacts) before the final rm wipes $BENCHMARK_LOGS_DIR. So CI still gets logs uploaded as artifacts — but local users (the entire target audience for KEEP_LOGS=1) get nothing.

Impact

Documented user-facing feature (KEEP_LOGS=1 for local debugging) does not work on the happy path. The PR's own test plan checkbox cannot be satisfied as written. Severity is normal because the feature is new in this PR and ships broken — not a pre-existing issue.

Fix

Gate both inline rms on KEEP_LOGS, mirroring the trap setup:

if [[ "${KEEP_LOGS:-0}" != "1" ]]; then
    sudo rm -rf "$BENCHMARK_LOGS_DIR/logs" 2>/dev/null || true
fi

…and similarly at line 178. (Or factor into a helper.) The line-167 rm is arguably also safe to skip on KEEP_LOGS=1 since $BENCHMARK_LOGS_DIR/logs is the per-job subdir with the in-container result tree, which is exactly what a local debugger would want to inspect alongside the slurm out/err files.

Comment thread benchmarks/multi_node/amd_utils/server.sh Outdated
Comment thread .github/configs/amd-master.yaml
@ChangLiu0709 ChangLiu0709 force-pushed the chang/qwen35-fp8-mi355x-sglang-disagg-pr-1 branch from a562e2a to 60bd801 Compare May 27, 2026 14:35
ChangLiu0709 pushed a commit that referenced this pull request May 27, 2026
Required when adding new amd-master.yaml benchmark configs (PR #1570).

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment thread runners/launch_mi355x-amds.sh
Comment thread benchmarks/multi_node/qwen3.5_fp8_mi355x_sglang-disagg.sh
functionstackx pushed a commit that referenced this pull request May 27, 2026
Required when adding new amd-master.yaml benchmark configs (PR #1570).

Co-authored-by: Cursor <cursoragent@cursor.com>
@functionstackx functionstackx force-pushed the chang/qwen35-fp8-mi355x-sglang-disagg-pr-1 branch from 20e2974 to b5e21e8 Compare May 27, 2026 16:08
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: mtp }
- { tp: 8, ep: 1, conc-start: 4, conc-end: 256, spec-decoding: mtp }

qwen3.5-fp8-mi355x-sglang-disagg:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chang, we will need to rework this PR to keep the git commit history

@chunfangamd chunfangamd marked this pull request as draft May 27, 2026 16:45
functionstackx pushed a commit that referenced this pull request May 27, 2026
Required when adding new amd-master.yaml benchmark configs (PR #1570).

Co-authored-by: Cursor <cursoragent@cursor.com>
@functionstackx functionstackx force-pushed the chang/qwen35-fp8-mi355x-sglang-disagg-pr-1 branch from b5e21e8 to b564db2 Compare May 27, 2026 21:37
@functionstackx functionstackx marked this pull request as ready for review May 27, 2026 21:42
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 27, 2026

Claude finished @functionstackx's task in 2m 15s —— View job


Review: [AMD] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark

  • Read CLAUDE.md and repository guidelines
  • Read all changed files and validate against review checklist
  • Validate master config & perf-changelog
  • Validate model prefix in config keys
  • Validate container image accessibility
  • Validate benchmark script code style
  • Check KEEP_LOGS implementation correctness
  • Post review findings

Findings

1 blocking issue found. See inline comment.

🔴 KEEP_LOGS=1 doesn't guard inline cleanup — The EXIT trap is correctly gated (runners/launch_mi355x-amds.sh:73-74), but the inline sudo rm -rf "$BENCHMARK_LOGS_DIR/logs" at line 191 still fires unconditionally on successful runs. This wipes the per-job result tree even with KEEP_LOGS=1, making the feature ineffective for its primary use case (local debugging of successful runs). Gate line 191 with if [[ "${KEEP_LOGS:-0}" != "1" ]] — see inline comment for the suggested fix.

What looks good

  • Master config (amd-master.yaml): Valid schema, public Docker Hub image (lmsysorg/sglang-rocm:v0.5.11-rocm700-mi35x-20260511), model-prefix: qwen3.5 consistent with existing entries.
  • perf-changelog.yaml: Entry correctly appended at the end of the file with proper config-key and PR link.
  • Benchmark script (qwen3.5_fp8_mi355x_sglang-disagg.sh): Mirrors the existing dsr1_fp8_mi355x_sglang-disagg.sh structure cleanly, with the useful ${NODE_LIST:-} addition for local smoke testing.
  • Model config (models.yaml): Qwen3.5-397B-A17B-FP8 with mtp_flags: "" correctly disables MTP; MoRI transfer, FP8 KV, aiter attention flags look correct. DP/no-DP/EP-only profiles are well-structured.
  • 8k1k comment was fixed in b564db2 — now accurately says TP8/EP1 smoke sweep with dp-attn.

@functionstackx functionstackx changed the title [AMD] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark [AMD][WIP- small changes suggested by chunfang] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark May 27, 2026
Comment on lines +73 to +77
if [[ "${KEEP_LOGS:-0}" == "1" ]]; then
trap '' EXIT
else
trap cleanup_and_save_logs EXIT
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 BLOCKING: KEEP_LOGS=1 only disarms the EXIT trap (line 73-74) but the inline sudo rm -rf "$BENCHMARK_LOGS_DIR/logs" at line 191 still fires unconditionally on the happy path, wiping the per-job result tree that a local debugger would want to inspect.

Why it matters: The feature's stated purpose (retain logs under benchmark_logs/ for local debugging) doesn't work for successful runs — only for early/failed exits. The PR's own test plan item ("re-run with KEEP_LOGS=1 and confirm Slurm logs are retained") cannot be satisfied.

Fix: Gate line 191 on KEEP_LOGS as well:

Suggested change
if [[ "${KEEP_LOGS:-0}" == "1" ]]; then
trap '' EXIT
else
trap cleanup_and_save_logs EXIT
fi
+ if [[ "${KEEP_LOGS:-0}" != "1" ]]; then
+ sudo rm -rf "$BENCHMARK_LOGS_DIR/logs" 2>/dev/null || true
+ fi

@functionstackx functionstackx changed the title [AMD][WIP- small changes suggested by chunfang] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark [AMD] Add Qwen3.5 FP8 MI355X SGLang disaggregated benchmark May 27, 2026
functionstackx pushed a commit that referenced this pull request May 27, 2026
Required when adding new amd-master.yaml benchmark configs (PR #1570).

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: chunfangamd <chun.fang@amd.com>
@functionstackx functionstackx force-pushed the chang/qwen35-fp8-mi355x-sglang-disagg-pr-1 branch from b564db2 to 715d44f Compare May 27, 2026 22:47
@github-actions
Copy link
Copy Markdown
Contributor

functionstackx pushed a commit that referenced this pull request May 27, 2026
Required when adding new amd-master.yaml benchmark configs (PR #1570).

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: chunfangamd <chun.fang@amd.com>
@functionstackx functionstackx force-pushed the chang/qwen35-fp8-mi355x-sglang-disagg-pr-1 branch from 715d44f to fd83f40 Compare May 27, 2026 22:48
@github-actions
Copy link
Copy Markdown
Contributor

Comment thread benchmarks/multi_node/qwen3.5_fp8_mi355x_sglang-disagg.sh Outdated
Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @ChangLiu0709 @chunfangamd @HaiShaw thanks for the SGLang qwen3.5 mi355 disagg PR but unfortunately it doesn't work yet, can u look into fixing this?

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26543322403/job/78198590470?pr=1570

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/26543322403/job/78198590469?pr=1570

Image
026-05-28 00:11:09.816 DP0 TP0 EP0] Scheduler hit an exception: Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3985, in run_scheduler_process
    scheduler = Scheduler(
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 434, in __init__
    self.init_model_worker()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 705, in init_model_worker
    self.init_tp_model_worker()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 660, in init_tp_model_worker
    self.tp_worker = TpModelWorker(**worker_kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 262, in __init__
    self._init_model_runner()
  File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 347, in _init_model_runner
    self._model_runner = ModelRunner(
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 532, in __init__
    self.initialize(pre_model_load_memory)
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 652, in initialize
    self.load_model()
  File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 1445, in load_model
    self.model = self.loader.load_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 716, in load_model
    model = _initialize_model(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 290, in _initialize_model
    return model_class(**kwargs)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_5.py", line 1608, in __init__
    super().__init__(config, quant_config, prefix, language_model_cls)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_vl.py", line 1124, in __init__
    self.model = language_model_cls(
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_5.py", line 1237, in __init__
    super().__init__(config=config, quant_config=quant_config, prefix=prefix)
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_5.py", line 1046, in __init__
    self.layers, self._start_layer, self._end_layer = make_layers(
  File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 675, in make_layers
    + get_offloader().wrap_modules(
  File "/sgl-workspace/sglang/python/sglang/srt/utils/offloader.py", line 36, in wrap_modules
    return list(all_modules_generator)
  File "/sgl-workspace/sglang/python/sglang/srt/utils/common.py", line 677, in <genexpr>
    layer_fn(idx=idx, prefix=add_prefix(idx, prefix))
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_5.py", line 1037, in get_layer
    return layer_class(
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen3_5.py", line 554, in __init__
    self.mlp = Qwen2MoeSparseMoeBlock(
  File "/sgl-workspace/sglang/python/sglang/srt/models/qwen2_moe.py", line 253, in __init__
    self.experts = get_moe_impl_class(quant_config)(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/layer.py", line 613, in __init__
    super().__init__(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/ep_moe/layer.py", line 93, in __init__
    super().__init__(
  File "/sgl-workspace/sglang/python/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 212, in __init__
    assert (num_experts - num_shared_slots) % self.moe_ep_size == 0
AssertionError
Traceback (most recent call last):
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 1279, in _wait_for_scheduler_ready
    data = scheduler_pipe_readers[i].recv()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 69, in <module>
    run_server(server_args)
  File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 50, in run_server
    launch_server(server_args)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py", line 2346, in launch_server
    ) = Engine._launch_subprocesses(
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 800, in _launch_subprocesses
    scheduler_init_result.wait_for_ready()
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 651, in wait_for_ready
    infos = _wait_for_scheduler_ready(scheduler_pipe_readers, scheduler_procs)
  File "/sgl-workspace/sglang/python/sglang/srt/entrypoints/engine.py", line 1281, in _wait_for_scheduler_ready
    raise _scheduler_died_error(i, scheduler_procs[i])
RuntimeError: Rank 0 scheduler died during initialization (exit code: -3). If exit code is -9 (SIGKILL), a common cause is the OS OOM killer. Run `dmesg -T | grep -i oom` to check.

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

cliu1004@amd.com and others added 4 commits May 27, 2026 21:58
Introduce CI config, model server flags, multinode launch script, and
amd_utils plumbing (sudo auto-detect, optional disagg decode TP/DP flags)
for qwen3.5-fp8-mi355x-sglang-disagg smoke sweeps on MI355X.

Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
Required when adding new amd-master.yaml benchmark configs (PR #1570).

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
The search-space uses 1P+1D TP8/EP1 with dp-attn, not TP2/EP2 from the
aggregated qwen3.5-fp8-mi355x-sglang recipe.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
With --enable-dp-attention + --moe-a2a-backend mori, sglang auto-promotes
moe_ep_size=tp_size=8, but is_deepep_class_backend() excludes MoRI, so
num_shared_slots stays at the global value (1) and the
(num_experts - num_shared_slots) % moe_ep_size assertion in
fused_moe_triton/layer.py fires for Qwen3.5 (512 routed + 1 shared).
Track upstream sglang for a fix; flip back to dp-attn=true once MoRI is
added to is_deepep_class_backend() or shared-slot accounting is
reconciled.

Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
@functionstackx functionstackx force-pushed the chang/qwen35-fp8-mi355x-sglang-disagg-pr-1 branch from d3fa9ff to 99fc3fb Compare May 28, 2026 01:59
@functionstackx functionstackx added the non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim) label May 28, 2026
Matches sister sglang-disagg scripts (dsr1_fp8, dsr1_fp4) and the
GLM-5 disagg launch script. submit.sh requires FRAMEWORK; surfacing
the missing-var failure at the top of the launch script gives a
cleaner error than letting it fail deep inside submit.sh.

Addresses Cursor Bugbot review comment on PR #1570.

Co-authored-by: chunfangamd <chun.fang@amd.com>
Co-authored-by: ChangLiu0709 <cliu1004@amd.com>
@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 224d444. Configure here.

Comment thread perf-changelog.yaml
- "1P+1D TP8/EP1 smoke sweep for 1k1k and 8k1k (conc 8-512); MoRI transfer backend"
- "Add models.yaml server flags and multinode launch script qwen3.5_fp8_mi355x_sglang-disagg.sh"
- "8k1k row uses dp-attn=false (matches 1k1k): with --enable-dp-attention + --moe-a2a-backend mori, sglang auto-promotes moe_ep_size=tp_size=8, but is_deepep_class_backend() excludes MoRI, so num_shared_slots stays at the global value (1) and the (num_experts - num_shared_slots) % moe_ep_size assertion in fused_moe_triton/layer.py fires for Qwen3.5 (512 routed + 1 shared). Track upstream sglang; flip back to dp-attn=true once MoRI is added to is_deepep_class_backend() or shared-slot accounting is reconciled."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1570
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changelog entry indentation inconsistent with existing entries

Medium Severity

The new perf-changelog.yaml entry starts at 0-space indentation (- config-keys:) while all surrounding entries use 2-space indentation ( - config-keys:). In YAML, list items at different indentation levels belong to different structural contexts. This means process_changelog.py may not parse the new entry as part of the same changelog list, potentially causing the CI sweep to skip the qwen3.5-fp8-mi355x-sglang-disagg config entirely.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 224d444. Configure here.

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit 7ae613c into main May 28, 2026
27 checks passed
@functionstackx functionstackx deleted the chang/qwen35-fp8-mi355x-sglang-disagg-pr-1 branch May 28, 2026 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

non-canary-full-sweep-enabled Run the full sweep without the canary gate (full search space, no trim)

Projects

Development

Successfully merging this pull request may close these issues.

3 participants