Skip to content

feat(blog): MI355X DeepSeek-V4-Pro SGLang — 110.5x throughput per GPU in 26 days#388

Merged
functionstackx merged 5 commits into
masterfrom
feat/blog-mi355x-deepseek-v4-pro-sglang-110x
May 26, 2026
Merged

feat(blog): MI355X DeepSeek-V4-Pro SGLang — 110.5x throughput per GPU in 26 days#388
functionstackx merged 5 commits into
masterfrom
feat/blog-mi355x-deepseek-v4-pro-sglang-110x

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx commented May 26, 2026

Summary

  • New blog post: MI355X DeepSeek-V4-Pro on SGLang — 110.5x throughput per GPU in 26 days (/blog/mi355x-deepseek-v4-pro-sglang-110x-in-26-days). Time-series story of the sgl-project/sglang amd/deepseek_v4 side branch: first-light 20.4 tok/s/GPU at 2.4 tok/s/user on 2026-04-25 (FP8, recipe needed SGLANG_HACK_FLASHMLA_BACKEND=torch to compile) → 2,256 tok/s/GPU at 9.4 tok/s/user on 2026-05-21 (FP4, DP attention, lmsysorg/sglang:v0.5.12-rocm720-mi35x). Both axes climb together: 110.5x throughput-per-GPU × 3.85x interactivity.
  • Decomposes the 31 numbered side-branch PRs into 5 mechanism buckets — DSA attention (TileLang indexer + Triton sparse MLA), mHC fusion, RoPE+Hadamard fusion, MoE (FlyDSL + FP4 + fused hash topk), AITER + misc fusions — and traces each container-image bump in the InferenceX recipe loop (rocm/sgl-devlmsysorg/sglang).
  • Adds 9 files: the MDX post + 4 light/dark image pairs (benchmark chart, DSv4-Pro-Max vs Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 quality bars, MI355X-vs-B200 SXM /gpu-specs radar, B200-vs-MI355X SGLang FP4 performance curve).
  • On-Paper Specs section in What's Next anchors the residual ~5x gap to B200 SGLang as software, not silicon: MI355X has 1.60x HBM capacity, the same 8 TB/s HBM BW, and 1.12x dense FP4 / FP8 / BF16 vs B200 SXM; only scale-up BW (576 vs 900 GB/s uni-di, 0.64x) favors B200. Pulled from packages/app/src/lib/gpu-specs.ts.
  • Iso-interactivity table uses the bundled iso_interactivity.py helper (monotone cubic Hermite on the upper-left Pareto frontier — matches d3.curveMonotoneX exactly so the table values track the rendered chart, not a linear approximation). Unreachable cells render as _unreachable_.
  • What's Next explicitly flags the upstream-to-main migration story via PR #24933 (kk, merged 2026-05-18, +3,678 / -70) — first chunk landing DSv4 on ROCm in main in eager mode, with the perf-critical fusions / TileLang indexer / FlyDSL MoE / SGLANG_OPT_* toggles still side-branch-only. Hard reproducibility note: SGLang main will under-perform this post by an order of magnitude until those upstream.
  • FAQ JSON-LD covers the 5 questions readers actually ask: peak number, what shipped, why 04-25 was so slow, NVIDIA comparison, what's still uncovered.

Test plan

  • pnpm dev and visit /blog/mi355x-deepseek-v4-pro-sglang-110x-in-26-days — verify all 4 figures render in light + dark modes
  • Post appears in /blog listing with correct title, subtitle, publish date (2026-05-26)
  • OG image renders correctly for /blog/mi355x-deepseek-v4-pro-sglang-110x-in-26-days
  • DashboardCTA at top + bottom links land on the preset MI355X DSv4-Pro 5-date view on inferencex.semianalysis.com
  • Live chart link inside body opens the same preset view
  • Sitemap / RSS feed / llms.txt include the new post
  • All 31 SGLang PR links + PR #24933 + DeepSeek announcement + SemiAnalysis tweet resolve

🤖 Generated with Claude Code


Note

Low Risk
Content-only addition (MDX + referenced static images); no application logic, auth, or data-path changes.

Overview
Adds a new long-form benchmark article at /blog/mi355x-deepseek-v4-pro-sglang-110x-in-26-days documenting 26 days of MI355X SGLang tuning for DeepSeek-V4-Pro on the amd/deepseek_v4 fork—from first-light ~20 tok/s/GPU (2026-04-25 FP8) to ~2,256 tok/s/GPU (2026-05-21 FP4)—with throughput and interactivity rising together.

The post ties gains to 31 side-branch PRs (DSA/TileLang + Triton sparse MLA, mHC/RoPE/Hadamard fusions, FlyDSL MoE, FP4 path, InferenceX container bumps), publishes date-stamped concurrency tables, an iso-interactivity comparison, and MI355X vs B200 software-gap framing. It uses existing blog MDX patterns: Figure (light/dark assets under /images/mi355x-deepseek-v4-pro-sglang-110x-in-26-days/), DashboardCTA + live InferenceX links, and JsonLd FAQ schema for SEO.

Reviewed by Cursor Bugbot for commit 036d473. Bugbot is set up for automated code reviews on this repo. Configure here.

… in 26 days

Time-series story of the sgl-project/sglang amd/deepseek_v4 side branch:
20.4 → 2,256 tok/s/GPU at iso/improving interactivity on 8K/1K, across 31
numbered performance optimization PRs. Includes per-date tables, iso-iv
interpolation (12-14x in the 10-20 tok/s/user serving band from FP4 first
light to 05-21), DSv4-Pro vs Claude/GPT/Gemini quality benchmarks, on-paper
specs framing (MI355X vs B200 — 1.60x HBM + 1.12x dense compute, only 0.64x
scale-up BW, so the residual 5x gap to B200 SGLang is software not silicon),
and the upstream-to-main migration story via PR #24933.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
inferencemax-app Ready Ready Preview, Comment May 26, 2026 4:21am

Request Review

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n quality-benchmarks caption

Reapplies four 'integration → performance optimization PR' wording fixes that
the editor reverted, and adds an inline hyperlink from the quality-benchmarks
figure source to the DeepSeek V4 preview announcement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Widen Figure caption prop from string to ReactNode so MDX captions can hold
inline JSX (e.g. a source link). Convert the DSv4-Pro-Max vs Claude/GPT/Gemini
caption to a JSX expression so the DeepSeek V4 preview release source renders
as a real underlined external link instead of raw markdown syntax.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Revert the ReactNode widening of the Figure caption prop and convert the
DSv4-Pro-Max vs Claude/GPT/Gemini caption back to a plain string. The
source URL is now spelled out as readable text inside the caption.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx merged commit 19ae49e into master May 26, 2026
18 checks passed
@functionstackx functionstackx deleted the feat/blog-mi355x-deepseek-v4-pro-sglang-110x branch May 26, 2026 04:26
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 036d473. Configure here.


| Interactivity (tok/s/user) | 04-25 | 05-02 | 05-04 | 05-10 | 05-21 | 05-02 → 05-21 |
| -------------------------- | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |
| 8 | _unreachable_ | 221 | 401 | 1,363 | _unreachable_ | _∞_ |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iso-interactivity table starts with unreachable ratio row

Medium Severity

The iso-interactivity table's first row (8 tok/s/user) shows _unreachable_ for 05-21 and _∞_ in the ratio column. This happens because the 05-21 Pareto frontier ends at 9.37 tok/s/user (conc 256); conc 512 at 5.59 tok/s/user is dominated and excluded from the frontier. The project's SKILL.md row-pruning heuristic explicitly states "The first row of the table must have two real numbers — never start with an _unreachable_ row" and warns that "a table that opens with _∞_ reads like the data is missing." Starting at 10 tok/s/user — where both 05-02 and 05-21 have real values — avoids this.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 036d473. Configure here.


<DashboardCTA href="https://inferencex.semianalysis.com/inference?g_rundate=2026-05-22&g_runid=26306422380&g_model=DeepSeek-V4-Pro&i_gpus=mi355x_sglang&i_dates=2026-05-03%2C2026-05-04%2C2026-05-08%2C2026-05-19%2C2026-05-21%2C2026-04-25&i_prec=fp4%2Cfp8&i_dstart=2026-04-25&i_dend=2026-05-21&i_linelabel=1">
Click to see the full InferenceX dashboard →
</DashboardCTA>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dashboard URL dates mismatch blog's discussed five dates

Medium Severity

The i_dates URL parameter decodes to six dates (04-25, 05-03, 05-04, 05-08, 05-19, 05-21), but the blog tables and figure captions reference five dates (04-25, 05-02, 05-04, 05-10, 05-21). Two dates are offset — blog says 05-02 but URL has 05-03, blog says 05-10 but URL has 05-08 — and 05-19 appears in the URL but is never discussed. The text at line 158 says "across the 5 measured dates" while the link shows 6 curves. Readers clicking the CTA will see different date labels and an extra undiscussed curve.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 036d473. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant