feat(blog): B200 NVFP4 vs H200 FP8 on GLM-5 — up to 3.65x better perf/$ by functionstackx · Pull Request #386 · SemiAnalysisAI/InferenceX-app

functionstackx · 2026-05-26T03:14:05Z

Summary

New blog post comparing B200 NVFP4 + MTP vs H200 FP8 + MTP on GLM-5 8K/1K with SGLang v0.5.12 (GHA run 26381101926). Headline: up to 3.65x better performance per dollar at 80 tok/s/user, stable in a 3.24x–3.65x band across H200's 25–84 tok/s/user range.
Decomposes the peak 3.65x into a 1.22x generation step (Blackwell vs Hopper at FP8 + MTP) and a 2.98x precision step (B200 FP8 → B200 NVFP4).
New ## On-Paper Specs section with the /gpu-specs radar + an absolute-values table sourced from packages/app/src/lib/gpu-specs.ts. Bridges from raw silicon ratios to the measured perf/$ lift via a "compute-bound ceiling / HBM-bound floor" bracket.
Iso-interactivity table uses the bundled Pareto helper at .claude/skills/write-inferencex-blog/iso_interactivity.py; cells outside H200's range render as _unreachable_.
Anchors: SGLang tracking #19380, kernel PRs #21783 / #21405, FlashInfer #2726 / #2836, InferenceX recipes #1087 (B200 NVFP4+MTP) and #1480 (H200 FP8+MTP).
SKILL.md updates: codifies the new On-Paper Specs section pattern, adds the /gpu-specs radar image to the Step 0 ask list, and adds a banned-phrases banner so future iso-interactivity intros stop leaking algorithm names into reader-facing prose.

Test plan

Vercel preview renders at /blog/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar
Hero benchmark <Figure> and radar <Figure> render in both themes (currently the same image is in both light + dark slots — drop a real dark radar export later if desired)
Iso-interactivity table values match the live cost view for spot-checked iv rows (25 / 50 / 80 / 84)
OG image generates
Sitemap + RSS pick up the new slug
FAQ JSON-LD passes a schema validator

🤖 Generated with Claude Code

Note

Low Risk
Content-only change (MDX blog post and authoring skill docs); no application runtime, auth, or data-path code.

Overview
Adds a new InferenceX benchmark post comparing B200 NVFP4 + MTP vs H200 FP8 + MTP on GLM-5 8K/1K (SGLang v0.5.12): headline up to 3.65× perf/$ at 80 tok/s/user, decomposed into ~1.22× generation (FP8+MTP) and ~2.98× precision (NVFP4 on B200). The post follows the standard blog shape—dashboard CTAs, hero/repeat figures, per-concurrency tables, iso-interactivity cost table with _unreachable_ cells, upstream PR narrative, FAQ JSON-LD—and introduces an ## On-Paper Specs block (radar figure + table from gpu-specs.ts + perf/$ bracket vs measured lift).

Updates write-inferencex-blog SKILL.md: Step 0 now asks for a /gpu-specs radar screenshot; documents the full On-Paper Specs section template; renumbers upstream/recipe inputs; and adds an explicit banned-phrases list so published posts must not name the dashboard’s iso-interactivity spline/Hermite implementation (approved copy: “interpolated along each SKU’s Pareto frontier”).

^{Reviewed by Cursor Bugbot for commit f479a2d. Bugbot is set up for automated code reviews on this repo. Configure here.}

New post comparing B200 NVFP4 + MTP vs H200 FP8 + MTP on GLM-5 8K/1K with SGLang v0.5.12 (GHA run 26381101926). Headline: up to 3.65x better performance per dollar at 80 tok/s/user, stable in a 3.24x–3.65x band across H200's 25–84 tok/s/user range. Decomposes at the peak into a 1.22x generation step (Blackwell vs Hopper at FP8 + MTP) and a 2.98x precision step (B200 FP8 → B200 NVFP4). Adds a new "On-Paper Specs" section anchoring the silicon ratios sourced from packages/app/src/lib/gpu-specs.ts before the recipes. SKILL.md codifies the new section pattern, adds the /gpu-specs radar image to the Step 0 ask list, and adds a banned-phrases banner so the iso-iv intro stops leaking algorithm names into reader-facing prose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-26T03:14:07Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
inferencemax-app	Ready	Preview, Comment	May 26, 2026 3:14am

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit f479a2d. Configure here.}

cursor · 2026-05-26T03:20:51Z

+
+Three gaps still narrow the headline number from here, all upstream-tracked:
+
+- **Disaggregated serving on NVL72.** The numbers above are single-node aggregated. The [tracking issue](https://github.com/sgl-project/sglang/issues/19380) is actively closing FP8 B200 disaggregated 8K/1K and GB300 disaggregated MTP. Wide EP on NVL72 has already demonstrated a [~3x throughput-per-GPU advantage on Kimi K2.5](/blog/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200) — the same lever should lift GLM-5's perf/$ further at the low-interactivity / high-throughput end where the FP4 frontier plateaus.


"Three gaps" listed but only one bullet present

High Severity

The "What's Next" section promises "Three gaps still narrow the headline number" but only one bullet point (Disaggregated serving on NVL72) follows. The FAQ JsonLd at the bottom lists all three: (1) disaggregated serving, (2) piecewise CUDA graph for B200 FP8 Agg prefill (#23351/#24276), and (3) H200 recipe improvements (disagg, trtllm-mha, KV FP8). Two bullet points were dropped from the published prose. The sibling blog post (mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx) correctly matches its "Two gaps" count with two bullets.

Additional Locations (1)

packages/app/content/blog/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar.mdx#L210-L211

^{Reviewed by Cursor Bugbot for commit f479a2d. Configure here.}

vercel Bot deployed to Preview May 26, 2026 03:14 View deployment

functionstackx merged commit 13e9dad into master May 26, 2026
18 checks passed

functionstackx deleted the blog/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar branch May 26, 2026 03:17

cursor Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(blog): B200 NVFP4 vs H200 FP8 on GLM-5 — up to 3.65x better perf/$#386

feat(blog): B200 NVFP4 vs H200 FP8 on GLM-5 — up to 3.65x better perf/$#386
functionstackx merged 1 commit into
masterfrom
blog/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar

functionstackx commented May 26, 2026 •

edited by cursor Bot

Loading

Uh oh!

vercel Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		Three gaps still narrow the headline number from here, all upstream-tracked:

		- Disaggregated serving on NVL72. The numbers above are single-node aggregated. The [tracking issue](https://github.com/sgl-project/sglang/issues/19380) is actively closing FP8 B200 disaggregated 8K/1K and GB300 disaggregated MTP. Wide EP on NVL72 has already demonstrated a [~3x throughput-per-GPU advantage on Kimi K2.5](/blog/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200) — the same lever should lift GLM-5's perf/$ further at the low-interactivity / high-throughput end where the FP4 frontier plateaus.

Conversation

functionstackx commented May 26, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

vercel Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 26, 2026

Choose a reason for hiding this comment

"Three gaps" listed but only one bullet present

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented May 26, 2026 •

edited by cursor Bot

Loading

vercel Bot commented May 26, 2026 •

edited

Loading