feat(blog): B200 NVFP4 vs H200 FP8 on GLM-5 — up to 3.65x better perf/$#386
Conversation
New post comparing B200 NVFP4 + MTP vs H200 FP8 + MTP on GLM-5 8K/1K with SGLang v0.5.12 (GHA run 26381101926). Headline: up to 3.65x better performance per dollar at 80 tok/s/user, stable in a 3.24x–3.65x band across H200's 25–84 tok/s/user range. Decomposes at the peak into a 1.22x generation step (Blackwell vs Hopper at FP8 + MTP) and a 2.98x precision step (B200 FP8 → B200 NVFP4). Adds a new "On-Paper Specs" section anchoring the silicon ratios sourced from packages/app/src/lib/gpu-specs.ts before the recipes. SKILL.md codifies the new section pattern, adds the /gpu-specs radar image to the Step 0 ask list, and adds a banned-phrases banner so the iso-iv intro stops leaking algorithm names into reader-facing prose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit f479a2d. Configure here.
|
|
||
| Three gaps still narrow the headline number from here, all upstream-tracked: | ||
|
|
||
| - **Disaggregated serving on NVL72.** The numbers above are single-node aggregated. The [tracking issue](https://github.com/sgl-project/sglang/issues/19380) is actively closing FP8 B200 disaggregated 8K/1K and GB300 disaggregated MTP. Wide EP on NVL72 has already demonstrated a [~3x throughput-per-GPU advantage on Kimi K2.5](/blog/gb200-nvl72-kimi-k2-5-vllm-wide-ep-3x-vs-b200) — the same lever should lift GLM-5's perf/$ further at the low-interactivity / high-throughput end where the FP4 frontier plateaus. |
There was a problem hiding this comment.
"Three gaps" listed but only one bullet present
High Severity
The "What's Next" section promises "Three gaps still narrow the headline number" but only one bullet point (Disaggregated serving on NVL72) follows. The FAQ JsonLd at the bottom lists all three: (1) disaggregated serving, (2) piecewise CUDA graph for B200 FP8 Agg prefill (#23351/#24276), and (3) H200 recipe improvements (disagg, trtllm-mha, KV FP8). Two bullet points were dropped from the published prose. The sibling blog post (mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdx) correctly matches its "Two gaps" count with two bullets.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit f479a2d. Configure here.


Summary
## On-Paper Specssection with the/gpu-specsradar + an absolute-values table sourced frompackages/app/src/lib/gpu-specs.ts. Bridges from raw silicon ratios to the measured perf/$ lift via a "compute-bound ceiling / HBM-bound floor" bracket..claude/skills/write-inferencex-blog/iso_interactivity.py; cells outside H200's range render as_unreachable_./gpu-specsradar image to the Step 0 ask list, and adds a banned-phrases banner so future iso-interactivity intros stop leaking algorithm names into reader-facing prose.Test plan
/blog/b200-glm5-nvfp4-vs-h200-fp8-3-6x-perf-per-dollar<Figure>and radar<Figure>render in both themes (currently the same image is in both light + dark slots — drop a real dark radar export later if desired)🤖 Generated with Claude Code
Note
Low Risk
Content-only change (MDX blog post and authoring skill docs); no application runtime, auth, or data-path code.
Overview
Adds a new InferenceX benchmark post comparing B200 NVFP4 + MTP vs H200 FP8 + MTP on GLM-5 8K/1K (SGLang v0.5.12): headline up to 3.65× perf/$ at 80 tok/s/user, decomposed into ~1.22× generation (FP8+MTP) and ~2.98× precision (NVFP4 on B200). The post follows the standard blog shape—dashboard CTAs, hero/repeat figures, per-concurrency tables, iso-interactivity cost table with
_unreachable_cells, upstream PR narrative, FAQ JSON-LD—and introduces an## On-Paper Specsblock (radar figure + table fromgpu-specs.ts+ perf/$ bracket vs measured lift).Updates
write-inferencex-blogSKILL.md: Step 0 now asks for a/gpu-specsradar screenshot; documents the full On-Paper Specs section template; renumbers upstream/recipe inputs; and adds an explicit banned-phrases list so published posts must not name the dashboard’s iso-interactivity spline/Hermite implementation (approved copy: “interpolated along each SKU’s Pareto frontier”).Reviewed by Cursor Bugbot for commit f479a2d. Bugbot is set up for automated code reviews on this repo. Configure here.