Add Custom-Rocm-Kernel-Skill#3308
Conversation
burtenshaw
left a comment
There was a problem hiding this comment.
Nice post overall, especially the structure around the two hardware targets and the benchmark sections.
I left a few small readability-focused comments. Main theme: trim a couple of dense paragraphs near the top, and make the takeaways a little more skimmable. I’d also consider adding a one-line interpretation in the R9700 end-to-end section, since that section currently goes straight from the heading into the collapsed table.
Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>
Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>
Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>
Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>
Co-authored-by: burtenshaw <burtenshaw@users.noreply.github.com>
Hi @burtenshaw , Thanks for the review! All suggestions applied. Please let me know if anything else needs adjusting. |
|
Friendly ping @burtenshaw, @sayakpaul — let me know if this is good to go or if there's anything else I should tweak. Thanks! |
Summary
Adds a new blog post: "ROCm Kernel Skill: Custom Triton Kernels for AMD, from Datacenter to Desktop"
Introduces the ROCm kernel agent skill — a companion to the existing CUDA kernel skill, built from scratch for AMD GPUs
Covers why AMD needs a separate skill (different language, constraints, and hardware), what's included, and how to install it
Key content
Two architectures tested: MI355X (CDNA3+, datacenter) and R9700 (RDNA4, desktop)
Benchmark results with charts and collapsible tables:
MI355X: up to 2.87x peak kernel speedup, 25% E2E speedup on LTX-Video
R9700: up to 3.97x peak kernel speedup, 79.5% bandwidth utilization
Cross-hardware comparison highlighting the performance characteristics of each architecture
Explains ROCm-specific challenges: tl.libdevice unavailability, BLOCK_D autotuning pitfalls, XCD Swizzle, Wave64 vs Wave32
Files added
blog/rocm-kernel-skill.md — the blog post
blog/assets/rocm-kernel-skill/meme.png — cover/thumbnail image
blog/assets/rocm-kernel-skill/MI355x_kernels.png — MI355X kernel benchmark chart
blog/assets/rocm-kernel-skill/R9700_kernels.png — R9700 kernel benchmark chart
blog/assets/rocm-kernel-skill/Cross_hardware.png — cross-hardware comparison chart