Skip to content

Add Custom-Rocm-Kernel-Skill#3308

Open
01xjw wants to merge 8 commits into
huggingface:mainfrom
01xjw:main
Open

Add Custom-Rocm-Kernel-Skill#3308
01xjw wants to merge 8 commits into
huggingface:mainfrom
01xjw:main

Conversation

@01xjw
Copy link
Copy Markdown

@01xjw 01xjw commented Mar 23, 2026

Summary
Adds a new blog post: "ROCm Kernel Skill: Custom Triton Kernels for AMD, from Datacenter to Desktop"
Introduces the ROCm kernel agent skill — a companion to the existing CUDA kernel skill, built from scratch for AMD GPUs
Covers why AMD needs a separate skill (different language, constraints, and hardware), what's included, and how to install it

Key content
Two architectures tested: MI355X (CDNA3+, datacenter) and R9700 (RDNA4, desktop)
Benchmark results with charts and collapsible tables:
MI355X: up to 2.87x peak kernel speedup, 25% E2E speedup on LTX-Video
R9700: up to 3.97x peak kernel speedup, 79.5% bandwidth utilization
Cross-hardware comparison highlighting the performance characteristics of each architecture
Explains ROCm-specific challenges: tl.libdevice unavailability, BLOCK_D autotuning pitfalls, XCD Swizzle, Wave64 vs Wave32

Files added
blog/rocm-kernel-skill.md — the blog post
blog/assets/rocm-kernel-skill/meme.png — cover/thumbnail image
blog/assets/rocm-kernel-skill/MI355x_kernels.png — MI355X kernel benchmark chart
blog/assets/rocm-kernel-skill/R9700_kernels.png — R9700 kernel benchmark chart
blog/assets/rocm-kernel-skill/Cross_hardware.png — cross-hardware comparison chart

@01xjw 01xjw marked this pull request as ready for review March 24, 2026 15:05
Comment thread rocm-kernel-skill.md Outdated
Comment thread rocm-kernel-skill.md Outdated
Comment thread rocm-kernel-skill.md Outdated
Comment thread rocm-kernel-skill.md Outdated
Copy link
Copy Markdown
Collaborator

@burtenshaw burtenshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice post overall, especially the structure around the two hardware targets and the benchmark sections.

I left a few small readability-focused comments. Main theme: trim a couple of dense paragraphs near the top, and make the takeaways a little more skimmable. I’d also consider adding a one-line interpretation in the R9700 end-to-end section, since that section currently goes straight from the heading into the collapsed table.

01xjw and others added 5 commits April 16, 2026 16:29
Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>
Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>
Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>
Co-authored-by: burtenshaw <ben.burtenshaw@gmail.com>
Co-authored-by: burtenshaw <burtenshaw@users.noreply.github.com>
@01xjw
Copy link
Copy Markdown
Author

01xjw commented Apr 16, 2026

Nice post overall, especially the structure around the two hardware targets and the benchmark sections.

I left a few small readability-focused comments. Main theme: trim a couple of dense paragraphs near the top, and make the takeaways a little more skimmable. I’d also consider adding a one-line interpretation in the R9700 end-to-end section, since that section currently goes straight from the heading into the collapsed table.

Hi @burtenshaw , Thanks for the review! All suggestions applied. Please let me know if anything else needs adjusting.

@01xjw
Copy link
Copy Markdown
Author

01xjw commented Apr 17, 2026

Friendly ping @burtenshaw, @sayakpaul — let me know if this is good to go or if there's anything else I should tweak. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants