Skip to content

Conversation

@vpietila-amd
Copy link
Contributor

@vpietila-amd vpietila-amd commented Jan 14, 2026

Proposed changes

This PR has four main changes

  1. Unified XDL and WMMA descriptions for warp under a single WarpGemmDescriptor concept and corresponding types.
  2. Refactored the convolution algorithm concepts for conv dispatcher such that they have a well defined hierarchy. The convolution algorithms are grouped into three categories: XDL, WMMA, and DL. The XDL and WMMA algorithm have a common base algorithm concept ConvAlgorithm from which the hierarchy of XDL and WMMA algorithms is derived.
  3. The convolution algorithm specialization are not bit flags such that one can define multiple specilization at the time. This allows us to achieve the hierarchy of the conv algorithms.
  4. The input/output tile related thread clusters have now more descriptive names.

The unified warp GEMM description (item 1.) allows us to treat XDL and WMMA algorithms at unique footing and reduces the amount of boilerplate code.

The hierarchical description of the conv algorithms (item 2.) simplifies the convolution algorithm concepts and allows better compile-time error messages when no corresponding factory is found for a given algorithms description.

Items 1. and 2. combined allow easier addition of new factories.

Ville Pietilä added 30 commits December 18, 2025 04:36
concept SpecifiesTileTransferParameters3D = TileTransferParameters<T, 3>;
// Base algorithm concepts
template <typename T, size_t ThreadClusterRank = 3>
concept ConvAlgorithm = ConvAlgorithmDescriptor<T> && SpecifiesThreadBlock<T> &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we requiring that a convolution always have to specify all this information.? This is a very strong design decision. We had talked about options for heuristics and default values. This appears to be blocking off those options.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that's what the currrent code is, but I was really trying to move us away from concepts here. The if statements below were a bridge to more explicit logic for determining the implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we've gone all in on concepts. I think this is a mistake, but we can ride it out and see where we get to. I think this is putting a huge burden on the calling code and may make MIOpen integration much more difficult. I had hoped to simplify the requirements on the calling code, but this looks pretty rigid and will probably require more helper code.


XDL and WMMA algorithms share a common base, while DL algorithms have their own independent base.

## Common Base Hierarchy (XDL & WMMA)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be very complicated to explain to new CK library users. Without default values they will have know details of the convolution algorithms and the hardware they are compiling for. Rather than allowing for optional expert mode tuning, even beginning library users will need to know all the values, and if they get anything wrong they will get compiler errors. We're loosing a lot of the benefits of an abstract builder layer with this rigid design.


## Overview

The convolution algorithms are organized into three main categories:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XDL and WMMA are specific to hardware architecture, not really a user choice. How does that fit into this design.

What is the logic of DL algorithms. Are those only for older CDNA2, or will they also work on newer architectures?

What about reference implementations? They don't fit in any of these three categories, but they are also convolutions create by the builder, right?

INTERWAVE
};

enum class ConvAlgorithmSpecialization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document why these values of 0, 1, 2, 4, 8, 16 are chosen. I'm guessing it's some kind of a bitfield scheme. Are we planning on doing math on enum class values?

return static_cast<T>(spec) == 0;
}

enum class MatrixInstructionType
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we start considering hardware architecture for some of this?

.weight = {.config = {.layout = GKXC}},
.output = {.config = {.layout = GNWK}}};

constexpr auto FwdConvAlgorithm =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this kind of code with the chained methods going to part of the public API. This is very different from the reflection interface. I thought this was some kind of testing detail, but it looks like it's getting more and more baked into the design. How complex is this going to get when we support twice as many GPU architectures? Constants like GemmParams_Wmma_16_16_2x2_per_wave were not part of the original design and look like they are masking an increasingly complex builder interface.

WarpGemm_,
InputOutputTileTransfer_<4>,
ConvSpecializationBwdWeight_,
GemmPipeline_, // Not needed, but we need this to meet the ConvAlgorithm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a smaller example of complication I'm worried is coming from over-use of concepts. It looks to me like we've dumped a lot of logic into concept pattern matching. I think it's going to get more and more complicated and limit what we can do in build-time logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bit of a code smell. I fixed the hierarchy of the algorithms concepts such that this is not needed.

@shumway
Copy link
Collaborator

shumway commented Jan 16, 2026

I like some of the refactoring around simplifying XDL and WMMA. As I've been saying, I disagree with this kind of use of concepts and I think it's going to create a lot of problems. If other people agree with going forward this way, I'm fine with seeing where it goes. I'm anticipating more complications with MIOpen integration, and we may have to rework this all to get to a more user-friendly builder. I'm also concerned that we are not introducing hardware architecture yet. It seems like knowing which GPU we are compiling for is a key part of the kernel selection logic. I don't have a good feel for how this is going to work when were integrated with MIOpen for all supported hardware, and I keep thinking we are complicating things by overusing concepts.

@vpietila-amd vpietila-amd marked this pull request as draft January 16, 2026 10:02
Ville Pietilä added 2 commits January 16, 2026 05:32
…com:ROCm/composable_kernel into vpietila/ckb-refactor-warp-gemm-descriptors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants