-
Notifications
You must be signed in to change notification settings - Fork 266
[CK_BUILDER] Refactor warp GEMM and conv algorithm concepts #3567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
[CK_BUILDER] Refactor warp GEMM and conv algorithm concepts #3567
Conversation
…ing representations.
| concept SpecifiesTileTransferParameters3D = TileTransferParameters<T, 3>; | ||
| // Base algorithm concepts | ||
| template <typename T, size_t ThreadClusterRank = 3> | ||
| concept ConvAlgorithm = ConvAlgorithmDescriptor<T> && SpecifiesThreadBlock<T> && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we requiring that a convolution always have to specify all this information.? This is a very strong design decision. We had talked about options for heuristics and default values. This appears to be blocking off those options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that's what the currrent code is, but I was really trying to move us away from concepts here. The if statements below were a bridge to more explicit logic for determining the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we've gone all in on concepts. I think this is a mistake, but we can ride it out and see where we get to. I think this is putting a huge burden on the calling code and may make MIOpen integration much more difficult. I had hoped to simplify the requirements on the calling code, but this looks pretty rigid and will probably require more helper code.
|
|
||
| XDL and WMMA algorithms share a common base, while DL algorithms have their own independent base. | ||
|
|
||
| ## Common Base Hierarchy (XDL & WMMA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be very complicated to explain to new CK library users. Without default values they will have know details of the convolution algorithms and the hardware they are compiling for. Rather than allowing for optional expert mode tuning, even beginning library users will need to know all the values, and if they get anything wrong they will get compiler errors. We're loosing a lot of the benefits of an abstract builder layer with this rigid design.
|
|
||
| ## Overview | ||
|
|
||
| The convolution algorithms are organized into three main categories: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XDL and WMMA are specific to hardware architecture, not really a user choice. How does that fit into this design.
What is the logic of DL algorithms. Are those only for older CDNA2, or will they also work on newer architectures?
What about reference implementations? They don't fit in any of these three categories, but they are also convolutions create by the builder, right?
| INTERWAVE | ||
| }; | ||
|
|
||
| enum class ConvAlgorithmSpecialization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document why these values of 0, 1, 2, 4, 8, 16 are chosen. I'm guessing it's some kind of a bitfield scheme. Are we planning on doing math on enum class values?
| return static_cast<T>(spec) == 0; | ||
| } | ||
|
|
||
| enum class MatrixInstructionType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we start considering hardware architecture for some of this?
| .weight = {.config = {.layout = GKXC}}, | ||
| .output = {.config = {.layout = GNWK}}}; | ||
|
|
||
| constexpr auto FwdConvAlgorithm = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this kind of code with the chained methods going to part of the public API. This is very different from the reflection interface. I thought this was some kind of testing detail, but it looks like it's getting more and more baked into the design. How complex is this going to get when we support twice as many GPU architectures? Constants like GemmParams_Wmma_16_16_2x2_per_wave were not part of the original design and look like they are masking an increasingly complex builder interface.
| WarpGemm_, | ||
| InputOutputTileTransfer_<4>, | ||
| ConvSpecializationBwdWeight_, | ||
| GemmPipeline_, // Not needed, but we need this to meet the ConvAlgorithm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a smaller example of complication I'm worried is coming from over-use of concepts. It looks to me like we've dumped a lot of logic into concept pattern matching. I think it's going to get more and more complicated and limit what we can do in build-time logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bit of a code smell. I fixed the hierarchy of the algorithms concepts such that this is not needed.
|
I like some of the refactoring around simplifying XDL and WMMA. As I've been saying, I disagree with this kind of use of concepts and I think it's going to create a lot of problems. If other people agree with going forward this way, I'm fine with seeing where it goes. I'm anticipating more complications with MIOpen integration, and we may have to rework this all to get to a more user-friendly builder. I'm also concerned that we are not introducing hardware architecture yet. It seems like knowing which GPU we are compiling for is a key part of the kernel selection logic. I don't have a good feel for how this is going to work when were integrated with MIOpen for all supported hardware, and I keep thinking we are complicating things by overusing concepts. |
…com:ROCm/composable_kernel into vpietila/ckb-refactor-warp-gemm-descriptors
Proposed changes
This PR has four main changes
WarpGemmDescriptorconcept and corresponding types.XDL,WMMA, andDL. The XDL and WMMA algorithm have a common base algorithm conceptConvAlgorithmfrom which the hierarchy ofXDLandWMMAalgorithms is derived.The unified warp GEMM description (item 1.) allows us to treat XDL and WMMA algorithms at unique footing and reduces the amount of boilerplate code.
The hierarchical description of the conv algorithms (item 2.) simplifies the convolution algorithm concepts and allows better compile-time error messages when no corresponding factory is found for a given algorithms description.
Items 1. and 2. combined allow easier addition of new factories.