Add intrinsic for launch-sized workgroup memory on GPUs#146181
Add intrinsic for launch-sized workgroup memory on GPUs#146181Flakebi wants to merge 1 commit intorust-lang:mainfrom
Conversation
|
rustbot has assigned @petrochenkov. Use |
|
Some changes occurred in src/tools/compiletest cc @jieyouxu Some changes occurred in compiler/rustc_codegen_ssa Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter |
This comment has been minimized.
This comment has been minimized.
0aa0e58 to
3ebaccb
Compare
This comment has been minimized.
This comment has been minimized.
3ebaccb to
2378959
Compare
| #[rustc_nounwind] | ||
| #[unstable(feature = "dynamic_shared_memory", issue = "135513")] | ||
| #[cfg(any(target_arch = "amdgpu", target_arch = "nvptx64"))] | ||
| pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T; |
There was a problem hiding this comment.
Note that outside the GPU world, "shared memory" typically refers to memory shared between processes. So I would suggest using a name that's less likely to be confused, like something that explicitly involves "GPU" or so.
This sounds like a form of "global" memory (similar to a static item), but then apparently OpenCL calls it "local" which is very confusing...
There was a problem hiding this comment.
Does it make sense to add a mod gpu?
I think there are more intrinsics for gpus that make can be added (although more in the traditional intrinsic sense, relating to an instruction, edit: re-exposing intrinsics from core::arch::nvptx and the amdgpu equivalent).
There was a problem hiding this comment.
Or should it be in core::arch::gpu?
(From #135516 (comment), cc @workingjubilee)
There was a problem hiding this comment.
Rust intrinsic names are not namespaced. They are exposed in a module, but inside the compiler they are identified entirely by their name. So moving them into a different module doesn't alleviate the need for a clear name that will be understandable to non-GPU people working in the compiler (which is the vast majority of compiler devs).
If there's more GPU intrinsics to come, moving them into a gpu.rs file here still might make sense.
I don't have a strong opinion on how the eventually stable public API is organized, I am commenting entirely as someone who has an interest in keeping the set of intrinsics the Rust compiler offers understandable and well-defined (the ones in this folder, not the ones in core::arch which you call "more traditional" but that's very dependent on your background ;). These intrinsics are just an implementation detail, but every intrinsic we add here is a new language primitive -- it's like adding a new keyword, just without the syntax discussions and perma-unstable. In the past we used to have intrinsics that entirely break the internal consistency of the language, and we used to have intrinsics whose safety requirements were very poorly documented.
|
Sorry for drowning you in questions here, but extending the core language with new operations (as in, adding a new intrinsic doing things that couldn't be done before) is a big deal, and we had a bad experience in the past when this was done without wider discussion in the team to ensure that the intrinsics actually make sense in the context of Rust. Not everything that exists in the hardware can be 1:1 exposed in Rust, sometimes this requires a lot of work and sometimes it's just basically impossible. It can be a lot of work to clean these things up later, and as someone who did a bunch of that work, I'd rather not have to do it again. :) |
|
I agree that it makes a lot of sense to have the discussion now. Thanks for taking a look and helping to design something useful!
Heh, yes, that’s something that should be mentioned in the doc comment as well. (Especially comments on how to safely use it.)
Depends on the size specified on the CPU side when launching the gpu-kernel. It may or it may not.
There are “higher-level APIs” like “do a fast matrix-matrix multiplication”, but not much in-between. I’d assume that people usually use this in its raw form.
Two general use cases are: 1) All threads in a group load a part from global memory (the RAM/VRAM) and store it in shared memory. Then all threads read from the collaboratively loaded data. 2) All threads in a group do some work and collaborate on shared memory (with atomics or so) to aggregate results. Then one of the threads stores the final result to global memory. So, shared memory is meant to be accessed collaboratively and the developer must ensure proper synchronization. It is hard to provide a safe abstraction for this and tbh, I don’t want to try 😅 (though I can see 3rd party crates doing this – at least to some extent). From Rust’s perspective, guarantees should be the same as with memory that’s shared between processes.
I agree, it would be nice to have good documentation for the intrinsics in Rust! |
Wait, there's a single static size set when launching the kernel? Why is it called "dynamic" memory? "dynamic" memory usually means Are you saying dynamic shared memory is neither dynamic in the normal sense nor shared in the normal sense? ;) |
|
r? @RalfJung |
|
I won't be able to do the final approval here, I can just help with ensuring that the intrinsics are documented well enough that they can be understood without GPU expertise, and that the LLVM codegen looks vaguely reasonable. I don't know if we have anyone who actually knows how the generated LLVM IR should look like and can ensure it makes sense. r? @nikic maybe? |
|
@bors try jobs=x86_64-gnu-nopt,x86_64-gnu-debug |
This comment has been minimized.
This comment has been minimized.
Add intrinsic for launch-sized workgroup memory on GPUs try-job: x86_64-gnu-nopt try-job: x86_64-gnu-debug
This comment has been minimized.
This comment has been minimized.
6236dd9 to
ceb1be7
Compare
This comment has been minimized.
This comment has been minimized.
|
Rebased to fix merge conflicts in a |
This comment has been minimized.
This comment has been minimized.
|
If Jubilee doesn't get to it by next week and doesn't mind, Marcelo and I can also have a look then. |
This comment has been minimized.
This comment has been minimized.
ceb1be7 to
8a95ca4
Compare
|
Fixed merge conflict in another use statement. |
This comment has been minimized.
This comment has been minimized.
8a95ca4 to
1f7a58d
Compare
This comment has been minimized.
This comment has been minimized.
|
Fixed another merge conflict in use statement. |
This comment has been minimized.
This comment has been minimized.
1f7a58d to
752c152
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
|
And same use statement again! :D |
752c152 to
1270c5d
Compare
This comment has been minimized.
This comment has been minimized.
Workgroup memory is a memory region that is shared between all threads in a workgroup on GPUs. Workgroup memory can be allocated statically or after compilation, when launching a gpu-kernel. The intrinsic added here returns the pointer to the memory that is allocated at launch-time. # Interface With this change, workgroup memory can be accessed in Rust by calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T` intrinsic. It returns the pointer to workgroup memory guaranteeing that it is aligned to at least the alignment of `T`. The pointer is dereferencable for the size specified when launching the current gpu-kernel (which may be the size of `T` but can also be larger or smaller or zero). All calls to this intrinsic return a pointer to the same address. See the intrinsic documentation for more details. ## Alternative Interfaces It was also considered to expose dynamic workgroup memory as extern static variables in Rust, like they are represented in LLVM IR. However, due to the pointer not being guaranteed to be dereferencable (that depends on the allocated size at runtime), such a global must be zero-sized, which makes global variables a bad fit. # Implementation Details Workgroup memory in amdgpu and nvptx lives in address space 3. Workgroup memory from a launch is implemented by creating an external global variable in address space 3. The global is declared with size 0, as the actual size is only known at runtime. It is defined behavior in LLVM to access an external global outside the defined size. There is no similar way to get the allocated size of launch-sized workgroup memory on amdgpu an nvptx, so users have to pass this out-of-band or rely on target specific ways for now.
1270c5d to
acdf598
Compare
|
Thanks for taking the time and working through all the feedback!
@Sa4dUs and I reviewed the IR and it does look like what we were expecting. From LLVM 23 onwards, we should now also have the same IR for both vendors, which is nice. For testing, I also built a minimal frontend for it and wrote a shared-memory matmul on top of it, which worked like a charm. Just to summarize the previous discussion for those who got lost in ~140 comments:
All previous questions/feedback (especially from Jubilee as the last reviewer) were addressed, and code, design and documentation look reasonable to Marcelo and me. Individual aspects were also thoroughly reviewed by others before: #146181 (comment). I'm happy to make the final sign-off. I was told it's conventional to add everyone involved in the partial reviewing, so let's see who is on a team. I added everyone from #146181 (comment), with the exception of Flakebi, who's the author. cc @workingjubilee @RalfJung @nikic @kjetilkjeka @kulst @bors r=ZuseZ4,Sa4dus,workingjubilee,RalfJung,nikic,kjetilkjeka,kulst |
Yeah that seems fine as long as it is properly documented. |
| /// allocated at launch-time. | ||
| /// All calls to `gpu_launch_sized_workgroup_mem` in a workgroup, independent of the | ||
| /// generic type, return the same address, so alias the same memory. | ||
| /// The returned pointer is aligned by at least the alignment of `T`. |
There was a problem hiding this comment.
@RalfJung I don't think there is anything in these docs that currently supports this assumption, so I wasn't particularly concerned about:
Hmm. I guess I'm worrying about someone calling it with ::<u8>, casting the pointer, and then assuming that a call elsewhere (perhaps in a library they are depending on?) will enforce the alignment they want, but that call might get subject to DCE or other "non-compilation events".
The docs only tie the alignment of "[this]" returned pointer to "[this]" T, and Rust also isn't really known for spooky actions at a distance that would support other interpretations. But if both you and Jubilee are concerned, we can also be more explicit. Do you prefer this (please feel free to suggest better wording)?
/// The returned pointer is aligned by at least the alignment of `T`.
/// No stronger alignment guarantee is provided.
/// In particular, callers may not rely on one invocation of
/// `gpu_launch_sized_workgroup_mem` to affect the alignment of a pointer
/// returned by another invocation.
There was a problem hiding this comment.
The docs only tie the alignment of "[this]" returned pointer to "[this]" T, and Rust also isn't really known for spooky actions at a distance that would support other interpretations.
This intrinsic adds spooky action at a distance, that's why I am so concerned. ;) All invocations of the intrinsic return the same pointer, so they magically affect each other in terms of alignment.
callers may not rely on one invocation of
/// `gpu_launch_sized_workgroup_mem` to affect the alignment of a pointer
/// returned by another invocation.
This is somewhat contradicting the statement that they all return the same address.
I'd propose something like:
If gpu_launch_sized_workgroup_mem is invoked multiple times with different types that have different alignment, then you may only rely on the resulting pointer having the alignment of T after a call to gpu_launch_sized_workgroup_mem::<T> has occurred in the current program execution.
There was a problem hiding this comment.
The two properties guaranteed by the intrinsic are
- the returned pointer has at least the alignment of
Tand - all invocations within a workgroup return the same pointer.
That allows a bunch of implications, but I don’t think they are important. The core goal is, you want to use launch-sized workgroup mem, you call the intrinsic with your needed alignment, you use the pointer. That’s it, nothing else needed, no other derived guarantees used.
In other words, I do not see a use-case for (ab)using this action at a distance.
If
gpu_launch_sized_workgroup_memis invoked multiple times with different types that have different alignment, then you may only rely on the resulting pointer having the alignment ofTafter a call togpu_launch_sized_workgroup_mem::<T>has occurred in the current program execution.
The two properties allow inferring even wider guarantees. If gpu_launch_sized_workgroup_mem is invoked with a certain alignment, in any execution within the same workgroup, every other call to gpu_launch_sized_workgroup_mem in that workgroup at any time before or after is guaranteed to receive at least this alignment.
The calls that “observe” the action at a distance do not need to be in the same thread of execution, nor do they need to be after the “observed” call.
The two core guarantees are written down in the docs. If there is no use-case for such inferred guarantees (I cannot think of any), I fear that writing down inferred guarantees in the docs adds more confusion than it helps.
(If I read something like this in the docs, it would leave me wondering if there is an intended use-case for this and if I am supposed to hold it differently.)
There was a problem hiding this comment.
In other words, I do not see a use-case for (ab)using this action at a distance.
That's great. But other people will read these docs, notice the implications, and if it even remotely fits their usecase they will (ab)use everything they can find. If there are implications of our spec, or things that seem like implications, that we don't actually intend to be used or guaranteed, then we can't just hope that people will not use them. We have to make it explicit, or someone will use them.
There was a problem hiding this comment.
If gpu_launch_sized_workgroup_mem is invoked with a certain alignment, in any execution within the same workgroup, every other call to gpu_launch_sized_workgroup_mem in that workgroup at any time before or after is guaranteed to receive at least this alignment.
This spec is extremely problematic. We do not allow time travel in Rust; time travel usually leads to semantic contradictions. That's why I insist on a clarification like what I described: we do absolutely not want code that might be executed in the future to affect the reasoning I am allowed to do here and now.
If there truly is no usecase for such "time travel" use of the intrinsic, then my proposed clarification should be uncontroversial.
There was a problem hiding this comment.
Not sure I get that. For me, what I wrote follows logically from the two properties (alignment + all invocations return the same pointer).
We cannot do anything to prevent that (any implementation that would not satisfy the wide guarantee can never be correct).
Let me try to explain with an example (pseudo-code):
fn main() {
let p = gpu_launch_sized_workgroup_mem::<u32>();
// As I understand it, you say we should declare that this assert can fail
assert!(p is aligned to at least 8 byte);
let p2 = gpu_launch_sized_workgroup_mem::<u64>();
// This is guaranteed to be true
assert!(p2 is aligned to at least 8 byte);
assert_eq!(p, p2);
}Given that p2 must be aligned to at least 8 byte and p2 == p, I can’t imagine any implementation where the first assert is allowed to fail.
Or, declaring that it can fail would contradict the other guarantees we give.
Am I missing something here?
There was a problem hiding this comment.
all invocations
The problem is defining what exactly "all invocations" are. Only the invocations that are actually executed in this run of the program matter. And since it's impossible to tell whether the program will actually reach an invocation further down the code, I think we want to be very sure to exclude any reasoning "elsewhere / in the future, this invocation exists, and hence ...".
There was a problem hiding this comment.
@Flakebi I think what Ralf is trying to say, is that your definition, with just the two points above, is too strong to be soundly expressed in Rust. As you point out, you can derive a lot of things, including Ralf's extension to the docs. We also know that with today's LLVM, we could never break his extension to the docs.
But explaining how the alignment within a workgroup is affected by multiple calls is hard (impossible) to do without using time-travelling, which is ~prohibited in Rust.
Then again, time-travelling is totally fine in code that is UB (afaik). So, by just making it UB to argue over the alignment based on later calls, we are now allowed to use our time-travelling implementation.
Implementation-wise, nothing would change on our side. We just prohibit Rust devs who peeked into LLVM to use that internal knowledge.
There was a problem hiding this comment.
UB currently can time-travel but we'd like to rein that in.
Anyway the approach here isn't really about introducing UB. It's about saying that there is only one way to actually know that a certain function call will happen in a certain program execution: knowing that it already happened (e.g. because we have a value that was returned by that function).
I think this program should be considered invalid (potential-UB):
fn main() {
let p = gpu_launch_sized_workgroup_mem::<u32>();
std::assert_unchecked(p is aligned to at least 8 byte);
let p2 = gpu_launch_sized_workgroup_mem::<u64>(
}The UB execution is one that for some reason aborts the program before the 2nd call, therefore never invoking gpu_launch_sized_workgroup_mem a 2nd time, therefore never raising the alignment to that point.
We don't (currently) allow executions to just randomly abort... except we kind of do since stack overflows could occur any time... but allowing any kind of UB-relevant reasoning based on "this will definitely happen" is sufficiently suspicious that I think we should only allow it if we have a really good reason for it.
View all comments
Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.
Interface
With this change, workgroup memory can be accessed in Rust by
calling the new
gpu_launch_sized_workgroup_mem<T>() -> *mut Tintrinsic.
It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of
T.The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of
Tbut can also be largeror smaller or zero).
All calls to this intrinsic return a pointer to the same address.
See the intrinsic documentation for more details.
Alternative Interfaces
It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.
Implementation Details
Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.
There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.
Tracking issue: #135516