Add intrinsic for launch-sized workgroup memory on GPUs#146181
Add intrinsic for launch-sized workgroup memory on GPUs#146181Flakebi wants to merge 1 commit intorust-lang:mainfrom
Conversation
|
rustbot has assigned @petrochenkov. Use |
|
Some changes occurred in src/tools/compiletest cc @jieyouxu Some changes occurred in compiler/rustc_codegen_ssa Some changes occurred to the intrinsics. Make sure the CTFE / Miri interpreter |
This comment has been minimized.
This comment has been minimized.
0aa0e58 to
3ebaccb
Compare
This comment has been minimized.
This comment has been minimized.
3ebaccb to
2378959
Compare
library/core/src/intrinsics/mod.rs
Outdated
| #[rustc_nounwind] | ||
| #[unstable(feature = "dynamic_shared_memory", issue = "135513")] | ||
| #[cfg(any(target_arch = "amdgpu", target_arch = "nvptx64"))] | ||
| pub fn dynamic_shared_memory<T: ?Sized>() -> *mut T; |
There was a problem hiding this comment.
Note that outside the GPU world, "shared memory" typically refers to memory shared between processes. So I would suggest using a name that's less likely to be confused, like something that explicitly involves "GPU" or so.
This sounds like a form of "global" memory (similar to a static item), but then apparently OpenCL calls it "local" which is very confusing...
There was a problem hiding this comment.
Does it make sense to add a mod gpu?
I think there are more intrinsics for gpus that make can be added (although more in the traditional intrinsic sense, relating to an instruction, edit: re-exposing intrinsics from core::arch::nvptx and the amdgpu equivalent).
There was a problem hiding this comment.
Or should it be in core::arch::gpu?
(From #135516 (comment), cc @workingjubilee)
There was a problem hiding this comment.
Rust intrinsic names are not namespaced. They are exposed in a module, but inside the compiler they are identified entirely by their name. So moving them into a different module doesn't alleviate the need for a clear name that will be understandable to non-GPU people working in the compiler (which is the vast majority of compiler devs).
If there's more GPU intrinsics to come, moving them into a gpu.rs file here still might make sense.
I don't have a strong opinion on how the eventually stable public API is organized, I am commenting entirely as someone who has an interest in keeping the set of intrinsics the Rust compiler offers understandable and well-defined (the ones in this folder, not the ones in core::arch which you call "more traditional" but that's very dependent on your background ;). These intrinsics are just an implementation detail, but every intrinsic we add here is a new language primitive -- it's like adding a new keyword, just without the syntax discussions and perma-unstable. In the past we used to have intrinsics that entirely break the internal consistency of the language, and we used to have intrinsics whose safety requirements were very poorly documented.
|
Sorry for drowning you in questions here, but extending the core language with new operations (as in, adding a new intrinsic doing things that couldn't be done before) is a big deal, and we had a bad experience in the past when this was done without wider discussion in the team to ensure that the intrinsics actually make sense in the context of Rust. Not everything that exists in the hardware can be 1:1 exposed in Rust, sometimes this requires a lot of work and sometimes it's just basically impossible. It can be a lot of work to clean these things up later, and as someone who did a bunch of that work, I'd rather not have to do it again. :) |
|
I agree that it makes a lot of sense to have the discussion now. Thanks for taking a look and helping to design something useful!
Heh, yes, that’s something that should be mentioned in the doc comment as well. (Especially comments on how to safely use it.)
Depends on the size specified on the CPU side when launching the gpu-kernel. It may or it may not.
There are “higher-level APIs” like “do a fast matrix-matrix multiplication”, but not much in-between. I’d assume that people usually use this in its raw form.
Two general use cases are: 1) All threads in a group load a part from global memory (the RAM/VRAM) and store it in shared memory. Then all threads read from the collaboratively loaded data. 2) All threads in a group do some work and collaborate on shared memory (with atomics or so) to aggregate results. Then one of the threads stores the final result to global memory. So, shared memory is meant to be accessed collaboratively and the developer must ensure proper synchronization. It is hard to provide a safe abstraction for this and tbh, I don’t want to try 😅 (though I can see 3rd party crates doing this – at least to some extent). From Rust’s perspective, guarantees should be the same as with memory that’s shared between processes.
I agree, it would be nice to have good documentation for the intrinsics in Rust! |
Wait, there's a single static size set when launching the kernel? Why is it called "dynamic" memory? "dynamic" memory usually means Are you saying dynamic shared memory is neither dynamic in the normal sense nor shared in the normal sense? ;) |
|
r? @RalfJung |
|
I won't be able to do the final approval here, I can just help with ensuring that the intrinsics are documented well enough that they can be understood without GPU expertise, and that the LLVM codegen looks vaguely reasonable. I don't know if we have anyone who actually knows how the generated LLVM IR should look like and can ensure it makes sense. r? @nikic maybe? |
|
I pushed the rename to Sorry for breaking nvptx names again. I tried to add a test for this, but this only fails in ptxas, so I didn’t manage to. |
|
I made the llvm PR: llvm/llvm-project#173018 I also tested your branch with the fix and that works great! The ptx is arguably nicer as the name of the I also think the name in the current revision is a nice improvement, but in case the reviews don't like it, the other alternatives are fine as well. All in all this PR looks pretty great from my point of view. I wouldn't expect @RDambrosio016 to take a look so there's at least no reason to block it for nvptx related things. |
|
Thanks for testing and for opening the LLVM PR. (You probably need a lit test for people to approve it.)
Just for context on why I removed the name and why I think it should eventually be removed for nvptx as well: With the named global, the maximum alignment is enforced for all kernels in the IR module, which is unnecessarily conservative. With the unnamed global, different kernels in the same module can end up with different minimum alignments, depending on the calls they make. IMO the later behavior is what we want. |
|
☔ The latest upstream changes (presumably #150448) made this pull request unmergeable. Please resolve the merge conflicts. |
c9c04f0 to
d45fd86
Compare
This comment has been minimized.
This comment has been minimized.
|
Rebased to fix conflicts, no other changes |
|
This should be good to merge.
@RalfJung, are you ok with giving the final sign-off based on these approvals? |
|
Given how disconnected I am from the implementation work, I am not comfortable doing the final review here. @nikic is assigned, @workingjubilee might also be willing to take over based on their statements in other PRs. |
This comment has been minimized.
This comment has been minimized.
d45fd86 to
cc300ab
Compare
This comment has been minimized.
This comment has been minimized.
|
Rebased to fix conflicts, no other changes |
This comment has been minimized.
This comment has been minimized.
Workgroup memory is a memory region that is shared between all threads in a workgroup on GPUs. Workgroup memory can be allocated statically or after compilation, when launching a gpu-kernel. The intrinsic added here returns the pointer to the memory that is allocated at launch-time. # Interface With this change, workgroup memory can be accessed in Rust by calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T` intrinsic. It returns the pointer to workgroup memory guaranteeing that it is aligned to at least the alignment of `T`. The pointer is dereferencable for the size specified when launching the current gpu-kernel (which may be the size of `T` but can also be larger or smaller or zero). All calls to this intrinsic return a pointer to the same address. See the intrinsic documentation for more details. ## Alternative Interfaces It was also considered to expose dynamic workgroup memory as extern static variables in Rust, like they are represented in LLVM IR. However, due to the pointer not being guaranteed to be dereferencable (that depends on the allocated size at runtime), such a global must be zero-sized, which makes global variables a bad fit. # Implementation Details Workgroup memory in amdgpu and nvptx lives in address space 3. Workgroup memory from a launch is implemented by creating an external global variable in address space 3. The global is declared with size 0, as the actual size is only known at runtime. It is defined behavior in LLVM to access an external global outside the defined size. There is no similar way to get the allocated size of launch-sized workgroup memory on amdgpu an nvptx, so users have to pass this out-of-band or rely on target specific ways for now.
cc300ab to
3541dd4
Compare
|
This PR was rebased onto a different main commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
|
Rebased to fix conflicts, no other changes. Maybe r? workingjubilee ? |
|
|
|
I am indeed happy to take this review as long as you are happy to be patient. ^^; |
There was a problem hiding this comment.
"Per-launch-sized workgroup memory". What a mouthful.
...Yeah. The intrinsic is indeed named correctly but... what a name. But! I am very much not going to suggest alternatives. If all the maintainers are happy enough with this name, it is good by me.
I see that we have further discussed the possibility of the alternative interface I suggested, using the close variant of extern { static }, and rejected it. The reasons for rejecting it make sense to me, and I am happy to see that we inspected it a bit further since I feel like the best case for some variation on the alternative I was thinking of was indeed made aptly enough by @Flakebi. Maybe the language should get some way to express this concept more aptly, but not today.
This looks good overall, but there's some stuff you should do to make sure this passes CI, and I would like to see a small amount of bonus annotation for maintainability.
| #[rustc_nounwind] | ||
| #[unstable(feature = "gpu_launch_sized_workgroup_mem", issue = "135513")] | ||
| #[cfg(any(target_arch = "amdgpu", target_arch = "nvptx64"))] | ||
| pub fn gpu_launch_sized_workgroup_mem<T>() -> *mut T; |
There was a problem hiding this comment.
We did indeed decide to have a mod gpu, can this be moved there?
| // Checks that the GPU intrinsic to get launch-sized workgroup memory works. | ||
|
|
||
| //@ revisions: amdgpu nvptx | ||
| //@ compile-flags: --crate-type=rlib | ||
| // | ||
| //@ [amdgpu] compile-flags: --target amdgcn-amd-amdhsa -Ctarget-cpu=gfx900 | ||
| //@ [amdgpu] needs-llvm-components: amdgpu | ||
| //@ [nvptx] compile-flags: --target nvptx64-nvidia-cuda | ||
| //@ [nvptx] needs-llvm-components: nvptx | ||
| //@ add-minicore |
There was a problem hiding this comment.
As this is a codegen-llvm test, you must specify an opt level, or add revisions for multiple opt levels, if you want this to pass CI.
| @@ -0,0 +1,31 @@ | |||
| // Checks that the GPU intrinsic to get launch-sized workgroup memory works. | |||
There was a problem hiding this comment.
| // Checks that the GPU intrinsic to get launch-sized workgroup memory works. | |
| // Checks that the GPU intrinsic to get launch-sized workgroup memory works | |
| // and correctly aligns the `external addrspace(...) global`s over multiple calls. |
There was a problem hiding this comment.
At first I kind of wanted there to be another test that does the cross-crate version of this. Then I remembered what was discussed elsewhere: that the targets in question are pure LLVM bitcode that gets mashed together anyways, so I am not sure it would actually benefit us, and it would probably involve a ton of tedium with run-make, having considered it in more detail. So, meh.
Basically only leaving this note here to remind myself that if this turns out to go awry in the future, I can update in the direction of following this kind of instinct more often. :^)
| size_t NameLen, | ||
| LLVMTypeRef Ty) { | ||
| Module *Mod = unwrap(M); | ||
| unsigned AddressSpace = Mod->getDataLayout().getDefaultGlobalsAddressSpace(); |
There was a problem hiding this comment.
I keep forgetting this "helpful" C "abbreviation", so let's expand it:
| unsigned AddressSpace = Mod->getDataLayout().getDefaultGlobalsAddressSpace(); | |
| unsigned int AddressSpace = Mod->getDataLayout().getDefaultGlobalsAddressSpace(); |
| extern "C" LLVMValueRef | ||
| LLVMRustGetOrInsertGlobalInAddrspace(LLVMModuleRef M, const char *Name, | ||
| size_t NameLen, LLVMTypeRef Ty, | ||
| unsigned AddressSpace) { |
There was a problem hiding this comment.
| unsigned AddressSpace) { | |
| unsigned int AddressSpace) { |
| extern "C" LLVMValueRef | ||
| LLVMRustGetOrInsertGlobalInAddrspace(LLVMModuleRef M, const char *Name, |
There was a problem hiding this comment.
If this function will always create a specifically-extern global, can we document that here? The name is whatever, to me.
| // The name of the global variable is not relevant, the important properties are. | ||
| // 1. The global is in the address space for workgroup memory | ||
| // 2. It is an extern global | ||
| // All instances of extern addrspace(gpu_workgroup) globals are merged in the LLVM backend. | ||
| // Generate an unnamed global per intrinsic call, so that different kernels can have | ||
| // different minimum alignments. |
There was a problem hiding this comment.
Lead with what is important, instead of leading with "don't look behind this curtain". Also, the alignment is an important property.
| // The name of the global variable is not relevant, the important properties are. | |
| // 1. The global is in the address space for workgroup memory | |
| // 2. It is an extern global | |
| // All instances of extern addrspace(gpu_workgroup) globals are merged in the LLVM backend. | |
| // Generate an unnamed global per intrinsic call, so that different kernels can have | |
| // different minimum alignments. | |
| // Generate an anonymous global per call, with these properties: | |
| // 1. The global is in the address space for workgroup memory | |
| // 2. It is an `external` global | |
| // 3. It is correctly aligned for the pointee `T` | |
| // All instances of extern addrspace(gpu_workgroup) globals are merged in the LLVM backend. | |
| // The name is irrelevant. |
| /// # Safety | ||
| /// | ||
| /// The pointer is safe to dereference from the start (the returned pointer) up to the | ||
| /// size of workgroup memory that was specified when launching the current gpu-kernel. |
There was a problem hiding this comment.
One small add maybe: It seems beneficial to explicitly deny this relationship? At your discretion.
| /// size of workgroup memory that was specified when launching the current gpu-kernel. | |
| /// size of workgroup memory that was specified when launching the current gpu-kernel. | |
| /// This allocated size is not related in any way to `T`. |
Otherwise this documentation is fine now. Thank you!
|
@bors try jobs=x86_64-gnu-nopt,x86_64-gnu-debug |
Add intrinsic for launch-sized workgroup memory on GPUs try-job: x86_64-gnu-nopt try-job: x86_64-gnu-debug
This comment has been minimized.
This comment has been minimized.
|
The job Click to see the possible cause of the failure (guessed by this bot) |
|
💔 Test for 42a3a3f failed: CI. Failed job:
|
Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.
Interface
With this change, workgroup memory can be accessed in Rust by
calling the new
gpu_launch_sized_workgroup_mem<T>() -> *mut Tintrinsic.
It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of
T.The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of
Tbut can also be largeror smaller or zero).
All calls to this intrinsic return a pointer to the same address.
See the intrinsic documentation for more details.
Alternative Interfaces
It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.
Implementation Details
Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.
There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.
Tracking issue: #135516