Add sycl_khr_free_function_commands extension#922
Conversation
This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects. It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).
Previous "0 or more" wording only made sense when reductions could be optionally provided to functions like parallel_for; now that there are dedicated *_reduce functions, at least one reduction is required.
"is" is more consistent with ISO C++ wording.
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
There is no need to constrain T here because T must be device-copyable in order to construct the accessor passed as an argument.
Renaming sycl::nd_item is not a necessary part of the API redesign for submitting work, so it should be moved to its own extension. This will also give us more time to consider the design and naming of any proposed replacement(s), including how they should interact with new functionality proposed in other KHRs.
There are currently no backends that define interop for reductions, so we can remove these functions for now. If we decide later that these functions are necessary, we can release a revision of the KHR.
Co-authored-by: Andrey Alekseenko <al42and@gmail.com>
…gnal_event function.
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
|
Thanks a bunch for the efforts to push this to finish line. My recollection, prefetch_host is sparingly used ATM. And these few apps (qmcpack, exachem, etc) use it at production-scale. My two-cents is towards free function standardization of this API just to reduce the cycles over non-free |
There was a problem hiding this comment.
Very nice! This extension will solve many pain point that we have with our current SYCL usage, it looks great.
I just had kind of neatpick. In the examples i did not find guidelines related to the use of local memory or anything that require the sycl::handler since this extension hides it away. Is the use of free function for commands limited to kernels that do not employ local memory ? If yes are there plans to lift such limit ?
Also i don't recall seeing any named kernel in the proposed extension. Can we name them for profiling (#1002 maybe ?) ?
Yes. the working group has discussed this. We plan to add a KHR similar to this oneAPI extension: And I think we might consider a KHR similar to this as well: |
|
Thanks for the clarification. I don't really like the approach as one has to rebuild aliases every time to use it. It looks overly complex for what it aims to achieve and also probably for people coming from other portability solutions / CUDA & HIP, ... What is the motivation behind that design rather than the second one ? This one on the other hand would be wonderful to have. And if we want to use it as scratch we can always declare a char[] local as one would do in e.g. CUDA. Also i would favor it since it is the closest to the other GPU backends. |
Why do you think it's more complex or harder to use than the CUDA API where you declare an unbounded array? We actually designed this API as a cleaner version of the CUDA one. One thing we don't like about the CUDA interface is that multiple unbounded array declarations alias each other, which seems very non-obvious. For example: This looks like two different arrays of shared memory (work-group local memory). However, the two arrays We think the proposed SYCL API makes this less surprising because there's just a single global function that returns a pointer to the start of the dynamically-sized local memory.
What do you mean by "rebuild aliases"? Do you mean that you need to cast the pointer to a particular type? Would it help if we made the function templated like this: Then, you would use it like: |
|
Oh i see now why. It's hard to always remember what is what between CUDA and SYCL. My confusion was related to "Static Shared Memory" in CUDA. which does resemble a lot more the second extension and is the most used one in CUDA. Whereas the "Dynamic Shared Memory" variant corresponding work_group_scratch_memory is quite rare in the type of apps i'm working with.
Yup, this syntax is much cleaner i think, with the static cast it would have made a very long line. int* pscratch = sycl::khr::get_work_group_scratch_memory<int>();which is already 65 chars so it allows for 3 level of indentation (4 space) before passing the 80 columns mark. We are still forced to do some weird business though if we have multiple types similarely to cuda (there is probably some memory alignment to account for but not yet 😅 ) extern __shared__ int s[];
int *integerData = s; // nI ints
float *floatData = (float*)&integerData[nI]; // nF floats
char *charData = (char*)&floatData[nF]; // nC charsI don't know if there is a clean way to handle this ... maybe this ? But a clean wrapper that does it automatically would be wonderfull. int *integerData = sycl::khr::get_work_group_scratch_memory<int>(); // nI ints
float *floatData = reinterpret_cast<float>(integerData + nI); // nF floats
char *charData = reinterpret_cast<char>(floatData + nF); // nC charsThat probably goes beyond the scope of the discussion of this PR though 😅 PS: why name it scratch instead of local in the name since it refers to local memory ? |
I think you'd need to do something similar to what you show. I'm not sure about providing a "clean wrapper". I think a wrapper like this would be very complicated, and it would be easier to just do the pointer math like you show.
I think we were trying to use the term "work-group memory" instead of "local memory" because that seemed clearer. Therefore, we were trying to avoid using the word "local" in the API name. |
I get this one, but since all of the doc and SYCL tutorials outline it as local memory (= work group local memory) in the memory hierarchy of the model wouldn't it be too confusing to introduce a new name. Also is shorter 😛 |
|
I have implementation concerns regarding this proposal. In particular, I'm concerned about the ability for implementations that use separate host and device compilation steps (like DPC++) to correlate kernel definitions across host and device compilation in a way that doesn't results in spurious kernel identification conflicts. See #893 for examples where the type of a named function object is insufficient to differentiate use in distinct kernel invocation calls. @TApplencourt, I'd like to resume kernel naming/identification discussions before this proposal moves forward. |
|
Totally agree, the naming/identification discussions are support important! We should see if we can use For |
I'm happy to provide a simple presentation of my concerns if you want to schedule it for a future SYCL WG meeting (not tomorrow, please).
That isn't wrong; it can be layered on top, but naively doing so preserves the many problems we already have with kernel naming. If we are going to provide new kernel invocation interfaces, I'd like to try to fix as many of those issues as we can. I would put #893 at the top of the list of issues to resolve (especially with respect to class types that can be used as multiple kinds of kernels as in the examples with overloaded member call operators). Next up would be #900 and #858 since the inability to specify an explicit kernel name imposes a requirement on the implementation to do so which prevents some implementation strategies currently in use (e.g., use of a SYCL unaware host compiler in conjunction with a device compiler). Issue #901 would then be next on the list. |
|
Oh I see. We should take the opportunity of having a new launch API to fix implementation problem here. That make sense.
You start to know me, and avoid my traps :( |
This is a new, follow-up PR to #644, originally created by John Pennycook. All the future work related to that PR will be continued here. The reason for creating a new PR is that the PR ownership transfer is required.
This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects.
It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).