Skip to content

Add sycl_khr_free_function_commands extension#922

Open
slawekptak wants to merge 77 commits into
KhronosGroup:mainfrom
slawekptak:khr_free_function_commands_new
Open

Add sycl_khr_free_function_commands extension#922
slawekptak wants to merge 77 commits into
KhronosGroup:mainfrom
slawekptak:khr_free_function_commands_new

Conversation

@slawekptak
Copy link
Copy Markdown

This is a new, follow-up PR to #644, originally created by John Pennycook. All the future work related to that PR will be continued here. The reason for creating a new PR is that the PR ownership transfer is required.

This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects.

It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).

Pennycook and others added 30 commits October 17, 2024 15:00
This extension provides an alternative mechanism for submitting commands to a
device via free-functions that require developers to opt-in to the creation of
event objects.

It also proposes alternative names for several commands (e.g., launch) and
simplifies some concepts (e.g., by removing the need for the nd_range class).
Previous "0 or more" wording only made sense when reductions could be
optionally provided to functions like parallel_for; now that there are
dedicated *_reduce functions, at least one reduction is required.
"is" is more consistent with ISO C++ wording.
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
There is no need to constrain T here because T must be device-copyable in order
to construct the accessor passed as an argument.
Renaming sycl::nd_item is not a necessary part of the API redesign for
submitting work, so it should be moved to its own extension.

This will also give us more time to consider the design and naming of any
proposed replacement(s), including how they should interact with new
functionality proposed in other KHRs.
There are currently no backends that define interop for reductions,
so we can remove these functions for now. If we decide later that
these functions are necessary, we can release a revision of the KHR.
Co-authored-by: Andrey Alekseenko <al42and@gmail.com>
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
slawekptak and others added 6 commits February 5, 2026 08:57
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated
Copy link
Copy Markdown
Member

@keryell keryell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@abagusetty
Copy link
Copy Markdown

Thanks a bunch for the efforts to push this to finish line.

My recollection, prefetch_host is sparingly used ATM. And these few apps (qmcpack, exachem, etc) use it at production-scale.

My two-cents is towards free function standardization of this API just to reduce the cycles over non-free

Copy link
Copy Markdown

@tdavidcl tdavidcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! This extension will solve many pain point that we have with our current SYCL usage, it looks great.

I just had kind of neatpick. In the examples i did not find guidelines related to the use of local memory or anything that require the sycl::handler since this extension hides it away. Is the use of free function for commands limited to kernels that do not employ local memory ? If yes are there plans to lift such limit ?

Also i don't recall seeing any named kernel in the proposed extension. Can we name them for profiling (#1002 maybe ?) ?

@gmlueck
Copy link
Copy Markdown
Contributor

gmlueck commented Apr 10, 2026

In the examples i did not find guidelines related to the use of local memory or anything that require the sycl::handler since this extension hides it away. Is the use of free function for commands limited to kernels that do not employ local memory ? If yes are there plans to lift such limit ?

Yes. the working group has discussed this. We plan to add a KHR similar to this oneAPI extension:

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_scratch_memory.asciidoc

And I think we might consider a KHR similar to this as well:

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_static.asciidoc

@tdavidcl
Copy link
Copy Markdown

tdavidcl commented Apr 10, 2026

Thanks for the clarification.

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_scratch_memory.asciidoc

I don't really like the approach as one has to rebuild aliases every time to use it. It looks overly complex for what it aims to achieve and also probably for people coming from other portability solutions / CUDA & HIP, ... What is the motivation behind that design rather than the second one ?

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_static.asciidoc

This one on the other hand would be wonderful to have. And if we want to use it as scratch we can always declare a char[] local as one would do in e.g. CUDA. Also i would favor it since it is the closest to the other GPU backends.

@gmlueck
Copy link
Copy Markdown
Contributor

gmlueck commented Apr 10, 2026

I don't really like the approach as one has to rebuild aliases every time to use it. It looks overly complex for what it aims to achieve and also probably for people coming from other portability solutions / CUDA & HIP, ... What is the motivation behind that design rather than the second one ?

Why do you think it's more complex or harder to use than the CUDA API where you declare an unbounded array? We actually designed this API as a cleaner version of the CUDA one. One thing we don't like about the CUDA interface is that multiple unbounded array declarations alias each other, which seems very non-obvious. For example:

extern __shared__ float data1[];
extern __shared__ char data2[];

This looks like two different arrays of shared memory (work-group local memory). However, the two arrays data1 and data2 actually alias each other and point to exactly the same memory.

We think the proposed SYCL API makes this less surprising because there's just a single global function that returns a pointer to the start of the dynamically-sized local memory.

I don't really like the approach as one has to rebuild aliases every time to use it.

What do you mean by "rebuild aliases"? Do you mean that you need to cast the pointer to a particular type? Would it help if we made the function templated like this:

template<typename T>
T* get_work_group_scratch_memory();

Then, you would use it like:

auto pscratch = sycl::khr::get_work_group_scratch_memory<int>();

@tdavidcl
Copy link
Copy Markdown

Oh i see now why.

It's hard to always remember what is what between CUDA and SYCL.

My confusion was related to "Static Shared Memory" in CUDA. which does resemble a lot more the second extension and is the most used one in CUDA. Whereas the "Dynamic Shared Memory" variant corresponding work_group_scratch_memory is quite rare in the type of apps i'm working with.

Do you mean that you need to cast the pointer to a particular type? Would it help if we made the function templated like this

Yup, this syntax is much cleaner i think, with the static cast it would have made a very long line.

int* pscratch = sycl::khr::get_work_group_scratch_memory<int>();

which is already 65 chars so it allows for 3 level of indentation (4 space) before passing the 80 columns mark.

We are still forced to do some weird business though if we have multiple types similarely to cuda (there is probably some memory alignment to account for but not yet 😅 )

extern __shared__ int s[];
int *integerData = s;                        // nI ints
float *floatData = (float*)&integerData[nI]; // nF floats
char *charData = (char*)&floatData[nF];      // nC chars

I don't know if there is a clean way to handle this ... maybe this ? But a clean wrapper that does it automatically would be wonderfull.

int *integerData = sycl::khr::get_work_group_scratch_memory<int>(); // nI ints
float *floatData = reinterpret_cast<float>(integerData + nI);   // nF floats
char *charData   = reinterpret_cast<char>(floatData + nF);      // nC chars

That probably goes beyond the scope of the discussion of this PR though 😅

PS: why name it scratch instead of local in the name since it refers to local memory ?

@gmlueck
Copy link
Copy Markdown
Contributor

gmlueck commented Apr 10, 2026

I don't know if there is a clean way to handle this ... maybe this ? But a clean wrapper that does it automatically would be wonderfull.

I think you'd need to do something similar to what you show. I'm not sure about providing a "clean wrapper". I think a wrapper like this would be very complicated, and it would be easier to just do the pointer math like you show.

PS: why name it scratch instead of local in the name since it refers to local memory ?

I think we were trying to use the term "work-group memory" instead of "local memory" because that seemed clearer. Therefore, we were trying to avoid using the word "local" in the API name.

@tdavidcl
Copy link
Copy Markdown

I think we were trying to use the term "work-group memory" instead of "local memory" because that seemed clearer. Therefore, we were trying to avoid using the word "local" in the API name.

I get this one, but since all of the doc and SYCL tutorials outline it as local memory (= work group local memory) in the memory hierarchy of the model wouldn't it be too confusing to introduce a new name. Also

int* pscratch = sycl::khr::get_local_scratch_memory<int>();
// or
int* pscratch = sycl::khr::get_local_memory_pointer<int>();

is shorter 😛

@tahonermann
Copy link
Copy Markdown

I have implementation concerns regarding this proposal. In particular, I'm concerned about the ability for implementations that use separate host and device compilation steps (like DPC++) to correlate kernel definitions across host and device compilation in a way that doesn't results in spurious kernel identification conflicts. See #893 for examples where the type of a named function object is insufficient to differentiate use in distinct kernel invocation calls.

@TApplencourt, I'd like to resume kernel naming/identification discussions before this proposal moves forward.

@TApplencourt
Copy link
Copy Markdown
Contributor

Totally agree, the naming/identification discussions are support important!

We should see if we can use compile-time property if we can solve your issue. A presentation to help drive discusion will be super nice nudge, nudge

For sycl_khr_free_function my naive recollection was that it was possible to implement it as a "layered" implementation on top of "old submit mechanism" -- minus compile time property. I guess I was wrong :)

@tahonermann
Copy link
Copy Markdown

@TApplencourt,

We should see if we can use compile-time property if we can solve your issue. A presentation to help drive discusion will be super nice nudge, nudge

I'm happy to provide a simple presentation of my concerns if you want to schedule it for a future SYCL WG meeting (not tomorrow, please).

For sycl_khr_free_function my naive recollection was that it was possible to implement it as a "layered" implementation on top of "old submit mechanism" -- minus compile time property. I guess I was wrong :)

That isn't wrong; it can be layered on top, but naively doing so preserves the many problems we already have with kernel naming. If we are going to provide new kernel invocation interfaces, I'd like to try to fix as many of those issues as we can. I would put #893 at the top of the list of issues to resolve (especially with respect to class types that can be used as multiple kinds of kernels as in the examples with overloaded member call operators). Next up would be #900 and #858 since the inability to specify an explicit kernel name imposes a requirement on the implementation to do so which prevents some implementation strategies currently in use (e.g., use of a SYCL unaware host compiler in conjunction with a device compiler). Issue #901 would then be next on the list.

@TApplencourt
Copy link
Copy Markdown
Contributor

Oh I see. We should take the opportunity of having a new launch API to fix implementation problem here. That make sense.

SYCL WG meeting (not tomorrow, please).

You start to know me, and avoid my traps :(
Just let me know when you are ready :) And thanks again for tackling those issues!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants