Add sycl_khr_free_function_commands extension by slawekptak · Pull Request #922 · KhronosGroup/SYCL-Docs

slawekptak · 2025-10-10T12:46:16Z

This is a new, follow-up PR to #644, originally created by John Pennycook. All the future work related to that PR will be continued here. The reason for creating a new PR is that the PR ownership transfer is required.

This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects.

It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).

This extension provides an alternative mechanism for submitting commands to a device via free-functions that require developers to opt-in to the creation of event objects. It also proposes alternative names for several commands (e.g., launch) and simplifies some concepts (e.g., by removing the need for the nd_range class).

Previous "0 or more" wording only made sense when reductions could be optionally provided to functions like parallel_for; now that there are dedicated *_reduce functions, at least one reduction is required.

"is" is more consistent with ISO C++ wording.

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

There is no need to constrain T here because T must be device-copyable in order to construct the accessor passed as an argument.

Renaming sycl::nd_item is not a necessary part of the API redesign for submitting work, so it should be moved to its own extension. This will also give us more time to consider the design and naming of any proposed replacement(s), including how they should interact with new functionality proposed in other KHRs.

There are currently no backends that define interop for reductions, so we can remove these functions for now. If we decide later that these functions are necessary, we can release a revision of the KHR.

Co-authored-by: Andrey Alekseenko <al42and@gmail.com>

API calls.

…gnal_event function.

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

keryell

Thanks!

abagusetty · 2026-02-19T18:11:24Z

Thanks a bunch for the efforts to push this to finish line.

My recollection, prefetch_host is sparingly used ATM. And these few apps (qmcpack, exachem, etc) use it at production-scale.

My two-cents is towards free function standardization of this API just to reduce the cycles over non-free

tdavidcl

Very nice! This extension will solve many pain point that we have with our current SYCL usage, it looks great.

I just had kind of neatpick. In the examples i did not find guidelines related to the use of local memory or anything that require the sycl::handler since this extension hides it away. Is the use of free function for commands limited to kernels that do not employ local memory ? If yes are there plans to lift such limit ?

Also i don't recall seeing any named kernel in the proposed extension. Can we name them for profiling (#1002 maybe ?) ?

gmlueck · 2026-04-10T15:00:31Z

In the examples i did not find guidelines related to the use of local memory or anything that require the sycl::handler since this extension hides it away. Is the use of free function for commands limited to kernels that do not employ local memory ? If yes are there plans to lift such limit ?

Yes. the working group has discussed this. We plan to add a KHR similar to this oneAPI extension:

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_scratch_memory.asciidoc

And I think we might consider a KHR similar to this as well:

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_static.asciidoc

tdavidcl · 2026-04-10T15:14:39Z

Thanks for the clarification.

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_scratch_memory.asciidoc

I don't really like the approach as one has to rebuild aliases every time to use it. It looks overly complex for what it aims to achieve and also probably for people coming from other portability solutions / CUDA & HIP, ... What is the motivation behind that design rather than the second one ?

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_work_group_static.asciidoc

This one on the other hand would be wonderful to have. And if we want to use it as scratch we can always declare a char[] local as one would do in e.g. CUDA. Also i would favor it since it is the closest to the other GPU backends.

gmlueck · 2026-04-10T15:53:58Z

I don't really like the approach as one has to rebuild aliases every time to use it. It looks overly complex for what it aims to achieve and also probably for people coming from other portability solutions / CUDA & HIP, ... What is the motivation behind that design rather than the second one ?

Why do you think it's more complex or harder to use than the CUDA API where you declare an unbounded array? We actually designed this API as a cleaner version of the CUDA one. One thing we don't like about the CUDA interface is that multiple unbounded array declarations alias each other, which seems very non-obvious. For example:

extern __shared__ float data1[];
extern __shared__ char data2[];

This looks like two different arrays of shared memory (work-group local memory). However, the two arrays data1 and data2 actually alias each other and point to exactly the same memory.

We think the proposed SYCL API makes this less surprising because there's just a single global function that returns a pointer to the start of the dynamically-sized local memory.

I don't really like the approach as one has to rebuild aliases every time to use it.

What do you mean by "rebuild aliases"? Do you mean that you need to cast the pointer to a particular type? Would it help if we made the function templated like this:

template<typename T>
T* get_work_group_scratch_memory();

Then, you would use it like:

auto pscratch = sycl::khr::get_work_group_scratch_memory<int>();

tdavidcl · 2026-04-10T17:35:03Z

Oh i see now why.

It's hard to always remember what is what between CUDA and SYCL.

My confusion was related to "Static Shared Memory" in CUDA. which does resemble a lot more the second extension and is the most used one in CUDA. Whereas the "Dynamic Shared Memory" variant corresponding work_group_scratch_memory is quite rare in the type of apps i'm working with.

Do you mean that you need to cast the pointer to a particular type? Would it help if we made the function templated like this

Yup, this syntax is much cleaner i think, with the static cast it would have made a very long line.

int* pscratch = sycl::khr::get_work_group_scratch_memory<int>();

which is already 65 chars so it allows for 3 level of indentation (4 space) before passing the 80 columns mark.

We are still forced to do some weird business though if we have multiple types similarely to cuda (there is probably some memory alignment to account for but not yet 😅 )

extern __shared__ int s[];
int *integerData = s;                        // nI ints
float *floatData = (float*)&integerData[nI]; // nF floats
char *charData = (char*)&floatData[nF];      // nC chars

I don't know if there is a clean way to handle this ... maybe this ? But a clean wrapper that does it automatically would be wonderfull.

int *integerData = sycl::khr::get_work_group_scratch_memory<int>(); // nI ints
float *floatData = reinterpret_cast<float>(integerData + nI);   // nF floats
char *charData   = reinterpret_cast<char>(floatData + nF);      // nC chars

That probably goes beyond the scope of the discussion of this PR though 😅

PS: why name it scratch instead of local in the name since it refers to local memory ?

gmlueck · 2026-04-10T17:47:25Z

I don't know if there is a clean way to handle this ... maybe this ? But a clean wrapper that does it automatically would be wonderfull.

I think you'd need to do something similar to what you show. I'm not sure about providing a "clean wrapper". I think a wrapper like this would be very complicated, and it would be easier to just do the pointer math like you show.

PS: why name it scratch instead of local in the name since it refers to local memory ?

I think we were trying to use the term "work-group memory" instead of "local memory" because that seemed clearer. Therefore, we were trying to avoid using the word "local" in the API name.

tdavidcl · 2026-04-10T18:00:48Z

I think we were trying to use the term "work-group memory" instead of "local memory" because that seemed clearer. Therefore, we were trying to avoid using the word "local" in the API name.

I get this one, but since all of the doc and SYCL tutorials outline it as local memory (= work group local memory) in the memory hierarchy of the model wouldn't it be too confusing to introduce a new name. Also

int* pscratch = sycl::khr::get_local_scratch_memory<int>();
// or
int* pscratch = sycl::khr::get_local_memory_pointer<int>();

is shorter 😛

tahonermann · 2026-05-19T15:48:34Z

I have implementation concerns regarding this proposal. In particular, I'm concerned about the ability for implementations that use separate host and device compilation steps (like DPC++) to correlate kernel definitions across host and device compilation in a way that doesn't results in spurious kernel identification conflicts. See #893 for examples where the type of a named function object is insufficient to differentiate use in distinct kernel invocation calls.

@TApplencourt, I'd like to resume kernel naming/identification discussions before this proposal moves forward.

TApplencourt · 2026-05-19T16:57:11Z

Totally agree, the naming/identification discussions are support important!

We should see if we can use compile-time property if we can solve your issue. A presentation to help drive discusion will be super nice nudge, nudge

For sycl_khr_free_function my naive recollection was that it was possible to implement it as a "layered" implementation on top of "old submit mechanism" -- minus compile time property. I guess I was wrong :)

tahonermann · 2026-05-20T15:46:55Z

@TApplencourt,

We should see if we can use compile-time property if we can solve your issue. A presentation to help drive discusion will be super nice nudge, nudge

I'm happy to provide a simple presentation of my concerns if you want to schedule it for a future SYCL WG meeting (not tomorrow, please).

For sycl_khr_free_function my naive recollection was that it was possible to implement it as a "layered" implementation on top of "old submit mechanism" -- minus compile time property. I guess I was wrong :)

That isn't wrong; it can be layered on top, but naively doing so preserves the many problems we already have with kernel naming. If we are going to provide new kernel invocation interfaces, I'd like to try to fix as many of those issues as we can. I would put #893 at the top of the list of issues to resolve (especially with respect to class types that can be used as multiple kinds of kernels as in the examples with overloaded member call operators). Next up would be #900 and #858 since the inability to specify an explicit kernel name imposes a requirement on the implementation to do so which prevents some implementation strategies currently in use (e.g., use of a SYCL unaware host compiler in conjunction with a device compiler). Issue #901 would then be next on the list.

TApplencourt · 2026-05-20T16:03:02Z

Oh I see. We should take the opportunity of having a new launch API to fix implementation problem here. That make sense.

SYCL WG meeting (not tomorrow, please).

You start to know me, and avoid my traps :(
Just let me know when you are ready :) And thanks again for tackling those issues!

Pennycook and others added 30 commits October 17, 2024 15:00

Merge branch 'main' into khr_free_function_commands

ebe4dc0

Reword khr_free_function_commands comment

ce08652

Add periods to khr_free_function_commands comments

372bb3b

Add + marks to code blocks containing ...

9747f7a

Require at least 1 reduction in *_reduce functions

2527e90

Previous "0 or more" wording only made sense when reductions could be optionally provided to functions like parallel_for; now that there are dedicated *_reduce functions, at least one reduction is required.

Replace "must be" with "is" in constraints

f63adeb

"is" is more consistent with ISO C++ wording.

Use bulleted list for multiple constraints

47c08f8

Rewrite preconditions for USM copy functions

15fd80a

Fix typo in non-normative note

9c62792

Define kernel object overloads via equivalence

4377549

Clarify dependencies for command_/event_barrier

c24fb13

Clarify that event_barrier can be a no-op

664a912

Add missing invocation constructor

16111b2

Restart numbering at 1 in each synopsis block

88b540a

Replace backticks with [code] environment

2575901

Add + marks to code blocks containing ... again

8fa1ce2

Fix grammar: "is" to "are"

8d79af4

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

Fix formatting of bulleted lists

2ba2394

Fix more instances of "is" that should be "are"

e382dbc

Remove unnecessary device-copyable constraint

a9bdc10

There is no need to constrain T here because T must be device-copyable in order to construct the accessor passed as an argument.

Remove empty issues section

fa8a8f6

Add missing constraints to fill overloads

db380b4

Remove *_reduce functions for kernel objects

d26831a

There are currently no backends that define interop for reductions, so we can remove these functions for now. If we decide later that these functions are necessary, we can release a revision of the KHR.

Fix copy-paste error in launch_task definition

75867f7

Remove unnecessary "is"

151f632

Co-authored-by: Andrey Alekseenko <al42and@gmail.com>

Explain potential performance overhead of events

32d11f5

Add no-op note to command_barrier

165d07e

Weaken note about no-op from "is" to "may be"

be56bb7

slawekptak added 2 commits January 21, 2026 11:55

Fix formatting

0053611

Add a host task function restriction, so it cannot make any SYCL

94262f8

API calls.

gmlueck reviewed Jan 22, 2026

View reviewed changes

Clarified the constraints for the requirements object, for enqueue_si…

3064bc1

…gnal_event function.

gmlueck reviewed Feb 4, 2026

View reviewed changes

Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated

slawekptak and others added 6 commits February 5, 2026 08:57

Update adoc/extensions/sycl_khr_free_function_commands.adoc

b5ba06c

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

Update adoc/extensions/sycl_khr_free_function_commands.adoc

14355fd

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

Update adoc/extensions/sycl_khr_free_function_commands.adoc

8196926

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

Update adoc/extensions/sycl_khr_free_function_commands.adoc

8b3f92a

Co-authored-by: Greg Lueck <gregory.m.lueck@intel.com>

Address the review feedback, and some minor changes.

dbb35ef

Fix formatting.

bfaa2fd

gmlueck reviewed Feb 5, 2026

View reviewed changes

Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated

Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated

Comment thread adoc/extensions/sycl_khr_free_function_commands.adoc Outdated

slawekptak added 2 commits February 6, 2026 09:58

Address review comments.

ea55089

Fix formatting.

58fc68e

keryell approved these changes Feb 12, 2026

View reviewed changes

gmlueck mentioned this pull request Feb 25, 2026

[KHR] Prefetch host #960

Open

gmlueck mentioned this pull request Mar 16, 2026

First draft of a KHR for properties #980

Draft

tdavidcl reviewed Apr 10, 2026

View reviewed changes

tdavidcl mentioned this pull request Apr 10, 2026

Attribute based kernel naming #1002

Open

Conversation

slawekptak commented Oct 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

keryell left a comment

Choose a reason for hiding this comment

Uh oh!

abagusetty commented Feb 19, 2026

Uh oh!

tdavidcl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmlueck commented Apr 10, 2026

Uh oh!

tdavidcl commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gmlueck commented Apr 10, 2026

Uh oh!

tdavidcl commented Apr 10, 2026

Uh oh!

gmlueck commented Apr 10, 2026

Uh oh!

tdavidcl commented Apr 10, 2026

Uh oh!

tahonermann commented May 19, 2026

Uh oh!

TApplencourt commented May 19, 2026

Uh oh!

tahonermann commented May 20, 2026

Uh oh!

TApplencourt commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

tdavidcl left a comment •

edited

Loading

tdavidcl commented Apr 10, 2026 •

edited

Loading