Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.)#1259
Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.)#1259krasznaa wants to merge 11 commits intoacts-project:mainfrom
Conversation
stephenswat
left a comment
There was a problem hiding this comment.
Looks okay, but far too verbose with the payload structs. Try to deduplicate those using the existing structs we have.
device/common/include/traccc/finding/device/impl/gather_best_tips_per_measurement.ipp
Outdated
Show resolved
Hide resolved
device/common/include/traccc/finding/device/combinatorial_kalman_filter_algorithm.hpp
Show resolved
Hide resolved
| // Here we could give control back to the caller, once our code allows | ||
| // for it. (coroutines...) |
There was a problem hiding this comment.
Instead of copying the same unstructured comment six times across this file, prepend them with TODO: to make them findable.
There was a problem hiding this comment.
TODO is flagged by SonarCloud. This is not.
I'm copying the same sentence to make it easily searchable in our code once we embark on such a code change.
There was a problem hiding this comment.
TODOis flagged by SonarCloud.
That's exactly the point: SonarCloud and other tools give you a list of code sections marked with TODO (and FIXME, etc.) comments so that you can easily find them. 😛
There was a problem hiding this comment.
But it's not obvious that we will want to do anything here. Not to me. Not yet.
b9265c0 to
e0e2ca5
Compare
stephenswat
left a comment
There was a problem hiding this comment.
Starting to look a bit better. 👍
device/common/include/traccc/finding/device/impl/gather_best_tips_per_measurement.ipp
Show resolved
Hide resolved
device/cuda/src/finding/kernels/specializations/apply_interaction_src.cuh
Outdated
Show resolved
Hide resolved
device/common/include/traccc/finding/device/impl/propagate_to_next_surface.ipp
Show resolved
Hide resolved
device/common/include/traccc/finding/device/impl/find_tracks.ipp
Outdated
Show resolved
Hide resolved
device/cuda/src/finding/kernels/specializations/propagate_to_next_surface_src.cuh
Outdated
Show resolved
Hide resolved
Performance summaryHere is a summary of the performance effects of this PR: GraphicalTabular
Important All metrics in this report are given as reciprocal throughput, not as wallclock runtime. Note This is an automated message produced upon the explicit request of a human being. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
This looks good! |
|
But FYI: the physics CI currently fails with |
😦 To be fixed then... |
e0e2ca5 to
d00f44e
Compare
|
I did not manage to reproduce a crash with: I even tried 2 different CUDA versions. Could you re-check @stephenswat? If you still see a crash, I'll need to test on the same node. 🤔 |
Physics performance summaryHere is a summary of the physics performance effects of this PR. Command used: Seeding performanceTotal number of seeds went from 298344 to 298340 (-0.0%) Track finding performanceTotal number of found tracks went from 50221 to 50224 (+0.0%) Track fitting performanceSeeding to track finding relative performanceNote This is an automated message produced on the explicit request of a human being. |
Performance summaryHere is a summary of the performance effects of this PR: GraphicalTabular
Important All metrics in this report are given as reciprocal throughput, not as wallclock runtime. Note This is an automated message produced upon the explicit request of a human being. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Interestingly testing on At current main: With this PR: So that would rather be a 10% slowdown. Fascinating! |
|
Ahh, never mind. When I actually add up all the time that is spent in |
|
Throughputs on the A5000:
|
|
Regarding the d00f44e commit, what happens here is that the register usage changes (probably due to the kernel arguments) which increases occupancy but also increases register spilling. So the compiler is doing a poor job optimising there, and the CI benchmark is being tricked by the increased occupancy. |
|
Indeed I increased the block size in some cases. If that's the culprit, that would be a pretty clean issue to fix. There were some comments here and there in the CUDA code for some of the block size choices. But not for all of them. I remember that one of those didn't seem to make sense for me, so I changed it on purpose. I'll do some tests of my own on an L40s a little later today, and let you know what I find. Your findings are very useful, to be very clear about that. |
device/common/src/finding/combinatorial_kalman_filter_algorithm.cpp
Outdated
Show resolved
Hide resolved
device/common/src/finding/combinatorial_kalman_filter_algorithm.cpp
Outdated
Show resolved
Hide resolved
This commit adds the `__grid_constant__` qualifier to the CUDA track finding kernel, allowing the compiler to make some additional optimisations. This should also help us better understand performance issues such as the ones in acts-project#1259.
This commit adds the `__grid_constant__` qualifier to the CUDA track finding kernel, allowing the compiler to make some additional optimisations. This should also help us better understand performance issues such as the ones in acts-project#1259.
d00f44e to
08f0655
Compare
|
With I'll do some further work a bit later on. 🤔 |
On the A5000 there is still a very noticeable performance impact, with the throughput going from 151 Hz to 137 Hz. 🙁 |
One hope I have (and it would be lovely if it turned out to be true) is that once acts-project/vecmem#350 is collected into this project, that would get rid of a lot of this difference. Since the unified code does all of its synchronization through (CUDA) events. Versus the current code doing a bunch of CUDA stream synchronizations. Let's see... |
Unfortunately the results I collect are with the event pooling already enabled, so I am afraid that this won't help. |
…ion_kernel_payload. Modified device::apply_interaction_payload not to be templated, by device::apply_interaction receiving the detector view as a separate argument. And then updated all the clients of the common algorithm base class to implement their versions of apply_interaction_kernel accordingly.
…rnel_payload. Ended up putting the modified device::find_tracks_payload into its own header file to avoid compilation issues arising from the Thrust code use in find_tracks.ipp.
…ext_surface_kernel_payload.
c5fdf02 to
1eb7409
Compare
|
This commit adds the `__grid_constant__` qualifier to the CUDA track finding kernel, allowing the compiler to make some additional optimisations. This should also help us better understand performance issues such as the ones in acts-project#1259.




































Following up on #1240, this is finally synchronizing the behaviour of the Alpaka CKF algorithm with the CUDA and SYCL ones. Using the same code re-write done in previous "harmonization PRs".
While at it, I also added unit tests for the Alpaka CKF algorithm. These unit tests will need to be re-designed a bit in a future PR to reduce the amount of code duplication. But didn't want to bother with that in this PR.
Note that one of the tests I cannot run successfully with Alpaka on a CPU. 😕 While the other test runs on a CPU happily. And I was also able to get reasonable outputs from various example binaries with the CKF. So I'm not sure what that particular test has against CPU running. (With Alpaka's CUDA backend it runs fine.) But I just gave up on understanding that after about an hour of looking at it.
Once I added the latest kernels/device functions to Alpaka, GCC flagged a few things in that code. 🤔 The complaint about
candidate_linkvariables not being initialized could be GCC just being too afraid. But setting an initial value toout_idxI believe was a good find by the compiler. 🤔