Skip to content

Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.)#1259

Open
krasznaa wants to merge 11 commits intoacts-project:mainfrom
krasznaa:CKFHarmonization-main-20260210
Open

Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.)#1259
krasznaa wants to merge 11 commits intoacts-project:mainfrom
krasznaa:CKFHarmonization-main-20260210

Conversation

@krasznaa
Copy link
Copy Markdown
Member

Following up on #1240, this is finally synchronizing the behaviour of the Alpaka CKF algorithm with the CUDA and SYCL ones. Using the same code re-write done in previous "harmonization PRs".

While at it, I also added unit tests for the Alpaka CKF algorithm. These unit tests will need to be re-designed a bit in a future PR to reduce the amount of code duplication. But didn't want to bother with that in this PR.

Note that one of the tests I cannot run successfully with Alpaka on a CPU. 😕 While the other test runs on a CPU happily. And I was also able to get reasonable outputs from various example binaries with the CKF. So I'm not sure what that particular test has against CPU running. (With Alpaka's CUDA backend it runs fine.) But I just gave up on understanding that after about an hour of looking at it.

Once I added the latest kernels/device functions to Alpaka, GCC flagged a few things in that code. 🤔 The complaint about candidate_link variables not being initialized could be GCC just being too afraid. But setting an initial value to out_idx I believe was a good find by the compiler. 🤔

@krasznaa krasznaa requested a review from stephenswat February 12, 2026 16:52
@krasznaa krasznaa added refactor Change the structure of the code cleanup Makes the code all clean and tidy cuda Changes related to CUDA sycl Changes related to SYCL alpaka Changes related to Alpaka labels Feb 12, 2026
Copy link
Copy Markdown
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay, but far too verbose with the payload structs. Try to deduplicate those using the existing structs we have.

Comment on lines +66 to +67
// Here we could give control back to the caller, once our code allows
// for it. (coroutines...)
Copy link
Copy Markdown
Member

@stephenswat stephenswat Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying the same unstructured comment six times across this file, prepend them with TODO: to make them findable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO is flagged by SonarCloud. This is not.

I'm copying the same sentence to make it easily searchable in our code once we embark on such a code change.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO is flagged by SonarCloud.

That's exactly the point: SonarCloud and other tools give you a list of code sections marked with TODO (and FIXME, etc.) comments so that you can easily find them. 😛

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not obvious that we will want to do anything here. Not to me. Not yet.

@krasznaa krasznaa force-pushed the CKFHarmonization-main-20260210 branch from b9265c0 to e0e2ca5 Compare February 13, 2026 16:26
Copy link
Copy Markdown
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to look a bit better. 👍

@stephenswat
Copy link
Copy Markdown
Member

Performance summary

Here is a summary of the performance effects of this PR:

Graphical

Tabular

KernelReciprocal ThroughputParallelism
0158246e0e2ca5Delta0158246e0e2ca5
propagate_to_next_surface7.84 ms6.49 ms-17.2%3.454.09
find_tracks1.74 ms1.74 ms0.0%1.831.83
ccl_kernel824.83 μs825.14 μs0.0%1.371.37
count_doublets811.36 μs814.67 μs0.4%1.611.61
count_triplets568.02 μs567.41 μs-0.1%1.021.02
find_doublets532.92 μs532.45 μs-0.1%3.083.08
Thrust::sort379.63 μs379.50 μs-0.0%7.327.32
find_triplets170.65 μs170.34 μs-0.2%1.311.31
estimate_track_params146.39 μs146.37 μs-0.0%2.682.68
build_tracks125.26 μs125.67 μs0.3%3.713.71
select_seeds58.91 μs59.45 μs0.9%1.341.34
populate_grid24.00 μs24.05 μs0.2%1.221.22
remove_duplicates23.39 μs23.52 μs0.6%26.1926.10
count_grid_capacities22.12 μs22.15 μs0.2%1.221.22
fill_sorted_measurements19.77 μs19.78 μs0.0%1.131.13
update_triplet_weights14.77 μs14.79 μs0.1%1.271.27
apply_interaction13.89 μs13.85 μs-0.3%6.706.72
fill_finding_propagation_sort_keys8.84 μs8.81 μs-0.4%7.647.66
form_spacepoints8.33 μs8.33 μs0.0%1.481.48
reduce_triplet_counts5.60 μs5.62 μs0.4%3.093.09
unknown5.08 μs5.08 μs-0.1%4.264.27
fill_finding_duplicate_removal_sort_keys1.57 μs1.57 μs-0.0%37.9838.00
Total13.35 ms12.00 ms-10.1%2.983.28

Important

All metrics in this report are given as reciprocal throughput, not as wallclock runtime.

Note

This is an automated message produced upon the explicit request of a human being.

@stephenswat
Copy link
Copy Markdown
Member

Performance summary

This looks good!

@stephenswat
Copy link
Copy Markdown
Member

But FYI: the physics CI currently fails with an illegal memory access was encountered.

@krasznaa
Copy link
Copy Markdown
Member Author

But FYI: the physics CI currently fails with an illegal memory access was encountered.

😦 To be fixed then...

@krasznaa krasznaa force-pushed the CKFHarmonization-main-20260210 branch from e0e2ca5 to d00f44e Compare February 16, 2026 14:28
@krasznaa
Copy link
Copy Markdown
Member Author

I did not manage to reproduce a crash with:

./bin/traccc_seeding_example_cuda --input-directory=/home/krasznaa/ATLAS/data/odd-simulations-20240506/geant4_ttbar_mu200 --digitization-file=geometries/odd/odd-digi-geometric-config.json --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --material-file=geometries/odd/odd-detray_material_detray.json --input-events=10 --use-acts-geom-source=on --check-performance --truth-finding-min-track-candidates=5 --truth-finding-min-pt=1.0 --truth-finding-min-z=-150 --truth-finding-max-z=150 --truth-finding-max-r=10 --seed-matching-ratio=0.99 --track-matching-ratio=0.5 --track-candidates-range=5:100 --seedfinder-vertex-range=-150:150

I even tried 2 different CUDA versions.

Could you re-check @stephenswat? If you still see a crash, I'll need to test on the same node. 🤔

@stephenswat
Copy link
Copy Markdown
Member

Physics performance summary

Here is a summary of the physics performance effects of this PR. Command used:

traccc_seeding_example_cuda --input-directory=/data/Acts/odd-simulations-20240506/geant4_ttbar_mu200 --digitization-file=geometries/odd/odd-digi-geometric-config.json --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --material-file=geometries/odd/odd-detray_material_detray.json --input-events=10 --use-acts-geom-source=on --check-performance --truth-finding-min-track-candidates=5 --truth-finding-min-pt=1.0 --truth-finding-min-z=-150 --truth-finding-max-z=150 --truth-finding-max-r=10 --seed-matching-ratio=0.99 --track-matching-ratio=0.5 --track-candidates-range=5:100 --seedfinder-vertex-range=-150:150

Seeding performance

Total number of seeds went from 298344 to 298340 (-0.0%)

Seeding plots



Track finding performance

Total number of found tracks went from 50221 to 50224 (+0.0%)

Finding plots









Track fitting performance

Fitting plots














Seeding to track finding relative performance

Seeding to track finding plots



Note

This is an automated message produced on the explicit request of a human being.

@stephenswat
Copy link
Copy Markdown
Member

Performance summary

Here is a summary of the performance effects of this PR:

Graphical

Tabular

KernelReciprocal ThroughputParallelism
0158246d00f44eDelta0158246d00f44e
propagate_to_next_surface7.83 ms6.49 ms-17.1%3.454.09
find_tracks1.74 ms1.74 ms0.2%1.831.83
ccl_kernel826.95 μs825.77 μs-0.1%1.371.37
count_doublets818.44 μs814.08 μs-0.5%1.611.61
count_triplets568.02 μs568.85 μs0.1%1.021.02
find_doublets534.21 μs535.06 μs0.2%3.083.08
Thrust::sort379.77 μs379.41 μs-0.1%7.327.32
find_triplets171.31 μs170.41 μs-0.5%1.311.31
estimate_track_params146.52 μs146.40 μs-0.1%2.682.68
build_tracks125.29 μs125.36 μs0.1%3.723.71
select_seeds58.47 μs58.10 μs-0.6%1.341.34
populate_grid24.02 μs23.98 μs-0.2%1.221.22
remove_duplicates23.54 μs23.55 μs0.0%26.0626.06
count_grid_capacities22.17 μs22.22 μs0.2%1.221.22
fill_sorted_measurements19.77 μs19.71 μs-0.3%1.131.13
update_triplet_weights14.75 μs14.80 μs0.4%1.271.27
apply_interaction13.89 μs13.86 μs-0.2%6.716.71
fill_finding_propagation_sort_keys8.82 μs8.81 μs-0.1%7.667.67
form_spacepoints8.35 μs8.33 μs-0.3%1.481.49
reduce_triplet_counts5.67 μs5.63 μs-0.6%3.083.09
unknown5.07 μs5.08 μs0.1%4.264.26
fill_finding_duplicate_removal_sort_keys1.57 μs1.57 μs0.0%38.0038.05
Total13.34 ms12.00 ms-10.0%2.983.28

Important

All metrics in this report are given as reciprocal throughput, not as wallclock runtime.

Note

This is an automated message produced upon the explicit request of a human being.

@stephenswat
Copy link
Copy Markdown
Member

Interestingly testing on pcadp04 reveals exactly the opposite behaviour:

./build/bin/traccc_throughput_mt_cuda --material-file=geometries/odd/odd-detray_material_detray.json --input-directory=/data/Acts
/odd-simulations-20240506/geant4_ttbar_mu140 --digitization-file=geometries/odd/odd-digi-geometric-config.json --detector-file=geometries/odd/odd-detray_geometry_detray.json
--grid-file=geometries/odd/odd-detray_surface_grids_detray.json --input-events=10 --cold-run-events=100 --processed-events=1000 --use-acts-geom-source=on --read-bfield-from-f
ile --cpu-threads=20 --track-candidates-range=5:20 --seedfinder-vertex-range=-150:150 --deterministic --initial-links-per-seed=6

At current main:

16:52:40    ThroughputExample             INFO      Reconstructed track parameters: 2909792
16:52:40    ThroughputExample             INFO      Time totals:                   File reading  529 ms
16:52:40    ThroughputExample             INFO                  Warm-up processing  744 ms
16:52:40    ThroughputExample             INFO                    Event processing  6612 ms
16:52:40    ThroughputExample             INFO      Throughput:            Warm-up processing  7.44117 ms/event, 134.388 events/s
16:52:40    ThroughputExample             INFO                    Event processing  6.61275 ms/event, 151.223 events/s

With this PR:

16:54:29    ThroughputExample             INFO      Reconstructed track parameters: 2909798
16:54:29    ThroughputExample             INFO      Time totals:                   File reading  491 ms
16:54:29    ThroughputExample             INFO                  Warm-up processing  829 ms
16:54:29    ThroughputExample             INFO                    Event processing  7425 ms
16:54:29    ThroughputExample             INFO      Throughput:            Warm-up processing  8.29486 ms/event, 120.557 events/s
16:54:29    ThroughputExample             INFO                    Event processing  7.42517 ms/event, 134.677 events/s

So that would rather be a 10% slowdown. Fascinating!

@krasznaa
Copy link
Copy Markdown
Member Author

Well, this is worrisome. 🤔 I don't claim to fully understand the situation, but it seems that cudaEventSynchronize(...) is less efficient than cudaStreamSynchronize(...).

This is how a 10-thread job (with the same parameters as posted in the previous comment) looks like in the current main branch:

image

While this is how it looks with this branch's code:

image

In this code size copies go through vecmem::async_size. Which relies on event and not stream synchronization.

But I have some doubts about the answer being quite so simple. 🤔 I'll do some profiling tomorrow with NSys as well.

@krasznaa
Copy link
Copy Markdown
Member Author

Ahh, never mind. When I actually add up all the time that is spent in cuEventSynchronize and cuStreamSynchronize in both cases, I come to pretty much the same value. The proportions between these two types have shifted. But the total time spent in them didn't change in any meaningful way. 🤔

@stephenswat
Copy link
Copy Markdown
Member

Throughputs on the A5000:

@stephenswat
Copy link
Copy Markdown
Member

Regarding the d00f44e commit, what happens here is that the register usage changes (probably due to the kernel arguments) which increases occupancy but also increases register spilling. So the compiler is doing a poor job optimising there, and the CI benchmark is being tricked by the increased occupancy.

@stephenswat
Copy link
Copy Markdown
Member

stephenswat commented Feb 17, 2026

The performance change in 8e67711 is easily explained by the fact that the block sizes change:

main:

const unsigned int nThreads = warp_size * 4;
...

8e67711:

const unsigned int deviceThreads = warp_size() * 2;
...

However, 8e67711 is also the commit that reduces throughput from 151 Hz to 139 Hz.

@krasznaa
Copy link
Copy Markdown
Member Author

Indeed I increased the block size in some cases. If that's the culprit, that would be a pretty clean issue to fix.

There were some comments here and there in the CUDA code for some of the block size choices. But not for all of them. I remember that one of those didn't seem to make sense for me, so I changed it on purpose.

I'll do some tests of my own on an L40s a little later today, and let you know what I find. Your findings are very useful, to be very clear about that.

stephenswat added a commit to stephenswat/traccc that referenced this pull request Feb 24, 2026
This commit adds the `__grid_constant__` qualifier to the CUDA track
finding kernel, allowing the compiler to make some additional
optimisations. This should also help us better understand performance
issues such as the ones in acts-project#1259.
stephenswat added a commit to stephenswat/traccc that referenced this pull request Feb 24, 2026
This commit adds the `__grid_constant__` qualifier to the CUDA track
finding kernel, allowing the compiler to make some additional
optimisations. This should also help us better understand performance
issues such as the ones in acts-project#1259.
@krasznaa krasznaa force-pushed the CKFHarmonization-main-20260210 branch from d00f44e to 08f0655 Compare February 25, 2026 09:27
@krasznaa
Copy link
Copy Markdown
Member Author

With __grid_constant__ added, on my desktop's GPU the difference between the main branch and this one is now seemingly smaller. But there is still a small throughput difference. (~0.5 Hz on this gaming GPU.)

I'll do some further work a bit later on. 🤔

@stephenswat
Copy link
Copy Markdown
Member

With __grid_constant__ added, on my desktop's GPU the difference between the main branch and this one is now seemingly smaller. But there is still a small throughput difference. (~0.5 Hz on this gaming GPU.)

On the A5000 there is still a very noticeable performance impact, with the throughput going from 151 Hz to 137 Hz. 🙁

@krasznaa
Copy link
Copy Markdown
Member Author

With __grid_constant__ added, on my desktop's GPU the difference between the main branch and this one is now seemingly smaller. But there is still a small throughput difference. (~0.5 Hz on this gaming GPU.)

On the A5000 there is still a very noticeable performance impact, with the throughput going from 151 Hz to 137 Hz. 🙁

One hope I have (and it would be lovely if it turned out to be true) is that once acts-project/vecmem#350 is collected into this project, that would get rid of a lot of this difference. Since the unified code does all of its synchronization through (CUDA) events. Versus the current code doing a bunch of CUDA stream synchronizations.

Let's see...

@stephenswat
Copy link
Copy Markdown
Member

One hope I have (and it would be lovely if it turned out to be true) is that once acts-project/vecmem#350 is collected into this project, that would get rid of a lot of this difference. Since the unified code does all of its synchronization through (CUDA) events. Versus the current code doing a bunch of CUDA stream synchronizations.

Unfortunately the results I collect are with the event pooling already enabled, so I am afraid that this won't help.

…ion_kernel_payload.

Modified device::apply_interaction_payload not to be templated, by
device::apply_interaction receiving the detector view as a separate
argument. And then updated all the clients of the common algorithm
base class to implement their versions of apply_interaction_kernel
accordingly.
…rnel_payload.

Ended up putting the modified device::find_tracks_payload into its own
header file to avoid compilation issues arising from the Thrust code use in
find_tracks.ipp.
@krasznaa krasznaa force-pushed the CKFHarmonization-main-20260210 branch from c5fdf02 to 1eb7409 Compare February 26, 2026 15:58
@sonarqubecloud
Copy link
Copy Markdown

mimodekjaer pushed a commit to mimodekjaer/traccc that referenced this pull request Mar 30, 2026
This commit adds the `__grid_constant__` qualifier to the CUDA track
finding kernel, allowing the compiler to make some additional
optimisations. This should also help us better understand performance
issues such as the ones in acts-project#1259.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

alpaka Changes related to Alpaka cleanup Makes the code all clean and tidy cuda Changes related to CUDA refactor Change the structure of the code sycl Changes related to SYCL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants