Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.) by krasznaa · Pull Request #1259 · acts-project/traccc

krasznaa · 2026-02-12T16:52:50Z

Following up on #1240, this is finally synchronizing the behaviour of the Alpaka CKF algorithm with the CUDA and SYCL ones. Using the same code re-write done in previous "harmonization PRs".

While at it, I also added unit tests for the Alpaka CKF algorithm. These unit tests will need to be re-designed a bit in a future PR to reduce the amount of code duplication. But didn't want to bother with that in this PR.

Note that one of the tests I cannot run successfully with Alpaka on a CPU. 😕 While the other test runs on a CPU happily. And I was also able to get reasonable outputs from various example binaries with the CKF. So I'm not sure what that particular test has against CPU running. (With Alpaka's CUDA backend it runs fine.) But I just gave up on understanding that after about an hour of looking at it.

Once I added the latest kernels/device functions to Alpaka, GCC flagged a few things in that code. 🤔 The complaint about candidate_link variables not being initialized could be GCC just being too afraid. But setting an initial value to out_idx I believe was a good find by the compiler. 🤔

stephenswat

Looks okay, but far too verbose with the payload structs. Try to deduplicate those using the existing structs we have.

core/include/traccc/finding/candidate_link.hpp

device/common/include/traccc/finding/device/impl/gather_best_tips_per_measurement.ipp

device/common/include/traccc/finding/device/combinatorial_kalman_filter_algorithm.hpp

device/common/src/finding/combinatorial_kalman_filter_algorithm.cpp

stephenswat · 2026-02-13T07:55:12Z

device/common/src/finding/combinatorial_kalman_filter_algorithm.cpp

+        // Here we could give control back to the caller, once our code allows
+        // for it. (coroutines...)


Instead of copying the same unstructured comment six times across this file, prepend them with TODO: to make them findable.

TODO is flagged by SonarCloud. This is not.

I'm copying the same sentence to make it easily searchable in our code once we embark on such a code change.

TODO is flagged by SonarCloud.

That's exactly the point: SonarCloud and other tools give you a list of code sections marked with TODO (and FIXME, etc.) comments so that you can easily find them. 😛

But it's not obvious that we will want to do anything here. Not to me. Not yet.

stephenswat

Starting to look a bit better. 👍

device/common/include/traccc/finding/device/impl/gather_best_tips_per_measurement.ipp

device/cuda/src/finding/kernels/specializations/apply_interaction_src.cuh

device/common/include/traccc/finding/device/impl/propagate_to_next_surface.ipp

device/common/include/traccc/finding/device/impl/find_tracks.ipp

device/cuda/src/finding/kernels/specializations/propagate_to_next_surface_src.cuh

stephenswat · 2026-02-16T09:40:38Z

Performance summary

Here is a summary of the performance effects of this PR:

Graphical

Tabular

Kernel	Reciprocal Throughput			Parallelism
Kernel	`0158246`	`e0e2ca5`	Delta	`0158246`	`e0e2ca5`
`propagate_to_next_surface`	7.84 ms	6.49 ms	-17.2%	3.45	4.09
`find_tracks`	1.74 ms	1.74 ms	0.0%	1.83	1.83
`ccl_kernel`	824.83 μs	825.14 μs	0.0%	1.37	1.37
`count_doublets`	811.36 μs	814.67 μs	0.4%	1.61	1.61
`count_triplets`	568.02 μs	567.41 μs	-0.1%	1.02	1.02
`find_doublets`	532.92 μs	532.45 μs	-0.1%	3.08	3.08
`Thrust::sort`	379.63 μs	379.50 μs	-0.0%	7.32	7.32
`find_triplets`	170.65 μs	170.34 μs	-0.2%	1.31	1.31
`estimate_track_params`	146.39 μs	146.37 μs	-0.0%	2.68	2.68
`build_tracks`	125.26 μs	125.67 μs	0.3%	3.71	3.71
`select_seeds`	58.91 μs	59.45 μs	0.9%	1.34	1.34
`populate_grid`	24.00 μs	24.05 μs	0.2%	1.22	1.22
`remove_duplicates`	23.39 μs	23.52 μs	0.6%	26.19	26.10
`count_grid_capacities`	22.12 μs	22.15 μs	0.2%	1.22	1.22
`fill_sorted_measurements`	19.77 μs	19.78 μs	0.0%	1.13	1.13
`update_triplet_weights`	14.77 μs	14.79 μs	0.1%	1.27	1.27
`apply_interaction`	13.89 μs	13.85 μs	-0.3%	6.70	6.72
`fill_finding_propagation_sort_keys`	8.84 μs	8.81 μs	-0.4%	7.64	7.66
`form_spacepoints`	8.33 μs	8.33 μs	0.0%	1.48	1.48
`reduce_triplet_counts`	5.60 μs	5.62 μs	0.4%	3.09	3.09
`unknown`	5.08 μs	5.08 μs	-0.1%	4.26	4.27
`fill_finding_duplicate_removal_sort_keys`	1.57 μs	1.57 μs	-0.0%	37.98	38.00
Total	13.35 ms	12.00 ms	-10.1%	2.98	3.28

Important

All metrics in this report are given as reciprocal throughput, not as wallclock runtime.

Note

This is an automated message produced upon the explicit request of a human being.

stephenswat · 2026-02-16T09:59:07Z

Performance summary

This looks good!

stephenswat · 2026-02-16T10:01:40Z

But FYI: the physics CI currently fails with an illegal memory access was encountered.

krasznaa · 2026-02-16T10:28:51Z

But FYI: the physics CI currently fails with an illegal memory access was encountered.

😦 To be fixed then...

krasznaa · 2026-02-16T14:35:58Z

I did not manage to reproduce a crash with:

./bin/traccc_seeding_example_cuda --input-directory=/home/krasznaa/ATLAS/data/odd-simulations-20240506/geant4_ttbar_mu200 --digitization-file=geometries/odd/odd-digi-geometric-config.json --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --material-file=geometries/odd/odd-detray_material_detray.json --input-events=10 --use-acts-geom-source=on --check-performance --truth-finding-min-track-candidates=5 --truth-finding-min-pt=1.0 --truth-finding-min-z=-150 --truth-finding-max-z=150 --truth-finding-max-r=10 --seed-matching-ratio=0.99 --track-matching-ratio=0.5 --track-candidates-range=5:100 --seedfinder-vertex-range=-150:150

I even tried 2 different CUDA versions.

Could you re-check @stephenswat? If you still see a crash, I'll need to test on the same node. 🤔

stephenswat · 2026-02-16T14:46:23Z

Physics performance summary

Here is a summary of the physics performance effects of this PR. Command used:

traccc_seeding_example_cuda --input-directory=/data/Acts/odd-simulations-20240506/geant4_ttbar_mu200 --digitization-file=geometries/odd/odd-digi-geometric-config.json --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --material-file=geometries/odd/odd-detray_material_detray.json --input-events=10 --use-acts-geom-source=on --check-performance --truth-finding-min-track-candidates=5 --truth-finding-min-pt=1.0 --truth-finding-min-z=-150 --truth-finding-max-z=150 --truth-finding-max-r=10 --seed-matching-ratio=0.99 --track-matching-ratio=0.5 --track-candidates-range=5:100 --seedfinder-vertex-range=-150:150

Seeding performance

Total number of seeds went from 298344 to 298340 (-0.0%)

Seeding plots

Track finding performance

Total number of found tracks went from 50221 to 50224 (+0.0%)

Finding plots

Track fitting performance

Fitting plots

Seeding to track finding relative performance

Seeding to track finding plots

Note

This is an automated message produced on the explicit request of a human being.

stephenswat · 2026-02-16T15:19:29Z

Performance summary

Here is a summary of the performance effects of this PR:

Graphical

Tabular

Kernel	Reciprocal Throughput			Parallelism
Kernel	`0158246`	`d00f44e`	Delta	`0158246`	`d00f44e`
`propagate_to_next_surface`	7.83 ms	6.49 ms	-17.1%	3.45	4.09
`find_tracks`	1.74 ms	1.74 ms	0.2%	1.83	1.83
`ccl_kernel`	826.95 μs	825.77 μs	-0.1%	1.37	1.37
`count_doublets`	818.44 μs	814.08 μs	-0.5%	1.61	1.61
`count_triplets`	568.02 μs	568.85 μs	0.1%	1.02	1.02
`find_doublets`	534.21 μs	535.06 μs	0.2%	3.08	3.08
`Thrust::sort`	379.77 μs	379.41 μs	-0.1%	7.32	7.32
`find_triplets`	171.31 μs	170.41 μs	-0.5%	1.31	1.31
`estimate_track_params`	146.52 μs	146.40 μs	-0.1%	2.68	2.68
`build_tracks`	125.29 μs	125.36 μs	0.1%	3.72	3.71
`select_seeds`	58.47 μs	58.10 μs	-0.6%	1.34	1.34
`populate_grid`	24.02 μs	23.98 μs	-0.2%	1.22	1.22
`remove_duplicates`	23.54 μs	23.55 μs	0.0%	26.06	26.06
`count_grid_capacities`	22.17 μs	22.22 μs	0.2%	1.22	1.22
`fill_sorted_measurements`	19.77 μs	19.71 μs	-0.3%	1.13	1.13
`update_triplet_weights`	14.75 μs	14.80 μs	0.4%	1.27	1.27
`apply_interaction`	13.89 μs	13.86 μs	-0.2%	6.71	6.71
`fill_finding_propagation_sort_keys`	8.82 μs	8.81 μs	-0.1%	7.66	7.67
`form_spacepoints`	8.35 μs	8.33 μs	-0.3%	1.48	1.49
`reduce_triplet_counts`	5.67 μs	5.63 μs	-0.6%	3.08	3.09
`unknown`	5.07 μs	5.08 μs	0.1%	4.26	4.26
`fill_finding_duplicate_removal_sort_keys`	1.57 μs	1.57 μs	0.0%	38.00	38.05
Total	13.34 ms	12.00 ms	-10.0%	2.98	3.28

Important

All metrics in this report are given as reciprocal throughput, not as wallclock runtime.

Note

This is an automated message produced upon the explicit request of a human being.

stephenswat · 2026-02-16T15:54:54Z

Interestingly testing on pcadp04 reveals exactly the opposite behaviour:

./build/bin/traccc_throughput_mt_cuda --material-file=geometries/odd/odd-detray_material_detray.json --input-directory=/data/Acts
/odd-simulations-20240506/geant4_ttbar_mu140 --digitization-file=geometries/odd/odd-digi-geometric-config.json --detector-file=geometries/odd/odd-detray_geometry_detray.json
--grid-file=geometries/odd/odd-detray_surface_grids_detray.json --input-events=10 --cold-run-events=100 --processed-events=1000 --use-acts-geom-source=on --read-bfield-from-f
ile --cpu-threads=20 --track-candidates-range=5:20 --seedfinder-vertex-range=-150:150 --deterministic --initial-links-per-seed=6

At current main:

16:52:40    ThroughputExample             INFO      Reconstructed track parameters: 2909792
16:52:40    ThroughputExample             INFO      Time totals:                   File reading  529 ms
16:52:40    ThroughputExample             INFO                  Warm-up processing  744 ms
16:52:40    ThroughputExample             INFO                    Event processing  6612 ms
16:52:40    ThroughputExample             INFO      Throughput:            Warm-up processing  7.44117 ms/event, 134.388 events/s
16:52:40    ThroughputExample             INFO                    Event processing  6.61275 ms/event, 151.223 events/s

With this PR:

16:54:29    ThroughputExample             INFO      Reconstructed track parameters: 2909798
16:54:29    ThroughputExample             INFO      Time totals:                   File reading  491 ms
16:54:29    ThroughputExample             INFO                  Warm-up processing  829 ms
16:54:29    ThroughputExample             INFO                    Event processing  7425 ms
16:54:29    ThroughputExample             INFO      Throughput:            Warm-up processing  8.29486 ms/event, 120.557 events/s
16:54:29    ThroughputExample             INFO                    Event processing  7.42517 ms/event, 134.677 events/s

So that would rather be a 10% slowdown. Fascinating!

krasznaa · 2026-02-16T17:09:43Z

Well, this is worrisome. 🤔 I don't claim to fully understand the situation, but it seems that cudaEventSynchronize(...) is less efficient than cudaStreamSynchronize(...).

This is how a 10-thread job (with the same parameters as posted in the previous comment) looks like in the current main branch:

While this is how it looks with this branch's code:

In this code size copies go through vecmem::async_size. Which relies on event and not stream synchronization.

But I have some doubts about the answer being quite so simple. 🤔 I'll do some profiling tomorrow with NSys as well.

krasznaa · 2026-02-16T17:21:14Z

Ahh, never mind. When I actually add up all the time that is spent in cuEventSynchronize and cuStreamSynchronize in both cases, I come to pretty much the same value. The proportions between these two types have shifted. But the total time spent in them didn't change in any meaningful way. 🤔

stephenswat · 2026-02-17T09:17:11Z

Throughputs on the A5000:

0158246 (main): 151 Hz
ce6c6c3 (Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.) #1259~): 139 Hz
d00f44e (Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.) #1259): 134 Hz
d00f44e (Algorithm Harmonization #5.1 CKF, main branch (2026.02.12.) #1259) with the kernel arguments marked as const and grid constant: 139 Hz

stephenswat · 2026-02-17T09:26:51Z

Regarding the d00f44e commit, what happens here is that the register usage changes (probably due to the kernel arguments) which increases occupancy but also increases register spilling. So the compiler is doing a poor job optimising there, and the CI benchmark is being tricked by the increased occupancy.

stephenswat · 2026-02-17T09:29:28Z

The performance change in 8e67711 is easily explained by the fact that the block sizes change:

main:

const unsigned int nThreads = warp_size * 4;
...

8e67711:

const unsigned int deviceThreads = warp_size() * 2;
...

However, 8e67711 is also the commit that reduces throughput from 151 Hz to 139 Hz.

krasznaa · 2026-02-17T13:09:08Z

Indeed I increased the block size in some cases. If that's the culprit, that would be a pretty clean issue to fix.

There were some comments here and there in the CUDA code for some of the block size choices. But not for all of them. I remember that one of those didn't seem to make sense for me, so I changed it on purpose.

I'll do some tests of my own on an L40s a little later today, and let you know what I find. Your findings are very useful, to be very clear about that.

device/common/src/finding/combinatorial_kalman_filter_algorithm.cpp

This commit adds the `__grid_constant__` qualifier to the CUDA track finding kernel, allowing the compiler to make some additional optimisations. This should also help us better understand performance issues such as the ones in acts-project#1259.

krasznaa · 2026-02-25T09:43:03Z

With __grid_constant__ added, on my desktop's GPU the difference between the main branch and this one is now seemingly smaller. But there is still a small throughput difference. (~0.5 Hz on this gaming GPU.)

I'll do some further work a bit later on. 🤔

stephenswat · 2026-02-25T10:09:56Z

With __grid_constant__ added, on my desktop's GPU the difference between the main branch and this one is now seemingly smaller. But there is still a small throughput difference. (~0.5 Hz on this gaming GPU.)

On the A5000 there is still a very noticeable performance impact, with the throughput going from 151 Hz to 137 Hz. 🙁

krasznaa · 2026-02-25T10:27:38Z

With __grid_constant__ added, on my desktop's GPU the difference between the main branch and this one is now seemingly smaller. But there is still a small throughput difference. (~0.5 Hz on this gaming GPU.)

On the A5000 there is still a very noticeable performance impact, with the throughput going from 151 Hz to 137 Hz. 🙁

One hope I have (and it would be lovely if it turned out to be true) is that once acts-project/vecmem#350 is collected into this project, that would get rid of a lot of this difference. Since the unified code does all of its synchronization through (CUDA) events. Versus the current code doing a bunch of CUDA stream synchronizations.

Let's see...

stephenswat · 2026-02-25T10:38:33Z

One hope I have (and it would be lovely if it turned out to be true) is that once acts-project/vecmem#350 is collected into this project, that would get rid of a lot of this difference. Since the unified code does all of its synchronization through (CUDA) events. Versus the current code doing a bunch of CUDA stream synchronizations.

Unfortunately the results I collect are with the event pooling already enabled, so I am afraid that this won't help.

…ion_kernel_payload. Modified device::apply_interaction_payload not to be templated, by device::apply_interaction receiving the detector view as a separate argument. And then updated all the clients of the common algorithm base class to implement their versions of apply_interaction_kernel accordingly.

…rnel_payload. Ended up putting the modified device::find_tracks_payload into its own header file to avoid compilation issues arising from the Thrust code use in find_tracks.ipp.

…ext_surface_kernel_payload.

sonarqubecloud · 2026-02-26T16:00:39Z

Quality Gate passed

Issues
16 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
31.3% Duplication on New Code

See analysis details on SonarQube Cloud

This commit adds the `__grid_constant__` qualifier to the CUDA track finding kernel, allowing the compiler to make some additional optimisations. This should also help us better understand performance issues such as the ones in acts-project#1259.

krasznaa requested a review from stephenswat February 12, 2026 16:52

krasznaa added refactor Change the structure of the code cleanup Makes the code all clean and tidy cuda Changes related to CUDA sycl Changes related to SYCL alpaka Changes related to Alpaka labels Feb 12, 2026

stephenswat requested changes Feb 13, 2026

View reviewed changes

krasznaa force-pushed the CKFHarmonization-main-20260210 branch from b9265c0 to e0e2ca5 Compare February 13, 2026 16:26

stephenswat requested changes Feb 16, 2026

View reviewed changes

krasznaa force-pushed the CKFHarmonization-main-20260210 branch from e0e2ca5 to d00f44e Compare February 16, 2026 14:28

stephenswat reviewed Feb 19, 2026

View reviewed changes

device/common/src/finding/combinatorial_kalman_filter_algorithm.cpp Outdated Show resolved Hide resolved

stephenswat reviewed Feb 19, 2026

View reviewed changes

device/common/src/finding/combinatorial_kalman_filter_algorithm.cpp Outdated Show resolved Hide resolved

stephenswat mentioned this pull request Feb 24, 2026

Add grid_constant qualifier to finding kernels #1266

Merged

krasznaa force-pushed the CKFHarmonization-main-20260210 branch from d00f44e to 08f0655 Compare February 25, 2026 09:27

krasznaa added 11 commits February 26, 2026 14:42

Introduced a common base class for the device CKF algorithms.

65c60a2

Migrated the CUDA CKF algorithm to the common base class.

8290a71

Migrated the SYCL CKF algorithm to the common base class.

9677c72

Migrated the Alpaka CKF algorithm to the common base class.

0a2f02b

Added unit tests for the Alpaka CKF algorithm.

532e1b4

Address issues flagged by GCC.

3fd436b

Removed device::combinatorial_kalman_filter_algorithm::find_tracks_ke…

21097c5

…rnel_payload. Ended up putting the modified device::find_tracks_payload into its own header file to avoid compilation issues arising from the Thrust code use in find_tracks.ipp.

Removed device::combinatorial_kalman_filter_algorithm::propagate_to_n…

09a54d1

…ext_surface_kernel_payload.

Adjust kernel launch parameters.

1f4092e

Address more of the PR comments.

1eb7409

krasznaa force-pushed the CKFHarmonization-main-20260210 branch from c5fdf02 to 1eb7409 Compare February 26, 2026 15:58

		// Here we could give control back to the caller, once our code allows
		// for it. (coroutines...)

Conversation

krasznaa commented Feb 12, 2026

Uh oh!

stephenswat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephenswat Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krasznaa Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

stephenswat Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

krasznaa Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

stephenswat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stephenswat commented Feb 16, 2026

Performance summary

Graphical

Tabular

Uh oh!

stephenswat commented Feb 16, 2026

Uh oh!

stephenswat commented Feb 16, 2026

Uh oh!

krasznaa commented Feb 16, 2026

Uh oh!

krasznaa commented Feb 16, 2026

Uh oh!

stephenswat commented Feb 16, 2026

Physics performance summary

Seeding performance

Track finding performance

Track fitting performance

Seeding to track finding relative performance

Uh oh!

stephenswat commented Feb 16, 2026

Performance summary

Graphical

Tabular

Uh oh!

stephenswat commented Feb 16, 2026

Uh oh!

krasznaa commented Feb 16, 2026

Uh oh!

krasznaa commented Feb 16, 2026

Uh oh!

stephenswat commented Feb 17, 2026

Uh oh!

stephenswat commented Feb 17, 2026

Uh oh!

stephenswat commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krasznaa commented Feb 17, 2026

Uh oh!

Uh oh!

Uh oh!

krasznaa commented Feb 25, 2026

Uh oh!

stephenswat commented Feb 25, 2026

Uh oh!

krasznaa commented Feb 25, 2026

Uh oh!

stephenswat commented Feb 25, 2026

stephenswat Feb 13, 2026 •

edited

Loading

stephenswat commented Feb 17, 2026 •

edited

Loading