Stability changes to SIMD code (building for Clang + on the M1). #108
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TL;DR: This PR makes some small structural changes to the SIMD code. The code works as before, but it's slightly closer to standard C++. Might also be worth considering if we want to test how fast a pure C++ version of the bucketer is. Big thanks to both @ElenaKirshanova and @malb for helping me to debug some of these issues and fixing some of the build code respectively.
This PR fixes the SIMD code on Clang and ARM machines. It turns out that some of the SIMD code didn't build compile with Clang or crashed on the M1. This meant that some changes had to be made to e.g the shuffling code. There were also some type-punning differences that I had to fix (see all of the extra calls to
memcpy).Note that this PR doesn't add full support for G6K to the M1. All of the sieves work, except for the HK3 sieve in a multi-threaded setting (there's a use-after-free crash). This is something we hope to fix.
As before, the tests for this code are here.
Note that as part of implementing this, I had to re-implement every Intel intrinsic that we use in standard C++ (so that I could test the intrinsics on ARM). This means that we could switch over to a pure C++ implementation of the bucketer if needed: we just have to switch out the calls to the intrinsics.
Finally, there's one outstanding curiousity: the vectorised random number code (that I wrote) sometimes gets stuck in a fixed point. I just replaced this with two calls to
rand, and I didn't notice any performance differences. This might be something to look into.