Skip to content

Conversation

@joerowell
Copy link
Collaborator

TL;DR: This PR makes some small structural changes to the SIMD code. The code works as before, but it's slightly closer to standard C++. Might also be worth considering if we want to test how fast a pure C++ version of the bucketer is. Big thanks to both @ElenaKirshanova and @malb for helping me to debug some of these issues and fixing some of the build code respectively.

This PR fixes the SIMD code on Clang and ARM machines. It turns out that some of the SIMD code didn't build compile with Clang or crashed on the M1. This meant that some changes had to be made to e.g the shuffling code. There were also some type-punning differences that I had to fix (see all of the extra calls to memcpy).

Note that this PR doesn't add full support for G6K to the M1. All of the sieves work, except for the HK3 sieve in a multi-threaded setting (there's a use-after-free crash). This is something we hope to fix.

As before, the tests for this code are here.

Note that as part of implementing this, I had to re-implement every Intel intrinsic that we use in standard C++ (so that I could test the intrinsics on ARM). This means that we could switch over to a pure C++ implementation of the bucketer if needed: we just have to switch out the calls to the intrinsics.

Finally, there's one outstanding curiousity: the vectorised random number code (that I wrote) sometimes gets stuck in a fixed point. I just replaced this with two calls to rand, and I didn't notice any performance differences. This might be something to look into.

@ElenaKirshanova
Copy link
Collaborator

Tested on M1 with Martin's setup.py from #107
All compiles well.

@lducas
Copy link
Member

lducas commented Dec 2, 2023

Any reason we haven't merged this yet ??

@joerowell
Copy link
Collaborator Author

joerowell commented Dec 2, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants