ARM64 JIT: Optimize Group E register conversion#324
Conversation
The AND and ORR sequence can be simplified down to a single BIF instruction if we correct the Group E AND mask to additionally clear the lower 22 bits of each double. This is possible because Group E registers are always loaded and converted from signed 32-bit integers. The int32-to-double conversion process never sets the lower 22 bits of the resulting double, so it doesn't matter whether or not we clear them with the mask. And since we're able to clear those bits with the mask, we can treat the AND/OR process like a bit-select operation, where the AND mask is used to select between the bits in the OR mask and the bits to keep unchanged. This change boosts performance by ~0.9% on an Apple M1 Pro, and likely more than that on systems with weaker OoO execution capabilities.
|
Good find. I didn't know about this ARM64 instruction. I'll test it tomorrow. |
coffnix
left a comment
There was a problem hiding this comment.
I compiled the RandomX library with the submitted AArch64 patches, including the Group E register conversion optimization that replaces the AND+ORR sequence with a single BIF instruction and adjusts the AND mask, and deployed it system-wide, while also building Monero using a custom patch that removes the in-tree RandomX (external/randomx) and forces linkage against the system-provided librandomx, confirming that monerod correctly links against the patched /usr/lib64/librandomx.so (verified via lsof and /proc//maps), after which I performed runtime validation under real workload conditions by running local mining with full CPU utilization (start_mining with all available threads), observing stable and expected hashrate (~1.7 kH/s on a 12-thread ARMv9.2A system) with no crashes, SIGILL, or memory faults, and further validated integration with P2Pool, where the RandomX hasher successfully allocated the full dataset (~2.5 GiB), updated cache and dataset across multiple threads, and synchronized the sidechain without any errors such as invalid shares, hashing failures, or rejected work, indicating that the modified JIT AArch64 path, SIMD masking changes, and BIF-based optimization behave correctly under concurrent load and do not introduce observable consensus or stability issues in practical mining scenarios on a Linux system using an external RandomX library.
thanks for your patch @cyrozap
|
Tested it on my phone - everything compiled fine, hashes matched both on RandomX and RandomX v2. |
See #324 Co-authored-by: cyrozap <220973+cyrozap@users.noreply.github.com>
The AND and ORR sequence can be simplified down to a single BIF instruction if we correct the Group E AND mask to additionally clear the lower 22 bits of each double.
This is possible because Group E registers are always loaded and converted from signed 32-bit integers. The int32-to-double conversion process never sets the lower 22 bits of the resulting double, so it doesn't matter whether or not we clear them with the mask. And since we're able to clear those bits with the mask, we can treat the AND/OR process like a bit-select operation, where the AND mask is used to select between the bits in the OR mask and the bits to keep unchanged.
This change boosts performance by ~0.9% on an Apple M1 Pro, and likely more than that on systems with weaker OoO execution capabilities.