Add JIT code generator for PPC64 by cyrozap · Pull Request #320 · tevador/RandomX

cyrozap · 2026-04-04T20:30:57Z

Adds a JIT backend for POWER8 and later Power ISA CPUs. Assembly instructions were restricted to those available in Power ISA v2.06 in order to facilitate adding support for POWER7, but currently only RandomX V1 is supported on those chips due to its lack of AES instructions.

Support has been added for both little-endian and big-endian CPUs, but only little-endian has been tested.

Fixes #132

Adds a JIT backend for POWER8 and later Power ISA CPUs. Assembly instructions were restricted to those available in Power ISA v2.06 in order to facilitate adding support for POWER7, but currently only RandomX V1 is supported on those chips due to its lack of AES instructions. Support has been added for both little-endian and big-endian CPUs, but only little-endian has been tested. Fixes tevador#132

cyrozap · 2026-04-05T03:02:20Z

Benchmarks on a Raptor Computing Systems Talos II with dual POWER9 CPUs:

./randomx-benchmark --auto --verify:

RandomX benchmark v2.0
 - Argon2 implementation: reference
 - light memory mode (256 MiB)
 - JIT compiled mode
 - hardware AES mode
 - small pages mode
 - batch mode
Initializing ...
Memory initialized in 1.01555 s
Initializing 1 virtual machine(s) ...
Running benchmark (1000 nonces) ...
Calculated result: 10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Reference result:  10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Performance: 25.3256 ms per hash

./randomx-benchmark --auto --verify --v2:

RandomX benchmark v2.0
 - Argon2 implementation: reference
 - light memory mode (256 MiB)
 - JIT compiled mode
 - hardware AES mode
 - small pages mode
 - batch mode
Initializing ...
Memory initialized in 1.0161 s
Initializing 1 virtual machine(s) ...
Running benchmark (1000 nonces) ...
Calculated result: b85d79e080b10b6ad28c2e6c993601a1361917dba979e03a0a8f7248aaf4ba52
Reference result:  b85d79e080b10b6ad28c2e6c993601a1361917dba979e03a0a8f7248aaf4ba52
Performance: 26.8799 ms per hash

./randomx-benchmark --auto --mine:

RandomX benchmark v2.0
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - hardware AES mode
 - small pages mode
 - batch mode
Initializing (144 threads) ...
Memory initialized in 2.69269 s
Initializing 1 virtual machine(s) ...
Running benchmark (1000 nonces) ...
Calculated result: 10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Reference result:  10b649a3f15c7c7f88277812f2e74b337a0f20ce909af09199cccb960771cfa1
Performance: 237.531 hashes per second

./randomx-benchmark --auto --mine --v2:

RandomX benchmark v2.0
 - Argon2 implementation: reference
 - full memory mode (2080 MiB)
 - JIT compiled mode
 - hardware AES mode
 - small pages mode
 - batch mode
Initializing (144 threads) ...
Memory initialized in 2.71528 s
Initializing 1 virtual machine(s) ...
Running benchmark (1000 nonces) ...
Calculated result: b85d79e080b10b6ad28c2e6c993601a1361917dba979e03a0a8f7248aaf4ba52
Reference result:  b85d79e080b10b6ad28c2e6c993601a1361917dba979e03a0a8f7248aaf4ba52
Performance: 177.077 hashes per second

The vector permutation is unnecessary on little-endian systems when using `stvx`.

SChernykh · 2026-04-05T16:17:02Z

due to its lack of AES instructions.

RandomX v1 also uses AES in the scratchpad hash/fill step, so you can use the existing soft AES code for RandomX v2 loop. It shouldn't be that hard compared to the full JIT implementation that you've done already.

This only saves one or two instructions, but there are no drawbacks to how this optimization is implemented so there's no reason not to do it.

tevador · 2026-04-06T10:23:01Z

FYI, the ppc64le build is failing: https://github.com/tevador/RandomX/actions/runs/24008647458/job/70025217444

This only saves one or two instructions in a very cold path in the code, but there are no drawbacks to implementing this optimization so there's no reason not to do it.

This optimization can save one or two instructions for some immediates.

cyrozap · 2026-04-06T20:25:53Z

FYI, the ppc64le build is failing: https://github.com/tevador/RandomX/actions/runs/24008647458/job/70025217444

From the CI log:

/home/runner/work/RandomX/RandomX/src/cpu.cpp:52:18: fatal error: asm/cputable.h: No such file or directory
   52 |         #include <asm/cputable.h>
      |                  ^~~~~~~~~~~~~~~~

asm/cputable.h is a Linux kernel header, so I think you would need to run apk add linux-headers in the ppc64le VM setup script to resolve that error.

We could also avoid the dependency entirely by #define-ing the value ourselves. This constant is needed in both src/cpu.cpp and src/jit_compiler_ppc64.cpp, so if you want to do this we'd have to make that change in both places (or make our own header that gets included into both files). Edit: I just realized I didn't actually need to call getauxval a second time, so I pushed a change to fix that. Now the constant is only needed in src/cpu.cpp and in the JIT backend we just query the cpu object for feature support when we need it.

Which option would you prefer?

FWIW, in the future I plan to use more of those definitions (full list is here) to detect the system's ISA version in order to patch in more-optimized code for the newer architectures, so my personal preference is to just use the Linux kernel header so there's no possibility of copy/paste issues. That said, I'll understand completely if you want to avoid a dependency on kernel headers just for a handful of constant definitions (which AIUI should never change between kernel versions).

We already query the CPU feature support in cpu.cpp, so there's no need to do it again.

This is the same split in Debian--the ppc64el port is only supported on POWER8 and later, so POWER7 and earlier can only run Debian ppc64 (big-endian 64-bit PowerPC). Because of this, we set the default little-endian architecture to POWER8. And since the RandomX JIT backend for PPC64 requires VSX, which is only supported by POWER7 and later, the lowest we can set the default big-endian architecture to is POWER7.

Most big-endian Power systems use V1 of the Power ELF ABI, so in order for this code to run on those systems we need to make a few changes. - We increase the size of the stacks by 80 bytes to avoid overwriting any of the information callers will write there. These larger stack frames are compatible with both ABI V1 and V2. - We add a macro to declare a C function with metadata that the linker is able to use when building for an ABI V1 system. - We generate function descriptors for JIT-generated functions, since pointers to functions in the V1 ABI actually point to function descriptors, not the functions themselves.

The Argon2 implementation writes to the cache in native endianness, so we need to read it in native endianness. And since nothing that reads the dataset cares about its byte order, we can keep that in native endianness as well despite the spec saying that it should be in little-endian byte order.

The scratchpad and register file must both be read from and written to in little-endian byte order.

We need to byte-swap 128-bit vectors for AES mixing on big-endian PPC64. Otherwise, the interpreter v2 hash tests will fail.

In ABI v1, register GPR2 is loaded by the caller from the function descriptor, so we don't need to emit instructions to load it ourselves.

cyrozap · 2026-04-10T01:07:38Z

Just a quick update on this: Big-endian PPC64 works now, but after doing more thorough testing on little-endian, it seems there's a very intermittent bug where the scratchpad pointer base address is getting added to spAddr0 / spAddr1 twice, resulting in out-of-bounds writes with associated segfaults and data corruption. I'll let you know once I've found the root cause of the issue and fixed it.

This should make the code a little bit easier to reason about.

GCC's built-in cache clearing function didn't do anything on Power, so we use our own code, borrowed from LLVM and modified to detect the cache line size at runtime. Also, to avoid a huge hit to performance (~13%), rather than clearing the cache for the whole 128 KiB of constants and code, we clear the cache just for the bytes of the program that was just written, and we only do that cache clearing after the whole program has been written. And since the constants aren't used as instructions, we can skip clearing any caches for that data. So now instead of clearing caches for 128 KiB of memory, we're only doing that for about 4+ KiB of just the program memory. Total hit to performance from actually clearing the cache seems to be in the range of 0.2%-0.3%, which is more than acceptable considering the alternative results in wasted cycles and random crashes.

cyrozap · 2026-04-10T19:37:19Z

I fixed the segfaults and memory corruption bugs. Turns out the caches weren't being invalidated after writing the code, causing stale instructions to get executed from the cache and leading to intermittent, unpredictable results. But now the caches are being properly invalidated and the code is running reliably.

On Power ISA processors that support v3.0B or later, use mffscrn instead of mtfsf to avoid a pipeline flush. This gives us an extra 0.5%-1.0% performance on RandomX V1 and a negligible performance increase (much less than 0.1%) on RandomX V2.

This gives us an extra 2.0% performance on RandomX V1 and V2.

This saves maybe one instruction every once in a while.

This should help slightly with pipelining.

The only benefit to using the VSX versions of these instructions is that they have access to more registers. But we don't use those extra registers, so in case anyone wants to port this code to processors without VSX we can make their job easier by removing the VSX-only instructions where they're not strictly needed.

We only need these two definitions, so we can avoid adding a new dependency by copying them from the headers.

cyrozap · 2026-04-16T23:00:21Z

@tevador I hadn't heard back from you regarding what you wanted to do about the missing kernel header issue, so in 9a77acf I just went ahead and removed the need for that dependency entirely. So if you want to run CI on this PR, it should be working now.

Separately, I got software AES working, so RandomX V2 should be able to run on POWER7 now. I've tested the implementation and it works both on my little-endian POWER9 system and in a Qemu-emulated big-endian POWER7 system.

At this point, I consider this implementation basically "complete" (pending review, of course). It works on (emulated) POWER7 through (bare-metal) POWER9, on both big-endian and little-endian systems. There's things that could be optimized in the software AES implementation, but I've decided to hold off on doing any of that until I can get access to a real POWER7 system to validate that any "improvements" I make actually increase performance.

Anyways, if there's anything I need to change or fix, please let me know!

This gives us an extra ~2% performance on RandomX V1 and V2.

musl doesn't define _SC_LEVEL1_DCACHE_LINESIZE or (I assume) _SC_LEVEL1_ICACHE_LINESIZE, so attempting to use those values with sysconf to get the cache line sizes causes the build to fail on Alpine Linux and other distros that use musl libc. To fix this, we just check if those values are defined and if they're not, we skip querying the cache line size and fall back to the safe default of 32 bytes. While we could probably get away with assuming a minimum cache line size of 64 bytes or even 128 bytes [1] for 64-bit systems, caching issues can be extremely difficult to catch, so I think it's best to play it safe to avoid those issues entirely. This will cause a small performance penalty on systems that use musl libc. That said, if any users of those systems need this fixed it would be straightforward to modify the code to use whatever mechanism is normally used on those systems to detect the cache line sizes. [1]: The lowest version of the Power ISA we target is v2.06 and we also require VSX. The only chips I'm aware of that meet both of those requirements are POWER7 and later, and every one of those chips has 128-byte cache lines.

cyrozap · 2026-04-17T19:36:07Z

I pushed a fix for the missing sysconf definitions (and also a 2% perf increase)--please run CI again at your earliest convenience.

We don't exclusively use it on big-endian systems anymore as little-endian systems need it for the RandomX V2 AES mixing process.

This is more correct since it more accurately describes what is being done to the Group E registers (the upper eight bits and lower 22 bits of each double come directly from the Group E OR mask values. That said, despite clearing the lower 22 bits of the mask this change is effectively a no-op. The reason for this is that because the Group E registers are loaded exclusively by converting signed 32-bit integers into doubles, the lower 22 bits of each double are always zero before they each get set by the OR mask. So, clearing those bits does not change their values.

With the Group E AND mask properly set to include just the bits we want to keep from the original register, we can combine the `vand` and `vor` operations into a single `vsel` operation, saving us one vector instruction every time we need to load a Group E register from memory. This gives us an extra ~1.3% boost to performance on both V1 and V2.

Fixes: 8cd6435 ("Optimize Group E register conversion on PPC64")

This will sometimes save us one instruction.

On POWER9 the LSU can perform four loads at a time, and we can make it easier for the CPU to do this by using different destination registers for each load.

The POWER9 LSU can perform four loads at a time and a POWER9 core can perform four ALU operations at a time, so if we group four loads followed by four ALU operations we can better hide the load latency and get a very small performance boost.

Fixes: 31ff28d ("Cache reciprocals in PPC64 JIT compiler")

This will better indicate what each part is doing.

This doesn't affect performance since it's the same sequence of instructions, but doing this enables us to more easily optimize the code for each supported architecture.

On Power ISA v3.0 and later, we can use stxv to perform the Group F scratchpad stores using immediate offsets instead of register offsets. This saves us from having to load the offsets into registers before performing the stores.

We don't need to load zero into a register to use it as an offset--we can just set RA in the instruction word to zero. This saves one ALU operation per loop iteration.

This doesn't really improve execution latency because the lis/ori immediate load is executed in parallel with other instructions before the xor, but for cases where the upper 16 bits are zero or where the sign bit and lower 16 bits are zero, this will save a tiny bit of instruction cache and maybe sometimes speed things up when the ALU pipelines are very busy before the xor.

They're short enough to use one line for each.

We know for a fact the branch to skip the rounding mode update is taken 93.75% of the time, so we might as well add the hint to indicate that to the CPU. Adding the hint reduces the number of branch mispredictions by about 5%, from ~1.25% to ~1.19%, in RandomX V2 on POWER9.

This just makes it easier to see at a glance what bits are set in the field. This change has no performance impact.

We know for a fact the branch is only taken 0.390625% of the time, so we might as well add the hint to indicate that to the CPU. Adding the hint reduces the number of branch mispredictions on POWER9 by about 6.9% in V1 (from ~0.961% to ~0.895%) and by about 7.5% in V2 (from ~1.19% to ~1.10%). This increases V1 performance by about 0.1% and V2 performance by <0.05%

Removing the macro enables us to group all the permutation operations together, which should reduce stalls on big-endian systems and makes the code slightly easier to read. This change has no effect on little-endian systems.

According to the ISA manual, loading the offset value into a register using `li rX, simm` and then performing the vector load or store immediately after that using `rX` as the third argument will optimize the memory access on POWER9. Presumably, this sequence gets fused into a single micro-op in the CPU. Changing the code to use this sequence for the vector loads and stores shows a small but measurable 0.05%-0.1% performance increase for RandomX V1 on POWER9.

cyrozap mentioned this pull request Apr 4, 2026

POWER support #132

Open

cyrozap added 3 commits April 5, 2026 10:40

Correct comment for STORE_LE_VR

65ba514

Optimize STORE_LE_VR on little-endian POWER

05ff7dd

The vector permutation is unnecessary on little-endian systems when using `stvx`.

Make it clear that f0-f31 are aliased by vs0-vs31

dcc710a

cyrozap added 3 commits April 5, 2026 11:51

Mark r23 as unused

279514b

Optimize CBRANCH

712da1d

This only saves one or two instructions, but there are no drawbacks to how this optimization is implemented so there's no reason not to do it.

Add a comment on the importance of using dcbt to prefetch the next block

09ddaf3

cyrozap added 2 commits April 6, 2026 10:08

Optimize scratchpad address calculation in program suffix

c103a1b

This only saves one or two instructions in a very cold path in the code, but there are no drawbacks to implementing this optimization so there's no reason not to do it.

Optimize emitMovImm64 for rotated 32-bit immediates

2241c36

This optimization can save one or two instructions for some immediates.

cyrozap added 7 commits April 6, 2026 15:35

Use cpu.hasAes() instead of getauxval on PPC64

cf9ca50

We already query the CPU feature support in cpu.cpp, so there's no need to do it again.

Fix BE PPC64 scratchpad and register endianness

205f4a5

The scratchpad and register file must both be read from and written to in little-endian byte order.

Fix interpreter v2 tests on big-endian PPC64

1865a43

We need to byte-swap 128-bit vectors for AES mixing on big-endian PPC64. Otherwise, the interpreter v2 hash tests will fail.

Remove unnecessary immediate load on PPC64 BE with v1 ABI

b7a3154

In ABI v1, register GPR2 is loaded by the caller from the function descriptor, so we don't need to emit instructions to load it ourselves.

cyrozap mentioned this pull request Apr 8, 2026

Include ppc64le CPU mining support xmrig/xmrig#2227

Open

cyrozap added 3 commits April 9, 2026 22:51

Move PPC64 VM prologue generation into prefix generation function

778a58f

This should make the code a little bit easier to reason about.

Factor out the common parts of the scratchpad store

c2e4355

cyrozap added 3 commits April 11, 2026 00:39

Optimize CFROUND for POWER9 (ISA v3.0B)

d8f508d

On Power ISA processors that support v3.0B or later, use mffscrn instead of mtfsf to avoid a pipeline flush. This gives us an extra 0.5%-1.0% performance on RandomX V1 and a negligible performance increase (much less than 0.1%) on RandomX V2.

Cache reciprocals in PPC64 JIT compiler

31ff28d

This gives us an extra 2.0% performance on RandomX V1 and V2.

Simplify scratchpad loading code

8abf44d

This saves maybe one instruction every once in a while.

cyrozap added 4 commits April 11, 2026 15:44

Move the creation of the zero vector further from where it's used

7f815b9

This should help slightly with pipelining.

Implement software AES for PPC64

2016660

Avoid dependency on Linux kernel headers

9a77acf

We only need these two definitions, so we can avoid adding a new dependency by copying them from the headers.

cyrozap added 2 commits April 17, 2026 00:02

Use round-robin temporary register allocator in PPC64 JIT compiler

62e9457

This gives us an extra ~2% performance on RandomX V1 and V2.

cyrozap added 22 commits April 24, 2026 10:19

Rename the PPC64 byte-reverse mask to better reflect its purpose

23b22fb

We don't exclusively use it on big-endian systems anymore as little-endian systems need it for the RandomX V2 AES mixing process.

Use .octa for vector byte-reverse mask to avoid confusion

2e4c986

PPC64 JIT: Correct maximum RandomX instruction code size

694dd00

Fixes: 8cd6435 ("Optimize Group E register conversion on PPC64")

PPC64 JIT: Optimize emitAddImm32 by using addis for supported values

2bb1ab2

This will sometimes save us one instruction.

PPC64 JIT: Add some notes on optimizing emitAddImm32

f633ec1

PPC64 JIT: Make sure groups of four loads use different temporary regs

bf0b5cd

On POWER9 the LSU can perform four loads at a time, and we can make it easier for the CPU to do this by using different destination registers for each load.

PPC64 JIT: Group loads four at a time

578ada3

The POWER9 LSU can perform four loads at a time and a POWER9 core can perform four ALU operations at a time, so if we group four loads followed by four ALU operations we can better hide the load latency and get a very small performance boost.

PPC64 JIT: Reorder ld arguments to match the assembly instruction

279a960

Fixes: 31ff28d ("Cache reciprocals in PPC64 JIT compiler")

PPC64 JIT: Rename scratchpad store prologue/epilogue

8f1c2f3

This will better indicate what each part is doing.

PPC64 JIT: Move the Group F scratchpad store into the code generator

7144c2d

This doesn't affect performance since it's the same sequence of instructions, but doing this enables us to more easily optimize the code for each supported architecture.

PPC64 JIT: Avoid moving register ma when we don't need to

7b48443

PPC64 JIT: Optimize Group F register scratchpad stores on pre-v3.0

047320a

We don't need to load zero into a register to use it as an offset--we can just set RA in the instruction word to zero. This saves one ALU operation per loop iteration.

PPC64 JIT: Rearrange the beq/bne instruction formatters

50b0658

They're short enough to use one line for each.

PPC64 JIT: Convert the BO field values to hexadecimal

ef71a76

This just makes it easier to see at a glance what bits are set in the field. This change has no performance impact.

PPC64 JIT: Remove STORE_LE_VR macro

e9a2c8e

Removing the macro enables us to group all the permutation operations together, which should reduce stalls on big-endian systems and makes the code slightly easier to read. This change has no effect on little-endian systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JIT code generator for PPC64#320

Add JIT code generator for PPC64#320
cyrozap wants to merge 50 commits into
tevador:masterfrom
cyrozap:power-jit

cyrozap commented Apr 4, 2026

Uh oh!

cyrozap commented Apr 5, 2026

Uh oh!

SChernykh commented Apr 5, 2026

Uh oh!

tevador commented Apr 6, 2026

Uh oh!

cyrozap commented Apr 6, 2026 •

edited

Loading

Uh oh!

cyrozap commented Apr 10, 2026

Uh oh!

cyrozap commented Apr 10, 2026

Uh oh!

cyrozap commented Apr 16, 2026

Uh oh!

cyrozap commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cyrozap commented Apr 4, 2026

Uh oh!

cyrozap commented Apr 5, 2026

Uh oh!

SChernykh commented Apr 5, 2026

Uh oh!

tevador commented Apr 6, 2026

Uh oh!

cyrozap commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cyrozap commented Apr 10, 2026

Uh oh!

cyrozap commented Apr 10, 2026

Uh oh!

cyrozap commented Apr 16, 2026

Uh oh!

cyrozap commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cyrozap commented Apr 6, 2026 •

edited

Loading