With reference to https://www.corsix.org/content/fast-crc32c-4k, what I call crc32_4k is your option 12 ("8-byte Hardware-accelerated"), and what I call crc32_4k_three_way is your option 13 ("Golden"). The theoretical upper bound on option 13 is 64 bits/cycle, which your implementation gets close to, at 62 bits/cycle. What I realised is that:
- There's an inferior option, that I call
crc32_4k_pclmulqdq, but you might call "Silver".
- Gold and silver use separate execution ports, and thus can be alloyed together, for a theoretical upper bound of 120.89 bits/cycle (this is 64+72 bytes every 9 cycles). I'm measuring 93 bits/cycle for this alloy, and I imagine that a well tuned implementation could get closer to 120.89.
With reference to https://www.corsix.org/content/fast-crc32c-4k, what I call
crc32_4kis your option 12 ("8-byte Hardware-accelerated"), and what I callcrc32_4k_three_wayis your option 13 ("Golden"). The theoretical upper bound on option 13 is 64 bits/cycle, which your implementation gets close to, at 62 bits/cycle. What I realised is that:crc32_4k_pclmulqdq, but you might call "Silver".