Skip to content

Sign: Unpack s1, s2, and t0 on the fly in REDUCE_RAM mode#1002

Open
mkannwischer wants to merge 8 commits intomainfrom
sign-recompute-s1s2t0
Open

Sign: Unpack s1, s2, and t0 on the fly in REDUCE_RAM mode#1002
mkannwischer wants to merge 8 commits intomainfrom
sign-recompute-s1s2t0

Conversation

@mkannwischer
Copy link
Contributor

@mkannwischer mkannwischer commented Mar 22, 2026

Introduce mld_s1vec, mld_s2vec, and mld_t0vec, following the same pattern as mld_polymat for reduced RAM usage. In normal mode, they store the full NTT'd polyvec. In REDUCE_RAM mode, they store a pointer to the packed data in the secret key and unpack + NTT individual polynomials on demand.

This reduces signing memory allocation in REDUCE_RAM mode:

  • ML-DSA-44: 32,448 -> 20,256 (-37.6%)
  • ML-DSA-65: 44,768 -> 27,456 (-38.6%)
  • ML-DSA-87: 59,104 -> 35,648 (-39.7%)

TODO:

  • CBMC proofs for non-reduced-RAM mode
  • Avoid double unpacking of s1/s2/t0 in pk_from_sk

Introduce mld_s1vec, following the same pattern as mld_polymat for
reduced RAM usage. In normal mode, it stores the full NTT'd polyvecl.
In REDUCE_RAM mode, it stores a pointer to the packed s1 data in the
secret key and unpacks + NTTs individual polynomials on demand.

This reduces signing memory in REDUCE_RAM mode:
- ML-DSA-44: 32,448 -> 28,384 (-4,064 bytes)
- ML-DSA-65: 44,768 -> 39,680 (-5,088 bytes)
- ML-DSA-87: 59,104 -> 51,968 (-7,136 bytes)

Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 34362 cycles 34508 cycles 1.00
ML-DSA-44 sign 120129 cycles 119762 cycles 1.00
ML-DSA-44 verify 38068 cycles 38106 cycles 1.00
ML-DSA-65 keypair 61396 cycles 61327 cycles 1.00
ML-DSA-65 sign 201975 cycles 202109 cycles 1.00
ML-DSA-65 verify 62883 cycles 62771 cycles 1.00
ML-DSA-87 keypair 93605 cycles 94593 cycles 0.99
ML-DSA-87 sign 238491 cycles 240827 cycles 0.99
ML-DSA-87 verify 95314 cycles 96019 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 94456 cycles 93753 cycles 1.01
ML-DSA-44 sign 332649 cycles 333304 cycles 1.00
ML-DSA-44 verify 99711 cycles 99738 cycles 1.00
ML-DSA-65 keypair 160279 cycles 159678 cycles 1.00
ML-DSA-65 sign 543374 cycles 544024 cycles 1.00
ML-DSA-65 verify 161092 cycles 160787 cycles 1.00
ML-DSA-87 keypair 267434 cycles 267177 cycles 1.00
ML-DSA-87 sign 709085 cycles 705890 cycles 1.00
ML-DSA-87 verify 270229 cycles 270246 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 113162 cycles 113125 cycles 1.00
ML-DSA-44 sign 357353 cycles 355404 cycles 1.01
ML-DSA-44 verify 117895 cycles 117806 cycles 1.00
ML-DSA-65 keypair 196212 cycles 196440 cycles 1.00
ML-DSA-65 sign 592734 cycles 588870 cycles 1.01
ML-DSA-65 verify 194625 cycles 194523 cycles 1.00
ML-DSA-87 keypair 322467 cycles 322254 cycles 1.00
ML-DSA-87 sign 756570 cycles 752961 cycles 1.00
ML-DSA-87 verify 320130 cycles 320091 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 69194 cycles 69272 cycles 1.00
ML-DSA-44 sign 188252 cycles 188132 cycles 1.00
ML-DSA-44 verify 68906 cycles 69431 cycles 0.99
ML-DSA-65 keypair 119088 cycles 119537 cycles 1.00
ML-DSA-65 sign 302886 cycles 300738 cycles 1.01
ML-DSA-65 verify 115778 cycles 115521 cycles 1.00
ML-DSA-87 keypair 203113 cycles 204457 cycles 0.99
ML-DSA-87 sign 408515 cycles 395562 cycles 1.03
ML-DSA-87 verify 195375 cycles 196251 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 68176 cycles 68092 cycles 1.00
ML-DSA-44 sign 203637 cycles 202357 cycles 1.01
ML-DSA-44 verify 70924 cycles 70840 cycles 1.00
ML-DSA-65 keypair 121292 cycles 120892 cycles 1.00
ML-DSA-65 sign 334250 cycles 332262 cycles 1.01
ML-DSA-65 verify 118051 cycles 117993 cycles 1.00
ML-DSA-87 keypair 198088 cycles 198285 cycles 1.00
ML-DSA-87 sign 431278 cycles 428165 cycles 1.01
ML-DSA-87 verify 194701 cycles 194638 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 56406 cycles 56810 cycles 0.99
ML-DSA-44 sign 181106 cycles 181256 cycles 1.00
ML-DSA-44 verify 60974 cycles 61127 cycles 1.00
ML-DSA-65 keypair 99070 cycles 98683 cycles 1.00
ML-DSA-65 sign 301150 cycles 298776 cycles 1.01
ML-DSA-65 verify 100716 cycles 100109 cycles 1.01
ML-DSA-87 keypair 152856 cycles 152672 cycles 1.00
ML-DSA-87 sign 358498 cycles 355205 cycles 1.01
ML-DSA-87 verify 153513 cycles 153314 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 134975 cycles 135154 cycles 1.00
ML-DSA-44 sign 527071 cycles 524730 cycles 1.00
ML-DSA-44 verify 147865 cycles 147590 cycles 1.00
ML-DSA-65 keypair 228558 cycles 228675 cycles 1.00
ML-DSA-65 sign 865183 cycles 866364 cycles 1.00
ML-DSA-65 verify 236295 cycles 236755 cycles 1.00
ML-DSA-87 keypair 371936 cycles 372434 cycles 1.00
ML-DSA-87 sign 1079730 cycles 1081953 cycles 1.00
ML-DSA-87 verify 383598 cycles 383807 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 40542 cycles 41153 cycles 0.99
ML-DSA-44 sign 132746 cycles 132931 cycles 1.00
ML-DSA-44 verify 43385 cycles 43836 cycles 0.99
ML-DSA-65 keypair 71975 cycles 72244 cycles 1.00
ML-DSA-65 sign 214826 cycles 214745 cycles 1.00
ML-DSA-65 verify 72365 cycles 73096 cycles 0.99
ML-DSA-87 keypair 108760 cycles 108337 cycles 1.00
ML-DSA-87 sign 254141 cycles 253357 cycles 1.00
ML-DSA-87 verify 111590 cycles 110812 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 128366 cycles 128232 cycles 1.00
ML-DSA-44 sign 448227 cycles 447685 cycles 1.00
ML-DSA-44 verify 138227 cycles 144647 cycles 0.96
ML-DSA-65 keypair 220728 cycles 220666 cycles 1.00
ML-DSA-65 sign 728968 cycles 727390 cycles 1.00
ML-DSA-65 verify 222560 cycles 223179 cycles 1.00
ML-DSA-87 keypair 364646 cycles 365048 cycles 1.00
ML-DSA-87 sign 926537 cycles 925897 cycles 1.00
ML-DSA-87 verify 372923 cycles 372806 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 212388 cycles 212677 cycles 1.00
ML-DSA-44 sign 761595 cycles 759475 cycles 1.00
ML-DSA-44 verify 228709 cycles 228953 cycles 1.00
ML-DSA-65 keypair 379903 cycles 380253 cycles 1.00
ML-DSA-65 sign 1257270 cycles 1251269 cycles 1.00
ML-DSA-65 verify 371502 cycles 372050 cycles 1.00
ML-DSA-87 keypair 604456 cycles 605509 cycles 1.00
ML-DSA-87 sign 1598919 cycles 1591320 cycles 1.00
ML-DSA-87 verify 618280 cycles 617579 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 72371 cycles 72262 cycles 1.00
ML-DSA-44 sign 213687 cycles 212358 cycles 1.01
ML-DSA-44 verify 75757 cycles 75722 cycles 1.00
ML-DSA-65 keypair 127613 cycles 127611 cycles 1.00
ML-DSA-65 sign 353316 cycles 350840 cycles 1.01
ML-DSA-65 verify 125589 cycles 125699 cycles 1.00
ML-DSA-87 keypair 205884 cycles 208501 cycles 0.99
ML-DSA-87 sign 447609 cycles 450025 cycles 0.99
ML-DSA-87 verify 205683 cycles 205765 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 157318 cycles 157591 cycles 1.00
ML-DSA-44 sign 549881 cycles 551560 cycles 1.00
ML-DSA-44 verify 169319 cycles 169402 cycles 1.00
ML-DSA-65 keypair 268418 cycles 267815 cycles 1.00
ML-DSA-65 sign 906561 cycles 904542 cycles 1.00
ML-DSA-65 verify 274795 cycles 274303 cycles 1.00
ML-DSA-87 keypair 447731 cycles 448249 cycles 1.00
ML-DSA-87 sign 1161252 cycles 1156908 cycles 1.00
ML-DSA-87 verify 457349 cycles 458389 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 120710 cycles 120615 cycles 1.00
ML-DSA-44 sign 448467 cycles 447589 cycles 1.00
ML-DSA-44 verify 130598 cycles 130296 cycles 1.00
ML-DSA-65 keypair 204115 cycles 204314 cycles 1.00
ML-DSA-65 sign 729459 cycles 728144 cycles 1.00
ML-DSA-65 verify 210276 cycles 210151 cycles 1.00
ML-DSA-87 keypair 337081 cycles 338739 cycles 1.00
ML-DSA-87 sign 927755 cycles 924086 cycles 1.00
ML-DSA-87 verify 347169 cycles 347015 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 138670 cycles 138488 cycles 1.00
ML-DSA-44 sign 484824 cycles 483902 cycles 1.00
ML-DSA-44 verify 148462 cycles 162298 cycles 0.91
ML-DSA-65 keypair 241326 cycles 241720 cycles 1.00
ML-DSA-65 sign 794542 cycles 792693 cycles 1.00
ML-DSA-65 verify 240735 cycles 241300 cycles 1.00
ML-DSA-87 keypair 395465 cycles 396574 cycles 1.00
ML-DSA-87 sign 1016682 cycles 1012397 cycles 1.00
ML-DSA-87 verify 402879 cycles 402619 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 113678 cycles 113486 cycles 1.00
ML-DSA-44 sign 358177 cycles 355929 cycles 1.01
ML-DSA-44 verify 118427 cycles 118313 cycles 1.00
ML-DSA-65 keypair 196712 cycles 196525 cycles 1.00
ML-DSA-65 sign 592715 cycles 588739 cycles 1.01
ML-DSA-65 verify 194904 cycles 194868 cycles 1.00
ML-DSA-87 keypair 322556 cycles 323107 cycles 1.00
ML-DSA-87 sign 757645 cycles 753767 cycles 1.01
ML-DSA-87 verify 320481 cycles 320405 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Same pattern as mld_s1vec: in normal mode stores the full NTT'd
polyveck, in REDUCE_RAM mode stores a pointer and unpacks + NTTs
on demand.

REDUCE_RAM signing memory reduction:
- ML-DSA-44: 28,384 -> 24,320 (-4,064 bytes)
- ML-DSA-65: 39,680 -> 33,568 (-6,112 bytes)
- ML-DSA-87: 51,968 -> 43,808 (-8,160 bytes)

Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Details
Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-44 keypair 213320 cycles 212744 cycles 1.00
ML-DSA-44 sign 762194 cycles 760342 cycles 1.00
ML-DSA-44 verify 241436 cycles 234472 cycles 1.03
ML-DSA-65 keypair 380938 cycles 380565 cycles 1.00
ML-DSA-65 sign 1259322 cycles 1254252 cycles 1.00
ML-DSA-65 verify 372512 cycles 372074 cycles 1.00
ML-DSA-87 keypair 606252 cycles 604302 cycles 1.00
ML-DSA-87 sign 1597536 cycles 1594512 cycles 1.00
ML-DSA-87 verify 618242 cycles 618492 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: c0ac5e8 Previous: bb07ee8 Ratio
ML-DSA-44 verify 241913 cycles 234472 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@oqs-bot
Copy link
Contributor

oqs-bot commented Mar 22, 2026

CBMC Results (ML-DSA-44)

⚠️ Attention Required

Proof Status Current Previous Change
sign_verify_internal ⚠️ 218s 126s +73%
Full Results (178 proofs)
Proof Status Current Previous Change
**TOTAL** 2134s 1998s +6.8%
polyvecl_pointwise_acc_montgomery_c 222s 208s +7%
sign_verify_internal ⚠️ 218s 126s +73%
mld_attempt_signature_generation 207s 231s -10%
poly_pointwise_montgomery_c 162s 152s +7%
rej_uniform_native 145s 146s -1%
mld_invntt_layer 90s 88s +2%
mld_ct_memcmp 72s 77s -6%
mld_ntt_layer 55s 59s -7%
keccak_squeezeblocks_x4 43s 42s +2%
polyvec_matrix_expand 28s 28s +0%
rej_uniform 23s 22s +5%
polymat_permute_bitrev_to_custom 22s 15s +47%
poly_chknorm_c 21s 19s +11%
fqmul 20s 20s +0%
sign_signature_internal 20s 31s -35%
poly_uniform_eta_4x 19s 17s +12%
polyeta_unpack 18s 16s +12%
mld_compute_t0_t1_tr_from_sk_components 15s 14s +7%
rej_uniform_c 15s 14s +7%
mld_ntt_butterfly_block 14s 12s +17%
poly_add 14s 9s +56%
polyt0_unpack 14s 14s +0%
polyz_unpack_c 14s 11s +27%
keccakf1600x4_permute_native 13s 13s +0%
mld_check_pct 13s 5s +160%
poly_uniform_4x 13s 14s -7%
polyveck_power2round 13s 12s +8%
polyvec_matrix_expand_serial 12s 13s -8%
polyvec_matrix_pointwise_montgomery 12s 13s -8%
sign_pk_from_sk 12s 7s +71%
keccak_absorb_once_x4 11s 10s +10%
keccakf1600_permute 10s 7s +43%
mld_h 9s 6s +50%
polyveck_add 9s 9s +0%
keccak_absorb 8s 9s -11%
keccakf1600_permute_native 8s 7s +14%
mld_polyvecl_permute_bitrev_to_custom_native 8s 7s +14%
mld_compute_pack_z 7s 7s +0%
polyveck_ntt 7s 6s +17%
polyveck_pointwise_poly_montgomery_t0 7s - new
polyveck_use_hint 7s 9s -22%
sign_signature 7s 5s +40%
poly_invntt_tomont 6s 3s +100%
poly_invntt_tomont_c 6s 5s +20%
poly_sub 6s 3s +100%
poly_uniform 6s 3s +100%
poly_uniform_gamma1_4x 6s 3s +100%
polyveck_decompose 6s 5s +20%
polyveck_make_hint 6s 4s +50%
polyveck_shiftl 6s 4s +50%
polyvecl_unpack_z 6s 4s +50%
rej_eta_native 6s 5s +20%
sign 6s 7s -14%
sign_open 6s 5s +20%
unpack_sk 6s 5s +20%
intt_native_x86_64 5s 3s +67%
make_hint 5s 3s +67%
mld_sample_s1_s2 5s 2s +150%
poly_chknorm 5s 4s +25%
polyveck_caddq 5s 3s +67%
polyveck_reduce 5s 6s -17%
polyvecl_chknorm 5s 4s +25%
polyvecl_unpack_eta 5s 3s +67%
polyz_unpack 5s 2s +150%
shake128_squeeze 5s 3s +67%
sign_keypair_internal 5s 9s -44%
sign_verify_extmu 5s 5s +0%
sign_verify_pre_hash_internal 5s 4s +25%
sys_check_capability 5s 2s +150%
unpack_hints 5s 8s -38%
decompose 4s 4s +0%
fqscale 4s 4s +0%
keccakf1600_extract_bytes (big endian) 4s 2s +100%
keccakf1600x4_xor_bytes 4s 4s +0%
mld_ct_cmask_nonzero_u32 4s 4s +0%
mld_keccakf1600_extract_bytes 4s 3s +33%
mld_value_barrier_u8 4s 1s +300%
ntt_native_x86_64 4s 4s +0%
pack_pk 4s 3s +33%
poly_caddq_c 4s 4s +0%
poly_caddq_native_aarch64 4s 6s -33%
poly_challenge 4s 4s +0%
poly_decompose 4s 2s +100%
poly_decompose_c 4s 2s +100%
poly_ntt_native 4s 3s +33%
poly_pointwise_montgomery_native 4s 2s +100%
poly_power2round 4s 8s -50%
poly_reduce 4s 3s +33%
poly_use_hint_c 4s 5s -20%
polyeta_pack 4s 4s +0%
polyt1_unpack 4s 4s +0%
polyveck_chknorm 4s 4s +0%
polyveck_pack_w1 4s 6s -33%
polyveck_sub 4s 5s -20%
polyvecl_pack_eta 4s 2s +100%
polyvecl_uniform_gamma1 4s 3s +33%
polyvecl_uniform_gamma1_serial 4s 2s +100%
reduce32 4s 3s +33%
rej_eta_c 4s 5s -20%
shake128x4_squeezeblocks 4s 2s +100%
shake256 4s 3s +33%
shake256_absorb 4s 6s -33%
sign_keypair 4s 3s +33%
sign_verify 4s 7s -43%
unpack_sig 4s 4s +0%
caddq 3s 3s +0%
keccak_init 3s 2s +50%
keccak_squeeze 3s 3s +0%
keccakf1600_xor_bytes (big endian) 3s 3s +0%
keccakf1600x4_extract_bytes 3s 5s -40%
keccakf1600x4_permute 3s 3s +0%
mld_ct_abs_i32 3s 3s +0%
mld_ct_cmask_nonzero_u8 3s 2s +50%
mld_ct_get_optblocker_i64 3s 2s +50%
mld_ct_get_optblocker_u32 3s 3s +0%
mld_sample_s1_s2_serial 3s 3s +0%
mld_value_barrier_i64 3s 3s +0%
mld_value_barrier_u32 3s 2s +50%
montgomery_reduce 3s 3s +0%
pack_sig_c_h 3s 5s -40%
pack_sig_z 3s 3s +0%
poly_caddq 3s 3s +0%
poly_caddq_native 3s 3s +0%
poly_chknorm_native 3s 2s +50%
poly_chknorm_native_aarch64 3s 5s -40%
poly_decompose_native 3s 3s +0%
poly_invntt_tomont_native 3s 4s -25%
poly_ntt 3s 3s +0%
poly_pointwise_montgomery 3s 4s -25%
poly_shiftl 3s 2s +50%
poly_uniform_eta 3s 3s +0%
poly_uniform_gamma1 3s 2s +50%
polyt0_pack 3s 5s -40%
polyveck_pack_eta 3s 2s +50%
polyveck_pointwise_poly_montgomery 3s 2s +50%
polyveck_pointwise_poly_montgomery_s2 3s - new
polyvecl_ntt 3s 7s -57%
polyvecl_pointwise_acc_montgomery_native 3s 3s +0%
polyw1_pack 3s 5s -40%
polyz_pack 3s 5s -40%
polyz_unpack_native 3s 3s +0%
power2round 3s 3s +0%
rej_eta 3s 4s -25%
shake128_init 3s 2s +50%
shake128x4_absorb_once 3s 1s +200%
shake256_init 3s 1s +200%
shake256_release 3s 2s +50%
sign_signature_pre_hash_shake256 3s 3s +0%
sign_verify_pre_hash_shake256 3s 5s -40%
use_hint 3s 2s +50%
keccak_finalize 2s 4s -50%
keccakf1600_xor_bytes 2s 1s +100%
mld_ct_get_optblocker_u8 2s 3s -33%
mld_prepare_domain_separation_prefix 2s 7s -71%
pack_sk 2s 3s -33%
poly_make_hint 2s 4s -50%
poly_ntt_c 2s 4s -50%
poly_use_hint 2s 3s -33%
poly_use_hint_native 2s 3s -33%
polyt1_pack 2s 3s -33%
polyveck_invntt_tomont 2s 3s -33%
polyveck_pack_t0 2s 3s -33%
polyveck_unpack_eta 2s 2s +0%
polyvecl_permute_bitrev_to_custom 2s 4s -50%
polyvecl_pointwise_acc_montgomery 2s 2s +0%
shake128_absorb 2s 3s -33%
shake128_finalize 2s 2s +0%
shake256x4_absorb_once 2s 3s -33%
sign_signature_extmu 2s 5s -60%
sign_signature_pre_hash_internal 2s 3s -33%
unpack_pk 2s 3s -33%
mld_ct_cmask_neg_i32 1s 1s +0%
mld_ct_sel_int32 1s 2s -50%
polyveck_unpack_t0 1s 3s -67%
shake128_release 1s 2s -50%
shake256_finalize 1s 3s -67%
shake256_squeeze 1s 2s -50%
shake256x4_squeezeblocks 1s 2s -50%

Same pattern as mld_s1vec and mld_s2vec: in normal mode stores the
full NTT'd polyveck, in REDUCE_RAM mode stores a pointer and unpacks
+ NTTs on demand.

REDUCE_RAM signing memory reduction:
- ML-DSA-44: 24,320 -> 20,256 (-4,064 bytes)
- ML-DSA-65: 33,568 -> 27,456 (-6,112 bytes)
- ML-DSA-87: 43,808 -> 35,648 (-8,160 bytes)

Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
@oqs-bot
Copy link
Contributor

oqs-bot commented Mar 22, 2026

CBMC Results (ML-DSA-65)

Full Results (178 proofs)
Proof Status Current Previous Change
**TOTAL** 2482s 2485s -0.1%
mld_attempt_signature_generation 287s 275s +4%
sign_verify_internal 274s 336s -18%
polyvecl_pointwise_acc_montgomery_c 196s 192s +2%
poly_pointwise_montgomery_c 168s 153s +10%
rej_uniform_native 148s 148s +0%
polyvec_matrix_expand 120s 126s -5%
mld_invntt_layer 96s 96s +0%
mld_ct_memcmp 80s 76s +5%
polyvec_matrix_expand_serial 68s 68s +0%
mld_ntt_layer 56s 55s +2%
keccak_squeezeblocks_x4 44s 43s +2%
mld_compute_t0_t1_tr_from_sk_components 24s 27s -11%
poly_chknorm_c 24s 21s +14%
sign_signature_internal 24s 39s -38%
rej_uniform 21s 24s -12%
fqmul 18s 21s -14%
polymat_permute_bitrev_to_custom 17s 30s -43%
rej_uniform_c 17s 13s +31%
poly_uniform_eta_4x 16s 16s +0%
keccakf1600x4_permute_native 15s 13s +15%
mld_ntt_butterfly_block 15s 12s +25%
poly_add 15s 12s +25%
poly_uniform_4x 15s 17s -12%
polyt0_unpack 14s 15s -7%
polyvec_matrix_pointwise_montgomery 14s 11s +27%
polyveck_decompose 14s 11s +27%
polyveck_shiftl 12s 7s +71%
keccak_absorb_once_x4 11s 11s +0%
polyveck_sub 11s 11s +0%
sign_pk_from_sk 11s 8s +38%
mld_polyvecl_permute_bitrev_to_custom_native 10s 7s +43%
poly_invntt_tomont_c 10s 6s +67%
polyveck_add 10s 9s +11%
polyveck_pointwise_poly_montgomery_t0 10s - new
polyveck_power2round 10s 11s -9%
polyvecl_chknorm 10s 11s -9%
keccakf1600_permute 9s 7s +29%
polyveck_ntt 9s 12s -25%
polyveck_reduce 9s 6s +50%
polyveck_use_hint 9s 8s +12%
keccakf1600_permute_native 8s 8s +0%
mld_check_pct 8s 8s +0%
poly_decompose_c 8s 9s -11%
unpack_sk 8s 5s +60%
keccak_absorb 7s 7s +0%
mld_compute_pack_z 7s 6s +17%
mld_sample_s1_s2_serial 7s 5s +40%
polyveck_caddq 7s 8s -12%
polyveck_invntt_tomont 7s 9s -22%
polyveck_pointwise_poly_montgomery 7s 7s +0%
sign_signature_pre_hash_shake256 7s 3s +133%
poly_caddq_c 6s 5s +20%
polyeta_unpack 6s 6s +0%
polyveck_unpack_eta 6s 3s +100%
polyvecl_ntt 6s 7s -14%
polyvecl_uniform_gamma1 6s 4s +50%
polyz_pack 6s 3s +100%
sign 6s 7s -14%
keccakf1600x4_xor_bytes 5s 2s +150%
mld_ct_get_optblocker_i64 5s 5s +0%
mld_prepare_domain_separation_prefix 5s 4s +25%
pack_sk 5s 3s +67%
poly_caddq 5s 3s +67%
poly_chknorm 5s 2s +150%
poly_chknorm_native_aarch64 5s 3s +67%
poly_power2round 5s 5s +0%
poly_use_hint_c 5s 4s +25%
polyt0_pack 5s 3s +67%
polyveck_pointwise_poly_montgomery_s2 5s - new
polyveck_unpack_t0 5s 6s -17%
polyvecl_pointwise_acc_montgomery_native 5s 4s +25%
polyz_unpack 5s 4s +25%
rej_eta 5s 3s +67%
rej_eta_c 5s 3s +67%
shake128_squeeze 5s 3s +67%
sign_keypair 5s 5s +0%
sign_keypair_internal 5s 5s +0%
sign_open 5s 5s +0%
sign_verify 5s 7s -29%
sign_verify_extmu 5s 4s +25%
keccak_init 4s 3s +33%
keccakf1600_xor_bytes 4s 3s +33%
keccakf1600x4_extract_bytes 4s 2s +100%
make_hint 4s 5s -20%
mld_ct_cmask_nonzero_u8 4s 2s +100%
mld_ct_get_optblocker_u32 4s 2s +100%
mld_ct_get_optblocker_u8 4s 5s -20%
mld_h 4s 3s +33%
mld_sample_s1_s2 4s 8s -50%
mld_value_barrier_i64 4s 3s +33%
ntt_native_x86_64 4s 5s -20%
pack_sig_c_h 4s 4s +0%
poly_challenge 4s 4s +0%
poly_chknorm_native 4s 4s +0%
poly_decompose_native 4s 3s +33%
poly_invntt_tomont 4s 4s +0%
poly_ntt_c 4s 3s +33%
poly_ntt_native 4s 2s +100%
poly_pointwise_montgomery 4s 7s -43%
poly_reduce 4s 4s +0%
polyeta_pack 4s 4s +0%
polyt1_unpack 4s 2s +100%
polyveck_chknorm 4s 3s +33%
polyveck_make_hint 4s 7s -43%
polyveck_pack_eta 4s 2s +100%
polyveck_pack_t0 4s 5s -20%
polyveck_pack_w1 4s 3s +33%
polyvecl_uniform_gamma1_serial 4s 4s +0%
polyvecl_unpack_z 4s 4s +0%
polyz_unpack_native 4s 3s +33%
rej_eta_native 4s 4s +0%
shake128_finalize 4s 3s +33%
shake128x4_squeezeblocks 4s 3s +33%
shake256_init 4s 2s +100%
shake256_release 4s 2s +100%
shake256x4_squeezeblocks 4s 2s +100%
sign_signature_extmu 4s 4s +0%
sign_verify_pre_hash_shake256 4s 2s +100%
unpack_hints 4s 5s -20%
use_hint 4s 3s +33%
decompose 3s 2s +50%
intt_native_x86_64 3s 4s -25%
keccak_finalize 3s 2s +50%
keccak_squeeze 3s 4s -25%
keccakf1600_extract_bytes (big endian) 3s 2s +50%
mld_keccakf1600_extract_bytes 3s 2s +50%
montgomery_reduce 3s 2s +50%
pack_pk 3s 5s -40%
pack_sig_z 3s 4s -25%
poly_caddq_native 3s 2s +50%
poly_caddq_native_aarch64 3s 5s -40%
poly_invntt_tomont_native 3s 3s +0%
poly_pointwise_montgomery_native 3s 4s -25%
poly_shiftl 3s 4s -25%
poly_uniform 3s 4s -25%
poly_uniform_eta 3s 5s -40%
poly_uniform_gamma1 3s 4s -25%
poly_uniform_gamma1_4x 3s 7s -57%
poly_use_hint_native 3s 5s -40%
polyt1_pack 3s 2s +50%
polyvecl_permute_bitrev_to_custom 3s 3s +0%
polyvecl_pointwise_acc_montgomery 3s 3s +0%
polyvecl_unpack_eta 3s 1s +200%
polyw1_pack 3s 3s +0%
polyz_unpack_c 3s 3s +0%
shake128_init 3s 2s +50%
shake128_release 3s 5s -40%
shake128x4_absorb_once 3s 2s +50%
shake256_squeeze 3s 3s +0%
shake256x4_absorb_once 3s 3s +0%
sign_signature 3s 3s +0%
sign_signature_pre_hash_internal 3s 6s -50%
sign_verify_pre_hash_internal 3s 4s -25%
sys_check_capability 3s 2s +50%
unpack_pk 3s 4s -25%
caddq 2s 4s -50%
fqscale 2s 2s +0%
keccakf1600_xor_bytes (big endian) 2s 4s -50%
mld_ct_cmask_neg_i32 2s 1s +100%
mld_ct_cmask_nonzero_u32 2s 3s -33%
mld_ct_sel_int32 2s 3s -33%
mld_value_barrier_u32 2s 2s +0%
mld_value_barrier_u8 2s 2s +0%
poly_decompose 2s 2s +0%
poly_make_hint 2s 5s -60%
poly_ntt 2s 2s +0%
poly_sub 2s 3s -33%
poly_use_hint 2s 3s -33%
polyvecl_pack_eta 2s 4s -50%
power2round 2s 3s -33%
shake128_absorb 2s 1s +100%
shake256 2s 3s -33%
shake256_absorb 2s 1s +100%
shake256_finalize 2s 5s -60%
unpack_sig 2s 4s -50%
keccakf1600x4_permute 1s 3s -67%
mld_ct_abs_i32 1s 2s -50%
reduce32 1s 4s -75%

@oqs-bot
Copy link
Contributor

oqs-bot commented Mar 22, 2026

CBMC Results (ML-DSA-87)

Full Results (178 proofs)
Proof Status Current Previous Change
**TOTAL** 2589s 2658s -2.6%
polyvecl_pointwise_acc_montgomery_c 282s 276s +2%
mld_attempt_signature_generation 261s 237s +10%
sign_verify_internal 251s 330s -24%
polyvec_matrix_expand 161s 177s -9%
poly_pointwise_montgomery_c 157s 153s +3%
rej_uniform_native 141s 140s +1%
mld_invntt_layer 98s 94s +4%
polyvec_matrix_expand_serial 81s 79s +3%
mld_ct_memcmp 77s 73s +5%
polyveck_decompose 57s 56s +2%
mld_ntt_layer 54s 52s +4%
sign_signature_internal 45s 56s -20%
keccak_squeezeblocks_x4 43s 42s +2%
polymat_permute_bitrev_to_custom 27s 45s -40%
mld_compute_t0_t1_tr_from_sk_components 26s 25s +4%
poly_chknorm_c 20s 18s +11%
rej_uniform 20s 21s -5%
fqmul 18s 18s +0%
poly_uniform_4x 18s 14s +29%
poly_uniform_eta_4x 16s 16s +0%
polyeta_unpack 16s 18s -11%
keccakf1600x4_permute_native 15s 15s +0%
polyt0_unpack 14s 14s +0%
rej_uniform_c 14s 12s +17%
sign_pk_from_sk 14s 6s +133%
mld_ntt_butterfly_block 13s 12s +8%
polyvec_matrix_pointwise_montgomery 13s 12s +8%
mld_polyvecl_permute_bitrev_to_custom_native 12s 13s -8%
mld_check_pct 11s 9s +22%
poly_add 11s 13s -15%
polyveck_add 11s 9s +22%
polyveck_use_hint 11s 13s -15%
polyvecl_ntt 11s 9s +22%
polyveck_reduce 10s 10s +0%
keccakf1600_permute_native 9s 9s +0%
polyveck_invntt_tomont 9s 7s +29%
polyveck_shiftl 9s 7s +29%
unpack_sk 9s 5s +80%
keccak_absorb_once_x4 8s 10s -20%
keccakf1600_permute 8s 7s +14%
poly_decompose_c 8s 7s +14%
poly_uniform_gamma1_4x 8s 4s +100%
polyveck_caddq 8s 9s -11%
polyveck_power2round 8s 8s +0%
polyz_unpack_c 8s 7s +14%
keccak_absorb 7s 7s +0%
mld_compute_pack_z 7s 8s -12%
mld_sample_s1_s2 7s 7s +0%
poly_caddq_c 7s 7s +0%
polyveck_ntt 7s 7s +0%
polyveck_pointwise_poly_montgomery_t0 7s - new
polyveck_sub 7s 6s +17%
sign_keypair_internal 7s 7s +0%
mld_sample_s1_s2_serial 6s 9s -33%
poly_chknorm_native_aarch64 6s 2s +200%
poly_invntt_tomont_c 6s 8s -25%
poly_uniform_eta 6s 5s +20%
polyveck_pointwise_poly_montgomery 6s 7s -14%
polyveck_pointwise_poly_montgomery_s2 6s - new
polyw1_pack 6s 4s +50%
intt_native_x86_64 5s 4s +25%
pack_sig_z 5s 6s -17%
pack_sk 5s 3s +67%
poly_caddq_native_aarch64 5s 2s +150%
poly_challenge 5s 4s +25%
poly_invntt_tomont 5s 2s +150%
poly_invntt_tomont_native 5s 4s +25%
poly_power2round 5s 5s +0%
polyveck_make_hint 5s 7s -29%
polyveck_unpack_eta 5s 5s +0%
polyz_unpack_native 5s 6s -17%
reduce32 5s 3s +67%
sign 5s 7s -29%
sign_signature_pre_hash_shake256 5s 6s -17%
sign_verify_pre_hash_internal 5s 5s +0%
caddq 4s 4s +0%
decompose 4s 2s +100%
keccak_finalize 4s 2s +100%
mld_ct_cmask_nonzero_u8 4s 5s -20%
mld_h 4s 4s +0%
mld_prepare_domain_separation_prefix 4s 5s -20%
mld_value_barrier_u8 4s 1s +300%
poly_chknorm_native 4s 2s +100%
poly_ntt_c 4s 4s +0%
poly_ntt_native 4s 2s +100%
poly_pointwise_montgomery_native 4s 2s +100%
poly_shiftl 4s 3s +33%
poly_uniform_gamma1 4s 2s +100%
polyt0_pack 4s 4s +0%
polyveck_chknorm 4s 6s -33%
polyveck_pack_t0 4s 3s +33%
polyveck_unpack_t0 4s 6s -33%
polyvecl_pack_eta 4s 4s +0%
polyvecl_uniform_gamma1 4s 4s +0%
rej_eta_native 4s 4s +0%
shake128_absorb 4s 2s +100%
shake128_finalize 4s 3s +33%
shake256_init 4s 1s +300%
shake256x4_squeezeblocks 4s 7s -43%
sign_signature 4s 6s -33%
sign_verify 4s 3s +33%
sign_verify_extmu 4s 5s -20%
sys_check_capability 4s 4s +0%
unpack_hints 4s 5s -20%
unpack_pk 4s 3s +33%
keccak_init 3s 4s -25%
keccakf1600_xor_bytes (big endian) 3s 2s +50%
keccakf1600x4_extract_bytes 3s 2s +50%
keccakf1600x4_permute 3s 3s +0%
keccakf1600x4_xor_bytes 3s 3s +0%
make_hint 3s 3s +0%
mld_ct_abs_i32 3s 2s +50%
mld_ct_get_optblocker_i64 3s 3s +0%
mld_ct_get_optblocker_u32 3s 4s -25%
montgomery_reduce 3s 4s -25%
ntt_native_x86_64 3s 3s +0%
pack_pk 3s 2s +50%
poly_caddq 3s 2s +50%
poly_caddq_native 3s 2s +50%
poly_chknorm 3s 3s +0%
poly_pointwise_montgomery 3s 4s -25%
poly_sub 3s 5s -40%
poly_uniform 3s 4s -25%
poly_use_hint 3s 2s +50%
poly_use_hint_c 3s 3s +0%
poly_use_hint_native 3s 4s -25%
polyt1_unpack 3s 3s +0%
polyvecl_permute_bitrev_to_custom 3s 2s +50%
polyvecl_pointwise_acc_montgomery 3s 4s -25%
polyvecl_uniform_gamma1_serial 3s 4s -25%
polyvecl_unpack_eta 3s 4s -25%
polyvecl_unpack_z 3s 6s -50%
polyz_pack 3s 2s +50%
power2round 3s 2s +50%
rej_eta 3s 6s -50%
shake256_absorb 3s 2s +50%
shake256_release 3s 1s +200%
shake256x4_absorb_once 3s 2s +50%
sign_keypair 3s 3s +0%
sign_open 3s 5s -40%
sign_signature_extmu 3s 3s +0%
sign_signature_pre_hash_internal 3s 3s +0%
sign_verify_pre_hash_shake256 3s 6s -50%
use_hint 3s 3s +0%
fqscale 2s 5s -60%
keccak_squeeze 2s 2s +0%
keccakf1600_extract_bytes (big endian) 2s 2s +0%
mld_ct_cmask_neg_i32 2s 2s +0%
mld_ct_cmask_nonzero_u32 2s 5s -60%
mld_ct_get_optblocker_u8 2s 1s +100%
mld_ct_sel_int32 2s 1s +100%
mld_value_barrier_i64 2s 4s -50%
pack_sig_c_h 2s 3s -33%
poly_decompose 2s 3s -33%
poly_decompose_native 2s 2s +0%
poly_make_hint 2s 4s -50%
poly_ntt 2s 3s -33%
poly_reduce 2s 3s -33%
polyeta_pack 2s 3s -33%
polyt1_pack 2s 3s -33%
polyveck_pack_eta 2s 4s -50%
polyvecl_chknorm 2s 6s -67%
polyvecl_pointwise_acc_montgomery_native 2s 4s -50%
polyz_unpack 2s 4s -50%
rej_eta_c 2s 3s -33%
shake256 2s 2s +0%
shake256_finalize 2s 3s -33%
unpack_sig 2s 5s -60%
keccakf1600_xor_bytes 1s 4s -75%
mld_keccakf1600_extract_bytes 1s 2s -50%
mld_value_barrier_u32 1s 4s -75%
polyveck_pack_w1 1s 2s -50%
shake128_init 1s 2s -50%
shake128_release 1s 3s -67%
shake128_squeeze 1s 2s -50%
shake128x4_absorb_once 1s 4s -75%
shake128x4_squeezeblocks 1s 3s -67%
shake256_squeeze 1s 2s -50%

Comment on lines +1468 to +1482
/* Unpack s1 again in raw form for norm check and recomputation.
* TODO: avoid this double unpacking */
mld_polyvecl_unpack_eta(s1_raw, sk + 2 * MLDSA_SEEDBYTES + MLDSA_TRBYTES);

/* Unpack s2 again in raw form for norm check and recomputation.
* TODO: avoid this double unpacking */
mld_polyveck_unpack_eta(s2_raw, sk + 2 * MLDSA_SEEDBYTES + MLDSA_TRBYTES +
MLDSA_L * MLDSA_POLYETA_PACKEDBYTES);

/* Unpack t0 again in raw form for validation.
* TODO: avoid this double unpacking */
mld_polyveck_unpack_t0(t0_raw, sk + 2 * MLDSA_SEEDBYTES + MLDSA_TRBYTES +
MLDSA_L * MLDSA_POLYETA_PACKEDBYTES +
MLDSA_K * MLDSA_POLYETA_PACKEDBYTES);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pk_from_sk gets quite a bit more ugly here, because we need to do bounds check between unpacking and NTTing.
Can't think about a good way right now. Ideas?

Copy link
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 3rd gen (c6a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: 881c2e4 Previous: bb07ee8 Ratio
ML-DSA-87 sign 408515 cycles 395562 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
@mkannwischer mkannwischer force-pushed the sign-recompute-s1s2t0 branch from 269c1ee to b3ee120 Compare March 22, 2026 09:08
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
@mkannwischer
Copy link
Contributor Author

@gilles-peskine-arm @waleed-elmelegy-arm, this PR may be of interest to you.

@mkannwischer mkannwischer changed the title Sign Memory: Unpack s1, s2, and t0 on the fly in REDUCE_RAM mode Sign: Unpack s1, s2, and t0 on the fly in REDUCE_RAM mode Mar 22, 2026
@mkannwischer mkannwischer marked this pull request as ready for review March 22, 2026 10:15
@mkannwischer mkannwischer requested a review from a team as a code owner March 22, 2026 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants