Sign: Unpack s1, s2, and t0 on the fly in REDUCE_RAM mode#1002
Sign: Unpack s1, s2, and t0 on the fly in REDUCE_RAM mode#1002mkannwischer wants to merge 8 commits intomainfrom
s1, s2, and t0 on the fly in REDUCE_RAM mode#1002Conversation
Introduce mld_s1vec, following the same pattern as mld_polymat for reduced RAM usage. In normal mode, it stores the full NTT'd polyvecl. In REDUCE_RAM mode, it stores a pointer to the packed s1 data in the secret key and unpacks + NTTs individual polynomials on demand. This reduces signing memory in REDUCE_RAM mode: - ML-DSA-44: 32,448 -> 28,384 (-4,064 bytes) - ML-DSA-65: 44,768 -> 39,680 (-5,088 bytes) - ML-DSA-87: 59,104 -> 51,968 (-7,136 bytes) Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
0e1dfbf to
c0ac5e8
Compare
There was a problem hiding this comment.
Intel Xeon 4th gen (c7i)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
34362 cycles |
34508 cycles |
1.00 |
ML-DSA-44 sign |
120129 cycles |
119762 cycles |
1.00 |
ML-DSA-44 verify |
38068 cycles |
38106 cycles |
1.00 |
ML-DSA-65 keypair |
61396 cycles |
61327 cycles |
1.00 |
ML-DSA-65 sign |
201975 cycles |
202109 cycles |
1.00 |
ML-DSA-65 verify |
62883 cycles |
62771 cycles |
1.00 |
ML-DSA-87 keypair |
93605 cycles |
94593 cycles |
0.99 |
ML-DSA-87 sign |
238491 cycles |
240827 cycles |
0.99 |
ML-DSA-87 verify |
95314 cycles |
96019 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 4th gen (c7i) (no-opt)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
94456 cycles |
93753 cycles |
1.01 |
ML-DSA-44 sign |
332649 cycles |
333304 cycles |
1.00 |
ML-DSA-44 verify |
99711 cycles |
99738 cycles |
1.00 |
ML-DSA-65 keypair |
160279 cycles |
159678 cycles |
1.00 |
ML-DSA-65 sign |
543374 cycles |
544024 cycles |
1.00 |
ML-DSA-65 verify |
161092 cycles |
160787 cycles |
1.00 |
ML-DSA-87 keypair |
267434 cycles |
267177 cycles |
1.00 |
ML-DSA-87 sign |
709085 cycles |
705890 cycles |
1.00 |
ML-DSA-87 verify |
270229 cycles |
270246 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
113162 cycles |
113125 cycles |
1.00 |
ML-DSA-44 sign |
357353 cycles |
355404 cycles |
1.01 |
ML-DSA-44 verify |
117895 cycles |
117806 cycles |
1.00 |
ML-DSA-65 keypair |
196212 cycles |
196440 cycles |
1.00 |
ML-DSA-65 sign |
592734 cycles |
588870 cycles |
1.01 |
ML-DSA-65 verify |
194625 cycles |
194523 cycles |
1.00 |
ML-DSA-87 keypair |
322467 cycles |
322254 cycles |
1.00 |
ML-DSA-87 sign |
756570 cycles |
752961 cycles |
1.00 |
ML-DSA-87 verify |
320130 cycles |
320091 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 3rd gen (c6a)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
69194 cycles |
69272 cycles |
1.00 |
ML-DSA-44 sign |
188252 cycles |
188132 cycles |
1.00 |
ML-DSA-44 verify |
68906 cycles |
69431 cycles |
0.99 |
ML-DSA-65 keypair |
119088 cycles |
119537 cycles |
1.00 |
ML-DSA-65 sign |
302886 cycles |
300738 cycles |
1.01 |
ML-DSA-65 verify |
115778 cycles |
115521 cycles |
1.00 |
ML-DSA-87 keypair |
203113 cycles |
204457 cycles |
0.99 |
ML-DSA-87 sign |
408515 cycles |
395562 cycles |
1.03 |
ML-DSA-87 verify |
195375 cycles |
196251 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton4
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
68176 cycles |
68092 cycles |
1.00 |
ML-DSA-44 sign |
203637 cycles |
202357 cycles |
1.01 |
ML-DSA-44 verify |
70924 cycles |
70840 cycles |
1.00 |
ML-DSA-65 keypair |
121292 cycles |
120892 cycles |
1.00 |
ML-DSA-65 sign |
334250 cycles |
332262 cycles |
1.01 |
ML-DSA-65 verify |
118051 cycles |
117993 cycles |
1.00 |
ML-DSA-87 keypair |
198088 cycles |
198285 cycles |
1.00 |
ML-DSA-87 sign |
431278 cycles |
428165 cycles |
1.01 |
ML-DSA-87 verify |
194701 cycles |
194638 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 3rd gen (c6i)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
56406 cycles |
56810 cycles |
0.99 |
ML-DSA-44 sign |
181106 cycles |
181256 cycles |
1.00 |
ML-DSA-44 verify |
60974 cycles |
61127 cycles |
1.00 |
ML-DSA-65 keypair |
99070 cycles |
98683 cycles |
1.00 |
ML-DSA-65 sign |
301150 cycles |
298776 cycles |
1.01 |
ML-DSA-65 verify |
100716 cycles |
100109 cycles |
1.01 |
ML-DSA-87 keypair |
152856 cycles |
152672 cycles |
1.00 |
ML-DSA-87 sign |
358498 cycles |
355205 cycles |
1.01 |
ML-DSA-87 verify |
153513 cycles |
153314 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 3rd gen (c6a) (no-opt)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
134975 cycles |
135154 cycles |
1.00 |
ML-DSA-44 sign |
527071 cycles |
524730 cycles |
1.00 |
ML-DSA-44 verify |
147865 cycles |
147590 cycles |
1.00 |
ML-DSA-65 keypair |
228558 cycles |
228675 cycles |
1.00 |
ML-DSA-65 sign |
865183 cycles |
866364 cycles |
1.00 |
ML-DSA-65 verify |
236295 cycles |
236755 cycles |
1.00 |
ML-DSA-87 keypair |
371936 cycles |
372434 cycles |
1.00 |
ML-DSA-87 sign |
1079730 cycles |
1081953 cycles |
1.00 |
ML-DSA-87 verify |
383598 cycles |
383807 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 4th gen (c7a)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
40542 cycles |
41153 cycles |
0.99 |
ML-DSA-44 sign |
132746 cycles |
132931 cycles |
1.00 |
ML-DSA-44 verify |
43385 cycles |
43836 cycles |
0.99 |
ML-DSA-65 keypair |
71975 cycles |
72244 cycles |
1.00 |
ML-DSA-65 sign |
214826 cycles |
214745 cycles |
1.00 |
ML-DSA-65 verify |
72365 cycles |
73096 cycles |
0.99 |
ML-DSA-87 keypair |
108760 cycles |
108337 cycles |
1.00 |
ML-DSA-87 sign |
254141 cycles |
253357 cycles |
1.00 |
ML-DSA-87 verify |
111590 cycles |
110812 cycles |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton4 (no-opt)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
128366 cycles |
128232 cycles |
1.00 |
ML-DSA-44 sign |
448227 cycles |
447685 cycles |
1.00 |
ML-DSA-44 verify |
138227 cycles |
144647 cycles |
0.96 |
ML-DSA-65 keypair |
220728 cycles |
220666 cycles |
1.00 |
ML-DSA-65 sign |
728968 cycles |
727390 cycles |
1.00 |
ML-DSA-65 verify |
222560 cycles |
223179 cycles |
1.00 |
ML-DSA-87 keypair |
364646 cycles |
365048 cycles |
1.00 |
ML-DSA-87 sign |
926537 cycles |
925897 cycles |
1.00 |
ML-DSA-87 verify |
372923 cycles |
372806 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
212388 cycles |
212677 cycles |
1.00 |
ML-DSA-44 sign |
761595 cycles |
759475 cycles |
1.00 |
ML-DSA-44 verify |
228709 cycles |
228953 cycles |
1.00 |
ML-DSA-65 keypair |
379903 cycles |
380253 cycles |
1.00 |
ML-DSA-65 sign |
1257270 cycles |
1251269 cycles |
1.00 |
ML-DSA-65 verify |
371502 cycles |
372050 cycles |
1.00 |
ML-DSA-87 keypair |
604456 cycles |
605509 cycles |
1.00 |
ML-DSA-87 sign |
1598919 cycles |
1591320 cycles |
1.00 |
ML-DSA-87 verify |
618280 cycles |
617579 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton3
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
72371 cycles |
72262 cycles |
1.00 |
ML-DSA-44 sign |
213687 cycles |
212358 cycles |
1.01 |
ML-DSA-44 verify |
75757 cycles |
75722 cycles |
1.00 |
ML-DSA-65 keypair |
127613 cycles |
127611 cycles |
1.00 |
ML-DSA-65 sign |
353316 cycles |
350840 cycles |
1.01 |
ML-DSA-65 verify |
125589 cycles |
125699 cycles |
1.00 |
ML-DSA-87 keypair |
205884 cycles |
208501 cycles |
0.99 |
ML-DSA-87 sign |
447609 cycles |
450025 cycles |
0.99 |
ML-DSA-87 verify |
205683 cycles |
205765 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Intel Xeon 3rd gen (c6i) (no-opt)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
157318 cycles |
157591 cycles |
1.00 |
ML-DSA-44 sign |
549881 cycles |
551560 cycles |
1.00 |
ML-DSA-44 verify |
169319 cycles |
169402 cycles |
1.00 |
ML-DSA-65 keypair |
268418 cycles |
267815 cycles |
1.00 |
ML-DSA-65 sign |
906561 cycles |
904542 cycles |
1.00 |
ML-DSA-65 verify |
274795 cycles |
274303 cycles |
1.00 |
ML-DSA-87 keypair |
447731 cycles |
448249 cycles |
1.00 |
ML-DSA-87 sign |
1161252 cycles |
1156908 cycles |
1.00 |
ML-DSA-87 verify |
457349 cycles |
458389 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
AMD EPYC 4th gen (c7a) (no-opt)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
120710 cycles |
120615 cycles |
1.00 |
ML-DSA-44 sign |
448467 cycles |
447589 cycles |
1.00 |
ML-DSA-44 verify |
130598 cycles |
130296 cycles |
1.00 |
ML-DSA-65 keypair |
204115 cycles |
204314 cycles |
1.00 |
ML-DSA-65 sign |
729459 cycles |
728144 cycles |
1.00 |
ML-DSA-65 verify |
210276 cycles |
210151 cycles |
1.00 |
ML-DSA-87 keypair |
337081 cycles |
338739 cycles |
1.00 |
ML-DSA-87 sign |
927755 cycles |
924086 cycles |
1.00 |
ML-DSA-87 verify |
347169 cycles |
347015 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton3 (no-opt)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
138670 cycles |
138488 cycles |
1.00 |
ML-DSA-44 sign |
484824 cycles |
483902 cycles |
1.00 |
ML-DSA-44 verify |
148462 cycles |
162298 cycles |
0.91 |
ML-DSA-65 keypair |
241326 cycles |
241720 cycles |
1.00 |
ML-DSA-65 sign |
794542 cycles |
792693 cycles |
1.00 |
ML-DSA-65 verify |
240735 cycles |
241300 cycles |
1.00 |
ML-DSA-87 keypair |
395465 cycles |
396574 cycles |
1.00 |
ML-DSA-87 sign |
1016682 cycles |
1012397 cycles |
1.00 |
ML-DSA-87 verify |
402879 cycles |
402619 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Graviton2
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
113678 cycles |
113486 cycles |
1.00 |
ML-DSA-44 sign |
358177 cycles |
355929 cycles |
1.01 |
ML-DSA-44 verify |
118427 cycles |
118313 cycles |
1.00 |
ML-DSA-65 keypair |
196712 cycles |
196525 cycles |
1.00 |
ML-DSA-65 sign |
592715 cycles |
588739 cycles |
1.01 |
ML-DSA-65 verify |
194904 cycles |
194868 cycles |
1.00 |
ML-DSA-87 keypair |
322556 cycles |
323107 cycles |
1.00 |
ML-DSA-87 sign |
757645 cycles |
753767 cycles |
1.01 |
ML-DSA-87 verify |
320481 cycles |
320405 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
Same pattern as mld_s1vec: in normal mode stores the full NTT'd polyveck, in REDUCE_RAM mode stores a pointer and unpacks + NTTs on demand. REDUCE_RAM signing memory reduction: - ML-DSA-44: 28,384 -> 24,320 (-4,064 bytes) - ML-DSA-65: 39,680 -> 33,568 (-6,112 bytes) - ML-DSA-87: 51,968 -> 43,808 (-8,160 bytes) Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
There was a problem hiding this comment.
Graviton2 (no-opt)
Details
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
213320 cycles |
212744 cycles |
1.00 |
ML-DSA-44 sign |
762194 cycles |
760342 cycles |
1.00 |
ML-DSA-44 verify |
241436 cycles |
234472 cycles |
1.03 |
ML-DSA-65 keypair |
380938 cycles |
380565 cycles |
1.00 |
ML-DSA-65 sign |
1259322 cycles |
1254252 cycles |
1.00 |
ML-DSA-65 verify |
372512 cycles |
372074 cycles |
1.00 |
ML-DSA-87 keypair |
606252 cycles |
604302 cycles |
1.00 |
ML-DSA-87 sign |
1597536 cycles |
1594512 cycles |
1.00 |
ML-DSA-87 verify |
618242 cycles |
618492 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Graviton2 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: c0ac5e8 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-44 verify |
241913 cycles |
234472 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
CBMC Results (ML-DSA-44)
Full Results (178 proofs)
|
Same pattern as mld_s1vec and mld_s2vec: in normal mode stores the full NTT'd polyveck, in REDUCE_RAM mode stores a pointer and unpacks + NTTs on demand. REDUCE_RAM signing memory reduction: - ML-DSA-44: 24,320 -> 20,256 (-4,064 bytes) - ML-DSA-65: 33,568 -> 27,456 (-6,112 bytes) - ML-DSA-87: 43,808 -> 35,648 (-8,160 bytes) Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
CBMC Results (ML-DSA-65)Full Results (178 proofs)
|
CBMC Results (ML-DSA-87)Full Results (178 proofs)
|
| /* Unpack s1 again in raw form for norm check and recomputation. | ||
| * TODO: avoid this double unpacking */ | ||
| mld_polyvecl_unpack_eta(s1_raw, sk + 2 * MLDSA_SEEDBYTES + MLDSA_TRBYTES); | ||
|
|
||
| /* Unpack s2 again in raw form for norm check and recomputation. | ||
| * TODO: avoid this double unpacking */ | ||
| mld_polyveck_unpack_eta(s2_raw, sk + 2 * MLDSA_SEEDBYTES + MLDSA_TRBYTES + | ||
| MLDSA_L * MLDSA_POLYETA_PACKEDBYTES); | ||
|
|
||
| /* Unpack t0 again in raw form for validation. | ||
| * TODO: avoid this double unpacking */ | ||
| mld_polyveck_unpack_t0(t0_raw, sk + 2 * MLDSA_SEEDBYTES + MLDSA_TRBYTES + | ||
| MLDSA_L * MLDSA_POLYETA_PACKEDBYTES + | ||
| MLDSA_K * MLDSA_POLYETA_PACKEDBYTES); | ||
|
|
There was a problem hiding this comment.
pk_from_sk gets quite a bit more ugly here, because we need to do bounds check between unpacking and NTTing.
Can't think about a good way right now. Ideas?
oqs-bot
left a comment
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'AMD EPYC 3rd gen (c6a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 881c2e4 | Previous: bb07ee8 | Ratio |
|---|---|---|---|
ML-DSA-87 sign |
408515 cycles |
395562 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
269c1ee to
b3ee120
Compare
Signed-off-by: Matthias J. Kannwischer <matthias@zerorisc.com>
|
@gilles-peskine-arm @waleed-elmelegy-arm, this PR may be of interest to you. |
s1, s2, and t0 on the fly in REDUCE_RAM modes1, s2, and t0 on the fly in REDUCE_RAM mode
Introduce
mld_s1vec,mld_s2vec, andmld_t0vec, following the same pattern asmld_polymatfor reduced RAM usage. In normal mode, they store the full NTT'd polyvec. InREDUCE_RAMmode, they store a pointer to the packed data in the secret key and unpack + NTT individual polynomials on demand.This reduces signing memory allocation in
REDUCE_RAMmode:TODO:
pk_from_sk