Summary
We generated C++ from the working DSL 140 TFlops MLIR flash-attention kernel.
Then edited the emitted C++ to remove a single autosync barrier in the final vector normalization path:
- generated file:
fa_dsl.cpp
- edited variant:
fa_dsl_nobar_finaldiv.cpp
- removed line in generated C++:
pipe_barrier(PIPE_V);
- immediately before:
TROWEXPANDDIV(v61, v61, v68);
TROWEXPANDDIV(v62, v62, v69);
You can find all the files attached here:
fa_140tflops_barriers.zip
This edited variant remains numerically correct on the tested lengths and improves long-sequence throughput.
Measured correctness against fp32 reference:
S1=8192: 1.850910484790802e-05
S1=16384: 1.3465061783790588e-05
S1=65536: 6.201036740094423e-06
S1=131072: 4.507601261138916e-06
Measured throughput for the edited variant:
S1=8192: 98.90 TFLOP/s
S1=16384: 127.07 TFLOP/s
S1=32768: 131.62 TFLOP/s
S1=65536: 138.73 TFLOP/s
S1=131072: 164.18 TFLOP/s
This suggests the final vec normalize barrier is an over-sync candidate in PTOAS autosync.
Command line
ptoas --pto-arch=a3 --enable-insert-sync fa_dsl.mlir > fa_dsl.cpp
Reproduction input
See above the attached files in summary.
Expected performance
PTOAS autosync should avoid unnecessary synchronization in the final vector normalize path when correctness does not depend on it.
For this kernel shape, the generated kernel should be able to match the edited variant’s long-sequence behavior, reaching roughly:
~164-165 TFLOP/s at S1=131072
without requiring manual edits to the generated C++.
Actual performance
The unedited autosync-generated kernel includes a final pipe_barrier(PIPE_V) before the last TROWEXPANDDIV normalize step and is slower at long sequence lengths than the edited variant without that barrier.
Observed unedited throughput:
S1=131072: about 143.5 TFLOP/s
Profiling data (optional)
No response
Git commit
40570a2
Environment:
- A3
- Linux
aarch64
- driver:
25.5.1
- CANN:
8.5.0
Summary
We generated C++ from the working DSL 140 TFlops MLIR flash-attention kernel.
Then edited the emitted C++ to remove a single autosync barrier in the final vector normalization path:
fa_dsl.cppfa_dsl_nobar_finaldiv.cpppipe_barrier(PIPE_V);TROWEXPANDDIV(v61, v61, v68);TROWEXPANDDIV(v62, v62, v69);You can find all the files attached here:
fa_140tflops_barriers.zip
This edited variant remains numerically correct on the tested lengths and improves long-sequence throughput.
Measured correctness against fp32 reference:
S1=8192:1.850910484790802e-05S1=16384:1.3465061783790588e-05S1=65536:6.201036740094423e-06S1=131072:4.507601261138916e-06Measured throughput for the edited variant:
S1=8192:98.90 TFLOP/sS1=16384:127.07 TFLOP/sS1=32768:131.62 TFLOP/sS1=65536:138.73 TFLOP/sS1=131072:164.18 TFLOP/sThis suggests the final vec normalize barrier is an over-sync candidate in PTOAS autosync.
Command line
ptoas --pto-arch=a3 --enable-insert-sync fa_dsl.mlir > fa_dsl.cpp
Reproduction input
See above the attached files in summary.
Expected performance
PTOAS autosync should avoid unnecessary synchronization in the final vector normalize path when correctness does not depend on it.
For this kernel shape, the generated kernel should be able to match the edited variant’s long-sequence behavior, reaching roughly:
~164-165 TFLOP/satS1=131072without requiring manual edits to the generated C++.
Actual performance
The unedited autosync-generated kernel includes a final
pipe_barrier(PIPE_V)before the lastTROWEXPANDDIVnormalize step and is slower at long sequence lengths than the edited variant without that barrier.Observed unedited throughput:
S1=131072: about143.5 TFLOP/sProfiling data (optional)
No response
Git commit
40570a2
Environment:
aarch6425.5.18.5.0