Skip to content

[Performance] autosync inserts a final PIPE_V barrier before TROWEXPANDDIV; removing it preserves correctness and improves long-sequence FA throughput to ~165 TFLOP/s #646

@MirkoDeVita98

Description

@MirkoDeVita98

Summary

We generated C++ from the working DSL 140 TFlops MLIR flash-attention kernel.
Then edited the emitted C++ to remove a single autosync barrier in the final vector normalization path:

  • generated file: fa_dsl.cpp
  • edited variant: fa_dsl_nobar_finaldiv.cpp
  • removed line in generated C++:
    • pipe_barrier(PIPE_V);
    • immediately before:
      • TROWEXPANDDIV(v61, v61, v68);
      • TROWEXPANDDIV(v62, v62, v69);

You can find all the files attached here:

fa_140tflops_barriers.zip

This edited variant remains numerically correct on the tested lengths and improves long-sequence throughput.

Measured correctness against fp32 reference:

  • S1=8192: 1.850910484790802e-05
  • S1=16384: 1.3465061783790588e-05
  • S1=65536: 6.201036740094423e-06
  • S1=131072: 4.507601261138916e-06

Measured throughput for the edited variant:

  • S1=8192: 98.90 TFLOP/s
  • S1=16384: 127.07 TFLOP/s
  • S1=32768: 131.62 TFLOP/s
  • S1=65536: 138.73 TFLOP/s
  • S1=131072: 164.18 TFLOP/s

This suggests the final vec normalize barrier is an over-sync candidate in PTOAS autosync.

Command line

ptoas --pto-arch=a3 --enable-insert-sync fa_dsl.mlir > fa_dsl.cpp

Reproduction input

See above the attached files in summary.

Expected performance

PTOAS autosync should avoid unnecessary synchronization in the final vector normalize path when correctness does not depend on it.

For this kernel shape, the generated kernel should be able to match the edited variant’s long-sequence behavior, reaching roughly:

  • ~164-165 TFLOP/s at S1=131072

without requiring manual edits to the generated C++.

Actual performance

The unedited autosync-generated kernel includes a final pipe_barrier(PIPE_V) before the last TROWEXPANDDIV normalize step and is slower at long sequence lengths than the edited variant without that barrier.

Observed unedited throughput:

  • S1=131072: about 143.5 TFLOP/s

Profiling data (optional)

No response

Git commit

40570a2

Environment:

  • A3
  • Linux aarch64
  • driver: 25.5.1
  • CANN: 8.5.0

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions