[Performance] autosync inserts a final `PIPE_V` barrier before `TROWEXPANDDIV`; removing it preserves correctness and improves long-sequence FA throughput to ~165 TFLOP/s

### Summary

We generated C++ from the working DSL 140 TFlops MLIR flash-attention kernel.
Then edited the emitted C++ to remove a single autosync barrier in the final vector normalization path:

- generated file: `fa_dsl.cpp`
- edited variant: `fa_dsl_nobar_finaldiv.cpp`
- removed line in generated C++:
  - `pipe_barrier(PIPE_V);`
  - immediately before:
    - `TROWEXPANDDIV(v61, v61, v68);`
    - `TROWEXPANDDIV(v62, v62, v69);`

You can find all the files attached here: 

[fa_140tflops_barriers.zip](https://github.com/user-attachments/files/27519899/fa_140tflops_barriers.zip)

This edited variant remains numerically correct on the tested lengths and improves long-sequence throughput.

Measured correctness against fp32 reference:

- `S1=8192`: `1.850910484790802e-05`
- `S1=16384`: `1.3465061783790588e-05`
- `S1=65536`: `6.201036740094423e-06`
- `S1=131072`: `4.507601261138916e-06`

Measured throughput for the edited variant:

- `S1=8192`: `98.90 TFLOP/s`
- `S1=16384`: `127.07 TFLOP/s`
- `S1=32768`: `131.62 TFLOP/s`
- `S1=65536`: `138.73 TFLOP/s`
- `S1=131072`: `164.18 TFLOP/s`

This suggests the final vec normalize barrier is an over-sync candidate in PTOAS autosync.

### Command line

ptoas --pto-arch=a3 --enable-insert-sync fa_dsl.mlir > fa_dsl.cpp

### Reproduction input

See above the attached files in summary.

### Expected performance

PTOAS autosync should avoid unnecessary synchronization in the final vector normalize path when correctness does not depend on it.

For this kernel shape, the generated kernel should be able to match the edited variant’s long-sequence behavior, reaching roughly:

- `~164-165 TFLOP/s` at `S1=131072`

without requiring manual edits to the generated C++.

### Actual performance

The unedited autosync-generated kernel includes a final `pipe_barrier(PIPE_V)` before the last `TROWEXPANDDIV` normalize step and is slower at long sequence lengths than the edited variant without that barrier.

Observed unedited throughput:

- `S1=131072`: about `143.5 TFLOP/s`


### Profiling data (optional)

_No response_

### Git commit

40570a2c9a5f58b19ced95a0734b847c64cc00e6

### Environment:

- A3
- Linux `aarch64`
- driver: `25.5.1`
- CANN: `8.5.0`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] autosync inserts a final `PIPE_V` barrier before `TROWEXPANDDIV`; removing it preserves correctness and improves long-sequence FA throughput to ~165 TFLOP/s #646

Summary

Command line

Reproduction input

Expected performance

Actual performance

Profiling data (optional)

Git commit

Environment:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Performance] autosync inserts a final PIPE_V barrier before TROWEXPANDDIV; removing it preserves correctness and improves long-sequence FA throughput to ~165 TFLOP/s #646

Description

Summary

Command line

Reproduction input

Expected performance

Actual performance

Profiling data (optional)

Git commit

Environment:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Performance] autosync inserts a final `PIPE_V` barrier before `TROWEXPANDDIV`; removing it preserves correctness and improves long-sequence FA throughput to ~165 TFLOP/s #646