Skip to content

Fix loop-tail synthetic wait ordering#645

Open
TaoTao-real wants to merge 2 commits intohw-native-sys:mainfrom
TaoTao-real:codex/issue622-loop-tail-ordering
Open

Fix loop-tail synthetic wait ordering#645
TaoTao-real wants to merge 2 commits intohw-native-sys:mainfrom
TaoTao-real:codex/issue622-loop-tail-ordering

Conversation

@TaoTao-real
Copy link
Copy Markdown
Contributor

Summary

  • Keep loop-tail synthetic waits ahead of post-loop local sets on the same LOOP_END anchor.
  • Add an issue622 reproducer and adjust related lit checks to match the repaired ordering.

Motivation

  • Fixes issue [Bug] QK_PRELOAD=4 FlashAttention DSL kernel builds but deadlocks under auto-sync #622.
  • Before this change, UpdateBackwardMatchSync appended the synthetic tail wait to LOOP_END.pipeAfter. When the same anchor also carried a sunk post-loop local set reusing the same (srcPipe, dstPipe, eventId), the generated C++ could emit:
    • set(E0)
    • set(E0)
    • wait(E0)
    • wait(E0)
  • That creates overlapping same-key use on the loop-tail anchor. The intended order is serialized reuse:
    • set(E0)
    • wait(E0)
    • set(E0)
    • wait(E0)

Design

  • In SyncEventIdAllocation::UpdateBackwardMatchSync, insert the synthetic tail wait with push_front instead of push_back on LOOP_END.pipeAfter.
  • This preserves the intended order on the shared loop-tail anchor:
    • consume the carried event first
    • then emit the new post-loop local set
  • The loop-head synthetic set remains unchanged (push_back on LOOP_BEGIN.pipeBefore), which is already the conservative order for PRE syncs.
  • Because this changes legal generated sync order, the affected lit checks are updated to match the new stable shape.

Testing

  • Local targeted checks:
    • ptoas --pto-arch=a3 --enable-insert-sync test/lit/pto/issue428_cube_sync_regression.pto
    • ptoas --pto-arch=a3 --pto-level=level3 --enable-insert-sync test/lit/pto/issue564_k_loop_mte1_mte2_wait_regression.pto
    • ptoas --pto-arch=a3 --pto-level=level3 --enable-insert-sync test/lit/pto/issue622_v_mte2_eventid_overlap_reproducer.pto
  • Verified issue622 changes from overlapping same-key use to serialized reuse at loop tail.

Risk / Rollback

  • Risk: loop-tail sync order changes for synthetic backward-match waits that share a LOOP_END anchor with sunk local sets.
  • Rollback: revert this PR.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses synchronization issues where event IDs could be incorrectly reused across outstanding sync pairs, specifically targeting issue #622. The primary change in SyncEventIdAllocation.cpp ensures that synthetic tail waits are inserted at the beginning of the pipeAfter list, allowing them to be emitted before post-loop sets. Additionally, the PR updates several regression tests and introduces a new test case to verify the fix for event ID overlap in mixed-width pipelines. I have no feedback to provide.

@reedhecre
Copy link
Copy Markdown

Codex Review

该评论由 review 机器人自动更新。

  • PR: Fix loop-tail synthetic wait ordering #645 Fix loop-tail synthetic wait ordering
  • Author: TaoTao-real
  • Base/Head: main / codex/issue622-loop-tail-ordering
  • Head SHA: d0ee3a47b8d7
  • Trigger: 检测到新的 open PR
  • Generated At: 2026-05-08T09:12:22Z
  • Status: completed

Summary

发现 1 个 P2 问题:loop-tail synthetic wait 改为前插后,会在 ChangeNoEventIdSyncToPipeAll() 扫描同一 pipeAfter 时提前触发自修改,可能漏掉后续空 event-id 的 set,并在 codegen 阶段崩溃或生成错误同步序列。

Findings

  1. P2 将 tail wait 前插后,会在 no-event-id 降级扫描里提前自修改同一 `pipeAfter` deque lib/PTO/Transforms/InsertSync/SyncEventIdAllocation.cpp:402

把 reallocated loop-tail synthetic wait 改成 push_front 后,这个 WAIT 会成为 pipeAfter 的第一个元素。后面的 ChangeNoEventIdSyncToPipeAll() 仍然用 for (auto &sync : e->pipeAfter) 顺序扫描同一个 deque;一旦先看到这个 WAIT,就会进入 MoveOutBackwardMatchSync(),而无冲突分支会再次调用 UpdateBackwardMatchSync(),继续往同一个 pipeAfter 里插入新的 tail wait。这样现在是在“还没扫描完后续 loop-end SET_EVENT”之前就自修改正在遍历的容器,后面的 empty-event-id set 可能被跳过,无法走 568-573 行的 PIPE_ALL 降级。那类遗漏的 set 到 codegen 阶段会直接取 eventIds[0]SyncCodegen.cpp 340/353),会变成编译崩溃或坏的同步序列。旧实现把 synthetic wait 放在末尾时,至少不会在剩余 set 尚未处理前触发这条路径。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants