-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Thank you for releasing the easy to use artifact. I'm trying to reproduce the results from the paper on an H100 SXM5 machine. The following plot shows the normalized perf for KV_HEADS=Q_HEADS=32 (similar to Figure 10 top plot: https://arxiv.org/pdf/2511.22333). Some interesting observations:
- While PAT is the best performing, the next best is often FlashInfer (which is prefix unaware) and not FastTree (which is prefix aware) in many cases.
- Cascade attention (which is also prefix-aware) performs really bad compared to FlashInfer.
For 1, I wonder if there are any intuitive reasons for it. For 2, was this also visible on the A100 machine (which the paper uses for the experiments)? Any reason why cascade performs so bad?
Your thoughts would be very helpful! Thank you for your time!
Attaching the kernel perf raw numbers as well.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels