Reproducing results on an H100 SXM5

Thank you for releasing the easy to use artifact. I'm trying to reproduce the results from the paper on an H100 SXM5 machine. The following plot shows the normalized perf for KV_HEADS=Q_HEADS=32 (similar to Figure 10 top plot: https://arxiv.org/pdf/2511.22333). Some interesting observations: 

1. While PAT is the best performing, the next best is often FlashInfer (which is prefix unaware) and not FastTree (which is prefix aware) in many cases.
2. Cascade attention (which is also prefix-aware) performs really bad compared to FlashInfer. 

For 1, I wonder if there are any intuitive reasons for it. For 2, was this also visible on the A100 machine (which the paper uses for the experiments)? Any reason why cascade performs so bad? 

Your thoughts would be very helpful! Thank you for your time!

 
<img width="1347" height="664" alt="Image" src="https://github.com/user-attachments/assets/b6b818b0-a383-4352-8f70-8a12e89bcbd5" />

Attaching the kernel perf raw numbers as well. 

[kernel_perf.json](https://github.com/user-attachments/files/25398284/kernel_perf.json)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing results on an H100 SXM5 #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducing results on an H100 SXM5 #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions