Skip to content

Reproducing results on an H100 SXM5 #1

@christindbose

Description

@christindbose

Thank you for releasing the easy to use artifact. I'm trying to reproduce the results from the paper on an H100 SXM5 machine. The following plot shows the normalized perf for KV_HEADS=Q_HEADS=32 (similar to Figure 10 top plot: https://arxiv.org/pdf/2511.22333). Some interesting observations:

  1. While PAT is the best performing, the next best is often FlashInfer (which is prefix unaware) and not FastTree (which is prefix aware) in many cases.
  2. Cascade attention (which is also prefix-aware) performs really bad compared to FlashInfer.

For 1, I wonder if there are any intuitive reasons for it. For 2, was this also visible on the A100 machine (which the paper uses for the experiments)? Any reason why cascade performs so bad?

Your thoughts would be very helpful! Thank you for your time!

Image

Attaching the kernel perf raw numbers as well.

kernel_perf.json

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions