Enable FLASH_ATTN backend with upstream flash-attn CK on ROCm for decode by mgehre-amd · Pull Request #866 · ROCm/vllm

mgehre-amd · 2026-04-10T07:05:30Z

The FLASH_ATTN backend in vLLM V1 was tightly coupled to vllm_flash_attn (CUDA-only). On ROCm, fa_utils.py imported upstream flash_attn_varlen_func but the forward path passed vllm-specific kwargs (out, fa_version, scheduler_metadata, etc.) that the upstream API doesn't accept, and get_flash_attn_version() returned None causing an assertion failure.

Changes:

Replace raw upstream import with a wrapper in fa_utils.py that translates vLLM's calling convention to the upstream _wrapped_flash_attn_varlen_forward API, handling seqused_k -> cu_seqlens_k conversion for paged KV cache
Return FA version 2 on ROCm when upstream flash-attn is available
Set block_size to MultipleOf(128) on ROCm (CK kernel requirement)

The Triton AMD backend (FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE) does NOT support paged attention, so FLASH_ATTN as LLM backend requires the CK backend (FLASH_ATTENTION_TRITON_AMD_ENABLE unset).

Validated on gfx1151 with Qwen2.5-1.5B-Instruct: correct text generation with paged KV cache, concurrent requests, and multi-token decode.

TODO: detect whether flash-attn contains CK backend; and whether the CK backend is enabled (FLASH_ATTENTION_TRITON_AMD_ENABLE unset).

git fetch upstream --tags and git describe can fail if the upstream repo is unreachable or no tags are reachable from HEAD. Use || to avoid aborting the workflow. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Trigger on push to gfx11 instead of main/matthias.awq_gemv. Remove create-release and publish-to-gh-pages jobs. Wheel is available as a GitHub Actions artifact. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

- Trigger workflow on PRs targeting gfx11 (build-only) - On push to gfx11, upload wheel to S3 via OIDC + boto3 - S3 upload gated on ROCm org to skip on forks Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

The FLASH_ATTN backend in vLLM V1 was tightly coupled to vllm_flash_attn (CUDA-only). On ROCm, fa_utils.py imported upstream flash_attn_varlen_func but the forward path passed vllm-specific kwargs (out, fa_version, scheduler_metadata, etc.) that the upstream API doesn't accept, and get_flash_attn_version() returned None causing an assertion failure. Changes: - Replace raw upstream import with a wrapper in fa_utils.py that translates vLLM's calling convention to the upstream _wrapped_flash_attn_varlen_forward API, handling seqused_k -> cu_seqlens_k conversion for paged KV cache - Return FA version 2 on ROCm when upstream flash-attn is available - Set block_size to MultipleOf(128) on ROCm (CK kernel requirement) The Triton AMD backend (FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE) does NOT support paged attention, so FLASH_ATTN as LLM backend requires the CK backend (FLASH_ATTENTION_TRITON_AMD_ENABLE unset). Validated on gfx1151 with Qwen2.5-1.5B-Instruct: correct text generation with paged KV cache, concurrent requests, and multi-token decode. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

mgehre-amd added 6 commits April 9, 2026 09:51

Fix: make upstream tag fetch resilient to failures

0c7a255

git fetch upstream --tags and git describe can fail if the upstream repo is unreachable or no tags are reachable from HEAD. Use || to avoid aborting the workflow. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Adapt wheel workflow for gfx11: trigger on gfx11, build-only

b4eeb27

Trigger on push to gfx11 instead of main/matthias.awq_gemv. Remove create-release and publish-to-gh-pages jobs. Wheel is available as a GitHub Actions artifact. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Reduce default ROCM_ARCH to gfx1150;gfx1151

695453a

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Add PR trigger and S3 wheel upload for gfx11 branch

372b225

- Trigger workflow on PRs targeting gfx11 (build-only) - On push to gfx11, upload wheel to S3 via OIDC + boto3 - S3 upload gated on ROCm org to skip on forks Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Fix AWS credentials: use us-east-2, pin action to v6.0.0

b120940

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

mgehre-amd force-pushed the gfx11 branch from ca7a64e to 67305fc Compare April 10, 2026 15:43

mgehre-amd changed the title ~~Enable FLASH_ATTN backend with upstream flash-attn CK on ROCm~~ Enable FLASH_ATTN backend with upstream flash-attn CK on ROCm for decode May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable FLASH_ATTN backend with upstream flash-attn CK on ROCm for decode#866

Enable FLASH_ATTN backend with upstream flash-attn CK on ROCm for decode#866
mgehre-amd wants to merge 6 commits intoROCm:gfx11from
mgehre-amd:matthias.flash-attn-ck-backend

mgehre-amd commented Apr 10, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgehre-amd commented Apr 10, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgehre-amd commented Apr 10, 2026 •

edited by github-actions Bot

Loading