Skip to content

[Bug]: FLASHINFER - chunked_prefill crashes when multiple concurrent requests happen #1279

@frenzybiscuit

Description

@frenzybiscuit

Your current environment

In a single user mode with a single request, chunked prefill will work on FLASHINFER and I am able to hit 160k FP8 context.

When multiple concurrent requests come in, it crashes saying its not supported with FLASHINFER.

However, without chunked prefill my 132k FP8 context goes down to 15k FP8 context, making flashinfer useless to me.

With FLASH ATTENTION 2 I can hit over 60k FP16 context, but cannot use FP8 because its not supported on FLASH ATTENTION 2.

Is there any way to get FLASHINFER and chunked_prefill fixed? Or get quant k,v cache supported on FLASH ATTENTION 2?

Thank you!

Model Input Dumps

.

🐛 Describe the bug

.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions