Thanks for the great work!
I notice that dflash drafters for qwen3.6 use sliding window attention, I'm curious about the reason behind it.
Is that for lower draft latency, or sliding window attention can increase accept length?
If I force to use full attention for sliding window layers, will the accept length drop?
Thanks for the great work!
I notice that dflash drafters for qwen3.6 use sliding window attention, I'm curious about the reason behind it.
Is that for lower draft latency, or sliding window attention can increase accept length?
If I force to use full attention for sliding window layers, will the accept length drop?