Skip to content

feat(qwen3): double-buffered decoding with on-device embedding lookup#46

Open
vgene wants to merge 1 commit intomainfrom
feat/double-buffering
Open

feat(qwen3): double-buffered decoding with on-device embedding lookup#46
vgene wants to merge 1 commit intomainfrom
feat/double-buffering

Conversation

@vgene
Copy link
Copy Markdown
Contributor

@vgene vgene commented Mar 27, 2026

Summary

  • Fuse greedy sampling + embedding lookup into a single device kernel (greedy_sampling_with_embedding) so the token embedding stays on device, eliminating the per-token host round-trip (D2H token ID → host embedding lookup → H2D embedding)
  • Double-buffer next_id output: two alternating DeviceTensors so the D2H of the previous token overlaps with the current iteration's non-blocking kernel execution
  • Add --no-double-buffering CLI flag to fall back to the original baseline decode path for A/B performance comparison

Test plan

  • Run with double buffering (default): torchrun ... qwen3.py "prompt" — verify correct output and tokens/sec
  • Run without: torchrun ... qwen3.py --no-double-buffering "prompt" — verify identical output
  • Compare tokens/sec between the two modes

…okup

Fuse greedy sampling and embedding lookup into a single device kernel
so the selected token's embedding stays on device and feeds the next
iteration directly, eliminating the per-token host round-trip
(D2H token ID → host embedding lookup → H2D embedding).

Two next_id buffers alternate so the D2H of the previous token
overlaps with the current iteration's non-blocking kernel execution.

Adds --no-double-buffering flag to compare performance against
the baseline decode path.
@vgene vgene requested a review from a team March 27, 2026 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant