Restore Phi-3.5 test configurations with head_dim fix#1090
Draft
Restore Phi-3.5 test configurations with head_dim fix#1090
Conversation
246e6f6 to
5b814a0
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
5b814a0 to
232e141
Compare
Move maybe_pad_interleaved, maybe_pad_tail, and replicate_kv from gqa.py to a new padding.py module. Extract FP8 cast helpers for reuse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This is to workaround an explosion of the number of ops produced by the compiler when unrolling tensors whose shape is not aligned on the hardware tiling constraints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3b449ec to
6cda428
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Head_dim alignment (64→128) changes compiled tensor shapes, which can produce slightly different token generation on Neuron hardware. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The head_dim padding (align to 128) caused qwen3 and qwen3_moe to create q_layernorm/k_layernorm with the padded dimension, but HF weights have the original dimension. This shape mismatch broke NxD weight loading. Use original_head_dim for the layernorm and apply it only to the active (non-padded) portion of Q/K tensors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request solves an edge case for decoder models when the head_dim is not a multiple of 128.
The empirical observation is that the compiler creates a huge HLO graph just to unroll the matrix multiplications with a head_dim that does not fit well within the hardware tiling constraints.
This was revealed when trying to compile the phi3.5 model (head_dim is 96) with a smaller sequence length (1024), leading to an error because the default XLA tracing path was used instead of just calling the ISA (assembly) attention kernel that is only selected when sequence length is above 4096.
The solution is to pad the attention weights to a multiple of the head dim and use that padded dimension throughout all the attention operations.