Skip to content

Restore Phi-3.5 test configurations with head_dim fix#1090

Draft
dacorvo wants to merge 7 commits intomainfrom
head_dim_padding
Draft

Restore Phi-3.5 test configurations with head_dim fix#1090
dacorvo wants to merge 7 commits intomainfrom
head_dim_padding

Conversation

@dacorvo
Copy link
Copy Markdown
Collaborator

@dacorvo dacorvo commented Mar 13, 2026

This pull request solves an edge case for decoder models when the head_dim is not a multiple of 128.
The empirical observation is that the compiler creates a huge HLO graph just to unroll the matrix multiplications with a head_dim that does not fit well within the hardware tiling constraints.
This was revealed when trying to compile the phi3.5 model (head_dim is 96) with a smaller sequence length (1024), leading to an error because the default XLA tracing path was used instead of just calling the ISA (assembly) attention kernel that is only selected when sequence length is above 4096.
The solution is to pad the attention weights to a multiple of the head dim and use that padded dimension throughout all the attention operations.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dacorvo dacorvo changed the title test(phi3.5): restore Phi-3.5 test configurations with head_dim fix Restore Phi-3.5 test configurations with head_dim fix Mar 13, 2026
dacorvo and others added 3 commits March 13, 2026 13:48
Move maybe_pad_interleaved, maybe_pad_tail, and replicate_kv from
gqa.py to a new padding.py module. Extract FP8 cast helpers for reuse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This is to workaround an explosion of the number of ops produced
by the compiler when unrolling tensors whose shape is not aligned
on the hardware tiling constraints.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
dacorvo and others added 3 commits March 13, 2026 14:50
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Head_dim alignment (64→128) changes compiled tensor shapes, which
can produce slightly different token generation on Neuron hardware.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The head_dim padding (align to 128) caused qwen3 and qwen3_moe to
create q_layernorm/k_layernorm with the padded dimension, but HF
weights have the original dimension. This shape mismatch broke NxD
weight loading. Use original_head_dim for the layernorm and apply
it only to the active (non-padded) portion of Q/K tensors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants