Restore Phi-3.5 test configurations with head_dim fix by dacorvo · Pull Request #1090 · huggingface/optimum-neuron

dacorvo · 2026-03-13T10:00:26Z

This pull request solves an edge case for decoder models when the head_dim is not a multiple of 128.
The empirical observation is that the compiler creates a huge HLO graph just to unroll the matrix multiplications with a head_dim that does not fit well within the hardware tiling constraints.
This was revealed when trying to compile the phi3.5 model (head_dim is 96) with a smaller sequence length (1024), leading to an error because the default XLA tracing path was used instead of just calling the ISA (assembly) attention kernel that is only selected when sequence length is above 4096.
The solution is to pad the attention weights to a multiple of the head dim and use that padded dimension throughout all the attention operations.

HuggingFaceDocBuilderDev · 2026-03-13T12:32:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Move maybe_pad_interleaved, maybe_pad_tail, and replicate_kv from gqa.py to a new padding.py module. Extract FP8 cast helpers for reuse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This is to workaround an explosion of the number of ops produced by the compiler when unrolling tensors whose shape is not aligned on the hardware tiling constraints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Head_dim alignment (64→128) changes compiled tensor shapes, which can produce slightly different token generation on Neuron hardware. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The head_dim padding (align to 128) caused qwen3 and qwen3_moe to create q_layernorm/k_layernorm with the padded dimension, but HF weights have the original dimension. This shape mismatch broke NxD weight loading. Use original_head_dim for the layernorm and apply it only to the active (non-padded) portion of Q/K tensors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

doc: warn about sensitivity to head_dim

2295064

dacorvo force-pushed the head_dim_padding branch from 246e6f6 to 5b814a0 Compare March 13, 2026 12:29

dacorvo force-pushed the head_dim_padding branch from 5b814a0 to 232e141 Compare March 13, 2026 12:47

dacorvo changed the title ~~test(phi3.5): restore Phi-3.5 test configurations with head_dim fix~~ Restore Phi-3.5 test configurations with head_dim fix Mar 13, 2026

dacorvo and others added 3 commits March 13, 2026 13:48

refactor(attention): extract padding helpers into dedicated module

cfdc337

Move maybe_pad_interleaved, maybe_pad_tail, and replicate_kv from gqa.py to a new padding.py module. Extract FP8 cast helpers for reuse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(inference): align attention head_dim to 128

28abf89

This is to workaround an explosion of the number of ops produced by the compiler when unrolling tensors whose shape is not aligned on the hardware tiling constraints. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

test(phi3.5): restore Phi-3.5 test configurations

6cda428

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dacorvo force-pushed the head_dim_padding branch from 3b449ec to 6cda428 Compare March 13, 2026 14:12

dacorvo and others added 3 commits March 13, 2026 14:50

fix(test): handle concurrent Hub push race condition in export fixture

d8e2d1d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix(test): update vLLM greedy expected text for padded head_dim

fb6c299

Head_dim alignment (64→128) changes compiled tensor shapes, which can produce slightly different token generation on Neuron hardware. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore Phi-3.5 test configurations with head_dim fix#1090

Restore Phi-3.5 test configurations with head_dim fix#1090
dacorvo wants to merge 7 commits intomainfrom
head_dim_padding

dacorvo commented Mar 13, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dacorvo commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dacorvo commented Mar 13, 2026 •

edited

Loading