[ROCm][quant] INC: route w4a16-sym MoE through HybridW4A16 HIP path by mgehre-amd · Pull Request #929 · ROCm/vllm

mgehre-amd · 2026-05-08T15:26:54Z

Wires INC (Intel Neural Compressor / auto-round) quantized models into the same HIP HybridW4A16 MoE path that compressed-tensors w4a16 already uses on ROCm. Auto-round emits its checkpoints in
auto_round:auto_gptq packing (same on-disk layout as compressed-tensors pack_quantized), so the only INC-specific piece is registering the parameters under the GPTQ names (w*_qweight / w*_scales / w*_qzeros) that the standard FusedMoE expert-name mapping resolves; the conversion to ExLlama-shuffled [E, N, K//8] and the HybridW4A16MoEExperts modular-kernel install are reused from compressed-tensors.

Verified on Strix Halo (gfx1151) with Intel/Qwen3.5-35B-A3B-int4-AutoRound: the _rocm_C::fused_moe_wvSplitK_int4_gemm kernel now drives the per-token MoE GEMMs on decode; non-MoE INT4 linears were already going through HybridW4A16LinearKernel via choose_mp_linear_kernel.

Changes:

vllm/platforms/rocm.py: add "inc" to supported_quantization (the dispatcher behind it ultimately picks AWQ/GPTQ kernels through choose_mp_linear_kernel, so ROCm support is no longer unconditionally rejected at config validation).
vllm/model_executor/layers/quantization/inc.py: in apply_awq_quant_layer / apply_gptq_quant_layer, when the gate passes (is_rocm, 4-bit, sym, group_size>0, FusedMoE, non-marlin), return the new INCHybridW4A16MoEMethod instead of falling back to the generic MoeWNA16Method.
vllm/model_executor/layers/quantization/inc_moe.py (new): INCHybridW4A16MoEMethod registers GPTQ-named params, drops the sym qzeros (7-sentinel) before the kernel sees them, aliases to the names the helper expects, and installs the modular kernel. Originals are freed after the repack so weight memory stays at the checkpoint footprint instead of doubling.
vllm/model_executor/layers/fused_moe/hybrid_w4a16_moe_helper.py (new): shared setup_hybrid_w4a16_moe(method, layer) extracted from compressed-tensors; called by both backends so the conversion + modular-kernel install lives in one place.
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16.py: _process_weights_hybrid_w4a16 now delegates to the shared helper.

Bench (gfx1151, Intel/Qwen3.5-35B-A3B-int4-AutoRound, synthetic-mm 640x480, ISL/OSL=100/128, conc=1, --enforce-eager):
decode 25.9 -> 36.4 tok/s (+40%), TPOT 38.7 -> 27.5 ms.
Profile confirms _rocm_C::fused_moe_wvSplitK_int4_gemm is now
on the decode hot path (was Triton MoeWNA16 before).

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Wires INC (Intel Neural Compressor / auto-round) quantized models into the same HIP HybridW4A16 MoE path that compressed-tensors w4a16 already uses on ROCm. Auto-round emits its checkpoints in `auto_round:auto_gptq` packing (same on-disk layout as compressed-tensors `pack_quantized`), so the only INC-specific piece is registering the parameters under the GPTQ names (`w*_qweight` / `w*_scales` / `w*_qzeros`) that the standard FusedMoE expert-name mapping resolves; the conversion to ExLlama-shuffled `[E, N, K//8]` and the `HybridW4A16MoEExperts` modular-kernel install are reused from compressed-tensors. Verified on Strix Halo (gfx1151) with `Intel/Qwen3.5-35B-A3B-int4-AutoRound`: the `_rocm_C::fused_moe_wvSplitK_int4_gemm` kernel now drives the per-token MoE GEMMs on decode; non-MoE INT4 linears were already going through HybridW4A16LinearKernel via `choose_mp_linear_kernel`. Changes: - `vllm/platforms/rocm.py`: add `"inc"` to `supported_quantization` (the dispatcher behind it ultimately picks AWQ/GPTQ kernels through `choose_mp_linear_kernel`, so ROCm support is no longer unconditionally rejected at config validation). - `vllm/model_executor/layers/quantization/inc.py`: in `apply_awq_quant_layer` / `apply_gptq_quant_layer`, when the gate passes (`is_rocm`, 4-bit, sym, group_size>0, FusedMoE, non-marlin), return the new `INCHybridW4A16MoEMethod` instead of falling back to the generic `MoeWNA16Method`. - `vllm/model_executor/layers/quantization/inc_moe.py` (new): `INCHybridW4A16MoEMethod` registers GPTQ-named params, drops the sym `qzeros` (7-sentinel) before the kernel sees them, aliases to the names the helper expects, and installs the modular kernel. Originals are freed after the repack so weight memory stays at the checkpoint footprint instead of doubling. - `vllm/model_executor/layers/fused_moe/hybrid_w4a16_moe_helper.py` (new): shared `setup_hybrid_w4a16_moe(method, layer)` extracted from compressed-tensors; called by both backends so the conversion + modular-kernel install lives in one place. - `vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16.py`: `_process_weights_hybrid_w4a16` now delegates to the shared helper. Bench (gfx1151, Intel/Qwen3.5-35B-A3B-int4-AutoRound, synthetic-mm 640x480, ISL/OSL=100/128, conc=1, --enforce-eager): decode 25.9 -> 36.4 tok/s (+40%), TPOT 38.7 -> 27.5 ms. Profile confirms `_rocm_C::fused_moe_wvSplitK_int4_gemm` is now on the decode hot path (was Triton MoeWNA16 before). Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm][quant] INC: route w4a16-sym MoE through HybridW4A16 HIP path#929

[ROCm][quant] INC: route w4a16-sym MoE through HybridW4A16 HIP path#929
mgehre-amd wants to merge 1 commit intogfx11from
matthias.inc-rocm-hybrid-w4a16-moe

mgehre-amd commented May 8, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mgehre-amd commented May 8, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mgehre-amd commented May 8, 2026 •

edited by github-actions Bot

Loading