Skip to content

[ROCm][quant] INC: route w4a16-sym MoE through HybridW4A16 HIP path#929

Draft
mgehre-amd wants to merge 1 commit intogfx11from
matthias.inc-rocm-hybrid-w4a16-moe
Draft

[ROCm][quant] INC: route w4a16-sym MoE through HybridW4A16 HIP path#929
mgehre-amd wants to merge 1 commit intogfx11from
matthias.inc-rocm-hybrid-w4a16-moe

Conversation

@mgehre-amd
Copy link
Copy Markdown

@mgehre-amd mgehre-amd commented May 8, 2026

Wires INC (Intel Neural Compressor / auto-round) quantized models into the same HIP HybridW4A16 MoE path that compressed-tensors w4a16 already uses on ROCm. Auto-round emits its checkpoints in
auto_round:auto_gptq packing (same on-disk layout as compressed-tensors pack_quantized), so the only INC-specific piece is registering the parameters under the GPTQ names (w*_qweight / w*_scales / w*_qzeros) that the standard FusedMoE expert-name mapping resolves; the conversion to ExLlama-shuffled [E, N, K//8] and the HybridW4A16MoEExperts modular-kernel install are reused from compressed-tensors.

Verified on Strix Halo (gfx1151) with Intel/Qwen3.5-35B-A3B-int4-AutoRound: the _rocm_C::fused_moe_wvSplitK_int4_gemm kernel now drives the per-token MoE GEMMs on decode; non-MoE INT4 linears were already going through HybridW4A16LinearKernel via choose_mp_linear_kernel.

Changes:

  • vllm/platforms/rocm.py: add "inc" to supported_quantization (the dispatcher behind it ultimately picks AWQ/GPTQ kernels through choose_mp_linear_kernel, so ROCm support is no longer unconditionally rejected at config validation).
  • vllm/model_executor/layers/quantization/inc.py: in apply_awq_quant_layer / apply_gptq_quant_layer, when the gate passes (is_rocm, 4-bit, sym, group_size>0, FusedMoE, non-marlin), return the new INCHybridW4A16MoEMethod instead of falling back to the generic MoeWNA16Method.
  • vllm/model_executor/layers/quantization/inc_moe.py (new): INCHybridW4A16MoEMethod registers GPTQ-named params, drops the sym qzeros (7-sentinel) before the kernel sees them, aliases to the names the helper expects, and installs the modular kernel. Originals are freed after the repack so weight memory stays at the checkpoint footprint instead of doubling.
  • vllm/model_executor/layers/fused_moe/hybrid_w4a16_moe_helper.py (new): shared setup_hybrid_w4a16_moe(method, layer) extracted from compressed-tensors; called by both backends so the conversion + modular-kernel install lives in one place.
  • vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16.py: _process_weights_hybrid_w4a16 now delegates to the shared helper.

Bench (gfx1151, Intel/Qwen3.5-35B-A3B-int4-AutoRound, synthetic-mm 640x480, ISL/OSL=100/128, conc=1, --enforce-eager):
decode 25.9 -> 36.4 tok/s (+40%), TPOT 38.7 -> 27.5 ms.
Profile confirms _rocm_C::fused_moe_wvSplitK_int4_gemm is now
on the decode hot path (was Triton MoeWNA16 before).

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Wires INC (Intel Neural Compressor / auto-round) quantized models into
the same HIP HybridW4A16 MoE path that compressed-tensors w4a16 already
uses on ROCm. Auto-round emits its checkpoints in
`auto_round:auto_gptq` packing (same on-disk layout as compressed-tensors
`pack_quantized`), so the only INC-specific piece is registering the
parameters under the GPTQ names (`w*_qweight` / `w*_scales` / `w*_qzeros`)
that the standard FusedMoE expert-name mapping resolves; the conversion
to ExLlama-shuffled `[E, N, K//8]` and the `HybridW4A16MoEExperts`
modular-kernel install are reused from compressed-tensors.

Verified on Strix Halo (gfx1151) with `Intel/Qwen3.5-35B-A3B-int4-AutoRound`:
the `_rocm_C::fused_moe_wvSplitK_int4_gemm` kernel now drives the per-token
MoE GEMMs on decode; non-MoE INT4 linears were already going through
HybridW4A16LinearKernel via `choose_mp_linear_kernel`.

Changes:
- `vllm/platforms/rocm.py`: add `"inc"` to `supported_quantization`
  (the dispatcher behind it ultimately picks AWQ/GPTQ kernels through
  `choose_mp_linear_kernel`, so ROCm support is no longer
  unconditionally rejected at config validation).
- `vllm/model_executor/layers/quantization/inc.py`: in
  `apply_awq_quant_layer` / `apply_gptq_quant_layer`, when the gate
  passes (`is_rocm`, 4-bit, sym, group_size>0, FusedMoE, non-marlin),
  return the new `INCHybridW4A16MoEMethod` instead of falling back to
  the generic `MoeWNA16Method`.
- `vllm/model_executor/layers/quantization/inc_moe.py` (new):
  `INCHybridW4A16MoEMethod` registers GPTQ-named params, drops the
  sym `qzeros` (7-sentinel) before the kernel sees them, aliases to
  the names the helper expects, and installs the modular kernel.
  Originals are freed after the repack so weight memory stays at the
  checkpoint footprint instead of doubling.
- `vllm/model_executor/layers/fused_moe/hybrid_w4a16_moe_helper.py`
  (new): shared `setup_hybrid_w4a16_moe(method, layer)` extracted from
  compressed-tensors; called by both backends so the conversion +
  modular-kernel install lives in one place.
- `vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe/compressed_tensors_moe_wna16.py`:
  `_process_weights_hybrid_w4a16` now delegates to the shared helper.

Bench (gfx1151, Intel/Qwen3.5-35B-A3B-int4-AutoRound,
synthetic-mm 640x480, ISL/OSL=100/128, conc=1, --enforce-eager):
  decode 25.9 -> 36.4 tok/s (+40%), TPOT 38.7 -> 27.5 ms.
  Profile confirms `_rocm_C::fused_moe_wvSplitK_int4_gemm` is now
  on the decode hot path (was Triton MoeWNA16 before).

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant