-
Notifications
You must be signed in to change notification settings - Fork 77
Support Qwen3 Omni model quantization #1404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
|
Thank you for the PR! Could you help verify all inferences (vLLM, Transformers 4, and Transformers 5) before merging? |
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Quantize:Inference with transformers 5.1.0
vLLM tests are currently blocked because the latest vLLM version depends on an outdated Transformers release. Qwen3-Omni requires Transformers >= 5.1.0 to address several known issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Adds quantization support for the Qwen3-Omni MoE model family by integrating model-specific loading/version gating, calibration forward behavior for thinker/talker, and custom multimodal block discovery.
Changes:
- Added explicit Transformers version guard for
qwen3_omni_moe. - Introduced Qwen3-Omni processor/template registration and model-specific multimodal block name discovery.
- Implemented a Qwen3-Omni-specific forward path to run thinker (and optionally talker) during calibration.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Adds a project-specific word to typos’ allowlist. |
| auto_round/utils/model.py | Adds Transformers version guard and adjusts lm_head discovery logic. |
| auto_round/utils/common.py | Adds _no_split_modules normalization and extends multimodal ignore-key lists. |
| auto_round/special_model_handler.py | Adds Qwen3-Omni special forward + block discovery + ignore-layer rule. |
| auto_round/compressors/shard_writer.py | Improves tie_word_embeddings lookup for nested multimodal configs. |
| auto_round/compressors/mllm/utils.py | Extends multimodal ignore-key list for Qwen3-Omni components. |
| auto_round/compressors/mllm/template.py | Registers a Qwen3-Omni model template with the new processor. |
| auto_round/compressors/mllm/processor.py | Adds a custom processor for Qwen3-Omni chat-template inputs. |
| auto_round/compressors/base.py | Imports the new normalization helper. |
| auto_round/auto_scheme/utils.py | Uses normalized _no_split_modules when dispatching across devices. |
| ) | ||
|
|
||
| # Run talker forward if available (for calibration purposes) | ||
| if hasattr(model, "talker") and model.has_talker: |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can raise AttributeError when model.has_talker doesn’t exist (the hasattr only checks talker). Use getattr(model, "has_talker", False) (and optionally also ensure model.talker is not None) to make this guard safe.
| if hasattr(model, "talker") and model.has_talker: | |
| if getattr(model, "has_talker", False) and getattr(model, "talker", None) is not None: |
| # Use text projection to convert thinker embeddings to talker space | ||
| if hasattr(model.talker, "text_projection"): | ||
| # Get thinker embeddings | ||
| thinker_embeds = model.thinker.get_input_embeddings()(input_ids) | ||
| talker_inputs_embeds = model.talker.text_projection(thinker_embeds) |
Copilot
AI
Feb 11, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path assumes input_ids is provided; if calibration runs with inputs_embeds (or other modalities without input_ids), this will throw and then be silently ignored (due to the broad except), meaning the talker forward never runs. Consider deriving inputs from inputs_embeds when present, or projecting from thinker_output.hidden_states[-1] (which you already compute) instead of re-embedding input_ids.
| # Use text projection to convert thinker embeddings to talker space | |
| if hasattr(model.talker, "text_projection"): | |
| # Get thinker embeddings | |
| thinker_embeds = model.thinker.get_input_embeddings()(input_ids) | |
| talker_inputs_embeds = model.talker.text_projection(thinker_embeds) | |
| # Use text projection to convert thinker hidden states to talker space | |
| if hasattr(model.talker, "text_projection"): | |
| # Project thinker hidden states directly into the talker embedding space | |
| talker_inputs_embeds = model.talker.text_projection(thinker_hidden) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Description
This update adds quantization support for Qwen3-Omni by integrating a custom MLLM processor and template, implementing dedicated forward logic for thinker/talker calibration, and introducing model-specific block discovery.
Note: This feature requires Transformers >= 5.1.0, as earlier versions contain compatibility issues with Qwen3-Omni.
Type of Change
Related Issues
#1387
Fixes or relates to #
Checklist Before Submitting