[Enhancement] support online quantization by haoyangli0109 · Pull Request #653 · ROCm/ATOM

haoyangli0109 · 2026-04-28T04:58:28Z

support linear mixed mxfp4 and ptpc_fp8
support moe mixed mxfp4 and ptpc_fp8
for PTPC format and certain necessary cases, gather all weights before quantization.
suport dpsk DQ and Q
check EP mode

ACC and Performance test

	TTFT	TTFT	TPOT	TPOT	gsm8k	gsm8k
model	online	offline	online	offline	online	offline
DeepSeek-R1-0528-MXFP4-MTP-MoEFP4	296.39	296.06	65.22	65.46	0.9484	0.9462
DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 mtp mode	339.78	339.86	26.47	26.39	0.9439	0.9439
Qwen3-30B-A3B-Thinking-2507-ptpc	126.93	126.81	11.55	11.65	0.6971	0.6861
Qwen3-235B-A22B-Instruct-2507-MXFP4	445.52	450.51	34.03	34.16	0.8976	0.8961

Reproduction
aiter: d6e73f96141bcdb61c2cc7ed1b09d874dea8ecf8
atom: 81054f9

command:

qwen3-30B ptpc online & offline command
python3 -m atom.entrypoints.openai_server --model /shareddata/Qwen/Qwen3-30B-A3B-Thinking-2507 \
  -tp 4 --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"ptpc_fp8","layer_quant_config":{"*expert*":"ptpc_fp8"},"exclude_layer":["lm_head","*.gate.*"]}' 

python3 -m atom.entrypoints.openai_server --model /shareddata/amd/Qwen3-30B-A3B-Thinking-2507-ptpc \
  -tp 4 --port 5679 --server-port 7778

deepseek-r1-0528 online & offline command
python3 -m atom.entrypoints.openai_server --model /shareddata/deepseek-ai/DeepSeek-R1-0528 \
  --enforce-eager -tp 8 \
  --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"ptpc_fp8","layer_quant_config":{"*expert*":"mxfp4"},"exclude_layer":["lm_head","*.gate.*"]}' \
  --method mtp --num-speculative-tokens 3 

Qwen3-235B-A22B-Instruct-2507 mxfp4 online & offline command
 python -m atom.entrypoints.openai_server \
  --model /shareddata/Qwen/Qwen3-235B-A22B-Instruct-2507 \
  -tp 2 --enable-expert-parallel \
  --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"mxfp4","exclude_layer":["lm_head","*.gate.*"]}'

  

**ACC & performance command**
lm_eval \
  --model local-completions \
  --model_args "model=model_path,base_url=http://localhost:7778/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size auto
  
python -m atom.benchmarks.benchmark_serving \
  --model=model_path --backend=vllm --base-url=http://localhost:7778 \
  --dataset-name=random \
  --random-input-len=1024 --random-output-len=1024 \
  --random-range-ratio=0.8 \
  --num-prompts=1280 --max-concurrency=128 \
  --request-rate=inf --ignore-eos \
  --save-result --percentile-metrics="ttft,tpot,itl,e2el"

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

haoyangli0109 force-pushed the lhy/online_quantization branch from efba94e to e8fca54 Compare April 28, 2026 05:51

haoyangli0109 force-pushed the lhy/online_quantization branch from e8fca54 to 9abf8bf Compare May 7, 2026 08:07

WIP

92ec964

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

haoyangli0109 force-pushed the lhy/online_quantization branch from 9abf8bf to 92ec964 Compare May 7, 2026 08:30

haoyangli0109 marked this pull request as ready for review May 7, 2026 08:54

lihaoyang-amd requested a review from valarLip May 8, 2026 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] support online quantization#653

[Enhancement] support online quantization#653
haoyangli0109 wants to merge 1 commit intoROCm:mainfrom
haoyangli0109:lhy/online_quantization

haoyangli0109 commented Apr 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haoyangli0109 commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

haoyangli0109 commented Apr 28, 2026 •

edited

Loading