Skip to content

[Enhancement] support online quantization#653

Open
haoyangli0109 wants to merge 1 commit intoROCm:mainfrom
haoyangli0109:lhy/online_quantization
Open

[Enhancement] support online quantization#653
haoyangli0109 wants to merge 1 commit intoROCm:mainfrom
haoyangli0109:lhy/online_quantization

Conversation

@haoyangli0109
Copy link
Copy Markdown
Contributor

@haoyangli0109 haoyangli0109 commented Apr 28, 2026

  1. support linear mixed mxfp4 and ptpc_fp8
  2. support moe mixed mxfp4 and ptpc_fp8
  3. for PTPC format and certain necessary cases, gather all weights before quantization.
  4. suport dpsk DQ and Q
  5. check EP mode

ACC and Performance test

  TTFT TTFT TPOT TPOT gsm8k gsm8k
model online offline online offline online offline
DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 296.39 296.06 65.22 65.46 0.9484 0.9462
DeepSeek-R1-0528-MXFP4-MTP-MoEFP4 mtp mode 339.78 339.86 26.47 26.39 0.9439 0.9439
Qwen3-30B-A3B-Thinking-2507-ptpc 126.93 126.81 11.55 11.65 0.6971 0.6861
Qwen3-235B-A22B-Instruct-2507-MXFP4 445.52 450.51 34.03 34.16 0.8976 0.8961

Reproduction
aiter: d6e73f96141bcdb61c2cc7ed1b09d874dea8ecf8
atom: 81054f9

command:

qwen3-30B ptpc online & offline command
python3 -m atom.entrypoints.openai_server --model /shareddata/Qwen/Qwen3-30B-A3B-Thinking-2507 \
  -tp 4 --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"ptpc_fp8","layer_quant_config":{"*expert*":"ptpc_fp8"},"exclude_layer":["lm_head","*.gate.*"]}' 

python3 -m atom.entrypoints.openai_server --model /shareddata/amd/Qwen3-30B-A3B-Thinking-2507-ptpc \
  -tp 4 --port 5679 --server-port 7778

deepseek-r1-0528 online & offline command
python3 -m atom.entrypoints.openai_server --model /shareddata/deepseek-ai/DeepSeek-R1-0528 \
  --enforce-eager -tp 8 \
  --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"ptpc_fp8","layer_quant_config":{"*expert*":"mxfp4"},"exclude_layer":["lm_head","*.gate.*"]}' \
  --method mtp --num-speculative-tokens 3 

Qwen3-235B-A22B-Instruct-2507 mxfp4 online & offline command
 python -m atom.entrypoints.openai_server \
  --model /shareddata/Qwen/Qwen3-235B-A22B-Instruct-2507 \
  -tp 2 --enable-expert-parallel \
  --port 5679 --server-port 7778 \
  --online_quant_config '{"global_quant_config":"mxfp4","exclude_layer":["lm_head","*.gate.*"]}'

  

**ACC & performance command**
lm_eval \
  --model local-completions \
  --model_args "model=model_path,base_url=http://localhost:7778/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size auto
  
python -m atom.benchmarks.benchmark_serving \
  --model=model_path --backend=vllm --base-url=http://localhost:7778 \
  --dataset-name=random \
  --random-input-len=1024 --random-output-len=1024 \
  --random-range-ratio=0.8 \
  --num-prompts=1280 --max-concurrency=128 \
  --request-rate=inf --ignore-eos \
  --save-result --percentile-metrics="ttft,tpot,itl,e2el"

@haoyangli0109 haoyangli0109 force-pushed the lhy/online_quantization branch from efba94e to e8fca54 Compare April 28, 2026 05:51
@haoyangli0109 haoyangli0109 force-pushed the lhy/online_quantization branch from e8fca54 to 9abf8bf Compare May 7, 2026 08:07
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
@haoyangli0109 haoyangli0109 force-pushed the lhy/online_quantization branch from 9abf8bf to 92ec964 Compare May 7, 2026 08:30
@haoyangli0109 haoyangli0109 marked this pull request as ready for review May 7, 2026 08:54
@lihaoyang-amd lihaoyang-amd requested a review from valarLip May 8, 2026 11:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant