Optimize CPU RAM peak memory during quantization by lvliang-intel · Pull Request #1386 · intel/auto-round

lvliang-intel · 2026-02-03T07:06:07Z

Description

Optimize CPU RAM peak memory during quantization:

Two optional CPU RAM optimizations, gated by [low_cpu_mem_usage]:
cpu_stream_offload_blocks: offload block weights to disk and load them on demand during block-wise quantization, then re-offload quantized weights; restore at the end.
cpu_stream_loss: avoid caching block outputs by computing targets on-the-fly with a frozen block copy (requires [nblocks=1]).
The quantization flow caches inputs once, then processes blocks sequentially, loading/offloading weights and optionally streaming loss to keep peak CPU RAM low.

Test

Quantize Qwen/Qwen3-4B-Instruct-2507 with AutoRound (4-bit) and compare CPU RAM peak usage with different optimization options.

Optimization options:

cpu_stream_offload_blocks: Offload block weights to disk, load on demand
cpu_stream_loss: Compute loss on-the-fly using frozen block copy

Summary: Peak RAM Comparison

Configuration	Peak RAM (GB)	Time (s)	RAM Saved
Baseline	24.29	1582.3	baseline
+ offload_blocks	20.26	1609.1	-4.03 GB
+ stream_loss	21.31	1364.0	-2.98 GB
All optimizations	15.57	1269.3	-8.72 GB

Type of Change

Related Issues

Fixes or relates to #

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Copilot

Pull request overview

This PR optimizes CPU RAM usage during model quantization by introducing two optional streaming strategies. The changes enable efficient quantization of large models by reducing peak memory consumption through block-wise weight offloading to disk and on-the-fly loss computation.

Changes:

Added CPU RAM optimization options (cpu_stream_offload_blocks and cpu_stream_loss) to reduce memory usage during quantization
Modified export logic to only save quantization config attributes that differ from scheme defaults
Added comprehensive test for CPU RAM optimization with memory tracking

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
auto_round/compressors/base.py	Core implementation of CPU RAM optimization with block offloading and streaming loss computation
auto_round/utils/model.py	Added utility functions for saving/loading/clearing module weights to support offloading
auto_round/export/export_to_autoround/export.py	Modified to only save non-default config attributes in extra_config
auto_round/export/export_to_autoround/export_to_fp8.py	Modified to only save non-default config attributes in extra_config
auto_round/export/export_to_autoround/export_to_nvfp_mxfp.py	Modified to only save non-default config attributes in extra_config
test/test_cuda/advanced/test_cpu_ram_optimization.py	New test file to validate CPU RAM optimization features
test/test_cuda/quantization/test_mix_bits.py	Updated assertions to verify only non-default attributes are saved
test/test_cpu/quantization/test_mix_bits.py	Updated assertions to verify only non-default attributes are saved
test/test_cuda/integrations/test_sglang.py	Updated test configuration and assertions
test/test_cpu/quantization/test_act_quantization.py	Removed assertions for default config values
test/test_cuda/export/test_gguf.py	Changed device specification from integer to string format
auto_round/auto_scheme/utils.py	Added fallback device handling for string device specifications

auto_round/compressors/base.py

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

for more information, see https://pre-commit.ci

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

auto_round/compressors/base.py

Signed-off-by: yiliu30 <yi4.liu@intel.com>

…atible) (#1374) Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Signed-off-by: n1ck-guo <heng.guo@intel.com>

… references (#1389)

Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: n1ck-guo <heng.guo@intel.com> Co-authored-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

for more information, see https://pre-commit.ci

lvliang-intel · 2026-02-09T06:42:46Z

Support AutoScheme CPU RAM Optimization:

CUDA_VISIBLE_DEVICES=0 python compare_auto_scheme_ram.py --model /models/Qwen2.5-3B-Instruct/ --nsamples 8 --seqlen 256 --batch-size 1
[Result] low_cpu_mem_usage=False -> 'peak_ram': 13.65GB, 'peak_vram': 3.44GB
[Result] low_cpu_mem_usage=False -> time=688.6s
[Result] low_cpu_mem_usage=True -> 'peak_ram': 8.83GB, 'peak_vram': 3.44GB
[Result] low_cpu_mem_usage=True -> time=637.2s
=== Summary ===

Configuration	Peak RAM (GB)	Time (s)	RAM Saved
Baseline	13.65	688.6	--
Optimizations	8.83	637.2	4.82 GB
Ratio	0.65x	0.92x

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Optimize CPU RAM peak memeory during quantization

dee1db7

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Copilot AI review requested due to automatic review settings February 3, 2026 07:06

Merge branch 'main' into lvl/ram_usage_optimization

459ee8a

Copilot AI reviewed Feb 3, 2026

View reviewed changes

auto_round/compressors/base.py Show resolved Hide resolved

WeiweiZhang1 and others added 4 commits February 3, 2026 07:17

rm duplicate args of the quantization extra config (#1334)

2a78a18

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

fix --device_map cuda and xpu issue (#1383)

e00c176

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d6d9f77

for more information, see https://pre-commit.ci

refine test case

ca55ae8

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

n1ck-guo reviewed Feb 4, 2026

View reviewed changes

auto_round/compressors/base.py Outdated Show resolved Hide resolved

yiliu30 and others added 8 commits February 4, 2026 03:10

Disable replace FP8Expert (#1379)

7a3dcac

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Support general MOE replacement for MOE models (Transformers 5.0 comp…

082bf4c

…atible) (#1374) Signed-off-by: lvliang-intel <liang1.lv@intel.com>

fix cuda ut fail (#1370)

dd45c31

Signed-off-by: n1ck-guo <heng.guo@intel.com>

[Regression] Detach scale tensor to prevent holding computation graph…

10028e8

… references (#1389)

fix layer config (#1373)

b2dff81

Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: n1ck-guo <heng.guo@intel.com> Co-authored-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' into lvl/ram_usage_optimization

5614894

update code for comments

a041da8

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e62a708

for more information, see https://pre-commit.ci

lvliang-intel added 2 commits February 9, 2026 06:55

support AutoScheme cpu ram optimization

5dcd064

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'main' into lvl/ram_usage_optimization

10f0a4a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CPU RAM peak memory during quantization#1386

Optimize CPU RAM peak memory during quantization#1386
lvliang-intel wants to merge 16 commits intomainfrom
lvl/ram_usage_optimization

lvliang-intel commented Feb 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

lvliang-intel commented Feb 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

lvliang-intel commented Feb 3, 2026

Description