Optimize CPU RAM peak memory during quantization#1386
Optimize CPU RAM peak memory during quantization#1386lvliang-intel wants to merge 16 commits intomainfrom
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
There was a problem hiding this comment.
Pull request overview
This PR optimizes CPU RAM usage during model quantization by introducing two optional streaming strategies. The changes enable efficient quantization of large models by reducing peak memory consumption through block-wise weight offloading to disk and on-the-fly loss computation.
Changes:
- Added CPU RAM optimization options (
cpu_stream_offload_blocksandcpu_stream_loss) to reduce memory usage during quantization - Modified export logic to only save quantization config attributes that differ from scheme defaults
- Added comprehensive test for CPU RAM optimization with memory tracking
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| auto_round/compressors/base.py | Core implementation of CPU RAM optimization with block offloading and streaming loss computation |
| auto_round/utils/model.py | Added utility functions for saving/loading/clearing module weights to support offloading |
| auto_round/export/export_to_autoround/export.py | Modified to only save non-default config attributes in extra_config |
| auto_round/export/export_to_autoround/export_to_fp8.py | Modified to only save non-default config attributes in extra_config |
| auto_round/export/export_to_autoround/export_to_nvfp_mxfp.py | Modified to only save non-default config attributes in extra_config |
| test/test_cuda/advanced/test_cpu_ram_optimization.py | New test file to validate CPU RAM optimization features |
| test/test_cuda/quantization/test_mix_bits.py | Updated assertions to verify only non-default attributes are saved |
| test/test_cpu/quantization/test_mix_bits.py | Updated assertions to verify only non-default attributes are saved |
| test/test_cuda/integrations/test_sglang.py | Updated test configuration and assertions |
| test/test_cpu/quantization/test_act_quantization.py | Removed assertions for default config values |
| test/test_cuda/export/test_gguf.py | Changed device specification from integer to string format |
| auto_round/auto_scheme/utils.py | Added fallback device handling for string device specifications |
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…atible) (#1374) Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com>
Signed-off-by: n1ck-guo <heng.guo@intel.com> Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Co-authored-by: n1ck-guo <heng.guo@intel.com> Co-authored-by: WeiweiZhang1 <weiwei1.zhang@intel.com> Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci
|
Support AutoScheme CPU RAM Optimization: CUDA_VISIBLE_DEVICES=0 python compare_auto_scheme_ram.py --model /models/Qwen2.5-3B-Instruct/ --nsamples 8 --seqlen 256 --batch-size 1
|
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Description
Optimize CPU RAM peak memory during quantization:
Two optional CPU RAM optimizations, gated by [low_cpu_mem_usage]:
cpu_stream_offload_blocks: offload block weights to disk and load them on demand during block-wise quantization, then re-offload quantized weights; restore at the end.
cpu_stream_loss: avoid caching block outputs by computing targets on-the-fly with a frozen block copy (requires [nblocks=1]).
The quantization flow caches inputs once, then processes blocks sequentially, loading/offloading weights and optionally streaming loss to keep peak CPU RAM low.
Test
Quantize Qwen/Qwen3-4B-Instruct-2507 with AutoRound (4-bit) and compare CPU RAM peak usage with different optimization options.
Optimization options:
Summary: Peak RAM Comparison
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting