Skip to content

Conversation

@avtc
Copy link
Contributor

@avtc avtc commented Jan 1, 2026

@Qubitium I have spent endless debugging days and nights and finally figured out why offload_to_disk=True uses more VRAM on cuda:0 than offload_to_disk=False. Have tried many tools: claude code + glm 4.7/kilo + glm 4.7/antigravity with opus 4.5/ they were helped in adding logs but even looking at logs they could not figure out why that happen, and only with aider-desk + glm 4.7 (but without aider itself) - it reached the correct hypothesis, that with offload_to_disk=False base modules and captured inputs are placed on CPU, so that is why VRAM usage is lower. Also excessive input capture loop was identified and removed for tp-pre-pad processor.

This PR makes offload_to_disk=True behave like offload_to_disk=False by materializing base modules to CPU and store captured inputs on CPU. But does not yet change the way how outputs after forward pass are stored (right now they are stored on cur_layer_device), do not want to introduce merge conflicts with moe_disable_router feature right now. Also I have a feature in separate branch where calibration_data_device can be set + balanced mode waiting for merge of moe_disable_router, so would better wait and then propose improvement for storing outputs to proper device.

Here are some debug logs with evidence:
Used 256 samples from c4/en with qwen3-coder-30b-a3b-instruct.

offload_to_disk=False

INFO  [VRAM-DEBUG] ========== Starting Input Capture ==========
INFO  gc.collect() reclaimed 113 objects in 0.152s
INFO  [VRAM-DEBUG] cuda:0 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:1 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:2 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:3 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  ModuleLooper: capturing layer inputs from 256 calibration batches
INFO  [VRAM-DEBUG] ===== INITIAL STATE (Line 75-76) =====
INFO  [VRAM-DEBUG] cur_layer_device=cpu, data_device=cpu
INFO  [VRAM-DEBUG] quantize_config.device=DEVICE.CUDA
INFO  [VRAM-DEBUG] ===== AFTER MATERIALIZATION (Line 180) =====
INFO  [VRAM-DEBUG] cur_layer_device=cpu, data_device=cpu
INFO  [VRAM-DEBUG] Base modules to materialize: ['model.embed_tokens', 'model.norm', 'model.rotary_emb']
INFO  [VRAM-DEBUG] model.embed_tokens device before materialize: cpu
INFO  [VRAM-DEBUG] model.norm device before materialize: cpu
INFO  [VRAM-DEBUG] model.rotary_emb device before materialize: cpu
INFO  [VRAM-DEBUG] ========== Input Capture: After Base Module Materialization ==========
INFO  gc.collect() reclaimed 40 objects in 0.156s
INFO  [VRAM-DEBUG] cuda:0 - Allocated: 1.17GB, Reserved: 1.17GB
INFO  [VRAM-DEBUG] cuda:1 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:2 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:3 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] ===== INPUT CAPTURE (Batch 1) =====
INFO  [VRAM-DEBUG] data_device=cpu, cur_layer_device=cpu
INFO  [VRAM-DEBUG] hidden_states device=cpu
INFO  [VRAM-DEBUG] ===== INPUT CAPTURE (Batch 2) =====
INFO  [VRAM-DEBUG] data_device=cpu, cur_layer_device=cpu
INFO  [VRAM-DEBUG] hidden_states device=cpu
INFO  [VRAM-DEBUG] ===== INPUT CAPTURE (Batch 256) =====
INFO  [VRAM-DEBUG] data_device=cpu, cur_layer_device=cpu
INFO  [VRAM-DEBUG] hidden_states device=cpu
INFO  [VRAM-DEBUG] ========== Input Capture: After Capture Loop (before InputCache) ==========
INFO  gc.collect() reclaimed 40 objects in 0.157s
INFO  [VRAM-DEBUG] cuda:0 - Allocated: 1.17GB, Reserved: 1.17GB
INFO  [VRAM-DEBUG] cuda:1 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:2 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:3 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] InputCache total size: 0.00MB (0 tensors across 256 batches)
INFO  [VRAM-DEBUG] ========== Input Capture: After InputCache Creation ==========
INFO  gc.collect() reclaimed 40 objects in 0.166s
INFO  [VRAM-DEBUG] cuda:0 - Allocated: 1.17GB, Reserved: 1.17GB
INFO  [VRAM-DEBUG] cuda:1 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:2 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:3 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] ========== Input Capture Complete ==========
INFO  gc.collect() reclaimed 40 objects in 0.157s
INFO  [VRAM-DEBUG] cuda:0 - Allocated: 1.17GB, Reserved: 1.17GB
INFO  [VRAM-DEBUG] cuda:1 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:2 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:3 - Allocated: 0.01GB, Reserved: 0.02GB

vs offload_to_disk=True:

INFO  [VRAM-DEBUG] ========== Starting Input Capture ==========
INFO  gc.collect() reclaimed 78 objects in 0.202s
INFO  [VRAM-DEBUG] cuda:0 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:1 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:2 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:3 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] Suspending turtle reload during input capture (original threshold: 0.50GB)
INFO  ModuleLooper: capturing layer inputs from 256 calibration batches
INFO  [VRAM-DEBUG] ===== INITIAL STATE (Line 75-76) =====
INFO  [VRAM-DEBUG] cur_layer_device=meta, data_device=meta
INFO  [VRAM-DEBUG] quantize_config.device=DEVICE.CUDA
INFO  [VRAM-DEBUG] shell_module_materialize: Qwen3MoeDecoderLayer | device_before=meta -> target_device=DEVICE.CUDA | VRAM_before=0.01GB
INFO  gc.collect() reclaimed 45 objects in 0.199s
INFO  [VRAM-DEBUG] shell_module_materialize: Qwen3MoeDecoderLayer | VRAM_after=1.17GB | delta=1.16GB
INFO  [VRAM-DEBUG] _maybe_auto_reload_after_alias: module=Qwen3MoeDecoderLayer, bytes_added=1.16GB, accum=1.16GB, threshold=infGB
INFO  [VRAM-DEBUG] ===== AFTER MATERIALIZATION (Line 178) =====
INFO  [VRAM-DEBUG] cur_layer_device=DEVICE.CUDA, data_device=meta
INFO  [VRAM-DEBUG] Base modules to materialize: ['model.embed_tokens', 'model.norm', 'model.rotary_emb']
INFO  [VRAM-DEBUG] model.embed_tokens device before materialize: meta
INFO  [VRAM-DEBUG] Materializing model.embed_tokens from meta to CPU first (to avoid extra VRAM allocation)
INFO  [VRAM-DEBUG] shell_module_materialize: Embedding | device_before=meta -> target_device=cpu | VRAM_before=1.17GB
INFO  gc.collect() reclaimed 10 objects in 0.202s
INFO  [VRAM-DEBUG] shell_module_materialize: Embedding | VRAM_after=1.17GB | delta=0.00GB
INFO  [VRAM-DEBUG] _maybe_auto_reload_after_alias: module=Embedding, bytes_added=0.58GB, accum=1.74GB, threshold=infGB
INFO  [VRAM-DEBUG] Moving model.embed_tokens from CPU to CUDA
INFO  [VRAM-DEBUG] model.norm device before materialize: meta
INFO  [VRAM-DEBUG] Materializing model.norm from meta to CPU first (to avoid extra VRAM allocation)
INFO  [VRAM-DEBUG] shell_module_materialize: Qwen3MoeRMSNorm | device_before=meta -> target_device=cpu | VRAM_before=1.75GB
INFO  gc.collect() reclaimed 11 objects in 0.205s
INFO  [VRAM-DEBUG] shell_module_materialize: Qwen3MoeRMSNorm | VRAM_after=1.75GB | delta=0.00GB
INFO  [VRAM-DEBUG] _maybe_auto_reload_after_alias: module=Qwen3MoeRMSNorm, bytes_added=0.00GB, accum=1.74GB, threshold=infGB
INFO  [VRAM-DEBUG] Moving model.norm from CPU to CUDA
INFO  [VRAM-DEBUG] model.rotary_emb device before materialize: meta
INFO  [VRAM-DEBUG] Materializing model.rotary_emb from meta to CPU first (to avoid extra VRAM allocation)
INFO  [VRAM-DEBUG] shell_module_materialize: Qwen3MoeRotaryEmbedding | device_before=meta -> target_device=cpu | VRAM_before=1.75GB
INFO  gc.collect() reclaimed 10 objects in 0.214s
INFO  [VRAM-DEBUG] shell_module_materialize: Qwen3MoeRotaryEmbedding | VRAM_after=1.75GB | delta=0.00GB
INFO  [VRAM-DEBUG] _maybe_auto_reload_after_alias: module=Qwen3MoeRotaryEmbedding, bytes_added=0.00GB, accum=1.74GB, threshold=infGB
INFO  [VRAM-DEBUG] Moving model.rotary_emb from CPU to CUDA
INFO  [VRAM-DEBUG] ========== Input Capture: After Base Module Materialization ==========
INFO  gc.collect() reclaimed 5 objects in 0.199s
INFO  [VRAM-DEBUG] cuda:0 - Allocated: 1.75GB, Reserved: 1.76GB
INFO  [VRAM-DEBUG] cuda:1 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:2 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:3 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] ===== INPUT CAPTURE (Batch 1) =====
INFO  [VRAM-DEBUG] data_device=DEVICE.CUDA, cur_layer_device=DEVICE.CUDA
INFO  [VRAM-DEBUG] hidden_states device=cuda:0
INFO  [VRAM-DEBUG] ===== INPUT CAPTURE (Batch 2) =====
INFO  [VRAM-DEBUG] data_device=DEVICE.CUDA, cur_layer_device=DEVICE.CUDA
INFO  [VRAM-DEBUG] hidden_states device=cuda:0
INFO  [VRAM-DEBUG] ===== INPUT CAPTURE (Batch 256) =====
INFO  [VRAM-DEBUG] data_device=DEVICE.CUDA, cur_layer_device=DEVICE.CUDA
INFO  [VRAM-DEBUG] hidden_states device=cuda:0
INFO  [VRAM-DEBUG] ========== Input Capture: After Capture Loop (before InputCache) ==========
INFO  gc.collect() reclaimed 40 objects in 0.207s
INFO  [VRAM-DEBUG] cuda:0 - Allocated: 2.25GB, Reserved: 2.26GB
INFO  [VRAM-DEBUG] cuda:1 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:2 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:3 - Allocated: 0.01GB, Reserved: 0.02GB
/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/__init__.py:1136: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(obj, torch.Tensor)
INFO  [VRAM-DEBUG] ===== Objects in VRAM after input capture loop =====
INFO  [VRAM-DEBUG]   Parameter: count=389, total=1781.5MB, shapes=['torch.Size([768, 2048])', 'torch.Size([768, 2048])', 'torch.Size([768, 2048])']
INFO  [VRAM-DEBUG]   Tensor: count=125, total=387.8MB, shapes=['torch.Size([1, 1147, 2048])', 'torch.Size([1, 1684, 2048])', 'torch.Size([1, 1286, 2048])']
INFO  [VRAM-DEBUG]   Total tracked: 2169.3MB
INFO  [VRAM-DEBUG] InputCache total size: 444.98MB (256 tensors across 256 batches)
INFO  [VRAM-DEBUG] ========== Input Capture: After InputCache Creation ==========
INFO  gc.collect() reclaimed 44 objects in 0.187s
INFO  [VRAM-DEBUG] cuda:0 - Allocated: 2.25GB, Reserved: 2.26GB
INFO  [VRAM-DEBUG] cuda:1 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:2 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:3 - Allocated: 0.01GB, Reserved: 0.02GB
/home/ubuntu/venvs/gptqmodelt/lib/python3.13t/site-packages/torch/__init__.py:1136: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
  return isinstance(obj, torch.Tensor)
INFO  [VRAM-DEBUG] ===== Objects in VRAM after InputCache creation =====
INFO  [VRAM-DEBUG]   Parameter: count=389, total=1781.5MB, shapes=['torch.Size([768, 2048])', 'torch.Size([768, 2048])', 'torch.Size([768, 2048])']
INFO  [VRAM-DEBUG]   Tensor: count=125, total=387.8MB, shapes=['torch.Size([1, 1147, 2048])', 'torch.Size([1, 1684, 2048])', 'torch.Size([1, 1286, 2048])']
INFO  [VRAM-DEBUG]   Total tracked: 2169.3MB
INFO  [VRAM-DEBUG] ========== Input Capture Complete ==========
INFO  gc.collect() reclaimed 143 objects in 0.238s
INFO  [VRAM-DEBUG] cuda:0 - Allocated: 2.25GB, Reserved: 2.26GB
INFO  [VRAM-DEBUG] cuda:1 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:2 - Allocated: 0.01GB, Reserved: 0.02GB
INFO  [VRAM-DEBUG] cuda:3 - Allocated: 0.01GB, Reserved: 0.02GB

Idk if it was a bug, but after layers[0] = layers[0].to(self.gptq_model.quantize_config.device) the cur_layer_device was not changed from CPU to self.gptq_model.quantize_config.device for the offload_to_disk=False case, so based modules at first were moved to cuda:0 with layers[0] = layers[0].to(self.gptq_model.quantize_config.device), then back to CPU with:

            if module is not None:
                self.gptq_model.shell_module_materialize(
                    target_submodule=module,
                    device=cur_layer_device,
                )

@Qubitium Qubitium merged commit 377e198 into ModelCloud:main Jan 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants