fix: offload_to_disk=True uses more vram than offload_to_disk=False #2325
+20
−18
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@Qubitium I have spent endless debugging days and nights and finally figured out why
offload_to_disk=Trueuses more VRAM oncuda:0thanoffload_to_disk=False. Have tried many tools: claude code + glm 4.7/kilo + glm 4.7/antigravity with opus 4.5/ they were helped in adding logs but even looking at logs they could not figure out why that happen, and only with aider-desk + glm 4.7 (but without aider itself) - it reached the correct hypothesis, that withoffload_to_disk=Falsebase modules and captured inputs are placed on CPU, so that is why VRAM usage is lower. Also excessive input capture loop was identified and removed for tp-pre-pad processor.This PR makes
offload_to_disk=Truebehave likeoffload_to_disk=Falseby materializing base modules to CPU and store captured inputs on CPU. But does not yet change the way how outputs after forward pass are stored (right now they are stored on cur_layer_device), do not want to introduce merge conflicts withmoe_disable_routerfeature right now. Also I have a feature in separate branch wherecalibration_data_devicecan be set +balancedmode waiting for merge ofmoe_disable_router, so would better wait and then propose improvement for storing outputs to proper device.Here are some debug logs with evidence:
Used 256 samples from
c4/enwithqwen3-coder-30b-a3b-instruct.offload_to_disk=Falsevs
offload_to_disk=True:Idk if it was a bug, but after
layers[0] = layers[0].to(self.gptq_model.quantize_config.device)thecur_layer_devicewas not changed fromCPUtoself.gptq_model.quantize_config.devicefor theoffload_to_disk=Falsecase, so based modules at first were moved tocuda:0withlayers[0] = layers[0].to(self.gptq_model.quantize_config.device), then back toCPUwith: