Background
The current PTQ pipeline assumes float32 as the default dtype for model execution and intermediate tensors. This assumption originates from legacy constraints in the CNN-era toolchain (e.g., ONE), where models were expected to be in float32 format.
However, this assumption is no longer suitable for modern Transformer/LLM workflows:
- Most pretrained models are in
fp16 or bf16
- Converting to
float32 increases memory usage significantly
- The dependency on float32 is not algorithmically required, but historically inherited
Problem
- Hardcoded float32 conversions (
.float(), torch.float32) exist across the pipeline
- Model dtype is ignored even when explicitly defined in the checkpoint or load configuration
- This leads to:
- Unnecessary memory overhead
- Redundant dtype casts
- Reduced flexibility for modern models
Proposal
Refactor the PTQ pipeline to remove the implicit float32 assumption and instead follow the original model dtype.
Key Changes
-
Remove float32 hardcoding
- Eliminate
.float() and torch.float32 assumptions in:
- model wrapping
- calibration
- observer inputs
- export paths
-
Adopt model dtype as default execution dtype
- Use the actual model parameter dtype as the primary source of truth
- Suggested precedence:
- Explicit user override
- Model parameter dtype
- Model config dtype
- Fallback to float32
-
Validate via regression
- Run calibration, quantization, and evaluation flows
- Compare against current float32-based baseline
- Identify any numerical instability or accuracy degradation
-
Introduce selective fp32 promotion only if needed
- If regression reveals instability, apply fp32 only to specific components:
- observer statistics
- scale/zero-point computation
- error metrics
- Avoid global dtype overrides
Expected Benefits
- Reduced memory footprint for large models
- Better alignment with modern LLM/VLM checkpoints
- Cleaner and more general PTQ design
- Decoupling from legacy backend assumptions (e.g., ONE)
Background
The current PTQ pipeline assumes
float32as the default dtype for model execution and intermediate tensors. This assumption originates from legacy constraints in the CNN-era toolchain (e.g., ONE), where models were expected to be in float32 format.However, this assumption is no longer suitable for modern Transformer/LLM workflows:
fp16orbf16float32increases memory usage significantlyProblem
.float(),torch.float32) exist across the pipelineProposal
Refactor the PTQ pipeline to remove the implicit float32 assumption and instead follow the original model dtype.
Key Changes
Remove float32 hardcoding
.float()andtorch.float32assumptions in:Adopt model dtype as default execution dtype
Validate via regression
Introduce selective fp32 promotion only if needed
Expected Benefits