Skip to content

[quantization] Refactor: Remove float32 assumption and follow model dtype #621

@mhs4670go

Description

@mhs4670go

Background

The current PTQ pipeline assumes float32 as the default dtype for model execution and intermediate tensors. This assumption originates from legacy constraints in the CNN-era toolchain (e.g., ONE), where models were expected to be in float32 format.

However, this assumption is no longer suitable for modern Transformer/LLM workflows:

  • Most pretrained models are in fp16 or bf16
  • Converting to float32 increases memory usage significantly
  • The dependency on float32 is not algorithmically required, but historically inherited

Problem

  • Hardcoded float32 conversions (.float(), torch.float32) exist across the pipeline
  • Model dtype is ignored even when explicitly defined in the checkpoint or load configuration
  • This leads to:
    • Unnecessary memory overhead
    • Redundant dtype casts
    • Reduced flexibility for modern models

Proposal

Refactor the PTQ pipeline to remove the implicit float32 assumption and instead follow the original model dtype.

Key Changes

  1. Remove float32 hardcoding

    • Eliminate .float() and torch.float32 assumptions in:
      • model wrapping
      • calibration
      • observer inputs
      • export paths
  2. Adopt model dtype as default execution dtype

    • Use the actual model parameter dtype as the primary source of truth
    • Suggested precedence:
      1. Explicit user override
      2. Model parameter dtype
      3. Model config dtype
      4. Fallback to float32
  3. Validate via regression

    • Run calibration, quantization, and evaluation flows
    • Compare against current float32-based baseline
    • Identify any numerical instability or accuracy degradation
  4. Introduce selective fp32 promotion only if needed

    • If regression reveals instability, apply fp32 only to specific components:
      • observer statistics
      • scale/zero-point computation
      • error metrics
    • Avoid global dtype overrides

Expected Benefits

  • Reduced memory footprint for large models
  • Better alignment with modern LLM/VLM checkpoints
  • Cleaner and more general PTQ design
  • Decoupling from legacy backend assumptions (e.g., ONE)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions