[quantization] Refactor: Remove float32 assumption and follow model dtype

## Background

The current PTQ pipeline assumes `float32` as the default dtype for model execution and intermediate tensors. This assumption originates from legacy constraints in the CNN-era toolchain (e.g., ONE), where models were expected to be in float32 format.

However, this assumption is no longer suitable for modern Transformer/LLM workflows:

- Most pretrained models are in `fp16` or `bf16`
- Converting to `float32` increases memory usage significantly
- The dependency on float32 is not algorithmically required, but historically inherited

## Problem

- Hardcoded float32 conversions (`.float()`, `torch.float32`) exist across the pipeline
- Model dtype is ignored even when explicitly defined in the checkpoint or load configuration
- This leads to:
  - Unnecessary memory overhead
  - Redundant dtype casts
  - Reduced flexibility for modern models

## Proposal

Refactor the PTQ pipeline to remove the implicit float32 assumption and instead follow the original model dtype.

### Key Changes

1. **Remove float32 hardcoding**
   - Eliminate `.float()` and `torch.float32` assumptions in:
     - model wrapping
     - calibration
     - observer inputs
     - export paths

2. **Adopt model dtype as default execution dtype**
   - Use the actual model parameter dtype as the primary source of truth
   - Suggested precedence:
     1. Explicit user override
     2. Model parameter dtype
     3. Model config dtype
     4. Fallback to float32

3. **Validate via regression**
   - Run calibration, quantization, and evaluation flows
   - Compare against current float32-based baseline
   - Identify any numerical instability or accuracy degradation

4. **Introduce selective fp32 promotion only if needed**
   - If regression reveals instability, apply fp32 only to specific components:
     - observer statistics
     - scale/zero-point computation
     - error metrics
   - Avoid global dtype overrides

### Expected Benefits

- Reduced memory footprint for large models
- Better alignment with modern LLM/VLM checkpoints
- Cleaner and more general PTQ design
- Decoupling from legacy backend assumptions (e.g., ONE)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Refactor: Remove float32 assumption and follow model dtype #621

Background

Problem

Proposal

Key Changes

Expected Benefits

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[quantization] Refactor: Remove float32 assumption and follow model dtype #621

Description

Background

Problem

Proposal

Key Changes

Expected Benefits

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions