Skip to content

[quantization] Ablation study: float vs quantized prefill for decode calibration #622

@mhs4670go

Description

@mhs4670go

Summary

We have not yet finalized how decode calibration inputs should be generated.

From a PTQ perspective, using float/reference prefill outputs appears to be more principled, since calibration is expected to reflect the original float model behavior. However, using quantized prefill outputs may better reflect the actual deployment pipeline.

At this stage, it is unclear which approach is more appropriate in practice, so we propose to support both and compare them empirically.

Motivation

PTQ fundamentally aims to approximate the float model behavior.

For decode PTQ, the ideal objective is:

Make quantized decode approximate float decode under the original float input distribution.

However, using quantized prefill outputs changes the optimization target to:

Approximate float decode under already-quantized inputs.

This mixes:

  • prefill quantization error
  • decode quantization error

and may lead to suboptimal or less interpretable calibration.

Proposal

1. Float-based decode calibration

  • Run calibration prompts through the float/reference model (prefill).

  • Collect:

    • past_key_values
    • decode step inputs (hidden states, masks, position embeddings, etc.)
  • Use these as decode calibration inputs for PTQ.

2. Quantized-prefill-based calibration

  • Generate decode calibration inputs from the quantized prefill model.

Ablation Study

Compare the following two approaches:

  1. Float-prefill-based decode calibration
  2. Quantized-prefill-based decode calibration

Metrics:

  • Perplexity (Wikitext-2)
  • lm_eval tasks (e.g., openbookqa, winogrande, etc.)
  • Optional: layer-wise error / sensitivity

Goals:

  • Identify which approach yields better decode performance
  • Evaluate trade-offs between principled calibration vs deployment-faithful calibration

Expected Outcome

  • Clarify the correct strategy for decode calibration inputs
  • Establish a reliable baseline for prefill-decode PTQ pipeline
  • Provide empirical guidance for future design decisions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions