Summary
We have not yet finalized how decode calibration inputs should be generated.
From a PTQ perspective, using float/reference prefill outputs appears to be more principled, since calibration is expected to reflect the original float model behavior. However, using quantized prefill outputs may better reflect the actual deployment pipeline.
At this stage, it is unclear which approach is more appropriate in practice, so we propose to support both and compare them empirically.
Motivation
PTQ fundamentally aims to approximate the float model behavior.
For decode PTQ, the ideal objective is:
Make quantized decode approximate float decode under the original float input distribution.
However, using quantized prefill outputs changes the optimization target to:
Approximate float decode under already-quantized inputs.
This mixes:
- prefill quantization error
- decode quantization error
and may lead to suboptimal or less interpretable calibration.
Proposal
1. Float-based decode calibration
2. Quantized-prefill-based calibration
- Generate decode calibration inputs from the quantized prefill model.
Ablation Study
Compare the following two approaches:
- Float-prefill-based decode calibration
- Quantized-prefill-based decode calibration
Metrics:
- Perplexity (Wikitext-2)
- lm_eval tasks (e.g., openbookqa, winogrande, etc.)
- Optional: layer-wise error / sensitivity
Goals:
- Identify which approach yields better decode performance
- Evaluate trade-offs between principled calibration vs deployment-faithful calibration
Expected Outcome
- Clarify the correct strategy for decode calibration inputs
- Establish a reliable baseline for prefill-decode PTQ pipeline
- Provide empirical guidance for future design decisions
Summary
We have not yet finalized how decode calibration inputs should be generated.
From a PTQ perspective, using float/reference prefill outputs appears to be more principled, since calibration is expected to reflect the original float model behavior. However, using quantized prefill outputs may better reflect the actual deployment pipeline.
At this stage, it is unclear which approach is more appropriate in practice, so we propose to support both and compare them empirically.
Motivation
PTQ fundamentally aims to approximate the float model behavior.
For decode PTQ, the ideal objective is:
However, using quantized prefill outputs changes the optimization target to:
This mixes:
and may lead to suboptimal or less interpretable calibration.
Proposal
1. Float-based decode calibration
Run calibration prompts through the float/reference model (prefill).
Collect:
past_key_valuesUse these as decode calibration inputs for PTQ.
2. Quantized-prefill-based calibration
Ablation Study
Compare the following two approaches:
Metrics:
Goals:
Expected Outcome