[quantization] Ablation study: float vs quantized prefill for decode calibration

## Summary

We have not yet finalized how decode calibration inputs should be generated.

From a PTQ perspective, using **float/reference prefill outputs** appears to be more principled, since calibration is expected to reflect the original float model behavior. However, using **quantized prefill outputs** may better reflect the actual deployment pipeline.

At this stage, it is unclear which approach is more appropriate in practice, so we propose to support both and compare them empirically.

## Motivation

PTQ fundamentally aims to approximate the **float model behavior**.

For decode PTQ, the ideal objective is:

> Make quantized decode approximate float decode under the original float input distribution.

However, using quantized prefill outputs changes the optimization target to:

> Approximate float decode under already-quantized inputs.

This mixes:

* prefill quantization error
* decode quantization error

and may lead to suboptimal or less interpretable calibration.

## Proposal

### 1. Float-based decode calibration

* Run calibration prompts through the **float/reference model** (prefill).
* Collect:

  * `past_key_values`
  * decode step inputs (hidden states, masks, position embeddings, etc.)
* Use these as decode calibration inputs for PTQ.

### 2. Quantized-prefill-based calibration

* Generate decode calibration inputs from the quantized prefill model.

## Ablation Study

Compare the following two approaches:

1. **Float-prefill-based decode calibration**
2. **Quantized-prefill-based decode calibration**

Metrics:

* Perplexity (Wikitext-2)
* lm_eval tasks (e.g., openbookqa, winogrande, etc.)
* Optional: layer-wise error / sensitivity

Goals:

* Identify which approach yields better decode performance
* Evaluate trade-offs between principled calibration vs deployment-faithful calibration

## Expected Outcome

* Clarify the correct strategy for decode calibration inputs
* Establish a reliable baseline for prefill-decode PTQ pipeline
* Provide empirical guidance for future design decisions


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Ablation study: float vs quantized prefill for decode calibration #622

Summary

Motivation

Proposal

1. Float-based decode calibration

2. Quantized-prefill-based calibration

Ablation Study

Expected Outcome

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[quantization] Ablation study: float vs quantized prefill for decode calibration #622

Description

Summary

Motivation

Proposal

1. Float-based decode calibration

2. Quantized-prefill-based calibration

Ablation Study

Expected Outcome

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions