1. Dataset Candidates
We limit the candidates to practical, widely-used, and easy-to-integrate datasets.
1.1 LLM (LLaMA)
(A) Baseline
- WikiText-2
- Purpose: baseline comparison
- Characteristics: plain natural language
(B) Instruction / Chat
(C) Structured Tasks
- FLAN-style mixtures (optional subset)
- Includes QA, reasoning, summarization
- Good for diverse activation patterns
1.2 VLM (Qwen3-VL)
(A) Text-only baseline
- Same as LLM (WikiText / Alpaca subset)
(B) Vision-Language
2. Sampling Strategy
2.1 Number of Samples
Recommended:
Guideline:
- Too small → unstable calibration
- Too large → diminishing returns
2.2 Sequence Length
We should match deployment characteristics:
| Scenario |
Strategy |
| Chat / QA |
128 ~ 512 tokens |
| Long context |
include some 1K+ samples |
| Mixed |
stratified by length |
2.3 Sampling Method
Option A (Simple Random)
- Uniform random sampling
- Fast and easy baseline
Option B (Recommended: Stratified)
Split by:
- Sequence length buckets
- Prompt types (instruction / plain / QA)
Example:
- 50% instruction/chat
- 30% general text
- 20% long-context samples
2.4 Prompt Formatting (Important)
For instruction models, use actual inference format:
Example:
### Instruction:
<instruction>
### Response:
<response>
or chat template:
<|system|>
...
<|user|>
...
<|assistant|>
...
Mismatch here can significantly affect activation distribution.
3. VLM Input Construction
For Qwen3-VL:
Each sample should include:
- Image (or dummy image if needed)
- Text prompt
Example:
User: What is happening in this image?
<image>
Important:
- Maintain real inference preprocessing
- Use actual tokenizer + image processor
4. Calibration Execution Details
4.1 Prefill + Decode Coverage
Ensure calibration includes:
- Prefill (full sequence)
- Short decode steps (important for KV cache behavior)
Example:
- Run 1 full forward (prefill)
- Run 2~4 decode steps
4.2 Token Distribution Coverage
We want to expose:
- Special tokens (BOS, EOS, role tokens)
- Punctuation-heavy inputs
- Rare tokens (optional but helpful)
5. Ablation Plan
| Dataset |
Expected Outcome |
| WikiText |
Baseline |
| Alpaca |
Better instruction alignment |
| ShareGPT |
Better chat realism |
| COCO (VLM) |
Basic multimodal alignment |
| VQAv2 (VLM) |
Complex cross-modal |
6. Evaluation Focus
We prioritize:
- Perplexity delta vs FP
- lm-eval tasks (subset)
- Representative prompts
- Qualitative output stability
- (Optional) Layer-wise activation error
7. Recommended Default (Initial Guess)
If we had to pick a strong default:
LLM
VLM
- 50% COCO
- 30% Alpaca-style text
- 20% VQAv2
8. Risks & Considerations
- Overfitting calibration to specific formats
- Dataset preprocessing mismatch
- Ignoring decode-phase behavior
- Too homogeneous sampling
9. Next Actions
1. Dataset Candidates
We limit the candidates to practical, widely-used, and easy-to-integrate datasets.
1.1 LLM (LLaMA)
(A) Baseline
(B) Instruction / Chat
Alpaca (Stanford Alpaca)
ShareGPT (filtered subset)
(C) Structured Tasks
1.2 VLM (Qwen3-VL)
(A) Text-only baseline
(B) Vision-Language
COCO Captions
VQAv2
Instruction-style VLM data (if available)
2. Sampling Strategy
2.1 Number of Samples
Recommended:
Guideline:
2.2 Sequence Length
We should match deployment characteristics:
2.3 Sampling Method
Option A (Simple Random)
Option B (Recommended: Stratified)
Split by:
Example:
2.4 Prompt Formatting (Important)
For instruction models, use actual inference format:
Example:
or chat template:
Mismatch here can significantly affect activation distribution.
3. VLM Input Construction
For Qwen3-VL:
Each sample should include:
Example:
Important:
4. Calibration Execution Details
4.1 Prefill + Decode Coverage
Ensure calibration includes:
Example:
4.2 Token Distribution Coverage
We want to expose:
5. Ablation Plan
6. Evaluation Focus
We prioritize:
7. Recommended Default (Initial Guess)
If we had to pick a strong default:
LLM
VLM
8. Risks & Considerations
9. Next Actions