Skip to content

[examples] Refactor WrapQ examples into config-driven reusable recipe structure #716

@mhs4670go

Description

@mhs4670go

@Samsung/tico_developers

Summary

The current WrapQ example directory has grown into many scripts split by model, module, algorithm, and debugging purpose. As the number of examples has increased, we now have substantial duplicated logic across scripts, and the examples are becoming less useful as user-facing references.

This issue proposes refactoring the WrapQ examples into a smaller, config-driven structure:

tico/quantization/examples/
├── README.md
├── quantize.py
├── evaluate.py
├── inspect.py
└── configs/
    ├── llama_gptq_ptq.yaml
    ├── llama_ptq_only.yaml
    ├── qwen3_vl_gptq_ptq.yaml
    └── qwen3_vl_ptq_only.yaml

The actual reusable logic should be moved out of examples/ and into a new recipe layer:

tico/quantization/recipes/
├── __init__.py
├── config.py
├── context.py
├── runner.py
├── utils.py
├── qparams.py
├── adapters/
├── stages/
├── data/
├── evaluation/
├── export/
└── debug/

The goal is to make examples/ contain only thin CLI entrypoints, while model-specific behavior, algorithm stages, calibration, evaluation, export, and debugging logic live in reusable modules.

Motivation

The current example structure has several issues:

1. Too many example scripts

Scripts are currently split by model type, algorithm, module, and debug purpose. This makes it hard for users to know which script they should run.

2. Duplicated logic

Model loading, calibration data preparation, GPTQ/PTQ flow, qparam injection, evaluation, and export logic are repeated across multiple files.

3. Examples are doing too much

Some example scripts are effectively acting as libraries, evaluation tools, export tools, and debugging tools at the same time.

4. Examples import from other examples

Some scripts depend on helper functions defined in other example scripts. This makes the structure fragile and difficult to refactor.

5. Module-level examples are closer to integration tests

Scripts such as individual Qwen vision MLP/attention or LLaMA decoder-layer examples are useful, but they are better suited as tests or debug recipes rather than public user-facing examples.

Proposed Design

1. Keep only three user-facing example CLIs

quantize.py
evaluate.py
inspect.py

Their responsibilities should be:

quantize.py : load model → prepare calibration data → run quantization pipeline → optionally evaluate/export
evaluate.py : evaluate FP model or fake-quant/checkpoint model
inspect.py  : trace, parity check, layer/runtime debugging

Example usage:

python -m tico.quantization.examples.quantize \
  --config tico/quantization/examples/configs/llama_gptq_ptq.yaml \
  --model Maykeye/TinyLLama-v0
python -m tico.quantization.examples.quantize \
  --config tico/quantization/examples/configs/qwen3_vl_gptq_ptq.yaml \
  --model Qwen/Qwen3-VL-2B-Instruct
python -m tico.quantization.examples.evaluate \
  --config tico/quantization/examples/configs/llama_gptq_ptq.yaml \
  --checkpoint ./out/llama/quantized_model.pt
python -m tico.quantization.examples.inspect \
  --config tico/quantization/examples/configs/qwen3_vl_ptq_only.yaml \
  --mode trace \
  --interesting-modules model.language_model model.visual

2. Move reusable logic into tico.quantization.recipes

Proposed structure:

tico/quantization/recipes/
├── __init__.py
├── config.py
├── context.py
├── runner.py
├── utils.py
├── qparams.py
├── adapters/
│   ├── base.py
│   ├── llama.py
│   └── qwen3_vl.py
├── stages/
│   ├── base.py
│   ├── gptq.py
│   ├── ptq.py
│   ├── spinquant.py
│   ├── cle.py
│   └── smoothquant.py
├── data/
│   ├── llm.py
│   └── vlm.py
├── evaluation/
│   ├── llm.py
│   ├── vlm.py
│   └── mmlu.py
├── export/
│   ├── checkpoint.py
│   └── circle.py
└── debug/
    └── trace.py

The examples/ scripts should only parse CLI arguments, load config, and call the recipe runner or adapter APIs.


3. Represent algorithm combinations through config, not script names

Instead of creating separate scripts such as:

quantize_full_qmodel_with_gptq.py
quantize_qwen3_vl_with_gptq.py
quantize_model.py
quantize_for_conditional_generation.py

we should use YAML configs:

pipeline:
  - name: gptq
    enabled: true
    weight_bits: 4
    perchannel: true

  - name: ptq
    enabled: true
    activation_dtype: int16
    linear_weight_bits: 4

For PTQ-only:

pipeline:
  - name: ptq
    enabled: true
    activation_dtype: int16
    linear_weight_bits: 8

This makes the example layer stable even when we add more algorithms or model families.


4. Use model adapters for model-family-specific behavior

Model-specific code should live in adapters:

recipes/adapters/
├── base.py
├── llama.py
└── qwen3_vl.py

Each adapter should handle:

- model/tokenizer/processor loading
- calibration input construction
- PTQ config construction
- calibration forward pass
- evaluation
- export behavior

This allows the common runner to stay generic:

adapter = get_adapter(cfg["model"]["family"])
ctx = adapter.load_model(ctx)
ctx.calibration_inputs = adapter.build_calibration_inputs(ctx)

for stage_cfg in cfg["pipeline"]:
    stage = get_stage(stage_cfg["name"])
    ctx = stage.run(ctx, stage_cfg)

adapter.evaluate(ctx)
adapter.export(ctx)

Proposed Migration Mapping

Current script Proposed handling
quantize_full_qmodel_with_gptq.py Move logic into recipes/adapters/llama.py, recipes/stages/*, recipes/evaluation/llm.py, and recipes/export/*. Replace usage with examples/quantize.py --config configs/llama_gptq_ptq.yaml.
evaluate_fk_llama_model.py Merge into examples/evaluate.py and recipes/evaluation/llm.py.
static_llama_layer_runtime.py Move to recipes/debug/ and expose through examples/inspect.py.
llama/quantize_decoder_layer_prefill.py Move to integration tests or expose as an inspect/debug recipe.
quantize_qwen3_vl_with_gptq.py Move logic into recipes/adapters/qwen3_vl.py, recipes/stages/*, recipes/evaluation/vlm.py, and recipes/evaluation/mmlu.py.
quantize_full_vlm_model_with_gptq.py Merge into the common Qwen3-VL recipe path.
qwen/quantize_model.py Move to synthetic smoke test or PTQ-only config.
qwen/quantize_for_conditional_generation.py Merge into Qwen3-VL adapter/config path.
qwen/quantize_vision_mlp.py Move to wrapper-level integration test.
qwen/quantize_vision_attention.py Move to wrapper-level integration test.
qwen/trace_qwen.py Move to recipes/debug/trace.py and expose through examples/inspect.py --mode trace.

Integration Plan

I'll merge related codes into tico/quantization/examples. And, before it becomes stable, I'll keep existing tico/quantization/wrapq/examples. After some time, existing folder will be renamed to examples_deprecated and removed later.

Expected Outcome

After this refactor, adding a new model family or algorithm variant should not require adding another example script. Instead, we should add or update:

- one adapter, if it is a new model family
- one stage, if it is a new algorithm
- one YAML config, if it is a new recipe

The public example surface will remain small and stable, while the actual quantization workflow becomes easier to reuse, test, and maintain.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions