Skip to content

Lewis-panda/llm-quant-diagnostics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AWQ-Diag

Understand and visualize AWQ β€” an empirical look at whether "activation-aware importance" is real and actually matters.

Python PyTorch License: MIT

πŸ“– New here? Start with the 0β†’100 walkthrough (docs/understanding.md) β€” it builds up every concept (quantization, activations, AWQ, importance) from scratch and explains every figure.

AWQ (Activation-aware Weight Quantization, MLSys 2024 Best Paper) rests on one idea: not all weights are equal β€” a weight matters in proportion to the activation it multiplies, so the few "salient" channels should be protected. AWQ-Diag instruments Qwen2.5 with PyTorch forward hooks to make that idea concrete, visual, and testable:

What does AWQ's "activation importance" actually look like inside a real LLM β€” and does protecting the important channels really reduce quantization error?

The answer is yes, clearly β€” and the project shows it four ways:

  1. Importance is highly concentrated β€” the classic AWQ hockey-stick: a handful of channels hold a disproportionate share of the total importance.
  2. The important channels are genuine activation outliers, concentrated in specific modules (o_proj, down_proj) β€” not spread uniformly.
  3. You can see it β€” a 3D map of importance across every channel and layer (the spiky "outlier-channel" surface from the AWQ / SmoothQuant / LLM.int8() papers).
  4. It is operationally meaningful β€” we implement AWQ's scaling (protect the salient channels before quantizing) and measure the payoff: it cuts low-bit error most exactly where importance concentrates (down_proj ~2.3Γ—, up to 25.9Γ— on a single layer), and barely touches the low-importance modules. Protecting the "important" channels demonstrably helps β€” so the importance notion has real meaning, not just intuition.

The goal is to reproduce, visualize, and empirically verify AWQ's core mechanism β€” its saliency picture, the salient-channel structure, and the payoff of protecting it. It is not a new quantization method.


Key findings

Measured on Qwen/Qwen2.5-1.5B, replicated on Qwen/Qwen2.5-0.5B:

What we ask of AWQ's idea Metric Result
Is importance concentrated? top-1% channel importance share up to 17.6% in one layer (β‰ˆ18Γ— the uniform 1%) β€” hockey-stick βœ…
Are the salient channels real outliers? max excess kurtosis ΞΊ β‰ˆ 12 (layers.1.mlp.down_proj); heavy-tailed βœ…
Where do they live? mean kurtosis / importance by module concentrated in down_proj & o_proj (β‰ˆ35–55Γ— the other projections) βœ…
Does protecting them actually help? 3-bit output-error reduction, AWQ vs RTN down_proj 2.31Γ— Β· o_proj 1.63Γ— vs others ~1.2Γ— (max 25.9Γ—) βœ…
Does that hold across sizes? same, on Qwen2.5-0.5B down_proj/o_proj ~2.3Γ— vs others ~1.2Γ— (max 28.9Γ—) βœ…

The one-line takeaway: AWQ's "activation importance" is not just a heuristic β€” importance is sharply concentrated in a few real outlier channels (o_proj/down_proj), and protecting exactly those channels with AWQ scaling measurably reduces quantization error. The protection helps most precisely where the importance says it should.


Figures

1 Β· AWQ importance is concentrated (the hockey-stick). Sorted per-channel importance |W|Β·|x| (log y): a few channels dominate, which is the entire premise of AWQ.

saliency

2 Β· Visualizing AWQ importance across the network. x = input channel, y = layer, z = importance |W|Β·|x|. The spiky towers are the salient/outlier channels AWQ protects (here for down_proj, the highest-outlier family). This is the picture that motivates activation-aware quantization, drawn from a real model.

surface

3 Β· Where the importance lives. Activation outliers β€” and therefore importance β€” are concentrated in o_proj and down_proj, far above the other projections.

family

4 Β· Importance is meaningful: protecting the salient channels works. Implementing AWQ's per-channel scaling and measuring the 3-bit output-error reduction vs plain round-to-nearest. The benefit lands exactly on the high-importance families (o_proj, down_proj); the low-importance families barely move. This is the empirical payoff of the importance notion.

awq

Cross-model replication. The AWQ benefit by module family on both sizes β€” the protection lands on o_proj/down_proj in each:

cross

Supporting diagnostics

Per layer, kurtosis confirms the salient channels are genuine heavy-tailed outliers (an importance_surface_o_proj view is also generated per model):

kurtosis


What it measures

For every nn.Linear inside the Transformer blocks, a forward hook collects per-input-channel:

Statistic Meaning
channel_magnitude mean |x| β€” the AWQ saliency signal
kurtosis excess kurtosis β€” confirms the salient channels are genuine heavy-tailed outliers
outlier_ratio fraction of |x| > 6Οƒ
channel_variance, channel_max distribution spread / worst case

From these it builds the AWQ importance |W|Β·|x| per channel (and the top-1% share), the per-layer weight-quantization error across {8,6,4,3,2} bits (both an activation-weighted proxy and the real output error β€–Wx βˆ’ Ε΄xβ€–/β€–Wxβ€–), and β€” the key step β€” runs an AWQ scaling search: for each layer/bit it grid-searches the per-input-channel scaling s = (mean|x|)^Ξ± that minimizes output error (Ξ±=0 is exactly plain RTN), reporting how much protecting the salient channels beats RTN.

See docs/report.md for the full method and math.


Extension (investigation): is AWQ's per-layer scale search necessary?

A small follow-up, not claimed as novel. AWQ picks each group's scaling exponent Ξ± by a grid search (the bulk of its calibration cost). Quantizing the whole model (group-wise asymmetric) and measuring WikiText-2 perplexity, a single global Ξ± matches or beats the official-style block-level AWQ scale search in every config β€” clearly at 3-bit, tied at 4-bit:

ppl

This is consistent with prior work (a global migration strength is exactly SmoothQuant's Ξ±=0.5; 4-bit group-wise RTN is known to be near-lossless), so the value is the clean end-to-end reproduction, not novelty. The official llm-awq scale search would not run on Qwen2.5 + current transformers, so the AWQ baseline is a faithful reimplementation of its block-level grouping (no clipping); see docs/report.md Β§9 for method, the per-layer Ξ± study, and caveats.


Quickstart

The environment is managed with micromamba (or conda/mamba).

# 1. Create the environment (PyTorch cu128 β€” adjust for your CUDA / CPU)
micromamba env create -f environment.yml
micromamba activate awq-diag

# 2. Run the diagnostic on one model (writes results/ + figures/)
python scripts/run_diagnostic.py --model Qwen/Qwen2.5-1.5B
python scripts/run_diagnostic.py --model Qwen/Qwen2.5-0.5B

# 3. Build the cross-model comparison
python scripts/compare_models.py results/diagnostic_*.json

# 4. (optional) run the unit tests
pytest

CPU-only / non-CUDA machines:

python scripts/run_diagnostic.py --model Qwen/Qwen2.5-0.5B --device cpu --dtype float32

Outputs land in:

results/diagnostic_<model>.json      # full per-layer record + summary (see schema below)
figures/<model>/*.png                # 9 per-model figures (incl. 3D importance surfaces + AWQ benefit)
results/cross_model_summary.md

Repository layout

AWQ-Diag/
β”œβ”€β”€ src/awq_diag/          # the package
β”‚   β”œβ”€β”€ config.py          # DiagConfig β€” one object controls a run
β”‚   β”œβ”€β”€ data.py            # calibration texts
β”‚   β”œβ”€β”€ model_utils.py     # model loading + layer bookkeeping
β”‚   β”œβ”€β”€ hooks.py           # ActivationCollector + AWQErrorCollector (the AWQ scaling search)
β”‚   β”œβ”€β”€ quant.py           # symmetric per-channel quant, AWQ scaling, error metrics
β”‚   β”œβ”€β”€ analysis.py        # per-layer records, summary, module-family, importance
β”‚   β”œβ”€β”€ plotting.py        # the 9 figures (incl. 3D AWQ importance surface, AWQ benefit)
β”‚   β”œβ”€β”€ pipeline.py        # end-to-end orchestration
β”‚   └── cli.py             # `awq-diag` console entry
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_diagnostic.py  # run one model (importance / outliers / AWQ benefit)
β”‚   β”œβ”€β”€ compare_models.py  # cross-model summary
β”‚   β”œβ”€β”€ alpha_study.py     # extension: does a cheap stat predict AWQ's optimal Ξ±?
β”‚   β”œβ”€β”€ perplexity_eval.py # extension: const-Ξ± vs block-AWQ vs RTN on WikiText-2 ppl
β”‚   └── plot_perplexity.py # extension: the perplexity comparison figure
β”œβ”€β”€ results/               # JSON outputs + cross-model table
β”œβ”€β”€ figures/               # generated PNGs
β”œβ”€β”€ notebooks/
β”‚   └── awq_diagnostic.ipynb   # the original exploratory notebook (bilingual, educational)
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ report.md          # full write-up (English)
β”‚   └── understanding.md   # 0β†’100 walkthrough (δΈ­ζ–‡)
β”œβ”€β”€ tests/                 # pytest (quant core, CPU-only, no model download)
β”œβ”€β”€ environment.yml        # micromamba/conda environment
β”œβ”€β”€ requirements.txt       # pip fallback
└── pyproject.toml

The .py pipeline is the canonical, reproducible entry point; it reproduces the original exploratory notebook's importance/saliency numbers (e.g. top-ΞΊ layer layers.1.mlp.down_proj, ΞΊ β‰ˆ 12).


Output JSON schema

{
  "model": "Qwen/Qwen2.5-1.5B",
  "config":      { "bit_widths": [8,6,4,3,2], "outlier_sigma": 6.0, "seed": 0, ... },
  "model_info":  { "num_params": ..., "num_layers": 28, "num_linear_analyzed": 196, ... },
  "summary": {
    "awq_reduction_3bit": { "min": 1.0, "median": .., "max": 25.85, ... },  // AWQ vs RTN benefit
    "module_family": { "down_proj": { "mean_kurtosis": 4.31, "mean_awq_reduction_3bit": 2.31, ... }, ... },
    "per_bit_median_output_error": { "8": .., "4": 0.022, "3": 0.079, "2": 0.268 },
    "correlations": { ... }            // includes the supporting kurtosis / proxy diagnostics
  },
  "layers": {
    "model.layers.0.self_attn.q_proj": {
      "module_type": "q_proj", "layer_idx": 0,
      "mean_kurtosis": .., "top1pct_importance_share": ..,
      "output_error": { "8": .., "3": .., "2": .. }, "awq_output_error": { ... },
      "awq_reduction_3bit": .., "awq_best_alpha": { ... }
    }
  }
}

Limitations & honest scope

  • Simplified quantizer. The base is symmetric per-output-channel round-to-nearest; the AWQ pass adds the activation-aware per-channel scaling search on top. It captures AWQ's mechanism but is not the full deployed AWQ (group-wise + asymmetric zero-point + folded scales), and there is no GPTQ baseline β€” so absolute error magnitudes are illustrative, not production numbers.
  • Layer-local error, not end-task quality (perplexity / accuracy) β€” the AWQ benefit is measured at the layer output, not yet propagated to model-level metrics.
  • One architecture family (Qwen2.5, two sizes) and a small calibration set (4 paragraphs).

Next steps

  1. Group-wise + asymmetric AWQ to move from "mechanism demo" toward the real quantizer.
  2. Connect the layer-level AWQ benefit to model-level quality (perplexity / logit KL) β€” does protecting the important channels recover end-task accuracy, not just layer-output error?
  3. More model families (Llama / Gemma / Phi) to test whether the o_proj/down_proj importance concentration is universal.

References

License

MIT β€” see LICENSE.

About

A diagnostic toolkit for understanding why low-bit LLM quantization fails

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors