Understand and visualize AWQ β an empirical look at whether "activation-aware importance" is real and actually matters.
π New here? Start with the 0β100 walkthrough (
docs/understanding.md) β it builds up every concept (quantization, activations, AWQ, importance) from scratch and explains every figure.
AWQ (Activation-aware Weight Quantization, MLSys 2024 Best Paper) rests on one idea: not all weights are equal β a weight matters in proportion to the activation it multiplies, so the few "salient" channels should be protected. AWQ-Diag instruments Qwen2.5 with PyTorch forward hooks to make that idea concrete, visual, and testable:
What does AWQ's "activation importance" actually look like inside a real LLM β and does protecting the important channels really reduce quantization error?
The answer is yes, clearly β and the project shows it four ways:
- Importance is highly concentrated β the classic AWQ hockey-stick: a handful of channels hold a disproportionate share of the total importance.
- The important channels are genuine activation outliers, concentrated in specific modules
(
o_proj,down_proj) β not spread uniformly. - You can see it β a 3D map of importance across every channel and layer (the spiky "outlier-channel" surface from the AWQ / SmoothQuant / LLM.int8() papers).
- It is operationally meaningful β we implement AWQ's scaling (protect the salient channels
before quantizing) and measure the payoff: it cuts low-bit error most exactly where importance
concentrates (
down_proj~2.3Γ, up to 25.9Γ on a single layer), and barely touches the low-importance modules. Protecting the "important" channels demonstrably helps β so the importance notion has real meaning, not just intuition.
The goal is to reproduce, visualize, and empirically verify AWQ's core mechanism β its saliency picture, the salient-channel structure, and the payoff of protecting it. It is not a new quantization method.
Measured on Qwen/Qwen2.5-1.5B, replicated on Qwen/Qwen2.5-0.5B:
| What we ask of AWQ's idea | Metric | Result |
|---|---|---|
| Is importance concentrated? | top-1% channel importance share | up to 17.6% in one layer (β18Γ the uniform 1%) β hockey-stick β |
| Are the salient channels real outliers? | max excess kurtosis | ΞΊ β 12 (layers.1.mlp.down_proj); heavy-tailed β
|
| Where do they live? | mean kurtosis / importance by module | concentrated in down_proj & o_proj (β35β55Γ the other projections) β
|
| Does protecting them actually help? | 3-bit output-error reduction, AWQ vs RTN | down_proj 2.31Γ Β· o_proj 1.63Γ vs others ~1.2Γ (max 25.9Γ) β
|
| Does that hold across sizes? | same, on Qwen2.5-0.5B | down_proj/o_proj ~2.3Γ vs others ~1.2Γ (max 28.9Γ) β
|
The one-line takeaway: AWQ's "activation importance" is not just a heuristic β importance is
sharply concentrated in a few real outlier channels (o_proj/down_proj), and protecting exactly
those channels with AWQ scaling measurably reduces quantization error. The protection helps most
precisely where the importance says it should.
1 Β· AWQ importance is concentrated (the hockey-stick). Sorted per-channel importance
|W|Β·|x| (log y): a few channels dominate, which is the entire premise of AWQ.
2 Β· Visualizing AWQ importance across the network. x = input channel, y = layer,
z = importance |W|Β·|x|. The spiky towers are the salient/outlier channels AWQ protects (here for
down_proj, the highest-outlier family). This is the picture that motivates activation-aware
quantization, drawn from a real model.
3 Β· Where the importance lives. Activation outliers β and therefore importance β are
concentrated in o_proj and down_proj, far above the other projections.
4 Β· Importance is meaningful: protecting the salient channels works. Implementing AWQ's
per-channel scaling and measuring the 3-bit output-error reduction vs plain round-to-nearest. The
benefit lands exactly on the high-importance families (o_proj, down_proj); the low-importance
families barely move. This is the empirical payoff of the importance notion.
Cross-model replication. The AWQ benefit by module family on both sizes β the protection lands on
o_proj/down_proj in each:
Supporting diagnostics
Per layer, kurtosis confirms the salient channels are genuine heavy-tailed outliers (an
importance_surface_o_proj view is also generated per model):
For every nn.Linear inside the Transformer blocks, a forward hook collects per-input-channel:
| Statistic | Meaning |
|---|---|
channel_magnitude |
mean |x| β the AWQ saliency signal |
kurtosis |
excess kurtosis β confirms the salient channels are genuine heavy-tailed outliers |
outlier_ratio |
fraction of |x| > 6Ο |
channel_variance, channel_max |
distribution spread / worst case |
From these it builds the AWQ importance |W|Β·|x| per channel (and the top-1% share), the per-layer
weight-quantization error across {8,6,4,3,2} bits (both an activation-weighted proxy and the real
output error βWx β Ε΄xβ/βWxβ), and β the key step β runs an AWQ scaling search: for each
layer/bit it grid-searches the per-input-channel scaling s = (mean|x|)^Ξ± that minimizes output error
(Ξ±=0 is exactly plain RTN), reporting how much protecting the salient channels beats RTN.
See docs/report.md for the full method and math.
A small follow-up, not claimed as novel. AWQ picks each group's scaling exponent Ξ± by a grid
search (the bulk of its calibration cost). Quantizing the whole model (group-wise asymmetric) and
measuring WikiText-2 perplexity, a single global Ξ± matches or beats the official-style
block-level AWQ scale search in every config β clearly at 3-bit, tied at 4-bit:
This is consistent with prior work (a global migration strength is exactly
SmoothQuant's Ξ±=0.5; 4-bit group-wise RTN is known to be
near-lossless), so the value is the clean end-to-end reproduction, not novelty. The official
llm-awq scale search would not run on Qwen2.5 + current transformers, so the AWQ baseline is a
faithful reimplementation of its block-level grouping (no clipping); see
docs/report.md Β§9 for method, the per-layer Ξ± study, and caveats.
The environment is managed with micromamba (or conda/mamba).
# 1. Create the environment (PyTorch cu128 β adjust for your CUDA / CPU)
micromamba env create -f environment.yml
micromamba activate awq-diag
# 2. Run the diagnostic on one model (writes results/ + figures/)
python scripts/run_diagnostic.py --model Qwen/Qwen2.5-1.5B
python scripts/run_diagnostic.py --model Qwen/Qwen2.5-0.5B
# 3. Build the cross-model comparison
python scripts/compare_models.py results/diagnostic_*.json
# 4. (optional) run the unit tests
pytestCPU-only / non-CUDA machines:
python scripts/run_diagnostic.py --model Qwen/Qwen2.5-0.5B --device cpu --dtype float32Outputs land in:
results/diagnostic_<model>.json # full per-layer record + summary (see schema below)
figures/<model>/*.png # 9 per-model figures (incl. 3D importance surfaces + AWQ benefit)
results/cross_model_summary.md
AWQ-Diag/
βββ src/awq_diag/ # the package
β βββ config.py # DiagConfig β one object controls a run
β βββ data.py # calibration texts
β βββ model_utils.py # model loading + layer bookkeeping
β βββ hooks.py # ActivationCollector + AWQErrorCollector (the AWQ scaling search)
β βββ quant.py # symmetric per-channel quant, AWQ scaling, error metrics
β βββ analysis.py # per-layer records, summary, module-family, importance
β βββ plotting.py # the 9 figures (incl. 3D AWQ importance surface, AWQ benefit)
β βββ pipeline.py # end-to-end orchestration
β βββ cli.py # `awq-diag` console entry
βββ scripts/
β βββ run_diagnostic.py # run one model (importance / outliers / AWQ benefit)
β βββ compare_models.py # cross-model summary
β βββ alpha_study.py # extension: does a cheap stat predict AWQ's optimal Ξ±?
β βββ perplexity_eval.py # extension: const-Ξ± vs block-AWQ vs RTN on WikiText-2 ppl
β βββ plot_perplexity.py # extension: the perplexity comparison figure
βββ results/ # JSON outputs + cross-model table
βββ figures/ # generated PNGs
βββ notebooks/
β βββ awq_diagnostic.ipynb # the original exploratory notebook (bilingual, educational)
βββ docs/
β βββ report.md # full write-up (English)
β βββ understanding.md # 0β100 walkthrough (δΈζ)
βββ tests/ # pytest (quant core, CPU-only, no model download)
βββ environment.yml # micromamba/conda environment
βββ requirements.txt # pip fallback
βββ pyproject.toml
The .py pipeline is the canonical, reproducible entry point; it reproduces the original
exploratory notebook's importance/saliency numbers (e.g. top-ΞΊ layer layers.1.mlp.down_proj, ΞΊ β 12).
- Simplified quantizer. The base is symmetric per-output-channel round-to-nearest; the AWQ pass adds the activation-aware per-channel scaling search on top. It captures AWQ's mechanism but is not the full deployed AWQ (group-wise + asymmetric zero-point + folded scales), and there is no GPTQ baseline β so absolute error magnitudes are illustrative, not production numbers.
- Layer-local error, not end-task quality (perplexity / accuracy) β the AWQ benefit is measured at the layer output, not yet propagated to model-level metrics.
- One architecture family (Qwen2.5, two sizes) and a small calibration set (4 paragraphs).
- Group-wise + asymmetric AWQ to move from "mechanism demo" toward the real quantizer.
- Connect the layer-level AWQ benefit to model-level quality (perplexity / logit KL) β does protecting the important channels recover end-task accuracy, not just layer-output error?
- More model families (Llama / Gemma / Phi) to test whether the
o_proj/down_projimportance concentration is universal.
- AWQ β Activation-aware Weight Quantization (MLSys 2024 Best Paper)
- GPTQ β Accurate Post-Training Quantization
- SmoothQuant β arxiv 2211.10438
- LLM.int8() β arxiv 2208.07339
MIT β see LICENSE.







{ "model": "Qwen/Qwen2.5-1.5B", "config": { "bit_widths": [8,6,4,3,2], "outlier_sigma": 6.0, "seed": 0, ... }, "model_info": { "num_params": ..., "num_layers": 28, "num_linear_analyzed": 196, ... }, "summary": { "awq_reduction_3bit": { "min": 1.0, "median": .., "max": 25.85, ... }, // AWQ vs RTN benefit "module_family": { "down_proj": { "mean_kurtosis": 4.31, "mean_awq_reduction_3bit": 2.31, ... }, ... }, "per_bit_median_output_error": { "8": .., "4": 0.022, "3": 0.079, "2": 0.268 }, "correlations": { ... } // includes the supporting kurtosis / proxy diagnostics }, "layers": { "model.layers.0.self_attn.q_proj": { "module_type": "q_proj", "layer_idx": 0, "mean_kurtosis": .., "top1pct_importance_share": .., "output_error": { "8": .., "3": .., "2": .. }, "awq_output_error": { ... }, "awq_reduction_3bit": .., "awq_best_alpha": { ... } } } }