[DRAFT][quantization] Full quantization of LLama compatible models by stamalakhov · Pull Request #436 · Samsung/TICO

stamalakhov · 2026-01-13T13:44:30Z

This draft tries to get fully quantized circle layers for Llama model.

TODO:

tests/cleanup

TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

mhs4670go · 2026-01-21T05:42:10Z

                    gptq[name] = GPTQ(subset[name])
                    gptq[name].quantizer.configure(
-                        bits=8, perchannel=True, sym=False, mse=False
+                        bits=4, perchannel=True, sym=False, mse=False


FYI, you can give the option for this with this PR.

FYI, you can give the option for this with this PR.

@mhs4670go
Thank you. I'll rebase after merging of #441.

mhs4670go · 2026-01-21T05:57:26Z

Could you give me some explanations for the reason of changes related with observers?

Deleting some attributes and register them as buffer.

Change ObserverBase's parent from ABC to torch.nn.Module. (and MinMaxObserver)

Could you give me some explanations for the reason of changes related with observers?

Deleting some attributes and register them as buffer.

Ahh. It occured that model.to("cuda") or model.to("cpu") do not transfer scales and zero_points to gpu/cpu, they were not registered as buffers or parameters, that is why they were registered as buffers. Deleting them is needed, because otherwise torch fails to register known attributes as buffers.

Change ObserverBase's parent from ABC to torch.nn.Module

It will enable using buffer registering and correct automatical transfer of scales/zp to cpu<->gpu. The same approach is used in gptq/quant.py for the same reason (i suppose).
Please see

TICO/tico/quantization/algorithm/gptq/quant.py

Line 32 in aaf55d7

class Quantizer(nn.Module):

This draft is just PoC (quick and dirty).

Ah, thanks for the clarification. I'll reconsider those and apply the changes soon.

Ah, thanks for the clarification. I'll reconsider those and apply the changes soon.

ok. Thank you very much.

mhs4670go · 2026-01-21T06:12:52Z

+                continue
+            if (
+                dq.target
+                != torch.ops.circle_custom.dequantize_mx_to_float.default


Seems that just quantize_mx and dequantize_mx are simpler. Is there some consideration for exposing dtypes in the name?

There is no fake_quantize for mx types (just circle_custom::quantize_mx). So quantize_float_to_mx is a try (m.b. failed) to distinguish it from quantize_mx. In case circle_custom::quantize_mx will become circle_custom::fakequantize_mx, then usual quantize/dequantize naming scheme applies.

@mhs4670go
It can be renamed to any other (more appropriate) name.

Ah, I see. How about go with quantize_mx_decomposed, and dequantize_mx_decomposed? This aligns with torch.ops.quantized_decomposed.

Ah, I see. How about go with quantize_mx_decomposed, and dequantize_mx_decomposed? This aligns with torch.ops.quantized_decomposed.

@mhs4670go
Ok. Got it. Thank you.

stamalakhov · 2026-03-05T11:04:32Z

@mhs4670go
I've run advanced mse modes from this draft for unsloth/Llama-3.2-3B-Instruct:

seq_line == 2048:

Config ID	PPL	arc_easy(%)	arc_challenge (%)	winogrande (%)	openbookqa(%)
FP32	11.05	75	44	69	29
GPTQ_MSE_w4A16_main_branch_mse	12.92	72	41	67	28
GPTQ_MSE_w4A16_smse	12.12	73	41	67	30
GPTQ_MSE_w4A16_smse_for_gptq	12.11	72	41	67	27

tables from logs

GPTQ_MSE_w4A16_main_branch_mse:

Quantized RESULTS ARE:
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4078|±  |0.0144|
|             |       |none  |     0|acc_norm|↑  |0.4386|±  |0.0145|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7226|±  |0.0092|
|             |       |none  |     0|acc_norm|↑  |0.6738|±  |0.0096|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2840|±  |0.0202|
|             |       |none  |     0|acc_norm|↑  |0.3920|±  |0.0219|
|winogrande   |      1|none  |     0|acc     |↑  |0.6661|±  |0.0133|

GPTQ_MSE_w4A16_smse:

Quantized RESULTS ARE:
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4096|±  |0.0144|
|             |       |none  |     0|acc_norm|↑  |0.4352|±  |0.0145|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7323|±  |0.0091|
|             |       |none  |     0|acc_norm|↑  |0.6696|±  |0.0097|
|openbookqa   |      1|none  |     0|acc     |↑  |0.3020|±  |0.0206|
|             |       |none  |     0|acc_norm|↑  |0.3760|±  |0.0217|
|winogrande   |      1|none  |     0|acc     |↑  |0.6748|±  |0.0132|

GPTQ_MSE_w4A16_smse_for_gptq:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4061|±  |0.0144|
|             |       |none  |     0|acc_norm|↑  |0.4309|±  |0.0145|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7247|±  |0.0092|
|             |       |none  |     0|acc_norm|↑  |0.6692|±  |0.0097|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2720|±  |0.0199|
|             |       |none  |     0|acc_norm|↑  |0.3740|±  |0.0217|
|winogrande   |      1|none  |     0|acc     |↑  |0.6654|±  |0.0133|

GPTQ_MSE_w4A16_smse_for_gptq_with_4096_calib_seqlen:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4078|±  |0.0144|
|             |       |none  |     0|acc_norm|↑  |0.4369|±  |0.0145|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7256|±  |0.0092|
|             |       |none  |     0|acc_norm|↑  |0.6696|±  |0.0097|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2680|±  |0.0198|
|             |       |none  |     0|acc_norm|↑  |0.3780|±  |0.0217|
|winogrande   |      1|none  |     0|acc     |↑  |0.6677|±  |0.0132|

GPTQ_MSE_w4A16_smse_for_gptq_with_4096_calib_seqlen_and_256_qcalib_samples:

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4104|±  |0.0144|
|             |       |none  |     0|acc_norm|↑  |0.4241|±  |0.0144|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7201|±  |0.0092|
|             |       |none  |     0|acc_norm|↑  |0.6545|±  |0.0098|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2800|±  |0.0201|
|             |       |none  |     0|acc_norm|↑  |0.3660|±  |0.0216|
|winogrande   |      1|none  |     0|acc     |↑  |0.6567|±  |0.0133|

Do we need something alike (e.g. smse) in the main branch?

Please note that smse_for_gptq produces better ppl but fails to improve accuracy, seems like it's overfitting, which can be avoided by using more data e.g.

stamalakhov · 2026-04-01T17:02:15Z

@mhs4670go
For mxint8 (input/ouput for linear layers only) we can get the following:

Config ID	PPL	arc_easy(%)	arc_challenge (%)	winogrande (%)	openbookqa(%)
FP32	11.05	75	44	69	29
GPTQ_MSE_w4_linear_in_mxint8_others_in_int16	12.74	72	40	66	28
GPTQ_SMSE_w4_linear_in_mxint8_others_in_int16	12.37	72	40	66	27

more

GPTQ_MSE_w4_linear_in_mxint8_others_in_int16:

┌── Wikitext-2 test perplexity ─────────────
│ int16 :    12.74
└───────────────────────────────────────────
Quantized RESULTS ARE:
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4019|±  |0.0143|
|             |       |none  |     0|acc_norm|↑  |0.4283|±  |0.0145|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7155|±  |0.0093|
|             |       |none  |     0|acc_norm|↑  |0.6608|±  |0.0097|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2760|±  |0.0200|
|             |       |none  |     0|acc_norm|↑  |0.3740|±  |0.0217|
|winogrande   |      1|none  |     0|acc     |↑  |0.6606|±  |0.0133|

GPTQ_SMSE_w4_linear_in_mxint8_others_in_int16:

┌── Wikitext-2 test perplexity ─────────────
│ int16 :    12.37
└───────────────────────────────────────────
Quantized RESULTS ARE:
|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|-----:|
|arc_challenge|      1|none  |     0|acc     |↑  |0.4019|±  |0.0143|
|             |       |none  |     0|acc_norm|↑  |0.4352|±  |0.0145|
|arc_easy     |      1|none  |     0|acc     |↑  |0.7176|±  |0.0092|
|             |       |none  |     0|acc_norm|↑  |0.6814|±  |0.0096|
|openbookqa   |      1|none  |     0|acc     |↑  |0.2720|±  |0.0199|
|             |       |none  |     0|acc_norm|↑  |0.3660|±  |0.0216|
|winogrande   |      1|none  |     0|acc     |↑  |0.6614|±  |0.0133|

This draft tries to get fully quantized model. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov · 2026-05-21T11:13:16Z

@mhs4670go
I've updated this draft to use mx types for quantizing activations for rms_norm, softmax, matmuls.
Do we need MXTypes in main branch?

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov self-assigned this Jan 13, 2026

stamalakhov added DRAFT No merge labels Jan 13, 2026

stamalakhov force-pushed the full_quantization_br branch 9 times, most recently from 0253cb9 to 5201525 Compare January 20, 2026 11:37

mhs4670go reviewed Jan 21, 2026

View reviewed changes

stamalakhov force-pushed the full_quantization_br branch 7 times, most recently from 60fcd6a to f7bb4d9 Compare January 27, 2026 13:51

stamalakhov force-pushed the full_quantization_br branch 6 times, most recently from 06581cb to 542db37 Compare January 30, 2026 06:07

stamalakhov changed the title ~~[DRAFT][NO_MERGE][quantization] Full quantization~~ [DRAFT][quantization] Full quantization Jan 30, 2026

stamalakhov force-pushed the full_quantization_br branch from 542db37 to 7f684a3 Compare January 30, 2026 10:31

stamalakhov force-pushed the full_quantization_br branch 3 times, most recently from 797e714 to 07cf6fd Compare February 22, 2026 09:38

stamalakhov mentioned this pull request Feb 25, 2026

[quantization] Add eval_tasks option #522

Merged

stamalakhov force-pushed the full_quantization_br branch 2 times, most recently from be606ae to 91b6916 Compare February 27, 2026 12:28

stamalakhov force-pushed the full_quantization_br branch from 91b6916 to cde8f76 Compare March 5, 2026 08:22

stamalakhov force-pushed the full_quantization_br branch 3 times, most recently from b14c3b0 to 810175d Compare March 16, 2026 10:09

stamalakhov force-pushed the full_quantization_br branch 4 times, most recently from 1760c64 to 4ef076a Compare April 1, 2026 07:36

stamalakhov force-pushed the full_quantization_br branch 4 times, most recently from d6b8476 to af15b68 Compare April 16, 2026 06:15

stamalakhov force-pushed the full_quantization_br branch 2 times, most recently from 6ed93f3 to e34300f Compare May 15, 2026 12:48

[quantization] Full quantization

754ed7d

This draft tries to get fully quantized model. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov force-pushed the full_quantization_br branch from e34300f to 754ed7d Compare May 19, 2026 10:20

stamalakhov added 3 commits May 20, 2026 16:01

mxdtype

04a7b24

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

use of mx dtype

a3a099c

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

mxdtype using builders

ed5f0da

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov added 2 commits May 21, 2026 16:15

fix tests

ff7f128

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

fix_add_mul

d3a2ce6

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

Conversation

stamalakhov commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhs4670go Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stamalakhov Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhs4670go Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stamalakhov commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stamalakhov commented Apr 1, 2026

Uh oh!

stamalakhov commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stamalakhov commented Jan 13, 2026 •

edited

Loading

mhs4670go Jan 21, 2026 •

edited

Loading

stamalakhov Jan 21, 2026 •

edited

Loading

mhs4670go Jan 21, 2026 •

edited

Loading

stamalakhov commented Mar 5, 2026 •

edited

Loading