[DRAFT][quantization] Full quantization of LLama compatible models#436
[DRAFT][quantization] Full quantization of LLama compatible models#436stamalakhov wants to merge 6 commits into
Conversation
0253cb9 to
5201525
Compare
| gptq[name] = GPTQ(subset[name]) | ||
| gptq[name].quantizer.configure( | ||
| bits=8, perchannel=True, sym=False, mse=False | ||
| bits=4, perchannel=True, sym=False, mse=False |
There was a problem hiding this comment.
FYI, you can give the option for this with this PR.
There was a problem hiding this comment.
FYI, you can give the option for this with this PR.
@mhs4670go
Thank you. I'll rebase after merging of #441.
There was a problem hiding this comment.
Could you give me some explanations for the reason of changes related with observers?
- Deleting some attributes and register them as buffer.
- Change
ObserverBase's parent fromABCtotorch.nn.Module. (andMinMaxObserver)
There was a problem hiding this comment.
Could you give me some explanations for the reason of changes related with observers?
- Deleting some attributes and register them as buffer.
Ahh. It occured that model.to("cuda") or model.to("cpu") do not transfer scales and zero_points to gpu/cpu, they were not registered as buffers or parameters, that is why they were registered as buffers. Deleting them is needed, because otherwise torch fails to register known attributes as buffers.
- Change ObserverBase's parent from ABC to torch.nn.Module
It will enable using buffer registering and correct automatical transfer of scales/zp to cpu<->gpu. The same approach is used in gptq/quant.py for the same reason (i suppose).
Please see
This draft is just PoC (quick and dirty).
There was a problem hiding this comment.
Ah, thanks for the clarification. I'll reconsider those and apply the changes soon.
There was a problem hiding this comment.
Ah, thanks for the clarification. I'll reconsider those and apply the changes soon.
ok. Thank you very much.
| continue | ||
| if ( | ||
| dq.target | ||
| != torch.ops.circle_custom.dequantize_mx_to_float.default |
There was a problem hiding this comment.
Seems that just quantize_mx and dequantize_mx are simpler. Is there some consideration for exposing dtypes in the name?
There was a problem hiding this comment.
There is no fake_quantize for mx types (just circle_custom::quantize_mx). So quantize_float_to_mx is a try (m.b. failed) to distinguish it from quantize_mx. In case circle_custom::quantize_mx will become circle_custom::fakequantize_mx, then usual quantize/dequantize naming scheme applies.
There was a problem hiding this comment.
@mhs4670go
It can be renamed to any other (more appropriate) name.
There was a problem hiding this comment.
Ah, I see. How about go with quantize_mx_decomposed, and dequantize_mx_decomposed? This aligns with torch.ops.quantized_decomposed.
There was a problem hiding this comment.
Ah, I see. How about go with
quantize_mx_decomposed, anddequantize_mx_decomposed? This aligns withtorch.ops.quantized_decomposed.
@mhs4670go
Ok. Got it. Thank you.
60fcd6a to
f7bb4d9
Compare
06581cb to
542db37
Compare
542db37 to
7f684a3
Compare
797e714 to
07cf6fd
Compare
be606ae to
91b6916
Compare
91b6916 to
cde8f76
Compare
|
@mhs4670go
tables from logsGPTQ_MSE_w4A16_main_branch_mse: GPTQ_MSE_w4A16_smse: GPTQ_MSE_w4A16_smse_for_gptq: GPTQ_MSE_w4A16_smse_for_gptq_with_4096_calib_seqlen: GPTQ_MSE_w4A16_smse_for_gptq_with_4096_calib_seqlen_and_256_qcalib_samples: Do we need something alike (e.g. Please note that |
b14c3b0 to
810175d
Compare
1760c64 to
4ef076a
Compare
|
@mhs4670go
moreGPTQ_MSE_w4_linear_in_mxint8_others_in_int16: GPTQ_SMSE_w4_linear_in_mxint8_others_in_int16: |
d6b8476 to
af15b68
Compare
6ed93f3 to
e34300f
Compare
This draft tries to get fully quantized model. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
e34300f to
754ed7d
Compare
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
|
@mhs4670go |
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
This draft tries to get fully quantized circle layers for
Llamamodel.TODO:
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com