Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
536fb3f
Add CPU DynamicQuantMatMulFp8 contrib op with MLAS FP8 fallback
melkap01-Arm May 8, 2026
b53b1c5
wording for tile replaced with block
melkap01-Arm May 8, 2026
563dcaa
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 11, 2026
c5fe8e1
cleaning the zp checks from MlasFp8GemmBatch after symmetric quantis…
melkap01-Arm May 11, 2026
a258ad8
documentation updated for failing build, copilot comment addressed
melkap01-Arm May 12, 2026
7fb80e7
Reusable A buffer implemented before gemm, tests covering all fp8 typ…
melkap01-Arm May 12, 2026
864b1a8
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 12, 2026
1966da4
redundant lines removed
melkap01-Arm May 13, 2026
98ea9ff
Optimize DynamicQuantMatMulFp8 A quantization
melkap01-Arm May 13, 2026
473287e
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 14, 2026
10c88a7
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 14, 2026
e041589
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 15, 2026
b913105
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 18, 2026
16efd04
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 18, 2026
4655e91
documentation difference patched
melkap01-Arm May 18, 2026
f3018d9
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 18, 2026
cc27b94
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 19, 2026
93592f3
answering copilot comment regarding N==0 case
melkap01-Arm May 20, 2026
cdb66cf
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 20, 2026
f5424a9
LHS,RHS block layouts changed, scales adjusted
melkap01-Arm May 20, 2026
3696f36
review comments addressed, docs patched
melkap01-Arm May 21, 2026
68e2037
Merge branch 'microsoft:main' into fp8_DynamicQuantMatMul_Support
melkap01-Arm May 21, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cmake/onnxruntime_mlas.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ onnxruntime_add_static_library(onnxruntime_mlas
${MLAS_SRC_DIR}/sgemm.cpp
${MLAS_SRC_DIR}/halfgemm.cpp
${MLAS_SRC_DIR}/qgemm.cpp
${MLAS_SRC_DIR}/qgemm_fp8.cpp
${MLAS_SRC_DIR}/qdwconv.cpp
${MLAS_SRC_DIR}/convolve.cpp
${MLAS_SRC_DIR}/sconv_nchw_depthwise_multiplier_greater_than_1.cpp
Expand Down
64 changes: 62 additions & 2 deletions docs/ContribOperators.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Do not modify directly.*
* <a href="#com.microsoft.DequantizeBFP">com.microsoft.DequantizeBFP</a>
* <a href="#com.microsoft.DequantizeLinear">com.microsoft.DequantizeLinear</a>
* <a href="#com.microsoft.DequantizeWithOrder">com.microsoft.DequantizeWithOrder</a>
* <a href="#com.microsoft.DynamicQuantMatMulFp8">com.microsoft.DynamicQuantMatMulFp8</a>
* <a href="#com.microsoft.DynamicQuantizeLSTM">com.microsoft.DynamicQuantizeLSTM</a>
* <a href="#com.microsoft.DynamicQuantizeMatMul">com.microsoft.DynamicQuantizeMatMul</a>
* <a href="#com.microsoft.DynamicTimeWarping">com.microsoft.DynamicTimeWarping</a>
Expand Down Expand Up @@ -1493,6 +1494,67 @@ This version of the operator has been available since version 1 of the 'com.micr
</dl>


### <a name="com.microsoft.DynamicQuantMatMulFp8"></a><a name="com.microsoft.dynamicquantmatmulfp8">**com.microsoft.DynamicQuantMatMulFp8**</a>

Symmetric quantized MatMul for fp8 weights (with optional prepack conversion from float16/bfloat16/float) and dynamic runtime quantization of activations to fp8 using internally computed block-wise scales. All zero-point inputs, when provided, must encode 0.0.

#### Version

This version of the operator has been available since version 1 of the 'com.microsoft' operator set.

#### Attributes

<dl>
<dt><tt>block_size_k</tt> : int</dt>
<dd>Block size along K for A and B block-wise scales.</dd>
<dt><tt>block_size_m</tt> : int</dt>
<dd>Block size along M for A block-wise scales. Must be 1.</dd>
<dt><tt>block_size_n</tt> : int</dt>
<dd>Block size along N for B block-wise scales.</dd>
<dt><tt>fp8_type</tt> : int</dt>
<dd>FP8 TensorProto data type used when non-FP8 constant B is dynamically quantized during prepack. Defaults to FLOAT8E4M3FN.</dd>
</dl>

#### Inputs (2 - 6)

<dl>
<dt><tt>A</tt> : TA</dt>
<dd>Input tensor A.</dd>
<dt><tt>B</tt> : TB</dt>
<dd>Input tensor B. FP8 B may be provided at runtime. Float, float16, and bfloat16 B are only supported when B is a constant initializer that can be quantized during prepack.</dd>
<dt><tt>B_scale</tt> (optional) : TS</dt>
<dd>Scale of FP8 input 'B'. Must be a block-wise tensor with shape (N / block_size_n, K / block_size_k). Required when B is already FP8. Ignored for non-FP8 constant B, where scales are computed during prepack.</dd>
<dt><tt>B_zero_point</tt> (optional) : TZ</dt>
<dd>Zero point tensor for input 'B'. Must have the same shape as B_scale and all values must encode 0.0.</dd>
<dt><tt>Y_scale</tt> (optional) : TS</dt>
<dd>Scale of output 'Y'. Must be a scalar when provided.</dd>
<dt><tt>Y_zero_point</tt> (optional) : TZ</dt>
<dd>Zero point tensor for output 'Y'. Must be a scalar encoding 0.0 when provided.</dd>
</dl>

#### Outputs

<dl>
<dt><tt>Y</tt> : TY</dt>
<dd>Output tensor of shape (..., M, N).</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>TA</tt> : tensor(float16), tensor(bfloat16), tensor(float)</dt>
<dd>Constrain input A type to float16, bfloat16, or float.</dd>
<dt><tt>TB</tt> : tensor(float16), tensor(bfloat16), tensor(float), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)</dt>
<dd>Constrain input B type to fp8, or to float16, bfloat16, or float for constant initializers.</dd>
<dt><tt>TZ</tt> : tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)</dt>
<dd>Constrain zero point types to fp8. Only zero-valued zero points are supported.</dd>
<dt><tt>TS</tt> : tensor(float), tensor(float16), tensor(bfloat16)</dt>
<dd>Constrain scale types to float, float16, or bfloat16.</dd>
<dt><tt>TY</tt> : tensor(float16), tensor(bfloat16), tensor(float)</dt>
<dd>Constrain output type to float16, bfloat16, or float.</dd>
</dl>


### <a name="com.microsoft.DynamicQuantizeLSTM"></a><a name="com.microsoft.dynamicquantizelstm">**com.microsoft.DynamicQuantizeLSTM**</a>

#### Version
Expand Down Expand Up @@ -6690,5 +6752,3 @@ No versioning maintained for experimental ops.
<dt><tt>T</tt> : tensor(float)</dt>
<dd>Constrain input and output types to float32 tensors.</dd>
</dl>


1 change: 1 addition & 0 deletions docs/OperatorKernels.md
Original file line number Diff line number Diff line change
Expand Up @@ -577,6 +577,7 @@ The **OpSet Version** column uses the following notation:
|CropAndResize|*in* X:**T1**<br> *in* rois:**T1**<br> *in* batch_indices:**T2**<br> *in* crop_size:**T2**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int32)|
|DecoderMaskedMultiHeadAttention|*in* query:**T**<br> *in* key:**T**<br> *in* value:**T**<br> *in* mask_index:**M**<br> *in* attention_bias:**T**<br> *in* past_key:**T**<br> *in* past_value:**T**<br> *in* past_sequence_length:**M**<br> *in* beam_width:**M**<br> *in* cache_indirection:**M**<br> *in* bias:**T**<br> *out* output:**T**<br> *out* present_key:**T**<br> *out* present_value:**T**<br> *out* qk:**QK**|1+|**T** = tensor(float)|
|DequantizeLinear|*in* x:**T1**<br> *in* x_scale:**T2**<br> *in* x_zero_point:**T1**<br> *out* y:**T2**|1+|**T1** = tensor(int16), tensor(int32), tensor(int4), tensor(int8), tensor(uint16), tensor(uint4), tensor(uint8)<br/> **T2** = tensor(float)|
|DynamicQuantMatMulFp8|*in* A:**TA**<br> *in* B:**TB**<br> *in* B_scale:**TS**<br> *in* B_zero_point:**TZ**<br> *in* Y_scale:**TS**<br> *in* Y_zero_point:**TZ**<br> *out* Y:**TY**|1+|**TA** = tensor(bfloat16), tensor(float), tensor(float16)<br/> **TB** = tensor(bfloat16), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)<br/> **TS** = tensor(bfloat16), tensor(float), tensor(float16)<br/> **TY** = tensor(bfloat16), tensor(float), tensor(float16)<br/> **TZ** = tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)|
|DynamicQuantizeLSTM|*in* X:**T**<br> *in* W:**T2**<br> *in* R:**T2**<br> *in* B:**T**<br> *in* sequence_lens:**T1**<br> *in* initial_h:**T**<br> *in* initial_c:**T**<br> *in* P:**T**<br> *in* W_scale:**T**<br> *in* W_zero_point:**T2**<br> *in* R_scale:**T**<br> *in* R_zero_point:**T2**<br> *out* Y:**T**<br> *out* Y_h:**T**<br> *out* Y_c:**T**|1+|**T** = tensor(float)<br/> **T1** = tensor(int32)<br/> **T2** = tensor(int8), tensor(uint8)|
|DynamicQuantizeMatMul|*in* A:**T1**<br> *in* B:**T2**<br> *in* b_scale:**T1**<br> *in* b_zero_point:**T2**<br> *in* bias:**T1**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int8), tensor(uint8)|
|DynamicTimeWarping|*in* input:**F**<br> *out* output:**I**|1+|**F** = tensor(float)<br/> **I** = tensor(int32)|
Expand Down
6 changes: 6 additions & 0 deletions onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,9 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1,
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, NhwcMaxPool);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, NhwcMaxPool);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, QEmbedLayerNormalization);
#if !defined(DISABLE_FLOAT8_TYPES)
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, DynamicQuantMatMulFp8);
#endif
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QGemm);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QGemm);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MLFloat16, QMoE);
Expand Down Expand Up @@ -284,6 +287,9 @@ Status RegisterQuantizationKernels(KernelRegistry& kernel_registry) {
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, NhwcMaxPool)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, NhwcMaxPool)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, QEmbedLayerNormalization)>,
#if !defined(DISABLE_FLOAT8_TYPES)
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, DynamicQuantMatMulFp8)>,
#endif
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QGemm)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QGemm)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MLFloat16, QMoE)>,
Expand Down
Loading
Loading