Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 60 additions & 2 deletions docs/ContribOperators.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Do not modify directly.*
* <a href="#com.microsoft.DequantizeBFP">com.microsoft.DequantizeBFP</a>
* <a href="#com.microsoft.DequantizeLinear">com.microsoft.DequantizeLinear</a>
* <a href="#com.microsoft.DequantizeWithOrder">com.microsoft.DequantizeWithOrder</a>
* <a href="#com.microsoft.DynamicQuantMatMulFp8">com.microsoft.DynamicQuantMatMulFp8</a>
* <a href="#com.microsoft.DynamicQuantizeLSTM">com.microsoft.DynamicQuantizeLSTM</a>
* <a href="#com.microsoft.DynamicQuantizeMatMul">com.microsoft.DynamicQuantizeMatMul</a>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already exists a DynamicQuantizeMatMul contrib op. Could we modify its op schema to support all of these FP8 changes?

* <a href="#com.microsoft.DynamicTimeWarping">com.microsoft.DynamicTimeWarping</a>
Expand Down Expand Up @@ -1493,6 +1494,65 @@ This version of the operator has been available since version 1 of the 'com.micr
</dl>


### <a name="com.microsoft.DynamicQuantMatMulFp8"></a><a name="com.microsoft.dynamicquantmatmulfp8">**com.microsoft.DynamicQuantMatMulFp8**</a>

Symmetric quantized MatMul for fp8 weights (with optional prepack conversion from float16/bfloat16/float) and dynamic runtime quantization of activations to fp8 using internally computed block-wise scales. All zero-point inputs, when provided, must encode 0.0. Optional trailing inputs may be omitted, but intermediate optional inputs must use an empty input name to keep later input positions.

#### Version

This version of the operator has been available since version 1 of the 'com.microsoft' operator set.

#### Attributes

<dl>
<dt><tt>block_size_k</tt> : int</dt>
<dd>Block size along K for A and B block-wise scales.</dd>
<dt><tt>block_size_n</tt> : int</dt>
<dd>Block size along N for B block-wise scales.</dd>
<dt><tt>fp8_type</tt> : int</dt>
<dd>FP8 TensorProto data type used when non-FP8 constant B is dynamically quantized during prepack. Defaults to FLOAT8E4M3FN.</dd>
</dl>

#### Inputs (2 - 6)

<dl>
<dt><tt>A</tt> : TA</dt>
<dd>Input tensor A.</dd>
<dt><tt>B</tt> : TB</dt>
<dd>Input tensor B. FP8 B may be provided at runtime. Float, float16, and bfloat16 B are only supported when B is a constant initializer that can be quantized during prepack.</dd>
<dt><tt>B_scale</tt> (optional) : TS</dt>
<dd>Scale of FP8 input 'B'. Must be a block-wise tensor with shape (N / block_size_n, K / block_size_k). Required when B is already FP8. Ignored for non-FP8 constant B, where scales are computed during prepack.</dd>
<dt><tt>B_zero_point</tt> (optional) : TZ</dt>
<dd>Zero point tensor for input 'B'. Must have the same shape as B_scale and all values must encode 0.0.</dd>
<dt><tt>Y_scale</tt> (optional) : TS</dt>
<dd>Scale of output 'Y'. Must be a scalar when provided.</dd>
<dt><tt>Y_zero_point</tt> (optional) : TZ</dt>
<dd>Zero point tensor for output 'Y'. Must be a scalar encoding 0.0 when provided. May be provided without Y_scale; only Y_scale changes the floating-point output values.</dd>
</dl>

#### Outputs

<dl>
<dt><tt>Y</tt> : TY</dt>
<dd>Output tensor of shape (..., M, N).</dd>
</dl>

#### Type Constraints

<dl>
<dt><tt>TA</tt> : tensor(float16), tensor(bfloat16), tensor(float)</dt>
<dd>Constrain input A type to float16, bfloat16, or float.</dd>
<dt><tt>TB</tt> : tensor(float16), tensor(bfloat16), tensor(float), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)</dt>
<dd>Constrain input B type to fp8, or to float16, bfloat16, or float for constant initializers.</dd>
<dt><tt>TZ</tt> : tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)</dt>
<dd>Constrain zero point types to fp8. Only zero-valued zero points are supported.</dd>
<dt><tt>TS</tt> : tensor(float), tensor(float16), tensor(bfloat16)</dt>
<dd>Constrain scale types to float, float16, or bfloat16.</dd>
<dt><tt>TY</tt> : tensor(float16), tensor(bfloat16), tensor(float)</dt>
<dd>Constrain output type to float16, bfloat16, or float.</dd>
</dl>


### <a name="com.microsoft.DynamicQuantizeLSTM"></a><a name="com.microsoft.dynamicquantizelstm">**com.microsoft.DynamicQuantizeLSTM**</a>

#### Version
Expand Down Expand Up @@ -6691,5 +6751,3 @@ No versioning maintained for experimental ops.
<dt><tt>T</tt> : tensor(float)</dt>
<dd>Constrain input and output types to float32 tensors.</dd>
</dl>


1 change: 1 addition & 0 deletions docs/OperatorKernels.md
Original file line number Diff line number Diff line change
Expand Up @@ -577,6 +577,7 @@ The **OpSet Version** column uses the following notation:
|CropAndResize|*in* X:**T1**<br> *in* rois:**T1**<br> *in* batch_indices:**T2**<br> *in* crop_size:**T2**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int32)|
|DecoderMaskedMultiHeadAttention|*in* query:**T**<br> *in* key:**T**<br> *in* value:**T**<br> *in* mask_index:**M**<br> *in* attention_bias:**T**<br> *in* past_key:**T**<br> *in* past_value:**T**<br> *in* past_sequence_length:**M**<br> *in* beam_width:**M**<br> *in* cache_indirection:**M**<br> *in* bias:**T**<br> *out* output:**T**<br> *out* present_key:**T**<br> *out* present_value:**T**<br> *out* qk:**QK**|1+|**T** = tensor(float)|
|DequantizeLinear|*in* x:**T1**<br> *in* x_scale:**T2**<br> *in* x_zero_point:**T1**<br> *out* y:**T2**|1+|**T1** = tensor(int16), tensor(int32), tensor(int4), tensor(int8), tensor(uint16), tensor(uint4), tensor(uint8)<br/> **T2** = tensor(float)|
|DynamicQuantMatMulFp8|*in* A:**TA**<br> *in* B:**TB**<br> *in* B_scale:**TS**<br> *in* B_zero_point:**TZ**<br> *in* Y_scale:**TS**<br> *in* Y_zero_point:**TZ**<br> *out* Y:**TY**|1+|**TA** = tensor(bfloat16), tensor(float), tensor(float16)<br/> **TB** = tensor(bfloat16), tensor(float), tensor(float16), tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)<br/> **TS** = tensor(bfloat16), tensor(float), tensor(float16)<br/> **TY** = tensor(bfloat16), tensor(float), tensor(float16)<br/> **TZ** = tensor(float8e4m3fn), tensor(float8e4m3fnuz), tensor(float8e5m2), tensor(float8e5m2fnuz)|
|DynamicQuantizeLSTM|*in* X:**T**<br> *in* W:**T2**<br> *in* R:**T2**<br> *in* B:**T**<br> *in* sequence_lens:**T1**<br> *in* initial_h:**T**<br> *in* initial_c:**T**<br> *in* P:**T**<br> *in* W_scale:**T**<br> *in* W_zero_point:**T2**<br> *in* R_scale:**T**<br> *in* R_zero_point:**T2**<br> *out* Y:**T**<br> *out* Y_h:**T**<br> *out* Y_c:**T**|1+|**T** = tensor(float)<br/> **T1** = tensor(int32)<br/> **T2** = tensor(int8), tensor(uint8)|
|DynamicQuantizeMatMul|*in* A:**T1**<br> *in* B:**T2**<br> *in* b_scale:**T1**<br> *in* b_zero_point:**T2**<br> *in* bias:**T1**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int8), tensor(uint8)|
|DynamicTimeWarping|*in* input:**F**<br> *out* output:**I**|1+|**F** = tensor(float)<br/> **I** = tensor(int32)|
Expand Down
6 changes: 6 additions & 0 deletions onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,9 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1,
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, NhwcMaxPool);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, NhwcMaxPool);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, QEmbedLayerNormalization);
#if !defined(DISABLE_FLOAT8_TYPES)
class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, DynamicQuantMatMulFp8);
#endif
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QGemm);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QGemm);
class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MLFloat16, QMoE);
Expand Down Expand Up @@ -284,6 +287,9 @@ Status RegisterQuantizationKernels(KernelRegistry& kernel_registry) {
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, NhwcMaxPool)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, NhwcMaxPool)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, float, QEmbedLayerNormalization)>,
#if !defined(DISABLE_FLOAT8_TYPES)
BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, DynamicQuantMatMulFp8)>,
#endif
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, int8_t, QGemm)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, uint8_t, QGemm)>,
BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MLFloat16, QMoE)>,
Expand Down
Loading
Loading