Add CPU DynamicQuantMatMulFp8 contrib op by melkap01-Arm · Pull Request #28688 · microsoft/onnxruntime

melkap01-Arm · 2026-05-27T16:56:45Z

Description

This MR adds a CPU contrib implementation for com.microsoft::DynamicQuantMatMulFp8. It keeps the FP8 GEMM path
internal to the contrib kernel for this PR, without adding a public MLAS FP8 API.

Scope

Adds DynamicQuantMatMulFp8 schema under the Microsoft contrib opset.
Registers the CPU contrib kernel when FP8 types are enabled.
Adds the CPU opkernel implementation and an internal scalar FP8 GEMM helper.
Adds provider tests for the FP8 opkernel path.

Operator Contract

A supports float, float16, and bfloat16.
Runtime B supports FP8 only and must be rank-2.
Constant initializer B supports float, float16, bfloat16, or FP8.
Non-FP8 constant B is dynamically quantized once during PrePack.
Dynamic non-FP8 B is intentionally rejected.
Y supports float, float16, and bfloat16.
Optional Y_scale and Y_zero_point are supported.
Supported FP8 formats: FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ.
The op enforces symmetric quantization; provided zero points must encode 0.0.
A scales are computed dynamically by the kernel.
For non-FP8 constant B, scales are computed during PrePack.
For FP8 runtime/constant B, B_scale is required and validated.
block_size_k and block_size_n default to 128.
B_scale and B_zero_point use [N / block_size_n, K / block_size_k] layout.
Runtime FP8 B is consumed directly.
Constant non-FP8 B is quantized to FP8 in PrePack.
Constant FP8 B preserves its FP8 type metadata.
Shared prepack metadata restores B shape/type, quantized B size, and B scale count.
K == 0 produces zero-filled output.
M == 0 and N == 0 return cleanly after cheap runtime validation.

Tests
Provider tests cover
Ran the fp8 converted Qwen3 ONNX model successfully.

Known Limitations

Dynamic non-FP8 B is not supported.
No public MLAS FP8 API is introduced in this PR.
No packed-B optimized backend or KleidiAI dispatch is included.

Comments Regarding the review points submitted on the main MR: #28416
Item 4 - Constant-input validation

Zero-point inputs are still part of the operator contract because they currently also carry the FP8 zero-point
type/encoding. The op only supports symmetric quantization, so any provided zero point must encode 0.0; non-zero
Until a better contract is suggested, such as moving the quantization/FP8 encoding choice into a separate
attribute or argument, we prefer to keep zero points as explicit inputs. That keeps the current model format
flexible while we validate what scheme best matches the models we expect to support.
To reduce per-run cost, constant B_scale and B_zero_point validation is handled during PrePack when those inputs
are initializers. Runtime validation is kept only for dynamic inputs, where values can change between runs.

Item 6 - Temporary allocations

The current implementation keeps the operator contract flexible, so temporary buffers are used only on the paths
that need them. A is dynamically quantized at runtime, so the kernel needs temporary FP8 A data and computed A
scales. Lower-precision outputs use a float scratch buffer because accumulation is done in float and then
converted to the requested output type. B_scale conversion is only needed when model-provided runtime/FP8 B_scale is not already float; for non-FP8 constant B, scales are computed and stored during PrePack.

This does not mean every allocation happens on every execution path. Some allocations are required for the main
dynamic-activation path, while others only happen for specific cases such as non-float scale tensors or lower-
precision outputs. This PR keeps the implementation correctness-focused while preserving the flexible contract, if more constrained contract suggested, such as not supporting runtime provided B_scale/B tensor then those allocations would be reduced as well.

Motivation and Context

Signed-off-by: melkap01 <melike.kaptan@arm.com>

…port_on_contribOps

kunal-vaishnavi · 2026-05-28T16:59:41Z

  * <a href="#com.microsoft.DequantizeWithOrder">com.microsoft.DequantizeWithOrder</a>
+  * <a href="#com.microsoft.DynamicQuantMatMulFp8">com.microsoft.DynamicQuantMatMulFp8</a>
  * <a href="#com.microsoft.DynamicQuantizeLSTM">com.microsoft.DynamicQuantizeLSTM</a>
  * <a href="#com.microsoft.DynamicQuantizeMatMul">com.microsoft.DynamicQuantizeMatMul</a>


There already exists a DynamicQuantizeMatMul contrib op. Could we modify its op schema to support all of these FP8 changes?

kunal-vaishnavi · 2026-05-28T21:33:08Z

+
+  b_type_ = fp8_type_;
+  has_b_type_ = true;
+  if (K == 0) {


if (K == 0 || N == 0) { return Status::OK(); }

kunal-vaishnavi · 2026-05-28T21:35:02Z

+                                     size_t block_size_n,
+                                     float fp8_max_abs,
+                                     float* scales) {
+  // Reference-style dynamic quantization: derive one positive scale from each source block.


Can we use concurrency::ThreadPool::TryParallelFor here instead of many nested for loops?

kunal-vaishnavi · 2026-05-28T22:24:43Z

+}
+
+// Reject invalid scales before quantization divides by them or the GEMM dequantizes with them.
+Status ValidatePositiveFiniteScaleTensor(const Tensor& scale, const char* scale_name) {


Can we reduce duplication across the if conditions and use a templated method? Something like the following could work.

template <typename T> Status ValidatePositiveFiniteScaleTensorImpl(const Tensor& scale, const char* scale_name) { const auto* data = scale.Data<T>(); const size_t count = static_cast<size_t>(scale.Shape().Size()); for (size_t i = 0; i < count; ++i) { const float value = static_cast<float>(data[i]); ORT_RETURN_IF(!std::isfinite(value) || value <= 0.0f, "DynamicQuantMatMulFp8 requires ", scale_name, " values to be finite and positive."); } return Status::OK(); } Status ValidatePositiveFiniteScaleTensor(const Tensor& scale, const char* scale_name) { if (scale.IsDataType<float>()) { return ValidatePositiveFiniteScaleTensorImpl<float>(scale, scale_name); } if (scale.IsDataType<MLFloat16>()) { return ValidatePositiveFiniteScaleTensorImpl<MLFloat16>(scale, scale_name); } if (scale.IsDataType<BFloat16>()) { return ValidatePositiveFiniteScaleTensorImpl<BFloat16>(scale, scale_name); } return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, "DynamicQuantMatMulFp8 requires ", scale_name, " input to be float, float16, or bfloat16."); }

melkap01-Arm added 3 commits May 27, 2026 15:44

Add DynamicQuantMatMulFp8 contrib op with internal FP8 GEMM helper

4cb522c

Signed-off-by: melkap01 <melike.kaptan@arm.com>

removing 2 mlas references from internal implementation

deebf48

Signed-off-by: melkap01 <melike.kaptan@arm.com>

missing header added

1e7951f

Signed-off-by: melkap01 <melike.kaptan@arm.com>

melkap01-Arm marked this pull request as ready for review May 27, 2026 17:16

github-advanced-security AI found potential problems May 27, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cpu/quantization/dynamic_quant_matmul_fp8.cc Fixed

melkap01-Arm added 2 commits May 27, 2026 21:31

github review comments addessed

8dabaa2

Signed-off-by: melkap01 <melike.kaptan@arm.com>

Merge branch 'microsoft:main' into Splitted_DynamicQuantMatMulFp8_sup…

163bc73

…port_on_contribOps

tianleiwu requested review from jambayk and kunal-vaishnavi May 27, 2026 22:59

kunal-vaishnavi reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU DynamicQuantMatMulFp8 contrib op#28688

Add CPU DynamicQuantMatMulFp8 contrib op#28688
melkap01-Arm wants to merge 5 commits into
microsoft:mainfrom
melkap01-Arm:Splitted_DynamicQuantMatMulFp8_support_on_contribOps

melkap01-Arm commented May 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

kunal-vaishnavi May 28, 2026

Uh oh!

kunal-vaishnavi May 28, 2026

Uh oh!

kunal-vaishnavi May 28, 2026

Uh oh!

kunal-vaishnavi May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

melkap01-Arm commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

Uh oh!

kunal-vaishnavi May 28, 2026

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi May 28, 2026

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi May 28, 2026

Choose a reason for hiding this comment

Uh oh!

kunal-vaishnavi May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

melkap01-Arm commented May 27, 2026 •

edited

Loading