Skip to content

Add CPU DynamicQuantMatMulFp8 contrib op#28688

Open
melkap01-Arm wants to merge 5 commits into
microsoft:mainfrom
melkap01-Arm:Splitted_DynamicQuantMatMulFp8_support_on_contribOps
Open

Add CPU DynamicQuantMatMulFp8 contrib op#28688
melkap01-Arm wants to merge 5 commits into
microsoft:mainfrom
melkap01-Arm:Splitted_DynamicQuantMatMulFp8_support_on_contribOps

Conversation

@melkap01-Arm
Copy link
Copy Markdown
Contributor

@melkap01-Arm melkap01-Arm commented May 27, 2026

Description

This MR adds a CPU contrib implementation for com.microsoft::DynamicQuantMatMulFp8. It keeps the FP8 GEMM path
internal to the contrib kernel for this PR, without adding a public MLAS FP8 API.

Scope

  • Adds DynamicQuantMatMulFp8 schema under the Microsoft contrib opset.
  • Registers the CPU contrib kernel when FP8 types are enabled.
  • Adds the CPU opkernel implementation and an internal scalar FP8 GEMM helper.
  • Adds provider tests for the FP8 opkernel path.

Operator Contract

  • A supports float, float16, and bfloat16.

  • Runtime B supports FP8 only and must be rank-2.

  • Constant initializer B supports float, float16, bfloat16, or FP8.

  • Non-FP8 constant B is dynamically quantized once during PrePack.

  • Dynamic non-FP8 B is intentionally rejected.

  • Y supports float, float16, and bfloat16.

  • Optional Y_scale and Y_zero_point are supported.

  • Supported FP8 formats: FLOAT8E4M3FN, FLOAT8E4M3FNUZ, FLOAT8E5M2, FLOAT8E5M2FNUZ.

  • The op enforces symmetric quantization; provided zero points must encode 0.0.

  • A scales are computed dynamically by the kernel.

  • For non-FP8 constant B, scales are computed during PrePack.

  • For FP8 runtime/constant B, B_scale is required and validated.

  • block_size_k and block_size_n default to 128.

  • B_scale and B_zero_point use [N / block_size_n, K / block_size_k] layout.

  • Runtime FP8 B is consumed directly.

  • Constant non-FP8 B is quantized to FP8 in PrePack.

  • Constant FP8 B preserves its FP8 type metadata.

  • Shared prepack metadata restores B shape/type, quantized B size, and B scale count.

  • K == 0 produces zero-filled output.

  • M == 0 and N == 0 return cleanly after cheap runtime validation.

Tests
Provider tests cover
Ran the fp8 converted Qwen3 ONNX model successfully.

Known Limitations

  • Dynamic non-FP8 B is not supported.
  • No public MLAS FP8 API is introduced in this PR.
  • No packed-B optimized backend or KleidiAI dispatch is included.

Comments Regarding the review points submitted on the main MR: #28416
Item 4 - Constant-input validation

Zero-point inputs are still part of the operator contract because they currently also carry the FP8 zero-point
type/encoding. The op only supports symmetric quantization, so any provided zero point must encode 0.0; non-zero
Until a better contract is suggested, such as moving the quantization/FP8 encoding choice into a separate
attribute or argument, we prefer to keep zero points as explicit inputs. That keeps the current model format
flexible while we validate what scheme best matches the models we expect to support.
To reduce per-run cost, constant B_scale and B_zero_point validation is handled during PrePack when those inputs
are initializers. Runtime validation is kept only for dynamic inputs, where values can change between runs.

Item 6 - Temporary allocations

The current implementation keeps the operator contract flexible, so temporary buffers are used only on the paths
that need them. A is dynamically quantized at runtime, so the kernel needs temporary FP8 A data and computed A
scales. Lower-precision outputs use a float scratch buffer because accumulation is done in float and then
converted to the requested output type. B_scale conversion is only needed when model-provided runtime/FP8 B_scale is not already float; for non-FP8 constant B, scales are computed and stored during PrePack.

This does not mean every allocation happens on every execution path. Some allocations are required for the main
dynamic-activation path, while others only happen for specific cases such as non-float scale tensors or lower-
precision outputs. This PR keeps the implementation correctness-focused while preserving the flexible contract, if more constrained contract suggested, such as not supporting runtime provided B_scale/B tensor then those allocations would be reduced as well.

Motivation and Context

Signed-off-by: melkap01 <melike.kaptan@arm.com>
Signed-off-by: melkap01 <melike.kaptan@arm.com>
Signed-off-by: melkap01 <melike.kaptan@arm.com>
@melkap01-Arm melkap01-Arm marked this pull request as ready for review May 27, 2026 17:16
Comment thread docs/ContribOperators.md
* <a href="#com.microsoft.DequantizeWithOrder">com.microsoft.DequantizeWithOrder</a>
* <a href="#com.microsoft.DynamicQuantMatMulFp8">com.microsoft.DynamicQuantMatMulFp8</a>
* <a href="#com.microsoft.DynamicQuantizeLSTM">com.microsoft.DynamicQuantizeLSTM</a>
* <a href="#com.microsoft.DynamicQuantizeMatMul">com.microsoft.DynamicQuantizeMatMul</a>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already exists a DynamicQuantizeMatMul contrib op. Could we modify its op schema to support all of these FP8 changes?


b_type_ = fp8_type_;
has_b_type_ = true;
if (K == 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (K == 0 || N == 0) {
  return Status::OK();
}

size_t block_size_n,
float fp8_max_abs,
float* scales) {
// Reference-style dynamic quantization: derive one positive scale from each source block.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use concurrency::ThreadPool::TryParallelFor here instead of many nested for loops?

}

// Reject invalid scales before quantization divides by them or the GEMM dequantizes with them.
Status ValidatePositiveFiniteScaleTensor(const Tensor& scale, const char* scale_name) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reduce duplication across the if conditions and use a templated method? Something like the following could work.

template <typename T>
Status ValidatePositiveFiniteScaleTensorImpl(const Tensor& scale, const char* scale_name) {
  const auto* data = scale.Data<T>();
  const size_t count = static_cast<size_t>(scale.Shape().Size());

  for (size_t i = 0; i < count; ++i) {
    const float value = static_cast<float>(data[i]);
    ORT_RETURN_IF(!std::isfinite(value) || value <= 0.0f,
                  "DynamicQuantMatMulFp8 requires ", scale_name,
                  " values to be finite and positive.");
  }

  return Status::OK();
}

Status ValidatePositiveFiniteScaleTensor(const Tensor& scale, const char* scale_name) {
  if (scale.IsDataType<float>()) {
    return ValidatePositiveFiniteScaleTensorImpl<float>(scale, scale_name);
  }
  if (scale.IsDataType<MLFloat16>()) {
    return ValidatePositiveFiniteScaleTensorImpl<MLFloat16>(scale, scale_name);
  }
  if (scale.IsDataType<BFloat16>()) {
    return ValidatePositiveFiniteScaleTensorImpl<BFloat16>(scale, scale_name);
  }

  return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT,
                         "DynamicQuantMatMulFp8 requires ", scale_name,
                         " input to be float, float16, or bfloat16.");
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants