[MLAS] Update the NHWC sans transposes path to also support Depthwise convolutions by orlmon01 · Pull Request #28565 · microsoft/onnxruntime

orlmon01 · 2026-05-19T13:41:32Z

Description

A path for MLAS to support NHWC Convolutions without the need for transposes was added in PR: #26834
This PR expands those changes to also support Depthwise Convolutions via the same pathway

What changed:

The shared NHWC capability gate in onnxruntime/core/mlas/lib/convolve.cpp:1348 stopped requiring GroupCount == 1. It now allows GroupCount > 1 only when the op is true depthwise, meaning filters_per_group ==
1.
The NHWC transformer in onnxruntime/core/optimizer/nhwc_transformer.cc:162 was updated to pass the real group value and compute filter_count per group instead of hard-coding group 1. That is what lets grouped
depthwise Conv/FusedConv nodes get rewritten to com.microsoft.NhwcFusedConv.
The KleidiAI execution path in onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp:553 learned how to handle grouped NHWC tensors by:
- gathering one group’s channels out of interleaved NHWC input into a temporary contiguous buffer,
- running the existing per-group kernel,
- scattering that group’s output channels back into interleaved NHWC output.
Tests were added for a working NHWC depthwise case in onnxruntime/test/contrib_ops/fused_conv_test.cc:466, and transformer tests were updated to verify both the new positive case and the expected skip cases in
onnxruntime/test/optimizer/nhwc_transformer_test.cc:416.

Added performance benchmark tests to allow for comparison between the new NHWC path and the old NCHW default.
Sample output:

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                                     Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:1/Cpg:64/Fpg:64/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time                508509 ns       508507 ns         1374
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:1/Cpg:128/Fpg:128/I:28/28/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time              700573 ns       700386 ns          997
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:64/Cpg:1/Fpg:1/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time                6471094 ns      6471114 ns          132
SCONV_NCHW/KleidiAiNhwcComparison_NchwBaseline/Rank:2/N:1/G:72/Cpg:1/Fpg:1/I:48/80/K:3/3/P:1/1/1/1/S:2/2/D:1/1/real_time                3768969 ns      3767797 ns          217
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:1/Cpg:64/Fpg:64/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time       414198 ns       414197 ns         1688
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:1/Cpg:128/Fpg:128/I:28/28/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time     652454 ns       652454 ns         1074
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:64/Cpg:1/Fpg:1/I:56/56/K:3/3/P:1/1/1/1/S:1/1/D:1/1/real_time       6032947 ns      6032940 ns          117
SCONV_NHWC_KLEIDIAI/KleidiAiNhwcComparison_NhwcFastPath/Rank:2/N:1/G:72/Cpg:1/Fpg:1/I:48/80/K:3/3/P:1/1/1/1/S:2/2/D:1/1/real_time       3022041 ns      3018352 ns          227

* Allow for NHWC Depthwise convolutions when groups are values other than 1 * Added verification tests * Changed the fallback / skip tests to now check for asymettric padding, non-depthwise grouped conv, and multiplier > 1 Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

orlmon01 · 2026-05-19T13:41:53Z

@microsoft-github-policy-service agree company="Arm"

Copilot

Pull request overview

Expands the existing MLAS/KleidiAI NHWC “no-transpose” convolution fast path to support true depthwise convolutions (grouped conv where filters-per-group == 1), and wires that capability through the NHWC transformer plus adds test/benchmark coverage.

Changes:

Relax MLAS NHWC capability gating to allow GroupCount > 1 only for true depthwise (FilterCount-per-group == 1).
Update NHWC transformer filtering to pass the real group count and compute per-group filter count.
Extend KleidiAI NHWC execution to handle grouped NHWC tensors via per-group gather/compute/scatter, plus add unit tests and a benchmark comparison.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
onnxruntime/core/mlas/lib/convolve.cpp	Updates NHWC capability gate to allow depthwise grouped convs.
onnxruntime/core/mlas/lib/kleidiai/convolve_kleidiai.cpp	Implements grouped-NHWC handling by gathering/scattering channels per group.
onnxruntime/core/optimizer/nhwc_transformer.cc	Passes group count + per-group filter count into the NHWC fast-path capability check.
onnxruntime/core/providers/cpu/nn/conv.h	Broadens KleidiAI fast-path compilation guard to `MLAS_TARGET_ARM64`.
onnxruntime/core/providers/cpu/nn/conv.cc	Same guard update for KleidiAI fast-path code.
onnxruntime/test/optimizer/nhwc_transformer_test.cc	Adds/updates tests validating depthwise enablement and expected skip cases.
onnxruntime/test/contrib_ops/fused_conv_test.cc	Adds an NHWC depthwise FusedConv correctness test (conditionally enabled).
onnxruntime/test/mlas/bench/bench_sconv.cpp	Adds benchmark cases comparing NCHW baseline vs NHWC KleidiAI fast path, including depthwise shapes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

orlmon01 · 2026-05-21T14:13:43Z

  const auto group = node.GetAttributeInt("group").value_or(1);
-  if (group != 1) {
+  if (group <= 0) {
    return false;
  }
+  const auto group_count = narrow<size_t>(group);



This is a valid concern, added a limit check to be safe

orlmon01 · 2026-05-21T14:25:40Z

    for (size_t g = 0; g < groups; ++g) {
+        const float* input_group = in;
+        std::vector<float> input_group_buffer;
+        if (grouped_channels_last) {
+            input_group_buffer.resize(ih * iw * ci);
+            for (size_t pixel = 0; pixel < ih * iw; ++pixel) {
+                const float* src = input_base + pixel * input_channels_total + g * ci;
+                std::copy_n(src, ci, input_group_buffer.data() + pixel * ci);
+            }


This is a fairly minor concern but it will affect performance so I've moved it out of the loop and only size it once.

orlmon01 · 2026-05-21T14:23:24Z

+  if (rank <= 0) throw std::invalid_argument("Kernel rank must greater than 0!");
+  if (batch_size <= 0) throw std::invalid_argument("Batch size must greater than 0!");
+  if (groups <= 0) throw std::invalid_argument("Group count must greater than 0!");
+  if (input_channels_per_group <= 0) throw std::invalid_argument("input_channels_per_group must greater than 0!");
+  if (output_channels_per_group <= 0) throw std::invalid_argument("output_channels_per_group must greater than 0!");


Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

…eSme and only size it once * Fixed some grammer in throw statements Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

orlmon01 added 2 commits May 19, 2026 14:30

Adding benchmark tests removed unnecessary linux ifdefs

6a89722

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

hariharans29 requested a review from Copilot May 19, 2026 16:09

Copilot started reviewing on behalf of hariharans29 May 19, 2026 16:13 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

JonathanC-ARM mentioned this pull request May 19, 2026

Remove ifdef for linux targets - NHWC Transformer in Conv op #28564

Closed

orlmon01 added 6 commits May 20, 2026 10:52

Merge branch 'microsoft:main' into depthwise

0463008

Merge branch 'microsoft:main' into depthwise

934693b

Add a limit to FloatNhwcWrapperFilter to avoid narrow oversize errors

ef85de2

Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Move the input_group_buffer allocation outside of the loop in Convolv…

2c43663

…eSme and only size it once * Fixed some grammer in throw statements Signed-off-by: Orlaith Monahan <orlaith.monahan@arm.com>

Merge branch 'microsoft:main' into depthwise

7bbca2e

Merge branch 'microsoft:main' into depthwise

7763847

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLAS] Update the NHWC sans transposes path to also support Depthwise convolutions#28565

[MLAS] Update the NHWC sans transposes path to also support Depthwise convolutions#28565
orlmon01 wants to merge 8 commits into
microsoft:mainfrom
orlmon01:depthwise

orlmon01 commented May 19, 2026

Uh oh!

orlmon01 commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

orlmon01 May 21, 2026

Uh oh!

orlmon01 May 21, 2026

Uh oh!

orlmon01 May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

orlmon01 commented May 19, 2026

Description

What changed:

Uh oh!

orlmon01 commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

orlmon01 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

orlmon01 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

orlmon01 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants