[improvement] Batch the non-grouped Conv GEMM with gemmStridedBatched instead of a per sample loop

In `Generate_GPU_ALPAKA of ROperator_Conv.hxx` (the batch loop, around line 867), the non-grouped path runs one blas.matmul per batch sample. Each iteration does im2col into the shared `_xcol buffer, broadcasts bias, calls matmul, then alpaka::wait(queue)` before the next sample (since the next im2col would overwrite _xcol while the GEMM is still reading it). For batch B that is B separate small GEMMs and  this creates around ~3B sync points.

For every sample the `weight _f` is identical; only the im2col input and the output slice change. This is exactly the case gemmStridedBatched handles. It already exists in sofieBLAS dev (commit fa108fb) and the Gemm operator uses it for the stacked MatMul case (`ROperator_Gemm.hxx`, `useSBatched` path around line 665), so there is a working reference for how to call it.

Proposed change: give each sample its own slice of _xcol so they don't alias, im2col each sample into its slice, then replace the B matmul calls with a single gemmStridedBatched over all samples.

The GEMM per sample is `Y = Xcol * W with m = gemm_m` (output spatial), n = gemm_n (output channels), k = gemm_k (inC * kH * kW). Strides for the batched call:
- A = _xcol, strideA = colElements (gemm_m * gemm_k), since each sample has its own im2col
- B = _f, strideB = 0, since the weight is shared across samples
- C = _Y, strideC = gemm_n * gemm_m, each sample's output block
- batchCount = B

Bias still works: Conv already broadcasts bias into the output with a separate kernel, then the GEMM accumulates with beta = 1. With per-sample slices the inter-sample `alpaka::wait` calls also go away.

Tradeoff is that _xcol (registered around line 322) has to grow B times so all samples' im2col coexist.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[improvement] Batch the non-grouped Conv GEMM with gemmStridedBatched instead of a per sample loop #29

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[improvement] Batch the non-grouped Conv GEMM with gemmStridedBatched instead of a per sample loop #29

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions