In Generate_GPU_ALPAKA of ROperator_Conv.hxx (the batch loop, around line 867), the non-grouped path runs one blas.matmul per batch sample. Each iteration does im2col into the shared _xcol buffer, broadcasts bias, calls matmul, then alpaka::wait(queue) before the next sample (since the next im2col would overwrite _xcol while the GEMM is still reading it). For batch B that is B separate small GEMMs and this creates around ~3B sync points.
For every sample the weight _f is identical; only the im2col input and the output slice change. This is exactly the case gemmStridedBatched handles. It already exists in sofieBLAS dev (commit fa108fb) and the Gemm operator uses it for the stacked MatMul case (ROperator_Gemm.hxx, useSBatched path around line 665), so there is a working reference for how to call it.
Proposed change: give each sample its own slice of _xcol so they don't alias, im2col each sample into its slice, then replace the B matmul calls with a single gemmStridedBatched over all samples.
The GEMM per sample is Y = Xcol * W with m = gemm_m (output spatial), n = gemm_n (output channels), k = gemm_k (inC * kH * kW). Strides for the batched call:
- A = _xcol, strideA = colElements (gemm_m * gemm_k), since each sample has its own im2col
- B = _f, strideB = 0, since the weight is shared across samples
- C = _Y, strideC = gemm_n * gemm_m, each sample's output block
- batchCount = B
Bias still works: Conv already broadcasts bias into the output with a separate kernel, then the GEMM accumulates with beta = 1. With per-sample slices the inter-sample alpaka::wait calls also go away.
Tradeoff is that _xcol (registered around line 322) has to grow B times so all samples' im2col coexist.
In
Generate_GPU_ALPAKA of ROperator_Conv.hxx(the batch loop, around line 867), the non-grouped path runs one blas.matmul per batch sample. Each iteration does im2col into the shared_xcol buffer, broadcasts bias, calls matmul, then alpaka::wait(queue)before the next sample (since the next im2col would overwrite _xcol while the GEMM is still reading it). For batch B that is B separate small GEMMs and this creates around ~3B sync points.For every sample the
weight _fis identical; only the im2col input and the output slice change. This is exactly the case gemmStridedBatched handles. It already exists in sofieBLAS dev (commit fa108fb) and the Gemm operator uses it for the stacked MatMul case (ROperator_Gemm.hxx,useSBatchedpath around line 665), so there is a working reference for how to call it.Proposed change: give each sample its own slice of _xcol so they don't alias, im2col each sample into its slice, then replace the B matmul calls with a single gemmStridedBatched over all samples.
The GEMM per sample is
Y = Xcol * W with m = gemm_m(output spatial), n = gemm_n (output channels), k = gemm_k (inC * kH * kW). Strides for the batched call:Bias still works: Conv already broadcasts bias into the output with a separate kernel, then the GEMM accumulates with beta = 1. With per-sample slices the inter-sample
alpaka::waitcalls also go away.Tradeoff is that _xcol (registered around line 322) has to grow B times so all samples' im2col coexist.