Fuse conv and ReLU into the GEMM epilogue (alpaka GPU)

Right now on the alpaka GPU path, a Conv followed by a ReLU runs as two passes i.e. the Conv writes its full output to device memory, then a standalone element-wise ReLU kernel reads that whole tensor back and writes it again. That's an extra full-tensor memory round-trip per conv layer, and in a CNN (Conv→ReLU→Conv→ReLU…) it's one wasted pass per layer.

Similar fusion for Gemm -> ReLU is present. `FuseGemmActivations_GPU` detects the pattern, sets an activation on the Gemm and emits blas.gemmrelu so the ReLU runs in the GEMM epilogue instead of a separate kernel. This proposes the same for Conv.

### Proposed improvement:
- Add a `FuseConvActivations_GPU` pass (mirroring FuseGemmActivations_GPU): detect a Conv immediately followed by a ReLU, set an activation flag on the ROperator_Conv, and mark the ReLU node to be skipped.
- In `Generate_GPU_ALPAKA`, the final GEMM emits blas.matmulrelu instead of blas.matmul when the flag is set. Conv already broadcasts its bias into the output C and runs the GEMM with beta=1, so relu(xcol·W + bias) comes out correct with the activation fused in and the standalone ReLU kernel dropped.

the current gemmrelu doesn't seem fit to be used since it broadcasts the bias along the cuBLASLt M axis. Conv on the other hand does bias as a separate kernel, so we would need a no bias ReLU epilogue. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fuse conv and ReLU into the GEMM epilogue (alpaka GPU) #36

Proposed improvement:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Fuse conv and ReLU into the GEMM epilogue (alpaka GPU) #36

Description

Proposed improvement:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions