Right now on the alpaka GPU path, a Conv followed by a ReLU runs as two passes i.e. the Conv writes its full output to device memory, then a standalone element-wise ReLU kernel reads that whole tensor back and writes it again. That's an extra full-tensor memory round-trip per conv layer, and in a CNN (Conv→ReLU→Conv→ReLU…) it's one wasted pass per layer.
Similar fusion for Gemm -> ReLU is present. FuseGemmActivations_GPU detects the pattern, sets an activation on the Gemm and emits blas.gemmrelu so the ReLU runs in the GEMM epilogue instead of a separate kernel. This proposes the same for Conv.
Proposed improvement:
- Add a
FuseConvActivations_GPU pass (mirroring FuseGemmActivations_GPU): detect a Conv immediately followed by a ReLU, set an activation flag on the ROperator_Conv, and mark the ReLU node to be skipped.
- In
Generate_GPU_ALPAKA, the final GEMM emits blas.matmulrelu instead of blas.matmul when the flag is set. Conv already broadcasts its bias into the output C and runs the GEMM with beta=1, so relu(xcol·W + bias) comes out correct with the activation fused in and the standalone ReLU kernel dropped.
the current gemmrelu doesn't seem fit to be used since it broadcasts the bias along the cuBLASLt M axis. Conv on the other hand does bias as a separate kernel, so we would need a no bias ReLU epilogue.
Right now on the alpaka GPU path, a Conv followed by a ReLU runs as two passes i.e. the Conv writes its full output to device memory, then a standalone element-wise ReLU kernel reads that whole tensor back and writes it again. That's an extra full-tensor memory round-trip per conv layer, and in a CNN (Conv→ReLU→Conv→ReLU…) it's one wasted pass per layer.
Similar fusion for Gemm -> ReLU is present.
FuseGemmActivations_GPUdetects the pattern, sets an activation on the Gemm and emits blas.gemmrelu so the ReLU runs in the GEMM epilogue instead of a separate kernel. This proposes the same for Conv.Proposed improvement:
FuseConvActivations_GPUpass (mirroring FuseGemmActivations_GPU): detect a Conv immediately followed by a ReLU, set an activation flag on the ROperator_Conv, and mark the ReLU node to be skipped.Generate_GPU_ALPAKA, the final GEMM emits blas.matmulrelu instead of blas.matmul when the flag is set. Conv already broadcasts its bias into the output C and runs the GEMM with beta=1, so relu(xcol·W + bias) comes out correct with the activation fused in and the standalone ReLU kernel dropped.the current gemmrelu doesn't seem fit to be used since it broadcasts the bias along the cuBLASLt M axis. Conv on the other hand does bias as a separate kernel, so we would need a no bias ReLU epilogue.