Conversation
|
Any idea why matmul FLOP/S dropped and layer norm got so much better? The explanations behind the relative performance seems to have flipped, comparing memory and compute-bound kernels @maleadt |
|
I think I did the previous measurements with Layernorm got much better because the example wasn't storing the fastest iterating dimension as the contiguous one, as became obvious after the IR / code still containing |
So was the previous comparison not apples-to-apples? Also, the new paragraph in the README states:
Is the logic here that it's not bandwidth limited, but spends a less portion of time on compute because it's waiting on serialized loads? Does the closing of #1 have the anticipated effect? |
It was apples-to-apples, but because the times it was using were lower the reported throughput was slightly larger. That said, these benchmarks are not rigorous, just indicative of the expected performance.
Yeah. However, fixing #1 didn't fix the performance as I expected (as indicated in that paragraph). I've found the actual issue though; PRs incoming. |
No description provided.