Skip to content

Update examples#144

Merged
maleadt merged 4 commits intomainfrom
tb/examples
Mar 26, 2026
Merged

Update examples#144
maleadt merged 4 commits intomainfrom
tb/examples

Conversation

@maleadt
Copy link
Member

@maleadt maleadt commented Mar 25, 2026

No description provided.

@maleadt maleadt merged commit 850a808 into main Mar 26, 2026
9 checks passed
@maleadt maleadt deleted the tb/examples branch March 26, 2026 05:40
@AntonOresten
Copy link
Contributor

Any idea why matmul FLOP/S dropped and layer norm got so much better? The explanations behind the relative performance seems to have flipped, comparing memory and compute-bound kernels @maleadt

@maleadt
Copy link
Member Author

maleadt commented Mar 26, 2026

I think I did the previous measurements with nsys, so only measuring kernel times as opposed to measuring the whole call here (including launch overhead etc).

Layernorm got much better because the example wasn't storing the fastest iterating dimension as the contiguous one, as became obvious after the IR / code still containing permute calls post #142 (which shouldn't be necessary anymore).

@AntonOresten
Copy link
Contributor

AntonOresten commented Mar 26, 2026

so only measuring kernel times as opposed to measuring the whole call here

So was the previous comparison not apples-to-apples?

Also, the new paragraph in the README states:

Compute-intensive kernels (matmul, batch matmul, FFT) are slower due to conservative token threading in the generated Tile IR, which serializes loads that could otherwise be pipelined.

Is the logic here that it's not bandwidth limited, but spends a less portion of time on compute because it's waiting on serialized loads?

Does the closing of #1 have the anticipated effect?

@maleadt
Copy link
Member Author

maleadt commented Mar 26, 2026

so only measuring kernel times as opposed to measuring the whole call here

So was the previous comparison not apples-to-apples?

It was apples-to-apples, but because the times it was using were lower the reported throughput was slightly larger. That said, these benchmarks are not rigorous, just indicative of the expected performance.

Compute-intensive kernels (matmul, batch matmul, FFT) are slower due to conservative token threading in the generated Tile IR, which serializes loads that could otherwise be pipelined.

Is the logic here that it's not bandwidth limited, but spends a less portion of time on compute because it's waiting on serialized loads?

Yeah. However, fixing #1 didn't fix the performance as I expected (as indicated in that paragraph). I've found the actual issue though; PRs incoming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants