Update examples by maleadt · Pull Request #144 · JuliaGPU/cuTile.jl

maleadt · 2026-03-25T21:30:50Z

No description provided.

AntonOresten · 2026-03-26T10:45:00Z

Any idea why matmul FLOP/S dropped and layer norm got so much better? The explanations behind the relative performance seems to have flipped, comparing memory and compute-bound kernels @maleadt

maleadt · 2026-03-26T10:54:29Z

I think I did the previous measurements with nsys, so only measuring kernel times as opposed to measuring the whole call here (including launch overhead etc).

Layernorm got much better because the example wasn't storing the fastest iterating dimension as the contiguous one, as became obvious after the IR / code still containing permute calls post #142 (which shouldn't be necessary anymore).

AntonOresten · 2026-03-26T18:04:40Z

so only measuring kernel times as opposed to measuring the whole call here

So was the previous comparison not apples-to-apples?

Also, the new paragraph in the README states:

Compute-intensive kernels (matmul, batch matmul, FFT) are slower due to conservative token threading in the generated Tile IR, which serializes loads that could otherwise be pipelined.

Is the logic here that it's not bandwidth limited, but spends a less portion of time on compute because it's waiting on serialized loads?

Does the closing of #1 have the anticipated effect?

maleadt · 2026-03-26T19:04:06Z

so only measuring kernel times as opposed to measuring the whole call here

So was the previous comparison not apples-to-apples?

It was apples-to-apples, but because the times it was using were lower the reported throughput was slightly larger. That said, these benchmarks are not rigorous, just indicative of the expected performance.

Compute-intensive kernels (matmul, batch matmul, FFT) are slower due to conservative token threading in the generated Tile IR, which serializes loads that could otherwise be pipelined.

Is the logic here that it's not bandwidth limited, but spends a less portion of time on compute because it's waiting on serialized loads?

Yeah. However, fixing #1 didn't fix the performance as I expected (as indicated in that paragraph). I've found the actual issue though; PRs incoming.

maleadt added 4 commits March 25, 2026 22:02

batchedmatmul: align types.

009364c

layernorm: use contiguous layout.

cbfbdc3

fft: use natural storage.

b07eeab

Update README.

96c765b

maleadt merged commit 850a808 into main Mar 26, 2026
9 checks passed

maleadt deleted the tb/examples branch March 26, 2026 05:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update examples#144

Update examples#144
maleadt merged 4 commits intomainfrom
tb/examples

maleadt commented Mar 25, 2026

Uh oh!

Uh oh!

AntonOresten commented Mar 26, 2026

Uh oh!

maleadt commented Mar 26, 2026

Uh oh!

AntonOresten commented Mar 26, 2026 •

edited

Loading

Uh oh!

maleadt commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maleadt commented Mar 25, 2026

Uh oh!

Uh oh!

AntonOresten commented Mar 26, 2026

Uh oh!

maleadt commented Mar 26, 2026

Uh oh!

AntonOresten commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AntonOresten commented Mar 26, 2026 •

edited

Loading