Skip to content

inital softmax gpu kernel and tests added#41

Open
harz05 wants to merge 3 commits into
ML4EP:gpu/alpakafrom
harz05:feat/softmax-gpu
Open

inital softmax gpu kernel and tests added#41
harz05 wants to merge 3 commits into
ML4EP:gpu/alpakafrom
harz05:feat/softmax-gpu

Conversation

@harz05

@harz05 harz05 commented Jun 18, 2026

Copy link
Copy Markdown

A block of threads handles each row together, where a row is one slice along the reduction axis. Each thread strides over the row so the reads are coalesced. Row addressing is like- row_base + l*inner_stride. Exp and log go through alpaka::math.

The kernel uses online-softmax(https://arxiv.org/pdf/1805.02867). Each thread keeps a running (max, sum) pair over its slice in one fused pass, then the block combines the pairs with a single shared-memory tree reduction using the online merge operator, followed by a normalize pass.

Superseded:

First version: one thread per row, three passes, no shared memory.
2nd: block per row with two shared-memory tree reductions (max, then sum).

@harz05 harz05 marked this pull request as ready for review June 18, 2026 18:25
@harz05

harz05 commented Jun 18, 2026

Copy link
Copy Markdown
Author

working on the warp approach to see if that could be done for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant