inital softmax gpu kernel and tests added by harz05 · Pull Request #41 · ML4EP/SOFIE

harz05 · 2026-06-18T13:08:56Z

A block of threads handles each row together, where a row is one slice along the reduction axis. Each thread strides over the row so the reads are coalesced. Row addressing is like- row_base + l*inner_stride. Exp and log go through alpaka::math.

The kernel uses online-softmax(https://arxiv.org/pdf/1805.02867). Each thread keeps a running (max, sum) pair over its slice in one fused pass, then the block combines the pairs with a single shared-memory tree reduction using the online merge operator, followed by a normalize pass.

Superseded:

First version: one thread per row, three passes, no shared memory.
2nd: block per row with two shared-memory tree reductions (max, then sum).

harz05 · 2026-06-18T18:27:16Z

working on the warp approach to see if that could be done for it

harz05 added 2 commits June 18, 2026 18:30

inital softmax gpu kernel and tests added

f50dd4a

block per row reduction

2331e16

harz05 marked this pull request as ready for review June 18, 2026 18:25

online softmax

2e53990

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

inital softmax gpu kernel and tests added#41

inital softmax gpu kernel and tests added#41
harz05 wants to merge 3 commits into
ML4EP:gpu/alpakafrom
harz05:feat/softmax-gpu

harz05 commented Jun 18, 2026 •

edited

Loading

Uh oh!

harz05 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

harz05 commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Superseded:

Uh oh!

harz05 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

harz05 commented Jun 18, 2026 •

edited

Loading