Skip to content

Add INT8 support for LDS transpose load#2214

Open
stefankoncarevic wants to merge 2 commits intolds-transpose-load-fp8from
lds-transpose-load-int8
Open

Add INT8 support for LDS transpose load#2214
stefankoncarevic wants to merge 2 commits intolds-transpose-load-fp8from
lds-transpose-load-int8

Conversation

@stefankoncarevic
Copy link
Copy Markdown
Contributor

⚠️ Do not merge until #2210 is merged - this PR depends on LDS transpose load fp8 support

Motivation

Extends LDS transpose load optimization to support INT8 data types for GEMM and Attention kernels on gfx950. This enables hardware-accelerated transposed loads (ds_read_tr8_b64) for all INT8 MFMAs (16x16x32, 16x16x64, 32x32x16, 32x32x32), improving performance for INT8 quantized inference.

Technical Details

  • LdsTransposeLoad.cpp: Added INT8 type support, offset formulas for (16,64) and (32,32) geometries, and double-rate K-coverage logic
  • AccelEmitter.cpp: Added K-dimension transformation for INT8 MFMAs with kBase=16 when kpack=1
  • RockDialect.cpp/RockOps.td: Updated validation and type support for INT8 LDS transpose

Test Plan

Added MLIR unit tests
Added E2E tests
All tests verified on gfx950 hardware with numerical correctness validation

Test Result

Submission Checklist

@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 3 times, most recently from f3176a8 to a75ab7a Compare January 29, 2026 14:10
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 2 times, most recently from 24d9bf6 to 076a998 Compare February 27, 2026 13:37
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-int8 branch from 3ccbf35 to a6318df Compare March 2, 2026 10:05
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 3 times, most recently from a6e9ccf to b8674ba Compare April 2, 2026 13:56
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-fp8 branch from b8674ba to b54ad4c Compare April 6, 2026 12:21
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 2 times, most recently from 43f0c7e to b4f76ac Compare April 23, 2026 15:29
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-fp8 branch 4 times, most recently from 80e126e to 7a60515 Compare May 6, 2026 10:48
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-int8 branch from a6318df to 06376ea Compare May 6, 2026 14:24
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-fp8 branch from f394fde to 024e85a Compare May 6, 2026 22:18
Extend LDS transpose load (ds_read_tr8_b64) to cover INT8 (i8) on top
of the refactored FP8/BF8 path. INT8 reuses the 8-bit lane swizzle and
adds two new double-rate geometries: (16, 64) and (32, 32).
Core changes:
* LdsTransposeLoad.{h,cpp}: add isInt8Type / uses8BitTransposeLoad and
  isInt8OnlyLdsTransposeGeometry helpers; extend isValidLdsTransposeMfma-
  Geometry, getTransposeLoadVectorLength, getDoubleRateKOffsetBase,
  getBasePanelOffsets and emitThreadwiseHWTranspose to compute the right
  kStride / kOffsetBase / highHalfOffset for INT8 double-rate loads.
* makeDecision: reject INT8 with FP8-only or F16-only geometries and
  vice versa.
* RockOps.td: allow I8 on rock.lds_transpose_load source/result types.
* RockDialect.cpp: ThreadwiseReadIntoOp::verify accepts i8 destinations
  and enforces (geometry, type) consistency for INT8.
Cleanup applied during the review:
* Add isF16DoubleRateGeometry helper and reuse it in getDoubleRateKOffset-
  Base / getBasePanelOffsets / emitThreadwiseHWTranspose.
* Fix outdated assert message in buildTransposeAttrFromParams and the
  kStride doc comment in getDoubleRateKOffsetBase.
* Reorganize the LdsTransposeLoad.h header doc to list supported MFMA
  geometries by element type.
Tests:
* ops.mlir: positive rock.lds_transpose_load + LDSTransposeConfigAttr
  cases for the four INT8 geometries.
* lowering_load_transpose_lds.mlir: lowering check for i8 ->
  amdgpu.transpose_load.
* lds_transpose_load_panels.mlir: rename from the FP8-only file and
  add an INT8 (32, 16) panel-count check.
* lds_transpose_error.mlir: negative tests for INT8 with f16-only and
  fp8-only geometries, f16 / fp8 with int8-only geometries, updated
  valid-geometry messages, and a kPerBlock divisibility test for the
  INT8 (32, 32) double-rate geometry.
* PrLdsTransposeLoadI8.{toml,cfg} and PrLdsTransposeLoadAttentionI8.
  {toml,cfg}: new e2e configs for INT8 GEMM and Attention with LDS
  transpose on A/B and on K/Q.
* lds_transpose_load_panels.mlir: add panel-count Lit guards for
  the two INT8 double-rate geometries (16,64) and (32,32). Each
  stanza checks both the number of amdgpu.transpose_load ops and the
  resulting amdgpu.mfma instruction.
* LdsTransposeLoad.h: promote isInt8Type next to isFp8Type and use
  hwtranspose::isInt8Type from RockDialect.cpp for symmetry.
* LdsTransposeLoad.cpp: refresh stale doc comments that listed only
  fp8/bf8 where the path now also handles INT8.
* PrLdsTransposeLoadI8.toml / PrLdsTransposeLoadAttentionI8.toml:
  replace inaccurate suite banners with a description of the INT8
  MFMA geometries actually exercised.
No functional change for supported configurations.
@stefankoncarevic stefankoncarevic force-pushed the lds-transpose-load-int8 branch from 06376ea to 4dae620 Compare May 8, 2026 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant