Skip to content

Conversation

@rampitec
Copy link
Collaborator

We can encode 16-bit operands in a short form for VGPRs [0..127].
When we have 1K registers available we can in fact allocate 4
times more from all 4 banks. That, however, requires an allocatable
class for these operands. When for most of the instructions it will
result in the VOP3 longer form, for V_FMAAMK/FMADAK_F16 it will
simply prohibit the encoding because these do not have VOP3 forms.

A straight forward solution would be to create a register class
with all registers having bit 8 of the encoding zero, i.e. to
create a register class with holes punched in it: [0-127, 256-383,
512-639, 768-895]. LLVM, however, does not like register classes
with punched holes when they also have subregisters. The cross-
product of all classes explodes and some combinations of a 'class
having a common subreg with another' becomeing impossible. Just
doing so explodes our register info to 4+Gb, uncompilable too.

The solution proposed is to define _lo128 RC with contigous 896
VGPRs, but the allocation order of it hiding prohibited registers.
That keeps generated register info a reasonable size (+~50%).

The same is needed to VGPR_16_Lo128 in true16 mode.

In general we could later reuse VGPR_32 with AltOrderSelect, but
we would need to pass there operand type and deal with the AsmParser.

One other consideration is that we can allocate a register leaving
a hole of the whole 128 registers, but a subsequent patch can fix
it, i.e. by the time of the RA we really know estimated register
pressure and can further limit allocation order.

For the short: w/o it we either have spilling when we have VGPRs
available, or outright have 'run out of registers' when we have
a lot of 16-bit registers used, a lot of budget, but we cannot
encode it.

@rampitec
Copy link
Collaborator Author

rampitec commented Dec 17, 2025

@llvmbot
Copy link
Member

llvmbot commented Dec 17, 2025

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

We can encode 16-bit operands in a short form for VGPRs [0..127].
When we have 1K registers available we can in fact allocate 4
times more from all 4 banks. That, however, requires an allocatable
class for these operands. When for most of the instructions it will
result in the VOP3 longer form, for V_FMAAMK/FMADAK_F16 it will
simply prohibit the encoding because these do not have VOP3 forms.

A straight forward solution would be to create a register class
with all registers having bit 8 of the encoding zero, i.e. to
create a register class with holes punched in it: [0-127, 256-383,
512-639, 768-895]. LLVM, however, does not like register classes
with punched holes when they also have subregisters. The cross-
product of all classes explodes and some combinations of a 'class
having a common subreg with another' becomeing impossible. Just
doing so explodes our register info to 4+Gb, uncompilable too.

The solution proposed is to define _lo128 RC with contigous 896
VGPRs, but the allocation order of it hiding prohibited registers.
That keeps generated register info a reasonable size (+~50%).

The same is needed to VGPR_16_Lo128 in true16 mode.

In general we could later reuse VGPR_32 with AltOrderSelect, but
we would need to pass there operand type and deal with the AsmParser.

One other consideration is that we can allocate a register leaving
a hole of the whole 128 registers, but a subsequent patch can fix
it, i.e. by the time of the RA we really know estimated register
pressure and can further limit allocation order.

For the short: w/o it we either have spilling when we have VGPRs
available, or outright have 'run out of registers' when we have
a lot of 16-bit registers used, a lot of budget, but we cannot
encode it.


Patch is 442.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/172614.diff

42 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp (+15-1)
  • (modified) llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp (+5-1)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrFormats.td (+21)
  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.td (+1-1)
  • (modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.td (+9-11)
  • (modified) llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp (+2-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll (+16-16)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankcombiner-ignore-copies-crash.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relax-indirect-branch.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relax-no-terminators.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/coalesce-copy-to-agpr-to-av-registers.mir (+120-120)
  • (modified) llvm/test/CodeGen/AMDGPU/coalescer-early-clobber-subreg.mir (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/dst-sel-hazard.mir (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/hazards-gfx950.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/inflate-reg-class-vgpr-mfma-to-av-with-load-source.mir (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/inline-asm.i128.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-add-references.gfx10.mir (+15-15)
  • (modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-add-references.gfx8.mir (+180-180)
  • (modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-add-references.gfx9.mir (+90-90)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-scheduler-sink-trivial-remats.mir (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/no-limit-coalesce.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/optimize-exec-mask-pre-ra-no-fold-exec-copy.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/partial-regcopy-and-spill-missed-at-regalloc.ll (+12-12)
  • (added) llvm/test/CodeGen/AMDGPU/regalloc-vgpr_lo128-gfx1250.mir (+94)
  • (modified) llvm/test/CodeGen/AMDGPU/rename-independent-subregs.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/rewrite-vgpr-mfma-to-agpr-copy-from.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/rewrite-vgpr-mfma-to-agpr-subreg-insert-extract.mir (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/rewrite-vgpr-mfma-to-agpr-subreg-src2-chain.mir (+18-18)
  • (modified) llvm/test/CodeGen/AMDGPU/sched-assert-dead-def-subreg-use-other-subreg.mir (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/sched-handleMoveUp-subreg-def-across-subreg-def.mir (+8-8)
  • (added) llvm/test/CodeGen/AMDGPU/shrink-vgpr_lo128-gfx1250.mir (+89)
  • (modified) llvm/test/CodeGen/AMDGPU/spill-vector-superclass.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/subreg-undef-def-with-other-subreg-defs.mir (+12-12)
  • (modified) llvm/test/MC/Disassembler/AMDGPU/gfx11_dasm_vop1_dpp16.txt (+46-46)
  • (modified) llvm/test/MC/Disassembler/AMDGPU/gfx11_dasm_vop1_dpp8.txt (+46-23)
  • (modified) llvm/test/MC/Disassembler/AMDGPU/gfx11_dasm_vop2_dpp16.txt (+6-6)
  • (modified) llvm/test/MC/Disassembler/AMDGPU/gfx11_dasm_vop2_dpp8.txt (+6-3)
  • (modified) llvm/test/MC/Disassembler/AMDGPU/gfx1250_dasm_vop1_dpp8.txt (+13-13)
  • (modified) llvm/test/MC/Disassembler/AMDGPU/gfx12_dasm_vop1_dpp16.txt (+46-46)
  • (modified) llvm/test/MC/Disassembler/AMDGPU/gfx12_dasm_vop1_dpp8.txt (+49-46)
  • (modified) llvm/test/MC/Disassembler/AMDGPU/gfx12_dasm_vop2_dpp16.txt (+6-6)
  • (modified) llvm/test/MC/Disassembler/AMDGPU/gfx12_dasm_vop2_dpp8.txt (+6-3)
diff --git a/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp b/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
index 7a91a40e18cde..8430f2e3b4344 100644
--- a/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
+++ b/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
@@ -426,6 +426,8 @@ class AMDGPUOperand : public MCParsedAsmOperand {
     return isRegOrInline(RCID, type) && !hasModifiers();
   }
 
+  bool isVGPR32_Lo128() const;
+
   bool isSCSrcB16() const {
     return isRegOrInlineNoMods(AMDGPU::SReg_32RegClassID, MVT::i16);
   }
@@ -2243,7 +2245,19 @@ bool AMDGPUOperand::isLiteralImm(MVT type) const {
 }
 
 bool AMDGPUOperand::isRegClass(unsigned RCID) const {
-  return isRegKind() && AsmParser->getMRI()->getRegClass(RCID).contains(getReg());
+  if (!isRegKind() ||
+      !AsmParser->getMRI()->getRegClass(RCID).contains(getReg()))
+    return false;
+  if (RCID == AMDGPU::VGPR_32_Lo128RegClassID ||
+      RCID == AMDGPU::VS_32_Lo128RegClassID)
+    return getReg() <= AMDGPU::VGPR127 || getReg() > AMDGPU::VGPR1023;
+  return true;
+}
+
+bool AMDGPUOperand::isVGPR32_Lo128() const {
+  if (!isRegKind())
+    return false;
+  return getReg() >= AMDGPU::VGPR0 && getReg() <= AMDGPU::VGPR127;
 }
 
 bool AMDGPUOperand::isVRegWithInputMods() const {
diff --git a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp
index b63d71dc2fde9..55cd800d201db 100644
--- a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp
+++ b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp
@@ -813,7 +813,11 @@ void AMDGPUInstPrinter::printRegularOperand(const MCInst *MI, unsigned OpNo,
           OpInfo, STI.getHwMode(MCSubtargetInfo::HwMode_RegInfo));
       const MCRegisterClass &RC = MRI.getRegClass(RCID);
       auto Reg = mc2PseudoReg(Op.getReg());
-      if (!RC.contains(Reg) && !isInlineValue(Reg)) {
+      bool Err = !RC.contains(Reg) && !isInlineValue(Reg);
+      if (!Err && (RCID == AMDGPU::VGPR_32_Lo128RegClassID ||
+                   RCID == AMDGPU::VS_32_Lo128RegClassID))
+        Err = Reg >= AMDGPU::VGPR128 && Reg <= AMDGPU::VGPR1023;
+      if (Err) {
         bool IsWaveSizeOp = OpInfo.isLookupRegClassByHwMode() &&
                             (OpInfo.RegClass == AMDGPU::SReg_1 ||
                              OpInfo.RegClass == AMDGPU::SReg_1_XEXEC);
diff --git a/llvm/lib/Target/AMDGPU/SIInstrFormats.td b/llvm/lib/Target/AMDGPU/SIInstrFormats.td
index de66c472be0ca..cb036b517f2e2 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrFormats.td
+++ b/llvm/lib/Target/AMDGPU/SIInstrFormats.td
@@ -323,6 +323,12 @@ def CPolBit {
 
 class VOPDstOperand<RegisterClassLike rc> : RegisterOperand<rc, "printVOPDst">;
 
+def Vgpr32Lo128 : AsmOperandClass {
+  let Name = "Vgpr32Lo128";
+  let PredicateMethod = "isVGPR32_Lo128";
+  let RenderMethod = "addRegOperands";
+}
+
 def VOPDstOperand_t16 : VOPDstOperand <VGPR_16> {
   let EncoderMethod = "getMachineOpValueT16";
   let DecoderMethod = "DecodeVGPR_16RegisterClass";
@@ -333,12 +339,27 @@ def VOPDstOperand_t16Lo128 : VOPDstOperand <VGPR_16_Lo128> {
   let DecoderMethod = "DecodeVGPR_16_Lo128RegisterClass";
 }
 
+def VOPDstOperand_Vgpr32Lo128 : VOPDstOperand <VGPR_32_Lo128> {
+  let ParserMatchClass = Vgpr32Lo128;
+}
+
 // Source-encoded destination operand for instructions like v_swap_b16.
 def VOPSrcEncodedDstOperand_t16Lo128 : VOPDstOperand <VGPR_16_Lo128> {
   let EncoderMethod = VSrcT_b16_Lo128.EncoderMethod;
   let DecoderMethod = VSrcT_b16_Lo128.DecoderMethod;
 }
 
+
+def VGPROp_16_Lo128 : RegisterOperand<VGPR_16_Lo128> {
+  let DecoderMethod = "DecodeVGPR_16_Lo128RegisterClass";
+  let EncoderMethod = "getMachineOpValueT16Lo128";
+}
+
+def VGPROp_32_Lo128 : RegisterOperand<VGPR_32_Lo128> {
+  let DecoderMethod = "DecodeVGPR_32RegisterClass";
+  let ParserMatchClass = Vgpr32Lo128;
+}
+
 class VINTRPe <bits<2> op> : Enc32 {
   bits<8> vdst;
   bits<8> vsrc;
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.td b/llvm/lib/Target/AMDGPU/SIInstrInfo.td
index 628b972f97086..8738df31a9a56 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.td
@@ -1817,7 +1817,7 @@ class getVALUDstForVT_fake16<ValueType VT> {
   RegisterOperand ret = !if(!eq(VT.Size, 32), VOPDstOperand<VGPR_32>,
                           !if(!eq(VT.Size, 128), VOPDstOperand<VReg_128_AlignTarget>,
                             !if(!eq(VT.Size, 64), VOPDstOperand<VReg_64_AlignTarget>,
-                              !if(!eq(VT.Size, 16), VOPDstOperand<VGPR_32_Lo128>,
+                              !if(!eq(VT.Size, 16), VOPDstOperand_Vgpr32Lo128,
                               VOPDstS64orS32)))); // else VT == i1
 }
 
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
index 272d4b5609dfb..838d31df5bacd 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
@@ -632,13 +632,20 @@ def VGPR_32 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types
   let BaseClassOrder = 32;
 }
 
-// Identical to VGPR_32 except it only contains the low 128 (Lo128) registers.
+// Identical to VGPR_32 except only the low 128 (Lo128) registers in each
+// register bank are allocatable.
 def VGPR_32_Lo128 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types.types), 32,
-                            (add (sequence "VGPR%u", 0, 127))> {
+                            (add (sequence "VGPR%u", 0, 895))> {
+  let AltOrders = [(add (sequence "VGPR%u", 0, 127),
+                        (sequence "VGPR%u", 256, 383),
+                        (sequence "VGPR%u", 512, 639),
+                        (sequence "VGPR%u", 768, 895))];
+  let AltOrderSelect = [{ return 1; }];
   let AllocationPriority = !add(0, !mul(BaseClassPriority, BaseClassScaleFactor));
   let GeneratePressureSet = 0;
   let Size = 32;
   let Weight = 1;
+  let BaseClassOrder = 33;
 }
 
 // Identical to VGPR_32 except it only contains the low 256 (Lo256) registers.
@@ -1487,15 +1494,6 @@ foreach size = ["64", "96", "128", "160", "192", "224", "256", "288", "320", "35
   def VGPROp_#size#_Align2 : RegisterOperand<!cast<RegisterClassLike>("VReg_"#size#_Align2)>;
 }
 
-def VGPROp_16_Lo128 : RegisterOperand<VGPR_16_Lo128> {
-  let DecoderMethod = "DecodeVGPR_16_Lo128RegisterClass";
-  let EncoderMethod = "getMachineOpValueT16Lo128";
-}
-
-def VGPROp_32_Lo128 : RegisterOperand<VGPR_32_Lo128> {
-  let DecoderMethod = "DecodeVGPR_32RegisterClass";
-}
-
 //===----------------------------------------------------------------------===//
 //  ASrc_* Operands with an AccVGPR
 //===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp b/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
index 1b78f67e76d07..df8abccf97362 100644
--- a/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
+++ b/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
@@ -157,7 +157,8 @@ bool SIShrinkInstructions::shouldShrinkTrue16(MachineInstr &MI) const {
       assert(!Reg.isVirtual() && "Prior checks should ensure we only shrink "
                                  "True16 Instructions post-RA");
       if (AMDGPU::VGPR_32RegClass.contains(Reg) &&
-          !AMDGPU::VGPR_32_Lo128RegClass.contains(Reg))
+          !llvm::is_contained(
+              AMDGPU::VGPR_32_Lo128RegClass.getRawAllocationOrder(*MF), Reg))
         return false;
 
       if (AMDGPU::VGPR_16RegClass.contains(Reg) &&
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
index b290c314f1154..4089e21040b24 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
@@ -66,7 +66,7 @@ define amdgpu_kernel void @asm_simple_agpr_clobber() {
 define i32 @asm_vgpr_early_clobber() {
   ; CHECK-LABEL: name: asm_vgpr_early_clobber
   ; CHECK: bb.1 (%ir-block.0):
-  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 $0, 7; v_mov_b32 $1, 7", 1 /* sideeffect attdialect */, 1245195 /* regdef-ec:VGPR_32 */, def early-clobber %8, 1245195 /* regdef-ec:VGPR_32 */, def early-clobber %9, !1
+  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 $0, 7; v_mov_b32 $1, 7", 1 /* sideeffect attdialect */, 1376267 /* regdef-ec:VGPR_32 */, def early-clobber %8, 1376267 /* regdef-ec:VGPR_32 */, def early-clobber %9, !1
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY %8
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:_(s32) = COPY %9
   ; CHECK-NEXT:   [[ADD:%[0-9]+]]:_(s32) = G_ADD [[COPY]], [[COPY1]]
@@ -94,7 +94,7 @@ entry:
 define i32 @test_single_vgpr_output() nounwind {
   ; CHECK-LABEL: name: test_single_vgpr_output
   ; CHECK: bb.1.entry:
-  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 $0, 7", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %8
+  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 $0, 7", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %8
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY %8
   ; CHECK-NEXT:   $vgpr0 = COPY [[COPY]](s32)
   ; CHECK-NEXT:   SI_RETURN implicit $vgpr0
@@ -106,7 +106,7 @@ entry:
 define i32 @test_single_sgpr_output_s32() nounwind {
   ; CHECK-LABEL: name: test_single_sgpr_output_s32
   ; CHECK: bb.1.entry:
-  ; CHECK-NEXT:   INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1835018 /* regdef:SReg_32 */, def %8
+  ; CHECK-NEXT:   INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %8
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY %8
   ; CHECK-NEXT:   $vgpr0 = COPY [[COPY]](s32)
   ; CHECK-NEXT:   SI_RETURN implicit $vgpr0
@@ -119,7 +119,7 @@ entry:
 define float @test_multiple_register_outputs_same() #0 {
   ; CHECK-LABEL: name: test_multiple_register_outputs_same
   ; CHECK: bb.1 (%ir-block.0):
-  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 $0, 0; v_mov_b32 $1, 1", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %8, 1245194 /* regdef:VGPR_32 */, def %9
+  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 $0, 0; v_mov_b32 $1, 1", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %8, 1376266 /* regdef:VGPR_32 */, def %9
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY %8
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:_(s32) = COPY %9
   ; CHECK-NEXT:   [[FADD:%[0-9]+]]:_(s32) = G_FADD [[COPY]], [[COPY1]]
@@ -136,7 +136,7 @@ define float @test_multiple_register_outputs_same() #0 {
 define double @test_multiple_register_outputs_mixed() #0 {
   ; CHECK-LABEL: name: test_multiple_register_outputs_mixed
   ; CHECK: bb.1 (%ir-block.0):
-  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 $0, 0; v_add_f64 $1, 0, 0", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %8, 2818058 /* regdef:VReg_64 */, def %9
+  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 $0, 0; v_add_f64 $1, 0, 0", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %8, 2883594 /* regdef:VReg_64 */, def %9
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY %8
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:_(s64) = COPY %9
   ; CHECK-NEXT:   [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[COPY1]](s64)
@@ -171,7 +171,7 @@ define amdgpu_kernel void @test_input_vgpr_imm() {
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; CHECK-NEXT:   [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 42
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[C]](s32)
-  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 v0, $0", 1 /* sideeffect attdialect */, 1245193 /* reguse:VGPR_32 */, [[COPY1]]
+  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 v0, $0", 1 /* sideeffect attdialect */, 1376265 /* reguse:VGPR_32 */, [[COPY1]]
   ; CHECK-NEXT:   S_ENDPGM 0
   call void asm sideeffect "v_mov_b32 v0, $0", "v"(i32 42)
   ret void
@@ -185,7 +185,7 @@ define amdgpu_kernel void @test_input_sgpr_imm() {
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
   ; CHECK-NEXT:   [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 42
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:sreg_32 = COPY [[C]](s32)
-  ; CHECK-NEXT:   INLINEASM &"s_mov_b32 s0, $0", 1 /* sideeffect attdialect */, 1835017 /* reguse:SReg_32 */, [[COPY1]]
+  ; CHECK-NEXT:   INLINEASM &"s_mov_b32 s0, $0", 1 /* sideeffect attdialect */, 1966089 /* reguse:SReg_32 */, [[COPY1]]
   ; CHECK-NEXT:   S_ENDPGM 0
   call void asm sideeffect "s_mov_b32 s0, $0", "s"(i32 42)
   ret void
@@ -212,7 +212,7 @@ define float @test_input_vgpr(i32 %src) nounwind {
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[COPY]](s32)
-  ; CHECK-NEXT:   INLINEASM &"v_add_f32 $0, 1.0, $1", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %9, 1245193 /* reguse:VGPR_32 */, [[COPY1]]
+  ; CHECK-NEXT:   INLINEASM &"v_add_f32 $0, 1.0, $1", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %9, 1376265 /* reguse:VGPR_32 */, [[COPY1]]
   ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:_(s32) = COPY %9
   ; CHECK-NEXT:   $vgpr0 = COPY [[COPY2]](s32)
   ; CHECK-NEXT:   SI_RETURN implicit $vgpr0
@@ -227,7 +227,7 @@ define i32 @test_memory_constraint(ptr addrspace(3) %a) nounwind {
   ; CHECK-NEXT:   liveins: $vgpr0
   ; CHECK-NEXT: {{  $}}
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(p3) = COPY $vgpr0
-  ; CHECK-NEXT:   INLINEASM &"ds_read_b32 $0, $1", 8 /* mayload attdialect */, 1245194 /* regdef:VGPR_32 */, def %9, 262158 /* mem:m */, [[COPY]](p3)
+  ; CHECK-NEXT:   INLINEASM &"ds_read_b32 $0, $1", 8 /* mayload attdialect */, 1376266 /* regdef:VGPR_32 */, def %9, 262158 /* mem:m */, [[COPY]](p3)
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:_(s32) = COPY %9
   ; CHECK-NEXT:   $vgpr0 = COPY [[COPY1]](s32)
   ; CHECK-NEXT:   SI_RETURN implicit $vgpr0
@@ -244,7 +244,7 @@ define i32 @test_vgpr_matching_constraint(i32 %a) nounwind {
   ; CHECK-NEXT:   [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
   ; CHECK-NEXT:   [[AND:%[0-9]+]]:_(s32) = G_AND [[COPY]], [[C]]
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[AND]](s32)
-  ; CHECK-NEXT:   INLINEASM &";", 1 /* sideeffect attdialect */, 1245194 /* regdef:VGPR_32 */, def %11, 2147483657 /* reguse tiedto:$0 */, [[COPY1]](tied-def 3)
+  ; CHECK-NEXT:   INLINEASM &";", 1 /* sideeffect attdialect */, 1376266 /* regdef:VGPR_32 */, def %11, 2147483657 /* reguse tiedto:$0 */, [[COPY1]](tied-def 3)
   ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:_(s32) = COPY %11
   ; CHECK-NEXT:   $vgpr0 = COPY [[COPY2]](s32)
   ; CHECK-NEXT:   SI_RETURN implicit $vgpr0
@@ -256,13 +256,13 @@ define i32 @test_vgpr_matching_constraint(i32 %a) nounwind {
 define i32 @test_sgpr_matching_constraint() nounwind {
   ; CHECK-LABEL: name: test_sgpr_matching_constraint
   ; CHECK: bb.1.entry:
-  ; CHECK-NEXT:   INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1835018 /* regdef:SReg_32 */, def %8
+  ; CHECK-NEXT:   INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %8
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY %8
-  ; CHECK-NEXT:   INLINEASM &"s_mov_b32 $0, 8", 0 /* attdialect */, 1835018 /* regdef:SReg_32 */, def %10
+  ; CHECK-NEXT:   INLINEASM &"s_mov_b32 $0, 8", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %10
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:_(s32) = COPY %10
   ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:sreg_32 = COPY [[COPY]](s32)
   ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:sreg_32 = COPY [[COPY1]](s32)
-  ; CHECK-NEXT:   INLINEASM &"s_add_u32 $0, $1, $2", 0 /* attdialect */, 1835018 /* regdef:SReg_32 */, def %12, 1835017 /* reguse:SReg_32 */, [[COPY2]], 2147483657 /* reguse tiedto:$0 */, [[COPY3]](tied-def 3)
+  ; CHECK-NEXT:   INLINEASM &"s_add_u32 $0, $1, $2", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %12, 1966089 /* reguse:SReg_32 */, [[COPY2]], 2147483657 /* reguse tiedto:$0 */, [[COPY3]](tied-def 3)
   ; CHECK-NEXT:   [[COPY4:%[0-9]+]]:_(s32) = COPY %12
   ; CHECK-NEXT:   $vgpr0 = COPY [[COPY4]](s32)
   ; CHECK-NEXT:   SI_RETURN implicit $vgpr0
@@ -285,7 +285,7 @@ define void @test_many_matching_constraints(i32 %a, i32 %b, i32 %c) nounwind {
   ; CHECK-NEXT:   [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[COPY2]](s32)
   ; CHECK-NEXT:   [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[COPY]](s32)
   ; CHECK-NEXT:   [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY1]](s32)
-  ; CHECK-NEXT:   INLINEASM &"; ", 1 /* sideeffect attdialect */, 1245194 /* regdef:VGPR_32 */, def %11, 1245194 /* regdef:VGPR_32 */, def %12, 1245194 /* regdef:VGPR_32 */, def %13, 2147483657 /* reguse tiedto:$0 */, [[COPY3]](tied-def 3), 2147614729 /* reguse tiedto:$2 */, [[COPY4]](tied-def 7), 2147549193 /* reguse tiedto:$1 */, [[COPY5]](tied-def 5)
+  ; CHECK-NEXT:   INLINEASM &"; ", 1 /* sideeffect attdialect */, 1376266 /* regdef:VGPR_32 */, def %11, 1376266 /* regdef:VGPR_32 */, def %12, 1376266 /* regdef:VGPR_32 */, def %13, 2147483657 /* reguse tiedto:$0 */, [[COPY3]](tied-def 3), 2147614729 /* reguse tiedto:$2 */, [[COPY4]](tied-def 7), 2147549193 /* reguse tiedto:$1 */, [[COPY5]](tied-def 5)
   ; CHECK-NEXT:   [[COPY6:%[0-9]+]]:_(s32) = COPY %11
   ; CHECK-NEXT:   [[COPY7:%[0-9]+]]:_(s32) = COPY %12
   ; CHECK-NEXT:   [[COPY8:%[0-9]+]]:_(s32) = COPY %13
@@ -306,10 +306,10 @@ define void @test_many_matching_constraints(i32 %a, i32 %b, i32 %c) nounwind {
 define i32 @test_sgpr_to_vgpr_move_matching_constraint() nounwind {
   ; CHECK-LABEL: name: test_sgpr_to_vgpr_move_matching_constraint
   ; CHECK: bb.1.entry:
-  ; CHECK-NEXT:   INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1835018 /* regdef:SReg_32 */, def %8
+  ; CHECK-NEXT:   INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %8
   ; CHECK-NEXT:   [[COPY:%[0-9]+]]:_(s32) = COPY %8
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[COPY]](s32)
-  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 $0, $1", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %10, 2147483657 /* reguse tiedto:$0 */, [[COPY1]](tied-def 3)
+  ; CHECK-NEXT:   INLINEASM &"v_mov_b32 $0, $1", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %10, 2147483657 /* reguse tiedto:$0 */, [[COPY1]](tied-def 3)
   ; CHECK-NEXT:   [[COPY2:%[0-9]+]]:_(s32) = COPY %10
   ; CHECK-NEXT:   $vgpr0 = COPY [[COPY2]](s32)
   ; CHECK-NEXT:   SI_RETURN implicit $vgpr0
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankcombiner-ignore-copies-crash.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankcombiner-ignore-copies-crash.mir
index 7ca3869b535e4..65bde1e7d8efb 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankcombiner-ignore-copies-crash.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankcombiner-ignore-copies-crash.mir
@@ -24,7 +24,7 @@ body: |
     ; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr(s32) = COPY [[C]](s32)
     ; CHECK-NEXT: [[FMUL:%[0-9]+]]:vgpr(s32) = G_FMUL [[COPY]], [[COPY1]]
     ; CHECK-NEXT: [[C1:%[0-9]+]]:sgpr(s32) = G_FCONSTANT float 1.000000e+00
-    ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 0", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %5(s32)
+    ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 0", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %5(s32)
     ; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY [[C1]](s32)
     ; CHECK-NEXT: [[AMDGPU_FMED3_:%[0-9]+]]:vgpr(s32) = nnan G_AMDGPU_FMED3 [[FMUL]], %5, [[COPY2]]
     ; CHECK-NEXT: $vgpr0 = COPY [[AMDGPU_FMED3_]](s32)
@@ -33,7 +33,7 @@ body: |
     %2:vgpr(s32) = COPY %1(s32)
     %3:vgpr(s32) = G_FMUL %0, %2
     %4:sgpr(s32) = G_FCONSTANT float 1.000000e+00
-    INLINEASM &"v_mov_b32 $0, 0", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %5:vgpr_32
+    INLINEASM &"v_mov_b32 $0, 0", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %5:vgpr_32
     %6:vgpr(s32) = COPY %4(s32)
     %7:vgpr(s32) = nnan G_AMDGPU_FMED3 %3(s32), %5(s32), %6(s32)
     $vgpr0 = COPY %7(s32)
diff --git a/llvm/test/CodeGen/AMDGPU/branch-relax-indirect-branch.mir b/llvm/test/CodeGen/AMDGPU/branch-relax-indirect-branch.mir
index 34c0159dd3ddb..b1690860178de 100644
--- a/llvm/test/CodeGen/AMDGPU/branch-relax-indirect-branch.mir
+++ b/llvm/test/CodeGen/AMDGPU/branch-relax-indirect-branch.mir
@@ -68,7 +68,7 @@ body:             |
   ; CHECK-NEXT:   successors: %bb.3(0x04000000), %bb.7(0x7c000000)
   ; CHECK-NEXT:   liveins: $vcc_hi, $vcc_lo, $sgpr5, $sgpr6, $sgpr7, $sgpr8, $sgpr9, $sgpr10, $sgpr11, $sgpr12, $sgpr13, $sgpr14, $sgpr15, $sgpr16, $sgpr17, $sgpr18, $sgpr19, $sgpr20, $sgpr21, $sgpr22, $sgpr23, $sgpr24, $sgpr25, $sgpr26, $sgpr27, $sgpr28, $sgpr29, $sgpr30, $sgpr31, $sgpr34, $sgpr35, $sgpr36, $sgpr37, $sgpr38, $sgpr39, $sgpr40, $sgpr41, $sgpr42, $sgpr43, $sgpr44, $sgpr45, $sgpr46, $sgpr47, $sgpr48, $sgpr49, $sgpr50, $sgpr51, $sgpr52, $sgpr53, $sgpr54, $sg...
[truncated]

@rampitec rampitec requested a review from jayfoad December 17, 2025 09:04
@rampitec
Copy link
Collaborator Author

It also has some progression in disasm, but that I guess is because I have specialized some operand classes. This can be done with just a part of the patch, I think.

@github-actions
Copy link

github-actions bot commented Dec 17, 2025

🐧 Linux x64 Test Results

  • 187675 tests passed
  • 4983 tests skipped

✅ The build succeeded and all tests passed.

@rampitec rampitec force-pushed the users/rampitec/lo128-regs-from-all-banks branch from e1a1b8f to ad09d7d Compare December 17, 2025 09:51
@rampitec
Copy link
Collaborator Author

The real-true16 is a more complex case because VGPR_16_Lo128 is not allocatable and in fact unused in real codegen as an operand RC, just as a part of VS_16_Lo256. In addition I have also discovered that we simply do not form fmaak_f16/fmamk_f16 since GFX11, so we practically do not run into an unencodable situation, just into a deficient VOP3 encoding for the rest of true16 instructions.

Copy link
Collaborator

@qcolombet qcolombet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me but I'm not too much in the details yet.

; CHECK-LABEL: name: asm_vgpr_early_clobber
; CHECK: bb.1 (%ir-block.0):
; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 7; v_mov_b32 $1, 7", 1 /* sideeffect attdialect */, 1245195 /* regdef-ec:VGPR_32 */, def early-clobber %8, 1245195 /* regdef-ec:VGPR_32 */, def early-clobber %9, !1
; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 7; v_mov_b32 $1, 7", 1 /* sideeffect attdialect */, 1376267 /* regdef-ec:VGPR_32 */, def early-clobber %8, 1376267 /* regdef-ec:VGPR_32 */, def early-clobber %9, !1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this magic number?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are RC IDs encoded in MIR. I do not like these tests, but every time you touch register info these has to be updated. It does not have anything to do with the patch itself.

let GeneratePressureSet = 0;
let Size = 32;
let Weight = 1;
let BaseClassOrder = 33;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we go with say 64 to leave us some space between 32 and this one?
(Happy to go with 33, just asking)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly it does not work, whatever number I set. In theory it has to be higher than VGPR_32. In practice it does not work and matter, I've tried 10000. Thus I have shrunk the RC size to 896 registers instead of 1024 so tablegen will not make it a base class for anything. I have to note, it is counterintuitive, but if it were 1024 registers, the size of generated reginfo will be 1/3 less. Because it will become indistinguishable from VGPR_32. I wish we could get rid of this RC altogether though, and pass operand type to the getRawAllocationOrder() instead, but its current uses by RA does not seem to collect operands at all. At least some of them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could get rid of this RC altogether though, and pass operand type to the getRawAllocationOrder()

What would the best solution look like if you had infinite time?

Where I'm going is RA is definitely not flexible enough and may need an overall, so I'm trying to gauge what would be your ideas on this side.

bool Err = !RC.contains(Reg) && !isInlineValue(Reg);
if (!Err && (RCID == AMDGPU::VGPR_32_Lo128RegClassID ||
RCID == AMDGPU::VS_32_Lo128RegClassID))
Err = Reg >= AMDGPU::VGPR128 && Reg <= AMDGPU::VGPR1023;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but should we somehow factor this logic?
I feel it is easy to miss the RCID (RCID == AMDGPU::VGPR_32_Lo128RegClassID || RCID == AMDGPU::VS_32_Lo128RegClassID) and/or the range (Reg >= AMDGPU::VGPR128 && Reg <= AMDGPU::VGPR1023;).
And we're repeating it twice in this patch (the second time in AMDGPUOperand::isRegClass).

I'm afraid finding all the places where this needs to be fixed to be problematic if it's not shared.

Copy link
Collaborator Author

@rampitec rampitec Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is repeated on purpose. It is not the same logic. One place checks it is a VGPR in range [v0-v127], another checks it is not a VGPR outside of this range, because it can also accept SGPRs. Yes, this is quite puzzling unfortunately.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try it with the last patch though. It may work slower, but at least it is more readable and commented.

@rampitec rampitec force-pushed the users/rampitec/lo128-regs-from-all-banks branch 3 times, most recently from 06712e3 to d935fcd Compare December 18, 2025 08:58
@rampitec
Copy link
Collaborator Author

@arsenm
Copy link
Contributor

arsenm commented Dec 18, 2025

A straight forward solution would be to create a register class
with all registers having bit 8 of the encoding zero, i.e. to
create a register class with holes punched in it: [0-127, 256-383,
512-639, 768-895]. LLVM, however, does not like register classes
with punched holes when they also have subregisters. The cross-
product of all classes explodes and some combinations of a 'class
having a common subreg with another' becomeing impossible. Just
doing so explodes our register info to 4+Gb, uncompilable too.

I don't understand the problem here. Fix the tablegen bug? It would be far nicer to express this as a class instead of functionally creating a class that requires special casing

@rampitec
Copy link
Collaborator Author

I don't understand the problem here. Fix the tablegen bug? It would be far nicer to express this as a class instead of functionally creating a class that requires special casing

It's not a bug, it is design. All that getMatchingSuperReg, getClassWithSubReg etc need to have register class to return and static tables. When we have a fractured RC it literally explodes these tables, because we have a lot of possible subregs and when it tries to apply that subreg window on the fractured class it just sees impossible combinations.

We can encode 16-bit operands in a short form for VGPRs [0..127].
When we have 1K registers available we can in fact allocate 4
times more from all 4 banks. That, however, requires an allocatable
class for these operands. When for most of the instructions it will
result in the VOP3 longer form, for V_FMAAMK/FMADAK_F16 it will
simply prohibit the encoding because these do not have VOP3 forms.

A straight forward solution would be to create a register class
with all registers having bit 8 of the encoding zero, i.e. to
create a register class with holes punched in it: [0-127, 256-383,
512-639, 768-895]. LLVM, however, does not like register classes
with punched holes when they also have subregisters. The cross-
product of all classes explodes and some combinations of a 'class
having a common subreg with another' becomeing impossible. Just
doing so explodes our register info to 4+Gb, uncompilable too.

The solution proposed is to define _lo128 RC with contigous 896
VGPRs, but the allocation order of it hiding prohibited registers.
That keeps generated register info a reasonable size (+~50%).

The same is needed to VGPR_16_Lo128 in true16 mode.

In general we could later reuse VGPR_32 with AltOrderSelect, but
we would need to pass there operand type and deal with the AsmParser.

One other consideration is that we can allocate a register leaving
a hole of the whole 128 registers, but a subsequent patch can fix
it, i.e. by the time of the RA we really know estimated register
pressure and can further limit allocation order.

For the short: w/o it we either have spilling when we have VGPRs
available, or outright have 'run out of registers' when we have
a lot of 16-bit registers used, a lot of budget, but we cannot
encode it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants