-
Notifications
You must be signed in to change notification settings - Fork 15.5k
[AMDGPU] Allow allocation of lo128 registers from all banks #172614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by sgh. |
|
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-amdgpu Author: Stanislav Mekhanoshin (rampitec) ChangesWe can encode 16-bit operands in a short form for VGPRs [0..127]. A straight forward solution would be to create a register class The solution proposed is to define _lo128 RC with contigous 896 The same is needed to VGPR_16_Lo128 in true16 mode. In general we could later reuse VGPR_32 with AltOrderSelect, but One other consideration is that we can allocate a register leaving For the short: w/o it we either have spilling when we have VGPRs Patch is 442.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/172614.diff 42 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp b/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
index 7a91a40e18cde..8430f2e3b4344 100644
--- a/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
+++ b/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
@@ -426,6 +426,8 @@ class AMDGPUOperand : public MCParsedAsmOperand {
return isRegOrInline(RCID, type) && !hasModifiers();
}
+ bool isVGPR32_Lo128() const;
+
bool isSCSrcB16() const {
return isRegOrInlineNoMods(AMDGPU::SReg_32RegClassID, MVT::i16);
}
@@ -2243,7 +2245,19 @@ bool AMDGPUOperand::isLiteralImm(MVT type) const {
}
bool AMDGPUOperand::isRegClass(unsigned RCID) const {
- return isRegKind() && AsmParser->getMRI()->getRegClass(RCID).contains(getReg());
+ if (!isRegKind() ||
+ !AsmParser->getMRI()->getRegClass(RCID).contains(getReg()))
+ return false;
+ if (RCID == AMDGPU::VGPR_32_Lo128RegClassID ||
+ RCID == AMDGPU::VS_32_Lo128RegClassID)
+ return getReg() <= AMDGPU::VGPR127 || getReg() > AMDGPU::VGPR1023;
+ return true;
+}
+
+bool AMDGPUOperand::isVGPR32_Lo128() const {
+ if (!isRegKind())
+ return false;
+ return getReg() >= AMDGPU::VGPR0 && getReg() <= AMDGPU::VGPR127;
}
bool AMDGPUOperand::isVRegWithInputMods() const {
diff --git a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp
index b63d71dc2fde9..55cd800d201db 100644
--- a/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp
+++ b/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp
@@ -813,7 +813,11 @@ void AMDGPUInstPrinter::printRegularOperand(const MCInst *MI, unsigned OpNo,
OpInfo, STI.getHwMode(MCSubtargetInfo::HwMode_RegInfo));
const MCRegisterClass &RC = MRI.getRegClass(RCID);
auto Reg = mc2PseudoReg(Op.getReg());
- if (!RC.contains(Reg) && !isInlineValue(Reg)) {
+ bool Err = !RC.contains(Reg) && !isInlineValue(Reg);
+ if (!Err && (RCID == AMDGPU::VGPR_32_Lo128RegClassID ||
+ RCID == AMDGPU::VS_32_Lo128RegClassID))
+ Err = Reg >= AMDGPU::VGPR128 && Reg <= AMDGPU::VGPR1023;
+ if (Err) {
bool IsWaveSizeOp = OpInfo.isLookupRegClassByHwMode() &&
(OpInfo.RegClass == AMDGPU::SReg_1 ||
OpInfo.RegClass == AMDGPU::SReg_1_XEXEC);
diff --git a/llvm/lib/Target/AMDGPU/SIInstrFormats.td b/llvm/lib/Target/AMDGPU/SIInstrFormats.td
index de66c472be0ca..cb036b517f2e2 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrFormats.td
+++ b/llvm/lib/Target/AMDGPU/SIInstrFormats.td
@@ -323,6 +323,12 @@ def CPolBit {
class VOPDstOperand<RegisterClassLike rc> : RegisterOperand<rc, "printVOPDst">;
+def Vgpr32Lo128 : AsmOperandClass {
+ let Name = "Vgpr32Lo128";
+ let PredicateMethod = "isVGPR32_Lo128";
+ let RenderMethod = "addRegOperands";
+}
+
def VOPDstOperand_t16 : VOPDstOperand <VGPR_16> {
let EncoderMethod = "getMachineOpValueT16";
let DecoderMethod = "DecodeVGPR_16RegisterClass";
@@ -333,12 +339,27 @@ def VOPDstOperand_t16Lo128 : VOPDstOperand <VGPR_16_Lo128> {
let DecoderMethod = "DecodeVGPR_16_Lo128RegisterClass";
}
+def VOPDstOperand_Vgpr32Lo128 : VOPDstOperand <VGPR_32_Lo128> {
+ let ParserMatchClass = Vgpr32Lo128;
+}
+
// Source-encoded destination operand for instructions like v_swap_b16.
def VOPSrcEncodedDstOperand_t16Lo128 : VOPDstOperand <VGPR_16_Lo128> {
let EncoderMethod = VSrcT_b16_Lo128.EncoderMethod;
let DecoderMethod = VSrcT_b16_Lo128.DecoderMethod;
}
+
+def VGPROp_16_Lo128 : RegisterOperand<VGPR_16_Lo128> {
+ let DecoderMethod = "DecodeVGPR_16_Lo128RegisterClass";
+ let EncoderMethod = "getMachineOpValueT16Lo128";
+}
+
+def VGPROp_32_Lo128 : RegisterOperand<VGPR_32_Lo128> {
+ let DecoderMethod = "DecodeVGPR_32RegisterClass";
+ let ParserMatchClass = Vgpr32Lo128;
+}
+
class VINTRPe <bits<2> op> : Enc32 {
bits<8> vdst;
bits<8> vsrc;
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.td b/llvm/lib/Target/AMDGPU/SIInstrInfo.td
index 628b972f97086..8738df31a9a56 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.td
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.td
@@ -1817,7 +1817,7 @@ class getVALUDstForVT_fake16<ValueType VT> {
RegisterOperand ret = !if(!eq(VT.Size, 32), VOPDstOperand<VGPR_32>,
!if(!eq(VT.Size, 128), VOPDstOperand<VReg_128_AlignTarget>,
!if(!eq(VT.Size, 64), VOPDstOperand<VReg_64_AlignTarget>,
- !if(!eq(VT.Size, 16), VOPDstOperand<VGPR_32_Lo128>,
+ !if(!eq(VT.Size, 16), VOPDstOperand_Vgpr32Lo128,
VOPDstS64orS32)))); // else VT == i1
}
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
index 272d4b5609dfb..838d31df5bacd 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.td
@@ -632,13 +632,20 @@ def VGPR_32 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types
let BaseClassOrder = 32;
}
-// Identical to VGPR_32 except it only contains the low 128 (Lo128) registers.
+// Identical to VGPR_32 except only the low 128 (Lo128) registers in each
+// register bank are allocatable.
def VGPR_32_Lo128 : SIRegisterClass<"AMDGPU", !listconcat(Reg32Types.types, Reg16Types.types), 32,
- (add (sequence "VGPR%u", 0, 127))> {
+ (add (sequence "VGPR%u", 0, 895))> {
+ let AltOrders = [(add (sequence "VGPR%u", 0, 127),
+ (sequence "VGPR%u", 256, 383),
+ (sequence "VGPR%u", 512, 639),
+ (sequence "VGPR%u", 768, 895))];
+ let AltOrderSelect = [{ return 1; }];
let AllocationPriority = !add(0, !mul(BaseClassPriority, BaseClassScaleFactor));
let GeneratePressureSet = 0;
let Size = 32;
let Weight = 1;
+ let BaseClassOrder = 33;
}
// Identical to VGPR_32 except it only contains the low 256 (Lo256) registers.
@@ -1487,15 +1494,6 @@ foreach size = ["64", "96", "128", "160", "192", "224", "256", "288", "320", "35
def VGPROp_#size#_Align2 : RegisterOperand<!cast<RegisterClassLike>("VReg_"#size#_Align2)>;
}
-def VGPROp_16_Lo128 : RegisterOperand<VGPR_16_Lo128> {
- let DecoderMethod = "DecodeVGPR_16_Lo128RegisterClass";
- let EncoderMethod = "getMachineOpValueT16Lo128";
-}
-
-def VGPROp_32_Lo128 : RegisterOperand<VGPR_32_Lo128> {
- let DecoderMethod = "DecodeVGPR_32RegisterClass";
-}
-
//===----------------------------------------------------------------------===//
// ASrc_* Operands with an AccVGPR
//===----------------------------------------------------------------------===//
diff --git a/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp b/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
index 1b78f67e76d07..df8abccf97362 100644
--- a/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
+++ b/llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
@@ -157,7 +157,8 @@ bool SIShrinkInstructions::shouldShrinkTrue16(MachineInstr &MI) const {
assert(!Reg.isVirtual() && "Prior checks should ensure we only shrink "
"True16 Instructions post-RA");
if (AMDGPU::VGPR_32RegClass.contains(Reg) &&
- !AMDGPU::VGPR_32_Lo128RegClass.contains(Reg))
+ !llvm::is_contained(
+ AMDGPU::VGPR_32_Lo128RegClass.getRawAllocationOrder(*MF), Reg))
return false;
if (AMDGPU::VGPR_16RegClass.contains(Reg) &&
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
index b290c314f1154..4089e21040b24 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/irtranslator-inline-asm.ll
@@ -66,7 +66,7 @@ define amdgpu_kernel void @asm_simple_agpr_clobber() {
define i32 @asm_vgpr_early_clobber() {
; CHECK-LABEL: name: asm_vgpr_early_clobber
; CHECK: bb.1 (%ir-block.0):
- ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 7; v_mov_b32 $1, 7", 1 /* sideeffect attdialect */, 1245195 /* regdef-ec:VGPR_32 */, def early-clobber %8, 1245195 /* regdef-ec:VGPR_32 */, def early-clobber %9, !1
+ ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 7; v_mov_b32 $1, 7", 1 /* sideeffect attdialect */, 1376267 /* regdef-ec:VGPR_32 */, def early-clobber %8, 1376267 /* regdef-ec:VGPR_32 */, def early-clobber %9, !1
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY %8
; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY %9
; CHECK-NEXT: [[ADD:%[0-9]+]]:_(s32) = G_ADD [[COPY]], [[COPY1]]
@@ -94,7 +94,7 @@ entry:
define i32 @test_single_vgpr_output() nounwind {
; CHECK-LABEL: name: test_single_vgpr_output
; CHECK: bb.1.entry:
- ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 7", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %8
+ ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 7", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %8
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY %8
; CHECK-NEXT: $vgpr0 = COPY [[COPY]](s32)
; CHECK-NEXT: SI_RETURN implicit $vgpr0
@@ -106,7 +106,7 @@ entry:
define i32 @test_single_sgpr_output_s32() nounwind {
; CHECK-LABEL: name: test_single_sgpr_output_s32
; CHECK: bb.1.entry:
- ; CHECK-NEXT: INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1835018 /* regdef:SReg_32 */, def %8
+ ; CHECK-NEXT: INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %8
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY %8
; CHECK-NEXT: $vgpr0 = COPY [[COPY]](s32)
; CHECK-NEXT: SI_RETURN implicit $vgpr0
@@ -119,7 +119,7 @@ entry:
define float @test_multiple_register_outputs_same() #0 {
; CHECK-LABEL: name: test_multiple_register_outputs_same
; CHECK: bb.1 (%ir-block.0):
- ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 0; v_mov_b32 $1, 1", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %8, 1245194 /* regdef:VGPR_32 */, def %9
+ ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 0; v_mov_b32 $1, 1", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %8, 1376266 /* regdef:VGPR_32 */, def %9
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY %8
; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY %9
; CHECK-NEXT: [[FADD:%[0-9]+]]:_(s32) = G_FADD [[COPY]], [[COPY1]]
@@ -136,7 +136,7 @@ define float @test_multiple_register_outputs_same() #0 {
define double @test_multiple_register_outputs_mixed() #0 {
; CHECK-LABEL: name: test_multiple_register_outputs_mixed
; CHECK: bb.1 (%ir-block.0):
- ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 0; v_add_f64 $1, 0, 0", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %8, 2818058 /* regdef:VReg_64 */, def %9
+ ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 0; v_add_f64 $1, 0, 0", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %8, 2883594 /* regdef:VReg_64 */, def %9
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY %8
; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s64) = COPY %9
; CHECK-NEXT: [[UV:%[0-9]+]]:_(s32), [[UV1:%[0-9]+]]:_(s32) = G_UNMERGE_VALUES [[COPY1]](s64)
@@ -171,7 +171,7 @@ define amdgpu_kernel void @test_input_vgpr_imm() {
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 42
; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[C]](s32)
- ; CHECK-NEXT: INLINEASM &"v_mov_b32 v0, $0", 1 /* sideeffect attdialect */, 1245193 /* reguse:VGPR_32 */, [[COPY1]]
+ ; CHECK-NEXT: INLINEASM &"v_mov_b32 v0, $0", 1 /* sideeffect attdialect */, 1376265 /* reguse:VGPR_32 */, [[COPY1]]
; CHECK-NEXT: S_ENDPGM 0
call void asm sideeffect "v_mov_b32 v0, $0", "v"(i32 42)
ret void
@@ -185,7 +185,7 @@ define amdgpu_kernel void @test_input_sgpr_imm() {
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(p4) = COPY $sgpr8_sgpr9
; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 42
; CHECK-NEXT: [[COPY1:%[0-9]+]]:sreg_32 = COPY [[C]](s32)
- ; CHECK-NEXT: INLINEASM &"s_mov_b32 s0, $0", 1 /* sideeffect attdialect */, 1835017 /* reguse:SReg_32 */, [[COPY1]]
+ ; CHECK-NEXT: INLINEASM &"s_mov_b32 s0, $0", 1 /* sideeffect attdialect */, 1966089 /* reguse:SReg_32 */, [[COPY1]]
; CHECK-NEXT: S_ENDPGM 0
call void asm sideeffect "s_mov_b32 s0, $0", "s"(i32 42)
ret void
@@ -212,7 +212,7 @@ define float @test_input_vgpr(i32 %src) nounwind {
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY $vgpr0
; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[COPY]](s32)
- ; CHECK-NEXT: INLINEASM &"v_add_f32 $0, 1.0, $1", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %9, 1245193 /* reguse:VGPR_32 */, [[COPY1]]
+ ; CHECK-NEXT: INLINEASM &"v_add_f32 $0, 1.0, $1", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %9, 1376265 /* reguse:VGPR_32 */, [[COPY1]]
; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY %9
; CHECK-NEXT: $vgpr0 = COPY [[COPY2]](s32)
; CHECK-NEXT: SI_RETURN implicit $vgpr0
@@ -227,7 +227,7 @@ define i32 @test_memory_constraint(ptr addrspace(3) %a) nounwind {
; CHECK-NEXT: liveins: $vgpr0
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(p3) = COPY $vgpr0
- ; CHECK-NEXT: INLINEASM &"ds_read_b32 $0, $1", 8 /* mayload attdialect */, 1245194 /* regdef:VGPR_32 */, def %9, 262158 /* mem:m */, [[COPY]](p3)
+ ; CHECK-NEXT: INLINEASM &"ds_read_b32 $0, $1", 8 /* mayload attdialect */, 1376266 /* regdef:VGPR_32 */, def %9, 262158 /* mem:m */, [[COPY]](p3)
; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY %9
; CHECK-NEXT: $vgpr0 = COPY [[COPY1]](s32)
; CHECK-NEXT: SI_RETURN implicit $vgpr0
@@ -244,7 +244,7 @@ define i32 @test_vgpr_matching_constraint(i32 %a) nounwind {
; CHECK-NEXT: [[C:%[0-9]+]]:_(s32) = G_CONSTANT i32 1
; CHECK-NEXT: [[AND:%[0-9]+]]:_(s32) = G_AND [[COPY]], [[C]]
; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[AND]](s32)
- ; CHECK-NEXT: INLINEASM &";", 1 /* sideeffect attdialect */, 1245194 /* regdef:VGPR_32 */, def %11, 2147483657 /* reguse tiedto:$0 */, [[COPY1]](tied-def 3)
+ ; CHECK-NEXT: INLINEASM &";", 1 /* sideeffect attdialect */, 1376266 /* regdef:VGPR_32 */, def %11, 2147483657 /* reguse tiedto:$0 */, [[COPY1]](tied-def 3)
; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY %11
; CHECK-NEXT: $vgpr0 = COPY [[COPY2]](s32)
; CHECK-NEXT: SI_RETURN implicit $vgpr0
@@ -256,13 +256,13 @@ define i32 @test_vgpr_matching_constraint(i32 %a) nounwind {
define i32 @test_sgpr_matching_constraint() nounwind {
; CHECK-LABEL: name: test_sgpr_matching_constraint
; CHECK: bb.1.entry:
- ; CHECK-NEXT: INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1835018 /* regdef:SReg_32 */, def %8
+ ; CHECK-NEXT: INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %8
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY %8
- ; CHECK-NEXT: INLINEASM &"s_mov_b32 $0, 8", 0 /* attdialect */, 1835018 /* regdef:SReg_32 */, def %10
+ ; CHECK-NEXT: INLINEASM &"s_mov_b32 $0, 8", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %10
; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(s32) = COPY %10
; CHECK-NEXT: [[COPY2:%[0-9]+]]:sreg_32 = COPY [[COPY]](s32)
; CHECK-NEXT: [[COPY3:%[0-9]+]]:sreg_32 = COPY [[COPY1]](s32)
- ; CHECK-NEXT: INLINEASM &"s_add_u32 $0, $1, $2", 0 /* attdialect */, 1835018 /* regdef:SReg_32 */, def %12, 1835017 /* reguse:SReg_32 */, [[COPY2]], 2147483657 /* reguse tiedto:$0 */, [[COPY3]](tied-def 3)
+ ; CHECK-NEXT: INLINEASM &"s_add_u32 $0, $1, $2", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %12, 1966089 /* reguse:SReg_32 */, [[COPY2]], 2147483657 /* reguse tiedto:$0 */, [[COPY3]](tied-def 3)
; CHECK-NEXT: [[COPY4:%[0-9]+]]:_(s32) = COPY %12
; CHECK-NEXT: $vgpr0 = COPY [[COPY4]](s32)
; CHECK-NEXT: SI_RETURN implicit $vgpr0
@@ -285,7 +285,7 @@ define void @test_many_matching_constraints(i32 %a, i32 %b, i32 %c) nounwind {
; CHECK-NEXT: [[COPY3:%[0-9]+]]:vgpr_32 = COPY [[COPY2]](s32)
; CHECK-NEXT: [[COPY4:%[0-9]+]]:vgpr_32 = COPY [[COPY]](s32)
; CHECK-NEXT: [[COPY5:%[0-9]+]]:vgpr_32 = COPY [[COPY1]](s32)
- ; CHECK-NEXT: INLINEASM &"; ", 1 /* sideeffect attdialect */, 1245194 /* regdef:VGPR_32 */, def %11, 1245194 /* regdef:VGPR_32 */, def %12, 1245194 /* regdef:VGPR_32 */, def %13, 2147483657 /* reguse tiedto:$0 */, [[COPY3]](tied-def 3), 2147614729 /* reguse tiedto:$2 */, [[COPY4]](tied-def 7), 2147549193 /* reguse tiedto:$1 */, [[COPY5]](tied-def 5)
+ ; CHECK-NEXT: INLINEASM &"; ", 1 /* sideeffect attdialect */, 1376266 /* regdef:VGPR_32 */, def %11, 1376266 /* regdef:VGPR_32 */, def %12, 1376266 /* regdef:VGPR_32 */, def %13, 2147483657 /* reguse tiedto:$0 */, [[COPY3]](tied-def 3), 2147614729 /* reguse tiedto:$2 */, [[COPY4]](tied-def 7), 2147549193 /* reguse tiedto:$1 */, [[COPY5]](tied-def 5)
; CHECK-NEXT: [[COPY6:%[0-9]+]]:_(s32) = COPY %11
; CHECK-NEXT: [[COPY7:%[0-9]+]]:_(s32) = COPY %12
; CHECK-NEXT: [[COPY8:%[0-9]+]]:_(s32) = COPY %13
@@ -306,10 +306,10 @@ define void @test_many_matching_constraints(i32 %a, i32 %b, i32 %c) nounwind {
define i32 @test_sgpr_to_vgpr_move_matching_constraint() nounwind {
; CHECK-LABEL: name: test_sgpr_to_vgpr_move_matching_constraint
; CHECK: bb.1.entry:
- ; CHECK-NEXT: INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1835018 /* regdef:SReg_32 */, def %8
+ ; CHECK-NEXT: INLINEASM &"s_mov_b32 $0, 7", 0 /* attdialect */, 1966090 /* regdef:SReg_32 */, def %8
; CHECK-NEXT: [[COPY:%[0-9]+]]:_(s32) = COPY %8
; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[COPY]](s32)
- ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, $1", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %10, 2147483657 /* reguse tiedto:$0 */, [[COPY1]](tied-def 3)
+ ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, $1", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %10, 2147483657 /* reguse tiedto:$0 */, [[COPY1]](tied-def 3)
; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(s32) = COPY %10
; CHECK-NEXT: $vgpr0 = COPY [[COPY2]](s32)
; CHECK-NEXT: SI_RETURN implicit $vgpr0
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankcombiner-ignore-copies-crash.mir b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankcombiner-ignore-copies-crash.mir
index 7ca3869b535e4..65bde1e7d8efb 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankcombiner-ignore-copies-crash.mir
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankcombiner-ignore-copies-crash.mir
@@ -24,7 +24,7 @@ body: |
; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr(s32) = COPY [[C]](s32)
; CHECK-NEXT: [[FMUL:%[0-9]+]]:vgpr(s32) = G_FMUL [[COPY]], [[COPY1]]
; CHECK-NEXT: [[C1:%[0-9]+]]:sgpr(s32) = G_FCONSTANT float 1.000000e+00
- ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 0", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %5(s32)
+ ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 0", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %5(s32)
; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr(s32) = COPY [[C1]](s32)
; CHECK-NEXT: [[AMDGPU_FMED3_:%[0-9]+]]:vgpr(s32) = nnan G_AMDGPU_FMED3 [[FMUL]], %5, [[COPY2]]
; CHECK-NEXT: $vgpr0 = COPY [[AMDGPU_FMED3_]](s32)
@@ -33,7 +33,7 @@ body: |
%2:vgpr(s32) = COPY %1(s32)
%3:vgpr(s32) = G_FMUL %0, %2
%4:sgpr(s32) = G_FCONSTANT float 1.000000e+00
- INLINEASM &"v_mov_b32 $0, 0", 0 /* attdialect */, 1245194 /* regdef:VGPR_32 */, def %5:vgpr_32
+ INLINEASM &"v_mov_b32 $0, 0", 0 /* attdialect */, 1376266 /* regdef:VGPR_32 */, def %5:vgpr_32
%6:vgpr(s32) = COPY %4(s32)
%7:vgpr(s32) = nnan G_AMDGPU_FMED3 %3(s32), %5(s32), %6(s32)
$vgpr0 = COPY %7(s32)
diff --git a/llvm/test/CodeGen/AMDGPU/branch-relax-indirect-branch.mir b/llvm/test/CodeGen/AMDGPU/branch-relax-indirect-branch.mir
index 34c0159dd3ddb..b1690860178de 100644
--- a/llvm/test/CodeGen/AMDGPU/branch-relax-indirect-branch.mir
+++ b/llvm/test/CodeGen/AMDGPU/branch-relax-indirect-branch.mir
@@ -68,7 +68,7 @@ body: |
; CHECK-NEXT: successors: %bb.3(0x04000000), %bb.7(0x7c000000)
; CHECK-NEXT: liveins: $vcc_hi, $vcc_lo, $sgpr5, $sgpr6, $sgpr7, $sgpr8, $sgpr9, $sgpr10, $sgpr11, $sgpr12, $sgpr13, $sgpr14, $sgpr15, $sgpr16, $sgpr17, $sgpr18, $sgpr19, $sgpr20, $sgpr21, $sgpr22, $sgpr23, $sgpr24, $sgpr25, $sgpr26, $sgpr27, $sgpr28, $sgpr29, $sgpr30, $sgpr31, $sgpr34, $sgpr35, $sgpr36, $sgpr37, $sgpr38, $sgpr39, $sgpr40, $sgpr41, $sgpr42, $sgpr43, $sgpr44, $sgpr45, $sgpr46, $sgpr47, $sgpr48, $sgpr49, $sgpr50, $sgpr51, $sgpr52, $sgpr53, $sgpr54, $sg...
[truncated]
|
|
It also has some progression in disasm, but that I guess is because I have specialized some operand classes. This can be done with just a part of the patch, I think. |
🐧 Linux x64 Test Results
✅ The build succeeded and all tests passed. |
e1a1b8f to
ad09d7d
Compare
|
The real-true16 is a more complex case because VGPR_16_Lo128 is not allocatable and in fact unused in real codegen as an operand RC, just as a part of VS_16_Lo256. In addition I have also discovered that we simply do not form fmaak_f16/fmamk_f16 since GFX11, so we practically do not run into an unencodable situation, just into a deficient VOP3 encoding for the rest of true16 instructions. |
qcolombet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me but I'm not too much in the details yet.
| ; CHECK-LABEL: name: asm_vgpr_early_clobber | ||
| ; CHECK: bb.1 (%ir-block.0): | ||
| ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 7; v_mov_b32 $1, 7", 1 /* sideeffect attdialect */, 1245195 /* regdef-ec:VGPR_32 */, def early-clobber %8, 1245195 /* regdef-ec:VGPR_32 */, def early-clobber %9, !1 | ||
| ; CHECK-NEXT: INLINEASM &"v_mov_b32 $0, 7; v_mov_b32 $1, 7", 1 /* sideeffect attdialect */, 1376267 /* regdef-ec:VGPR_32 */, def early-clobber %8, 1376267 /* regdef-ec:VGPR_32 */, def early-clobber %9, !1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are RC IDs encoded in MIR. I do not like these tests, but every time you touch register info these has to be updated. It does not have anything to do with the patch itself.
| let GeneratePressureSet = 0; | ||
| let Size = 32; | ||
| let Weight = 1; | ||
| let BaseClassOrder = 33; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we go with say 64 to leave us some space between 32 and this one?
(Happy to go with 33, just asking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Frankly it does not work, whatever number I set. In theory it has to be higher than VGPR_32. In practice it does not work and matter, I've tried 10000. Thus I have shrunk the RC size to 896 registers instead of 1024 so tablegen will not make it a base class for anything. I have to note, it is counterintuitive, but if it were 1024 registers, the size of generated reginfo will be 1/3 less. Because it will become indistinguishable from VGPR_32. I wish we could get rid of this RC altogether though, and pass operand type to the getRawAllocationOrder() instead, but its current uses by RA does not seem to collect operands at all. At least some of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we could get rid of this RC altogether though, and pass operand type to the getRawAllocationOrder()
What would the best solution look like if you had infinite time?
Where I'm going is RA is definitely not flexible enough and may need an overall, so I'm trying to gauge what would be your ideas on this side.
| bool Err = !RC.contains(Reg) && !isInlineValue(Reg); | ||
| if (!Err && (RCID == AMDGPU::VGPR_32_Lo128RegClassID || | ||
| RCID == AMDGPU::VS_32_Lo128RegClassID)) | ||
| Err = Reg >= AMDGPU::VGPR128 && Reg <= AMDGPU::VGPR1023; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but should we somehow factor this logic?
I feel it is easy to miss the RCID (RCID == AMDGPU::VGPR_32_Lo128RegClassID || RCID == AMDGPU::VS_32_Lo128RegClassID) and/or the range (Reg >= AMDGPU::VGPR128 && Reg <= AMDGPU::VGPR1023;).
And we're repeating it twice in this patch (the second time in AMDGPUOperand::isRegClass).
I'm afraid finding all the places where this needs to be fixed to be problematic if it's not shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is repeated on purpose. It is not the same logic. One place checks it is a VGPR in range [v0-v127], another checks it is not a VGPR outside of this range, because it can also accept SGPRs. Yes, this is quite puzzling unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try it with the last patch though. It may work slower, but at least it is more readable and commented.
06712e3 to
d935fcd
Compare
|
Then when I say we do not form fmaak/fmamk since gfx11 I mean this: |
I don't understand the problem here. Fix the tablegen bug? It would be far nicer to express this as a class instead of functionally creating a class that requires special casing |
It's not a bug, it is design. All that |
We can encode 16-bit operands in a short form for VGPRs [0..127]. When we have 1K registers available we can in fact allocate 4 times more from all 4 banks. That, however, requires an allocatable class for these operands. When for most of the instructions it will result in the VOP3 longer form, for V_FMAAMK/FMADAK_F16 it will simply prohibit the encoding because these do not have VOP3 forms. A straight forward solution would be to create a register class with all registers having bit 8 of the encoding zero, i.e. to create a register class with holes punched in it: [0-127, 256-383, 512-639, 768-895]. LLVM, however, does not like register classes with punched holes when they also have subregisters. The cross- product of all classes explodes and some combinations of a 'class having a common subreg with another' becomeing impossible. Just doing so explodes our register info to 4+Gb, uncompilable too. The solution proposed is to define _lo128 RC with contigous 896 VGPRs, but the allocation order of it hiding prohibited registers. That keeps generated register info a reasonable size (+~50%). The same is needed to VGPR_16_Lo128 in true16 mode. In general we could later reuse VGPR_32 with AltOrderSelect, but we would need to pass there operand type and deal with the AsmParser. One other consideration is that we can allocate a register leaving a hole of the whole 128 registers, but a subsequent patch can fix it, i.e. by the time of the RA we really know estimated register pressure and can further limit allocation order. For the short: w/o it we either have spilling when we have VGPRs available, or outright have 'run out of registers' when we have a lot of 16-bit registers used, a lot of budget, but we cannot encode it.
d935fcd to
6e916e3
Compare
We can encode 16-bit operands in a short form for VGPRs [0..127].
When we have 1K registers available we can in fact allocate 4
times more from all 4 banks. That, however, requires an allocatable
class for these operands. When for most of the instructions it will
result in the VOP3 longer form, for V_FMAAMK/FMADAK_F16 it will
simply prohibit the encoding because these do not have VOP3 forms.
A straight forward solution would be to create a register class
with all registers having bit 8 of the encoding zero, i.e. to
create a register class with holes punched in it: [0-127, 256-383,
512-639, 768-895]. LLVM, however, does not like register classes
with punched holes when they also have subregisters. The cross-
product of all classes explodes and some combinations of a 'class
having a common subreg with another' becomeing impossible. Just
doing so explodes our register info to 4+Gb, uncompilable too.
The solution proposed is to define _lo128 RC with contigous 896
VGPRs, but the allocation order of it hiding prohibited registers.
That keeps generated register info a reasonable size (+~50%).
The same is needed to VGPR_16_Lo128 in true16 mode.
In general we could later reuse VGPR_32 with AltOrderSelect, but
we would need to pass there operand type and deal with the AsmParser.
One other consideration is that we can allocate a register leaving
a hole of the whole 128 registers, but a subsequent patch can fix
it, i.e. by the time of the RA we really know estimated register
pressure and can further limit allocation order.
For the short: w/o it we either have spilling when we have VGPRs
available, or outright have 'run out of registers' when we have
a lot of 16-bit registers used, a lot of budget, but we cannot
encode it.