aboutsummaryrefslogtreecommitdiff
path: root/llvm/lib/Target/AMDGPU
AgeCommit message (Collapse)AuthorFilesLines
30 hours[AMDGPU]: Unpack packed instructions overlapped by MFMAs post-RA scheduling ↵Akash Dutta3-5/+398
(#157968) This is a cleaned up version of PR #151704. These optimizations are now performed post-RA scheduling.
35 hoursCodeGen: Add RegisterClass by HwMode (#158269)Matt Arsenault3-4/+9
This is a generalization of the LookupPtrRegClass mechanism. AMDGPU has several use cases for swapping the register class of instruction operands based on the subtarget, but none of them really fit into the box of being pointer-like. The current system requires manual management of an arbitrary integer ID. For the AMDGPU use case, this would end up being around 40 new entries to manage. This just introduces the base infrastructure. I have ports of all the target specific usage of PointerLikeRegClass ready.
37 hours[SDAG][AMDGPU] Allow opting in to OOB-generating PTRADD transforms (#146074)Fabian Ritter2-49/+13
This PR adds a TargetLowering hook, canTransformPtrArithOutOfBounds, that targets can use to allow transformations to introduce out-of-bounds pointer arithmetic. It also moves two such transformations from the AMDGPU-specific DAG combines to the generic DAGCombiner. This is motivated by target features like AArch64's checked pointer arithmetic, CPA, which does not tolerate the introduction of out-of-bounds pointer arithmetic.
38 hours[AMDGPU][SDAG] Handle ISD::PTRADD in various special cases (#145330)Fabian Ritter2-6/+7
There are more places in SIISelLowering.cpp and AMDGPUISelDAGToDAG.cpp that check for ISD::ADD in a pointer context, but as far as I can tell those are only relevant for 32-bit pointer arithmetic (like frame indices/scratch addresses and LDS), for which we don't enable PTRADD generation yet. For SWDEV-516125.
46 hoursAMDGPU: Remove unnecessary AGPR legalize logic (#159491)Matt Arsenault1-13/+0
The manual legalizeOperands code only need to consider cases that require full instruction context to know if the operand is legal. This does not need to handle basic operand register class constraints.
47 hours[AMDGPU] gfx1251 VOP3 dpp support (#159654)Stanislav Mekhanoshin3-51/+92
48 hours[AMDGPU] gfx1251 VOP2 dpp support (#159641)Stanislav Mekhanoshin1-34/+45
2 days[AMDGPU] gfx1251 VOP1 dpp support (#159637)Stanislav Mekhanoshin1-22/+43
2 days[AMDGPU][SDAG] Handle ISD::PTRADD in VOP3 patterns (#143881)Fabian Ritter1-5/+20
This patch mirrors similar patterns for ISD::ADD. The main difference is that ISD::ADD is commutative, so that a pattern definition for, e.g., (add (mul x, y), z), automatically also handles (add z, (mul x, y)). ISD::PTRADD is not commutative, so we would need to handle these cases explicitly. This patch only implements (ptradd z, (op x, y)) patterns, where the nested operation (shift or multiply) is the offset of the ptradd (i.e., the right operand), since base pointers that are the result of a shift or multiply seem less likely. For SWDEV-516125.
2 days[AMDGPU][SIInsertWaitcnts] Track SCC. Insert KM_CNT waits for SCC writes. ↵Petar Avramovic1-6/+75
(#157843) Add new event SCC_WRITE for s_barrier_signal_isfirst and s_barrier_leave, instructions that write to SCC, counter is KM_CNT. Also start tracking SCC for reads and writes. s_barrier_wait on the same barrier guarantees that the SCC write from s_barrier_signal_isfirst has landed, no need to insert s_wait_kmcnt.
3 daysAMDGPU: Remove unnecessary operand legalization for WMMAs (#159370)Matt Arsenault1-15/+0
The operand constraints already express this constraint, and InstrEmitter will respect them.
3 daysAMDGPU: Constrain regclass when replacing SGPRs with VGPRs (#159369)Matt Arsenault1-1/+4
3 daysAMDGPU: Set RegTupleAlignUnits on _Lo256_Align2 class (#159383)Matt Arsenault1-1/+3
3 days[AMDGPU] Mark cluster_workgroup_id_* intrinsics always uniform (#159439)Stanislav Mekhanoshin1-0/+8
3 days[AMDGPU] Add gfx1251 subtarget (#159430)Stanislav Mekhanoshin3-0/+10
3 days[AMDGPU] Fold copies of constant physical registers into their uses (#154410)Stanislav Mekhanoshin1-4/+11
Co-authored-by: Jay Foad <Jay.Foad@amd.com> Co-authored-by: Jay Foad <Jay.Foad@amd.com>
3 days[AMDGPU][CodeGen][True16] Track waitcnt as vgpr32 instead of vgpr16 for D16 ↵Brox Chen4-1/+30
Instructions in GFX11 (#157795) It seems the VMEM access on hi/lo half could interfere the other half. Track waitcnt of vgpr32 instead of vgpr16 for 16bit reg in GFX11. --------- Co-authored-by: Joe Nash <joseph.nash@amd.com>
3 daysAMDGPU: Fixes for regbankselecting copies of i1 physregs to sgprs (#159283)Matt Arsenault1-4/+10
If the source register of a copy was a physical sgpr copied to an s1 value, this would assert.
4 daysAMDGPU: Remove subtarget feature hacking in AsmParser (#159227)Matt Arsenault2-14/+3
The wavesize hacking part was already done in createAMDGPUMCSubtargetInfo, and we can move the default target hack there too.
4 days[TableGen] Add mapping from processor ID to resource index for packetizer ↵Luo, Yuanke1-0/+5
(#158182) Tablegen would generate code to access TargetResourceIndices with processor ID. The TargetProcResourceIndexStart[] array is generated for each processor which has itineraries. The processor which doesn't has itineraries is excluded from the array. When a target has mixed processors, the processor ID may exceed the array size and cause the error. This patch is to generate a table mapping processor with itineraries to resource index, so that scheduler can get the correct resource index with processor ID.
4 days[AMDGPU] Prevent re-visits in LowerBufferFatPointers (#159168)Krzysztof Drewniak1-0/+6
Fixes https://github.com/iree-org/iree/issues/22001 The visitor in SplitPtrStructs would re-visit instructions if an instruction earlier in program order caused a recursive visit() call via getPtrParts(). This would cause instructions to be processed multiple times. As a consequence of this, PHI nodes could be added to the Conditionals array multiple times, which would to a conditinoal that was already simplified being processed multiple times. After the code moved to InstSimplifyFolder, this re-processing, combined with more agressive simplifications, would lead to an attempt to replace an instruction with itself, causing an assertion failure and crash. This commit resolves the issue and adds the reduced form of the crashing input as a test.
4 days[AMDGPU] Add s_cluster_barrier on gfx1250 (#159175)Stanislav Mekhanoshin3-9/+69
4 days[AMDGPU] Set TGID_EN_X/Y/Z when cluster ID intrinsics are used (#159120)Shilei Tian3-25/+32
Hardware initializes a single value in ttmp9 which is either the workgroup ID X or cluster ID X. Most of this patch is a refactoring to use a single `PreloadedValue` enumerator for this value, instead of two enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same value. This makes it simpler to have a single attribute `amdgpu-no-workgroup-id-x` indicating that this value is not used, which in turns sets the TGID_EN_X bit appropriately to tell the hardware whether to initialize it. All of the above applies to Y and Z similarly. Fixes: LWPSCGFX13-568 Co-authored-by: Jay Foad <jay.foad@amd.com>
4 days[AMDGPU] Change `scale_sel` to be 4 bits (#157900)Shilei Tian2-4/+3
The latest SP changes updated it to use `OP_SEL[0:3]` instead of `OP_SEL[0:2]`. Fixes SWDEV-554472.
4 days[AMDGPU] Add aperture classes to VS_64 (#158823)Stanislav Mekhanoshin1-3/+5
Should not do anything.
4 days[AMDGPU] Elide bitcast fold i64 imm to build_vector (#154115)Janek van Oirschot3-1/+55
Elide bitcast combine to build_vector in case of i64 immediate that can be materialized through 64b mov
4 daysAMDGPU: Try to unspill VGPRs after rewriting MFMAs to AGPR form (#154323)Matt Arsenault1-4/+171
After replacing VGPR MFMAs with the AGPR form, we've alleviated VGPR pressure which may have triggered spills during allocation. Identify these spill slots, and try to reassign them to newly freed VGPRs, and replace the spill instructions with copies. Fixes #154260
4 days[AMDGPU] Use larger immediate values in S_NOP (#158990)Jay Foad2-1/+12
The S_NOP instruction has an immediate operand which is one less than the number of cycles to delay for. The maximum value that may be encoded in this field was increased in GFX8 and again in GFX12.
5 days[AMDGPU] Fix codegen to emit COPY instead of S_MOV_B64 for aperture regs ↵Stanislav Mekhanoshin2-23/+5
(#158754)
5 days[AMDGPU] Drop high 32 bits of aperture registers (#158725)Stanislav Mekhanoshin2-17/+25
Fixes: SWDEV-551181
5 days[AMDGPU][MC] Keep MCOperands unencoded. (#158685)Ivan Kosarev4-71/+46
We have proper encoding facilities to encode operands and instructions; there's no need to pollute the MC representation with encoding details. Supposed to be an NFCI, but happens to fix some re-encoded instruction codes in disassembler tests. The 64-bit operands are to be addressed in following patches introducing MC-level representation for lit() and lit64() modifiers, to then be respected by both the assembler and disassembler.
5 daysCodeGen: Surface shouldRewriteCopySrc utility function (#158524)Matt Arsenault1-15/+2
Change shouldRewriteCopySrc to return the common register class and expose it as a utility function. I've found myself reproducing essentially the same logic in multiple places. The purpose of this function is to jsut work through the API constraints of which combination of register class and subreg indexes you have. i.e. you need to use a different function if you have 0, 1, or 2 subregister indexes involved in a pair of copy-like operations.
5 days[AMDGPU] Refactor out common exec mask opcode patterns (NFCI) (#154718)Carl Ritson11-386/+338
Create utility mechanism for finding wave size dependent opcodes used to manipulate exec/lane masks.
5 days[AMDGPU] Add the support for `.cluster_dims` code object metadata (#158721)Shilei Tian2-8/+28
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
5 days[AMDGPU][Attributor] Add `AAAMDGPUClusterDims` (#158076)Shilei Tian3-2/+167
5 days[AMDGPU][AsmParser] Simplify getting source locations of operands. (#158323)Ivan Kosarev1-186/+102
Remember indexes of MCOperands in MCParsedAsmOperands as we add them to instructions. Then use the indexes to find locations by known MCOperands indexes. Happens to fix some reported locations in tests. NFCI otherwise. getImmLoc() is to be eliminated as well; there's enough work for another patch.
5 days[AMDGPU][NFC] Refactor FLAT_Global_* pseudos. (#120244)sstipano1-87/+68
6 daysAMDGPU: Report unaligned scratch access as fast if supported by tgt (#158036)macurtis-amd1-1/+7
This enables more consecutive load folding during aggressive-instcombine. The original motivating example provided by Jeff Byrnes: https://godbolt.org/z/8ebcTEjTs Example provided by Nikita Popov: https://godbolt.org/z/Gv1j4vjqE as part of my original attempt to fix the issue (PR [#133301](https://github.com/llvm/llvm-project/pull/133301), see his [comment](https://github.com/llvm/llvm-project/pull/133301#issuecomment-2984905809)). This changes the value of `IsFast` returned by `In SITargetLowering::allowsMisalignedMemoryAccessesImpl` to be non-zero for private and flat addresses if the subtarget supports unaligned scratch accesses. This enables aggressive-instcombine to do more folding of consecutive loads (see [here](https://github.com/llvm/llvm-project/blob/cbd496581fb6953a9a8d8387a010cc3a67d4654b/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp#L811)). Summary performance impact on [composable_kernel](https://github.com/ROCm/composable_kernel): |GPU|speedup (geomean*)| |---|---| |MI300A| 1.11| |MI300X| 1.14| |MI350X| 1.03| [*] Just to be clear, this is the geomean across kernels which were impacted by this change - not across all CK kernels.
8 daysAMDGPU: Relax verifier for agpr/vgpr loads and stores (#158391)Matt Arsenault1-1/+1
8 days[AMDGPU] Support lowering of cluster related instrinsics (#157978)Shilei Tian12-30/+567
Since many code are connected, this also changes how workgroup id is lowered. Co-authored-by: Jay Foad <jay.foad@amd.com> Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
8 days[AMDGPU] Remove scope check in SIInsertWaitcnts::generateWaitcntInstBefore ↵choikwa1-7/+1
(#157821) This change was motivated by CK where many VMCNT(0)'s were generated due to instructions lacking !alias.scope metadata. The two causes of this were: 1) LowerLDSModule not tacking on scope metadata on a single LDS variable 2) IPSCCP pass before inliner replacing noalias ptr derivative with a global value, which made inliner unable to track it back to the noalias ptr argument. However, it turns out that IPSCCP losing the scope information was largely ineffectual as ScopedNoAliasAA was able to handle asymmetric condition, where one MemLoc was missing scope, and still return NoAlias result. AMDGPU however was checking for existence of scope in SIInsertWaitcnts and conservatively treating it as aliasing all and inserted VMCNT(0) before DS_READs, forcing it to wait for all previous LDS DMA instructions. Since we know that ScopedNoAliasAA can handle asymmetry, we should also allow AA query to determine if two MIs may alias. Passed PSDB. Previous attempt to address the issue in IPSCCP, likely stalled: https://github.com/llvm/llvm-project/pull/154522 This solution may be preferrable over that as issue only affects AMDGPU.
9 daysCodeGen: Remove MachineFunction argument from getRegClass (#158188)Matt Arsenault4-17/+14
This is a low level utility to parse the MCInstrInfo and should not depend on the state of the function.
9 daysAMDGPU: Move spill pseudo special case out of adjustAllocatableRegClass ↵Matt Arsenault2-7/+7
(#158246) This is special for the same reason av_mov_b64_imm_pseudo is special.
9 days[NFC][AMDGPU][SIMemoryLegalizer] remove effectively empty function (#156806)Sameer Sahasrabuddhe1-39/+0
The removed function SIGfx90ACacheControl::enableLoadCacheBypass() does not actually do anything except one assert and one unreachable.
9 daysCodeGen: Remove MachineFunction argument from getPointerRegClass (#158185)Matt Arsenault2-4/+4
getPointerRegClass is a layering violation. Its primary purpose is to determine how to interpret an MCInstrDesc's operands RegClass fields. This should be context free, and only depend on the subtarget. The model of this is also wrong, since this should be an instruction / operand specific property, not a global pointer class. Remove the the function argument to help stage removal of this hook and avoid introducing any new obstacles to replacing it. The remaining uses of the function were to get the subtarget, which TargetRegisterInfo already belongs to. A few targets needed new subtarget derived properties copied there.
9 daysAMDGPU: Remove MIMG special case in adjustAllocatableRegClass (#158184)Matt Arsenault1-2/+1
I have no idea why this was here. MIMG atomics use tied operands for the input and output, so AV classes should have always worked. We have poor test coverage for AGPRs with atomics, so add a partial set. Everything seems to work OK, although it seems image cmpswap always uses VGPRs unnecessarily.
9 days[AMDGPUPromoteAlloca][NFC] Avoid unnecessary APInt/int64_t conversions (#157864)Fabian Ritter1-12/+10
Follow-up to #157682
9 days[AMDGPU] Remove an unused variable (NFC)Jie Fu1-1/+0
/llvm-project/llvm/lib/Target/AMDGPU/SIISelLowering.cpp:3416:9: error: unused variable 'ValVT' [-Werror,-Wunused-variable] EVT ValVT = VA.getValVT(); ^ 1 error generated.
9 daysAMDGPU: Fix returning wrong type for stack passed sub-dword arguments (#158002)Matt Arsenault2-27/+42
Fixes assertion with -debug-only=isel on LowerFormalArguments result. That assert really shouldn't be under LLVM_DEBUG. Fixes #157997
9 daysAMDGPU: Relax legal register operand constraint (#157989)Matt Arsenault1-5/+3
Find a common subclass instead of directly checking for a subclass relationship. This fixes folding logic for unaligned register defs into aligned use contexts. e.g., a vreg_64 def into an av_64_align2 use should be able to find the common subclass vreg_align2. This avoids regressions in future patches. Checking the subclass was also redundant on the subregister path; getMatchingSuperRegClass is sufficient.