aboutsummaryrefslogtreecommitdiff
path: root/llvm/lib/Target/AMDGPU
AgeCommit message (Collapse)AuthorFilesLines
9 hours[AMDGPU] LRO: allow same-BB non-lookthrough users for PHI (#160909)michaelselehov1-1/+4
Loop headers frequently consume the loop-carried value in the header block via non-lookthrough ops (e.g. byte-wise vector binops). LiveRegOptimizer’s same-BB filter currently prunes these users, so the loop-carried PHI is not coerced to i32 and the intended packed form is lost. Relax the filter: when the def is a PHI, allow same-BB non-lookthrough users. Also fix the check to look at the user (CII) rather than the def (II) so the walk does not terminate prematurely.
9 hours[AMDGPU][LowerBufferFatPointers] Erase dead ptr(7) intrinsics (#160798)Krzysztof Drewniak1-1/+3
Fix a crash that would arise when intrinsics like llvm.masked.load.T.p7 were left in the module when AMDGPULowerBufferFatPointers was applied and so a captures(none) annotation would be applied to a non-pointer value, triggering a verifier failure. --------- Co-authored-by: Shilei Tian <i@tianshilei.me>
3 daysAMDGPU: Check if immediate is legal for av_mov_b32_imm_pseudo (#160819)Matt Arsenault1-0/+9
This is primarily to avoid folding a frame index materialized into an SGPR into the pseudo; this would end up looking like: %sreg = s_mov_b32 %stack.0 %av_32 = av_mov_b32_imm_pseudo %sreg Which is not useful. Match the check used for the b64 case. This is limited to the pseudo to avoid regression due to gfx908's special case - it is expecting to pass here with v_accvgpr_write_b32 for illegal cases, and stay in the intermediate state with an sgpr input. This avoids regressions in a future patch.
3 days[AMDGPU][True16][CodeGen] Avoid setting hi part in copysign (#160891)Piotr Sobczak1-2/+3
This is a temporary fix for a regression from #154875. The new pattern sets the hi part of V_BFI result and that confuses si-fix-sgpr-copies - where the proper fix is likely to be. During si-fix-sgpr-copies, an incorrect fold happens: %86:vgpr_32 = V_BFI_B32_e64 %87:sreg_32 = COPY %86.hi16:vgpr_32 %95:vgpr_32 = nofpexcept V_PACK_B32_F16_t16_e64 0, killed %87:sreg_32, 0, %63:vgpr_16, 0, 0 into %86:vgpr_32 = V_BFI_B32_e64 %95:vgpr_32 = nofpexcept V_PACK_B32_F16_t16_e64 0, %86.lo16:vgpr_32, 0, %63:vgpr_16, 0, 0 Fixes: Vulkan CTS dEQP-VK.glsl.builtin.precision_fp16_storage32b.*.
3 days[AMDGPU] Ensure divergence for v_alignbit (#129159)Jeffrey Byrnes1-7/+7
Selecting vgpr for the uniform version of this pattern may lead to unnecessary vgpr and waterfall loops.
3 days[NFC][LLVM] Pass/return SMLoc by value instead of const reference (#160797)Rahul Joshi1-13/+11
SMLoc itself encapsulates just a pointer, so there is no need to pass or return it by reference.
4 days[AMDGPU] Skip debug uses in SIInsertWaitcnts::shouldFlushVmCnt (#160818)Jay Foad1-1/+1
4 days[AMDGPU] Avoid constraining RC based on folded into operand (NFC) (#160743)Josh Hutton1-4/+9
The RC of the folded operand does not need to be constrained based on the RC of the current operand we are folding into. The purpose of this PR is to facilitate this PR: https://github.com/llvm/llvm-project/pull/151033
4 days[AMDGPU] Calc IsVALU correctly during UADDO/USUBO selection (#159814)LU-JOHN2-7/+14
Fix two bugs. The first bug hid the second bug. 1. Calculate IsVALU correctly during UADDO/USUBO selection. IsVALU should be false if the carryout users are UADDO_CARRY/USUBO_CARRY. However instruction selection visits uses before defs, so the UADDO_CARRY/USUBO_CARRY nodes are normally (probably always) already converted to S_ADD_CO_PSEUDO/S_SUB_CO_PSEUDO. Fix to check for these machine opcodes. 2. Without this fix, UADDO/USUBO selection will always select the VALU instructions V_ADD_CO__U32_e64/V_SUB_CO_U32_e64. S_UADDO_PSEUDO/S_USUBO_PSEUDO were never selected in the CodeGen/AMDGPU tests. Thus, S_UADDO_PSEUDO/S_USUBO_PSEUDO cases were never hit in EmitInstrWithCustomInserter. The code generation for S_UADDO_PSEUDO/S_USUBO_PSEUDO had a bug where it could not handle code generation for 32-bit $scc_out. --------- Signed-off-by: John Lu <John.Lu@amd.com> Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
4 days[AMDGPU] Add GFX12 wave register names with WAVE_ prefix (#144352)Aleksandar Spasojevic1-87/+79
Rename canonical register names with WAVE_ prefix for GFX12 Maintain backward compatibility through aliases
5 daysAMDGPU: Ensure both wavesize features are not set (#159234)Matt Arsenault6-16/+62
Make sure we cannot be in a mode with both wavesizes. This prevents assertions in a future change. This should probably just be an error, but we do not have a good way to report errors from the MCSubtargetInfo constructor.
5 days[AMDGPU] Fix vector legalization for bf16 valu ops (#158439)Giuseppe Rossini2-6/+19
Add v4,v8,v16,v32 legalizations for the following operations: - `FADD` - `FMUL` - `FMA` - `FCANONICALIZE`
5 days[TII] Split isTrivialReMaterializable into two versions [nfc] (#160377)Philip Reames2-8/+7
This change builds on https://github.com/llvm/llvm-project/pull/160319 which tries to clarify which *callers* (not backends) assume that the result is actually trivial. This change itself should be NFC. Essentially, I'm just renaming the existing isTrivialRematerializable to the non-trivial version and then adding a new trivial version (with the same name as the prior function) and simplifying a few callers which want that semantic. This change does *not* enable non-trivial remat any more broadly than was already done for our targets which were lying through the old APIs; that will come separately. The goal here is simply to make the code easier to follow in terms of what assumptions are being made where. --------- Co-authored-by: Luke Lau <luke_lau@icloud.com>
5 days[AMDGPU] Update comments in memory legalizer. NFC (#160453)Stanislav Mekhanoshin1-5/+14
5 days[AMDGPU][True16][CodeGen] true16 isel pattern for fma_mix_f16/bf16 (#159648)Brox Chen5-50/+155
This patch includes: 1. fma_mix inst takes fp16 type as input, but place the operand in vgpr32. Update selector to insert vgpr32 for true16 mode if necessary. 2. fma_mix inst returns fp16 type as output, but place the vdst in vgpr32. Create a fma_mix_t16 pesudo inst for isel pattern, and lower it to mix_lo/hi in the mc lowering pass. These stop isel from emitting illegal `vgpr32 = COPY vgpr16` and improve code quality
5 days[AMDGPU] Add the support for 45-bit buffer resource (#159702)Shilei Tian4-45/+109
On new targets like `gfx1250`, the buffer resource (V#) now uses this format: ``` base (57-bit): resource[56:0] num_records (45-bit): resource[101:57] reserved (6-bit): resource[107:102] stride (14-bit): resource[121:108] ``` This PR changes the type of `num_records` from `i32` to `i64` in both builtin and intrinsic, and also adds the support for lowering the new format. Fixes SWDEV-554034. --------- Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
5 days[NFC][AMDGPU] Refactor common declarations (#160406)LU-JOHN1-20/+3
Move common declarations from switch cases to function entry. Signed-off-by: John Lu <John.Lu@amd.com>
6 days[AMDGPU][AsmParser] Introduce MC representation for lit() and lit64(). (#160316)Ivan Kosarev8-81/+220
And rework the lit64() support to use it. The rules for when to add lit64() can be simplified and improved. In this change, however, we just follow the existing conventions on the assembler and disassembler sides. In codegen we do not (and normally should not need to) add explicit lit() and lit64() modifiers, so the codegen tests lose them. The change is an NFCI otherwise. Simplifies printing operands.
6 days[AMDGPU] SIMemoryLegalizer: Factor out check if memory operations can affect ↵Fabian Ritter1-18/+31
the global AS (#160129) Mostly NFC, and adds an assertion for gfx12 to ensure that no atomic scratch instructions are present in the case of GloballyAddressableScratch. This should always hold because of #154710.
6 days[AMDGPU] Refine GCNHazardRecognizer hasHazard() (#138841)Carl Ritson1-31/+106
Remove recursion to avoid stack overflow on large CFGs. Avoid worklist for hazard search within single MachineBasicBlock. Ensure predecessors are visited for all state combinations.
6 days[AMDGPU] SILowerControlFlow: ensure EXEC/SCC interval recompute (#160459)Carl Ritson1-5/+4
Ensure live intervals for EXEC and SCC are removed on all paths which generate instructions.
6 days[AMDGPU] Handle S_GETREG_B32_const in the hazard recognizer. NFCI (#160364)Stanislav Mekhanoshin1-1/+1
6 days[AMDGPU] Support `xor cond, -1` when lowering `BRCOND` (#160341)Shilei Tian1-3/+16
This can happen when `xor cond, -1` is not combined.
6 days[CodeGen] Rename isReallyTriviallyReMaterializable [nfc]Philip Reames2-3/+3
.. to isReMaterializableImpl. The "Really" naming has always been awkward, and we're working towards removing the "Trivial" part now, so go ehead and remove both pieces in a single rename. Note that this doesn't change any aspect of the current implementation; we still "mostly" only return instructions which are trivial (meaning no virtual register uses), but some targets do lie about that today.
6 days[AMDGPU] Fix high vgpr printing with true16 (#160209)Stanislav Mekhanoshin2-2/+19
6 days[AMDGPU][AsmParser][NFC] Combine the Lit and Lit64 modifier flags. (#160315)Ivan Kosarev1-34/+37
They represent mutually exclusive values of the same attribute.
6 days[AMDGPU] Fix sub-dword atomic flat saddr store with no D16. NFCI (#160253)Stanislav Mekhanoshin1-2/+2
The pattern does not factor saddr. There is no way to write a test for it because gfx1200 does not have sram-ecc but also no saddr, and gfx1250 does not fall into this preserving category while has sram-ecc. Nevertheless, the day we could fix it that would become a problem. For now it is OK that change does not fail. That was untested before and it is untested now, but at least t16 block uses t16 patterns.
6 daysRevert "[AMDGPU] Elide bitcast fold i64 imm to build_vector" (#160325)Janek van Oirschot3-55/+1
Reverts llvm/llvm-project#154115 Co-authored-by: ronlieb <ron.lieberman@amd.com>
6 days[MCA] Use Bare Reference for InstrPostProcess (#160229)Aiden Grossman2-7/+6
This patch makes it so that InstrPostProcess::postProcessInstruction takes in a reference to a mca::Instruction rather than a reference to a std::unique_ptr. Without this, InstrPostProcess cannot be used with MCA instruction recycling because it needs to be called on both newly created instructions and instructions that have been recycled. We only have access to a raw pointer for instructions that have been recycled rather than a reference to the std::unique_ptr that owns them. This patch adds a call in the existing instruction recycling unit test to ensure the API remains compatible with this use case.
7 days[AMDGPU] Insert waitcnt for non-global fence release in GFX12 (#159282)Fabian Ritter1-38/+38
A fence release could be followed by a barrier, so it should wait for the relevant memory accesses to complete, even if it is mmra-limited to LDS. So far, that would be skipped for non-global fence releases. Fixes SWDEV-554932.
7 days[MIR] Support save/restore points with independent sets of registers (#119358)Elizaveta Noskova1-2/+4
This patch adds the MIR parsing and serialization support for save and restore points with subsets of callee saved registers. That is, it syntactically allows a function to contain two or more distinct sub-regions in which distinct subsets of registers are spilled/filled as callee save. This is useful if e.g. one of the CSRs isn't modified in one of the sub-regions, but is in the other(s). Support for actually using this capability in code generation is still forthcoming. This patch is the next logical step for multiple save/restore points support. All points are now stored in DenseMap from MBB to vector of CalleeSavedInfo. Shrink-Wrap points split Part 4. RFC: https://discourse.llvm.org/t/shrink-wrap-save-restore-points-splitting/83581 Part 1: https://github.com/llvm/llvm-project/pull/117862 (landed) Part 2: https://github.com/llvm/llvm-project/pull/119355 (landed) Part 3: https://github.com/llvm/llvm-project/pull/119357 (landed) Part 5: https://github.com/llvm/llvm-project/pull/119359 (likely to be further split)
7 days[AMDGPU] Add PAL metadata names for 32 user SGPRs (#160126)Jay Foad1-0/+16
Since #154205 some subtargets can use up to 32 user SGPRs. Add names for them all so they can be pretty printed in PAL metadata.
7 days[AMDGPU] Skip debug instructions in SIShrinkInstructions::matchSwap (#160123)Jay Foad1-1/+6
7 days[NFC][AMDGPU] Streamline code (#160177)LU-JOHN1-18/+6
Streamline code by only declaring TRI/TII once and using isWave64(). Signed-off-by: John Lu <John.Lu@amd.com>
7 days[AMDGPU] Use unsigned overflow for S_UADDO_PSEUDO/S_USUBO_PSEUDO (#160142)LU-JOHN1-2/+2
Use correct unsigned overflow instructions for S_UADDO_PSEUDO/S_USUBO_PSEUDO. Note that this issue was hidden because instruction selection never selected S_UADDO_PSEUDO/S_USUBO_PSEUDO which will be addressed in https://github.com/llvm/llvm-project/pull/159814. Signed-off-by: John Lu <John.Lu@amd.com>
7 days[AMDGPU] Simplify "class HasMember##member" with llvm::is_detected (NFC) ↵Kazu Hirata1-10/+2
(#160037) "class HasMember##member" detects a specific member with a complex SFINAE logic involving multiple inheritance. This patch simplifies that by switching to llvm::is_detected.
7 days[AMDGPU] Skip debug uses in SIInstrInfo::foldImmediate (#160102)Jay Foad1-2/+2
7 days[AMDGPU] Skip debug uses in SIPeepholeSDWA (#160092)Jay Foad1-1/+2
8 days[AMDGPU] Simplify template metaprogramming in IsMCExpr##member (NFC) (#160005)Kazu Hirata1-8/+5
Without this patch, we compute a type trait in a roundabout manner: - Compute a boolean value in the primary template. - Pass the value to std::enable_if_t. - Return std::true_type (or std::false_type on the fallback path). - Compare the return type to std::true_type. That is, when the expression for the first boolean value above is well formed, we already have the answer we are looking for. This patch bypasses the entire sequence by having the primary template return std::bool_constant and adjusting RESULT to extract the ::value of the boolean type.
10 days[AMDGPU]: Unpack packed instructions overlapped by MFMAs post-RA scheduling ↵Akash Dutta3-5/+398
(#157968) This is a cleaned up version of PR #151704. These optimizations are now performed post-RA scheduling.
11 daysCodeGen: Add RegisterClass by HwMode (#158269)Matt Arsenault3-4/+9
This is a generalization of the LookupPtrRegClass mechanism. AMDGPU has several use cases for swapping the register class of instruction operands based on the subtarget, but none of them really fit into the box of being pointer-like. The current system requires manual management of an arbitrary integer ID. For the AMDGPU use case, this would end up being around 40 new entries to manage. This just introduces the base infrastructure. I have ports of all the target specific usage of PointerLikeRegClass ready.
11 days[SDAG][AMDGPU] Allow opting in to OOB-generating PTRADD transforms (#146074)Fabian Ritter2-49/+13
This PR adds a TargetLowering hook, canTransformPtrArithOutOfBounds, that targets can use to allow transformations to introduce out-of-bounds pointer arithmetic. It also moves two such transformations from the AMDGPU-specific DAG combines to the generic DAGCombiner. This is motivated by target features like AArch64's checked pointer arithmetic, CPA, which does not tolerate the introduction of out-of-bounds pointer arithmetic.
11 days[AMDGPU][SDAG] Handle ISD::PTRADD in various special cases (#145330)Fabian Ritter2-6/+7
There are more places in SIISelLowering.cpp and AMDGPUISelDAGToDAG.cpp that check for ISD::ADD in a pointer context, but as far as I can tell those are only relevant for 32-bit pointer arithmetic (like frame indices/scratch addresses and LDS), for which we don't enable PTRADD generation yet. For SWDEV-516125.
11 daysAMDGPU: Remove unnecessary AGPR legalize logic (#159491)Matt Arsenault1-13/+0
The manual legalizeOperands code only need to consider cases that require full instruction context to know if the operand is legal. This does not need to handle basic operand register class constraints.
11 days[AMDGPU] gfx1251 VOP3 dpp support (#159654)Stanislav Mekhanoshin3-51/+92
11 days[AMDGPU] gfx1251 VOP2 dpp support (#159641)Stanislav Mekhanoshin1-34/+45
11 days[AMDGPU] gfx1251 VOP1 dpp support (#159637)Stanislav Mekhanoshin1-22/+43
11 days[AMDGPU][SDAG] Handle ISD::PTRADD in VOP3 patterns (#143881)Fabian Ritter1-5/+20
This patch mirrors similar patterns for ISD::ADD. The main difference is that ISD::ADD is commutative, so that a pattern definition for, e.g., (add (mul x, y), z), automatically also handles (add z, (mul x, y)). ISD::PTRADD is not commutative, so we would need to handle these cases explicitly. This patch only implements (ptradd z, (op x, y)) patterns, where the nested operation (shift or multiply) is the offset of the ptradd (i.e., the right operand), since base pointers that are the result of a shift or multiply seem less likely. For SWDEV-516125.
11 days[AMDGPU][SIInsertWaitcnts] Track SCC. Insert KM_CNT waits for SCC writes. ↵Petar Avramovic1-6/+75
(#157843) Add new event SCC_WRITE for s_barrier_signal_isfirst and s_barrier_leave, instructions that write to SCC, counter is KM_CNT. Also start tracking SCC for reads and writes. s_barrier_wait on the same barrier guarantees that the SCC write from s_barrier_signal_isfirst has landed, no need to insert s_wait_kmcnt.
12 daysAMDGPU: Remove unnecessary operand legalization for WMMAs (#159370)Matt Arsenault1-15/+0
The operand constraints already express this constraint, and InstrEmitter will respect them.