aboutsummaryrefslogtreecommitdiff
path: root/llvm/lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
AgeCommit message (Collapse)AuthorFilesLines
13 days[AMDGPU] Set TGID_EN_X/Y/Z when cluster ID intrinsics are used (#159120)Shilei Tian1-3/+6
Hardware initializes a single value in ttmp9 which is either the workgroup ID X or cluster ID X. Most of this patch is a refactoring to use a single `PreloadedValue` enumerator for this value, instead of two enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same value. This makes it simpler to have a single attribute `amdgpu-no-workgroup-id-x` indicating that this value is not used, which in turns sets the TGID_EN_X bit appropriately to tell the hardware whether to initialize it. All of the above applies to Y and Z similarly. Fixes: LWPSCGFX13-568 Co-authored-by: Jay Foad <jay.foad@amd.com>
2025-09-12[AMDGPU] Support lowering of cluster related instrinsics (#157978)Shilei Tian1-0/+2
Since many code are connected, this also changes how workgroup id is lowered. Co-authored-by: Jay Foad <jay.foad@amd.com> Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
2025-09-11[AMDGPU] Use subtarget call to determine number of VGPRs (#157927)Stanislav Mekhanoshin1-1/+1
Since the register file was increased that is no longer valid to call VGPR_32RegClass.getNumregs() to get a total number of arch registers available on a subtarget. Fixes: SWDEV-550425
2025-08-09AMDGPU: Add missing static to cl::opt (#152747)Matt Arsenault1-1/+1
2025-08-08[AMDGPU] AsmPrinter: Unify arg handling (#151672)Diana Picus1-0/+4
When computing the number of registers required by entry functions, the `AMDGPUAsmPrinter` needs to take into account both the register usage computed by the `AMDGPUResourceUsageAnalysis` pass, and the number of registers initialized by the hardware. At the moment, the way it computes the latter is different for graphics vs compute, due to differences in the implementation. For kernels, all the information needed is available in the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over the `Function` arguments in the `AMDGPUAsmPrinter`. This pretty much repeats some of the logic from instruction selection. This patch introduces 2 new members to `SIMachineFunctionInfo`, one for SGPRs and one for VGPRs. Both will be computed during instruction selection and then used during `AMDGPUAsmPrinter`, removing the need to refer to the `Function` when printing assembly. This patch is NFC except for the fact that we now add the extra SGPRs (VCC, XNACK etc) to the number of SGPRs computed for graphics entry points. I'm not sure why these weren't included before. It would be nice if someone could confirm if that was just an oversight or if we have some docs somewhere that I haven't managed to find. Only one test is affected (its SGPR usage increases because we now take into account the XNACK registers).
2025-07-25AMDGPU: Fix -amdgpu-mfma-vgpr-form flag on gfx908 (#150599)Matt Arsenault1-5/+9
This should be ignored since there are no VGPR forms. This makes it possible to flip the default for the flag to true.
2025-07-21[AMDGPU] ISel & PEI for whole wave functions (#145858)Diana Picus1-2/+7
Whole wave functions are functions that will run with a full EXEC mask. They will not be invoked directly, but instead will be launched by way of a new intrinsic, `llvm.amdgcn.call.whole.wave` (to be added in a future patch). These functions are meant as an alternative to the `llvm.amdgcn.init.whole.wave` or `llvm.amdgcn.strict.wwm` intrinsics. Whole wave functions will set EXEC to -1 in the prologue and restore the original value of EXEC in the epilogue. They must have a special first argument, `i1 %active`, that is going to be mapped to EXEC. They may have either the default calling convention or amdgpu_gfx. The inactive lanes need to be preserved for all registers used, active lanes only for the CSRs. At the IR level, arguments to a whole wave function (other than `%active`) contain poison in their inactive lanes. Likewise, the return value for the inactive lanes is poison. This patch contains the following work: * 2 new pseudos, SI_SETUP_WHOLE_WAVE_FUNC and SI_WHOLE_WAVE_FUNC_RETURN used for managing the EXEC mask. SI_SETUP_WHOLE_WAVE_FUNC will return a SReg_1 representing `%active`, which needs to be passed into SI_WHOLE_WAVE_FUNC_RETURN. * SelectionDAG support for generating these 2 new pseudos and the special handling of %active. Since the return may be in a different basic block, it's difficult to add the virtual reg for %active to SI_WHOLE_WAVE_FUNC_RETURN, so we initially generate an IMPLICIT_DEF which is later replaced via a custom inserter. * Expansion of the 2 pseudos during prolog/epilog insertion. PEI also marks any used VGPRs as WWM registers, which are then spilled and restored with the usual logic. Future patches will include the `llvm.amdgcn.call.whole.wave` intrinsic and a lot of optimization work (especially in order to reduce spills around function calls). --------- Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com> Co-authored-by: Shilei Tian <i@tianshilei.me>
2025-07-18[AMDGPU] Provide control to force VGPR MFMA form (#148079)Jeffrey Byrnes1-2/+12
This gives an override to the user to force select VGPR form of MFMA. Eventually we will drop this in favor of compiler making better decisions, but this provides a mechanism for users to address the cases where MayNeedAGPRs favors the AGPR form and performance is degraded due to poor RA.
2025-07-15[AMDGPU] Remove an unnecessary cast (NFC) (#148868)Kazu Hirata1-1/+1
STI is already of const GCNSubtarget *.
2025-06-24[AMDGPU] Replace dynamic VGPR feature with attribute (#133444)Diana Picus1-0/+7
Use a function attribute (amdgpu-dynamic-vgpr) instead of a subtarget feature, as requested in #130030.
2025-04-13[AMDGPU] Use llvm::find and llvm::find_if (NFC) (#135582)Kazu Hirata1-1/+1
2025-03-19[AMDGPU] Allocate scratch space for dVGPRs for CWSR (#130055)Diana Picus1-1/+2
The CWSR trap handler needs to save and restore the VGPRs. When dynamic VGPRs are in use, the fixed function hardware will only allocate enough space for one VGPR block. The rest will have to be stored in scratch, at offset 0. This patch allocates the necessary space by: - generating a prologue that checks at runtime if we're on a compute queue (since CWSR only works on compute queues); for this we will have to check the ME_ID bits of the ID_HW_ID2 register - if that is non-zero, we can assume we're on a compute queue and initialize the SP and FP with enough room for the dynamic VGPRs - forcing all compute entry functions to use a FP so they can access their locals/spills correctly (this isn't ideal but it's the quickest to implement) Note that at the moment we allocate enough space for the theoretical maximum number of VGPRs that can be allocated dynamically (for blocks of 16 registers, this will be 128, of which we subtract the first 16, which are already allocated by the fixed function hardware). Future patches may decide to allocate less if they can prove the shader never allocates that many blocks. Also note that this should not affect any reported stack sizes (e.g. PAL backend_stack_size etc).
2025-03-14[AMDGPU] Avoid repeated hash lookups (NFC) (#131419)Kazu Hirata1-5/+7
2025-03-06AMDGPU: Replace amdgpu-no-agpr with amdgpu-agpr-alloc (#129893)Matt Arsenault1-1/+4
This performs the minimal replacment of amdgpu-no-agpr to amdgpu-agpr-alloc=0. Most of the test diffs are due to the new attribute sorting later alphabetically. We could do better by trying to perform range merging in the attributor, and trying to pick non-0 values.
2025-02-23AMDGPU: Respect amdgpu-no-agpr in functions and with calls (#128147)Matt Arsenault1-48/+6
Remove the MIR scan to detect whether AGPRs are used or not, and the special case for callable functions. This behavior was confusing, and not overridable. The amdgpu-no-agpr attribute was intended to avoid this imprecise heuristic for how many AGPRs to allocate. It was also too confusing to make this interact with the pending amdgpu-num-agpr replacement for amdgpu-no-agpr. Also adds an xfail-ish test where the register allocator asserts after allocation fails which I ran into. Future work should reintroduce a more refined MIR scan to estimate AGPR pressure for how to split AGPRs and VGPRs.
2025-01-23[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)Lucas Ramirez1-3/+2
Occupancy (i.e., the number of waves per EU) depends, in addition to register usage, on per-workgroup LDS usage as well as on the range of possible workgroup sizes. Mirroring the latter, occupancy should therefore be expressed as a range since different group sizes generally yield different achievable occupancies. `getOccupancyWithLocalMemSize` currently returns a scalar occupancy based on the maximum workgroup size and LDS usage. With respect to the workgroup size range, this scalar can be the minimum, the maximum, or neither of the two of the range of achievable occupancies. This commit fixes the function by making it compute and return the range of achievable occupancies w.r.t. workgroup size and LDS usage; it also renames it to `getOccupancyWithWorkGroupSizes` since it is the range of workgroup sizes that produces the range of achievable occupancies. Computing the achievable occupancy range is surprisingly involved. Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum occupancies i.e., sometimes workgroup sizes inside the range yield the occupancy bounds. The implementation finds these sizes in constant time; heavy documentation explains the rationale behind the sometimes relatively obscure calculations. As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU, 64-wide waves. Also consider a function with no LDS usage and a flat workgroup size range of [513,1024]. - A group of 513 items requires 9 waves per group. Only 4 groups made up of 9 waves each can fit fully on a CU at any given time, for a total of 36 waves on the CU, or 9 per EU. However, filling as much as possible the remaining 40-36=4 wave slots without decreasing the number of groups reveals that a larger group of 640 items yields 40 waves on the CU, or 10 per EU. - Similarly, a group of 1024 items requires 16 waves per group. Only 2 groups made up of 16 waves each can fit fully on a CU ay any given time, for a total of 32 waves on the CU, or 8 per EU. However, removing as many waves as possible from the groups without being able to fit another equal-sized group on the CU reveals that a smaller group of 896 items yields 28 waves on the CU, or 7 per EU. Therefore the achievable occupancy range for this function is not [8,9] as the group size bounds directly yield, but [7,10]. Naturally this change causes a lot of test churn as instruction scheduling is driven by achievable occupancy estimates. In most unit tests the flat workgroup size range is the default [1,1024] which, ignoring potential LDS limitations, would previously produce a scalar occupancy of 8 (derived from 1024) on a lot of targets, whereas we now consider the maximum occupancy to be 10 in such cases. Most tests are updated automatically and checked manually for sanity. I also manually changed some non-automatically generated assertions when necessary. Fixes #118220.
2025-01-17[AMDGPU] Fix printing hasInitWholeWave in mir (#123232)Stanislav Mekhanoshin1-1/+1
2025-01-10[AMDGPU] Add backward compatibility layer for kernarg preloading (#119167)Austin Kerbow1-0/+2
Add a prologue to the kernel entry to handle cases where code designed for kernarg preloading is executed on hardware equipped with incompatible firmware. If hardware has compatible firmware the 256 bytes at the start of the kernel entry will be skipped. This skipping is done automatically by hardware that supports the feature. A pass is added which is intended to be run at the very end of the pipeline to avoid any optimizations that would assume the prologue is a real predecessor block to the actual code start. In reality we have two possible entry points for the function. 1. The optimized path that supports kernarg preloading which begins at an offset of 256 bytes. 2. The backwards compatible entry point which starts at offset 0.
2024-12-18[AMDGPU] Make max dwords of memory cluster configurable (#119342)Ruiling, Song1-4/+8
We find it helpful to increase the value for graphics workload. Make it configurable so we can experiment with a different value.
2024-11-07Revert "[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes" (#115353)dyung1-5/+4
Reverts llvm/llvm-project#115291 Reverting due to test failures on many bots including https://lab.llvm.org/buildbot/#/builders/174/builds/8049
2024-11-07[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes (#115291)Akshat Oke1-4/+5
2024-11-05[AMDGPU][MIR] Serialize SpillPhysVGPRs (#113129)Akshat Oke1-0/+3
2024-10-03[AMDGPU] Qualify auto. NFC. (#110878)Jay Foad1-1/+1
Generated automatically with: $ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find lib/Target/AMDGPU/ -type f)
2024-09-30[AMDGPU] Split vgpr regalloc pipeline (#93526)Christudasan Devadasan1-6/+22
Allocating wwm-registers and per-thread VGPR operands together imposes many challenges in the way the registers are reused during allocation. There are times when regalloc reuses the registers of regular VGPRs operations for wwm-operations in a small range leading to unwantedly clobbering their inactive lanes causing correctness issues that are hard to trace. This patch splits the VGPR allocation pipeline further to allocate wwm-registers first and the regular VGPR operands in a separate pipeline. The splitting would ensure that the physical registers used for wwm allocations won't take part in the next allocation pipeline to avoid any such clobbering.
2024-09-26[AMDGPU] Merge the conditions used for deciding CS spills for ↵Christudasan Devadasan1-2/+8
amdgpu_cs_chain[_preserve] (#109911) Multiple conditions exist to decide whether callee save spills/restores are required for amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions. This patch consolidates them all and moves to a single place.
2024-09-19[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)Jay Foad1-1/+1
It is almost always simpler to use {} instead of std::nullopt to initialize an empty ArrayRef. This patch changes all occurrences I could find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor could be deprecated or removed.
2024-08-19[AMDGPU] Move AMDGPUCodeGenPassBuilder into AMDGPUTargetMachine(NFC) (#103720)Christudasan Devadasan1-1/+0
This will allow us to reuse the existing flags and the static functions while building the pipeline for new pass manager.
2024-07-17[AMDGPU] clang-tidy: no else after return etc. NFC. (#99298)Jay Foad1-1/+2
2024-07-17[AMDGPU] Use range-based for loops. NFC. (#99047)Jay Foad1-3/+3
2024-07-17[AMDGPU] clang-tidy: use emplace_back instead of push_back. NFC.Jay Foad1-4/+2
2024-07-16[AMDGPU] clang-tidy: replace macro with enum. NFC.Jay Foad1-1/+1
2024-06-25AMDGPU: Add plumbing for private segment size argument (#96445)Nicolai Hähnle1-0/+6
The actual size of scratch/private is determined at dispatch time, so add more plumbing to request it. Will be used in subsequent change.
2024-06-25AMDGPU: Remove an outdated TODO (#96446)Nicolai Hähnle1-1/+0
We have a fixed calling convention for stack pointer and frame pointer, we shouldn't try to shift anything around.
2024-04-29[AMDGPU] Fix typo in #89773Jay Foad1-1/+1
Fixes #90281
2024-04-24[AMDGPU] Allow WorkgroupID intrinsics in amdgpu_gfx functions (#89773)Jay Foad1-1/+2
With GFX12 architected SGPRs the workgroup ids are trivially available in any function called from a compute entrypoint.
2024-03-21AMDGPU: Infer no-agpr usage in AMDGPUAttributor (#85948)Matt Arsenault1-29/+1
SIMachineFunctionInfo has a scan of the function body for inline asm which may use AGPRs, or callees in SIMachineFunctionInfo. Move this into the attributor, so it actually works interprocedurally. Could probably avoid most of the test churn if this bothered to avoid adding this on subtargets without AGPRs. We should also probably try to delete the MIR scan in usesAGPRs but it seems to be trickier to eliminate.
2024-03-12[AMDGPU] Adding the amdgpu_num_work_groups function attribute (#79035)Jun Wang1-0/+2
A new function attribute named amdgpu_num_work_groups is added. This attribute, which consists of three integers, allows programmers to let the compiler know the number of workgroups to be launched in each of the three dimensions and do optimizations based on that information. --------- Co-authored-by: Jun Wang <jun.wang7@amd.com>
2024-02-09[AMDGPU] Don't fix the scavenge slot at offset 0 (#79136)Diana Picus1-8/+4
At the moment, the emergency spill slot is a fixed object for entry functions and chain functions, and a regular stack object otherwise. This patch adopts the latter behaviour for entry/chain functions too. It seems this was always the intention [1] and it will also save us a bit of stack space in cases where the first stack object has a large alignment. [1] https://github.com/llvm/llvm-project/commit/34c8b835b16fb3879f1b9770e91df21883356bb6
2024-01-24[AMDGPU] Pick available high VGPR for CSR SGPR spilling (#78669)Christudasan Devadasan1-11/+43
CSR SGPR spilling currently uses the early available physical VGPRs. It currently imposes a high register pressure while trying to allocate large VGPR tuples within the default register budget. This patch changes the spilling strategy by picking the VGPRs in the reverse order, the highest available VGPR first and later after regalloc shift them back to the lowest available range. With that, the initial VGPRs would be available for allocation and possibility of finding large number of contiguous registers will be more.
2023-12-17[AMDGPU] Track physical VGPRs used for SGPR spills (#75573)Carl Ritson1-1/+2
Physical VGPRs used for SGPR spills need to be tracked independent of WWM reserved registers. The WWM reserved set contains extra registers allocated during WWM pre-allocation pass. This causes SGPR spills allocated after WWM pre-allocation to overlap with WWM register usage, e.g. if frame pointer is spilt during prologue/epilog insertion.
2023-12-13[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030)Piotr Sobczak1-1/+1
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2023-12-11[llvm] Use StringRef::{starts,ends}_with (NFC) (#74956)Kazu Hirata1-1/+1
This patch replaces uses of StringRef::{starts,ends}with with StringRef::{starts,ends}_with for consistency with std::{string,string_view}::{starts,ends}_with in C++20. I'm planning to deprecate and eventually remove StringRef::{starts,ends}with.
2023-12-06[AMDGPU] Introduce isBottomOfStack helper. NFC (#74288)Diana Picus1-1/+1
Introduce a helper to check if a function is at the bottom of the stack, i.e. if it's an entry function or a chain function. This was suggested in #71913.
2023-11-14[AMDGPU] Use immediates for stack accesses in chain funcs (#71913)Diana1-1/+1
Switch to using immediate offsets instead of the SP register to access objects on the current stack frame in chain functions. This means we no longer need to reserve a SP register just for accesing stack objects and it also allows us to set the SP (when one is actually needed) to the stack size from the very beginning. This only works if we use a FixedObject for the ScavengeFI, which is what we do for entry functions anyway (and we generally want to keep chain functions close to amdgpu_cs behaviour where we don't have a good reason to diverge).
2023-11-08[AMDGPU] Callee saves for amdgpu_cs_chain[_preserve] (#71526)Diana1-0/+8
Teach prolog epilog insertion how to handle functions with the amdgpu_cs_chain or amdgpu_cs_chain_preserve calling conventions. For amdgpu_cs_chain functions, we only need to preserve the inactive lanes of VGPRs above v8, and only in the presence of calls via @llvm.amdgcn.cs.chain. For amdgpu_cs_chain_preserve functions, we will also need to preserve the active lanes for registers above the last argument VGPR. AFAICT there's no direct way to find out what the last argument VGPR is, so instead the patch uses the fact that chain calls from amdgpu_cs_chain_preserve functions can't use more VGPRs than the caller's VGPR arguments. In other words, it removes the operands of SI_CS_CHAIN_TC instructions from the list of callee saved registers. For both calling conventions, registers v0-v7 never need to be saved and restored, so we should never add them as WWM spills. Differential Revision: https://reviews.llvm.org/D156412
2023-09-25[AMDGPU] Add DAG ISel support for preloaded kernel argumentsAustin Kerbow1-0/+28
This patch adds the DAG isel changes for kernel argument preloading. These changes are not usable with older firmware but subsequent patches in the series will make the codegen backwards compatible. This patch should only be submitted alongside that subsequent patch. Preloading here begins from the start of the kernel arguments until the amount of arguments indicated by the CL flag amdgpu-kernarg-preload-count. Aggregates and arguments passed by-ref are not supported. Special care for the alignment of the kernarg segment is needed as well as consideration of the alignment of addressable SGPR tuples when we cannot directly use misaligned large tuples that the arguments are loaded to. Reviewed By: bcahoon Differential Revision: https://reviews.llvm.org/D158579
2023-09-12[AMDGPU] Add utilities to track number of user SGPRs. NFC.Austin Kerbow1-60/+10
Factor out and unify some common code that calculates and tracks the number of user SGRPs. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D159439
2023-08-21[AMDGPU] ISel for amdgpu_cs_chain[_preserve] functionsDiana Picus1-1/+14
Lower formal arguments and returns for functions with the `amdgpu_cs_chain` and `amdgpu_cs_chain_preserve` calling conventions: * Put `inreg` arguments into SGPRs, starting at s0, and other arguments into VGPRs, starting at v8. No arguments should end up on the stack, if we don't have enough registers we should error out. * Lower the return (which is always void) as an S_ENDPGM. * Set the ScratchRSrc register to s48:51, as described in the docs. * Set the SP to s32, matching amdgpu_gfx. This might be revisited in a future patch. Differential Revision: https://reviews.llvm.org/D153517
2023-07-31Reapply "[CodeGen]Allow targets to use target specific COPY instructions for ↵Matt Arsenault1-35/+34
live range splitting" This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd. The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work around the underlying problem with SUBREG_TO_REG.
2023-07-26Revert "[CodeGen]Allow targets to use target specific COPY instructions for ↵Vitaly Buka1-34/+35
live range splitting" And dependent commits. Details in D150388. This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c. This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e. This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826. This reverts commit b7836d856206ec39509d42529f958c920368166b. No conflicts in the code, few tests had conflicts in autogenerated CHECKs: llvm/test/CodeGen/Thumb2/mve-float32regloops.ll llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll Reviewed By: alexfh Differential Revision: https://reviews.llvm.org/D156381