aboutsummaryrefslogtreecommitdiff
path: root/llvm/test/CodeGen/AMDGPU/udiv.ll
AgeCommit message (Collapse)AuthorFilesLines
2024-02-06[AMDGPU] Use correct number of bits needed for div/rem shrinking (#80622)choikwa1-127/+81
There was an error where dividend of type i64 and actual used number of bits of 32 fell into path that assumes only 24 bits being used. Check that AtLeast field is used correctly when using computeNumSignBits and add necessary extend/trunc for 32 bits path. Regolden and update testcases. @jrbyrnes @bcahoon @arsenm @rampitec
2024-01-16[AMDGPU,test] Change llc -march= to -mtriple= (#75982)Fangrui Song1-1/+1
Similar to 806761a7629df268c8aed49657aeccffa6bca449. For IR files without a target triple, -mtriple= specifies the full target triple while -march= merely sets the architecture part of the default target triple, leaving a target triple which may not make sense, e.g. amdgpu-apple-darwin. Therefore, -march= is error-prone and not recommended for tests without a target triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead of rejecting it outrightly. This patch changes AMDGPU tests to not rely on the default OS/environment components. Tests that need fixes are not changed: ``` LLVM :: CodeGen/AMDGPU/fabs.f64.ll LLVM :: CodeGen/AMDGPU/fabs.ll LLVM :: CodeGen/AMDGPU/floor.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll LLVM :: CodeGen/AMDGPU/fneg-fabs.ll LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll LLVM :: CodeGen/AMDGPU/schedule-if-2.ll ```
2023-10-24[AMDGPU] Regenerate udiv.llpvanhout1-24/+25
2023-10-24[DAG] Constant Folding for U/SMUL_LOHI (#69437)Pierre van Houtryve1-24/+25
2023-10-19[AMDGPU] Constant fold FMAD_FTZ (#69443)Pierre van Houtryve1-78/+38
Solves #68315
2023-10-18[DAG] Constant fold FMAD (#69324)Pierre van Houtryve1-27/+8
This has very little effect on codegen in practice, but is a nice to have I think. See #68315
2023-10-09Revert "[CodeGen] Really renumber slot indexes before register allocation ↵Jay Foad1-48/+48
(#67038)" This reverts commit 2501ae58e3bb9a70d279a56d7b3a0ed70a8a852c. Reverted due to various buildbot failures.
2023-10-09[CodeGen] Really renumber slot indexes before register allocation (#67038)Jay Foad1-48/+48
PR #66334 tried to renumber slot indexes before register allocation, but the numbering was still affected by list entries for instructions which had been erased. Fix this to make the register allocator's live range length heuristics even less dependent on the history of how instructions have been added to and removed from SlotIndexes's maps.
2023-10-05[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. ↵Ivan Kosarev1-3/+3
(#68002)
2023-09-19[CodeGen] Renumber slot indexes before register allocation (#66334)Jay Foad1-48/+48
RegAllocGreedy uses SlotIndexes::getApproxInstrDistance to approximate the length of a live range for its heuristics. Renumbering all slot indexes with the default instruction distance ensures that this estimate will be as accurate as possible, and will not depend on the history of how instructions have been added to and removed from SlotIndexes's maps. This also means that enabling -early-live-intervals, which runs the SlotIndexes analysis earlier, will not cause large amounts of churn due to different register allocator decisions.
2023-09-11[test] Change llc -march= to -mtriple=Fangrui Song1-2/+2
The issue is uncovered by #47698: for IR files without a target triple, -mtriple= specifies the full target triple while -march= merely sets the architecture part of the default target triple, leaving a target triple which may not make sense, e.g. riscv64-apple-darwin. Therefore, -march= is error-prone and not recommended for tests without a target triple. The issue has been benign as we recognize $unknown-apple-darwin as ELF instead of rejecting it outrightly.
2023-07-04[AMDGPU] Do not wait for vscnt on function entry and returnJay Foad1-1/+0
SIInsertWaitcnts inserts waitcnt instructions to resolve data dependencies. The GFX10+ vscnt (VMEM store count) counter is never used in this way. It is only used to resolve memory dependencies, and that is handled by SIMemoryLegalizer. Hence there is no need to conservatively wait for vscnt to be 0 on function entry and before returns. Differential Revision: https://reviews.llvm.org/D153537
2023-04-10[AMDGPU] Introduce SIInstrWorklist to process instructions in moveToVALUskc71-44/+44
Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D147168
2023-03-03[AMDGPU] Vectorize misaligned global loads & storesJeffrey Byrnes1-20/+11
Based on experimentation on gfx906,908,90a and 1030, wider global loads / stores are more performant than multiple narrower ones independent of alignment -- this is especially true when combining 8 bit loads / stores, in which case speedup was usually 2x across all alignments. Differential Revision: https://reviews.llvm.org/D145170 Change-Id: I6ee6c76e6ace7fc373cc1b2aac3818fc1425a0c1
2023-01-23AMDGPU: Clean up LDS-related occupancy calculationsNicolai Hähnle1-18/+18
Occupancy is expressed as waves per SIMD. This means that we need to take into account the number of SIMDs per "CU" or, to be more precise, the number of SIMDs over which a workgroup may be distributed. getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs into account at all. At the same time, we need to take into account that WGP mode offers access to a larger total amount of LDS, since this can affect how non-power-of-two LDS allocations are rounded. To make this work consistently, we distinguish between (available) local memory size and addressable local memory size (which is always limited by 64kB on gfx10+, even with WGP mode). This change results in a massive amount of test churn. A lot of it is caused by the fact that the default work group size is 1024, which means that (due to rounding effects) the default occupancy on older hardware is 8 instead of 10, which affects scheduling via register pressure estimates. I've adjusted most tests by just running the UTC tools, but in some cases I manually changed the work group size to 32 or 64 to make sure that work group size chunkiness has no effect. Differential Revision: https://reviews.llvm.org/D139468
2022-12-19[AMDGPU] Convert some tests to opaque pointers (NFC)Nikita Popov1-59/+59
2022-11-30[AMDGPU] Use s_cmp instead of s_cmpkJay Foad1-2/+2
Don't bother pre-shrinking "s_cmp_lg_u32 reg, 0" to s_cmpk_lg_u32 because 0 is already an inline constant so the s_cmpk form is no smaller. This is just for consistency with the surrounding code and to simplify a downstream patch. Differential Revision: https://reviews.llvm.org/D138993
2022-09-15[AMDGPU] Always select s_cselect_b32 for uniform 'select' SDNodeAlexander Timofeev1-294/+336
This patch contains changes necessary to carry physical condition register (SCC) dependencies through the SDNode scheduler. It adds the edge in the SDNodeScheduler dependency graph instead of inserting the SCC copy between each definition and use. This approach lets the scheduler place instructions in an optimal way placing the copy only when the dependency cannot be resolved. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D133593
2022-08-02[AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIsAlexander Timofeev1-11/+11
In the 2e29b0138ca243 we introduce a specific solving algorithm that analyzes the VGPR to SGPR copies use chains and either lowers the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms. Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues. At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF because of the weird mixing of SALU and VALU instructions. This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs, we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common VGPR to SGPR lowering algorithm. This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130367
2022-07-30[AMDGPU] Extend SILoadStoreOptimizer to s_load instructionsCarl Ritson1-37/+34
Apply merging to s_load as is done for s_buffer_load. Reviewed By: foad Differential Revision: https://reviews.llvm.org/D130742
2022-07-29Revert "[AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIs"Alexander Timofeev1-11/+11
This reverts commit 76d9ae924cc361578ecbb5688559f7cebc78ab87. because it causes several VK CTS tests to fail
2022-07-28[AMDGPU] avoid blind converting to VALU REG_SEQUENCE and PHIsAlexander Timofeev1-11/+11
In the 2e29b0138ca243 we introduce a specific solving algorithm that analyzes the VGPR to SGPR copies use chains and either lowers the copy to v_readfirstlane_b32 or converts the whole chain to VALU forms. Same time we still have the code that blindly converts to VALU REG_SEQUENCE and PHIs in case they produce SGPR but have VGPRs input operands. In case the REG_SEQUENCE and PHIs are in the VGPR to SGPR copy use chain, and this chain was considered long enough to convert copy to v_readfistlane_b32, further lowering them to VALU leads to several kinds of issues. At first, we have v_readfistlane_b32 which is completely useless because most parts of its use chain were moved to VALU forms. Second, we may encounter subtle bugs related to the EXEC-dependent CF because of the weird mixing of SALU and VALU instructions. This change removes the code that moves REG_SEQUENCE and PHIs to VALU. Instead, we use the fact that both REG_SEQUENCE and PHIs have copy semantics. That is, if they define SGPR but have VGPR inputs, we insert VGPR to SGPR copies to make them pure SGPR. Then, the new copies are processed by the common VGPR to SGPR lowering algorithm. This is Part 2 in the series of commits aiming at the massive refactoring of the SIFixSGPRCopies pass. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D130367
2022-07-14[AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.Alexander Timofeev1-74/+80
Since the divergence-driven instruction selection has been enabled for AMDGPU, all the uniform instructions are expected to be selected to SALU form, except those not having one. VGPR to SGPR copies appear in MIR to connect values producers and consumers. This change implements an algorithm that evolves a reasonable tradeoff between the profit achieved from keeping the uniform instructions in SALU form and overhead introduced by the data transfer between the VGPRs and SGPRs. Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D128252
2022-06-13[AMDGPU] Use null for dead sdst operandStanislav Mekhanoshin1-5/+5
Differential Revision: https://reviews.llvm.org/D127542
2022-06-09[AMDGPU] Use v_mad_u64_u32 for IMAD32Stanislav Mekhanoshin1-12/+11
Nic Curtis done the experiments to prove it is faster than a separate mul and add. Fixes: SWDEV-332806 Differential Revision: https://reviews.llvm.org/D127253
2022-05-18[AMDGPU] Aggressively fold immediates in SIShrinkInstructionsJay Foad1-24/+18
Fold immediates regardless of how many uses they have. This is expected to increase overall code size, but decrease register usage. Differential Revision: https://reviews.llvm.org/D114644
2022-05-18[AMDGPU] Aggressively fold immediates in SIFoldOperandsJay Foad1-71/+62
Previously SIFoldOperands::foldInstOperand would only fold a non-inlinable immediate into a single user, so as not to increase code size by adding the same 32-bit literal operand to many instructions. This patch removes that restriction, so that a non-inlinable immediate will be folded into any number of users. The rationale is: - It reduces the number of registers used for holding constant values, which might increase occupancy. (On the other hand, many of these registers are SGPRs which no longer affect occupancy on GFX10+.) - It reduces ALU stalls between the instruction that loads a constant into a register, and the instruction that uses it. - The above benefits are expected to outweigh any increase in code size. Differential Revision: https://reviews.llvm.org/D114643
2022-03-09[AMDGPU] Move call clobbered return address registers s[30:31] to callee ↵Venkata Ramanaiah Nalamothu1-36/+36
saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm, ronlieb Differential Revision: https://reviews.llvm.org/D114652
2022-01-26[LSV] Vectorize loads of vectors by turning it into a larger vectorBenjamin Kramer1-32/+31
Use shufflevector to do the subvector extracts. This allows a lot more load merging on AMDGPU and also on NVPTX when <2 x half> is involved. Differential Revision: https://reviews.llvm.org/D117219
2021-12-22Revert "[AMDGPU] Move call clobbered return address registers s[30:31] to ↵Ron Lieberman1-36/+36
callee saved range" This reverts commit 9075009d1fd5f2bf9aa6c2f362d2993691a316b3. Failed amdgpu runtime buildbot # 3514
2021-12-22[AMDGPU] Move call clobbered return address registers s[30:31] to callee ↵RamNalamothu1-36/+36
saved range Currently the return address ABI registers s[30:31], which fall in the call clobbered register range, are added as a live-in on the function entry to preserve its value when we have calls so that it gets saved and restored around the calls. But the DWARF unwind information (CFI) needs to track where the return address resides in a frame and the above approach makes it difficult to track the return address when the CFI information is emitted during the frame lowering, due to the involvment of understanding the control flow. This patch moves the return address ABI registers s[30:31] into callee saved registers range and stops adding live-in for return address registers, so that the CFI machinery will know where the return address resides when CSR save/restore happen during the frame lowering. And doing the above poses an issue that now the return instruction uses undefined register `sgpr30_sgpr31`. This is resolved by hiding the return address register use by the return instruction through the `SI_RETURN` pseudo instruction, which doesn't take any input operands, until the `SI_RETURN` pseudo gets lowered to the `S_SETPC_B64_return` during the `expandPostRAPseudo()`. As an added benefit, this patch simplifies overall return instruction handling. Note: The AMDGPU CFI changes are there only in the downstream code and another version of this patch will be posted for review for the downstream code. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D114652
2021-12-01[AMDGPU] Set most sched model resource's BufferSize to oneAustin Kerbow1-436/+449
Using a BufferSize of one for memory ProcResources will result in better ILP since it more accurately models the dependencies between memory ops and their consumers on an in-order processor. After this change, the scheduler will treat the data edges from loads as blocking so that stalls are guaranteed when waiting for data to be retreaved from memory. Since we don't actually track waitcnt here, this should do a better job at modeling their behavior. Practically, this means that the scheduler will trigger the 'STALL' heuristic more often. This type of change needs to be evaluated experimentally. Preliminary results are positive. Fixes: SWDEV-282962 Reviewed By: rampitec Differential Revision: https://reviews.llvm.org/D114777
2021-11-24[AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32Jay Foad1-193/+165
Select SelectionDAG ops smul_lohi/umul_lohi to v_mad_i64_i32/v_mad_u64_u32 respectively, with an addend of 0. v_mul_lo, v_mul_hi and v_mad_i64/u64 are all quarter-rate instructions so it is better to use one instruction than two. Further improvements are possible to make better use of the addend operand, but this is already a strict improvement over what we have now. Differential Revision: https://reviews.llvm.org/D113986
2021-11-19[AMDGPU] Use new opcode for indexed vgpr readsJay Foad1-51/+48
Introduce V_MOV_B32_indirect_read for indexed vgpr reads (and rename the old V_MOV_B32_indirect to V_MOV_B32_indirect_write) so they can be unambiguously distinguished from regular V_MOV_B32_e32. Previously they were distinguished by looking for extra implicit operands but this is fragile because regular moves sometimes have extra implicit operands too: - either by accident, when instructions end up with duplicate implicit operands (see e.g. D100939) - or by design, when SIInstrInfo::copyPhysReg breaks a multi-dword copy into individual subreg mov instructions and adds implicit operands for the super-register. The effect of this is that SIInstrInfo::isFoldableCopy can be simplified and identifies more foldable copies. The test diffs show that more immediate 0 values have been folded as inline operands. SIInstrInfo::isReallyTriviallyReMaterializable could probably be simplified too but that is not part of this patch. Differential Revision: https://reviews.llvm.org/D114230
2021-11-12[AMDGPU] Regenerate udiv.ll testsSimon Pilgrim1-73/+2730
2021-09-21[AMDGPU] Prefer v_fmac over v_fma only when no source modifiers are usedJay Foad1-1/+1
v_fmac with source modifiers forces VOP3 encoding, but it is strictly better to use the VOP3-only v_fma instead, because $dst and $src2 are not tied so it gives the register allocator more freedom and avoids a copy in some cases. This is the same strategy we already use for v_mad vs v_mac and v_fma_legacy vs v_fmac_legacy. Differential Revision: https://reviews.llvm.org/D110070
2021-07-05[DAGCombiner] Add support for mulhi const folding in DAGCombinerDavid Stuttard1-9/+3
Differential Revision: https://reviews.llvm.org/D103323 Change-Id: I4ffaaa32301795ba8a339567a68e77fe0862b869
2021-07-05[DAGCombiner] Pre-commit test to demonstrate mulhi const foldingDavid Stuttard1-0/+15
D103323 will fold this Differential Revision: https://reviews.llvm.org/D105424 Change-Id: I64947215eb531fbd70b52a72203b39e43fefafcc
2020-06-15[AMDGPU] Add gfx1030 targetStanislav Mekhanoshin1-0/+2
Differential Revision: https://reviews.llvm.org/D81886
2020-04-02AMDGPU: Remove denormal subtarget featuresMatt Arsenault1-2/+2
Switch to using the denormal-fp-math/denormal-fp-math-f32 attributes.
2018-09-11[AMDGPU] Preliminary patch for divergence driven instruction selection. ↵Alexander Timofeev1-2/+2
Immediate selection predicate changed Differential revision: https://reviews.llvm.org/D51734 Reviewers: rampitec llvm-svn: 341928
2018-06-27[AMDGPU] Convert rcp to rcp_iflagStanislav Mekhanoshin1-3/+3
If a source of rcp instruction is a result of any conversion from an integer convert it into rcp_iflag instruction. No FP exception can ever happen except division by zero if a single precision rcp argument is a representation of an integral number. Differential Revision: https://reviews.llvm.org/D48569 llvm-svn: 335742
2017-07-04[AMDGPU] Switch scalarize global loads ON by defaultAlexander Timofeev1-4/+4
Differential revision: https://reviews.llvm.org/D34407 llvm-svn: 307097
2017-07-04Revert r307026, "[AMDGPU] Switch scalarize global loads ON by default"NAKAMURA Takumi1-4/+4
It broke a testcase. Failing Tests (1): LLVM :: CodeGen/AMDGPU/alignbit-pat.ll llvm-svn: 307054
2017-07-03[AMDGPU] Switch scalarize global loads ON by defaultAlexander Timofeev1-4/+4
Differential revision: https://reviews.llvm.org/D34407 llvm-svn: 307026
2017-05-30[AMDGPU] Allow SDWA in instructions with immediates and SGPRsStanislav Mekhanoshin1-3/+3
An encoding does not allow to use SDWA in an instruction with scalar operands, either literals or SGPRs. That is however possible to copy these operands into a VGPR first. Several copies of the value are produced if multiple SDWA conversions were done. To cleanup MachineLICM (to hoist copies out of loops), MachineCSE (to remove duplicate copies) and SIFoldOperands (to replace SGPR to VGPR copy with immediate copy right to the VGPR) runs are added after the SDWA pass. Differential Revision: https://reviews.llvm.org/D33583 llvm-svn: 304219
2017-03-21AMDGPU: Mark all unspecified CC functions in tests as amdgpu_kernelMatt Arsenault1-14/+14
Currently the default C calling convention functions are treated the same as compute kernels. Make this explicit so the default calling convention can be changed to a non-kernel. Converted with perl -pi -e 's/define void/define amdgpu_kernel void/' on the relevant test directories (and undoing in one place that actually wanted a non-kernel). llvm-svn: 298444
2017-02-24AMDGPU : Replace FMAD with FMA when denormals are enabled.Wei Ding1-1/+19
Differential Revision: http://reviews.llvm.org/D29958 llvm-svn: 296186
2017-02-22AMDGPU: Remove some uses of llvm.SI.export in testsMatt Arsenault1-3/+25
Merge some of the old, smaller tests into more complete versions. llvm-svn: 295792
2017-01-24Enable FeatureFlatForGlobal on Volcanic IslandsMatt Arsenault1-1/+1
This switches to the workaround that HSA defaults to for the mesa path. This should be applied to the 4.0 branch. Patch by Vedran Miletić <vedran@miletic.net> llvm-svn: 292982