aboutsummaryrefslogtreecommitdiff
path: root/llvm/test/Transforms/LoopStrengthReduce/AArch64
AgeCommit message (Collapse)AuthorFilesLines
2025-12-03[LSR] Make OptimizeLoopTermCond able to handle some non-cmp conditions (#165590)John Brawn1-0/+205
Currently OptimizeLoopTermCond can only convert a cmp instruction to using a postincrement induction variable, which means it can't handle predicated loops where the termination condition comes from get_active_lane_mask. Relax this restriction so that we can handle any kind of instruction, though only if it's the instruction immediately before the branch (except for possibly an extractelement).
2025-12-02[LSR] Insert the transformed IV increment in the user block (#169515)John Brawn1-3/+1
Currently we try to hoist the transformed IV increment instruction to the header block to help with generation of postincrement instructions, but this only works if the user instruction is also in the header. We should instead be trying to insert it in the same block as the user.
2025-11-10[AArch64][SVE] Avoid movprfx by reusing register for _UNDEF pseudos. (#166926)Sander de Smalen1-3/+2
For predicated SVE instructions where we know that the inactive lanes are undef, it is better to pick a destination register that is not unique. This avoids introducing a movprfx to copy a unique register to the destination operand, which would be needed to comply with the tied operand constraints. For example: ``` %src1 = COPY $z1 %src2 = COPY $z2 %dst = SDIV_ZPZZ_S_UNDEF %p, %src1, %src2 ``` Here it is beneficial to pick $z1 or $z2 as the destination register, because if it would have chosen a unique register (e.g. $z0) then the pseudo expand pass would need to insert a MOVPRFX to expand the operation into: ``` $z0 = SDIV_ZPZZ_S_UNDEF $p0, $z1, $z2 -> $z0 = MOVPRFX $z1 $z0 = SDIV_ZPmZ_S $p0, $z0, $z2 ``` By picking $z1 directly, we'd get: ``` $z1 = SDIV_ZPmZ_S, $p0 $z1, $z2 ```
2025-10-30[LSR] Don't count conditional loads/store as enabling pre/post-index (#159573)John Brawn1-4/+140
When a load/store is conditionally executed in a loop it isn't a candidate for pre/post-index addressing, as the increment of the address would only happen on those loop iterations where the load/store is executed. Detect this and only discount the AddRec cost when the load/store is unconditional.
2025-09-16[LSR] Add an addressing mode that considers all addressing modes (#158110)John Brawn1-0/+178
The way that loops strength reduction works is that the target has to upfront decide whether it wants its addressing to be preindex, postindex, or neither. This choice affects: * Which potential solutions we generate * Whether we consider a pre/post index load/store as costing an AddRec or not. None of these choices are a good fit for either AArch64 or ARM, where both preindex and postindex addressing are typically free: * If we pick None then we count pre/post index addressing as costing one addrec more than is correct so we don't pick them when we should. * If we pick PreIndexed or PostIndexed then we get the correct cost for that addressing type, but still get it wrong for the other and also exclude potential solutions using offset addressing that could have less cost. This patch adds an "all" addressing mode that causes all potential solutions to be generated and counts both pre and postindex as having AddRecCost of zero. Unfortuntely this reveals problems elsewhere in how we calculate the cost of things that need to be fixed before we can make use of it.
2025-08-01[LLVM][DAGCombiner] fold (shl (X * vscale(C0)), C1) -> (X * vscale(C0 << ↵Paul Walker1-4/+4
C1)). (#150651)
2025-07-11[AArch64LoadStoreOpt] BaseReg update is searched also in CF successor (#145583)Sergey Shcherbinin1-2/+1
Look for reg update instruction (to merge w/ mem instruction into pre/post-increment form) not only inside a single MBB but also along a CF path going downward w/o side enters such that BaseReg is alive along it but not at its exits. Regression test is updated accordingly.
2025-07-03[PHIElimination] Revert #131837 #146320 #146337 (#146850)Guy David1-12/+14
Reverting because mis-compiles: - https://github.com/llvm/llvm-project/pull/131837 - https://github.com/llvm/llvm-project/pull/146320 - https://github.com/llvm/llvm-project/pull/146337
2025-06-29[PHIElimination] Reuse existing COPY in predecessor basic block (#131837)Guy David1-14/+12
The insertion point of COPY isn't always optimal and could eventually lead to a worse block layout, see the regression test in the first commit. This change affects many architectures but the amount of total instructions in the test cases seems too be slightly lower.
2025-06-16[LSR] Make canHoistIVInc allow non-integer types (#143707)John Brawn1-0/+189
canHoistIVInc was made to only allow integer types to avoid a crash in isIndexedLoadLegal/isIndexedStoreLegal due to them failing an assertion in getValueType (or rather in MVT::getVT which gets called from that) when passed a struct type. Adjusting these functions to pass AllowUnknown=true to getValueType means we don't get an assertion failure (MVT::Other is returned which TLI->isIndexedLoadLegal should then return false for), meaning we can remove this check for integer type.
2025-04-17[AArch64][SVE] Fold ADD+CNTB to INCB/DECB (#118280)Ricardo Jesus1-32/+33
Currently, given: ```cpp uint64_t incb(uint64_t x) { return x+svcntb(); } ``` LLVM generates: ```gas incb: addvl x0, x0, #1 ret ``` Which is equivalent to: ```gas incb: incb x0 ret ``` However, on microarchitectures like the Neoverse V2 and Neoverse V3, the second form (with INCB) can have significantly better latency and throughput (according to their SWOG). On the Neoverse V2, for example, ADDVL has a latency and throughput of 2, whereas some forms of INCB have a latency of 1 and a throughput of 4. The same applies to DECB. This patch adds patterns to prefer the cheaper INCB/DECB forms over ADDVL where applicable.
2025-02-26[AArch64][SVE] Lower unpredicated loads/stores as LDR/STR. (#127837)Ricardo Jesus1-38/+29
Currently, given: ```cpp svuint8_t foo(uint8_t *x) { return svld1(svptrue_b8(), x); } ``` We generate: ```gas foo: ptrue p0.b ld1b { z0.b }, p0/z, [x0] ret ``` However, on little-endian and with unaligned memory accesses allowed, we could instead be using LDR as follows: ```gas foo: ldr z0, [x0] ret ``` The second form avoids the predicate dependency. Likewise for other types and stores.
2024-12-29Remove -print-lsr-output in favor of --stop-after=loop-reduceFangrui Song1-1/+1
Pull Request: https://github.com/llvm/llvm-project/pull/121305
2024-11-21[llvm] Remove `br i1 undef` from some regression tests [NFC] (#117112)Lee Wei1-36/+36
This PR removes tests with `br i1 undef` under `llvm/tests/Transforms/Loop*, Lower*`.
2024-07-24[LSR] Fix matching vscale immediates (#100080)Benjamin Maxwell1-8/+12
Somewhat confusingly a `SCEVMulExpr` is a `SCEVNAryExpr`, so can have > 2 operands. Previously, the vscale immediate matching did not check the number of operands of the `SCEVMulExpr`, so would ignore any operands after the first two. This led to incorrect codegen (and results) for ArmSME in IREE (https://github.com/iree-org/iree), which sometimes addresses things that are a `vscale * vscale` multiple away. The test added with this change shows an example reduced from IREE. The second write should be offset from the first `16 * vscale * vscale` (* 4 bytes), however, previously LSR dropped the second vscale and instead offset the write by `#4, mul vl`, which is an offset of `16 * vscale` (* 4 bytes).
2024-07-23Precommit vscale-fixups.ll test (NFC)Benjamin Maxwell1-0/+47
Precommit test for #100080.
2024-07-01[LSR] Recognize vscale-relative immediates (#88124)Graham Hunter1-0/+387
Extends LoopStrengthReduce to recognize immediates multiplied by vscale, and query the current target for whether they are legal offsets for memory operations or adds.
2024-06-06[AArch64] Override isLSRCostLess, take number of instructions into account ↵Graham Hunter1-2/+4
(#84189) Adds an AArch64-specific version of isLSRCostLess, changing the relative importance of the various terms from the formulae being evaluated. This has been split out from my vscale-aware LSR work, see the RFC for reference: https://discourse.llvm.org/t/rfc-vscale-aware-loopstrengthreduce/77131
2023-08-21[AArch64] Update generic sched model to A510Harvin Iriawan3-14/+14
Refresh of the generic scheduling model to use A510 instead of A55. Main benefits are to the little core, and introducing SVE scheduling information. Changes tested on various OoO cores, no performance degradation is seen. Differential Revision: https://reviews.llvm.org/D156799
2023-07-13[LSR] Don't consider users of constant outside loopNikita Popov1-19/+14
In CollectLoopInvariantFixupsAndFormulae(), LSR looks at users outside the loop. E.g. if we have an addrec based on %base, and %base is also used outside the loop, then we have to keep it in a register anyway, which may make it more profitable to use %base + %idx style addressing. This reasoning doesn't hold up when the base is a constant, because the constant can be rematerialized. The lsr-memcpy.ll test regressed when enabling opaque pointers, because inttoptr (i64 6442450944 to ptr) now also has a use outside the loop (previously it didn't due to a pointer type difference), and that extra "use" results in worse use of addressing modes in the loop. However, the use outside the loop actually gets rematerialized, so the alleged register saving does not occur. The same reasoning also applies to other types of constants, such as global variable references. Differential Revision: https://reviews.llvm.org/D155073
2023-07-13[LSR] Add test variant with global variables (NFC)Nikita Popov1-2/+44
A variant of the test using globals instead of inttoptr expressions for D155073.
2023-07-12[LSR] Convert test to opaque pointers (NFC)Nikita Popov1-14/+17
This regresses with opaque pointers. I'll submit a patch to recover the regression.
2023-06-27[test] Replace aarch64-*-eabi with aarch64Fangrui Song1-1/+1
Using "eabi" for aarch64 targets is a common mistake and warned by Clang Driver. We want to avoid it elsewhere as well. Just use the common "aarch64" without other triple components.
2023-06-23[LSR] Regenerate test checks (NFC)Nikita Popov1-5/+19
2023-06-19[LSR] Add test for for issue leading to revert of abfeda5af329b5.Florian Hahn1-0/+53
Add unit test triggering an assertion with abfeda5af329b5.
2023-05-25[CodeGen][ShrinkWrap] Enable PostShrinkWrap by defaultsgokhale1-3/+3
This is an attempt to reland D42600 and enabling this optimisation by default. This also resolves the issue pointed out in the context of PGO build. Differential Revision: https://reviews.llvm.org/D42600
2023-05-17[NFC][Py Reformat] Reformat lit.local.cfg python files in llvmTobias Hieta1-2/+2
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0 since I forgot the lit.local.cfg files in that one. Reformatting is done with `black`. If you end up having problems merging this commit because you have made changes to a python file, the best way to handle that is to run git checkout --ours <yourfile> and then reformat it with black. If you run into any problems, post to discourse about it and we will try to help. RFC Thread below: https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style Reviewed By: barannikov88, kwk Differential Revision: https://reviews.llvm.org/D150762
2023-05-08Revert "[CodeGen][ShrinkWrap] Split restore point"Alan Zhao1-3/+3
This reverts commit 1ddfd1c8186735c62b642df05c505dc4907ffac4. The original commit causes a Chrome build assertion failure with ThinLTO: https://crbug.com/1443635
2023-05-08[CodeGen][ShrinkWrap] Split restore pointsgokhale1-3/+3
Try to reland D42600 Differential Revision: https://reviews.llvm.org/D42600
2023-04-21Recommit "[AArch64] Fix incorrect `isLegalAddressingMode`"Momchil Velikov1-3/+3
This patch recommits 0827e2fa3fd15b49fd2d0fc676753f11abb60cab after reverting it in ed7ada259f665a742561b88e9e6c078e9ea85224. Added workround for `Targetlowering::AddrMode` no longer being an aggregate in C++20. `AArch64TargetLowering::isLegalAddressingMode` has a number of defects, including accepting an addressing mode, which consists of only an immediate operand, or not checking the offset range for an addressing mode in the form `1*ScaledReg + Offs`. This patch fixes the above issues. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D143895 Change-Id: I41a520c13ce21da503ca45019979bfceb8b648fa
2023-04-20Revert "[AArch64] Fix incorrect `isLegalAddressingMode`"Momchil Velikov1-3/+3
This reverts commit 0827e2fa3fd15b49fd2d0fc676753f11abb60cab. Failing buildbot, perhaps due to `-std=c++20`.
2023-04-20[AArch64] Fix incorrect `isLegalAddressingMode`Momchil Velikov1-3/+3
`AArch64TargetLowering::isLegalAddressingMode` has a number of defects, including accepting an addressing mode which consists of only an immediate operand, or not checking the offset range for an addressing mode in the form `1*ScaledReg + Offs`. This patch fixes the above issues. Reviewed By: dmgreen Differential Revision: https://reviews.llvm.org/D143895 Change-Id: I756fa21941844ded44f082ac7eea4391219f9851
2023-03-17[SCEVExpander] Always use scevgep as nameNikita Popov1-1/+1
With opaque pointers the scevgep / uglygep distinction no longer makes sense -- GEPs are always emitted in offset-based representation.
2023-02-10[Reland][LSR] Hoist IVInc to loop header if its all uses are in the loop headerchenglin.bi1-6/+126
Original code will cause crash when the load/store memory type is structure because isIndexedLoadLegal/isIndexedStore doesn't support struct type. So we limit the load/store memory type to integer. Origin commit message: When the latch block is different from header block, IVInc will be expanded in the latch loop. We can't generate the post index load/store this case. But if the IVInc only used in the loop, actually we still can use the post index load/store because when exit loop we don't care the last IVInc value. So, trying to hoist IVInc to help backend to generate more post index load/store. Fix #53625 Reviewed By: eopXD Differential Revision: https://reviews.llvm.org/D138636
2023-01-18[AsmParser] Remove typed pointer auto-detectionNikita Popov1-1/+1
IR is now always parsed in opaque pointer mode, unless -opaque-pointers=0 is explicitly given. There is no automatic detection of typed pointers anymore. The -opaque-pointers=0 option is added to any remaining IR tests that haven't been migrated yet. Differential Revision: https://reviews.llvm.org/D141912
2023-01-11Revert "[LSR] Hoist IVInc to loop header if its all uses are in the loop header"chenglin.bi1-73/+6
The original commit seems to cause a regression in numba test. This reverts commit b1b4758e7f4b2ffe1faa28b00eb037832e5d26a7.
2023-01-10[LSR] Hoist IVInc to loop header if its all uses are in the loop headerchenglin.bi1-6/+73
When the latch block is different from header block, IVInc will be expanded in the latch loop. We can't generate the post index load/store this case. But if the IVInc only used in the loop, actually we still can use the post index load/store because when exit loop we don't care the last IVInc value. So, trying to hoist IVInc to help backend to generate more post index load/store. Fix #53625 Reviewed By: eopXD Differential Revision: https://reviews.llvm.org/D138636
2023-01-05[Transforms] Convert some tests to opaque pointers (NFC)Nikita Popov7-121/+112
These are all tests where conversion worked automatically, and required no manual fixup.
2022-12-20Revert "Reland "[SimplifyCFG] `FoldBranchToCommonDest()`: deal with ↵Roman Lebedev1-22/+20
mismatched IV's in PHI's in common successor block"" One of these two changes is exposing (or causing) some more miscompiles. A reproducer is in progress, so reverting until resolved. This reverts commit 428f36401b1b695fd501ebfdc8773bed8ced8d4e.
2022-12-17Reland "[SimplifyCFG] `FoldBranchToCommonDest()`: deal with mismatched IV's ↵Roman Lebedev1-20/+22
in PHI's in common successor block" This reverts commit 37b8f09a4b61bf9bf9d0b9017d790c8b82be2e17, and returns commit 1bd0b82e508d049efdb07f4f8a342f35818df341. The miscompile was in InstCombine, and it has been addressed. This tries to approach the problem noted by @arsenm: terrible codegen for `__builtin_fpclassify()`: https://godbolt.org/z/388zqdE37 Just because the PHI in the common successor happens to have different incoming values for these two blocks, doesn't mean we have to give up. It's quite easy to deal with this, we just need to produce a select: https://alive2.llvm.org/ce/z/000srb Now, the cost model for this transform is rather overly strict, so this will basically never fire. We tally all (over all preds) the selects needed to the NumBonusInsts Differential Revision: https://reviews.llvm.org/D139275
2022-12-16Revert "[SimplifyCFG] `FoldBranchToCommonDest()`: deal with mismatched IV's ↵Alexander Kornienko1-22/+20
in PHI's in common successor block" This reverts commit 1bd0b82e508d049efdb07f4f8a342f35818df341, since it leads to miscompiles. See https://reviews.llvm.org/D139275#3993229 and https://reviews.llvm.org/D139275#4001580.
2022-12-12[SimplifyCFG] `FoldBranchToCommonDest()`: deal with mismatched IV's in PHI's ↵Roman Lebedev1-20/+22
in common successor block This tries to approach the problem noted by @arsenm: terrible codegen for `__builtin_fpclassify()`: https://godbolt.org/z/388zqdE37 Just because the PHI in the common successor happens to have different incoming values for these two blocks, doesn't mean we have to give up. It's quite easy to deal with this, we just need to produce a select: https://alive2.llvm.org/ce/z/000srb Now, the cost model for this transform is rather overly strict, so this will basically never fire. We tally all (over all preds) the selects needed to the NumBonusInsts Differential Revision: https://reviews.llvm.org/D139275
2022-11-25[LSR] precommit test for D138636; NFCchenglin.bi1-0/+52
2022-06-15[LSR] Add test for LoopStrenghtReduce for Ldp; NFCchenglin.bi1-0/+75
#53877
2021-10-09[AArch64] Make -mcpu=generic schedule for an in-order coreDavid Green2-7/+7
We would like to start pushing -mcpu=generic towards enabling the set of features that improves performance for some CPUs, without hurting any others. A blend of the performance options hopefully beneficial to all CPUs. The largest part of that is enabling in-order scheduling using the Cortex-A55 schedule model. This is similar to the Arm backend change from eecb353d0e25ba which made -mcpu=generic perform in-order scheduling using the cortex-a8 schedule model. The idea is that in-order cpu's require the most help in instruction scheduling, whereas out-of-order cpus can for the most part out-of-order schedule around different codegen. Our benchmarking suggests that hypothesis holds. When running on an in-order core this improved performance by 3.8% geomean on a set of DSP workloads, 2% geomean on some other embedded benchmark and between 1% and 1.8% on a set of singlecore and multicore workloads, all running on a Cortex-A55 cluster. On an out-of-order cpu the results are a lot more noisy but show flat performance or an improvement. On the set of DSP and embedded benchmarks, run on a Cortex-A78 there was a very noisy 1% speed improvement. Using the most detailed results I could find, SPEC2006 runs on a Neoverse N1 show a small increase in instruction count (+0.127%), but a decrease in cycle counts (-0.155%, on average). The instruction count is very low noise, the cycle count is more noisy with a 0.15% decrease not being significant. SPEC2k17 shows a small decrease (-0.2%) in instruction count leading to a -0.296% decrease in cycle count. These results are within noise margins but tend to show a small improvement in general. When specifying an Apple target, clang will set "-target-cpu apple-a7" on the command line, so should not be affected by this change when running from clang. This also doesn't enable more runtime unrolling like -mcpu=cortex-a55 does, only changing the schedule used. A lot of existing tests have updated. This is a summary of the important differences: - Most changes are the same instructions in a different order. - Sometimes this leads to very minor inefficiencies, such as requiring an extra mov to move variables into r0/v0 for the return value of a test function. - misched-fusion.ll was no longer fusing the pairs of instructions it should, as per D110561. I've changed the schedule used in the test for now. - neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to the different latencies. This seems fine to me. - Some SVE tests do not always remove movprfx where they did before due to different register allocation giving different destructive forms. - The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll produce two LDR where they previously produced an LDP due to store-pair-suppress kicking in. - arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD. - Some tests such as arm64-neon-mul-div.ll and ragreedy-local-interval-cost.ll have more, less or just different spilling. - In aarch64_generated_funcs.ll.generated.expected one part of the function is no longer outlined. Interestingly if I switch this to use any other scheduled even less is outlined. Some of these are expected to happen, such as differences in outlining or register spilling. There will be places where these result in worse codegen, places where they are better, with the SPEC instruction counts suggesting it is not a decrease overall, on average. Differential Revision: https://reviews.llvm.org/D110830
2021-10-06[AArch64] Regenerate even more testsDavid Green1-0/+1
This updates a few more check lines, in some mte tests that were close to auto generated already and some CodeGenPrepare/consthoist tests where being able to see the entire code sequence is useful for determining whether code differences are improvements or not.
2021-08-03[AArch64] Prefer fmov over orr v.16b when copying f32/f64David Green1-3/+2
This changes the lowering of f32 and f64 COPY from a 128bit vector ORR to a fmov of the appropriate type. At least on some CPU's with 64bit NEON data paths this is expected to be faster, and shouldn't be slower on any CPU that treats fmov as a register rename. Differential Revision: https://reviews.llvm.org/D106365
2021-08-03[AArch64InstPrinter] Change printAddSubImm to comment imm value when shiftedJason Molenda2-7/+7
Add a comment when there is a shifted value, add x9, x0, #291, lsl #12 ; =1191936 but not when the immediate value is unshifted, subs x9, x0, #256 ; =256 when the comment adds nothing additional to the reader. Differential Revision: https://reviews.llvm.org/D107196
2021-04-15[LSR] Fix for pre-indexed generated constant offsetStelios Ioannou1-0/+53
This patch changed the isLegalUse check to ensure that LSRInstance::GenerateConstantOffsetsImpl generates an offset that results in a legal addressing mode and formula. The check is changed to look similar to the assert check used for illegal formulas. Differential Revision: https://reviews.llvm.org/D100383 Change-Id: Iffb9e32d59df96b8f072c00f6c339108159a009a
2020-09-10Mark FMOV constant materialization as being as cheap as a move.Owen Anderson1-7/+5
This prevents us from doing things like LICM'ing it out of a loop, which is usually a net loss because we end up having to spill a callee-saved FPR to accomodate it. This does perturb instruction scheduling around this instruction, so a number of tests had to be updated to account for it. Reviewed By: t.p.northover Differential Revision: https://reviews.llvm.org/D87316