| Age | Commit message (Collapse) | Author | Files | Lines |
|
Currently OptimizeLoopTermCond can only convert a cmp instruction to
using a postincrement induction variable, which means it can't handle
predicated loops where the termination condition comes from
get_active_lane_mask. Relax this restriction so that we can handle any
kind of instruction, though only if it's the instruction immediately
before the branch (except for possibly an extractelement).
|
|
Currently we try to hoist the transformed IV increment instruction to
the header block to help with generation of postincrement instructions,
but this only works if the user instruction is also in the header. We
should instead be trying to insert it in the same block as the user.
|
|
For predicated SVE instructions where we know that the inactive lanes
are undef, it is better to pick a destination register that is not
unique. This avoids introducing a movprfx to copy a unique register to
the destination operand, which would be needed to comply with the tied
operand constraints.
For example:
```
%src1 = COPY $z1
%src2 = COPY $z2
%dst = SDIV_ZPZZ_S_UNDEF %p, %src1, %src2
```
Here it is beneficial to pick $z1 or $z2 as the destination register,
because if it would have chosen a unique register (e.g. $z0) then the
pseudo expand pass would need to insert a MOVPRFX to expand the
operation into:
```
$z0 = SDIV_ZPZZ_S_UNDEF $p0, $z1, $z2
->
$z0 = MOVPRFX $z1
$z0 = SDIV_ZPmZ_S $p0, $z0, $z2
```
By picking $z1 directly, we'd get:
```
$z1 = SDIV_ZPmZ_S, $p0 $z1, $z2
```
|
|
When a load/store is conditionally executed in a loop it isn't a
candidate for pre/post-index addressing, as the increment of the address
would only happen on those loop iterations where the load/store is
executed.
Detect this and only discount the AddRec cost when the load/store is
unconditional.
|
|
The way that loops strength reduction works is that the target has to
upfront decide whether it wants its addressing to be preindex,
postindex, or neither. This choice affects:
* Which potential solutions we generate
* Whether we consider a pre/post index load/store as costing an AddRec
or not.
None of these choices are a good fit for either AArch64 or ARM, where
both preindex and postindex addressing are typically free:
* If we pick None then we count pre/post index addressing as costing one
addrec more than is correct so we don't pick them when we should.
* If we pick PreIndexed or PostIndexed then we get the correct cost for
that addressing type, but still get it wrong for the other and also
exclude potential solutions using offset addressing that could have less
cost.
This patch adds an "all" addressing mode that causes all potential
solutions to be generated and counts both pre and postindex as having
AddRecCost of zero. Unfortuntely this reveals problems elsewhere in how
we calculate the cost of things that need to be fixed before we can make
use of it.
|
|
C1)). (#150651)
|
|
Look for reg update instruction (to merge w/ mem instruction into
pre/post-increment form) not only inside a single MBB but also along a
CF path going downward w/o side enters such that BaseReg is alive along
it but not at its exits.
Regression test is updated accordingly.
|
|
Reverting because mis-compiles:
- https://github.com/llvm/llvm-project/pull/131837
- https://github.com/llvm/llvm-project/pull/146320
- https://github.com/llvm/llvm-project/pull/146337
|
|
The insertion point of COPY isn't always optimal and could eventually
lead to a worse block layout, see the regression test in the first
commit.
This change affects many architectures but the amount of total
instructions in the test cases seems too be slightly lower.
|
|
canHoistIVInc was made to only allow integer types to avoid a crash in
isIndexedLoadLegal/isIndexedStoreLegal due to them failing an assertion
in getValueType (or rather in MVT::getVT which gets called from that)
when passed a struct type. Adjusting these functions to pass
AllowUnknown=true to getValueType means we don't get an assertion
failure (MVT::Other is returned which TLI->isIndexedLoadLegal should
then return false for), meaning we can remove this check for integer
type.
|
|
Currently, given:
```cpp
uint64_t incb(uint64_t x) {
return x+svcntb();
}
```
LLVM generates:
```gas
incb:
addvl x0, x0, #1
ret
```
Which is equivalent to:
```gas
incb:
incb x0
ret
```
However, on microarchitectures like the Neoverse V2 and Neoverse V3,
the second form (with INCB) can have significantly better latency and
throughput (according to their SWOG). On the Neoverse V2, for example,
ADDVL has a latency and throughput of 2, whereas some forms of INCB
have a latency of 1 and a throughput of 4. The same applies to DECB.
This patch adds patterns to prefer the cheaper INCB/DECB forms over
ADDVL where applicable.
|
|
Currently, given:
```cpp
svuint8_t foo(uint8_t *x) {
return svld1(svptrue_b8(), x);
}
```
We generate:
```gas
foo:
ptrue p0.b
ld1b { z0.b }, p0/z, [x0]
ret
```
However, on little-endian and with unaligned memory accesses allowed, we
could instead be using LDR as follows:
```gas
foo:
ldr z0, [x0]
ret
```
The second form avoids the predicate dependency.
Likewise for other types and stores.
|
|
Pull Request: https://github.com/llvm/llvm-project/pull/121305
|
|
This PR removes tests with `br i1 undef` under
`llvm/tests/Transforms/Loop*, Lower*`.
|
|
Somewhat confusingly a `SCEVMulExpr` is a `SCEVNAryExpr`, so can have
> 2 operands. Previously, the vscale immediate matching did not check
the number of operands of the `SCEVMulExpr`, so would ignore any
operands after the first two.
This led to incorrect codegen (and results) for ArmSME in IREE
(https://github.com/iree-org/iree), which sometimes addresses things
that are a `vscale * vscale` multiple away. The test added with this
change shows an example reduced from IREE. The second write should
be offset from the first `16 * vscale * vscale` (* 4 bytes), however,
previously LSR dropped the second vscale and instead offset the write by
`#4, mul vl`, which is an offset of `16 * vscale` (* 4 bytes).
|
|
Precommit test for #100080.
|
|
Extends LoopStrengthReduce to recognize immediates multiplied by vscale, and query the current target for whether they are legal offsets for memory operations or adds.
|
|
(#84189)
Adds an AArch64-specific version of isLSRCostLess, changing the relative
importance of the various terms from the formulae being evaluated.
This has been split out from my vscale-aware LSR work, see the RFC for
reference:
https://discourse.llvm.org/t/rfc-vscale-aware-loopstrengthreduce/77131
|
|
Refresh of the generic scheduling model to use A510 instead of A55.
Main benefits are to the little core, and introducing SVE scheduling information.
Changes tested on various OoO cores, no performance degradation is seen.
Differential Revision: https://reviews.llvm.org/D156799
|
|
In CollectLoopInvariantFixupsAndFormulae(), LSR looks at users
outside the loop. E.g. if we have an addrec based on %base, and
%base is also used outside the loop, then we have to keep it in a
register anyway, which may make it more profitable to use
%base + %idx style addressing.
This reasoning doesn't hold up when the base is a constant, because
the constant can be rematerialized. The lsr-memcpy.ll test regressed
when enabling opaque pointers, because inttoptr (i64 6442450944 to ptr)
now also has a use outside the loop (previously it didn't due to a
pointer type difference), and that extra "use" results in worse use
of addressing modes in the loop. However, the use outside the loop
actually gets rematerialized, so the alleged register saving does
not occur.
The same reasoning also applies to other types of constants, such
as global variable references.
Differential Revision: https://reviews.llvm.org/D155073
|
|
A variant of the test using globals instead of inttoptr expressions
for D155073.
|
|
This regresses with opaque pointers. I'll submit a patch to recover
the regression.
|
|
Using "eabi" for aarch64 targets is a common mistake and warned by Clang Driver.
We want to avoid it elsewhere as well. Just use the common "aarch64" without
other triple components.
|
|
|
|
Add unit test triggering an assertion with abfeda5af329b5.
|
|
This is an attempt to reland D42600 and enabling this optimisation by default.
This also resolves the issue pointed out in the context of PGO build.
Differential Revision: https://reviews.llvm.org/D42600
|
|
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0
since I forgot the lit.local.cfg files in that one.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: barannikov88, kwk
Differential Revision: https://reviews.llvm.org/D150762
|
|
This reverts commit 1ddfd1c8186735c62b642df05c505dc4907ffac4.
The original commit causes a Chrome build assertion failure with
ThinLTO: https://crbug.com/1443635
|
|
Try to reland D42600
Differential Revision: https://reviews.llvm.org/D42600
|
|
This patch recommits 0827e2fa3fd15b49fd2d0fc676753f11abb60cab after
reverting it in ed7ada259f665a742561b88e9e6c078e9ea85224. Added
workround for `Targetlowering::AddrMode` no longer being an aggregate
in C++20.
`AArch64TargetLowering::isLegalAddressingMode` has a number of
defects, including accepting an addressing mode, which consists of
only an immediate operand, or not checking the offset range for an
addressing mode in the form `1*ScaledReg + Offs`.
This patch fixes the above issues.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D143895
Change-Id: I41a520c13ce21da503ca45019979bfceb8b648fa
|
|
This reverts commit 0827e2fa3fd15b49fd2d0fc676753f11abb60cab.
Failing buildbot, perhaps due to `-std=c++20`.
|
|
`AArch64TargetLowering::isLegalAddressingMode` has a number of
defects, including accepting an addressing mode which consists of only
an immediate operand, or not checking the offset range for an
addressing mode in the form `1*ScaledReg + Offs`.
This patch fixes the above issues.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D143895
Change-Id: I756fa21941844ded44f082ac7eea4391219f9851
|
|
With opaque pointers the scevgep / uglygep distinction no longer
makes sense -- GEPs are always emitted in offset-based representation.
|
|
Original code will cause crash when the load/store memory type is structure because isIndexedLoadLegal/isIndexedStore doesn't support struct type.
So we limit the load/store memory type to integer.
Origin commit message:
When the latch block is different from header block, IVInc will be expanded in the latch loop. We can't generate the post index load/store this case.
But if the IVInc only used in the loop, actually we still can use the post index load/store because when exit loop we don't care the last IVInc value.
So, trying to hoist IVInc to help backend to generate more post index load/store.
Fix #53625
Reviewed By: eopXD
Differential Revision: https://reviews.llvm.org/D138636
|
|
IR is now always parsed in opaque pointer mode, unless
-opaque-pointers=0 is explicitly given. There is no automatic
detection of typed pointers anymore.
The -opaque-pointers=0 option is added to any remaining IR tests
that haven't been migrated yet.
Differential Revision: https://reviews.llvm.org/D141912
|
|
The original commit seems to cause a regression in numba test.
This reverts commit b1b4758e7f4b2ffe1faa28b00eb037832e5d26a7.
|
|
When the latch block is different from header block, IVInc will be expanded in the latch loop. We can't generate the post index load/store this case.
But if the IVInc only used in the loop, actually we still can use the post index load/store because when exit loop we don't care the last IVInc value.
So, trying to hoist IVInc to help backend to generate more post index load/store.
Fix #53625
Reviewed By: eopXD
Differential Revision: https://reviews.llvm.org/D138636
|
|
These are all tests where conversion worked automatically, and
required no manual fixup.
|
|
mismatched IV's in PHI's in common successor block""
One of these two changes is exposing (or causing) some more miscompiles.
A reproducer is in progress, so reverting until resolved.
This reverts commit 428f36401b1b695fd501ebfdc8773bed8ced8d4e.
|
|
in PHI's in common successor block"
This reverts commit 37b8f09a4b61bf9bf9d0b9017d790c8b82be2e17,
and returns commit 1bd0b82e508d049efdb07f4f8a342f35818df341.
The miscompile was in InstCombine, and it has been addressed.
This tries to approach the problem noted by @arsenm:
terrible codegen for `__builtin_fpclassify()`:
https://godbolt.org/z/388zqdE37
Just because the PHI in the common successor happens to have different
incoming values for these two blocks, doesn't mean we have to give up.
It's quite easy to deal with this, we just need to produce a select:
https://alive2.llvm.org/ce/z/000srb
Now, the cost model for this transform is rather overly strict,
so this will basically never fire. We tally all (over all preds)
the selects needed to the NumBonusInsts
Differential Revision: https://reviews.llvm.org/D139275
|
|
in PHI's in common successor block"
This reverts commit 1bd0b82e508d049efdb07f4f8a342f35818df341, since it leads to
miscompiles. See https://reviews.llvm.org/D139275#3993229 and
https://reviews.llvm.org/D139275#4001580.
|
|
in common successor block
This tries to approach the problem noted by @arsenm:
terrible codegen for `__builtin_fpclassify()`:
https://godbolt.org/z/388zqdE37
Just because the PHI in the common successor happens to have different
incoming values for these two blocks, doesn't mean we have to give up.
It's quite easy to deal with this, we just need to produce a select:
https://alive2.llvm.org/ce/z/000srb
Now, the cost model for this transform is rather overly strict,
so this will basically never fire. We tally all (over all preds)
the selects needed to the NumBonusInsts
Differential Revision: https://reviews.llvm.org/D139275
|
|
|
|
#53877
|
|
We would like to start pushing -mcpu=generic towards enabling the set of
features that improves performance for some CPUs, without hurting any
others. A blend of the performance options hopefully beneficial to all
CPUs. The largest part of that is enabling in-order scheduling using the
Cortex-A55 schedule model. This is similar to the Arm backend change
from eecb353d0e25ba which made -mcpu=generic perform in-order scheduling
using the cortex-a8 schedule model.
The idea is that in-order cpu's require the most help in instruction
scheduling, whereas out-of-order cpus can for the most part out-of-order
schedule around different codegen. Our benchmarking suggests that
hypothesis holds. When running on an in-order core this improved
performance by 3.8% geomean on a set of DSP workloads, 2% geomean on
some other embedded benchmark and between 1% and 1.8% on a set of
singlecore and multicore workloads, all running on a Cortex-A55 cluster.
On an out-of-order cpu the results are a lot more noisy but show flat
performance or an improvement. On the set of DSP and embedded
benchmarks, run on a Cortex-A78 there was a very noisy 1% speed
improvement. Using the most detailed results I could find, SPEC2006 runs
on a Neoverse N1 show a small increase in instruction count (+0.127%),
but a decrease in cycle counts (-0.155%, on average). The instruction
count is very low noise, the cycle count is more noisy with a 0.15%
decrease not being significant. SPEC2k17 shows a small decrease (-0.2%)
in instruction count leading to a -0.296% decrease in cycle count. These
results are within noise margins but tend to show a small improvement in
general.
When specifying an Apple target, clang will set "-target-cpu apple-a7"
on the command line, so should not be affected by this change when
running from clang. This also doesn't enable more runtime unrolling like
-mcpu=cortex-a55 does, only changing the schedule used.
A lot of existing tests have updated. This is a summary of the important
differences:
- Most changes are the same instructions in a different order.
- Sometimes this leads to very minor inefficiencies, such as requiring
an extra mov to move variables into r0/v0 for the return value of a test
function.
- misched-fusion.ll was no longer fusing the pairs of instructions it
should, as per D110561. I've changed the schedule used in the test
for now.
- neon-mla-mls.ll now uses "mul; sub" as opposed to "neg; mla" due to
the different latencies. This seems fine to me.
- Some SVE tests do not always remove movprfx where they did before due
to different register allocation giving different destructive forms.
- The tests argument-blocks-array-of-struct.ll and arm64-windows-calls.ll
produce two LDR where they previously produced an LDP due to
store-pair-suppress kicking in.
- arm64-ldp.ll and arm64-neon-copy.ll are missing pre/postinc on LPD.
- Some tests such as arm64-neon-mul-div.ll and
ragreedy-local-interval-cost.ll have more, less or just different
spilling.
- In aarch64_generated_funcs.ll.generated.expected one part of the
function is no longer outlined. Interestingly if I switch this to use
any other scheduled even less is outlined.
Some of these are expected to happen, such as differences in outlining
or register spilling. There will be places where these result in worse
codegen, places where they are better, with the SPEC instruction counts
suggesting it is not a decrease overall, on average.
Differential Revision: https://reviews.llvm.org/D110830
|
|
This updates a few more check lines, in some mte tests that were close
to auto generated already and some CodeGenPrepare/consthoist tests where
being able to see the entire code sequence is useful for determining
whether code differences are improvements or not.
|
|
This changes the lowering of f32 and f64 COPY from a 128bit vector ORR to
a fmov of the appropriate type. At least on some CPU's with 64bit NEON
data paths this is expected to be faster, and shouldn't be slower on any
CPU that treats fmov as a register rename.
Differential Revision: https://reviews.llvm.org/D106365
|
|
Add a comment when there is a shifted value,
add x9, x0, #291, lsl #12 ; =1191936
but not when the immediate value is unshifted,
subs x9, x0, #256 ; =256
when the comment adds nothing additional to the reader.
Differential Revision: https://reviews.llvm.org/D107196
|
|
This patch changed the isLegalUse check to ensure that
LSRInstance::GenerateConstantOffsetsImpl generates an
offset that results in a legal addressing mode and
formula. The check is changed to look similar to the
assert check used for illegal formulas.
Differential Revision: https://reviews.llvm.org/D100383
Change-Id: Iffb9e32d59df96b8f072c00f6c339108159a009a
|
|
This prevents us from doing things like LICM'ing it out of a loop,
which is usually a net loss because we end up having to spill a
callee-saved FPR to accomodate it.
This does perturb instruction scheduling around this instruction,
so a number of tests had to be updated to account for it.
Reviewed By: t.p.northover
Differential Revision: https://reviews.llvm.org/D87316
|