| Age | Commit message (Collapse) | Author | Files | Lines |
|
addrspace(2/3/5)` in AMDGPU tests (#181710)
|
|
Allow SCEVPtrToAddr as cast in assertion in SCEVDbgValueBuilder.
SCEVPtrToAddr is handled similarly to SCEVPtrToInt.
Fixes a crash with debug info after bd40d1de9c9ee, which started to
generate ptrtoaddr instead of ptrtoint expressions.
|
|
From https://github.com/llvm/llvm-project/pull/171456#issuecomment-3675488354.
|
|
Test case for:
https://github.com/llvm/llvm-project/pull/171456#issuecomment-3670755137
|
|
Currently OptimizeLoopTermCond can only convert a cmp instruction to
using a postincrement induction variable, which means it can't handle
predicated loops where the termination condition comes from
get_active_lane_mask. Relax this restriction so that we can handle any
kind of instruction, though only if it's the instruction immediately
before the branch (except for possibly an extractelement).
|
|
Currently we try to hoist the transformed IV increment instruction to
the header block to help with generation of postincrement instructions,
but this only works if the user instruction is also in the header. We
should instead be trying to insert it in the same block as the user.
|
|
|
|
For predicated SVE instructions where we know that the inactive lanes
are undef, it is better to pick a destination register that is not
unique. This avoids introducing a movprfx to copy a unique register to
the destination operand, which would be needed to comply with the tied
operand constraints.
For example:
```
%src1 = COPY $z1
%src2 = COPY $z2
%dst = SDIV_ZPZZ_S_UNDEF %p, %src1, %src2
```
Here it is beneficial to pick $z1 or $z2 as the destination register,
because if it would have chosen a unique register (e.g. $z0) then the
pseudo expand pass would need to insert a MOVPRFX to expand the
operation into:
```
$z0 = SDIV_ZPZZ_S_UNDEF $p0, $z1, $z2
->
$z0 = MOVPRFX $z1
$z0 = SDIV_ZPmZ_S $p0, $z0, $z2
```
By picking $z1 directly, we'd get:
```
$z1 = SDIV_ZPmZ_S, $p0 $z1, $z2
```
|
|
When a load/store is conditionally executed in a loop it isn't a
candidate for pre/post-index addressing, as the increment of the address
would only happen on those loop iterations where the load/store is
executed.
Detect this and only discount the AddRec cost when the load/store is
unconditional.
|
|
Post cleanup for #164534.
|
|
This reverts commit fd58f235f8c5bd40d98acfd8e7fb11d41de301c7.
The recommit contains an extra check to make sure that D is a multiple of
C2, if C2 > C1. This fixes the issue causing the revert fd58f235f8c. Tests
have been added in 6a726e9a4d3d0.
Original message:
If C2 >u C1 and C1 >u 1, fold to A /u (C2 /u C1).
Depends on https://github.com/llvm/llvm-project/pull/157555.
Alive2 Proof: https://alive2.llvm.org/ce/z/BWvQYN
PR: https://github.com/llvm/llvm-project/pull/157656
|
|
Add additional test coverage for
https://github.com/llvm/llvm-project/pull/157656 covering cases where
the dividend is not a multiple of the divisor, causing the revert of
70012fda631.
|
|
The way that loops strength reduction works is that the target has to
upfront decide whether it wants its addressing to be preindex,
postindex, or neither. This choice affects:
* Which potential solutions we generate
* Whether we consider a pre/post index load/store as costing an AddRec
or not.
None of these choices are a good fit for either AArch64 or ARM, where
both preindex and postindex addressing are typically free:
* If we pick None then we count pre/post index addressing as costing one
addrec more than is correct so we don't pick them when we should.
* If we pick PreIndexed or PostIndexed then we get the correct cost for
that addressing type, but still get it wrong for the other and also
exclude potential solutions using offset addressing that could have less
cost.
This patch adds an "all" addressing mode that causes all potential
solutions to be generated and counts both pre and postindex as having
AddRecCost of zero. Unfortuntely this reveals problems elsewhere in how
we calculate the cost of things that need to be fixed before we can make
use of it.
|
|
Reverts llvm/llvm-project#157656
There are multiple reports that this is causing miscompiles in the MSan
test suite after bootstrapping and that this is causing miscompiles in
rustc. Let's revert for now, and work to capture a reproducer next week.
|
|
If C2 >u C1 and C1 >u 1, fold to A /u (C2 /u C1).
Depends on https://github.com/llvm/llvm-project/pull/157555.
Alive2 Proof: https://alive2.llvm.org/ce/z/BWvQYN
PR: https://github.com/llvm/llvm-project/pull/157656
|
|
(#156730)
Alive2 Proof: https://alive2.llvm.org/ce/z/JoHJE9
PR: https://github.com/llvm/llvm-project/pull/156730
|
|
Move transform test to llvm/test/Transforms/LoopStrengthReduce/X86,
clean up the names a bit and regenerate check lines.
|
|
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
|
|
C1)). (#150651)
|
|
We should ignore uses of pointers in lifetime intrinsics, as these are
not actually materialized in the final code, so don't affect register
pressure or anything else LSR needs to model.
Handling these only results in peculiar rewrites where additional
intermediate GEPs are introduced.
|
|
instruction (#147241)
Fix #147238
|
|
This reverts commit a60405c2d51ffe428dd38b8b1020c19189968f76.
Causes buildbot failures.
|
|
|
|
Look for reg update instruction (to merge w/ mem instruction into
pre/post-increment form) not only inside a single MBB but also along a
CF path going downward w/o side enters such that BaseReg is alive along
it but not at its exits.
Regression test is updated accordingly.
|
|
Reverting because mis-compiles:
- https://github.com/llvm/llvm-project/pull/131837
- https://github.com/llvm/llvm-project/pull/146320
- https://github.com/llvm/llvm-project/pull/146337
|
|
The insertion point of COPY isn't always optimal and could eventually
lead to a worse block layout, see the regression test in the first
commit.
This change affects many architectures but the amount of total
instructions in the test cases seems too be slightly lower.
|
|
canHoistIVInc was made to only allow integer types to avoid a crash in
isIndexedLoadLegal/isIndexedStoreLegal due to them failing an assertion
in getValueType (or rather in MVT::getVT which gets called from that)
when passed a struct type. Adjusting these functions to pass
AllowUnknown=true to getValueType means we don't get an assertion
failure (MVT::Other is returned which TLI->isIndexedLoadLegal should
then return false for), meaning we can remove this check for integer
type.
|
|
The existing way of managing clustered nodes was done through adding
weak edges between the neighbouring cluster nodes, which is a sort of
ordered queue. And this will be later recorded as `NextClusterPred` or
`NextClusterSucc` in `ScheduleDAGMI`.
But actually the instruction may be picked not in the exact order of the
queue. For example, we have a queue of cluster nodes A B C. But during
scheduling, node B might be picked first, then it will be very likely
that we only cluster B and C for Top-Down scheduling (leaving A alone).
Another issue is:
```
if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
std::swap(SUa, SUb);
if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
```
may break the cluster queue.
For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3
2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2),
As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be
pred of 3. This makes both 1 and 2 become preds of 3, but there is no
edge between 1 and 2. Thus we get a broken cluster chain.
To fix both issues, we introduce an unordered set in the change. This
could help improve clustering in some hard case.
One key reason the change causes so many test check changes is: As the
cluster candidates are not ordered now, the candidates might be picked
in different order from before.
The most affected targets are: AMDGPU, AArch64, RISCV.
For RISCV, it seems to me most are just minor instruction reorder, don't
see obvious regression.
For AArch64, there were some combining of ldr into ldp being affected.
With two cases being regressed and two being improved. This has more
deeper reason that machine scheduler cannot cluster them well both
before and after the change, and the load combine algorithm later is
also not smart enough.
For AMDGPU, some cases have more v_dual instructions used while some are
regressed. It seems less critical. Seems like test `v_vselect_v32bf16`
gets more buffer_load being claused.
|
|
Since e39f6c1844fab59c638d8059a6cf139adb42279a opt will infer the
correct datalayout when given a triple. Avoid explicitly specifying it
in tests that depend on the AMDGPU target being present to avoid the
string becoming out of sync with the TargetInfo value.
Only tests with REQUIRES: amdgpu-registered-target or a local lit.cfg
were updated to ensure that tests for non-target-specific passes that
happen to use the AMDGPU layout still pass when building with a limited
set of targets.
Reviewed By: shiltian, arsenm
Pull Request: https://github.com/llvm/llvm-project/pull/137921
|
|
Currently, given:
```cpp
uint64_t incb(uint64_t x) {
return x+svcntb();
}
```
LLVM generates:
```gas
incb:
addvl x0, x0, #1
ret
```
Which is equivalent to:
```gas
incb:
incb x0
ret
```
However, on microarchitectures like the Neoverse V2 and Neoverse V3,
the second form (with INCB) can have significantly better latency and
throughput (according to their SWOG). On the Neoverse V2, for example,
ADDVL has a latency and throughput of 2, whereas some forms of INCB
have a latency of 1 and a throughput of 4. The same applies to DECB.
This patch adds patterns to prefer the cheaper INCB/DECB forms over
ADDVL where applicable.
|
|
Setting `EnableLoopTermFold` enables `loop-term-fold` pass.
|
|
This reverts commit 71f43a7c42a37d18be98b0885d62ef76e658f242.
Caused build bot failures:
https://lab.llvm.org/buildbot/#/builders/190/builds/17267
https://lab.llvm.org/buildbot/#/builders/144/builds/21445
https://lab.llvm.org/buildbot/#/builders/46/builds/14343
please consider fixing the test failure in long-array-initialize.ll before
recommitting.
|
|
Setting `EnableLoopTermFold` enables `loop-term-fold` pass.
|
|
So that LoopStrengthReduce can work for MIPS.
The code is copied from RISC-V.
---------
Co-authored-by: qethu <190734095+qethu@users.noreply.github.com>
|
|
These date back to when the non-intrinsic format of variable locations
was still being tested and was behind a compile-time flag, so not all
builds / bots would correctly run them. The solution at the time, to get
at least some test coverage, was to have tests opt-in to non-intrinsic
debug-info if it was built into LLVM.
Nowadays, non-intrinsic format is the default and has been on for more
than a year, there's no need for this flag to exist.
(I've downgraded the flag from "try" to explicitly requesting
non-intrinsic format in some places, so that we can deal with tests that
are explicitly about non-intrinsic format in their own commit).
|
|
Currently, given:
```cpp
svuint8_t foo(uint8_t *x) {
return svld1(svptrue_b8(), x);
}
```
We generate:
```gas
foo:
ptrue p0.b
ld1b { z0.b }, p0/z, [x0]
ret
```
However, on little-endian and with unaligned memory accesses allowed, we
could instead be using LDR as follows:
```gas
foo:
ldr z0, [x0]
ret
```
The second form avoids the predicate dependency.
Likewise for other types and stores.
|
|
Providing the correct operand index allows addPhysRegDataDeps to compute
the correct latency.
Pull Request: https://github.com/llvm/llvm-project/pull/123541
|
|
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.
Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
make it easier to use old IR files and somewhat reduce the test churn in
this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
attribute. The representation in the LLVM IR dialect should be updated
separately.
|
|
the `ptx_kernel` calling convention is a more idiomatic and standard way
of specifying a NVPTX kernel than using the metadata which is not
supposed to change the meaning of the program. Further, checking the
calling convention is significantly faster than traversing the metadata,
improving compile time.
This change updates the clang and mlir frontends as well as the
NVPTXCtorDtorLowering pass to emit kernels using the calling convention.
In addition, this updates all NVPTX unit tests to use the calling
convention as well.
|
|
Pull Request: https://github.com/llvm/llvm-project/pull/121305
|
|
This PR removes tests with `br i1 undef` under
`llvm/tests/Transforms/Loop*, Lower*`.
|
|
The expansion of a SCEVUnknown is trivial (it's just the wrapped value).
If we try to reuse an existing value it might be a more complex
expression that simplifies to the SCEVUnknown.
This is inspired by https://github.com/llvm/llvm-project/issues/114879,
because SCEVExpander replacing a constant with a phi node is just silly.
(I don't consider this a fix for that issue though.)
|
|
See the following case:
```
@GlobIntONE = global i32 0, align 4
define ptr @src() {
entry:
br label %for.body.peel.begin
for.body.peel.begin: ; preds = %entry
br label %for.body.peel
for.body.peel: ; preds = %for.body.peel.begin
br i1 true, label %cleanup.peel, label %cleanup.loopexit.peel
cleanup.loopexit.peel: ; preds = %for.body.peel
br label %cleanup.peel
cleanup.peel: ; preds = %cleanup.loopexit.peel, %for.body.peel
%retval.2.peel = phi ptr [ undef, %for.body.peel ], [ @GlobIntONE, %cleanup.loopexit.peel ]
br i1 true, label %for.body.peel.next, label %cleanup7
for.body.peel.next: ; preds = %cleanup.peel
br label %for.body.peel.next1
for.body.peel.next1: ; preds = %for.body.peel.next
br label %entry.peel.newph
entry.peel.newph: ; preds = %for.body.peel.next1
br label %for.body
for.body: ; preds = %cleanup, %entry.peel.newph
%retval.0 = phi ptr [ %retval.2.peel, %entry.peel.newph ], [ %retval.2, %cleanup ]
br i1 false, label %cleanup, label %cleanup.loopexit
cleanup.loopexit: ; preds = %for.body
br label %cleanup
cleanup: ; preds = %cleanup.loopexit, %for.body
%retval.2 = phi ptr [ %retval.0, %for.body ], [ @GlobIntONE, %cleanup.loopexit ]
br i1 false, label %for.body, label %cleanup7.loopexit
cleanup7.loopexit: ; preds = %cleanup
%retval.2.lcssa.ph = phi ptr [ %retval.2, %cleanup ]
br label %cleanup7
cleanup7: ; preds = %cleanup7.loopexit, %cleanup.peel
%retval.2.lcssa = phi ptr [ %retval.2.peel, %cleanup.peel ], [ %retval.2.lcssa.ph, %cleanup7.loopexit ]
ret ptr %retval.2.lcssa
}
define ptr @tgt() {
entry:
br label %for.body.peel.begin
for.body.peel.begin: ; preds = %entry
br label %for.body.peel
for.body.peel: ; preds = %for.body.peel.begin
br i1 true, label %cleanup.peel, label %cleanup.loopexit.peel
cleanup.loopexit.peel: ; preds = %for.body.peel
br label %cleanup.peel
cleanup.peel: ; preds = %cleanup.loopexit.peel, %for.body.peel
%retval.2.peel = phi ptr [ undef, %for.body.peel ], [ @GlobIntONE, %cleanup.loopexit.peel ]
br i1 true, label %for.body.peel.next, label %cleanup7
for.body.peel.next: ; preds = %cleanup.peel
br label %for.body.peel.next1
for.body.peel.next1: ; preds = %for.body.peel.next
br label %entry.peel.newph
entry.peel.newph: ; preds = %for.body.peel.next1
br label %for.body
for.body: ; preds = %cleanup, %entry.peel.newph
br i1 false, label %cleanup, label %cleanup.loopexit
cleanup.loopexit: ; preds = %for.body
br label %cleanup
cleanup: ; preds = %cleanup.loopexit, %for.body
br i1 false, label %for.body, label %cleanup7.loopexit
cleanup7.loopexit: ; preds = %cleanup
%retval.2.lcssa.ph = phi ptr [ %retval.2.peel, %cleanup ]
br label %cleanup7
cleanup7: ; preds = %cleanup7.loopexit, %cleanup.peel
%retval.2.lcssa = phi ptr [ %retval.2.peel, %cleanup.peel ], [ %retval.2.lcssa.ph, %cleanup7.loopexit ]
ret ptr %retval.2.lcssa
}
```
1. `simplifyInstruction(%retval.2.peel)` returns `@GlobIntONE`. Thus,
`ScalarEvolution::createNodeForPHI` returns SCEV expr `@GlobIntONE` for
`%retval.2.peel`.
2. `SimplifyIndvar::replaceIVUserWithLoopInvariant` tries to replace the
use of `%retval.2.peel` in `%retval.2.lcssa.ph` with `@GlobIntONE`.
3. `simplifyLoopAfterUnroll -> simplifyLoopIVs -> SCEVExpander::expand`
reuses `%retval.2.peel = phi ptr [ undef, %for.body.peel ], [
@GlobIntONE, %cleanup.loopexit.peel ]` to generate code for
`@GlobIntONE`. It is incorrect.
This patch disallows simplifying `phi(undef, X)` to `X` by setting
`CanUseUndef` to false.
Closes https://github.com/llvm/llvm-project/issues/114879.
|
|
Comments attached to the `ScaledReg` field of `struct Formula` explains
that, `ScaledReg` must be non-null when `Scale` is non-zero.
This fixes up a code path where this invariant is violated. Also, add an
assert to ensure this invariant holds true.
Without this patch, compiler aborts with the attached test case.
Fixes #76504
|
|
wide (#110979)
Fixes #110494
|
|
As of rev ea222be0d, LLVMs assembler will actually try to honour the
"fill value" part of p2align directives. X86 printed these as 0x90, which
isn't actually what it wanted: we want multi-byte nops for .text
padding. Compiling via a textual assembly file produces single-byte
nop padding since ea222be0d but the built-in assembler will produce
multi-byte nops. This divergent behaviour is undesirable.
To fix: don't set the byte padding field for x86, which allows the
assembler to pick multi-byte nops. Test that we get the same multi-byte
padding when compiled via textual assembly or directly to object file.
Added same-align-bytes-with-llasm-llobj.ll to that effect, updated
numerous other tests to not contain check-lines for the explicit padding.
|
|
When expanding SCEV adds to geps, transfer the nuw flag to the resulting
gep. (Note that this doesn't apply to IV increment GEPs, which go
through a different code path.)
|
|
As pointed out in the review of #102133, SCEVExpander currently
incorrectly reuses GEP instructions that have poison-generating flags
set. Fix this by clearing the flags on the reused instruction.
|
|
|
|
form" (#107380)
Motivating example: https://godbolt.org/z/eb97zrxhx
Here we have 2 induction variables in the loop: one is corresponding to
i variable (add rdx, 4), the other - to res (add rax, 2). The second
induction variable can be removed by rewriteLoopExitValues() method
(final value of res at loop exit is unroll_iter * -2); however, this
doesn't happen because we have duplicated LCSSA phi nodes at loop exit:
```
; Preheader:
for.body.preheader.new: ; preds = %for.body.preheader
%unroll_iter = and i64 %N, -4
br label %for.body
; Loop:
for.body: ; preds = %for.body, %for.body.preheader.new
%lsr.iv = phi i64 [ %lsr.iv.next, %for.body ], [ 0, %for.body.preheader.new ]
%i.07 = phi i64 [ 0, %for.body.preheader.new ], [ %inc.3, %for.body ]
%inc.3 = add nuw i64 %i.07, 4
%lsr.iv.next = add nsw i64 %lsr.iv, -2
%niter.ncmp.3.not = icmp eq i64 %unroll_iter, %inc.3
br i1 %niter.ncmp.3.not, label %for.end.loopexit.unr-lcssa.loopexit, label %for.body, !llvm.loop !7
; Exit blocks
for.end.loopexit.unr-lcssa.loopexit: ; preds = %for.body
%inc.3.lcssa = phi i64 [ %inc.3, %for.body ]
%lsr.iv.next.lcssa11 = phi i64 [ %lsr.iv.next, %for.body ]
%lsr.iv.next.lcssa = phi i64 [ %lsr.iv.next, %for.body ]
br label %for.end.loopexit.unr-lcssa
```
rewriteLoopExitValues requires %lsr.iv.next value to have only 2 uses:
one in LCSSA phi node, the other - in induction phi node. Here we have 3
uses of this value because of duplicated lcssa nodes, so the transform
doesn't apply and leads to an extra add operation inside the loop. The
proposed solution is to accumulate inserted instructions that will
require LCSSA form update into SetVector and then call
formLCSSAForInstructions for this SetVector once, so the same
instructions don't process twice.
Reland fixes the issue with preserve-lcssa.ll test: it fails in the situation
when x86_64-unknown-linux-gnu target is unavailable in opt. The changes are
moved into separate duplicated-phis.ll test with explicit x86 target requirement
to fix bots which are not building this target.
|