Age | Commit message (Collapse) | Author | Files | Lines |
|
The initial implementation used a very crude check where a value was
considered ephemeral if it has only one use. This is insufficient if
there are multiple assumes acting on the same value, or in more complex
cases like cyclic phis.
Generalize this to a more typical ephemeral value check, i.e. make sure
that all transitive users are in assumes, while stopping at
side-effecting instructions.
|
|
Loop fusion pass will use the information provided by the recent
DA patch to fuse additional legal loops, including those with
forward loop-carried dependencies.
|
|
(#149049)
If a direction vector with all `*` elements, like `[* * *]`, is present,
it indicates that none of the loop pairs are legal to interchange. In
such cases, continuing the analysis is meaningless.
This patch introduces a check to detect such direction vectors and exits
early when one is found. This slightly reduces compile time.
|
|
This extends the DropUnnecessaryAssumes pass to also handle operand
bundle assumes. For this purpose, export the affected value analysis for
operand bundles from AssumptionCache.
If the bundle only affects ephemeral values, drop it. If all bundles on
an assume are dropped, drop the whole assume.
|
|
Assume operand bundles are emitted in a few more places now, including
used in various places in libc++. Add a dedicated ID for them.
PR: https://github.com/llvm/llvm-project/pull/158078
|
|
ResultElem stores a weak handle of an assume, plus an index for
referring to a specific operand bundle. This makes sense for the results
of assumptionsFor(), which refers to specific operands of assumes.
However, assumptions() is a plain list of assumes. It does *not* contain
separate entries for each operand bundles. The operand bundle index is
always ExprResultIdx.
As such, we should be directly using WeakVH for this case, without the
additional wrapper.
|
|
Summary:
The changes made in https://github.com/llvm/llvm-project/pull/156057
allows the alignment value to be increased. We assert effectively
infinite alignment when the pointer argument is invalid / null. The
problem is that for whatever reason the masked load / store functions
use i32 for their alignment value which means this gets truncated to
zero.
Add a special check for this, long term we probably want to just remove
this argument entirely.
|
|
This patch introduces a new optimization in SROA that handles the
pattern where multiple non-overlapping vector `store`s completely fill
an `alloca`.
The current approach to handle this pattern introduces many `.vecexpand`
and `.vecblend` instructions, which can dramatically slow down
compilation when dealing with large `alloca`s built from many small
vector `store`s. For example, consider an `alloca` of type `<128 x
float>` filled by 64 `store`s of `<2 x float>` each. The current
implementation requires:
- 64 `shufflevector`s( `.vecexpand`)
- 64 `select`s ( `.vecblend` )
- All operations use masks of size 128
- These operations form a long dependency chain
This kind of IR is both difficult to optimize and slow to compile,
particularly impacting the `InstCombine` pass.
This patch introduces a tree-structured merge approach that
significantly reduces the number of operations and improves compilation
performance.
Key features:
- Detects when vector `store`s completely fill an `alloca` without gaps
- Ensures no loads occur in the middle of the store sequence
- Uses a tree-based approach with `shufflevector`s to merge stored
values
- Reduces the number of intermediate operations compared to linear
merging
- Eliminates the long dependency chains that hurt optimization
Example transformation:
```
// Before: (stores do not have to be in order)
%alloca = alloca <8 x float>
store <2 x float> %val0, ptr %alloca ; offset 0-1
store <2 x float> %val2, ptr %alloca+16 ; offset 4-5
store <2 x float> %val1, ptr %alloca+8 ; offset 2-3
store <2 x float> %val3, ptr %alloca+24 ; offset 6-7
%result = load <8 x float>, ptr %alloca
// After (tree-structured merge):
%shuffle0 = shufflevector %val0, %val1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%shuffle1 = shufflevector %val2, %val3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%result = shufflevector %shuffle0, %shuffle1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
```
Benefits:
- Logarithmic depth (O(log n)) instead of linear dependency chains
- Fewer total operations for large vectors
- Better optimization opportunities for subsequent passes
- Significant compilation time improvements for large vector patterns
For some large cases, the compile time can be reduced from about 60s to
less than 3s.
---------
Co-authored-by: chengjunp <chengjunp@nividia.com>
|
|
when preserving inbounds (#159515)
If we know that the initial GEP was inbounds, and we change it to a sequence of
GEPs from the same base pointer where every offset is non-negative, then the
new GEPs are inbounds. So far, the implementation only checked if the extracted
offsets are non-negative. In cases where non-extracted offsets can be negative,
this would cause the inbounds flag to be wrongly preserved.
Fixes an issue in #130617 found by nikic.
|
|
ConstantExpr addrspacecast (#159695)
This PR extends isSafeToCastConstAddrSpace to treat
ConstantAggregateZero like ConstantPointerNull.
Tests shows an extra addrspacecast instruction is removed and icmp
pointer vector operand's address space is now inferred.
This change is motivated by inspecting the test in commit f7629f5945f6.
|
|
Previously undef pointer operand is only supported for select inst,
where undef in generic AS behaves like `take the other side`.
This PR extends the support to other instructions, e.g. phi inst. Defer
joining and inferring constant pointer operand until all other operand
AS states considered.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
|
|
This adds a new pass for dropping assumes that are unlikely to be useful
for further optimization.
It works by discarding any assumes whose affected values are one-use
(which implies that they are only used by the assume, i.e. ephemeral).
This pass currently runs at the start of the module optimization
pipeline, that is post-inline and post-link. Before that point, it is
more likely for previously "useless" assumes to become useful again,
e.g. because an additional user of the value is introduced after
inlining + CSE.
|
|
(#159516)
No loop pass seems to use now it after LoopPredication stopped using it
in https://reviews.llvm.org/D111668
|
|
Function analyses in LoopStandardAnalysisResults are marked as preserved
by the loop pass adaptor, because LoopAnalysisManagerFunctionProxy
manually invalidates most of them.
However the proxy doesn't invalidate BFI, since it is only preserved on
a "lossy" basis: see https://reviews.llvm.org/D86156 and
https://reviews.llvm.org/D110438.
So any changes to the CFG will result in BFI giving incorrect results,
which is fine for loop passes which deal with the lossiness.
But the loop pass adapator still marks it as preserved, which causes the
lossy result to leak out into function passes.
This causes incorrect results when viewed from e.g. LoopVectorizer,
where an innermost loop header may be reported to have a smaller
frequency than its successors.
This fixes this by dropping the call to preserve, and adds a test with
the -O1 pipeline which shows the effects whenever the CFG is changed and
UseBlockFrequencyInfo is set.
I've also dropped it for BranchProbabilityAnalysis too, but I couldn't
test for it since UseBranchProbabilityInfo always seems to be false?
This may be dead code.
|
|
A common idiom is the usage of the PatternMatch match function within a
functional algorithm like all_of. Introduce a match functor to shorten
this idiom.
Co-authored-by: Luke Lau <luke@igalia.com>
|
|
The way that loops strength reduction works is that the target has to
upfront decide whether it wants its addressing to be preindex,
postindex, or neither. This choice affects:
* Which potential solutions we generate
* Whether we consider a pre/post index load/store as costing an AddRec
or not.
None of these choices are a good fit for either AArch64 or ARM, where
both preindex and postindex addressing are typically free:
* If we pick None then we count pre/post index addressing as costing one
addrec more than is correct so we don't pick them when we should.
* If we pick PreIndexed or PostIndexed then we get the correct cost for
that addressing type, but still get it wrong for the other and also
exclude potential solutions using offset addressing that could have less
cost.
This patch adds an "all" addressing mode that causes all potential
solutions to be generated and counts both pre and postindex as having
AddRecCost of zero. Unfortuntely this reveals problems elsewhere in how
we calculate the cost of things that need to be fixed before we can make
use of it.
|
|
This fixes a bug where zero cost instruction was hoisted to nearest
common dominator but the hoisted instruction's operands didn't dominate
the common dominator causing poison values.
|
|
(#154635)
`MD_prof` is safe to keep when e.g. hoisting instructions.
Issue #147390
|
|
coroutine function (#149936)" (#157986)
Since #156788 has resolved #149604, we can revert this workaround now.
|
|
Rather than passes using `!prof = !{!”unknown”}`for cases where don’t have enough information to emit profile values, this patch captures the pass (or some other information) that can help diagnostics - i.e. `!{!”unknown”, !”some-pass-name”}`.
For example, suppose we emitted a `select` with the unknown metadata, and, later, end up needing to lower that to a conditional branch. If we observe (via sample profiling, for example) that the branch is biased and would have benefitted from a valid profile, the extra information can help speed up debugging.
We can also (in a subsequent pass) generate optimization remarks about such lowered selects, with a similar aim - identify patterns lowering to `select` that may be worth some extra investment in extracting a more precise profile.
|
|
In simplifyPartiallyRedundantLoad we may replace a load with a PHI of
available values in predecessor blocks. As part of this process, we may
need to cast those values, which we do by inserting a new cast at the
end of the predecessor. These cast instructions should take their debug
location from the load instruction, just as the PHI does; we make an
exception if the predecessor does not unconditionally branch to the
load's block, as in that case we are not guaranteed to reach the load
and must therefore drop its debug location.
Found using https://github.com/llvm/llvm-project/pull/107279.
|
|
The limit 'dfa-max-num-paths' that is used to control number of
enumerated paths was not checked against inside getPathsFromStateDefMap.
It may lead to large memory consumption for complex enough switch
statements.
Reland llvm/llvm-project#145482
|
|
(#149699)
Update unrolling preferences for Apple Silicon CPUs to enable partial
unrolling and runtime unrolling for small loops with reductions.
This builds on top of unroller changes to introduce parallel reduction
phis, if possible: https://github.com/llvm/llvm-project/pull/149470.
PR: https://github.com/llvm/llvm-project/pull/149699
|
|
Fixes #157450
---------
Co-authored-by: Nikita Popov <github@npopov.com>
|
|
Optimize CRC loops using a Sarwate table-lookup by using the results of
HashRecognize in LoopIdiomRecognize. The optimization is checked for
correctness using the SingleSource/UnitTests/HashRecognize tests in
llvm-test-suite.
|
|
When estimating the cost to avoid exponential unswitches of non-trivial
invariant conditions, also consider the parent loop basic blocks size,
ensuring this does not grow unexpectedly.
Fixes: https://github.com/llvm/llvm-project/issues/138509.
|
|
Update code comments and variable/function names to make it more clear
that we handle trunc instructions (and not only sext/zext) when
extracting constant offsets from a GEP index expressions.
This for example renames the vector ExtInsts to CastInsts.
|
|
In most cases, GEPs should be canonicalized by InstCombine. Bail out on
non-canonical forms for simplicity.
Fixes
https://github.com/llvm/llvm-project/pull/155253#issuecomment-3248457478.
|
|
fixes #154407
In HLSL the GVNPass was adding a phi node on
a target extention type.
https://hlsl.godbolt.org/z/sc14YenEe
This is something we cleaned up in a past PR
(https://github.com/llvm/llvm-project/pull/154620) by introducing
`isTokenLikeTy`. In the case of the GVN pass the target extention type
was still making its way through. This change makes it so if we see this
type we don't do PRE.
|
|
This patch removes bound checks that are dominated by bounded memory
accesses. For example, if we have an array `int A[5]` and `A[idx]` is
performed successfully, we know that `idx u< 5` after the load.
compile-time impact (+0.1%):
https://llvm-compile-time-tracker.com/compare.php?from=f0e9bba024d44b55d54b02025623ce4a3ba5a37c&to=5227b08a4a514159ec524d1b1ca18ed8ab5407df&stat=instructions%3Au
llvm-opt-benchmark:
https://github.com/dtcxzyw/llvm-opt-benchmark/pull/2709
Proof: https://alive2.llvm.org/ce/z/JEyjA2
|
|
known (#156057)
Summary:
The masked load / store LLVM intrinsics take an argument for the
alignment. If the user is pessimistic about alignment they can provide a
value of `1` for an unaligned load. This patch updates infer-alignment
to
increase the alignment value of the alignment argument if it is known
greater than the provided one.
Ignoring the gather / scatter versions for now since they contain many
pointers.
|
|
When zero cost instructions are hoisted, the simplifyHoistedPhi function
was setting incoming phi values which were not dominating the use
causing runtime failure. This was set to poison by rebuildSSA function.
This commit fixes the issue.
|
|
(#153902)
…d loops
Add a count of the number of partitions LoopDist made when distributing
a loop in meta data, then check for loops which are already distributed
to prevent reprocessing.
We see this happen on some spec apps, LD is on by default at SiFive.
|
|
We just replaced SmallSet<T *, N> with SmallPtrSet<T *, N>, bypassing
the redirection found in SmallSet.h. With that, we no longer need to
include SmallSet.h in many files.
|
|
(#154582)
Drop poison generating flags on trunc when distributing trunc over
add/sub/or. We need to do this since for example
(add (trunc nuw A), (trunc nuw B)) is more poisonous than
(trunc nuw (add A, B))).
In some situations it is pessimistic to drop the flags. Such as
if the add in the example above also has the nuw flag. For now we
keep it simple and always drop the flags.
Worth mentioning is that we drop the flags when cloning
instructions and rebuilding the chain. This is done after the
"allowsPreservingNUW" checks in ConstantOffsetExtractor::Extract.
So we still take the "trunc nuw" into consideration when determining
if nuw can be preserved in the gep (which should be ok since that
check also require that all the involved binary operations has nuw).
Fixes #154116
|
|
insertParsePoints collects live variables and then deduplicate them
while retaining the original insertion order, which is exactly what
SetVector is designed for. This patch replaces SmallVector with
SetSmallVector while deleting unique_unsorted.
While we are at it, this patch reduces the number of inline elements
to a reasonable level for linear search.
|
|
When separating the constant offset from a GEP, if the pointer operand
is a constant ptradd (likely generated when we performed this transform
on that GEP), we accumulate the offset into the current offset. This
ensures that when there is a chain of GEPs the constant offset reaches
the final memory instruction where it can likely be folded into the
addressing.
|
|
I'm trying to remove the redirection in SmallSet.h:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
to make it clear that we are using SmallPtrSet. There are only
handful places that rely on this redirection.
This patch replaces SmallSet to SmallPtrSet where the element type is
a pointer.
|
|
Goal is simply to reduce direct usage of getLength and setLength so that
if we end up moving memset.pattern (whose length is in elements) there
are fewer places to audit.
|
|
SCCP can use PredicateInfo to constrain ranges based on assume and
branch conditions. Currently, this is only enabled during IPSCCP.
This enables it for SCCP as well, which runs after functions have
already been simplified, while IPSCCP runs pre-inline. To a large
degree, CVP already handles range-based optimizations, but SCCP is more
reliable for the cases it can handle. In particular, SCCP works reliably
inside loops, which is something that CVP struggles with due to LVI
cycles.
I have made various optimizations to make PredicateInfo more efficient,
but unfortunately this still has significant compile-time cost (around
0.1-0.2%).
|
|
It can happen that the call is originally created as a MemoryDef,
and then later transforms show it is actually read-only and could
be a MemoryUse -- however, this is not guaranteed to be reflected
in MSSA.
|
|
conditions"
This reverts commit e9de32fd159d30cfd6fcc861b57b7e99ec2742ab due to
multiple performance regressions observed across downstream Numba
benchmarks (https://github.com/llvm/llvm-project/issues/138509#issuecomment-3193855772).
While avoiding non-trivial unswitches on newly-cloned loops helps
mitigate the pathological case reported in https://github.com/llvm/llvm-project/issues/138509,
it may as well make the IR less friendly to vectorization / loop-
canonicalization (in the test reported, previously no select with
loop-carried dependence existed in the new specialized loops),
leading the abovementioned approach to be reconsidered.
|
|
This patch replaces SmallSet<T *, N> with SmallPtrSet<T *, N>. Note
that SmallSet.h "redirects" SmallSet to SmallPtrSet for pointer
element types:
template <typename PointeeType, unsigned N>
class SmallSet<PointeeType*, N> : public SmallPtrSet<PointeeType*, N>
{};
We only have 140 instances that rely on this "redirection", with the
vast majority of them under llvm/. Since relying on the redirection
doesn't improve readability, this patch replaces SmallSet with
SmallPtrSet for pointer element types.
|
|
Part of Issue #147390
|
|
This also fixes a bug introduced accidentally in #153651, whereby the
`JumpTableToSwitch` would convert all the branch weights to 0 except
for one. It didn't trip the test because `update_test_checks` wasn't
run with `-check-globals`. It is now. This also made noticeable that
the direct calls promoted from the indirect call inherited the
`VP`metadata, which should be dropped as it makes no more sense now.
|
|
Emit safety guards for ptr accesses when cross partition loads exist
which have a corresponding store to the same address in a different
partition. This will emit the necessary ptr checks for these accesses.
The test case was obtained from SuperTest, which SiFive runs regularly.
We enabled LoopDistribution by default in our downstream compiler, this
change was part of that enablement.
|
|
Fixing buildbot failures after PR #153305, e.g.
https://lab.llvm.org/buildbot/#/builders/203/builds/19861
Analysis already depends on `ProfileData`, so the transitive closure of
the dependencies of `ScalarOpts` doesn't change.
Also avoided an extra dependency (and very unnecessary) on
`Instrumentation`. The API previously used doesn't need to live in
Instrumentation to begin with, but that's something to address in a
follow-up.
|
|
|
|
If the indirect call target being recognized as a jump table has profile info, we can accurately synthesize the branch weights of the switch that replaces the indirect call.
Otherwise we insert the "unknown" `MD_prof` to indicate this is the best we can do here.
Part of Issue #147390
|
|
Limit the re-association of BOps with multiple users to FP and Vector
arithmetic.
|