Age | Commit message (Collapse) | Author | Files | Lines |
|
LAA doesn't keep references to IR outside the loop or references to
SCEVs that may be invalidated, unless runtime checks are needed (either
memory or SCEV predicates). For the current LAA users, it should be
sufficient to invalidate entries for loops that require runtime checks,
thus avoiding analyzing loops again unnecessarily.
This helps reduce compile-time, in particular when removing the
restrictions added in 234cc40adc6.
https://llvm-compile-time-tracker.com/compare.php?from=73894dba2cdbcc00678d0c13a6b61765675f60b4&to=05c6bdc41b5f63696ebeb7116325725fa94f66d6&stat=instructions:u
|
|
This avoids expanding the same bounds multiple times, which helps reduce
the compile-time impact of removing the restrictions added in
234cc40adc6, notably -0.06% on stage1-O3 and -0.05% on both
stage1-ReleaseThinLTO and stage1-ReleaseLTO-g.
https://llvm-compile-time-tracker.com/compare.php?from=8b9ebc4bb86cf0979e05908cbb04336f2d01dda5&to=fabd36f96c31e47ea72653f5a404feaadfc7b5b5&stat=instructions:u
|
|
This is a helper to avoid writing `getModule()->getDataLayout()`. I
regularly try to use this method only to remember it doesn't exist...
`getModule()->getDataLayout()` is also a common (the most common?)
reason why code has to include the Module.h header.
|
|
|
|
Introduce a Loop::getLocStr stolen from LoopVectorize's static function
getDebugLocString in order to have uniform debug output headers across
LoopVectorize, LoopAccessAnalysis, and LoopDistribute. The motivation
for this change is to have UpdateTestChecks recognize the headers and
automatically generate CHECK lines for debug output, with minimal
special-casing.
|
|
733b8b2 ([LAA] Simplify identification of speculatable strides [nfc])
refactored getStrideFromPointer() to compute directly on SCEVs, and
return an SCEV expression instead of a Value. However, it left behind a
call to getUniqueCastUse(), which is completely unnecessary. Remove
this, showing a positive test update, and simplify the surrounding
program logic.
|
|
Avoid wastefully setting CanVecMem in several places in analyzeLoop,
complicating the logic, to get the function to return a bool, and set
CanVecMem in the caller.
|
|
Update LAA to use PSE::getSymbolicMaxBackedgeTakenCount which returns
the minimum of the countable exits.
When analyzing dependences and computing runtime checks, we need the
smallest upper bound on the number of iterations. In terms of memory
safety, it shouldn't matter if any uncomputable exits leave the loop,
as long as we prove that there are no dependences given the minimum of
the countable exits. The same should apply also for generating runtime
checks.
Note that this shifts the responsiblity of checking whether all exit
counts are computable or handling early-exits to the users of LAA.
Depends on https://github.com/llvm/llvm-project/pull/93498
PR: https://github.com/llvm/llvm-project/pull/93499
|
|
This avoids unnecessarily passing a number of parameters, and avoids
needing to add extra parameters in the future.
|
|
This reduces the need for explicitly passing it through multiple layers
of function calls.
|
|
Limit the logic added in https://github.com/llvm/llvm-project/pull/9230
to cases where either sink or source are loop-invariant, to avoid
compile-time increases. This is not needed for correctness.
I am working on follow-up changes to reduce the compile-time impact in
general to allow us to enable this again for any source/sink.
This should fix the compile-time regression introduced by this change:
* compile-time improvement with this change:
https://llvm-compile-time-tracker.com/compare.php?from=4351787fb650da6d1bfb8d6e58753c90dcd4c418&to=b89010a2eb5f98494787c1c3b77f25208c59090c&stat=instructions:u
* compile-time improvement with original patch reverted on top of this
change:
https://llvm-compile-time-tracker.com/compare.php?from=b89010a2eb5f98494787c1c3b77f25208c59090c&to=19a1103fe68115cfd7d6472c6961f4fabe81a593&stat=instructions:u
|
|
Implement NFC improvements spotted during a cursory reading of
LoopAccessAnalysis.
|
|
Use getStartAndEndForAccess to compute the start and end of both src
and sink (factored out to helper in bce3680f45b57f). If they do not
overlap (i.e. SrcEnd <= SinkStart || SinkEnd <= SrcStart), there is no
dependence, regardless of stride.
PR: https://github.com/llvm/llvm-project/pull/92307
|
|
This allows use at other places, in particular an updated version of
https://github.com/llvm/llvm-project/pull/92307.
|
|
Use the destructuring syntax in C++ and llvm::enumerate to make
sortPtrAccesses a little more readable.
|
|
Applying the loop guards to the distance may prevent
isSafeDependenceDistance from determining NoDep, unless loop guards are
also applied to the backedge-taken-count.
Instead of applying the guards to both Dist and the
backedge-taken-count, just apply them after handling
isSafeDependenceDistance and constant distances; there is no benefit to
applying the guards before then.
This fixes a regression flagged by @bjope due to
ecae3ed958481cba7d60868cf3504292f7f4fdf5.
|
|
tryToCreateDiffCheck has one caller, and exits early if CanUseDiffCheck
is false. Hence, we can get/set CanUseDiffCheck in the caller to avoid
wastefully calling tryToCreateDiffCheck. This patch is an NFC
simplification of program logic.
|
|
Following up to 933f49248, also update the code reasoning about
backwards dependences to support non-constant distances.
Update the code to use the signed minimum distance instead of a constant
distance
This means e checked the lower bound of the dependence distance and the
distance may be larger at runtime (and safe for vectorization). Whether
to classify it as Unknown or Backwards depends on the vector width and
LAA was updated to take TTI to get the maximum vector register width.
If the minimum dependence distance is larger than the max vector width,
we consider it as backwards-vectorizable. Otherwise we classify them as
Unknown, so we re-try with runtime checks.
PR: https://github.com/llvm/llvm-project/pull/91525
|
|
After supporting non-constant dependence distances in 933f49248bf,
applying information from loop guards can help further disambiguate
dependencies.
|
|
Instead of passing LoopAccessInfo only to fetch the MemoryDepChecker,
directly pass MemoryDepChecker. This simplifies the code and also allows
new uses in places where no LAI is available.
|
|
Code checking stores to invariant addresses and reductions made an
incorrect assumption that the case of both a load & store to the same
invariant address does not need to be handled.
In some cases when vectorizing with runtime checks, there may be
dependences with a load and store to the same address, storing a
reduction value.
Update LAA to separately track if there was a store-store and a
load-store dependence with an invariant addresses.
Bail out early if there as a load-store dependence with invariant
address. If there was a store-store one, still apply the logic checking
if they all store a reduction.
|
|
As discussed in https://github.com/llvm/llvm-project/pull/88039, support
different strides with isSafeDependenceDistance by passing the maximum
of both strides.
isSafeDependenceDistance tries to prove that
|Dist| > BackedgeTakenCount * Step
holds. Chosing the maximum stride computes the maximum range accesed by
the loop for all strides.
PR: https://github.com/llvm/llvm-project/pull/90036
|
|
(#88039)
Extend LoopAccessAnalysis to support different strides and as a
consequence non-constant distances between dependences using SCEV to
reason about the direction of the dependence.
In multiple places, logic to rule out dependences using the stride has
been updated to only be used if StrideA == StrideB, i.e. there's a
common stride.
We now also may bail out at multiple places where we may have to set
FoundNonConstantDistanceDependence. This is done when we need to bail
out and the distance is not constant to preserve original behavior.
Fixes https://github.com/llvm/llvm-project/issues/87336
PR: https://github.com/llvm/llvm-project/pull/88039
|
|
As suggested in https://github.com/llvm/llvm-project/pull/88039, add
extra documentation for reasoning in isDependent.
PR: https://github.com/llvm/llvm-project/pull/89381
|
|
As suggested in https://github.com/llvm/llvm-project/pull/88039, replace
the tuple with a struct, to make it easier to extend.
|
|
Fix type in textual analysis output.
|
|
LAA currently adds memory locations with their original AATags to AST.
However, scoped alias AATags may be valid only within one loop
iteration, while LAA reasons across iterations.
Fix this by determining which alias scopes are defined inside the loop,
and drop AATags that reference these scopes.
Fixes https://github.com/llvm/llvm-project/issues/79137.
|
|
This changes the AliasSetTracker to track memory locations instead of
pointers in its alias sets. The motivation for this is outlined in an RFC
posted on LLVM discourse:
https://discourse.llvm.org/t/rfc-dont-merge-memory-locations-in-aliassettracker/73336
In the data structures of the AST implementation, I made the choice to
replace the linked list of `PointerRec` entries (that had to go anyway)
with a simple flat vector of `MemoryLocation` objects, but for the
`AliasSet` objects referenced from a lookup table, I retained the
mechanism of a linked list, reference counting, forwarding, etc. The
data structures could be revised in a follow-up change.
|
|
Vectors are always bit-packed and don't respect the elements' alignment
requirements. This is different from arrays. This means offsets of
vector GEPs need to be computed differently than offsets of array GEPs.
This PR fixes many places that rely on an incorrect pattern
that always relies on `DL.getTypeAllocSize(GTI.getIndexedType())`.
We replace these by usages of `GTI.getSequentialElementStride(DL)`,
which is a new helper function added in this PR.
This changes behavior for GEPs into vectors with element types for which
the (bit) size and alloc size is different. This includes two cases:
* Types with a bit size that is not a multiple of a byte, e.g. i1.
GEPs into such vectors are questionable to begin with, as some elements
are not even addressable.
* Overaligned types, e.g. i16 with 32-bit alignment.
Existing tests are unaffected, but a miscompilation of a new test is fixed.
---------
Co-authored-by: Nikita Popov <github@npopov.com>
|
|
With commit https://reviews.llvm.org/D152366 I introduced functionality
that permitted the hoisting of runtime memory checks from a vectorised
inner loop to the preheader of the next outer-most loop. This is useful
for benchmarks like SPEC2017's x264 where the inner loop is vectorised
and only has a small trip count. In such cases the runtime memory checks
become expensive and since the checks never fail in the case of x264 it
makes sense to do this. However, this behaviour was controlled by the
flag -hoist-runtime-checks which was off by default.
This patch enables this flag by default for all targets, since I believe
this is a generally beneficial thing to do. I have tested this with
SPEC2017 and I see 2.3% and 2.6% improvements with x264 on neoverse-v1
and neoverse-n1, respectively. Similarly, I saw slight improvements in
the overall geomean on both machines. The only other notable changes
were a 1% drop in the roms benchmark, which was compensated for by a 1%
improvement in fotonik3d.
|
|
Pass MemoryLocation as one argument, instead of passing all its
parts separately.
|
|
When attempting to hoist runtime checks out of a loop we currently avoid
creating pointer diff checks and prefer to do expanded range checks
instead. This gives us the opportunity to hoist runtime checks out of a
loop, since these checks are loop invariant. However, in some cases the
pointer diff checks would also be loop invariant and so will naturally
get hoisted. Therefore, since diff checks are cheaper so we should
prefer to use those instead.
|
|
As shown in #70473, the following loop was not considered safe to
vectorize. When determining the memory access dependencies in
a loop which has negative iteration step, we invert the source and
sink of the dependence. Perhaps we should just invert the operands
to getMinusSCEV(). This way the dependency is not regarded to be
true, since the users of the `IsWrite` variables, which correspond to
each of the memory accesses, rely on program order and therefore
should not be swapped.
void vectorizable_Read_Write(int *A) {
for (unsigned i = 1022; i >= 0; i--)
A[i+1] = A[i] + 1;
}
|
|
After 9645267, TypeByteSize is 0 if both access do not have the same
size (i.e. HasSameSize will be false). This can cause an infinite loop
in couldPreventStoreLoadForward, if HasSameSize is not checked first.
So check HasSameSize first instead of after
couldPreventStoreLoadForward. Checking HasSameSize first is also
cheaper.
|
|
This patch refactors the logic to compute the dependence distance,
stride, size and write info to a separate function. This limits the
scope of various A* and B* variables, which in turn makes it easier to
reason about their uses.
In particular this makes it more explicit why dropping the various
std::swaps as done in https://github.com/llvm/llvm-project/pull/70819/
is valid.
|
|
Commit was pushed by accident.
This reverts commit 0e15f245557000038f27f2dc8926a65bf3d4ccaf.
|
|
|
|
This patch adds a new dependence kind UnsafeIndirect, for cases where at
least one of the memory access instructions may access a loop varying object,
e.g. the address of underlying object is loaded inside the loop, like A[B[i]].
We cannot determine direction or distance in those cases, and also are unable
to generate any runtime checks.
This fixes a miscompile, if we attempt to generate runtime checks for
unknown dependencies.
Note that in most cases we do not attempt to generate runtime checks for
unknown dependences, except if FoundNonConstantDistanceDependence is
true.
Fixes https://github.com/llvm/llvm-project/issues/69744.
|
|
Base on D19403, the exact pragma of distribute is
`#pragma clang loop distribute`
|
|
Identified with misc-include-cleaner.
|
|
Identified with misc-include-cleaner.
|
|
Given a function like the following: https://godbolt.org/z/T9c99fr88
```c
1161_noReadWrite(int *Preds) {
for (int i = 0; i < LEN_1D-1; ++i) {
if (Preds[i] != 0)
b[i] = c[i] + 1;
else
a[i] = i * i;
}
}
```
LLVM will optimize the IR to a single store by a phi instruction:
```llvm
%1 = load ptr, ptr @a, align 64
%2 = load ptr, ptr @b, align 64
...
for.inc:
%.sink = phi ptr [ %1, %if.then ], [ %2, %if.else ]
%add.sink = phi double [ %add, %if.then ], [ %conv8, %if.else ]
%arrayidx7 = getelementptr inbounds double, ptr %.sink, i64 %indvars.iv
store double %add.sink, ptr %arrayidx7, align 8
```
LAA is currently unable to analyze such IR, since ScalarEvolution will
return a SCEVUnknown for the forked pointer operand of the store.
This patch adds initial optional support for analyzing both
possibilities for the pointer and allowing LAA to generate runtime
checks for the bounds if required, refers to D108699, but here address
the phi node.
Fixes https://github.com/llvm/llvm-project/issues/64888
Reviewed By: huntergr-arm, fhahn
Differential Revision: https://reviews.llvm.org/D158965
|
|
Don't report 'Use #pragma loop distribute(enable) to allow loop
distribution' when we already add #pragma clang loop distribute(enable)
Fixes https://github.com/llvm/llvm-project/issues/64637
|
|
loop
Suppose we have a nested loop like this:
void foo(int32_t *dst, int32_t *src, int m, int n) {
for (int i = 0; i < m; i++) {
for (int j = 0; j < n; j++) {
dst[(i * n) + j] += src[(i * n) + j];
}
}
}
We currently generate runtime memory checks as a precondition for
entering the vectorised version of the inner loop. However, if the
runtime-determined trip count for the inner loop is quite small
then the cost of these checks becomes quite expensive. This patch
attempts to mitigate these costs by adding a new option to
expand the memory ranges being checked to include the outer loop
as well. This leads to runtime checks that can then be hoisted
above the outer loop. For example, rather than looking for a
conflict between the memory ranges:
1. &dst[(i * n)] -> &dst[(i * n) + n]
2. &src[(i * n)] -> &src[(i * n) + n]
we can instead look at the expanded ranges:
1. &dst[0] -> &dst[((m - 1) * n) + n]
2. &src[0] -> &src[((m - 1) * n) + n]
which are outer-loop-invariant. As with many optimisations there
is a trade-off here, because there is a danger that using the
expanded ranges we may never enter the vectorised inner loop,
whereas with the smaller ranges we might enter at least once.
I have added a HoistRuntimeChecks option that is turned off by
default, but can be enabled for workloads where we know this is
guaranteed to be of real benefit. In future, we can also use
PGO to determine if this is worthwhile by using the inner loop
trip count information.
When enabling this option for SPEC2017 on neoverse-v1 with the
flags "-Ofast -mcpu=native -flto" I see an overall geomean
improvement of ~0.5%:
SPEC2017 results (+ is an improvement, - is a regression):
520.omnetpp: +2%
525.x264: +2%
557.xz: +1.2%
...
GEOMEAN: +0.5%
I didn't investigate all the differences to see if they are
genuine or noise, but I know the x264 improvement is real because
it has some hot nested loops with low trip counts where I can
see this hoisting is beneficial.
Tests have been added here:
Transforms/LoopVectorize/runtime-checks-hoist.ll
Differential Revision: https://reviews.llvm.org/D152366
|
|
This patch fixes:
llvm/lib/Analysis/LoopAccessAnalysis.cpp:2001:12: error: unused
variable 'MinDepDistBytesOld' [-Werror,-Wunused-variable]
|
|
`MaxSafeDepDistBytes` was not correct based on its name an semantics
in instances when there was a non-unit stride loop. For example,
```
for (int k = 0; k < len; k+=3) {
a[k] = a[k+4];
a[k+2] = a[k+6];
}
```
Here, the smallest dependence distance is 24 bytes, but only vectorizing 8 bytes
is safe. `MaxSafeVectorWidthInBits` reported the correct number of bits
that could be vectorized as 64 bits.
The semantics of of `MaxSafeDepDistBytes` should be:
The smallest dependence distance in bytes in the loop. This may not be
the same as the maximum number of bytes that are safe to operate on
simultaneously.
The name of this variable should reflect those semantics and
its docstring should be updated accordingly, `MinDepDistBytes`.
A debug message that used `MaxSafeDepDistBytes` to signify to the user
how many bytes could be accessed in parallel is updated to use
`MaxSafeVectorWidthInBits` instead. That way, the same message if
communicated to the user, just in different units.
This patch makes sure that when `MinDepDistBytes` is modified in a way
that should impact `MaxSafeVectorWidthInBits`, that we update the latter
accordingly. This patch also clarifies why `MaxSafeVectorWidthInBits`
does not to be updated when `MinDepDistBytes` is (i.e. in the case of a
forward dependency).
Differential Revision: https://reviews.llvm.org/D156158
|
|
|
|
It was `inaccessiblemem: readwrite` before, no need for the read.
No real benefit is expected but it can help debugging and other efforts.
Differential Revision: https://reviews.llvm.org/D156478
|
|
Add extra assert to check invariant of RuntimePointerChecking::insert to
guard against subtle changes when extending the scope of LAA.
|
|
|