Age | Commit message (Collapse) | Author | Files | Lines |
|
C1 (#160163)
We can rewrite this to (srai(w)/srli X, C1) == C2 so the AND immediate
is free. This transform is done by performSETCCCombine in
RISCVISelLowering.cpp.
This fixes the opaque constant case mentioned in #157416.
|
|
This reverts commit aa08b1a9963f33ded658d3ee655429e1121b5212.
|
|
This patch is based on https://github.com/llvm/llvm-project/pull/159713
This patch extends AddressSanitizer to support indexed/segment
instructions in RVV. It enables proper instrumentation for these memory
operations.
A new member, `MaybeOffset`, is added to `InterestingMemoryOperand` to
describe the offset between the base pointer and the actual memory
reference address.
Co-authored-by: Yeting Kuo <yeting.kuo@sifive.com>
|
|
Split out from #151300 to isolate TargetTransformInfo cost modelling for
fault-only-first loads from VPlan implementation details. This change
adds costing support for vp.load.ff independently of the VPlan work.
For now, model a vp.load.ff as cost-equivalent to a vp.load.
|
|
embed in MemIntrinsicInfo #157863 (#159713)
[Previously reverted due to failures on asan-rvv-intrinsics.ll, the test
case is riscv only and it is triggered by other target]
Reland [#157863](https://github.com/llvm/llvm-project/pull/157863), and
add `; REQUIRES: riscv-registered-target` in test case to skip the
configuration that doesn't register riscv target.
Previously asan considers target intrinsics as black boxes, so asan
could not instrument accurate check. This patch make
SmallVector<InterestingMemoryOperand> a member of MemIntrinsicInfo so
that TTI can make targets describe their intrinsic informations to asan.
Note,
1. This patch move InterestingMemoryOperand from Transforms to Analysis.
2. Extend MemIntrinsicInfo by adding a
SmallVector<InterestingMemoryOperand> member.
3. This patch does not support RVV indexed/segment load/store.
|
|
embed in MemIntrinsicInfo" (#159700)
Reverts llvm/llvm-project#157863
|
|
MemIntrinsicInfo (#157863)
Previously asan considers target intrinsics as black boxes, so asan
could not instrument accurate check. This patch make
SmallVector<InterestingMemoryOperand> a member of MemIntrinsicInfo so
that TTI can make targets describe their intrinsic informations to asan.
Note,
1. This patch move InterestingMemoryOperand from Transforms to Analysis.
2. Extend MemIntrinsicInfo by adding a
SmallVector<InterestingMemoryOperand> member.
3. This patch does not support RVV indexed/segment load/store.
|
|
|
|
#149955" (#156386)
This patch implements the `getAddressComputationCost()` in RISCV TTI
which
make the gather/scatter with address calculation more expansive that
stride cost.
Note that the only user of `getAddressComputationCost()` with vector
type is in `VPWidenMemoryRecipe::computeCost()`. So this patch make some
LV tests changes.
I've checked the tests changes in LV and seems those changes can be
divided into two groups.
* gather/scatter with uniform vector ptr, seems can be optimized to
masked.load.
* can optimize to stride load/store.
----
After #155739 landed, the assertion (cost mis-aligned) is fixed.
I've tested llvm-test-suite w/ rva23u64 and rva23u64_zvl1024b locally
and no assertion occurred.
|
|
Return `true` in `RISCVTTIImpl::preferAlternateOpcodeVectorization` if
subtarget supports unaligned memory accesses.
|
|
(#155535)
Reverts llvm/llvm-project#149955
|
|
This patch implements the `getAddressComputationCost()` in RISCV TTI
which
make the gather/scatter with address calculation more expansive that
stride cost.
Note that the only user of `getAddressComputationCost()` with vector
type is in `VPWidenMemoryRecipe::computeCost()`. So this patch make some
LV tests changes.
I've checked the tests changes in LV and seems those changes can be
divided into two groups.
* gather/scatter with uniform vector ptr, seems can be optimized to
masked.load.
* can optimize to stride load/store.
|
|
There are a couple of places in the loop vectoriser where we
want to calculate the cost of extracting the last lane in a
vector. However, we wrongly assume that asking for the cost
of extracting lane (VF.getKnownMinValue() - 1) is an accurate
representation of the cost of extracting the last lane. For
SVE at least, this is non-trivial as it requires the use of
whilelo and lastb instructions.
To solve this problem I have added a new
getReverseVectorInstrCost interface where the index is used
in reverse from the end of the vector. Suppose a vector has
a given ElementCount EC, the extracted/inserted lane would be
EC - 1 - Index. For scalable vectors this index is unknown at
compile time. I've added a AArch64 hook that better represents
the cost, and also a RISCV hook that maintains compatibility
with the behaviour prior to this PR.
I've also taken the liberty of adding support in vplan for
calculating the cost of VPInstruction::ExtractLastElement.
|
|
cttz/ctlz/ctpop. NFC. (#154064)
That isn't necessary if we've checked ST->hasStdExtZvbb().
|
|
If we have a floating point vector and no zve32f/zve64f/zve64d, we can
end up with an invalid type-legalization cost from
getTypeLegalizationCost.
Previously this triggered an assertion that the type must have been
legalized if the "legal" type is a vector, but in this case when it's
not possible to legalize the original type is spat back out.
This fixes it by just checking that the legalization cost is valid.
We don't have much testing for zve64x, so we may have other places in
the cost model with this issue.
Fixes #153008
|
|
Now that support for masked loads/stores of interleave groups has
landed, we can enable the loop vectorizer to generate masked interleave
access where applicable.
This improves vectorization in several ways:
* Internal predication support: This enables interleave group
vectorization for loops with internal control flow predication, provided
all members of the group share the same predicate. Gaps in interleave
groups are still not efficiently handled by masking, so masking for gaps
remains disabled for now.
* Tail folding: This allows tail folding of loops with interleave groups
by using masking. Without this, vectorized loops with interleaves would
fall back to using separate gather/scatter accesses, which can be
significantly less efficient.
"[RISCV][TTI] Enable masked interleave access for scalable vector
(#149981)" was reverted by 5294793bdcf6ca142f7a0df897638bd4e85ed1a7 due
to triggering an assertion. The issue has been addressed in the patch
"[LV] Fix gap mask requirement for interleaved access (#151105)". On the
other hand, this patch also enable fixed-length masked interleave access
(#150624) since support for fixed-length has also been landed
992118cb4deab139ae384bb85f03225a9a21b008.
---------
Co-authored-by: Philip Reames <preames@rivosinc.com>
|
|
Adjust the unrolling preferences to unroll hand-vectorized code, as well
as the scalar remainder of a vectorized loop. Inspired by a similar
effort in AArch64: see #147420 and #151164.
|
|
Follow up on a review of bd66fd0 ([CostModel/RISCV] Fix costs of vector
[l](lrint|lround)) post-landing to fix a subtle problem with the cost
of vector [l](lrint|lround). We should use source LMUL in the case of
a narrowing op.
Co-authored-by: Luke Lau <luke@igalia.com>
|
|
bd66fd0 ([CostModel/RISCV] Fix costs of vector [l](lrint|lround))
introduced buildbot failures by using a temporary ArrayRef when a
SmallVector should have been used. Fix this.
Failure: https://lab.llvm.org/buildbot/#/builders/186/builds/11133
|
|
Take the actual instruction cost into account, and don't fallthrough to
code that doesn't apply to [l]lrint. Also strip invalid costs for
[b]f16, as a companion to #146507, and unify it with [l]lround costs as
a companion to #147713.
|
|
(#149981)"
This reverts commit ee3a7714b7a69ac9aae4b79f4c67adc38bc6876b.
Causes an assertion for the zvl1024b RISC-V build configuration. See
comment with reproducer at
<https://github.com/llvm/llvm-project/pull/149981#issuecomment-3118482801>
|
|
Now that support for masked loads/stores of interleave groups has
landed, we can enable the loop vectorizer to generate masked interleave
access where applicable.
This improves vectorization in several ways:
* Internal predication support: This enables interleave group
vectorization for loops with internal control flow predication, provided
all members of the group share the same predicate. Gaps in interleave
groups are still not efficiently handled by masking, so masking for gaps
remains disabled for now.
* Tail folding: This allows tail folding of loops with interleave groups
by using masking. Without this, vectorized loops with interleaves would
fall back to using separate gather/scatter accesses, which can be
significantly less efficient.
* Scalable vector support: Currently, only scalable vector types are
supported for masked interleave lowering. Fixed-length vector support
will be enabled in the future.
As interleave access is not yet supported with tail folding by EVL, that
functionality is temporarily disabled. We are going to create another
patch to support it.
Co-authored-by: Philip Reames <preames@rivosinc.com>
---------
Co-authored-by: Philip Reames <preames@rivosinc.com>
|
|
This patch implement vector costs for `llvm.fptoui.sat()` in RISCV TTI.
|
|
Currently we have slightly different costing for the vp and non-vp
version of the rounding intrinsics.
We can delete this code and use the generic BasicTTIImpl code for the vp
intrinsics which falls back to the non-vp versions.
I'm not sure if the zvfh costing is correct, this should probably be
fixed in a follow up patch. At the moment the non-vp cost is more
important since it is what the loop vectorizer will use.
|
|
Move the costing to the generic implementation in BasicTTIImpl since it
just falls back to the non-vp costing.
Also pass through the OperandValueInfo if using value based costing, but
I don't believe this affects the result for any in-tree target
currently.
|
|
A shuffle will take two input vectors and a mask, to produce a new
vector of size <MaskElts x SrcEltTy>. Historically it has been assumed
that the SrcTy and the DstTy are the same for getShuffleCost, with that
being relaxed in recent years. If the Tp passed to getShuffleCost is the
SrcTy, then the DstTy can be calculated from the Mask elts and the src
elt size, but the Mask is not always provided and the Tp is not reliably
always the SrcTy. This has led to situations notably in the SLP
vectorizer but also in the generic cost routines where assumption about
how vectors will be legalized are built into the generic cost routines -
for example whether they will widen or promote, with the cost modelling
assuming they will widen but the default lowering to promote for integer
vectors.
This patch attempts to start improving that - it originally tried to
alter more of the cost model but that too quickly became too many
changes at once, so this patch just plumbs in a DstTy to getShuffleCost
so that DstTy and SrcTy can be reliably distinguished. The callers of
getShuffleCost have been updated to try and include a DstTy that is more
accurate. Otherwise it tries to be fairly non-functional, keeping the
SrcTy used as the primary type used in shuffle cost routines, only using
DstTy where it was in the past (for InsertSubVector for example).
Some asserts have been added that help to check for consistent values
when a Mask and a DstTy are provided to getShuffleCost. Some of them
took a while to get right, and some non-mask calls might still be
incorrect. Hopefully this will provide a useful base to build more
shuffles that alter size.
|
|
Purely for the sake of being idiomatic with other TTI costing routines,
no direct motivation beyond that.
|
|
PPCTTIImpl defines hasActiveVectorLength and also getVPMemoryOpCost, but
they appear unused (i.e. no changes to tests).
Remove them, as they complicate the interface for hasActiveVectorLength.
This simplifies the only use in LV as now no placeholder values need to
be passed.
PR: https://github.com/llvm/llvm-project/pull/142310
|
|
We can convert non-power-of-2 types into extended value types
and then they will be widen.
Reviewers: lukel97
Reviewed By: lukel97
Pull Request: https://github.com/llvm/llvm-project/pull/114971
|
|
Put one copy on RISCVTargetLowering as a static function so that both
locations can use it, and rename the method to getM1VT for slightly
improved readability.
|
|
|
|
This contains two closely related changes:
1) Explicitly recurse on the i1 case - "3" happens to be the right
magic constant at m1, but is not otherwise correct, and we're
better off deferring this to existing logic.
2) Match the lowering for high LMUL shuffles - we've switched to using
a linear number of m1 vrgather instead of a single big vrgather.
This results in substantially faster (but also larger) code for
reverse shuffles larger than m1. Note that fixed vectors need
a slide at the end, but scalable ones don't.
This will have the effect of biasing the vectorizer towards larger
(particularly scalable larger) vector factors. This increases VF for the
s112 and s1112 loops from TSVC_2 (in all configurations).
We could refine the high LMUL estimates a bit more, but I think getting
the linear scaling right is probably close enough for the moment.
|
|
This patch adds the support of generating vector instructions for
`memcmp`. This implementation is inspired by X86's.
We convert integer comparisons (eq/ne only) into vector comparisons
and do a vector reduction and to get the result.
The range of supported load sizes is (XLEN, VLEN * LMUL8] and
non-power-of-2 types are not supported.
Fixes #143294.
Reviewers: lukel97, asb, preames, topperc, dtcxzyw
Reviewed By: topperc, lukel97
Pull Request: https://github.com/llvm/llvm-project/pull/114517
|
|
This depends on the recently add partial_reduce_sumla node for lowering
but at this point, we have all the parts.
|
|
(#142036)
If we have the ri.vinsert/vextract instructions from xrivosvisni, we can
do an element insert or extract without needing a vslide or a vector
temporary register. Adjust the TTI cost to reflect this.
|
|
Two changes:
1) Handle fixed vector cases now that 77a3f8 has landed.
2) Fix a mistake in the original costing - the VF passed in is the
input VF, not the output VF. Given that we should be costing the
accumulator type with VF/4.
Note that (2) does not cause any visible test differences as the
vectorizer (outside of maximize-bandwidth mode) does not consider wide
enough VF for the costing difference to matter.
|
|
Doing so tells the loop vectorizer that the partial.reduce intrinsic is
profitable to use over the plain extend/multiply/reduce.add sequence.
|
|
This does not alter much at the moment, but allows const pointers to be
passed as Op0 and Op1, simplifying later patches
|
|
`ad9909d "[SLP]Fix perfect diamond match with extractelements in scalars" `
changed SLPVectorizer getScalarizationOverhead() to call
TTI.getVectorInstrCost() instead of TTI.getScalarizationOverhead() in some
cases. This was due to X86 specific handlings in these (overridden) methods,
and unfortunately the general preference of TTI.getScalarizationOverhead()
was dropped. If VL is available it should always be preferred to use
getScalarizationOverhead(), and this is indeed the case for SystemZ which
has a special insertion instruction that can insert two GPR64s.
Then ` 33af951 "[SLP]Synchronize cost of gather/buildvector nodes with
codegen"` reworked SLPVectorizer getGatherCost() which together with
ad9909d caused the SystemZ test vec-elt-insertion.ll to fail.
This patch restores the SystemZ test and reverts the change in SLPVectorizer
getScalarizationOverhead() so that TTI.getScalarizationOverhead() is always
called again. The ForPoisonSrc argument is now passed on to the TTI method
so that X86 can handle this as required.
Fixes: #135346
|
|
This patch introduces a `vp.splat` matching method for VP support by
sinking the `vp.splat` operand of VP operations back into the same basic
block as the VP operation, facilitating the generation of .vx
instructions to reduce vector register pressure.
---------
Co-authored-by: yanming <ming.yan@terapines.com>
|
|
InstructionCost is already an optional value, containing an Invalid
state that can be checked with isValid(). There is little point in
returning another optional from getValue(). Most uses do not make use of
it being a std::optional, dereferencing the value directly (either
isValid has been checked previously or the Cost is assumed to be valid).
The one case that does in AMDGPU used value_or which has been replaced
by a isValid() check.
|
|
The change built before merge, but apparently a constness change landed
since I posted this for review.
|
|
We had some code which tried to estimate legalization costs for
illegally typed shuffles, but it only handled the case of a widening
shuffle, and used a somewhat adhoc heuristic. We can reuse the
processShuffleMask utility (which we already use for individual vector
register splitting when exact VLEN is known) to perform the same
splitting given the legal vector type as the unit of split instead. This
makes the costing both simpler and more robust.
Note that this swings costs for illegal shuffles pretty wildly as we
were previously sometimes hitting the adhoc code, and sometimes falling
through into generic scalarization costing. I don't know that any of the
costs for the individual tests in tree are significant, but the test
which which triggered me finding this was reported to me by Alexey
reduced from something triggering a bad choice in SLP for x264. So this
has the potential to be somewhat high impact.
|
|
(NFCI) (#136655)
These are not diagnosed because implementations hide the methods of the base class rather than overriding them.
This works as long as a hiding function is callable with the same arguments as the same function from the base class.
Pull Request: https://github.com/llvm/llvm-project/pull/136655
|
|
Making `TargetTransformInfo::Model::Impl` `const` makes sure all
interface methods are `const`, in `BasicTTIImpl`, its bases, and in all
derived classes.
Pull Request: https://github.com/llvm/llvm-project/pull/136598
|
|
The main change is making `thisT` method `const`, the rest of the
changes is fixing compilation errors (*).
(*) There are two tricky methods, `getVectorInstrCost()` and
`getIntImmCost()`.
They have several overloads; some of these overloads are typically
pulled in to derived classes using the `using` directive, and then
hidden by methods in the derived class.
The compiler does not complain if the hiding methods are not marked as
`const`, which means that clients will use the methods from the base
class. If after this change your target fails cost model tests, this
must be the reason. To resolve the issue you need to make all hiding
overloads `const`. See the second commit in this PR.
Pull Request: https://github.com/llvm/llvm-project/pull/136575
|
|
This fixes a crash reported at
https://github.com/llvm/llvm-project/pull/114250#issuecomment-2813686061
If the vector type isn't legal at all, e.g. bfloat with +zvfbfmin,
then the legalized type will be scalarized. So use getScalarType()
instead of getVectorElement() when checking for f16/bf16.
|
|
If we have a shuffle which can be split via VLA where two or more of the
destinations have exactly the same elements, then we only need to
account for them once in costing. The duplicate copies are are (at
worst) whole register moves.
Note that this change only handles the single source case. Doing the
multiple source case seemed a bit more complicated, and I didn't have a
motivating test case.
|
|
Inspired by https://reviews.llvm.org/D130755.
I don't know the logic behind the value 5, it is copied from AArch64.
For some tests, I have to change the trip count so that we don't
break what they are testing.
|
|
getArithmeticReductionCost(). (#131968)
In the implementation of the getExtendedReductionCost(), it ofter calls
getArithmeticReductionCost() with FMFs. But we shouldn't call
getArithmeticReductionCost() with FMFs for non-floating-point reductions
which will return the wrong cost.
This patch makes FMFs in getExtendedReductionCost() optional and align
to the getArithmeticReductionCost(). So the TTI will return the correct
cost for non-FP extended-reductions query without FMFs.
This patch is not quite NFC but it's hard to test from the CostModel
side.
Split from #113903.
|