Age | Commit message (Collapse) | Author | Files | Lines |
|
Summary:
This patch has broken the `libc` build bot. I could work around that but
the changes seem unnecessary.
This reverts commit 9ba844eb3a21d461c3adc7add7691a076c6992fc.
|
|
Different instructions are used for the 32-bit and 64-bit cases
anyway, so directly use the concrete register class in the
instruction.
|
|
STI exists in the base class, use it instead.
Fixes #159862.
|
|
Fixes #148052 .
Last PR did not account for the scenario, when more than one instruction
used the `catchpad` label.
In that case I have deleted uses, which were already "choosen to be
iterated over" by the early increment iterator. This issue was not
visible in normal release build on x86, but luckily later on the address
sanitizer build it has found it on the buildbot.
Here is the diff from the last version of this PR: #158435
```diff
diff --git a/llvm/lib/Transforms/Utils/BasicBlockUtils.cpp b/llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
index 91e245e5e8f5..1dd8cb4ee584 100644
--- a/llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
+++ b/llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
@@ -106,7 +106,8 @@ void llvm::detachDeadBlocks(ArrayRef<BasicBlock *> BBs,
// first block, the we would have possible cleanupret and catchret
// instructions with poison arguments, which wouldn't be valid.
if (isa<FuncletPadInst>(I)) {
- for (User *User : make_early_inc_range(I.users())) {
+ SmallPtrSet<BasicBlock *, 4> UniqueEHRetBlocksToDelete;
+ for (User *User : I.users()) {
Instruction *ReturnInstr = dyn_cast<Instruction>(User);
// If we have a cleanupret or catchret block, replace it with just an
// unreachable. The other alternative, that may use a catchpad is a
@@ -114,33 +115,12 @@ void llvm::detachDeadBlocks(ArrayRef<BasicBlock *> BBs,
if (isa<CatchReturnInst>(ReturnInstr) ||
isa<CleanupReturnInst>(ReturnInstr)) {
BasicBlock *ReturnInstrBB = ReturnInstr->getParent();
- // This catchret or catchpad basic block is detached now. Let the
- // successors know it.
- // This basic block also may have some predecessors too. For
- // example the following LLVM-IR is valid:
- //
- // [cleanuppad_block]
- // |
- // [regular_block]
- // |
- // [cleanupret_block]
- //
- // The IR after the cleanup will look like this:
- //
- // [cleanuppad_block]
- // |
- // [regular_block]
- // |
- // [unreachable]
- //
- // So regular_block will lead to an unreachable block, which is also
- // valid. There is no need to replace regular_block with unreachable
- // in this context now.
- // On the other hand, the cleanupret/catchret block's successors
- // need to know about the deletion of their predecessors.
- emptyAndDetachBlock(ReturnInstrBB, Updates, KeepOneInputPHIs);
+ UniqueEHRetBlocksToDelete.insert(ReturnInstrBB);
}
}
+ for (BasicBlock *EHRetBB :
+ make_early_inc_range(UniqueEHRetBlocksToDelete))
+ emptyAndDetachBlock(EHRetBB, Updates, KeepOneInputPHIs);
}
}
```
|
|
Currently MCA takes instruction properties from scheduling model.
However, some instructions may execute differently depending on external
factors - for example, latency of memory instructions may vary
differently depending on whether the load comes from L1 cache, L2 or
DRAM. While MCA as a static analysis tool cannot model such differences
(and currently takes some static decision, e.g. all memory ops are
treated as L1 accesses), it makes sense to allow manual modification of
instruction properties to model different behavior (e.g. sensitivity of
code performance to cache misses in particular load instruction). This
patch addresses this need.
The library modification is intentionally generic - arbitrary
modifications to InstrDesc are allowed. The tool support is currently
limited to changing instruction latencies (single number applies to all
output arguments and MaxLatency) via coments in the input assembler
code; the format is the like this:
add (%eax), eax // LLVM-MCA-LATENCY:100
Users of MCA library can already make additional customizations; command
line tool can be extended in the future.
Note that InstructionView currently shows per-instruction information
according to scheduling model and is not affected by this change.
See https://github.com/llvm/llvm-project/issues/133429 for additional
clarifications (including explanation why existing customization
mechanisms do not provide required functionality)
---------
Co-authored-by: Min-Yih Hsu <min@myhsu.dev>
|
|
The split in this code path was left over from when we had to support
the old PM and the new PM at the same time. Now that the legacy pass has
been dropped, this simplifies the code a little bit and swaps pointers
for references in a couple places.
Reviewers: aeubanks, efriedma-quic, wlei-llvm
Reviewed By: aeubanks
Pull Request: https://github.com/llvm/llvm-project/pull/159858
|
|
after LUI now. NFC (#159829)
The simm32 base case only uses lui+addiw when necessary after
3d2650bdeb8409563d917d8eef70b906323524ef
The worst case 8 instruction sequence doesn't leave a full 32 bits for
the LUI+ADDI(W) after the 3 12-bit ADDI and SLLI pairs are created. So
we will never generate LUI+ADDIW in the worst case sequence.
|
|
This patch introduces a new optimization in SROA that handles the
pattern where multiple non-overlapping vector `store`s completely fill
an `alloca`.
The current approach to handle this pattern introduces many `.vecexpand`
and `.vecblend` instructions, which can dramatically slow down
compilation when dealing with large `alloca`s built from many small
vector `store`s. For example, consider an `alloca` of type `<128 x
float>` filled by 64 `store`s of `<2 x float>` each. The current
implementation requires:
- 64 `shufflevector`s( `.vecexpand`)
- 64 `select`s ( `.vecblend` )
- All operations use masks of size 128
- These operations form a long dependency chain
This kind of IR is both difficult to optimize and slow to compile,
particularly impacting the `InstCombine` pass.
This patch introduces a tree-structured merge approach that
significantly reduces the number of operations and improves compilation
performance.
Key features:
- Detects when vector `store`s completely fill an `alloca` without gaps
- Ensures no loads occur in the middle of the store sequence
- Uses a tree-based approach with `shufflevector`s to merge stored
values
- Reduces the number of intermediate operations compared to linear
merging
- Eliminates the long dependency chains that hurt optimization
Example transformation:
```
// Before: (stores do not have to be in order)
%alloca = alloca <8 x float>
store <2 x float> %val0, ptr %alloca ; offset 0-1
store <2 x float> %val2, ptr %alloca+16 ; offset 4-5
store <2 x float> %val1, ptr %alloca+8 ; offset 2-3
store <2 x float> %val3, ptr %alloca+24 ; offset 6-7
%result = load <8 x float>, ptr %alloca
// After (tree-structured merge):
%shuffle0 = shufflevector %val0, %val1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%shuffle1 = shufflevector %val2, %val3, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%result = shufflevector %shuffle0, %shuffle1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
```
Benefits:
- Logarithmic depth (O(log n)) instead of linear dependency chains
- Fewer total operations for large vectors
- Better optimization opportunities for subsequent passes
- Significant compilation time improvements for large vector patterns
For some large cases, the compile time can be reduced from about 60s to
less than 3s.
---------
Co-authored-by: chengjunp <chengjunp@nividia.com>
|
|
When there is a dependency between two memory instructions in separate loops that have the same iteration space and depth, SIV will be able to test them and compute the direction and the distance of the dependency.
|
|
When deciding to sink address instructions into their uses, we check if
it is profitable to do so. The profitability check is based on the types
of uses of this address instruction -- if there are users which are not
memory instructions, then do not fold.
However, this profitability check wasn't considering target intrinsics,
which may be loads / stores.
This adds some logic to handle target memory intrinsics.
|
|
Reverts llvm/llvm-project#159782
The PR breaks multiple build bots and CI as well.
|
|
This is a common pattern to initialize Knownbits that occurs before
loops that call intersectWith.
|
|
Pass operand info to getMemoryOpCost in getMemInstScalarizationCost.
This matches the behavior in VPReplicateRecipe::computeCost.
|
|
|
|
Just do a custom lowering instead.
Also copy paste the cmov-neg fold to prevent regressions in nabs.
|
|
When handling CUDA ELF files via objdump or LLDB, the ELF parser in LLVM
needs to distinguish if an ELF file is sass or not, which requires a
triple for sass to exist in llvm. This patch includes all the necessary
changes for LLDB and objdump to correctly identify these files with the
correct triple.
|
|
clean unused PPC target feature FeatureBPERMD.
|
|
This change adds basic `MCInst` verification (checks the number of
operands) and fixes detected bugs.
* `RFE*` instructions have only one operand, but `DecodeRFEInstruction`
added two.
* `DecodeMVEModImmInstruction` and `DecodeMVEVCMP` added a `vpred`
operand, but this is what `AddThumbPredicate` normally does. This
resulted in an extra `vpred` operand.
* `DecodeMVEVADCInstruction` added an extra immediate operand.
* `getARMInstruction` added a `pred` operand to instructions that don't
have one (via `DecodePredicateOperand`).
* `AddThumb1SBit` appended an extra register operand to instructions
that don't modify CPSR (such as `tBL`).
* Instructions in `NEONDup` namespace have `pred` operand that the
generated code successfully decodes. The operand was added once again by
`getARMInstruction`/`getThumbInstruction` via `AddThumbPredicate`.
Functional changes extracted from #156540.
|
|
This is a very rough state of what this can look like, but I didn't want
to spend too much time on what could be a dead end.
Currently the only way to invoke callbacks is by using the default
pipelines, this is an issue if you want to define your own pipeline
using the C string API (we do that in LLVM.jl in julia) so I extended
the api to allow for invoking those callbacks just like one would call a
pass of that kind.
There are some questions about the params that these callbacks take and
also I'm missing some of them (some of them are also invoked by the
backend so we may not want to expose them)
Code written with AI help, bugs are mine. (Not sure what policy for this
is on LLVM)
|
|
We're only going to modify existing items, not add or remove any
elements to the vector.
|
|
Add `bits<0>` fields to instructions using the ZTR/MPR/MPR8 register
classes. These register classes contain only one register, and it is
not encoded in the instruction. This way, the generated decoder can
completely decode instructions without having to perform a post-decoding
pass to insert missing operands.
Some immediate operands are also not encoded and have only one possible
value "zero". Use this trick for them, too.
Finally, remove `-ignore-non-decodable-operands` option from
`llvm-tblgen` invocation to ensure that non-decodable operands do not
appear in the future.
|
|
externally (#159143)
Rather then defining these tags in each object file that requires them
we can can declare them as undefined and require that they defined
externally in, for example, compiler-rt or libcxxabi.
|
|
(#157968)
This is a cleaned up version of PR #151704. These optimizations are now
performed post-RA scheduling.
|
|
combineOp_VLToVWOp_VL. (#159205)
These instructions have one already narrow operand. Previously, we
pretended like this operand was a supported extension.
This could cause problems when we called getOrCreateExtendedOp on this
narrow operand when creating the the VWADD_VL. If the narrow operand
happened to be an extend of the opposite type, we would peek through it
and then rebuild it with the wrong extension type. So (vwadd_w_vl (i32
(sext X)), (i16 (zext Y))) would become (vwadd_vl (i16 (sext X)), (i16
(sext Y))).
To prevent this, we ignore the operand instead and pass std::nullopt for
SupportsExt to getOrCreateExtendedOp so it won't peek through any
extends on the narrow source.
Fixes #159152.
|
|
|
|
|
|
Teach the IR parser and writer to support metadata on ifuncs, and update
documentation.
In PR #153049, we have a use case of attaching the `!associated`
metadata to an ifunc.
Since an ifunc is similar to a function declaration, it seems natural to
allow metadata on ifuncs.
Currently, the metadata API allows adding Metadata to
llvm::GlobalObject, so the in-memory IR allows for metadata on ifuncs,
but the IR reader/writer is not aware of that.
---------
Co-authored-by: Wael Yehia <wyehia@ca.ibm.com>
|
|
|
|
The result type of the vector extend intrinsics generated by the
BUILD_VECTOR lowering code should match how they are actually defined.
Currently the result type is defaulting to the operand type there. This
can conflict with calls to the same intrinsic from other paths.
|
|
AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strlen routine is a millicode implementation;
we use millicode for the strlen function instead of a library call to
improve performance.
|
|
|
|
Replace the target uses of PointerLikeRegClass with RegClassByHwMode
|
|
In this commit:
(1) Added new pass manager support for `ReachingDefAnalysis`.
(2) Added printer pass.
(3) Make old pass manager use `ReachingDefInfoWrapperPass`
|
|
Just directly check x86_64. isArch64Bit just adds extra
steps around this.
|
|
|
|
(#159331)
The current implementation assumes ConstantInt return values are scalar,
which is not true when use-constant-int-for-fixed-length-splat is
enabled.
|
|
compares. (#141798)
My usecase is simplifying the control flow generated by LoopVectorize
when vectorising loops whose tripcount is a function of the runtime
vector length. This can be problematic because:
* CSE is a pre-LoopVectorize transform and so it's common for an IR
function to include several calls to llvm.vscale(). (NOTE: Code
generation will typically remove the duplicates)
* Pre-LoopVectorize instcombines will rewrite some multiplies as shifts.
This leads to a mismatch between VL based maths of the scalar loop and
that created for the vector loop, which prevents some obvious
simplifications.
SCEV does not suffer these issues because it effectively does CSE during
construction and shifts are represented as multiplies.
|
|
This patch implements the fold `lo(X * 1) + Z --> lo(X) + Z --> X iff X
== lo(X)`.
|
|
This is a generalization of the LookupPtrRegClass mechanism.
AMDGPU has several use cases for swapping the register class of
instruction operands based on the subtarget, but none of them
really fit into the box of being pointer-like.
The current system requires manual management of an arbitrary integer
ID. For the AMDGPU use case, this would end up being around 40 new
entries to manage.
This just introduces the base infrastructure. I have ports of all
the target specific usage of PointerLikeRegClass ready.
|
|
This patch adds an overflow check to the `exactSIVtest` function to fix
the issue demonstrated in the test case added in #157085. This patch
only fixes one of the routines. To fully resolve the test case, the
other functions need to be addressed as well.
|
|
It makes no sense why smin has to be limited to 32 and 64 bits.
hasAndNot only exists for 32 and 64 bits, so this does not affect smax.
|
|
- The output for `--output-sort=id` matches `--output-sort=offset` for
the available readers. Tests were updated accordingly.
- For `--output-sort=none`, and per `LVReader::sortScopes()`,
`LVScope::sort()` is called on the root scope.
`LVScope::sort()` has no effect if `getSortFunction() == nullptr`, and
thus the elements are currently traversed in the order in which they
were initially added. This should change, however, after
`LVScope::Children` is removed.
|
|
If we can't fold a PTRADD's offset into its users, lowering them to
disjoint ORs is preferable: Often, a 32-bit OR instruction suffices
where we'd otherwise use a pair of 32-bit additions with carry.
This needs to be a DAGCombine (and not a selection rule) because its
main purpose is to enable subsequent DAGCombines for bitwise operations.
We don't want to just turn PTRADDs into disjoint ORs whenever that's
sound because this transform loses the information that the operation
implements pointer arithmetic, which AMDGPU for instance needs when
folding constant offsets.
For SWDEV-516125.
|
|
The `NDEBUG` macro is tested for defined-ness everywhere else. The
instance here triggers a warning when compiling with `-Wundef`.
|
|
the following changes are made
a)Typo Fix (with previous PRhttps://github.com/llvm/llvm-project/pull/155747)
b)builtins support for MIPS P8700 execution control instructions .
c)Testcase
|
|
This PR adds a TargetLowering hook, canTransformPtrArithOutOfBounds,
that targets can use to allow transformations to introduce out-of-bounds
pointer arithmetic. It also moves two such transformations from the
AMDGPU-specific DAG combines to the generic DAGCombiner.
This is motivated by target features like AArch64's checked pointer
arithmetic, CPA, which does not tolerate the introduction of
out-of-bounds pointer arithmetic.
|
|
There are more places in SIISelLowering.cpp and AMDGPUISelDAGToDAG.cpp
that check for ISD::ADD in a pointer context, but as far as I can tell
those are only relevant for 32-bit pointer arithmetic (like frame
indices/scratch addresses and LDS), for which we don't enable PTRADD
generation yet.
For SWDEV-516125.
|
|
This patch adds MC support for Zvfofp8min
https://github.com/aswaterman/riscv-misc/blob/main/isa/zvfofp8min.adoc.
|
|
when preserving inbounds (#159515)
If we know that the initial GEP was inbounds, and we change it to a sequence of
GEPs from the same base pointer where every offset is non-negative, then the
new GEPs are inbounds. So far, the implementation only checked if the extracted
offsets are non-negative. In cases where non-extracted offsets can be negative,
this would cause the inbounds flag to be wrongly preserved.
Fixes an issue in #130617 found by nikic.
|
|
For vx form, we legalize it with widen scalar. And for vf form, we select the right register bank.
|