Age | Commit message (Collapse) | Author | Files | Lines |
|
Remove NoSignedZerosFPMath in TargetLowering part, users should always
use instruction level fast math flags.
|
|
This patch renames comparators
- from `std::equal_to<llvm::rdf::RegisterRef>` to
`llvm::rdf::RegisterRefEqualTo`, and
- from `std::less<llvm::rdf::RegisterRef>` to
`llvm::rdf::RegisterRefLess`.
The original specializations don't satisfy the requirements for the
original `std` templates by being stateful and
non-default-constructible, so they make the program have UB due to C++17
[namespace.std]/2, C++20/23 [namespace.std]/5.
> A program may explicitly instantiate a class template defined in the
standard library only if the declaration
> - depends on the name of at least one program-defined type, and
> - the instantiation meets the standard library requirements for the
original template.
|
|
Fixes #160981
The exponential part of a floating-point number is signed. This patch
prevents treating it as unsigned.
|
|
|
|
(#160294)
This is essentially the same patch as
116ca9522e89f1e4e02676b5bbe505e80c4d4933;
when trying to match a physreg hint, try to find a compatible physreg if
there is
a subregister copy. This has the slight difference of using getSubReg on
the hint
instead of getMatchingSuperReg (the other use should also use getSubReg
instead,
it's faster).
At the moment this turns out to have very little effect. The adjacent
code needs
better handling of subregisters, so continue adding this piecemeal. The
X86 test
shows a net reduction in real instructions, plus a few new kills.
|
|
[nfc]" (#160897)
Reverts llvm/llvm-project#160765. Failures on buildbot indicate second
assertion does not in fact hold.
|
|
Uses the existing format of the LiveRange printer, and just factors it
out so that you can do vni->dump() when debugging, or log a vni in a
debug print statement.
|
|
We should always be able to find the VNInfo in the original live
interval which corresponds to the subset we're trying to spill, and the
only cases where we have a VNInfo without a definition instruction are
if the vni is unused, or corresponds to a phi. Adjust the code structure
to explicitly check for PHIDef, and assert the stronger conditions.
|
|
We didn't have trace logging for two cases in this routine which makes
it sometimes hard to tell what is going on. In addition to debug trace
statements, add comments to explain the logic behind the early exits
which don't mark the virtual register live. Suggestions on how to word
these more precisely very welcome; I'm not clear I understand all the
intrinicies of this code myself.
|
|
This heuristic was originally added in 40c4aa with the stated purpose of
avoiding global split on live long ranges created by MachineLICM
hoisting trivially rematerializable instructions. In the meantime,
various backends have introduced non-trivial rematerialization cases,
MachineLICM gained an explicitly triviality check, and we've reworked
our APIs to match naming wise. Let's move this heuristic back to truely
trivial remat only.
This is a functional change, though somewhat hard to hit. This change
will cause non-trivially rematerializable instructions to be globally
split more often. This is likely a good thing since non-trivial remat
may not be legal at all possible points in the live interval, but may
cost slightly more compile time.
I don't have a motivating example; I found it when reviewing the callers
of isRemMaterializable(MI).
|
|
On targets where f32 maximumnum is legal, but maximumnum on vectors of
smaller types is not legal (e.g. v2f16), try unrolling the vector first
as part of the expansion.
Only fall back to expanding the full maximumnum computation into
compares + selects if maximumnum on the scalar element type cannot be
supported.
|
|
successor is loop header (#154063)
This addresses a performance issue for our downstream GPU target that
sets requiresStructuredCFG to true. The issue is that EarlyMachineLICM
pass does not hoist loop invariants because a critical edge is not
split.
The critical edge's destination a loop header. Splitting the critical
edge will not break structured CFG.
Add a nvptx test to demonstrate the issue since the target also
requires structured CFG.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
|
|
(#159110)
Currently, something like:
```
$eax = MOV32ri -11, implicit-def $rax
%al = COPY $eax
```
Can be rematerialized as:
```
dead $eax = MOV32ri -11, implicit-def $rax
```
Which marks the full $rax as used, not just $al.
With this change, this is rematerialized as:
```
dead $eax = MOV32ri -11, implicit-def dead $rax, implicit-def $al
```
To indicate that only $al is used.
Note: This issue is latent right now, but is exposed when #134408 is
applied, as it results in the register pressure being incorrectly
calculated (unless this patch is applied too).
I think this change is in line with past fixes in this area, notably:
https://github.com/llvm/llvm-project/commit/059cead5ed7aa11ce1eae0bcc751ea0d1e23ea75
https://github.com/llvm/llvm-project/commit/69cd121dd9945429b565b6a5eb8719130de880a7
|
|
Post-RA machine sinking could sink a copy of sub-register into
a successor. However, the sub-register might not be removed from the
live-in bitmask of its super register in successor and then a later
pass, e.g, if-converter, may add an implicit use of the register from
live-in resulting in an use of an undefined register. This change makes
sure subrange of live-ins from super register could be removed as well.
|
|
Remove these global flags and use node level flags instead.
|
|
weight discount (#159180)
This aims to fix the issue that caused https://reviews.llvm.org/D106408
to be reverted.
CalcSpillWeights will reduce the weight of an interval by half if it's
considered rematerializable, so it will be evicted before others.
It does this by checking TII.isTriviallyReMaterializable. However
rematerialization may still fail if any of the defining MI's uses aren't
available at the locations it needs to be rematerialized.
LiveRangeEdit::canRematerializeAt calls allUsesAvailableAt to check this
but CalcSpillWeights doesn't, so the two diverge.
This fixes it by also checking allUsesAvailableAt in CalcSpillWeights.
In practice this has zero change AArch64/X86-64/RISC-V as measured on
llvm-test-suite, but prevents weights from being perturbed in an
upcoming patch which enables more rematerialization by re-attempting
https://reviews.llvm.org/D106408
|
|
as the comparison (#160358)
We match uadd's behavior here.
Codegen comparison: https://godbolt.org/z/x8j4EhGno
|
|
- This implementation is adapted from **SDAG
X86TargetLowering::LowerSET_ROUNDING**.
|
|
(#160683)
The insert point management is messy here. We probably should
have an insert point guard, and not have ths dest operand utilities
modify the insert point.
Fixes #159716
|
|
Some passes, like AMDGPU's SIInsertHardClauses, wrap sequences of
instructions into bundles, and these bundles may end up with debug
instructions in the middle. Assuming that this is allowed, this patch
fixes MachineStripDebug to be able to remove these instructions from
inside a bundle.
|
|
|
|
This change builds on https://github.com/llvm/llvm-project/pull/160319
which tries to clarify which *callers* (not backends) assume that the
result is actually trivial.
This change itself should be NFC. Essentially, I'm just renaming the
existing isTrivialRematerializable to the non-trivial version and then
adding a new trivial version (with the same name as the prior function)
and simplifying a few callers which want that semantic.
This change does *not* enable non-trivial remat any more broadly than
was already done for our targets which were lying through the old APIs;
that will come separately. The goal here is simply to make the code
easier to follow in terms of what assumptions are being made where.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
|
|
ucmp (#159889)
Same deal we use for determining ucmp vs scmp.
Using selects on platforms that like selects is better than using usubo.
Rename function to be more general fitting this new description.
|
|
The CFG allows us to do layout optimization in the compiler.
Furthermore, it allows further branch optimization.
|
|
The AArch64 zsub regs are scalable, so defined with a size of -1 (which
comes through as 65535). The RegisterSize is only 128, so code to try
and find overlapping regs of a z30_z31 in DwarfEmitter can crash on
trying to access out of range bits in a BitVector. Hexagon and x86 also
contain subregs with unknown sizes.
Ideally most of these would be scalable values but in the meantime add a
check that the register are small enough to overlap with the current
register size, to prevent us from crashing.
This fixes the issue reported on #153810.
|
|
(#160145)
It is unnecessary and confusing to have a do/while loop that checks
SU->isScheduled as this should never be true.
ScheduleDAGMI::updateQueues() is always called after pickNode() and it
sets isScheduled on the SU. Turn this into an assertion instead.
|
|
This reverts commit bd2dac98ed4f19dcf90c098ae0b9976604880b59, and part of ca2e8fc928ad103f46ca9f827e147c43db3a5c47.
My speculative attempt at fixing buildbot failed, so just roll back the
relavant part of the change.
|
|
.. to isReMaterializableImpl. The "Really" naming has always been
awkward, and we're working towards removing the "Trivial" part now,
so go ehead and remove both pieces in a single rename.
Note that this doesn't change any aspect of the current
implementation; we still "mostly" only return instructions which
are trivial (meaning no virtual register uses), but some targets
do lie about that today.
|
|
This is a preparatory change for an upcoming reorganization of our
rematerialization APIs. Despite the interface being documented as
"trivial" (meaning no virtual register uses on the instruction being
considered for remat), our actual implementation inconsistently supports
non-trivial remat, and certain backends (AMDGPU and RISC-V mostly) lie
about instructions being trivial to abuse that. We want to allow
non-triial remat more broadly, but first we need to do some cleanup to
make it understandable what's going on.
These three call sites are ones which appear to actually want the
trivial definition, and appear fairly low risk to change.
p.s. I'm deliberately *not* updating any APIs in this change, I'm going
to do that as a followup once it's clear which category each callsite
fits in.
|
|
subprogram DIEs" (#160349)
Reverts llvm/llvm-project#159104 due to the issues reported in
https://github.com/llvm/llvm-project/issues/160197.
|
|
This patch adds the MIR parsing and serialization support for save and
restore points with subsets of callee saved registers. That is, it
syntactically allows a function to contain two or more distinct
sub-regions in which distinct subsets of registers are spilled/filled as
callee save. This is useful if e.g. one of the CSRs isn't modified in
one of the sub-regions, but is in the other(s).
Support for actually using this capability in code generation is still
forthcoming. This patch is the next logical step for multiple
save/restore points support.
All points are now stored in DenseMap from MBB to vector of
CalleeSavedInfo.
Shrink-Wrap points split Part 4.
RFC:
https://discourse.llvm.org/t/shrink-wrap-save-restore-points-splitting/83581
Part 1: https://github.com/llvm/llvm-project/pull/117862 (landed)
Part 2: https://github.com/llvm/llvm-project/pull/119355 (landed)
Part 3: https://github.com/llvm/llvm-project/pull/119357 (landed)
Part 5: https://github.com/llvm/llvm-project/pull/119359 (likely to be
further split)
|
|
Change the eviction advisor heuristic cost based on number of broken
hints to work in units of copy cost, rather than a magic number 1.
The intent is to allow breaking hints for cheap subregisters in favor
of more expensive register tuples.
The llvm.amdgcn.image.dim.gfx90a.ll change shows a simple example of
the case I am attempting to solve. Use of tuples in ABI contexts ends up
looking like this:
%argN = COPY $vgprN
%tuple = inst %argN
$vgpr0 = COPY %tuple.sub0
$vgpr1 = COPY %tuple.sub1
$vgpr2 = COPY %tuple.sub2
$vgpr3 = COPY %tuple.sub3
Since there are physreg copies in the input and output sequence,
both have hints to a physreg. The wider tuple hint on the output
should win though, since this satisfies 4 hints instead of 1.
This is the obvious part of a larger change to better handle
subregister interference with register tuples, and is not sufficient
to handle the original case I am looking at. There are several bugs here
that are proving tricky to untangle. In particular, there is a double
counting bug for all registers with multiple regunits; the cost of
breaking
the interfering hint is added for each interfering virtual register,
which have repeat visits across regunits. Fixing the double counting
badly
regresses a number of RISCV tests, which seem to rely on overestimating
the cost in tryFindEvictionCandidate to avoid early-exiting the eviction
candidate loop (RISCV is possibly underestimating the copy costs for
vector registers).
|
|
Extend CallGraphSection to include metadata about direct calls. This
simplifies the design of tools that must parse .callgraph section to not
require dependency on MC layer.
|
|
Currently there are two serialization modes for bitstream Remarks:
standalone and separate. The separate mode splits remark metadata (e.g.
the string table) from actual remark data. The metadata is written into
the object file by the AsmPrinter, while the remark data is stored in a
separate remarks file. This means we can't use bitstream remarks with
tools like opt that don't generate an object file. Also, it is confusing
to post-process bitstream remarks files, because only the standalone
files can be read by llvm-remarkutil. We always need to use dsymutil
to convert the separate files to standalone files, which only works for
MachO. It is not possible for clang/opt to directly emit bitstream
remark files in standalone mode, because the string table can only be
serialized after all remarks were emitted.
Therefore, this change completely removes the separate serialization
mode. Instead, the remark string table is now always written to the end
of the remarks file. This requires us to tell the serializer when to
finalize remark serialization. This automatically happens when the
serializer goes out of scope. However, often the remark file goes out of
scope before the serializer is destroyed. To diagnose this, I have added
an assert to alert users that they need to explicitly call
finalizeLLVMOptimizationRemarks.
This change paves the way for further improvements to the remark
infrastructure, including more tooling (e.g. #159784), size optimizations
for bitstream remarks, and more.
Pull Request: https://github.com/llvm/llvm-project/pull/156715
|
|
This PR introduces the support for the SPIR-V extension
`SPV_KHR_bfloat16`. This extension extends the `OpTypeFloat` instruction
to enable the use of bfloat16 types with cooperative matrices and dot
products.
TODO:
Per the `SPV_KHR_bfloat16` extension, there are a limited number of
instructions that can use the bfloat16 type. For example, arithmetic
instructions like `FAdd` or `FMul` can't operate on `bfloat16` values.
Therefore, a future patch should be added to either emit an error or
fall back to FP32 for arithmetic in cases where bfloat16 must not be
used.
Reference Specification:
https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/KHR/SPV_KHR_bfloat16.asciidoc
|
|
Make the actual use context less ugly.
|
|
If a COPY uses Reg but only in an implicit operand then the new
implementation ignores it but the old implementation would have treated
it as a copy of Reg. Probably this case never occurs in practice. Other
than that, this patch is NFC.
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
|
|
computeKnownBits/ComputeNumSignBits (#159692)
Resolves #158332
|
|
Fixes [157370](https://github.com/llvm/llvm-project/issues/157370)
UREM General proof: https://alive2.llvm.org/ce/z/b_GQJX
SREM General proof: https://alive2.llvm.org/ce/z/Whkaxh
I have added it as rv32i and rv64i tests because they are the only architectures where I could verify that it works.
|
|
NFC (#159965)
|
|
Fixes #159912
|
|
The code is taken from `SelectionDAG::computeKnownBits`.
This ticks off ABS from #150515
|
|
(#159839)
LiveRangeEdit's rematerialization checking logic is used in two quite
different ways. For SplitKit and InlineSpiller, we're analyzing all defs
associated with a live interval, doing that analysis up front, and then
using the result a bit later. The RegisterCoalescer, we're analysing
exactly one ValNo at a time, and using the legality result immediately.
LRE had a checkRematerializable which existed basically to adapt the
later into the former usage model.
Instead, this change bypasses the ScannedRemat and Remattable
structures, and directly queries the underlying routines. This is easy
to read, and makes it more clear as to which uses actually need the
deferred analysis. (A following change may try to unwind that too, but
it's not strictly NFC.)
|
|
When deciding to sink address instructions into their uses, we check if
it is profitable to do so. The profitability check is based on the types
of uses of this address instruction -- if there are users which are not
memory instructions, then do not fold.
However, this profitability check wasn't considering target intrinsics,
which may be loads / stores.
This adds some logic to handle target memory intrinsics.
|
|
This is a common pattern to initialize Knownbits that occurs before
loops that call intersectWith.
|
|
externally (#159143)
Rather then defining these tags in each object file that requires them
we can can declare them as undefined and require that they defined
externally in, for example, compiler-rt or libcxxabi.
|
|
AIX has "millicode" routines, which are functions loaded at boot time
into fixed addresses in kernel memory. This allows them to be customized
for the processor. The __strlen routine is a millicode implementation;
we use millicode for the strlen function instead of a library call to
improve performance.
|
|
In this commit:
(1) Added new pass manager support for `ReachingDefAnalysis`.
(2) Added printer pass.
(3) Make old pass manager use `ReachingDefInfoWrapperPass`
|
|
This is a generalization of the LookupPtrRegClass mechanism.
AMDGPU has several use cases for swapping the register class of
instruction operands based on the subtarget, but none of them
really fit into the box of being pointer-like.
The current system requires manual management of an arbitrary integer
ID. For the AMDGPU use case, this would end up being around 40 new
entries to manage.
This just introduces the base infrastructure. I have ports of all
the target specific usage of PointerLikeRegClass ready.
|
|
If we can't fold a PTRADD's offset into its users, lowering them to
disjoint ORs is preferable: Often, a 32-bit OR instruction suffices
where we'd otherwise use a pair of 32-bit additions with carry.
This needs to be a DAGCombine (and not a selection rule) because its
main purpose is to enable subsequent DAGCombines for bitwise operations.
We don't want to just turn PTRADDs into disjoint ORs whenever that's
sound because this transform loses the information that the operation
implements pointer arithmetic, which AMDGPU for instance needs when
folding constant offsets.
For SWDEV-516125.
|