Age | Commit message (Collapse) | Author | Files | Lines |
|
Loop headers frequently consume the loop-carried value in the header
block via non-lookthrough ops (e.g. byte-wise vector binops).
LiveRegOptimizer’s same-BB filter currently prunes these users, so the
loop-carried PHI is not coerced to i32 and the intended packed form is
lost.
Relax the filter: when the def is a PHI, allow same-BB non-lookthrough
users. Also fix the check to look at the user (CII) rather than the def
(II) so the walk does not terminate prematurely.
|
|
Fix a crash that would arise when intrinsics like llvm.masked.load.T.p7
were left in the module when AMDGPULowerBufferFatPointers was applied
and so a captures(none) annotation would be applied to a non-pointer
value, triggering a verifier failure.
---------
Co-authored-by: Shilei Tian <i@tianshilei.me>
|
|
This is primarily to avoid folding a frame index materialized
into an SGPR into the pseudo; this would end up looking like:
%sreg = s_mov_b32 %stack.0
%av_32 = av_mov_b32_imm_pseudo %sreg
Which is not useful.
Match the check used for the b64 case. This is limited to the
pseudo to avoid regression due to gfx908's special case - it
is expecting to pass here with v_accvgpr_write_b32 for illegal
cases, and stay in the intermediate state with an sgpr input.
This avoids regressions in a future patch.
|
|
This is a temporary fix for a regression from #154875.
The new pattern sets the hi part of V_BFI result and that confuses
si-fix-sgpr-copies - where the proper fix is likely to be.
During si-fix-sgpr-copies, an incorrect fold happens:
%86:vgpr_32 = V_BFI_B32_e64
%87:sreg_32 = COPY %86.hi16:vgpr_32
%95:vgpr_32 = nofpexcept V_PACK_B32_F16_t16_e64 0, killed %87:sreg_32,
0, %63:vgpr_16, 0, 0
into
%86:vgpr_32 = V_BFI_B32_e64
%95:vgpr_32 = nofpexcept V_PACK_B32_F16_t16_e64 0, %86.lo16:vgpr_32, 0,
%63:vgpr_16, 0, 0
Fixes: Vulkan CTS dEQP-VK.glsl.builtin.precision_fp16_storage32b.*.
|
|
Selecting vgpr for the uniform version of this pattern may lead to
unnecessary vgpr and waterfall loops.
|
|
SMLoc itself encapsulates just a pointer, so there is no need to pass or
return it by reference.
|
|
|
|
The RC of the folded operand does not need to be constrained based on
the RC of the current operand we are folding into.
The purpose of this PR is to facilitate this PR:
https://github.com/llvm/llvm-project/pull/151033
|
|
Fix two bugs. The first bug hid the second bug.
1. Calculate IsVALU correctly during UADDO/USUBO selection. IsVALU
should be false if the carryout users are UADDO_CARRY/USUBO_CARRY.
However instruction selection visits uses before defs, so the
UADDO_CARRY/USUBO_CARRY nodes are normally (probably always) already
converted to S_ADD_CO_PSEUDO/S_SUB_CO_PSEUDO. Fix to check for these
machine opcodes.
2. Without this fix, UADDO/USUBO selection will always select the VALU
instructions V_ADD_CO__U32_e64/V_SUB_CO_U32_e64.
S_UADDO_PSEUDO/S_USUBO_PSEUDO were never selected in the CodeGen/AMDGPU
tests. Thus, S_UADDO_PSEUDO/S_USUBO_PSEUDO cases were never hit in
EmitInstrWithCustomInserter. The code generation for
S_UADDO_PSEUDO/S_USUBO_PSEUDO had a bug where it could not handle code
generation for 32-bit $scc_out.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
|
|
Rename canonical register names with WAVE_ prefix for GFX12
Maintain backward compatibility through aliases
|
|
Make sure we cannot be in a mode with both wavesizes. This
prevents assertions in a future change. This should probably
just be an error, but we do not have a good way to report
errors from the MCSubtargetInfo constructor.
|
|
Add v4,v8,v16,v32 legalizations for the following operations:
- `FADD`
- `FMUL`
- `FMA`
- `FCANONICALIZE`
|
|
This change builds on https://github.com/llvm/llvm-project/pull/160319
which tries to clarify which *callers* (not backends) assume that the
result is actually trivial.
This change itself should be NFC. Essentially, I'm just renaming the
existing isTrivialRematerializable to the non-trivial version and then
adding a new trivial version (with the same name as the prior function)
and simplifying a few callers which want that semantic.
This change does *not* enable non-trivial remat any more broadly than
was already done for our targets which were lying through the old APIs;
that will come separately. The goal here is simply to make the code
easier to follow in terms of what assumptions are being made where.
---------
Co-authored-by: Luke Lau <luke_lau@icloud.com>
|
|
|
|
This patch includes:
1. fma_mix inst takes fp16 type as input, but place the operand in
vgpr32. Update selector to insert vgpr32 for true16 mode if necessary.
2. fma_mix inst returns fp16 type as output, but place the vdst in
vgpr32. Create a fma_mix_t16 pesudo inst for isel pattern, and lower it
to mix_lo/hi in the mc lowering pass.
These stop isel from emitting illegal `vgpr32 = COPY vgpr16` and improve
code quality
|
|
On new targets like `gfx1250`, the buffer resource (V#) now uses this
format:
```
base (57-bit): resource[56:0]
num_records (45-bit): resource[101:57]
reserved (6-bit): resource[107:102]
stride (14-bit): resource[121:108]
```
This PR changes the type of `num_records` from `i32` to `i64` in both
builtin and intrinsic, and also adds the support for lowering the new
format.
Fixes SWDEV-554034.
---------
Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>
|
|
Move common declarations from switch cases to function entry.
Signed-off-by: John Lu <John.Lu@amd.com>
|
|
And rework the lit64() support to use it.
The rules for when to add lit64() can be simplified and
improved. In this change, however, we just follow the existing
conventions on the assembler and disassembler sides.
In codegen we do not (and normally should not need to) add explicit
lit() and lit64() modifiers, so the codegen tests lose them. The change
is an NFCI otherwise.
Simplifies printing operands.
|
|
the global AS (#160129)
Mostly NFC, and adds an assertion for gfx12 to ensure that no atomic scratch
instructions are present in the case of GloballyAddressableScratch. This should
always hold because of #154710.
|
|
Remove recursion to avoid stack overflow on large CFGs.
Avoid worklist for hazard search within single MachineBasicBlock.
Ensure predecessors are visited for all state combinations.
|
|
Ensure live intervals for EXEC and SCC are removed on all paths which
generate instructions.
|
|
|
|
This can happen when `xor cond, -1` is not combined.
|
|
.. to isReMaterializableImpl. The "Really" naming has always been
awkward, and we're working towards removing the "Trivial" part now,
so go ehead and remove both pieces in a single rename.
Note that this doesn't change any aspect of the current
implementation; we still "mostly" only return instructions which
are trivial (meaning no virtual register uses), but some targets
do lie about that today.
|
|
|
|
They represent mutually exclusive values of the same attribute.
|
|
The pattern does not factor saddr. There is no way to write a test
for it because gfx1200 does not have sram-ecc but also no saddr,
and gfx1250 does not fall into this preserving category while has
sram-ecc. Nevertheless, the day we could fix it that would become a
problem. For now it is OK that change does not fail.
That was untested before and it is untested now, but at least t16
block uses t16 patterns.
|
|
Reverts llvm/llvm-project#154115
Co-authored-by: ronlieb <ron.lieberman@amd.com>
|
|
This patch makes it so that InstrPostProcess::postProcessInstruction
takes in a reference to a mca::Instruction rather than a reference to a
std::unique_ptr. Without this, InstrPostProcess cannot be used with MCA
instruction recycling because it needs to be called on both newly
created instructions and instructions that have been recycled. We only
have access to a raw pointer for instructions that have been recycled
rather than a reference to the std::unique_ptr that owns them.
This patch adds a call in the existing instruction recycling unit test
to ensure the API remains compatible with this use case.
|
|
A fence release could be followed by a barrier, so it should wait for
the relevant memory accesses to complete, even if it is mmra-limited to
LDS. So far, that would be skipped for non-global fence releases.
Fixes SWDEV-554932.
|
|
This patch adds the MIR parsing and serialization support for save and
restore points with subsets of callee saved registers. That is, it
syntactically allows a function to contain two or more distinct
sub-regions in which distinct subsets of registers are spilled/filled as
callee save. This is useful if e.g. one of the CSRs isn't modified in
one of the sub-regions, but is in the other(s).
Support for actually using this capability in code generation is still
forthcoming. This patch is the next logical step for multiple
save/restore points support.
All points are now stored in DenseMap from MBB to vector of
CalleeSavedInfo.
Shrink-Wrap points split Part 4.
RFC:
https://discourse.llvm.org/t/shrink-wrap-save-restore-points-splitting/83581
Part 1: https://github.com/llvm/llvm-project/pull/117862 (landed)
Part 2: https://github.com/llvm/llvm-project/pull/119355 (landed)
Part 3: https://github.com/llvm/llvm-project/pull/119357 (landed)
Part 5: https://github.com/llvm/llvm-project/pull/119359 (likely to be
further split)
|
|
Since #154205 some subtargets can use up to 32 user SGPRs. Add names for
them all so they can be pretty printed in PAL metadata.
|
|
|
|
Streamline code by only declaring TRI/TII once and using isWave64().
Signed-off-by: John Lu <John.Lu@amd.com>
|
|
Use correct unsigned overflow instructions for
S_UADDO_PSEUDO/S_USUBO_PSEUDO. Note that this issue was hidden because
instruction selection never selected S_UADDO_PSEUDO/S_USUBO_PSEUDO which
will be addressed in https://github.com/llvm/llvm-project/pull/159814.
Signed-off-by: John Lu <John.Lu@amd.com>
|
|
(#160037)
"class HasMember##member" detects a specific member with a complex
SFINAE logic involving multiple inheritance. This patch simplifies
that by switching to llvm::is_detected.
|
|
|
|
|
|
Without this patch, we compute a type trait in a roundabout manner:
- Compute a boolean value in the primary template.
- Pass the value to std::enable_if_t.
- Return std::true_type (or std::false_type on the fallback path).
- Compare the return type to std::true_type.
That is, when the expression for the first boolean value above is well
formed, we already have the answer we are looking for.
This patch bypasses the entire sequence by having the primary template
return std::bool_constant and adjusting RESULT to extract the ::value
of the boolean type.
|
|
(#157968)
This is a cleaned up version of PR #151704. These optimizations are now
performed post-RA scheduling.
|
|
This is a generalization of the LookupPtrRegClass mechanism.
AMDGPU has several use cases for swapping the register class of
instruction operands based on the subtarget, but none of them
really fit into the box of being pointer-like.
The current system requires manual management of an arbitrary integer
ID. For the AMDGPU use case, this would end up being around 40 new
entries to manage.
This just introduces the base infrastructure. I have ports of all
the target specific usage of PointerLikeRegClass ready.
|
|
This PR adds a TargetLowering hook, canTransformPtrArithOutOfBounds,
that targets can use to allow transformations to introduce out-of-bounds
pointer arithmetic. It also moves two such transformations from the
AMDGPU-specific DAG combines to the generic DAGCombiner.
This is motivated by target features like AArch64's checked pointer
arithmetic, CPA, which does not tolerate the introduction of
out-of-bounds pointer arithmetic.
|
|
There are more places in SIISelLowering.cpp and AMDGPUISelDAGToDAG.cpp
that check for ISD::ADD in a pointer context, but as far as I can tell
those are only relevant for 32-bit pointer arithmetic (like frame
indices/scratch addresses and LDS), for which we don't enable PTRADD
generation yet.
For SWDEV-516125.
|
|
The manual legalizeOperands code only need to consider cases that
require full instruction context to know if the operand is legal.
This does not need to handle basic operand register class constraints.
|
|
|
|
|
|
|
|
This patch mirrors similar patterns for ISD::ADD. The main difference is
that ISD::ADD is commutative, so that a pattern definition for, e.g.,
(add (mul x, y), z), automatically also handles (add z, (mul x, y)).
ISD::PTRADD is not commutative, so we would need to handle these cases
explicitly. This patch only implements (ptradd z, (op x, y)) patterns,
where the nested operation (shift or multiply) is the offset of the
ptradd (i.e., the right operand), since base pointers that are the
result of a shift or multiply seem less likely.
For SWDEV-516125.
|
|
(#157843)
Add new event SCC_WRITE for s_barrier_signal_isfirst and s_barrier_leave,
instructions that write to SCC, counter is KM_CNT.
Also start tracking SCC for reads and writes.
s_barrier_wait on the same barrier guarantees that the SCC write from
s_barrier_signal_isfirst has landed, no need to insert s_wait_kmcnt.
|
|
The operand constraints already express this constraint, and
InstrEmitter will respect them.
|