Age | Commit message (Collapse) | Author | Files | Lines |
|
(#157968)
This is a cleaned up version of PR #151704. These optimizations are now
performed post-RA scheduling.
|
|
This is a generalization of the LookupPtrRegClass mechanism.
AMDGPU has several use cases for swapping the register class of
instruction operands based on the subtarget, but none of them
really fit into the box of being pointer-like.
The current system requires manual management of an arbitrary integer
ID. For the AMDGPU use case, this would end up being around 40 new
entries to manage.
This just introduces the base infrastructure. I have ports of all
the target specific usage of PointerLikeRegClass ready.
|
|
This PR adds a TargetLowering hook, canTransformPtrArithOutOfBounds,
that targets can use to allow transformations to introduce out-of-bounds
pointer arithmetic. It also moves two such transformations from the
AMDGPU-specific DAG combines to the generic DAGCombiner.
This is motivated by target features like AArch64's checked pointer
arithmetic, CPA, which does not tolerate the introduction of
out-of-bounds pointer arithmetic.
|
|
There are more places in SIISelLowering.cpp and AMDGPUISelDAGToDAG.cpp
that check for ISD::ADD in a pointer context, but as far as I can tell
those are only relevant for 32-bit pointer arithmetic (like frame
indices/scratch addresses and LDS), for which we don't enable PTRADD
generation yet.
For SWDEV-516125.
|
|
The manual legalizeOperands code only need to consider cases that
require full instruction context to know if the operand is legal.
This does not need to handle basic operand register class constraints.
|
|
|
|
|
|
|
|
This patch mirrors similar patterns for ISD::ADD. The main difference is
that ISD::ADD is commutative, so that a pattern definition for, e.g.,
(add (mul x, y), z), automatically also handles (add z, (mul x, y)).
ISD::PTRADD is not commutative, so we would need to handle these cases
explicitly. This patch only implements (ptradd z, (op x, y)) patterns,
where the nested operation (shift or multiply) is the offset of the
ptradd (i.e., the right operand), since base pointers that are the
result of a shift or multiply seem less likely.
For SWDEV-516125.
|
|
(#157843)
Add new event SCC_WRITE for s_barrier_signal_isfirst and s_barrier_leave,
instructions that write to SCC, counter is KM_CNT.
Also start tracking SCC for reads and writes.
s_barrier_wait on the same barrier guarantees that the SCC write from
s_barrier_signal_isfirst has landed, no need to insert s_wait_kmcnt.
|
|
The operand constraints already express this constraint, and
InstrEmitter will respect them.
|
|
|
|
|
|
|
|
|
|
Co-authored-by: Jay Foad <Jay.Foad@amd.com>
Co-authored-by: Jay Foad <Jay.Foad@amd.com>
|
|
Instructions in GFX11 (#157795)
It seems the VMEM access on hi/lo half could interfere the other half.
Track waitcnt of vgpr32 instead of vgpr16 for 16bit reg in GFX11.
---------
Co-authored-by: Joe Nash <joseph.nash@amd.com>
|
|
If the source register of a copy was a physical sgpr copied to an
s1 value, this would assert.
|
|
The wavesize hacking part was already done in
createAMDGPUMCSubtargetInfo, and we can move the default
target hack there too.
|
|
(#158182)
Tablegen would generate code to access TargetResourceIndices with
processor ID.
The TargetProcResourceIndexStart[] array is generated for each processor
which has itineraries. The processor which doesn't has itineraries is excluded
from the array. When a target has mixed processors, the processor ID may
exceed the array size and cause the error.
This patch is to generate a table mapping processor with itineraries to
resource index, so that scheduler can get the correct resource index with
processor ID.
|
|
Fixes https://github.com/iree-org/iree/issues/22001
The visitor in SplitPtrStructs would re-visit instructions if an
instruction earlier in program order caused a recursive visit() call via
getPtrParts(). This would cause instructions to be processed multiple
times.
As a consequence of this, PHI nodes could be added to the Conditionals
array multiple times, which would to a conditinoal that was already
simplified being processed multiple times. After the code moved to
InstSimplifyFolder, this re-processing, combined with more agressive
simplifications, would lead to an attempt to replace an instruction with
itself, causing an assertion failure and crash.
This commit resolves the issue and adds the reduced form of the crashing
input as a test.
|
|
|
|
Hardware initializes a single value in ttmp9 which is either the
workgroup ID X or cluster ID X. Most of this patch is a refactoring to
use a single `PreloadedValue` enumerator for this value, instead of two
enumerators `WORKGROUP_ID_X` and `CLUSTER_ID_X` referring to the same
value.
This makes it simpler to have a single attribute
`amdgpu-no-workgroup-id-x` indicating that this value is not used, which
in turns sets the TGID_EN_X bit appropriately to tell the hardware
whether to initialize it.
All of the above applies to Y and Z similarly.
Fixes: LWPSCGFX13-568
Co-authored-by: Jay Foad <jay.foad@amd.com>
|
|
The latest SP changes updated it to use `OP_SEL[0:3]` instead of
`OP_SEL[0:2]`.
Fixes SWDEV-554472.
|
|
Should not do anything.
|
|
Elide bitcast combine to build_vector in case of i64 immediate that can
be materialized through 64b mov
|
|
After replacing VGPR MFMAs with the AGPR form, we've alleviated VGPR
pressure which may have triggered spills during allocation. Identify
these spill slots, and try to reassign them to newly freed VGPRs,
and replace the spill instructions with copies.
Fixes #154260
|
|
The S_NOP instruction has an immediate operand which is one less than
the number of cycles to delay for. The maximum value that may be encoded
in this field was increased in GFX8 and again in GFX12.
|
|
(#158754)
|
|
Fixes: SWDEV-551181
|
|
We have proper encoding facilities to encode operands and instructions;
there's no need to pollute the MC representation with encoding details.
Supposed to be an NFCI, but happens to fix some re-encoded instruction
codes in disassembler tests.
The 64-bit operands are to be addressed in following patches introducing
MC-level representation for lit() and lit64() modifiers, to then be
respected by both the assembler and disassembler.
|
|
Change shouldRewriteCopySrc to return the common register
class and expose it as a utility function. I've found myself
reproducing essentially the same logic in multiple places. The
purpose of this function is to jsut work through the API constraints
of which combination of register class and subreg indexes you have.
i.e. you need to use a different function if you have 0, 1, or 2
subregister indexes involved in a pair of copy-like operations.
|
|
Create utility mechanism for finding wave size dependent opcodes used to
manipulate exec/lane masks.
|
|
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
|
|
|
|
Remember indexes of MCOperands in MCParsedAsmOperands as we add them to
instructions. Then use the indexes to find locations by known MCOperands
indexes.
Happens to fix some reported locations in tests. NFCI otherwise.
getImmLoc() is to be eliminated as well; there's enough work for another
patch.
|
|
|
|
This enables more consecutive load folding during
aggressive-instcombine.
The original motivating example provided by Jeff Byrnes:
https://godbolt.org/z/8ebcTEjTs
Example provided by Nikita Popov: https://godbolt.org/z/Gv1j4vjqE as
part of my original attempt to fix the issue (PR
[#133301](https://github.com/llvm/llvm-project/pull/133301), see his
[comment](https://github.com/llvm/llvm-project/pull/133301#issuecomment-2984905809)).
This changes the value of `IsFast` returned by `In
SITargetLowering::allowsMisalignedMemoryAccessesImpl` to be non-zero for
private and flat addresses if the subtarget supports unaligned scratch
accesses.
This enables aggressive-instcombine to do more folding of consecutive
loads (see
[here](https://github.com/llvm/llvm-project/blob/cbd496581fb6953a9a8d8387a010cc3a67d4654b/llvm/lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp#L811)).
Summary performance impact on
[composable_kernel](https://github.com/ROCm/composable_kernel):
|GPU|speedup (geomean*)|
|---|---|
|MI300A| 1.11|
|MI300X| 1.14|
|MI350X| 1.03|
[*] Just to be clear, this is the geomean across kernels which were
impacted by this change - not across all CK kernels.
|
|
|
|
Since many code are connected, this also changes how workgroup id is lowered.
Co-authored-by: Jay Foad <jay.foad@amd.com>
Co-authored-by: Ivan Kosarev <ivan.kosarev@amd.com>
|
|
(#157821)
This change was motivated by CK where many VMCNT(0)'s were generated due
to instructions lacking !alias.scope metadata. The two causes of this
were:
1) LowerLDSModule not tacking on scope metadata on a single LDS variable
2) IPSCCP pass before inliner replacing noalias ptr derivative with a
global value, which made inliner unable to track it back to the noalias
ptr argument.
However, it turns out that IPSCCP losing the scope information was
largely ineffectual as ScopedNoAliasAA was able to handle asymmetric
condition, where one MemLoc was missing scope, and still return NoAlias
result.
AMDGPU however was checking for existence of scope in SIInsertWaitcnts
and conservatively treating it as aliasing all and inserted VMCNT(0)
before DS_READs, forcing it to wait for all previous LDS DMA
instructions.
Since we know that ScopedNoAliasAA can handle asymmetry, we should also
allow AA query to determine if two MIs may alias.
Passed PSDB.
Previous attempt to address the issue in IPSCCP, likely stalled:
https://github.com/llvm/llvm-project/pull/154522
This solution may be preferrable over that as issue only affects AMDGPU.
|
|
This is a low level utility to parse the MCInstrInfo and should
not depend on the state of the function.
|
|
(#158246)
This is special for the same reason av_mov_b64_imm_pseudo is special.
|
|
The removed function SIGfx90ACacheControl::enableLoadCacheBypass() does
not actually do anything except one assert and one unreachable.
|
|
getPointerRegClass is a layering violation. Its primary purpose
is to determine how to interpret an MCInstrDesc's operands RegClass
fields. This should be context free, and only depend on the subtarget.
The model of this is also wrong, since this should be an
instruction / operand specific property, not a global pointer class.
Remove the the function argument to help stage removal of this hook
and avoid introducing any new obstacles to replacing it.
The remaining uses of the function were to get the subtarget, which
TargetRegisterInfo already belongs to. A few targets needed new
subtarget derived properties copied there.
|
|
I have no idea why this was here. MIMG atomics use tied operands
for the input and output, so AV classes should have always worked.
We have poor test coverage for AGPRs with atomics, so add a partial
set. Everything seems to work OK, although it seems image cmpswap
always uses VGPRs unnecessarily.
|
|
Follow-up to #157682
|
|
/llvm-project/llvm/lib/Target/AMDGPU/SIISelLowering.cpp:3416:9:
error: unused variable 'ValVT' [-Werror,-Wunused-variable]
EVT ValVT = VA.getValVT();
^
1 error generated.
|
|
Fixes assertion with -debug-only=isel on LowerFormalArguments result.
That assert really shouldn't be under LLVM_DEBUG.
Fixes #157997
|
|
Find a common subclass instead of directly checking for a subclass
relationship. This fixes folding logic for unaligned register defs
into aligned use contexts. e.g., a vreg_64 def into an av_64_align2
use should be able to find the common subclass vreg_align2. This
avoids regressions in future patches.
Checking the subclass was also redundant on the subregister path;
getMatchingSuperRegClass is sufficient.
|