Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch provides a helper function to handle non-IEEE denormal
flushing behaviours. For the dynamic mode, it returns a union of all
possible results.
|
|
pressure. (#161850)
Basically this allows to analyze "why so many VGPRs used?".
It prints all live registers at the point of maximum register pressure
and for each register its defs/uses are dumped.
Currently can be run before and after the scheduler but would be nice if
it can be ran inbetween any passes (not sure this is possible with
legacy pass-manager). Requires debug or built with asserts compiler.
Highly recommended to run with debug info to have debug locations for
instructions.
Example output:
```
*** Register pressure info (VGPRs) for _ZN7ck_tile6ken.... ***
Max pressure is 256 VGPRs at 41780e@BB.18 (LoopHdr BB.16, Depth 1): %9858:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10137:vreg_128_align2, %10141:vreg_128_align2, %9858:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
Live registers with single definition (123 VGPRs):
%10126:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs)
def 41600r@BB.18 (LoopHdr BB.16, Depth 1): undef %10126.sub0_sub1:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 15232, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1314, !noalias !60, addrspace 3)
def 41608r@BB.18 (LoopHdr BB.16, Depth 1): %10126.sub2_sub3:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 16320, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1315, !noalias !60, addrspace 3)
use 41848r@BB.18 (LoopHdr BB.16, Depth 1): %9856:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10126:vreg_128_align2, %10138:vreg_128_align2, %9856:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
%10136:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs)
def 41264r@BB.18 (LoopHdr BB.16, Depth 1): undef %10136.sub0_sub1:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 2176, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1294, !noalias !60, addrspace 3)
def 41272r@BB.18 (LoopHdr BB.16, Depth 1): %10136.sub2_sub3:vreg_128_align2 = DS_READ_B64_TR_B16 %478:vgpr_32, 3264, 0, implicit $exec :: (load (s64) from %ir.sunkaddr1295, !noalias !60, addrspace 3)
use 41788r@BB.18 (LoopHdr BB.16, Depth 1): %9858:vreg_512_align2 = contract V_MFMA_F32_32X32X16_BF16_mac_vgprcd_e64 %10136:vreg_128_align2, %10140:vreg_128_align2, %9858:vreg_512_align2(tied-def 0), 0, 0, 0, implicit $mode, implicit $exec
%10129:VReg_128_Align2, LiveMask 00000000000000FF (4 VGPRs)
...
Live registers with multiple definitions (133 VGPRs):
%9856:VReg_512_Align2, LiveMask 00000000FFFFFFFF (16 VGPRs)
def 16544r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def undef %9856.sub0_sub1:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
def 16592r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub2_sub3:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
def 16608r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub4_sub5:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
def 16656r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub6_sub7:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
def 16672r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub8_sub9:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
def 16720r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub10_sub11:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
def 16736r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub12_sub13:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
def 16784r@BB.8: INLINEASM &"v_pk_mul_f32 $0, $1, $2" [sideeffect] [isconvergent] [attdialect], $0:[regdef:VReg_64_Align2], def %9856.sub14_sub15:vreg_512_align2, $1:[reguse:VReg_64_Align2], %4069:vreg_64_align2, $2:[reguse:VReg_64_Align2], %10159:vreg_64_align2, !52
def use 41828r@BB.18 (LoopHdr BB.16, Depth 1): %9856:vreg_512_align2 = contract V_MFMA_F32_...
...
********** INTERVALS **********
...
********** MACHINEINSTRS **********
# Machine code for function _ZN7ck_tile6kentr...
```
|
|
|
|
This change updates architecture support checks for tcgen05 intrinsics
(except tcgen05.mma.*). The newer checks will support family-specific
architecture variants as well. After this change, the arch checks will
be accurate and match with PTX ISA.
Intrinsics affected:
- tcgen05.ld/st
- tcgen05.alloc/dealloc/relinquish
- tcgen05.cp
- tcgen05.fence/wait
- tcgen05.commit
- tcgen05.shift
|
|
Treat it the same way as ptrtoint.
|
|
- Use explicit types instead of auto when type is not obvious.
- Move local function out of anonymous namespace and make them static.
- Use structured bindings in some range for loops.
- Simplify PHI handling code a bit.
|
|
This optimization is performed as a separate pass over newly inserted PHI nodes to simplify and deduplicate them. By processing PHIs separately, we avoid the complexity of tracking reference bookkeeping needed to update BBValueInfo structures during insertion.
|
|
Fixes #122624.
Assisted-by: gpt-5-codex
|
|
Allows to use And, Or and Xor instructions as base for copyables.
|
|
To properly support the matrix intrinsics on, e.g., 32-bit platforms
(without the need to emit `libc` calls), `LowerMatrixIntrinsics` pass
should generate code that performs strided index calculations using the
same pointer bit-width as the matrix pointers, as determined by the data
layout. This patch updates the `LowerMatrixInstrics` transform to make
this the case.
PR: https://github.com/llvm/llvm-project/pull/162646
|
|
Only `ninf` should be used.
|
|
Reverts llvm/llvm-project#161355
Looks like I've broken some intrinsic code generation.
|
|
This patch ports the ISD::SUB handling from SelectionDAG’s ComputeNumSignBits to GlobalISel.
Related to https://github.com/llvm/llvm-project/issues/150515.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Co-authored-by: Simon Pilgrim <llvm-dev@redking.me.uk>
|
|
We can perform CSE on recipes that do not directly map to Instruction
opcodes. One example is VPVectorPointerRecipe. Currently this is handled
by supporting them in ::canHandle, but currently that means that we
return std::nullopt from getOpcodeOrIntrinsicID() for it. This currently
only works, because the only case we return std::nullopt and perform CSE
is VPVectorPointerRecipe. But that does not work if we support more such
recipes, like VPPredInstPHIRecipe
(https://github.com/llvm/llvm-project/pull/162110).
To fix this, return a custom opcode from getOpcodeOrIntrinsicID for
recipes like VPVectorPointerRecipe, using the VPDefID after all regular
instruction opcodes.
PR: https://github.com/llvm/llvm-project/pull/162267
|
|
|
|
Lower v4f32 and v2f64 fmuladd calls to relaxed_madd instructions.
If we have FP16, then lower v8f16 fmuladds to FMA.
I've introduced an ISD node for fmuladd to maintain the rounding
ambiguity through legalization / combine / isel.
|
|
For ptrtoint(inttoptr) and ptrtoaddr(inttoptr), handle the case where
the source and destination size do not match and convert to either zext
or trunc. We can't do this if the middle size is smaller than both
src/dest, because we'd have to perform an additional masking operation
in that case.
Most of these cases are handled by dint of ptrtoint/inttoptr size
canonicalization (so I added some unit tests instead). However, the
ptrtoaddr(inttoptr) case where the pointer size and address size differ
is relevant, as in that case the mismatch in integer sizes is canonical.
|
|
Add a `-print-debug-counter-queries` option which prints the current
value of the counter and whether it is executed/skipped each time it is
queried. This is useful when interleaving the output with the usual
transform debug output, in order to find the correct counter value to
use to hit a specific point in the transform.
|
|
Allow LLVMGetVolatile() to work with any kind of Instruction, rather
than only memory instructions that accept a volatile flag. For
instructions that can never be volatile, the function now return false
instead of asserting. This matches the behavior of
`Instruction::isVolatile()` in the C++ API.
|
|
Converting a vector float op into a vector int op may be non-profitable,
especially for targets where the float op for a given type is legal, but
the integer op is not.
We could of course also try to address this via a reverse transform in
the backend, but I don't think it's worth the bother, given that vectors
were never the intended use case for this transform in the first place.
Fixes https://github.com/llvm/llvm-project/issues/162749.
|
|
CalleeSavesAboveFrameRecord (#156452)
Previously, we would pop `FixedObject`-bytes after deallocating the SVE
area, then again as part of the "AfterCSRPopSize". This could be seen in
the tests `@f6` and `@f9`.
This patch removes the erroneous pop, and refactors
`CalleeSavesAboveFrameRecord` to reuse more of the existing GPR
deallocation logic, which allows for post-decrements.
|
|
|
|
Extend the existing "scalarize" function which is used for the
fp-integer conversion instruction expansion to BinaryOperator
instructions and reuse it for the frem expansion; a similar function
for scalarizing BinaryOperator instructions exists in the ExpandLargeDivRem
pass and this change is a step towards merging that pass with ExpandFp.
Further refactoring: Scalarize directly instead of using the
"ReplaceVector" as a worklist, rename "Replace" vector to "Worklist",
and hoist a check for unsupported scalable vectors to the top of the
instruction visiting loop.
|
|
This gives MemoryMapper implementations a chance to allocate working
memory using the LinkGraph's allocator.
|
|
|
|
Add patterns which reduce or operations to register sequences when
combining i16 values to i32. This removes many intermediate VGPRs and
reduces registers pressure.
|
|
[[fallthrough]] is now part of C++17, so we don't need to use
LLVM_FALLTHROUGH.
|
|
Closes #95219
|
|
|
|
Generally G_UADDE, G_UADDO, G_USUBE, G_USUBO are used together and it
was enough to simply define EFLAGS. But if extractvalue is used, we end
up with a copy of EFLAGS into GPR.
Always generate SETB instruction to put the carry bit on GPR and CMP to
set the carry bit back. It gives the correct lowering in all the cases.
Closes #120029
|
|
Need to insert the vector value for the postponed gather/buildvector
node after all uses non only if the vector value of the user node is
phi, but also if the user node itself is PHI node, which may produce
vector phi + shuffle.
Fixes #162799
|
|
Currently we cannot vectorize loops with latch blocks terminated by a
switch. In the future this could be handled by materializing appropriate
compares.
Fixes https://github.com/llvm/llvm-project/issues/156894.
|
|
Corrects the spelling of 'IsGlobaLinkage' to 'IsGlobalLinkage' in
XCOFF-related code, comments, and tests across the codebase.
|
|
Follow-up as suggested in
https://github.com/llvm/llvm-project/pull/162617.
Just use an APInt for DividesBy, as the existing code already operates
on APInt and thus handles the case of DividesBy being 1.
PR: https://github.com/llvm/llvm-project/pull/163017
|
|
If the non-commutative user has several same operands and at least one
of them (but not the first) is copyable, need to consider this
opportunity when calculating the number of dependencies. Otherwise, the
schedule bundle might be not scheduled correctly and cause a compiler
crash
Fixes #162925
|
|
VPWidenCastRecipes with Trunc opcodes where missing the correct OpType
for IR flags. Update createWidenCast to set the correct flags for
truncs, and use it consistenly.
Fixes https://github.com/llvm/llvm-project/issues/162374.
|
|
(#162826)
This patch uses the RI member variable directly in the member function
AArch64InstrInfo::copyPhysReg, instead of redundant calls to the public
API.
|
|
Fixes #162983
|
|
This patch adds support for fadd/fsub operations. I only tested this
patch with some special ranges because the exhaustive check is too
expensive.
|
|
Users should use fast-math flags instead.
|
|
Instead of re-setting the start value of the canonical IV when
vectorizing the epilogue we can emit an Add VPInstruction to provide
canonical IV value, adjusted by the resume value from the main loop.
This is in preparation to make the canonical IV a VPValue defined by
loop regions. It ensures that the canonical IV always starts at 0.
PR: https://github.com/llvm/llvm-project/pull/161589
|
|
Support `@llvm.spv.resource.nonuniformindex` in SPIRV Codegen.
- Add `NonUniformEXT` decoration to the registers marked as
`nonuniformindex`, and recursively decorate its child registers (e.g.
Copy, AccessChain, Load) that access such index.
- `OpCapability ShaderNonUniformEXT` is already added in the code.
-
[SPV_EXT_descriptor_indexing](https://github.khronos.org/SPIRV-Registry/extensions/EXT/SPV_EXT_descriptor_indexing.html)
is skipped because it's added to SPIRV Core in 1.5.
## Unit test
- The unit test checks that the register being used in the final
Store/Load/Write instruction is decorated, as required by the spec.
- The implementation follows [DXC](https://godbolt.org/z/zhqGThcaf) in
that it recursively decorates all the child elements until the end.
```hlsl
RWStructuredBuffer<uint4> StructuredOut[64];
RWBuffer<uint> UnStructuredOut[64];
[numthreads(64,1,1)]
void main(uint3 GTID: SV_GroupThreadID) {
StructuredOut[(NonUniformResourceIndex(GTID.x + 1))][98][0] = 99;
UnStructuredOut[(NonUniformResourceIndex(GTID.x))][96] = 95;
}
```
Resolves https://github.com/llvm/llvm-project/issues/160231,
https://github.com/llvm/llvm-project/issues/161852.
Verified
[offload-test-suite](https://github.com/llvm/offload-test-suite/blob/cfc37840c8ad0d9c08ee900ecbc0b02cc56478ae/test/Feature/ResourceArrays/unbounded-array-nuri.test)
started passing for clang.
|
|
Replication is currently not supported for scalable VFs. Make sure
VPReplicateRecipe::computeCost returns an invalid cost early, for
scalable VFs if the recipe is not a single-scalar.
Note that this moves the existing invalid-costs.ll out of the AArch64
subdirectory, as it does not use a target triple.
Fixes https://github.com/llvm/llvm-project/issues/160792.
|
|
This reduces RISCVGenDAGISel.inc by about 750 bytes.
|
|
Depends on:
* https://github.com/llvm/llvm-project/pull/162621
Now we can start emitting `DW_AT_language_name`, make sure
`DwarfUnit::getSourceLanguage` is equipped to handle this. Otherwise the
new test-case would assert.
|
|
patterns. NFC (#162790)
Tablegen was creating unnecessary patterns for the RV64 HwMode.
|
|
Source: Hacker's delight, page 21.
Using the carry, we can use contractions to use the ucmp.
|
|
This patch adds support for fpext/fptrunc operations.
I noticed that finite-only semantics are not supported by the current
representation of constant FP ranges. It should be okay for now, as we
don't expose these types in the IR.
|
|
Merge the two checks using the existing API, NFC.
|
|
Based on top of #157211.
`FNEG` and `FABS` must preserve signalling NaNs, meaning they should not
convert to f32 to perform the operation. Instead legalize to `XOR` and
`AND`.
Fixes almost all of #104915
|