diff options
author | Luke Lau <luke@igalia.com> | 2025-09-15 12:14:04 +0800 |
---|---|---|
committer | GitHub <noreply@github.com> | 2025-09-15 04:14:04 +0000 |
commit | 65ad21d730d25789454d18e811f8ff5db79cb5d4 (patch) | |
tree | 24e87afdce8784ba28574ce05edb5810e12a0795 /clang/lib/AST/ByteCode/Compiler.cpp | |
parent | c44e015b49e832c9f3749c33cf5c9d5aacaf60a4 (diff) | |
download | llvm-65ad21d730d25789454d18e811f8ff5db79cb5d4.zip llvm-65ad21d730d25789454d18e811f8ff5db79cb5d4.tar.gz llvm-65ad21d730d25789454d18e811f8ff5db79cb5d4.tar.bz2 |
[RISCV] Handle recurrences in RISCVVLOptimizer (#151285)
After #144666 we now support vectorizing loops with induction variables
with EVL tail folding. The induction updates don't use VP intrinsics to
avoid VL toggles but instead rely on RISCVVLOptimizer. However
RISCVVLOptimizer can't reason about cycles or recurrences today, which
means we are left with a VL toggle to VLMAX:
# %bb.1: # %for.body.preheader
li a2, 0
vsetvli a3, zero, e32, m2, ta, ma
vid.v v8
.LBB0_2: # %vector.body
# =>This Inner Loop Header: Depth=1
sub a3, a1, a2
sh2add a4, a2, a0
vsetvli a3, a3, e32, m2, ta, ma
vle32.v v10, (a4)
add a2, a2, a3
vadd.vv v10, v10, v8
vse32.v v10, (a4)
vsetvli a4, zero, e32, m2, ta, ma
vadd.vx v8, v8, a3
bne a2, a1, .LBB0_2
This patch teaches RISCVVLOptimizer to reason about recurrences so we
can remove the VLMAX toggle:
# %bb.1: # %for.body.preheader
li a2, 0
vsetvli a3, zero, e32, m2, ta, ma
vid.v v8
.LBB0_2: # %vector.body
# =>This Inner Loop Header: Depth=1
sub a3, a1, a2
sh2add a4, a2, a0
vsetvli a3, a3, e32, m2, ta, ma
vle32.v v10, (a4)
add a2, a2, a3
vadd.vv v10, v10, v8
vse32.v v10, (a4)
vadd.vx v8, v8, a3
bne a2, a1, .LBB0_2
With this we remove a significant number of VL toggles and vsetvli
instructions across llvm-test-suite and SPEC CPU 2017 with tail folding
enabled, since it affects every loop with an induction variable.
This builds upon the work in #124530 where we started computing what VL
each instruction demanded, and generalizes it to an optimistic sparse
dataflow analysis:
- We begin by optimistically assuming no VL is used by any instruction,
and push instructions onto the worklist starting from the bottom.
- For each instruction on the worklist we apply the transfer function,
which propagates the VL needed by that instruction upwards to the
instructions it uses. If a use's demanded VL changes, it's added to the
worklist.
- Eventually this converges to a fixpoint when all uses have been
processed and every demanded VL has been propagated throughout the
entire use-def chain. Only after this is the DemandedVL map accurate.
Some implementation details:
- The roots are stores (or other unsupported instructions not in
`isSupportedInstr`) or copies to physical registers (they fail the
`any_of(MI.defs(), isPhysical)` check)
- This patch untangles `getMinimumVLForUser` and `checkUsers`.
`getMinimumVLForUser` now returns how many lanes of an operand are read
by an instruction, whilst `checkUsers` checks that an instruction and
its users have compatible EEW/EMULs.
- The `DemandedVL` struct was added so that we have a default
constructor of 0 for `DenseMap<const MachineInstr *, DemandedVL>
DemandedVLs`, so we don't need to check if a key exists when looking
things up.
There was no measurable compile time impact on llvm-test-suite or SPEC
CPU 2017. The analysis will always terminate, there are more details in
this EuroLLVM talk here:
https://www.youtube.com/watch?v=Mfb5fRSdJAc
Fixes #149354
Diffstat (limited to 'clang/lib/AST/ByteCode/Compiler.cpp')
0 files changed, 0 insertions, 0 deletions