riscv-gnu-toolchain/llvm.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Luke Lau <luke@igalia.com>	2024-08-29 14:32:41 +0800
committer	GitHub <noreply@github.com>	2024-08-29 14:32:41 +0800
commit	3b64ede096ce0a0230c4d3f77782e6fa18f2943a (patch)
tree	914137e8060ccdc0cab649295ca2d7469938d926 /clang/lib/CodeGen/CodeGenAction.cpp
parent	2adc94cd6c3dd1fc713a6ba8301fc04f21908700 (diff)
download	llvm-3b64ede096ce0a0230c4d3f77782e6fa18f2943a.zip llvm-3b64ede096ce0a0230c4d3f77782e6fa18f2943a.tar.gz llvm-3b64ede096ce0a0230c4d3f77782e6fa18f2943a.tar.bz2

[RISCV] Decompose LMUL > 1 reverses into LMUL * M1 vrgather.vv (#104574)

As far as I'm aware, vrgather.vv is quadratic in LMUL on most microarchitectures today due to each output register needing to read from each input register in the group. For example, the reciprocal throughput for vrgather.vv on the spacemit-x60 is listed on https://camel-cdr.github.io/rvv-bench-results/bpi_f3 as: LMUL1 LMUL2 LMUL4 LMUL8 4.0 16.0 64.0 256.1 Vector reverses are commonly emitted by the loop vectorizer and are lowered as vrgather.vvs, but since the loop vectorizer uses LMUL 2 by default they end up being quadratic. The output registers in a reverse only need to read from one input register though, so we can decompose this into LMUL * M1 vrgather.vvs to get linear performance. This gives a 0.43% runtime improvement on 526.blender_r at rva22u64_v O3 on the Banana Pi F3.

Diffstat (limited to 'clang/lib/CodeGen/CodeGenAction.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: