diff options
author | Tamar Christina <tamar.christina@arm.com> | 2023-03-12 18:42:59 +0000 |
---|---|---|
committer | Tamar Christina <tamar.christina@arm.com> | 2023-03-12 18:42:59 +0000 |
commit | f23dc726875c26f2c38dfded453aa9beba0b9be9 (patch) | |
tree | a83c43779c456b5acf203efafeb639a443642a0b /libgcc | |
parent | 81fd62d1378b7ddc1fa0967cbddcdcdcdd2d8d8c (diff) | |
download | gcc-f23dc726875c26f2c38dfded453aa9beba0b9be9.zip gcc-f23dc726875c26f2c38dfded453aa9beba0b9be9.tar.gz gcc-f23dc726875c26f2c38dfded453aa9beba0b9be9.tar.bz2 |
AArch64: Update div-bitmask to implement new optab instead of target hook [PR108583]
This replaces the custom division hook with just an implementation through
add_highpart. For NEON we implement the add highpart (Addition + extraction of
the upper highpart of the register in the same precision) as ADD + LSR.
This representation allows us to easily optimize the sequence using existing
sequences. This gets us a pretty decent sequence using SRA:
umull v1.8h, v0.8b, v3.8b
umull2 v0.8h, v0.16b, v3.16b
add v5.8h, v1.8h, v2.8h
add v4.8h, v0.8h, v2.8h
usra v1.8h, v5.8h, 8
usra v0.8h, v4.8h, 8
uzp2 v1.16b, v1.16b, v0.16b
To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
is half the precision of the mode of the operation into addhn + uaddw which is
a general good optimization on its own and gets us back to:
.L4:
ldr q0, [x3]
umull v1.8h, v0.8b, v5.8b
umull2 v0.8h, v0.16b, v5.16b
addhn v3.8b, v1.8h, v4.8h
addhn v2.8b, v0.8h, v4.8h
uaddw v1.8h, v1.8h, v3.8b
uaddw v0.8h, v0.8h, v2.8b
uzp2 v1.16b, v1.16b, v0.16b
str q1, [x3], 16
cmp x3, x4
bne .L4
For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:
.L3:
ld1b z0.h, p0/z, [x0, x3]
mul z0.h, p1/m, z0.h, z2.h
add z1.h, z0.h, z3.h
usra z0.h, z1.h, #8
lsr z0.h, z0.h, #8
st1b z0.h, p0, [x0, x3]
inch x3
whilelo p0.h, w3, w2
b.any .L3
.L1:
ret
and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
to addhnb which gets us to:
.L3:
ld1b z0.h, p0/z, [x0, x3]
mul z0.h, p1/m, z0.h, z2.h
addhnb z1.b, z0.h, z3.h
addhnb z0.b, z0.h, z1.h
st1b z0.h, p0, [x0, x3]
inch x3
whilelo p0.h, w3, w2
b.any .L3
There are multiple RTL representations possible for these optimizations, I did
not represent them using a zero_extend because we seem very inconsistent in this
in the backend. Since they are unspecs we won't match them from vector ops
anyway. I figured maintainers would prefer this, but my maintainer ouija board
is still out for repairs :)
There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.
gcc/ChangeLog:
PR target/108583
* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
(*bitmask_shift_plus<mode>): New.
* config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
(@aarch64_bitmask_udiv<mode>3): Remove.
* config/aarch64/aarch64.cc
(aarch64_vectorize_can_special_div_by_constant,
TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
(TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
aarch64_vectorize_preferred_div_as_shifts_over_mult): New.
Diffstat (limited to 'libgcc')
0 files changed, 0 insertions, 0 deletions