aarch64: Avoid unnecessary use of 2-input TBLs [PR115258]

When using TBL for (say) a V4SI permutation, the aarch64 port first asks target-independent code to lower to a V16QI permutation. Then, during code generation, an input like: (reg:V4SI R) gets converted to: (subreg:V16QI (reg:V4SI R) 0) aarch64_vectorize_vec_perm_const had: d.op0 = op0 ? force_reg (op_mode, op0) : NULL_RTX; if (op0 == op1) d.op1 = d.op0; else d.op1 = op1 ? force_reg (op_mode, op1) : NULL_RTX; But subregs (unlike regs) are not shared, so the op0 == op1 check always failed for this case. We'd then force each subreg into a fresh register, meaning that during the later: aarch64_expand_vec_perm_1 (d->target, d->op0, d->op1, sel); there is no way for aarch64_expand_vec_perm_1 to realise that d->op0 and d->op1 are the same value. It would therefore generate a two-input TBL in the testcase, even though a single-input TBL is enough. I'm not sure forcing subregs to a fresh regiter is a good idea -- it caused problems for copysign & co. -- but that's not something to fiddle with during stage 4. Using op0 == op1 for rtx equality is independently wrong, so we might as well just fix that for now. The patch gets rid of extra MOVs that are a regression from GCC 14. The testcase is based on one from Kugan, itself based on TSVC. gcc/ PR target/115258 * config/aarch64/aarch64.cc (aarch64_vectorize_vec_perm_const): Use d.one_vector_p to decide whether op1 should be a copy of op0. gcc/testsuite/ PR target/115258 * gcc.target/aarch64/pr115258_2.c: New test. Co-authored-by: Kugan Vivekanandarajah <kvivekananda@nvidia.com>
author: Richard Sandiford <richard.sandiford@arm.com> 2025-03-10 20:29:52 +0000
committer: Richard Sandiford <richard.sandiford@arm.com> 2025-03-10 20:29:52 +0000
commit: 31dcf941ac78c4b1b01dc4b2ce9809f0209153b8 (patch)
tree: 2900cfa91025367e053907f51ed075db16e3edb5
parent: e355fe414aa3aaf215c7dd9dd789ce217a1b458c (diff)
download: gcc-31dcf941ac78c4b1b01dc4b2ce9809f0209153b8.zip
gcc-31dcf941ac78c4b1b01dc4b2ce9809f0209153b8.tar.gz
gcc-31dcf941ac78c4b1b01dc4b2ce9809f0209153b8.tar.bz2
2 files changed, 19 insertions, 2 deletions
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 9bea8ce..36b65df 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -26851,8 +26851,8 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
   d.op_vec_flags = aarch64_classify_vector_mode (d.op_mode);
   d.target = target;
   d.op0 = op0 ? force_reg (op_mode, op0) : NULL_RTX;
-  if (op0 == op1)
-    d.op1 = d.op0;
+  if (op0 && d.one_vector_p)
+    d.op1 = copy_rtx (d.op0);
   else
     d.op1 = op1 ? force_reg (op_mode, op1) : NULL_RTX;
   d.testing_p = !target;
diff --git a/gcc/testsuite/gcc.target/aarch64/pr115258_2.c b/gcc/testsuite/gcc.target/aarch64/pr115258_2.c
new file mode 100644
index 0000000..065e1bb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr115258_2.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-v2" } */
+
+extern __attribute__((aligned(64))) float a[32000], b[32000];
+int dummy(float[32000], float[32000], float);
+
+void s1112() {
+
+  for (int nl = 0; nl < 100000 * 3; nl++) {
+    for (int i = 32000 - 1; i >= 0; i--) {
+      a[i] = b[i] + (float)1.;
+    }
+    dummy(a, b, 0.);
+  }
+}
+
+/* { dg-final { scan-assembler-not {\tmov\tv[0-9]+\.16b,} } } */
author	Richard Sandiford <richard.sandiford@arm.com>	2025-03-10 20:29:52 +0000
committer	Richard Sandiford <richard.sandiford@arm.com>	2025-03-10 20:29:52 +0000
commit	31dcf941ac78c4b1b01dc4b2ce9809f0209153b8 (patch)
tree	2900cfa91025367e053907f51ed075db16e3edb5
parent	e355fe414aa3aaf215c7dd9dd789ce217a1b458c (diff)
download	gcc-31dcf941ac78c4b1b01dc4b2ce9809f0209153b8.zip gcc-31dcf941ac78c4b1b01dc4b2ce9809f0209153b8.tar.gz gcc-31dcf941ac78c4b1b01dc4b2ce9809f0209153b8.tar.bz2