riscv-gnu-toolchain/gcc.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Richard Sandiford <richard.sandiford@arm.com>	2025-03-14 10:28:01 +0000
committer	Richard Sandiford <richard.sandiford@arm.com>	2025-03-14 10:28:01 +0000
commit	d1c5edc94b6d07ec29a93572f3b5086e88bf3b0e (patch)
tree	09d9112f880cd076f59c9f5157c62208a54e2393 /libgcc/config/i386/32
parent	df0e6509bf74421ea68a2e025300bcd6ca63722f (diff)
download	gcc-master.zip gcc-master.tar.gz gcc-master.tar.bz2

vect: Fix aarch64/pr99873_2.c ld4/st4 failureHEAD trunk master

vect_slp_prefer_store_lanes_p allows an SLP tree to be split even if the tree could use store-lanes, provided that one of the new groups would operate on full vectors for each scalar iteration. That heuristic is no longer firing for gcc.target/aarch64/pr99873_2.c. The test contains: void __attribute ((noipa)) foo (uint64_t *__restrict x, uint64_t *__restrict y, int n) { for (int i = 0; i < n; i += 4) { x[i] += y[i]; x[i + 1] += y[i + 1]; x[i + 2] |= y[i + 2]; x[i + 3] |= y[i + 3]; } } and wants us to use V2DI for the first two elements and V2DI for the second two elements, rather than LD4s and ST4s. This gives: .L3: ldp q31, q0, [x0] add w3, w3, 1 ldp q29, q30, [x1], 32 orr v30.16b, v0.16b, v30.16b add v31.2d, v29.2d, v31.2d stp q31, q30, [x0], 32 cmp w2, w3 bhi .L3 instead of: .L4: ld4 {v28.2d - v31.2d}, [x2] ld4 {v24.2d - v27.2d}, [x3], 64 add v24.2d, v28.2d, v24.2d add v25.2d, v29.2d, v25.2d orr v26.16b, v30.16b, v26.16b orr v27.16b, v31.16b, v27.16b st4 {v24.2d - v27.2d}, [x2], 64 cmp x2, x5 bne .L4 The first loop only handles half the amount of data per iteration, but it requires far fewer internal permutations. One reason the heuristic no longer fired looks like a typo: the call to vect_slp_prefer_store_lanes_p was passing "1" as the new group size, instead of "i". However, even with that fixed, vect_analyze_slp later falls back on single-lane SLP with load/store lanes. I think that heuristic too should use vect_slp_prefer_store_lanes_p (but it otherwise looks good). The question is whether every load should pass vect_slp_prefer_store_lanes_p or whether just one is enough. I don't have an example that would make the call either way, so I went for the latter, given that it's the smaller change from the status quo. This also appears to improve fotonik3d and roms from SPEC2017 (cross-checked against two different systems). gcc/ * tree-vect-slp.cc (vect_build_slp_instance): Pass the new group size (i) rather than 1 to vect_slp_prefer_store_lanes_p. (vect_analyze_slp): Only force the use of load-lanes and store-lanes if that is preferred for at least one load/store pair.

Diffstat (limited to 'libgcc/config/i386/32')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: