diff options
author | Richard Biener <rguenther@suse.de> | 2025-03-17 15:04:28 +0100 |
---|---|---|
committer | Richard Biener <rguenth@gcc.gnu.org> | 2025-05-06 13:36:17 +0200 |
commit | 76c33109074b8e7cf6c326116b46792070122c7b (patch) | |
tree | 340a1fe0ce062b8de85a69828462102dd0b20593 /libstdc++-v3/include/std/numeric | |
parent | 81475602c3dd57ff6987e5f902814e8e3a0a0dde (diff) | |
download | gcc-76c33109074b8e7cf6c326116b46792070122c7b.zip gcc-76c33109074b8e7cf6c326116b46792070122c7b.tar.gz gcc-76c33109074b8e7cf6c326116b46792070122c7b.tar.bz2 |
tree-optimization/1157777 - STLF fails with BB vectorization of loop
The following tries to address us BB vectorizing a loop body that
swaps consecutive elements of an array like for bubble-sort. This
causes the vector store in the previous iteration to fail to forward
to the vector load in the current iteration since there's a partial
overlap.
We try to detect this situation by looking for a load to store
data dependence and analyze this with respect to the containing loop
for a proven problematic access. Currently the search for a
problematic pair is limited to loads and stores in the same SLP
instance which means the problematic load happens in the next
loop iteration and larger dependence distances are not considered.
On x86 with generic costing this avoids vectorizing the loop body,
but once you do core-specific tuning the saved cost for the vector
store vs. the scalar stores makes vectorization still profitable,
but at least the STLF issue is avoided.
For example on my Zen4 machine with -O2 -march=znver4 the testcase in
the PR is improving from
insertion_sort => 2327
to
insertion_sort => 997
but plain -O2 (or -fno-tree-slp-vectorize) gives
insertion_sort => 183
In the end a better target-side cost model for small vector
vectorization is needed to reject this vectorization from this side.
I'll note this is a machine independent heuristic (similar to the
avoid-store-forwarding RTL optimization pass), I expect that uarchs
implementing vectors will suffer from this kind of issue. I know
some aarch64 uarchs can forward from upper/lower part stores, this
isn't considered at the moment. The actual vector size/overlap
distance check could be moved to a target hook if it turns out
necessary.
There might be the chance to use a smaller vector size for the loads
avoiding the penalty rather than falling back to elementwise accesses,
that's not implemented either.
PR tree-optimization/1157777
* tree-vectorizer.h (_slp_tree::avoid_stlf_fail): New member.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize it.
(vect_print_slp_tree): Dump it.
* tree-vect-data-refs.cc (vect_slp_analyze_instance_dependence):
For dataflow dependent loads of a store check whether there's
a cross-iteration data dependence that for sure prohibits
store-to-load forwarding and mark involved loads.
* tree-vect-stmts.cc (get_group_load_store_type): For avoid_stlf_fail
marked loads use VMAT_ELEMENTWISE.
* gcc.dg/vect/bb-slp-pr115777.c: New testcase.
Diffstat (limited to 'libstdc++-v3/include/std/numeric')
0 files changed, 0 insertions, 0 deletions