Age | Commit message (Collapse) | Author | Files | Lines |
|
As we can't cope with removed SLP instances during analysis there's
no point in doing that or even continuing analysis of SLP instances
after a failure. The following makes us abort early.
* tree-vect-slp.cc (vect_slp_analyze_operations): When
doing loop analysis fail after the first failed SLP
instance. Only remove instances when doing BB vectorization.
* tree-vect-loop.cc (vect_analyze_loop_2): Check whether
vect_slp_analyze_operations failed instead of checking
the number of SLP instances remaining.
|
|
When we decide to not process a association chain of size two and
that would also mismatch with a different chain size on another lane
we shouldn't fail discovery hard at this point. Instead let the
regular discovery figure out matching lanes so the parent can
decide to perform operand swapping or we can split groups at better
points rather than forcefully splitting away the first single lane.
For example on gcc.dg/vect/vect-strided-u8-i8.c we now see two
groups of size 4 feeding the store instead of groups of size 1,
three, two, one and one.
* tree-vect-slp.cc (vect_build_slp_tree_2): On reassociation
chain length mismatch do not fail discovery of the node
but try without re-associating to compute a better matches[].
Provide a reassociation failure hint in the dump.
(vect_slp_analyze_node_operations): Avoid stray failure
dumping.
(vectorizable_slp_permutation_1): Dump the address of the
SLP node representing the permutation.
|
|
Permute nodes do not have a representative so we have to guard
vect_is_slp_load_node against those.
PR tree-optimization/116658
* tree-vect-slp.cc (vect_is_slp_load_node): Make sure
node isn't a permute.
* g++.dg/vect/pr116658.cc: New testcase.
|
|
When doing SLP discovery I forgot to handle double reductions even
though they are already queued in LOOP_VINFO_REDUCTIONS.
* tree-vect-slp.cc (vect_analyze_slp): Also handle discovery
for double reductions.
|
|
The following enables single-lane loop SLP discovery for non-grouped stores
and adjusts vectorizable_store to properly handle those.
For gfortran.dg/vect/vect-8.f90 we vectorize one additional loop,
not running into the "not falling back to strided accesses" bail-out.
I have not investigated in detail.
There is a set of i386 target assembler test FAILs,
gcc.target/i386/pr88531-2[bc].c in particular fail because the
target cannot identify SLP emulated gathers, see another mail from me.
Others need adjustment, I've adjusted one with this patch only.
In particular there are gcc.target/i386/cond_op_fma_*-1.c FAILs
that are because we no longer fold a VEC_COND_EXPR during the
region value-numbering we do after vectorization since we
code-generate a { 0.0, ... } constant in the VEC_COND_EXPR now
instead of having a separate statement which gets forwarded
and then triggers folding. This leads to sligtly different
code generation. The solution is probably to use gimple_build
when building stmts or, in this case, directly emit .COND_FMA
instead of .FMA and a VEC_COND_EXPR.
gcc.dg/vect/slp-19a.c mixes contiguous 8-lane SLP with a single
lane contiguous store from one lane of the 8-lane load and we
expect to use load-lanes for this reason but the heuristic for
forcing single-lane rediscovery as implemented doesn't trigger
here as it treats both SLP instances separately. FAILs on RISC-V
gcc.dg/vect/slp-19c.c shows we fail to implement an interleaving
scheme for group_size 12 (by extension using the group_size 3
scheme to reduce to 4 lanes and then continue with a pow2 scheme
would work); we are also not considering load-lanes because of
the above reason, but aarch64 cannot do ld12. FAILs on AARCH64
(load requires three vectors) and x86_64.
gcc.dg/vect/slp-19c.c FAILs with variable-length vectors because
of "SLP induction not supported for variable-length vectors".
gcc.target/aarch64/pr110449.c will FAIL because the (contested)
optimization in r14-2367-g224fd59b2dc8a5 was only applied to
loop-vect but not SLP vect. I'll leave it to target maintainers
to either XFAIL (the optimization is bad) or remove the test.
* tree-vect-slp.cc (vect_analyze_slp): Perform single-lane
loop SLP discovery for non-grouped stores. Move check on the root
for re-doing SLP analysis with a single lane for load/store-lanes
earlier and make sure we are dealing with a grouped access.
* tree-vect-stmts.cc (vectorizable_store): Always set
vec_num for SLP.
* gcc.dg/vect/O3-pr39675-2.c: Adjust expected number of SLP.
* gcc.dg/vect/fast-math-vect-call-1.c: Likewise.
* gcc.dg/vect/no-scevccp-slp-31.c: Likewise.
* gcc.dg/vect/slp-12b.c: Likewise.
* gcc.dg/vect/slp-12c.c: Likewise.
* gcc.dg/vect/slp-19a.c: Likewise.
* gcc.dg/vect/slp-19b.c: Likewise.
* gcc.dg/vect/slp-4-big-array.c: Likewise.
* gcc.dg/vect/slp-4.c: Likewise.
* gcc.dg/vect/slp-5.c: Likewise.
* gcc.dg/vect/slp-7.c: Likewise.
* gcc.dg/vect/slp-perm-7.c: Likewise.
* gcc.dg/vect/slp-37.c: Likewise.
* gcc.dg/vect/fast-math-vect-call-2.c: Likewise.
* gcc.dg/vect/slp-26.c: RISC-V can now SLP two instances.
* gcc.dg/vect/vect-outer-slp-3.c: Disable vectorization of
initialization loop.
* gcc.dg/vect/slp-reduc-5.c: Likewise.
* gcc.dg/vect/no-scevccp-outer-12.c: Un-XFAIL. SLP can handle
inner loop inductions with multiple vector stmt copies.
* gfortran.dg/vect/vect-8.f90: Adjust expected number of
vectorized loops.
* gcc.target/i386/vectorize1.c: Adjust what we scan for.
|
|
The following adds SLP discovery for roots that are only live but
otherwise unused. These are usually inductions. This allows a
few more testcases to be handled fully with SLP, for example
gcc.dg/vect/no-scevccp-pr86725-1.c
* tree-vect-slp.cc (vect_analyze_slp): Analyze SLP for live
but otherwise unused defs.
|
|
This makes sure to produce interleaving schemes or load-lanes
for single-element interleaving and other permutes that otherwise
would use more than three vectors.
It exposes the latent issue that single-element interleaving with
large gaps can be inefficient - the mitigation in get_group_load_store_type
doesn't trigger when we clear the load permutation.
It also exposes the fact that not all permutes can be lowered in
the best way in a vector length agnostic way so I've added an
exception to keep power-of-two size contiguous aligned chunks
unlowered (unless we want load-lanes). The optimal handling
of load/store vectorization is going to continue to be a learning
process.
* tree-vect-slp.cc (vect_lower_load_permutations): Also
process single-use grouped loads.
Avoid lowering contiguous aligned power-of-two sized
chunks, those are better handled by the vector size
specific SLP code generation.
* tree-vect-stmts.cc (get_group_load_store_type): Drop
the unrelated requirement of a load permutation for the
single-element interleaving limit.
* gcc.dg/vect/slp-46.c: Remove XFAIL.
|
|
This makes it easier to discover whether SLP load or store nodes
participate in load/store-lanes accesses.
* tree-vect-slp.cc (vect_print_slp_tree): Annotate load
and store-lanes nodes.
|
|
The following avoids performing re-discovery with single lanes in
the attempt to for the use of mask_load_lane as rediscovery will
fail since a single lane of a mask load will appear permuted which
isn't supported.
PR tree-optimization/116575
* tree-vect-slp.cc (vect_analyze_slp): Properly compute
the mask argument for vect_load/store_lanes_supported.
When the load is masked for now avoid rediscovery.
* gcc.dg/vect/pr116575.c: New testcase.
|
|
The following makes sure we handle a SLP load/store group from
a structure with complex and scalar members. This for example
happens in gcc.target/i386/pr106010-9a.c.
* tree-vect-slp.cc (vect_build_slp_tree_1): Handle mixing
all of handled components besides ARRAY_RANGE_REF, drop
handling of INDIRECT_REF.
|
|
The following is a prototype for how to represent load/store-lanes
within SLP. I've for now settled with having a single load node
with multiple permute nodes acting as selection, one for each loaded lane
and a single store node fed from all stored lanes. For
for (int i = 0; i < 1024; ++i)
{
a[2*i] = b[2*i] + 7;
a[2*i+1] = b[2*i+1] * 3;
}
you have the following SLP graph where I explain how things are set
up and code-generated:
t.c:23:21: note: SLP graph after lowering permutations:
t.c:23:21: note: node 0x50dc8b0 (max_nunits=1, refcnt=1) vector(4) int
t.c:23:21: note: op template: *_6 = _7;
t.c:23:21: note: stmt 0 *_6 = _7;
t.c:23:21: note: stmt 1 *_12 = _13;
t.c:23:21: note: children 0x50dc488 0x50dc6e8
This is the store node, it's marked with ldst_lanes = true during
SLP discovery. This node code-generates
vect_array.65[0] = vect__7.61_29;
vect_array.65[1] = vect__13.62_28;
MEM <int[8]> [(int *)vectp_a.63_27] = .STORE_LANES (vect_array.65);
...
t.c:23:21: note: node 0x50dc520 (max_nunits=4, refcnt=2) vector(4) int
t.c:23:21: note: op: VEC_PERM_EXPR
t.c:23:21: note: stmt 0 _5 = *_4;
t.c:23:21: note: lane permutation { 0[0] }
t.c:23:21: note: children 0x50dc948
t.c:23:21: note: node 0x50dc780 (max_nunits=4, refcnt=2) vector(4) int
t.c:23:21: note: op: VEC_PERM_EXPR
t.c:23:21: note: stmt 0 _11 = *_10;
t.c:23:21: note: lane permutation { 0[1] }
t.c:23:21: note: children 0x50dc948
These are the selection nodes, marked with ldst_lanes = true.
They code generate nothing.
t.c:23:21: note: node 0x50dc948 (max_nunits=4, refcnt=3) vector(4) int
t.c:23:21: note: op template: _5 = *_4;
t.c:23:21: note: stmt 0 _5 = *_4;
t.c:23:21: note: stmt 1 _11 = *_10;
t.c:23:21: note: load permutation { 0 1 }
This is the load node, marked with ldst_lanes = true (the load
permutation is only accurate when taking into account the lane permute
in the selection nodes). It code generates
vect_array.58 = .LOAD_LANES (MEM <int[8]> [(int *)vectp_b.56_33]);
vect__5.59_31 = vect_array.58[0];
vect__5.60_30 = vect_array.58[1];
This scheme allows to leave code generation in vectorizable_load/store
mostly as-is.
While this should support both load-lanes and (masked) store-lanes
the decision to do either is done during SLP discovery time and
cannot be reversed without altering the SLP tree - as-is the SLP
tree is not usable for non-store-lanes on the store side, the
load side is OK representation-wise but will very likely fail
permute handling as the lowering to deal with the two input vector
restriction isn't done - but of course since the permute node is
marked as to be ignored that doesn't work out. So I've put
restrictions in place that fail vectorization if a load/store-lane
SLP tree is later classified differently by get_load_store_type.
I'll note that for example gcc.target/aarch64/sve/mask_struct_store_3.c
will not get SLP store-lanes used because the full store SLPs just
fine though we then fail to handle the "splat" load-permutation
t2.c:5:21: note: node 0x4db2630 (max_nunits=4, refcnt=2) vector([4,4]) int
t2.c:5:21: note: op template: _6 = *_5;
t2.c:5:21: note: stmt 0 _6 = *_5;
t2.c:5:21: note: stmt 1 _6 = *_5;
t2.c:5:21: note: stmt 2 _6 = *_5;
t2.c:5:21: note: stmt 3 _6 = *_5;
t2.c:5:21: note: load permutation { 0 0 0 0 }
the load permute lowering code currently doesn't consider it worth
lowering single loads from a group (or in this case not grouped loads).
The expectation is the target can handle this by two interleaves with
itself.
So what we see here is that while the explicit SLP representation is
helpful in some cases, in cases like this it would require changing
it when we make decisions how to vectorize. My idea is that this
all will change a lot when we re-do SLP discovery (for loops) and
when we get rid of non-SLP as I think vectorizable_* should be
allowed to alter the SLP graph during analysis.
The patch also removes the code cancelling SLP if we can use
load/store-lanes from the main loop vector analysis code and
re-implements it as re-discovering the SLP instance with
forced single-lane splits so SLP load/store-lanes scheme can be
used.
This is now done after SLP discovery and SLP pattern recog are
complete to not disturb the latter but per SLP instance instead
of being a global decision on the whole loop.
This is a behavioral change that for example shows in
gcc.dg/vect/slp-perm-6.c on ARM where we formerly used SLP permutes
but now a mix of SLP without permutes and load/store lanes. The
previous flaky heuristic is now flaky in a different way.
Testing on RISC-V and aarch64 reveal several testcases that require
adjustment as to now expect SLP even when load/store lanes are being
used. If in doubt I've adjusted them to the final expectation which
will lead to one or two new FAILs where we still do the SLP cancelling.
I have a followup that implements that while remaining in SLP that's
in final testing.
Note that gcc.dg/vect/slp-42.c and gcc.dg/vect/pr68445.c will FAIL
on aarch64 with SVE because for some odd reason vect_stridedN
is true for any N for check_effective_target_vect_fully_masked
targets but SVE cannot do ld8 while risc-v can.
I have not bothered to adjust target tests that now fail assembly-scan.
* tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark
load, store and permute nodes.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize ldst_lanes.
(vect_build_slp_instance): For stores iff the target prefers
store-lanes discover single-lane sub-groups, do not perform
interleaving lowering but mark the node with ldst_lanes.
Also allow i == 0 - fatal failure - for splitting up a store group
when we're not doing single-lane discovery already.
(vect_lower_load_permutations): When the target supports
load lanes and the loads all fit the pattern split out
a single level of permutes only and mark the load and
permute nodes with ldst_lanes.
(vectorizable_slp_permutation_1): Handle the load-lane permute
forwarding of vector defs.
(vect_analyze_slp): After SLP pattern recog is finished see if
there are any SLP instances that would benefit from using
load/store-lanes and re-discover those with forced single lanes.
* tree-vect-stmts.cc (get_group_load_store_type): Support
load/store-lanes for SLP.
(vectorizable_store): Support SLP code generation for store-lanes.
(vectorizable_load): Support SLP code generation for load-lanes.
* tree-vect-loop.cc (vect_analyze_loop_2): Do not cancel SLP
when store-lanes can be used.
* gcc.dg/vect/slp-55.c: New testcase.
* gcc.dg/vect/slp-56.c: Likewise.
* gcc.dg/vect/slp-11c.c: Adjust.
* gcc.dg/vect/slp-53.c: Likewise.
* gcc.dg/vect/slp-cond-1.c: Likewise.
* gcc.dg/vect/vect-complex-5.c: Likewise.
* gcc.dg/vect/slp-1.c: Likewise.
* gcc.dg/vect/slp-54.c: Remove riscv XFAIL.
* gcc.dg/vect/slp-perm-5.c: Adjust.
* gcc.dg/vect/slp-perm-7.c: Likewise.
* gcc.dg/vect/slp-perm-8.c: Likewise.
* gcc.dg/vect/slp-multitypes-11.c: Likewise.
* gcc.dg/vect/slp-multitypes-11-big-array.c: Likewise.
* gcc.dg/vect/slp-perm-9.c: Remove expected SLP fail due to
three-vector permute.
* gcc.dg/vect/slp-perm-6.c: Remove XFAIL.
* gcc.dg/vect/slp-perm-1.c: Adjust.
* gcc.dg/vect/slp-perm-2.c: Likewise.
* gcc.dg/vect/slp-perm-3.c: Likewise.
* gcc.dg/vect/slp-perm-4.c: Likewise.
* gcc.dg/vect/pr68445.c: Likewise.
* gcc.dg/vect/slp-11b.c: Likewise.
* gcc.dg/vect/slp-2.c: Likewise.
* gcc.dg/vect/slp-23.c: Likewise.
* gcc.dg/vect/slp-33.c: Likewise.
* gcc.dg/vect/slp-42.c: Likewise.
* gcc.dg/vect/slp-46.c: Likewise.
* gcc.dg/vect/slp-perm-10.c: Likewise.
|
|
The following emulates classical interleaving for SLP load permutes
that we are unlikely handling natively. This is to handle cases
where interleaving (or load/store-lanes) is the optimal choice for
vectorizing even when we are doing that within SLP. An example
would be
void foo (int * __restrict a, int * b)
{
for (int i = 0; i < 16; ++i)
{
a[4*i + 0] = b[4*i + 0] * 3;
a[4*i + 1] = b[4*i + 1] + 3;
a[4*i + 2] = (b[4*i + 2] * 3 + 3);
a[4*i + 3] = b[4*i + 3] * 3;
}
}
where currently the SLP store is merging four single-lane SLP
sub-graphs but none of the loads in it can be code-generated
with V4SImode vectors and a VF of four as the permutes would need
three vectors.
The patch introduces a lowering phase after SLP discovery but
before SLP pattern recognition or permute optimization that
analyzes all loads from the same dataref group and creates an
interleaving scheme starting from an unpermuted load.
What can be handled is power-of-two group size and a group size of
three. The possibility for doing the interleaving with a load-lanes
like instruction is done as followup.
For a group-size of three this is done by using
the non-interleaving fallback code which then creates at VF == 4 from
{ { a0, b0, c0 }, { a1, b1, c1 }, { a2, b2, c2 }, { a3, b3, c3 } }
the intermediate vectors { c0, c0, c1, c1 } and { c2, c2, c3, c3 }
to produce { c0, c1, c2, c3 }. This turns out to be more effective
than the scheme implemented for non-SLP for SSE and only slightly
worse for AVX512 and a bit more worse for AVX2. It seems to me that
this would extend to other non-power-of-two group-sizes though (but
the patch does not). Optimal schemes are likely difficult to lay out
in VF agnostic form.
I'll note that while the lowering assumes even/odd extract is
generally available for all vector element sizes (which is probably
a good assumption), it doesn't in any way constrain the other
permutes it generates based on target availability. Again difficult
to do in a VF agnostic way (but at least currently the vector type
is fixed).
I'll also note that the SLP store side merges lanes in a way
producing three-vector permutes for store group-size of three, so
the testcase uses a store group-size of four.
The patch has a fallback for when there are multi-lane groups
and the resulting permutes to not fit interleaving. Code
generation is not optimal when this triggers and might be
worse than doing single-lane group interleaving.
The patch handles gaps by representing them with NULL
entries in SLP_TREE_SCALAR_STMTS for the unpermuted load node.
The SLP discovery changes could be elided if we manually build the
load node instead.
SLP load nodes covering enough lanes to not need intermediate
permutes are retained as having a load-permutation and do not
use the single SLP load node for each dataref group. That's
something we might want to change, making load-permutation
something purely local to SLP discovery (but then SLP discovery
could do part of the lowering).
The patch misses CSEing intermediate generated permutes and
registering them with the bst_map which is possibly required
for SLP pattern detection in some cases - this re-spin of the
patch moves the lowering after SLP pattern detection.
* tree-vect-slp.cc (vect_build_slp_tree_1): Handle NULL stmt.
(vect_build_slp_tree_2): Likewise. Release load permutation
when there's a NULL in SLP_TREE_SCALAR_STMTS and assert there's
no actual permutation in that case.
(vllp_cmp): New function.
(vect_lower_load_permutations): Likewise.
(vect_analyze_slp): Call it.
* gcc.dg/vect/slp-11a.c: Expect SLP.
* gcc.dg/vect/slp-12a.c: Likewise.
* gcc.dg/vect/slp-51.c: New testcase.
* gcc.dg/vect/slp-52.c: New testcase.
|
|
It just clutters the dump files and takes up compile-time.
* tree-vect-slp.cc (vect_build_slp_tree_2): Disable SLP
reassociation for single-lane.
|
|
As preliminary work towards an overhaul of how optinfo_items
interact with dump_pretty_printer, replace uses of optinfo_item * with
std::unique_ptr<optinfo_item> to make ownership clearer.
No functional change intended.
gcc/ChangeLog:
* config/aarch64/aarch64.cc: Define INCLUDE_MEMORY.
* config/arm/arm.cc: Likewise.
* config/i386/i386.cc: Likewise.
* config/loongarch/loongarch.cc: Likewise.
* config/riscv/riscv-vector-costs.cc: Likewise.
* config/riscv/riscv.cc: Likewise.
* config/rs6000/rs6000.cc: Likewise.
* dump-context.h (dump_context::emit_item): Convert "item" param
from * to const &.
(dump_pretty_printer::stash_item): Convert "item" param from
optinfo_ * to std::unique_ptr<optinfo_item>.
(dump_pretty_printer::emit_item): Likewise.
* dumpfile.cc: Include "make-unique.h".
(make_item_for_dump_gimple_stmt): Replace uses of optinfo_item *
with std::unique_ptr<optinfo_item>.
(dump_context::dump_gimple_stmt): Likewise.
(make_item_for_dump_gimple_expr): Likewise.
(dump_context::dump_gimple_expr): Likewise.
(make_item_for_dump_generic_expr): Likewise.
(dump_context::dump_generic_expr): Likewise.
(make_item_for_dump_symtab_node): Likewise.
(dump_pretty_printer::emit_items): Likewise.
(dump_pretty_printer::emit_any_pending_textual_chunks): Likewise.
(dump_pretty_printer::emit_item): Likewise.
(dump_pretty_printer::stash_item): Likewise.
(dump_pretty_printer::decode_format): Likewise.
(dump_context::dump_printf_va): Fix overlong line.
(make_item_for_dump_dec): Replace uses of optinfo_item * with
std::unique_ptr<optinfo_item>.
(dump_context::dump_dec): Likewise.
(dump_context::dump_symtab_node): Likewise.
(dump_context::begin_scope): Likewise.
(dump_context::emit_item): Likewise.
* gimple-loop-interchange.cc: Define INCLUDE_MEMORY.
* gimple-loop-jam.cc: Likewise.
* gimple-loop-versioning.cc: Likewise.
* graphite-dependences.cc: Likewise.
* graphite-isl-ast-to-gimple.cc: Likewise.
* graphite-optimize-isl.cc: Likewise.
* graphite-poly.cc: Likewise.
* graphite-scop-detection.cc: Likewise.
* graphite-sese-to-poly.cc: Likewise.
* graphite.cc: Likewise.
* opt-problem.cc: Likewise.
* optinfo.cc (optinfo::add_item): Convert "item" param from
optinfo_ * to std::unique_ptr<optinfo_item>.
(optinfo::emit_for_opt_problem): Update for change to
dump_context::emit_item.
* optinfo.h: Add #error to fail immediately if INCLUDE_MEMORY
wasn't defined, rather than fail to find std::unique_ptr.
(optinfo::add_item): Convert "item" param from optinfo_ * to
std::unique_ptr<optinfo_item>.
* sese.cc: Define INCLUDE_MEMORY.
* targhooks.cc: Likewise.
* tree-data-ref.cc: Likewise.
* tree-if-conv.cc: Likewise.
* tree-loop-distribution.cc: Likewise.
* tree-parloops.cc: Likewise.
* tree-predcom.cc: Likewise.
* tree-ssa-live.cc: Likewise.
* tree-ssa-loop-ivcanon.cc: Likewise.
* tree-ssa-loop-ivopts.cc: Likewise.
* tree-ssa-loop-prefetch.cc: Likewise.
* tree-ssa-loop-unswitch.cc: Likewise.
* tree-ssa-phiopt.cc: Likewise.
* tree-ssa-threadbackward.cc: Likewise.
* tree-ssa-threadupdate.cc: Likewise.
* tree-vect-data-refs.cc: Likewise.
* tree-vect-generic.cc: Likewise.
* tree-vect-loop-manip.cc: Likewise.
* tree-vect-loop.cc: Likewise.
* tree-vect-patterns.cc: Likewise.
* tree-vect-slp-patterns.cc: Likewise.
* tree-vect-slp.cc: Likewise.
* tree-vect-stmts.cc: Likewise.
* tree-vectorizer.cc: Likewise.
gcc/testsuite/ChangeLog:
* gcc.dg/plugin/dump_plugin.c: Define INCLUDE_MEMORY.
Signed-off-by: David Malcolm <dmalcolm@redhat.com>
|
|
I found it helpful to be able to print a whole SLP instance from gdb.
* tree-vect-slp.cc (debug): Add overload for slp_instance.
|
|
The following fixes a leak of the discovered single-lane store
SLP nodes from which we only use their children. This uncovers
a latent reference counting issue in the interleaving build where
we fail to increment their reference count.
* tree-vect-slp.cc (vect_build_slp_store_interleaving):
Fix reference counting.
(vect_build_slp_instance): Release rhs_nodes.
|
|
This splits out SLP store interleaving into a separate function.
* tree-vect-slp.cc (vect_build_slp_store_interleaving): Split
out from ...
(vect_build_slp_instance): Here.
|
|
The following tries to address that the vectorizer fails to have
precise knowledge of argument and return calling conventions and
views some accesses as loads and stores that are not.
This is mainly important when doing basic-block vectorization as
otherwise loop indexing would force such arguments to memory.
On x86 the reduction in the number of apparent loads and stores
often dominates cost analysis so the following tries to mitigate
this aggressively by adjusting only the scalar load and store
cost, reducing them to the cost of a simple scalar statement,
but not touching the vector access cost which would be much
harder to estimate. Thereby we error on the side of not performing
basic-block vectorization.
PR tree-optimization/116274
* tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads
and stores as simple scalar stmts when they access a non-global,
not address-taken variable that doesn't have BLKmode assigned.
* gcc.target/i386/pr116274-2.c: New testcase.
|
|
This change checks when a two_operators SLP node has multiple occurrences of
the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the operands
so that there are no duplicates. Two vec_perm expressions are then introduced
to recreate the original ordering. These duplicates can appear due to how
two_operators nodes are handled, and they prevent vectorization in some cases.
This targets the vectorization of the SPEC2017 x264 pixel_satd functions.
In some processors a larger than 10% improvement on x264 has been observed.
PR tree-optimization/98138
gcc/ChangeLog:
* tree-vect-slp.cc: Avoid duplicates in two_operators nodes.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/vect-slp-two-operator.c: New test.
|
|
The following avoids some useless work when the SLP discovery limit
is reached, for example allocating a node to cache the failure
and starting discovery on split store groups when analyzing BBs.
It does not address the issue in the PR which is a gratious budget
for discovery when the store group size approaches the number of
overall statements.
PR tree-optimization/116083
* tree-vect-slp.cc (vect_build_slp_tree): Do not allocate
a discovery fail node when we reached the discovery limit.
(vect_build_slp_instance): Terminate early when the
discovery limit is reached.
|
|
Vector stmts number of an operation is calculated based on output vectype.
This is over-estimated for lane-reducing operation, which would cause vector
def/use mismatched when we want to support loop reduction mixed with lane-
reducing and normal operations. One solution is to refit lane-reducing
to make it behave like a normal one, by adding new pass-through copies to
fix possible def/use gap. And resultant superfluous statements could be
optimized away after vectorization. For example:
int sum = 1;
for (i)
{
sum += d0[i] * d1[i]; // dot-prod <vector(16) char>
}
The vector size is 128-bit,vectorization factor is 16. Reduction
statements would be transformed as:
vector<4> int sum_v0 = { 0, 0, 0, 1 };
vector<4> int sum_v1 = { 0, 0, 0, 0 };
vector<4> int sum_v2 = { 0, 0, 0, 0 };
vector<4> int sum_v3 = { 0, 0, 0, 0 };
for (i / 16)
{
sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
sum_v1 = sum_v1; // copy
sum_v2 = sum_v2; // copy
sum_v3 = sum_v3; // copy
}
sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0
2024-07-02 Feng Xue <fxue@os.amperecomputing.com>
gcc/
* tree-vect-loop.cc (vect_reduction_update_partial_vector_usage):
Calculate effective vector stmts number with generic
vect_get_num_copies.
(vect_transform_reduction): Insert copies for lane-reducing so as to
fix over-estimated vector stmts number.
(vect_transform_cycle_phi): Calculate vector PHI number only based on
output vectype.
* tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove
adjustment on vector stmts number specific to slp reduction.
|
|
Extend original vect_get_num_copies (pure loop-based) to calculate number of
vector stmts for slp node regarding a generic vect region.
2024-07-12 Feng Xue <fxue@os.amperecomputing.com>
gcc/
* tree-vectorizer.h (vect_get_num_copies): New overload function.
* tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Calculate
number of vector stmts for slp node with vect_get_num_copies.
(vect_slp_analyze_node_operations): Calculate number of vector elements
for constant/external slp node with vect_get_num_copies.
|
|
The following implements the group-size three scheme from
vect_permute_store_chain in SLP grouped store permute lowering
and extends it to power-of-two multiples of group size three.
The scheme goes from vectors A, B and C to
{ A[0], B[0], C[0], A[1], B[1], C[1], ... } by first producing
{ A[0], B[0], X, A[1], B[1], X, ... } (with X random but chosen
to A[n]) and then permuting in C[n] in the appropriate places.
The extension goes as to replace vector elements with a
power-of-two number of lanes and you'd get pairwise interleaving
until the final three input permutes happen.
The last permute step could be seen as extending C to { C[0], C[0],
C[0], ... } and then performing a blend.
VLA archs will want to use store-lanes here I guess, I'm not sure
if the three vector interleave operation is also available with
a register source and destination and thus available for a shuffle.
* tree-vect-slp.cc (vect_build_slp_instance): Special case
three input permute with the same number of lanes in store
permute lowering.
* gcc.dg/vect/slp-53.c: New testcase.
* gcc.dg/vect/slp-54.c: New testcase.
|
|
The following removes redundant dumping in vect permute vectorization.
* tree-vect-slp.cc (vectorizable_slp_permutation_1): Remove
redundant dump.
|
|
The following starts to handle NULL elements in SLP_TREE_SCALAR_STMTS
with the first candidate being the two-operator nodes where some
lanes are do-not-care and also do not have a scalar stmt computing
the result. I originally added SLP_TREE_SCALAR_STMTS to two-operator
nodes but this exposes PR115764, so I've split that out.
I have a patch use NULL elements for loads from groups with gaps
where we get around not doing that by having a load permutation.
* tree-vect-slp.cc (bst_traits::hash): Handle NULL elements
in SLP_TREE_SCALAR_STMTS.
(vect_print_slp_tree): Likewise.
(vect_mark_slp_stmts): Likewise.
(vect_mark_slp_stmts_relevant): Likewise.
(vect_find_last_scalar_stmt_in_slp): Likewise.
(vect_bb_slp_mark_live_stmts): Likewise.
(vect_slp_prune_covered_roots): Likewise.
(vect_bb_partition_graph_r): Likewise.
(vect_remove_slp_scalar_calls): Likewise.
(vect_slp_gather_vectorized_scalar_stmts): Likewise.
(vect_bb_slp_scalar_cost): Likewise.
(vect_contains_pattern_stmt_p): Likewise.
(vect_slp_convert_to_external): Likewise.
(vect_find_first_scalar_stmt_in_slp): Likewise.
(vect_optimize_slp_pass::remove_redundant_permutations): Likewise.
(vect_slp_analyze_node_operations_1): Likewise.
(vect_schedule_slp_node): Likewise.
* tree-vect-stmts.cc (can_vectorize_live_stmts): Likewise.
(vectorizable_shift): Likewise.
* tree-vect-data-refs.cc (vect_slp_analyze_load_dependences):
Handle NULL elements in SLP_TREE_SCALAR_STMTS.
|
|
The following makes sure that for a SLP reductions all lanes have
the same STMT_VINFO_REDUC_IDX. Once we move that info and can adjust
it we can implement swapping. It also makes the existing protection
against operand swapping trigger for all stmts participating in a
reduction, not just the final one marked as reduction-def.
* tree-vect-slp.cc (vect_build_slp_tree_1): Compare
STMT_VINFO_REDUC_IDX.
(vect_build_slp_tree_2): Prevent operand swapping for
all stmts participating in a reduction.
|
|
The following addresses the corner case of an outer loop with an empty
header where we end up asking for the BB of a NULL stmt by
special-casing this case.
PR tree-optimization/115652
* tree-vect-slp.cc (vect_schedule_slp_node): Handle the case
where the outer loop header block is empty.
|
|
The following avoids associating a reduction path as that might
get STMT_VINFO_REDUC_IDX out-of-sync with the SLP operand order.
This is a latent issue with SLP reductions but now easily exposed
as we're doing single-lane SLP reductions.
When we achieved SLP only we can move and update this meta-data.
PR tree-optimization/115669
* tree-vect-slp.cc (vect_build_slp_tree_2): Do not reassociate
chains that participate in a reduction.
* gcc.dg/vect/pr115669.c: New testcase.
|
|
The previous fix breaks in the degenerate case when the discovered
last_stmt is equal to the first stmt in the block since then we
undo a required stmt advancement.
PR tree-optimization/115652
* tree-vect-slp.cc (vect_schedule_slp_node): Only insert
at the start of the block if that strictly dominates
the discovered dependent stmt.
|
|
The following adjusts how SLP computes the insertion location. In
particular it advanced the insert iterator of the found last_stmt.
The vectorizer will later insert stmts _before_ it. But we also
have the constraint that possibly masked ops may not be scheduled
outside of the loop and as we do not model the loop mask in the
SLP graph we have to adjust for that. The following moves this
to after the advance since it isn't compatible with that as the
current GIMPLE_COND exception shows. The PR is about in-order
reduction vectorization which also isn't happy when that's the
very first stmt.
PR tree-optimization/115652
* tree-vect-slp.cc (vect_schedule_slp_node): Advance the
iterator based on last_stmt only for vector defs.
|
|
The following prevents SLP CSE to create new cycles which happened
because of a 1:1 permute node being present where its child was then
CSEd to the permute node. Fixed by making a node only available to
CSE to after recursing.
PR tree-optimization/115602
* tree-vect-slp.cc (vect_cse_slp_nodes): Delay populating the
bst-map to avoid cycles.
* gcc.dg/vect/pr115602.c: New testcase.
|
|
The following makes sure to always CSE when there's SLP_TREE_SCALAR_STMTS
as otherwise a chain of two-operator node operations can result in
exponential behavior of the CSE process as likely seen when building
510.parest on aarch64.
PR tree-optimization/115597
* tree-vect-slp.cc (vect_cse_slp_nodes): Allow to CSE
VEC_PERM nodes.
|
|
We currently fail to re-CSE SLP nodes after optimizing permutes
which results in off cost estimates. For gcc.dg/vect/bb-slp-32.c
this shows in not re-using the SLP node with the load and arithmetic
for both the store and the reduction. The following implements
CSE by re-bst-mapping nodes as finalization part of vect_optimize_slp.
I've tried to make the CSE part of permute materialization but it
isn't a very good fit there. I've not bothered to implement something
more complete, also handling external defs or defs without
SLP_TREE_SCALAR_STMTS.
I realize this might result in more BB SLP which in turn might slow
down code given costing for BB SLP is difficult (even that we now
vectorize gcc.dg/vect/bb-slp-32.c on x86_64 might be not a good idea).
This is nevertheless feeding more accurate info to costing which is
good.
PR tree-optimization/114413
* tree-vect-slp.cc (release_scalar_stmts_to_slp_tree_map):
New function, split out from ...
(vect_analyze_slp): ... here. Call it.
(vect_cse_slp_nodes): New function.
(vect_optimize_slp): Call it.
* gcc.dg/vect/bb-slp-32.c: Expect CSE and vectorization on x86.
|
|
Add a utility function to check if a statement is lane-reducing operation,
which could simplify some existing code.
2024-06-16 Feng Xue <fxue@os.amperecomputing.com>
gcc/
* tree-vectorizer.h (lane_reducing_stmt_p): New function.
* tree-vect-slp.cc (vect_analyze_slp): Use new function
lane_reducing_stmt_p to check statement.
|
|
When there's a permute after an extern vector we can run into a case
that didn't consider the scheduled node being a permute which lacks
a representative.
PR tree-optimization/115508
* tree-vect-slp.cc (vect_schedule_slp_node): Guard check on
representative.
* gcc.target/i386/pr115508.c: New testcase.
|
|
The following makes double reduction vectorization work when
using (single-lane) SLP vectorization.
* tree-vect-loop.cc (vect_analyze_scalar_cycles_1): Queue
double reductions in LOOP_VINFO_REDUCTIONS.
(vect_create_epilog_for_reduction): Remove asserts disabling
SLP for double reductions.
(vectorizable_reduction): Analyze SLP double reductions
only once and start off the correct places.
* tree-vect-slp.cc (vect_get_and_check_slp_defs): Allow
vect_double_reduction_def.
(vect_build_slp_tree_2): Fix condition for the ignored
reduction initial values.
* tree-vect-stmts.cc (vect_analyze_stmt): Allow
vect_double_reduction_def.
|
|
The following performs single-lane SLP discovery for reductions.
It requires a fixup for outer loop vectorization where a check
for multiple types needs adjustments as otherwise bogus pointer
IV increments happen when there are multiple copies of vector stmts
in the inner loop.
For the reduction epilog handling this extends the optimized path
to cover the trivial single-lane SLP reduction case.
The fix for PR65518 implemented in vect_grouped_load_supported for
non-SLP needs a SLP counterpart that I put in get_group_load_store_type.
I've decided to adjust three testcases for appearing single-lane
SLP instances instead of not dumping "vectorizing stmts using SLP"
for single-lane instances as that also requires testsuite adjustments.
* tree-vect-slp.cc (vect_build_slp_tree_2): Only multi-lane
discoveries are reduction chains and need special backedge
treatment.
(vect_analyze_slp): Fall back to single-lane SLP discovery
for reductions. Make sure to try single-lane SLP reduction
for all reductions as fallback.
(vectorizable_load): Avoid outer loop SLP vectorization with
multi-copy vector stmts in the inner loop.
(vectorizable_store): Likewise.
* tree-vect-loop.cc (vect_create_epilog_for_reduction): Allow
direct opcode and shift reduction also for SLP reductions
with a single lane.
* tree-vect-stmts.cc (get_group_load_store_type): For SLP also
check for the PR65518 single-element interleaving case as done in
vect_grouped_load_supported.
* gcc.dg/vect/slp-24.c: Expect another SLP instance for the
reduction.
* gcc.dg/vect/slp-24-big-array.c: Likewise.
* gcc.dg/vect/slp-reduc-6.c: Remove scan for zero SLP instances.
|
|
When vectorizing an early break loop with LENs (do we miss some
check here to disallow this?) we can end up deciding to insert
stmts after a GIMPLE_COND when doing SLP scheduling and trying
to be conservative with placing of stmts only dependent on
the implicit loop mask/len. The following avoids this, I guess
it's not perfect but it does the job fixing some observed
RISC-V regression.
* tree-vect-slp.cc (vect_schedule_slp_node): For mask/len
loops make sure to not advance the insertion iterator
beyond a GIMPLE_COND.
|
|
Check if an operation is lane-reducing requires comparison of code against
three kinds (DOT_PROD_EXPR/WIDEN_SUM_EXPR/SAD_EXPR). Add an utility
function to make source coding for the check handy and concise.
2024-05-29 Feng Xue <fxue@os.amperecomputing.com>
gcc/
* tree-vectorizer.h (lane_reducing_op_p): New function.
* tree-vect-slp.cc (vect_analyze_slp): Use new function
lane_reducing_op_p to check statement code.
* tree-vect-loop.cc (vect_transform_reduction): Likewise.
(vectorizable_reduction): Likewise, and change name of a local
variable that holds the result flag.
|
|
Both derived classes have their own "bbs" field, which have exactly same
purpose of recording all basic blocks inside the corresponding vect region,
while the fields are composed by different data type, one is normal array,
the other is auto_vec. This difference causes some duplicated code even
handling the same stuff, almost in tree-vect-patterns. One refinement is
lifting this field into the base class "vec_info", and reset its value to
the continuous memory area pointed by two old "bbs" in each constructor
of derived classes.
2024-05-16 Feng Xue <fxue@os.amperecomputing.com>
gcc/
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Move
initialization of bbs to explicit construction code. Adjust the
definition of nbbs.
(update_epilogue_loop_vinfo): Update nbbs for epilog vinfo.
* tree-vect-patterns.cc (vect_determine_precisions): Make
loop_vec_info and bb_vec_info share same code.
(vect_pattern_recog): Remove duplicated vect_pattern_recog_1 loop.
* tree-vect-slp.cc (vect_get_and_check_slp_defs): Access to bbs[0]
via base vec_info class.
(_bb_vec_info::_bb_vec_info): Initialize bbs and nbbs using data
fields of input auto_vec<> bbs.
(vect_slp_region): Use access to nbbs to replace original
bbs.length().
(vect_schedule_slp_node): Access to bbs[0] via base vec_info class.
* tree-vectorizer.cc (vec_info::vec_info): Add initialization of
bbs and nbbs.
(vec_info::insert_seq_on_entry): Access to bbs[0] via base vec_info
class.
* tree-vectorizer.h (vec_info): Add new fields bbs and nbbs.
(LOOP_VINFO_NBBS): New macro.
(BB_VINFO_BBS): Rename BB_VINFO_BB to BB_VINFO_BBS.
(BB_VINFO_NBBS): New macro.
(_loop_vec_info): Remove field bbs.
(_bb_vec_info): Rename field bbs.
|
|
The following avoids accounting single-lane SLP to the discovery
limit. As the two testcases show this makes discovery fail,
unfortunately even not the same across targets. The following
should fix two FAILs for GCN as a side-effect.
PR tree-optimization/115254
* tree-vect-slp.cc (vect_build_slp_tree): Only account
multi-lane SLP to limit.
* gcc.dg/vect/slp-cond-2-big-array.c: Expect 4 times SLP.
* gcc.dg/vect/slp-cond-2.c: Likewise.
|
|
The following avoids splitting store dataref groups during SLP
discovery but instead forces (eventually single-lane) consecutive
lane SLP discovery for all lanes of the group, creating VEC_PERM
SLP nodes merging them so the store will always cover the whole group.
With this for example
int x[1024], y[1024], z[1024], w[1024];
void foo (void)
{
for (int i = 0; i < 256; i++)
{
x[4*i+0] = y[2*i+0];
x[4*i+1] = y[2*i+1];
x[4*i+2] = z[i];
x[4*i+3] = w[i];
}
}
which was previously using hybrid SLP can now be fully SLPed and
SSE code generated looks better (but of course you never know,
I didn't actually benchmark). We of course need a VF of four here.
.L2:
movdqa z(%rax), %xmm0
movdqa w(%rax), %xmm4
movdqa y(%rax,%rax), %xmm2
movdqa y+16(%rax,%rax), %xmm1
movdqa %xmm0, %xmm3
punpckhdq %xmm4, %xmm0
punpckldq %xmm4, %xmm3
movdqa %xmm2, %xmm4
shufps $238, %xmm3, %xmm2
movaps %xmm2, x+16(,%rax,4)
movdqa %xmm1, %xmm2
shufps $68, %xmm3, %xmm4
shufps $68, %xmm0, %xmm2
movaps %xmm4, x(,%rax,4)
shufps $238, %xmm0, %xmm1
movaps %xmm2, x+32(,%rax,4)
movaps %xmm1, x+48(,%rax,4)
addq $16, %rax
cmpq $1024, %rax
jne .L2
The extra permute nodes merging distinct branches of the SLP
tree might be unexpected for some code, esp. since
SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we
cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS
consistently as we can have a mix of both.
The patch keeps the sub-trees form consecutive lanes but that's
in principle not necessary if we for example have an even/odd
split which now would result in N single-lane sub-trees. That's
left for future improvements.
The interesting part is how VLA vector ISAs handle merging of
two vectors that's not trivial even/odd merging. The strathegy
of how to build the permute tree might need adjustments for that
(in the end splitting each branch to single lanes and then doing
even/odd merging would be the brute-force fallback). Not sure
how much we can or should rely on the SLP optimize pass to handle
this.
The gcc.dg/vect/slp-12a.c case is interesting as we currently split
the 8 store group into lanes 0-5 which we SLP with an unroll factor
of two (on x86-64 with SSE) and the remaining two lanes are using
interleaving vectorization with a final unroll factor of four. Thus
we're using hybrid SLP within a single store group. After the change
we discover the same 0-5 lane SLP part as well as two single-lane
parts feeding the full store group. But that results in a load
permutation that isn't supported (I have WIP patchs to rectify that).
So we end up cancelling SLP and vectorizing the whole loop with
interleaving which is IMO good and results in better code.
This is similar for gcc.target/i386/pr52252-atom.c where interleaving
generates much better code than hybrid SLP. I'm unsure how to update
the testcase though.
gcc.dg/vect/slp-21.c runs into similar situations. Note that when
when analyzing SLP operations we discard an instance we currently
force the full loop to have no SLP because hybrid detection is
broken. It's probably not worth fixing this at this moment.
For gcc.dg/vect/pr97428.c we are not splitting the 16 store group
into two but merge the two 8 lane loads into one before doing the
store and thus have only a single SLP instance. A similar situation
happens in gcc.dg/vect/slp-11c.c but the branches feeding the
single SLP store only have a single lane. Likewise for
gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c.
gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization
with a SLP store group of size two but two single-lane branches.
* tree-vect-slp.cc (vect_build_slp_instance): Do not split
store dataref groups on loop SLP discovery failure but create
a single SLP instance for the stores but branch to SLP sub-trees
and merge with a series of VEC_PERM nodes.
* gcc.dg/vect/pr97428.c: Expect a single store SLP group.
* gcc.dg/vect/slp-11c.c: Likewise, if !vect_load_lanes.
* gcc.dg/vect/vect-complex-5.c: Likewise.
* gcc.dg/vect/slp-12a.c: Do not expect SLP.
* gcc.dg/vect/slp-21.c: Remove not important scanning for SLP.
* gcc.dg/vect/slp-cond-1.c: Expect one more SLP if !vect_load_lanes.
* gcc.dg/vect/vect-gather-2.c: Expect SLP to be used.
* gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
|
|
When change_vec_perm_layout runs into a permute combining two
nodes where one is invariant and one internal the partition of
one input can be -1 but the other might not be. The following
supports this case by simply ignoring inputs with input partiton -1.
I'm not sure this is correct but it avoids ICEing when accessing
that partitions layout for gcc.target/i386/pr98928.c with the
change to avoid splitting store dataref groups during SLP discovery.
* tree-vect-slp.cc (change_vec_perm_layout): Ignore an
input partition of -1.
|
|
SLP permute nodes can end up without a SLP_REPRESENTATIVE now,
the following avoids touching it in this case in vect_schedule_slp_node.
* tree-vect-slp.cc (vect_schedule_slp_node): Avoid looking
at SLP_REPRESENTATIVE for VEC_PERM nodes.
|
|
The following refactors a bit how we perform SLP reduction group
discovery possibly making it easier to have multiple reduction
groups later, esp. with single-lane SLP.
* tree-vect-slp.cc (vect_analyze_slp_instance): Remove
slp_inst_kind_reduc_group handling.
(vect_analyze_slp): Add the meat here.
|
|
The following removes the over-broad rejection of patterns for SLP
reductions which is done by removing them from LOOP_VINFO_REDUCTIONS
during pattern detection. That's also insufficient in case the
pattern only appears on the reduction path. Instead this implements
the proper correctness check in vectorizable_reduction and guides
SLP discovery to heuristically avoid forming later invalid groups.
I also couldn't find any testcase that FAILs when allowing the SLP
reductions to form so I've added one.
I came across this for single-lane SLP reductions with the all-SLP
work where we rely on patterns to properly vectorize COND_EXPR
reductions.
* tree-vect-patterns.cc (vect_pattern_recog_1): Do not
remove reductions involving patterns.
* tree-vect-loop.cc (vectorizable_reduction): Reject SLP
reduction groups with multiple lane-reducing reductions.
* tree-vect-slp.cc (vect_analyze_slp_instance): When discovering
SLP reduction groups avoid including lane-reducing ones.
* gcc.dg/vect/vect-reduc-sad-9.c: New testcase.
|
|
The following notes which lanes are considered live and adds an overload
to produce a graphviz graph for multiple entries into an SLP graph.
* tree-vect-slp.cc (vect_print_slp_tree): Mark live lanes.
(dot_slp_tree): New overload for multiple entries.
|
|
The following plugs a hole with computing whether a SLP node has any
pattern stmts which is important to know when we want to replace it
by a CTOR from external defs.
PR tree-optimization/114799
* tree-vect-slp.cc (vect_get_and_check_slp_defs): Properly
update ->any_pattern when swapping operands.
* gcc.dg/vect/bb-slp-pr114799.c: New testcase.
|
|
The following fixes a DFS walk issue when identifying to be ignored
latch edges. We have (bogus) SLP_TREE_REPRESENTATIVEs for VEC_PERM
nodes so those have to be explicitly ignored as possibly being PHIs.
PR tree-optimization/114736
* tree-vect-slp.cc (vect_optimize_slp_pass::is_cfg_latch_edge):
Do not consider VEC_PERM_EXPRs as PHI use.
* gfortran.dg/vect/pr114736.f90: New testcase.
|
|
The following makes sure to record the scalars we add to the BB
reduction vectorization result as scalar uses for the purpose of
computing live lanes. This restores vectorization in the
bondfree.c TU of 435.gromacs.
PR tree-optimization/114057
* tree-vect-slp.cc (vect_bb_slp_mark_live_stmts): Mark
BB reduction remain defs as scalar uses.
|