aboutsummaryrefslogtreecommitdiff
path: root/gcc/tree-vect-slp.cc
AgeCommit message (Collapse)AuthorFilesLines
2024-09-12Abort loop SLP analysis quickerRichard Biener1-1/+9
As we can't cope with removed SLP instances during analysis there's no point in doing that or even continuing analysis of SLP instances after a failure. The following makes us abort early. * tree-vect-slp.cc (vect_slp_analyze_operations): When doing loop analysis fail after the first failed SLP instance. Only remove instances when doing BB vectorization. * tree-vect-loop.cc (vect_analyze_loop_2): Check whether vect_slp_analyze_operations failed instead of checking the number of SLP instances remaining.
2024-09-12Better recover from SLP reassociation fails during discoveryRichard Biener1-15/+14
When we decide to not process a association chain of size two and that would also mismatch with a different chain size on another lane we shouldn't fail discovery hard at this point. Instead let the regular discovery figure out matching lanes so the parent can decide to perform operand swapping or we can split groups at better points rather than forcefully splitting away the first single lane. For example on gcc.dg/vect/vect-strided-u8-i8.c we now see two groups of size 4 feeding the store instead of groups of size 1, three, two, one and one. * tree-vect-slp.cc (vect_build_slp_tree_2): On reassociation chain length mismatch do not fail discovery of the node but try without re-associating to compute a better matches[]. Provide a reassociation failure hint in the dump. (vect_slp_analyze_node_operations): Avoid stray failure dumping. (vectorizable_slp_permutation_1): Dump the address of the SLP node representing the permutation.
2024-09-10tree-optimization/116658 - latent issue in vect_is_slp_load_nodeRichard Biener1-3/+4
Permute nodes do not have a representative so we have to guard vect_is_slp_load_node against those. PR tree-optimization/116658 * tree-vect-slp.cc (vect_is_slp_load_node): Make sure node isn't a permute. * g++.dg/vect/pr116658.cc: New testcase.
2024-09-06Fix SLP double-reduction supportRichard Biener1-1/+3
When doing SLP discovery I forgot to handle double reductions even though they are already queued in LOOP_VINFO_REDUCTIONS. * tree-vect-slp.cc (vect_analyze_slp): Also handle discovery for double reductions.
2024-09-06Handle non-grouped stores as single-lane SLPRichard Biener1-9/+36
The following enables single-lane loop SLP discovery for non-grouped stores and adjusts vectorizable_store to properly handle those. For gfortran.dg/vect/vect-8.f90 we vectorize one additional loop, not running into the "not falling back to strided accesses" bail-out. I have not investigated in detail. There is a set of i386 target assembler test FAILs, gcc.target/i386/pr88531-2[bc].c in particular fail because the target cannot identify SLP emulated gathers, see another mail from me. Others need adjustment, I've adjusted one with this patch only. In particular there are gcc.target/i386/cond_op_fma_*-1.c FAILs that are because we no longer fold a VEC_COND_EXPR during the region value-numbering we do after vectorization since we code-generate a { 0.0, ... } constant in the VEC_COND_EXPR now instead of having a separate statement which gets forwarded and then triggers folding. This leads to sligtly different code generation. The solution is probably to use gimple_build when building stmts or, in this case, directly emit .COND_FMA instead of .FMA and a VEC_COND_EXPR. gcc.dg/vect/slp-19a.c mixes contiguous 8-lane SLP with a single lane contiguous store from one lane of the 8-lane load and we expect to use load-lanes for this reason but the heuristic for forcing single-lane rediscovery as implemented doesn't trigger here as it treats both SLP instances separately. FAILs on RISC-V gcc.dg/vect/slp-19c.c shows we fail to implement an interleaving scheme for group_size 12 (by extension using the group_size 3 scheme to reduce to 4 lanes and then continue with a pow2 scheme would work); we are also not considering load-lanes because of the above reason, but aarch64 cannot do ld12. FAILs on AARCH64 (load requires three vectors) and x86_64. gcc.dg/vect/slp-19c.c FAILs with variable-length vectors because of "SLP induction not supported for variable-length vectors". gcc.target/aarch64/pr110449.c will FAIL because the (contested) optimization in r14-2367-g224fd59b2dc8a5 was only applied to loop-vect but not SLP vect. I'll leave it to target maintainers to either XFAIL (the optimization is bad) or remove the test. * tree-vect-slp.cc (vect_analyze_slp): Perform single-lane loop SLP discovery for non-grouped stores. Move check on the root for re-doing SLP analysis with a single lane for load/store-lanes earlier and make sure we are dealing with a grouped access. * tree-vect-stmts.cc (vectorizable_store): Always set vec_num for SLP. * gcc.dg/vect/O3-pr39675-2.c: Adjust expected number of SLP. * gcc.dg/vect/fast-math-vect-call-1.c: Likewise. * gcc.dg/vect/no-scevccp-slp-31.c: Likewise. * gcc.dg/vect/slp-12b.c: Likewise. * gcc.dg/vect/slp-12c.c: Likewise. * gcc.dg/vect/slp-19a.c: Likewise. * gcc.dg/vect/slp-19b.c: Likewise. * gcc.dg/vect/slp-4-big-array.c: Likewise. * gcc.dg/vect/slp-4.c: Likewise. * gcc.dg/vect/slp-5.c: Likewise. * gcc.dg/vect/slp-7.c: Likewise. * gcc.dg/vect/slp-perm-7.c: Likewise. * gcc.dg/vect/slp-37.c: Likewise. * gcc.dg/vect/fast-math-vect-call-2.c: Likewise. * gcc.dg/vect/slp-26.c: RISC-V can now SLP two instances. * gcc.dg/vect/vect-outer-slp-3.c: Disable vectorization of initialization loop. * gcc.dg/vect/slp-reduc-5.c: Likewise. * gcc.dg/vect/no-scevccp-outer-12.c: Un-XFAIL. SLP can handle inner loop inductions with multiple vector stmt copies. * gfortran.dg/vect/vect-8.f90: Adjust expected number of vectorized loops. * gcc.target/i386/vectorize1.c: Adjust what we scan for.
2024-09-05Handle unused-only-live stmts in SLP discoveryRichard Biener1-0/+30
The following adds SLP discovery for roots that are only live but otherwise unused. These are usually inductions. This allows a few more testcases to be handled fully with SLP, for example gcc.dg/vect/no-scevccp-pr86725-1.c * tree-vect-slp.cc (vect_analyze_slp): Analyze SLP for live but otherwise unused defs.
2024-09-04Also lower SLP grouped loads with just one consumerRichard Biener1-18/+38
This makes sure to produce interleaving schemes or load-lanes for single-element interleaving and other permutes that otherwise would use more than three vectors. It exposes the latent issue that single-element interleaving with large gaps can be inefficient - the mitigation in get_group_load_store_type doesn't trigger when we clear the load permutation. It also exposes the fact that not all permutes can be lowered in the best way in a vector length agnostic way so I've added an exception to keep power-of-two size contiguous aligned chunks unlowered (unless we want load-lanes). The optimal handling of load/store vectorization is going to continue to be a learning process. * tree-vect-slp.cc (vect_lower_load_permutations): Also process single-use grouped loads. Avoid lowering contiguous aligned power-of-two sized chunks, those are better handled by the vector size specific SLP code generation. * tree-vect-stmts.cc (get_group_load_store_type): Drop the unrelated requirement of a load permutation for the single-element interleaving limit. * gcc.dg/vect/slp-46.c: Remove XFAIL.
2024-09-03Dump whether a SLP node represents load/store-lanesRichard Biener1-2/+5
This makes it easier to discover whether SLP load or store nodes participate in load/store-lanes accesses. * tree-vect-slp.cc (vect_print_slp_tree): Annotate load and store-lanes nodes.
2024-09-03tree-optimization/116575 - avoid ICE with SLP mask_load_laneRichard Biener1-2/+17
The following avoids performing re-discovery with single lanes in the attempt to for the use of mask_load_lane as rediscovery will fail since a single lane of a mask load will appear permuted which isn't supported. PR tree-optimization/116575 * tree-vect-slp.cc (vect_analyze_slp): Properly compute the mask argument for vect_load/store_lanes_supported. When the load is masked for now avoid rediscovery. * gcc.dg/vect/pr116575.c: New testcase.
2024-09-03Handle mixing REALPART/IMAGPART with other components in SLP groupsRichard Biener1-2/+4
The following makes sure we handle a SLP load/store group from a structure with complex and scalar members. This for example happens in gcc.target/i386/pr106010-9a.c. * tree-vect-slp.cc (vect_build_slp_tree_1): Handle mixing all of handled components besides ARRAY_RANGE_REF, drop handling of INDIRECT_REF.
2024-09-02load and store-lanes with SLPRichard Biener1-17/+238
The following is a prototype for how to represent load/store-lanes within SLP. I've for now settled with having a single load node with multiple permute nodes acting as selection, one for each loaded lane and a single store node fed from all stored lanes. For for (int i = 0; i < 1024; ++i) { a[2*i] = b[2*i] + 7; a[2*i+1] = b[2*i+1] * 3; } you have the following SLP graph where I explain how things are set up and code-generated: t.c:23:21: note: SLP graph after lowering permutations: t.c:23:21: note: node 0x50dc8b0 (max_nunits=1, refcnt=1) vector(4) int t.c:23:21: note: op template: *_6 = _7; t.c:23:21: note: stmt 0 *_6 = _7; t.c:23:21: note: stmt 1 *_12 = _13; t.c:23:21: note: children 0x50dc488 0x50dc6e8 This is the store node, it's marked with ldst_lanes = true during SLP discovery. This node code-generates vect_array.65[0] = vect__7.61_29; vect_array.65[1] = vect__13.62_28; MEM <int[8]> [(int *)vectp_a.63_27] = .STORE_LANES (vect_array.65); ... t.c:23:21: note: node 0x50dc520 (max_nunits=4, refcnt=2) vector(4) int t.c:23:21: note: op: VEC_PERM_EXPR t.c:23:21: note: stmt 0 _5 = *_4; t.c:23:21: note: lane permutation { 0[0] } t.c:23:21: note: children 0x50dc948 t.c:23:21: note: node 0x50dc780 (max_nunits=4, refcnt=2) vector(4) int t.c:23:21: note: op: VEC_PERM_EXPR t.c:23:21: note: stmt 0 _11 = *_10; t.c:23:21: note: lane permutation { 0[1] } t.c:23:21: note: children 0x50dc948 These are the selection nodes, marked with ldst_lanes = true. They code generate nothing. t.c:23:21: note: node 0x50dc948 (max_nunits=4, refcnt=3) vector(4) int t.c:23:21: note: op template: _5 = *_4; t.c:23:21: note: stmt 0 _5 = *_4; t.c:23:21: note: stmt 1 _11 = *_10; t.c:23:21: note: load permutation { 0 1 } This is the load node, marked with ldst_lanes = true (the load permutation is only accurate when taking into account the lane permute in the selection nodes). It code generates vect_array.58 = .LOAD_LANES (MEM <int[8]> [(int *)vectp_b.56_33]); vect__5.59_31 = vect_array.58[0]; vect__5.60_30 = vect_array.58[1]; This scheme allows to leave code generation in vectorizable_load/store mostly as-is. While this should support both load-lanes and (masked) store-lanes the decision to do either is done during SLP discovery time and cannot be reversed without altering the SLP tree - as-is the SLP tree is not usable for non-store-lanes on the store side, the load side is OK representation-wise but will very likely fail permute handling as the lowering to deal with the two input vector restriction isn't done - but of course since the permute node is marked as to be ignored that doesn't work out. So I've put restrictions in place that fail vectorization if a load/store-lane SLP tree is later classified differently by get_load_store_type. I'll note that for example gcc.target/aarch64/sve/mask_struct_store_3.c will not get SLP store-lanes used because the full store SLPs just fine though we then fail to handle the "splat" load-permutation t2.c:5:21: note: node 0x4db2630 (max_nunits=4, refcnt=2) vector([4,4]) int t2.c:5:21: note: op template: _6 = *_5; t2.c:5:21: note: stmt 0 _6 = *_5; t2.c:5:21: note: stmt 1 _6 = *_5; t2.c:5:21: note: stmt 2 _6 = *_5; t2.c:5:21: note: stmt 3 _6 = *_5; t2.c:5:21: note: load permutation { 0 0 0 0 } the load permute lowering code currently doesn't consider it worth lowering single loads from a group (or in this case not grouped loads). The expectation is the target can handle this by two interleaves with itself. So what we see here is that while the explicit SLP representation is helpful in some cases, in cases like this it would require changing it when we make decisions how to vectorize. My idea is that this all will change a lot when we re-do SLP discovery (for loops) and when we get rid of non-SLP as I think vectorizable_* should be allowed to alter the SLP graph during analysis. The patch also removes the code cancelling SLP if we can use load/store-lanes from the main loop vector analysis code and re-implements it as re-discovering the SLP instance with forced single-lane splits so SLP load/store-lanes scheme can be used. This is now done after SLP discovery and SLP pattern recog are complete to not disturb the latter but per SLP instance instead of being a global decision on the whole loop. This is a behavioral change that for example shows in gcc.dg/vect/slp-perm-6.c on ARM where we formerly used SLP permutes but now a mix of SLP without permutes and load/store lanes. The previous flaky heuristic is now flaky in a different way. Testing on RISC-V and aarch64 reveal several testcases that require adjustment as to now expect SLP even when load/store lanes are being used. If in doubt I've adjusted them to the final expectation which will lead to one or two new FAILs where we still do the SLP cancelling. I have a followup that implements that while remaining in SLP that's in final testing. Note that gcc.dg/vect/slp-42.c and gcc.dg/vect/pr68445.c will FAIL on aarch64 with SVE because for some odd reason vect_stridedN is true for any N for check_effective_target_vect_fully_masked targets but SVE cannot do ld8 while risc-v can. I have not bothered to adjust target tests that now fail assembly-scan. * tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark load, store and permute nodes. * tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize ldst_lanes. (vect_build_slp_instance): For stores iff the target prefers store-lanes discover single-lane sub-groups, do not perform interleaving lowering but mark the node with ldst_lanes. Also allow i == 0 - fatal failure - for splitting up a store group when we're not doing single-lane discovery already. (vect_lower_load_permutations): When the target supports load lanes and the loads all fit the pattern split out a single level of permutes only and mark the load and permute nodes with ldst_lanes. (vectorizable_slp_permutation_1): Handle the load-lane permute forwarding of vector defs. (vect_analyze_slp): After SLP pattern recog is finished see if there are any SLP instances that would benefit from using load/store-lanes and re-discover those with forced single lanes. * tree-vect-stmts.cc (get_group_load_store_type): Support load/store-lanes for SLP. (vectorizable_store): Support SLP code generation for store-lanes. (vectorizable_load): Support SLP code generation for load-lanes. * tree-vect-loop.cc (vect_analyze_loop_2): Do not cancel SLP when store-lanes can be used. * gcc.dg/vect/slp-55.c: New testcase. * gcc.dg/vect/slp-56.c: Likewise. * gcc.dg/vect/slp-11c.c: Adjust. * gcc.dg/vect/slp-53.c: Likewise. * gcc.dg/vect/slp-cond-1.c: Likewise. * gcc.dg/vect/vect-complex-5.c: Likewise. * gcc.dg/vect/slp-1.c: Likewise. * gcc.dg/vect/slp-54.c: Remove riscv XFAIL. * gcc.dg/vect/slp-perm-5.c: Adjust. * gcc.dg/vect/slp-perm-7.c: Likewise. * gcc.dg/vect/slp-perm-8.c: Likewise. * gcc.dg/vect/slp-multitypes-11.c: Likewise. * gcc.dg/vect/slp-multitypes-11-big-array.c: Likewise. * gcc.dg/vect/slp-perm-9.c: Remove expected SLP fail due to three-vector permute. * gcc.dg/vect/slp-perm-6.c: Remove XFAIL. * gcc.dg/vect/slp-perm-1.c: Adjust. * gcc.dg/vect/slp-perm-2.c: Likewise. * gcc.dg/vect/slp-perm-3.c: Likewise. * gcc.dg/vect/slp-perm-4.c: Likewise. * gcc.dg/vect/pr68445.c: Likewise. * gcc.dg/vect/slp-11b.c: Likewise. * gcc.dg/vect/slp-2.c: Likewise. * gcc.dg/vect/slp-23.c: Likewise. * gcc.dg/vect/slp-33.c: Likewise. * gcc.dg/vect/slp-42.c: Likewise. * gcc.dg/vect/slp-46.c: Likewise. * gcc.dg/vect/slp-perm-10.c: Likewise.
2024-09-02lower SLP load permutation to interleavingRichard Biener1-2/+345
The following emulates classical interleaving for SLP load permutes that we are unlikely handling natively. This is to handle cases where interleaving (or load/store-lanes) is the optimal choice for vectorizing even when we are doing that within SLP. An example would be void foo (int * __restrict a, int * b) { for (int i = 0; i < 16; ++i) { a[4*i + 0] = b[4*i + 0] * 3; a[4*i + 1] = b[4*i + 1] + 3; a[4*i + 2] = (b[4*i + 2] * 3 + 3); a[4*i + 3] = b[4*i + 3] * 3; } } where currently the SLP store is merging four single-lane SLP sub-graphs but none of the loads in it can be code-generated with V4SImode vectors and a VF of four as the permutes would need three vectors. The patch introduces a lowering phase after SLP discovery but before SLP pattern recognition or permute optimization that analyzes all loads from the same dataref group and creates an interleaving scheme starting from an unpermuted load. What can be handled is power-of-two group size and a group size of three. The possibility for doing the interleaving with a load-lanes like instruction is done as followup. For a group-size of three this is done by using the non-interleaving fallback code which then creates at VF == 4 from { { a0, b0, c0 }, { a1, b1, c1 }, { a2, b2, c2 }, { a3, b3, c3 } } the intermediate vectors { c0, c0, c1, c1 } and { c2, c2, c3, c3 } to produce { c0, c1, c2, c3 }. This turns out to be more effective than the scheme implemented for non-SLP for SSE and only slightly worse for AVX512 and a bit more worse for AVX2. It seems to me that this would extend to other non-power-of-two group-sizes though (but the patch does not). Optimal schemes are likely difficult to lay out in VF agnostic form. I'll note that while the lowering assumes even/odd extract is generally available for all vector element sizes (which is probably a good assumption), it doesn't in any way constrain the other permutes it generates based on target availability. Again difficult to do in a VF agnostic way (but at least currently the vector type is fixed). I'll also note that the SLP store side merges lanes in a way producing three-vector permutes for store group-size of three, so the testcase uses a store group-size of four. The patch has a fallback for when there are multi-lane groups and the resulting permutes to not fit interleaving. Code generation is not optimal when this triggers and might be worse than doing single-lane group interleaving. The patch handles gaps by representing them with NULL entries in SLP_TREE_SCALAR_STMTS for the unpermuted load node. The SLP discovery changes could be elided if we manually build the load node instead. SLP load nodes covering enough lanes to not need intermediate permutes are retained as having a load-permutation and do not use the single SLP load node for each dataref group. That's something we might want to change, making load-permutation something purely local to SLP discovery (but then SLP discovery could do part of the lowering). The patch misses CSEing intermediate generated permutes and registering them with the bst_map which is possibly required for SLP pattern detection in some cases - this re-spin of the patch moves the lowering after SLP pattern detection. * tree-vect-slp.cc (vect_build_slp_tree_1): Handle NULL stmt. (vect_build_slp_tree_2): Likewise. Release load permutation when there's a NULL in SLP_TREE_SCALAR_STMTS and assert there's no actual permutation in that case. (vllp_cmp): New function. (vect_lower_load_permutations): Likewise. (vect_analyze_slp): Call it. * gcc.dg/vect/slp-11a.c: Expect SLP. * gcc.dg/vect/slp-12a.c: Likewise. * gcc.dg/vect/slp-51.c: New testcase. * gcc.dg/vect/slp-52.c: New testcase.
2024-08-30Do not bother with reassociation in SLP discovery for single-laneRichard Biener1-0/+2
It just clutters the dump files and takes up compile-time. * tree-vect-slp.cc (vect_build_slp_tree_2): Disable SLP reassociation for single-lane.
2024-08-29Use std::unique_ptr for optinfo_itemDavid Malcolm1-0/+1
As preliminary work towards an overhaul of how optinfo_items interact with dump_pretty_printer, replace uses of optinfo_item * with std::unique_ptr<optinfo_item> to make ownership clearer. No functional change intended. gcc/ChangeLog: * config/aarch64/aarch64.cc: Define INCLUDE_MEMORY. * config/arm/arm.cc: Likewise. * config/i386/i386.cc: Likewise. * config/loongarch/loongarch.cc: Likewise. * config/riscv/riscv-vector-costs.cc: Likewise. * config/riscv/riscv.cc: Likewise. * config/rs6000/rs6000.cc: Likewise. * dump-context.h (dump_context::emit_item): Convert "item" param from * to const &. (dump_pretty_printer::stash_item): Convert "item" param from optinfo_ * to std::unique_ptr<optinfo_item>. (dump_pretty_printer::emit_item): Likewise. * dumpfile.cc: Include "make-unique.h". (make_item_for_dump_gimple_stmt): Replace uses of optinfo_item * with std::unique_ptr<optinfo_item>. (dump_context::dump_gimple_stmt): Likewise. (make_item_for_dump_gimple_expr): Likewise. (dump_context::dump_gimple_expr): Likewise. (make_item_for_dump_generic_expr): Likewise. (dump_context::dump_generic_expr): Likewise. (make_item_for_dump_symtab_node): Likewise. (dump_pretty_printer::emit_items): Likewise. (dump_pretty_printer::emit_any_pending_textual_chunks): Likewise. (dump_pretty_printer::emit_item): Likewise. (dump_pretty_printer::stash_item): Likewise. (dump_pretty_printer::decode_format): Likewise. (dump_context::dump_printf_va): Fix overlong line. (make_item_for_dump_dec): Replace uses of optinfo_item * with std::unique_ptr<optinfo_item>. (dump_context::dump_dec): Likewise. (dump_context::dump_symtab_node): Likewise. (dump_context::begin_scope): Likewise. (dump_context::emit_item): Likewise. * gimple-loop-interchange.cc: Define INCLUDE_MEMORY. * gimple-loop-jam.cc: Likewise. * gimple-loop-versioning.cc: Likewise. * graphite-dependences.cc: Likewise. * graphite-isl-ast-to-gimple.cc: Likewise. * graphite-optimize-isl.cc: Likewise. * graphite-poly.cc: Likewise. * graphite-scop-detection.cc: Likewise. * graphite-sese-to-poly.cc: Likewise. * graphite.cc: Likewise. * opt-problem.cc: Likewise. * optinfo.cc (optinfo::add_item): Convert "item" param from optinfo_ * to std::unique_ptr<optinfo_item>. (optinfo::emit_for_opt_problem): Update for change to dump_context::emit_item. * optinfo.h: Add #error to fail immediately if INCLUDE_MEMORY wasn't defined, rather than fail to find std::unique_ptr. (optinfo::add_item): Convert "item" param from optinfo_ * to std::unique_ptr<optinfo_item>. * sese.cc: Define INCLUDE_MEMORY. * targhooks.cc: Likewise. * tree-data-ref.cc: Likewise. * tree-if-conv.cc: Likewise. * tree-loop-distribution.cc: Likewise. * tree-parloops.cc: Likewise. * tree-predcom.cc: Likewise. * tree-ssa-live.cc: Likewise. * tree-ssa-loop-ivcanon.cc: Likewise. * tree-ssa-loop-ivopts.cc: Likewise. * tree-ssa-loop-prefetch.cc: Likewise. * tree-ssa-loop-unswitch.cc: Likewise. * tree-ssa-phiopt.cc: Likewise. * tree-ssa-threadbackward.cc: Likewise. * tree-ssa-threadupdate.cc: Likewise. * tree-vect-data-refs.cc: Likewise. * tree-vect-generic.cc: Likewise. * tree-vect-loop-manip.cc: Likewise. * tree-vect-loop.cc: Likewise. * tree-vect-patterns.cc: Likewise. * tree-vect-slp-patterns.cc: Likewise. * tree-vect-slp.cc: Likewise. * tree-vect-stmts.cc: Likewise. * tree-vectorizer.cc: Likewise. gcc/testsuite/ChangeLog: * gcc.dg/plugin/dump_plugin.c: Define INCLUDE_MEMORY. Signed-off-by: David Malcolm <dmalcolm@redhat.com>
2024-08-28Add debug overload for slp_instanceRichard Biener1-0/+9
I found it helpful to be able to print a whole SLP instance from gdb. * tree-vect-slp.cc (debug): Add overload for slp_instance.
2024-08-28Fix leak of SLP nodes when building store interleavingRichard Biener1-0/+4
The following fixes a leak of the discovered single-lane store SLP nodes from which we only use their children. This uncovers a latent reference counting issue in the interleaving build where we fail to increment their reference count. * tree-vect-slp.cc (vect_build_slp_store_interleaving): Fix reference counting. (vect_build_slp_instance): Release rhs_nodes.
2024-08-28Split out vect_build_slp_store_interleavingRichard Biener1-174/+182
This splits out SLP store interleaving into a separate function. * tree-vect-slp.cc (vect_build_slp_store_interleaving): Split out from ... (vect_build_slp_instance): Here.
2024-08-20tree-optimization/116274 - overzealous SLP vectorizationRichard Biener1-1/+11
The following tries to address that the vectorizer fails to have precise knowledge of argument and return calling conventions and views some accesses as loads and stores that are not. This is mainly important when doing basic-block vectorization as otherwise loop indexing would force such arguments to memory. On x86 the reduction in the number of apparent loads and stores often dominates cost analysis so the following tries to mitigate this aggressively by adjusting only the scalar load and store cost, reducing them to the cost of a simple scalar statement, but not touching the vector access cost which would be much harder to estimate. Thereby we error on the side of not performing basic-block vectorization. PR tree-optimization/116274 * tree-vect-slp.cc (vect_bb_slp_scalar_cost): Cost scalar loads and stores as simple scalar stmts when they access a non-global, not address-taken variable that doesn't have BLKmode assigned. * gcc.target/i386/pr116274-2.c: New testcase.
2024-08-08Rearrange SLP nodes with duplicate statements [PR98138]Manolis Tsamis1-0/+114
This change checks when a two_operators SLP node has multiple occurrences of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the operands so that there are no duplicates. Two vec_perm expressions are then introduced to recreate the original ordering. These duplicates can appear due to how two_operators nodes are handled, and they prevent vectorization in some cases. This targets the vectorization of the SPEC2017 x264 pixel_satd functions. In some processors a larger than 10% improvement on x264 has been observed. PR tree-optimization/98138 gcc/ChangeLog: * tree-vect-slp.cc: Avoid duplicates in two_operators nodes. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vect-slp-two-operator.c: New test.
2024-07-25tree-optimization/116083 - improve behavior when SLP discovery limit is reachedRichard Biener1-14/+12
The following avoids some useless work when the SLP discovery limit is reached, for example allocating a node to cache the failure and starting discovery on split store groups when analyzing BBs. It does not address the issue in the PR which is a gratious budget for discovery when the store group size approaches the number of overall statements. PR tree-optimization/116083 * tree-vect-slp.cc (vect_build_slp_tree): Do not allocate a discovery fail node when we reached the discovery limit. (vect_build_slp_instance): Terminate early when the discovery limit is reached.
2024-07-17vect: Refit lane-reducing to be normal operationFeng Xue1-20/+7
Vector stmts number of an operation is calculated based on output vectype. This is over-estimated for lane-reducing operation, which would cause vector def/use mismatched when we want to support loop reduction mixed with lane- reducing and normal operations. One solution is to refit lane-reducing to make it behave like a normal one, by adding new pass-through copies to fix possible def/use gap. And resultant superfluous statements could be optimized away after vectorization. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod <vector(16) char> } The vector size is 128-bit,vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 2024-07-02 Feng Xue <fxue@os.amperecomputing.com> gcc/ * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Calculate effective vector stmts number with generic vect_get_num_copies. (vect_transform_reduction): Insert copies for lane-reducing so as to fix over-estimated vector stmts number. (vect_transform_cycle_phi): Calculate vector PHI number only based on output vectype. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove adjustment on vector stmts number specific to slp reduction.
2024-07-17vect: Add a unified vect_get_num_copies for slp and non-slpFeng Xue1-16/+3
Extend original vect_get_num_copies (pure loop-based) to calculate number of vector stmts for slp node regarding a generic vect region. 2024-07-12 Feng Xue <fxue@os.amperecomputing.com> gcc/ * tree-vectorizer.h (vect_get_num_copies): New overload function. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Calculate number of vector stmts for slp node with vect_get_num_copies. (vect_slp_analyze_node_operations): Calculate number of vector elements for constant/external slp node with vect_get_num_copies.
2024-07-05Support group size of three in SLP store permute loweringRichard Biener1-1/+64
The following implements the group-size three scheme from vect_permute_store_chain in SLP grouped store permute lowering and extends it to power-of-two multiples of group size three. The scheme goes from vectors A, B and C to { A[0], B[0], C[0], A[1], B[1], C[1], ... } by first producing { A[0], B[0], X, A[1], B[1], X, ... } (with X random but chosen to A[n]) and then permuting in C[n] in the appropriate places. The extension goes as to replace vector elements with a power-of-two number of lanes and you'd get pairwise interleaving until the final three input permutes happen. The last permute step could be seen as extending C to { C[0], C[0], C[0], ... } and then performing a blend. VLA archs will want to use store-lanes here I guess, I'm not sure if the three vector interleave operation is also available with a register source and destination and thus available for a shuffle. * tree-vect-slp.cc (vect_build_slp_instance): Special case three input permute with the same number of lanes in store permute lowering. * gcc.dg/vect/slp-53.c: New testcase. * gcc.dg/vect/slp-54.c: New testcase.
2024-07-03Remove redundant vector permute dumpRichard Biener1-10/+0
The following removes redundant dumping in vect permute vectorization. * tree-vect-slp.cc (vectorizable_slp_permutation_1): Remove redundant dump.
2024-07-03Handle NULL stmt in SLP_TREE_SCALAR_STMTSRichard Biener1-29/+47
The following starts to handle NULL elements in SLP_TREE_SCALAR_STMTS with the first candidate being the two-operator nodes where some lanes are do-not-care and also do not have a scalar stmt computing the result. I originally added SLP_TREE_SCALAR_STMTS to two-operator nodes but this exposes PR115764, so I've split that out. I have a patch use NULL elements for loads from groups with gaps where we get around not doing that by having a load permutation. * tree-vect-slp.cc (bst_traits::hash): Handle NULL elements in SLP_TREE_SCALAR_STMTS. (vect_print_slp_tree): Likewise. (vect_mark_slp_stmts): Likewise. (vect_mark_slp_stmts_relevant): Likewise. (vect_find_last_scalar_stmt_in_slp): Likewise. (vect_bb_slp_mark_live_stmts): Likewise. (vect_slp_prune_covered_roots): Likewise. (vect_bb_partition_graph_r): Likewise. (vect_remove_slp_scalar_calls): Likewise. (vect_slp_gather_vectorized_scalar_stmts): Likewise. (vect_bb_slp_scalar_cost): Likewise. (vect_contains_pattern_stmt_p): Likewise. (vect_slp_convert_to_external): Likewise. (vect_find_first_scalar_stmt_in_slp): Likewise. (vect_optimize_slp_pass::remove_redundant_permutations): Likewise. (vect_slp_analyze_node_operations_1): Likewise. (vect_schedule_slp_node): Likewise. * tree-vect-stmts.cc (can_vectorize_live_stmts): Likewise. (vectorizable_shift): Likewise. * tree-vect-data-refs.cc (vect_slp_analyze_load_dependences): Handle NULL elements in SLP_TREE_SCALAR_STMTS.
2024-06-30Harden SLP reduction support wrt STMT_VINFO_REDUC_IDXRichard Biener1-2/+21
The following makes sure that for a SLP reductions all lanes have the same STMT_VINFO_REDUC_IDX. Once we move that info and can adjust it we can implement swapping. It also makes the existing protection against operand swapping trigger for all stmts participating in a reduction, not just the final one marked as reduction-def. * tree-vect-slp.cc (vect_build_slp_tree_1): Compare STMT_VINFO_REDUC_IDX. (vect_build_slp_tree_2): Prevent operand swapping for all stmts participating in a reduction.
2024-06-28tree-optimization/115652 - more fixing of the fixRichard Biener1-2/+9
The following addresses the corner case of an outer loop with an empty header where we end up asking for the BB of a NULL stmt by special-casing this case. PR tree-optimization/115652 * tree-vect-slp.cc (vect_schedule_slp_node): Handle the case where the outer loop header block is empty.
2024-06-27tree-optimization/115669 - fix SLP reduction associationRichard Biener1-0/+3
The following avoids associating a reduction path as that might get STMT_VINFO_REDUC_IDX out-of-sync with the SLP operand order. This is a latent issue with SLP reductions but now easily exposed as we're doing single-lane SLP reductions. When we achieved SLP only we can move and update this meta-data. PR tree-optimization/115669 * tree-vect-slp.cc (vect_build_slp_tree_2): Do not reassociate chains that participate in a reduction. * gcc.dg/vect/pr115669.c: New testcase.
2024-06-27tree-optimization/115652 - amend last fixRichard Biener1-1/+2
The previous fix breaks in the degenerate case when the discovered last_stmt is equal to the first stmt in the block since then we undo a required stmt advancement. PR tree-optimization/115652 * tree-vect-slp.cc (vect_schedule_slp_node): Only insert at the start of the block if that strictly dominates the discovered dependent stmt.
2024-06-26tree-optimization/115652 - adjust insertion gsi for SLPRichard Biener1-16/+13
The following adjusts how SLP computes the insertion location. In particular it advanced the insert iterator of the found last_stmt. The vectorizer will later insert stmts _before_ it. But we also have the constraint that possibly masked ops may not be scheduled outside of the loop and as we do not model the loop mask in the SLP graph we have to adjust for that. The following moves this to after the advance since it isn't compatible with that as the current GIMPLE_COND exception shows. The PR is about in-order reduction vectorization which also isn't happy when that's the very first stmt. PR tree-optimization/115652 * tree-vect-slp.cc (vect_schedule_slp_node): Advance the iterator based on last_stmt only for vector defs.
2024-06-24tree-optimization/115602 - SLP CSE results in cyclesRichard Biener1-12/+21
The following prevents SLP CSE to create new cycles which happened because of a 1:1 permute node being present where its child was then CSEd to the permute node. Fixed by making a node only available to CSE to after recursing. PR tree-optimization/115602 * tree-vect-slp.cc (vect_cse_slp_nodes): Delay populating the bst-map to avoid cycles. * gcc.dg/vect/pr115602.c: New testcase.
2024-06-23tree-optimization/115597 - allow CSE of two-operator VEC_PERM nodesRichard Biener1-1/+0
The following makes sure to always CSE when there's SLP_TREE_SCALAR_STMTS as otherwise a chain of two-operator node operations can result in exponential behavior of the CSE process as likely seen when building 510.parest on aarch64. PR tree-optimization/115597 * tree-vect-slp.cc (vect_cse_slp_nodes): Allow to CSE VEC_PERM nodes.
2024-06-20tree-optimization/114413 - SLP CSE after permute optimizationRichard Biener1-12/+64
We currently fail to re-CSE SLP nodes after optimizing permutes which results in off cost estimates. For gcc.dg/vect/bb-slp-32.c this shows in not re-using the SLP node with the load and arithmetic for both the store and the reduction. The following implements CSE by re-bst-mapping nodes as finalization part of vect_optimize_slp. I've tried to make the CSE part of permute materialization but it isn't a very good fit there. I've not bothered to implement something more complete, also handling external defs or defs without SLP_TREE_SCALAR_STMTS. I realize this might result in more BB SLP which in turn might slow down code given costing for BB SLP is difficult (even that we now vectorize gcc.dg/vect/bb-slp-32.c on x86_64 might be not a good idea). This is nevertheless feeding more accurate info to costing which is good. PR tree-optimization/114413 * tree-vect-slp.cc (release_scalar_stmts_to_slp_tree_map): New function, split out from ... (vect_analyze_slp): ... here. Call it. (vect_cse_slp_nodes): New function. (vect_optimize_slp): Call it. * gcc.dg/vect/bb-slp-32.c: Expect CSE and vectorization on x86.
2024-06-20vect: Add a function to check lane-reducing stmtFeng Xue1-3/+1
Add a utility function to check if a statement is lane-reducing operation, which could simplify some existing code. 2024-06-16 Feng Xue <fxue@os.amperecomputing.com> gcc/ * tree-vectorizer.h (lane_reducing_stmt_p): New function. * tree-vect-slp.cc (vect_analyze_slp): Use new function lane_reducing_stmt_p to check statement.
2024-06-17tree-optimization/115508 - fix ICE with SLP scheduling and extern vectorRichard Biener1-0/+1
When there's a permute after an extern vector we can run into a case that didn't consider the scheduled node being a permute which lacks a representative. PR tree-optimization/115508 * tree-vect-slp.cc (vect_schedule_slp_node): Guard check on representative. * gcc.target/i386/pr115508.c: New testcase.
2024-06-06Add double reduction support for SLP vectorizationRichard Biener1-1/+2
The following makes double reduction vectorization work when using (single-lane) SLP vectorization. * tree-vect-loop.cc (vect_analyze_scalar_cycles_1): Queue double reductions in LOOP_VINFO_REDUCTIONS. (vect_create_epilog_for_reduction): Remove asserts disabling SLP for double reductions. (vectorizable_reduction): Analyze SLP double reductions only once and start off the correct places. * tree-vect-slp.cc (vect_get_and_check_slp_defs): Allow vect_double_reduction_def. (vect_build_slp_tree_2): Fix condition for the ignored reduction initial values. * tree-vect-stmts.cc (vect_analyze_stmt): Allow vect_double_reduction_def.
2024-06-04Do single-lane SLP discovery for reductionsRichard Biener1-17/+54
The following performs single-lane SLP discovery for reductions. It requires a fixup for outer loop vectorization where a check for multiple types needs adjustments as otherwise bogus pointer IV increments happen when there are multiple copies of vector stmts in the inner loop. For the reduction epilog handling this extends the optimized path to cover the trivial single-lane SLP reduction case. The fix for PR65518 implemented in vect_grouped_load_supported for non-SLP needs a SLP counterpart that I put in get_group_load_store_type. I've decided to adjust three testcases for appearing single-lane SLP instances instead of not dumping "vectorizing stmts using SLP" for single-lane instances as that also requires testsuite adjustments. * tree-vect-slp.cc (vect_build_slp_tree_2): Only multi-lane discoveries are reduction chains and need special backedge treatment. (vect_analyze_slp): Fall back to single-lane SLP discovery for reductions. Make sure to try single-lane SLP reduction for all reductions as fallback. (vectorizable_load): Avoid outer loop SLP vectorization with multi-copy vector stmts in the inner loop. (vectorizable_store): Likewise. * tree-vect-loop.cc (vect_create_epilog_for_reduction): Allow direct opcode and shift reduction also for SLP reductions with a single lane. * tree-vect-stmts.cc (get_group_load_store_type): For SLP also check for the PR65518 single-element interleaving case as done in vect_grouped_load_supported. * gcc.dg/vect/slp-24.c: Expect another SLP instance for the reduction. * gcc.dg/vect/slp-24-big-array.c: Likewise. * gcc.dg/vect/slp-reduc-6.c: Remove scan for zero SLP instances.
2024-06-04Avoid inserting after a GIMPLE_COND with SLP and early breakRichard Biener1-1/+6
When vectorizing an early break loop with LENs (do we miss some check here to disallow this?) we can end up deciding to insert stmts after a GIMPLE_COND when doing SLP scheduling and trying to be conservative with placing of stmts only dependent on the implicit loop mask/len. The following avoids this, I guess it's not perfect but it does the job fixing some observed RISC-V regression. * tree-vect-slp.cc (vect_schedule_slp_node): For mask/len loops make sure to not advance the insertion iterator beyond a GIMPLE_COND.
2024-06-01vect: Add a function to check lane-reducing codeFeng Xue1-3/+1
Check if an operation is lane-reducing requires comparison of code against three kinds (DOT_PROD_EXPR/WIDEN_SUM_EXPR/SAD_EXPR). Add an utility function to make source coding for the check handy and concise. 2024-05-29 Feng Xue <fxue@os.amperecomputing.com> gcc/ * tree-vectorizer.h (lane_reducing_op_p): New function. * tree-vect-slp.cc (vect_analyze_slp): Use new function lane_reducing_op_p to check statement code. * tree-vect-loop.cc (vect_transform_reduction): Likewise. (vectorizable_reduction): Likewise, and change name of a local variable that holds the result flag.
2024-05-29vect: Unify bbs in loop_vec_info and bb_vec_infoFeng Xue1-9/+14
Both derived classes have their own "bbs" field, which have exactly same purpose of recording all basic blocks inside the corresponding vect region, while the fields are composed by different data type, one is normal array, the other is auto_vec. This difference causes some duplicated code even handling the same stuff, almost in tree-vect-patterns. One refinement is lifting this field into the base class "vec_info", and reset its value to the continuous memory area pointed by two old "bbs" in each constructor of derived classes. 2024-05-16 Feng Xue <fxue@os.amperecomputing.com> gcc/ * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Move initialization of bbs to explicit construction code. Adjust the definition of nbbs. (update_epilogue_loop_vinfo): Update nbbs for epilog vinfo. * tree-vect-patterns.cc (vect_determine_precisions): Make loop_vec_info and bb_vec_info share same code. (vect_pattern_recog): Remove duplicated vect_pattern_recog_1 loop. * tree-vect-slp.cc (vect_get_and_check_slp_defs): Access to bbs[0] via base vec_info class. (_bb_vec_info::_bb_vec_info): Initialize bbs and nbbs using data fields of input auto_vec<> bbs. (vect_slp_region): Use access to nbbs to replace original bbs.length(). (vect_schedule_slp_node): Access to bbs[0] via base vec_info class. * tree-vectorizer.cc (vec_info::vec_info): Add initialization of bbs and nbbs. (vec_info::insert_seq_on_entry): Access to bbs[0] via base vec_info class. * tree-vectorizer.h (vec_info): Add new fields bbs and nbbs. (LOOP_VINFO_NBBS): New macro. (BB_VINFO_BBS): Rename BB_VINFO_BB to BB_VINFO_BBS. (BB_VINFO_NBBS): New macro. (_loop_vec_info): Remove field bbs. (_bb_vec_info): Rename field bbs.
2024-05-28tree-optimization/115254 - don't account single-lane SLP against discovery limitRichard Biener1-13/+18
The following avoids accounting single-lane SLP to the discovery limit. As the two testcases show this makes discovery fail, unfortunately even not the same across targets. The following should fix two FAILs for GCN as a side-effect. PR tree-optimization/115254 * tree-vect-slp.cc (vect_build_slp_tree): Only account multi-lane SLP to limit. * gcc.dg/vect/slp-cond-2-big-array.c: Expect 4 times SLP. * gcc.dg/vect/slp-cond-2.c: Likewise.
2024-05-24Avoid splitting store dataref groups during SLP discoveryRichard Biener1-26/+222
The following avoids splitting store dataref groups during SLP discovery but instead forces (eventually single-lane) consecutive lane SLP discovery for all lanes of the group, creating VEC_PERM SLP nodes merging them so the store will always cover the whole group. With this for example int x[1024], y[1024], z[1024], w[1024]; void foo (void) { for (int i = 0; i < 256; i++) { x[4*i+0] = y[2*i+0]; x[4*i+1] = y[2*i+1]; x[4*i+2] = z[i]; x[4*i+3] = w[i]; } } which was previously using hybrid SLP can now be fully SLPed and SSE code generated looks better (but of course you never know, I didn't actually benchmark). We of course need a VF of four here. .L2: movdqa z(%rax), %xmm0 movdqa w(%rax), %xmm4 movdqa y(%rax,%rax), %xmm2 movdqa y+16(%rax,%rax), %xmm1 movdqa %xmm0, %xmm3 punpckhdq %xmm4, %xmm0 punpckldq %xmm4, %xmm3 movdqa %xmm2, %xmm4 shufps $238, %xmm3, %xmm2 movaps %xmm2, x+16(,%rax,4) movdqa %xmm1, %xmm2 shufps $68, %xmm3, %xmm4 shufps $68, %xmm0, %xmm2 movaps %xmm4, x(,%rax,4) shufps $238, %xmm0, %xmm1 movaps %xmm2, x+32(,%rax,4) movaps %xmm1, x+48(,%rax,4) addq $16, %rax cmpq $1024, %rax jne .L2 The extra permute nodes merging distinct branches of the SLP tree might be unexpected for some code, esp. since SLP_TREE_REPRESENTATIVE cannot be meaningfully set and we cannot populate SLP_TREE_SCALAR_STMTS or SLP_TREE_SCALAR_OPS consistently as we can have a mix of both. The patch keeps the sub-trees form consecutive lanes but that's in principle not necessary if we for example have an even/odd split which now would result in N single-lane sub-trees. That's left for future improvements. The interesting part is how VLA vector ISAs handle merging of two vectors that's not trivial even/odd merging. The strathegy of how to build the permute tree might need adjustments for that (in the end splitting each branch to single lanes and then doing even/odd merging would be the brute-force fallback). Not sure how much we can or should rely on the SLP optimize pass to handle this. The gcc.dg/vect/slp-12a.c case is interesting as we currently split the 8 store group into lanes 0-5 which we SLP with an unroll factor of two (on x86-64 with SSE) and the remaining two lanes are using interleaving vectorization with a final unroll factor of four. Thus we're using hybrid SLP within a single store group. After the change we discover the same 0-5 lane SLP part as well as two single-lane parts feeding the full store group. But that results in a load permutation that isn't supported (I have WIP patchs to rectify that). So we end up cancelling SLP and vectorizing the whole loop with interleaving which is IMO good and results in better code. This is similar for gcc.target/i386/pr52252-atom.c where interleaving generates much better code than hybrid SLP. I'm unsure how to update the testcase though. gcc.dg/vect/slp-21.c runs into similar situations. Note that when when analyzing SLP operations we discard an instance we currently force the full loop to have no SLP because hybrid detection is broken. It's probably not worth fixing this at this moment. For gcc.dg/vect/pr97428.c we are not splitting the 16 store group into two but merge the two 8 lane loads into one before doing the store and thus have only a single SLP instance. A similar situation happens in gcc.dg/vect/slp-11c.c but the branches feeding the single SLP store only have a single lane. Likewise for gcc.dg/vect/vect-complex-5.c and gcc.dg/vect/vect-gather-2.c. gcc.dg/vect/slp-cond-1.c has an additional SLP vectorization with a SLP store group of size two but two single-lane branches. * tree-vect-slp.cc (vect_build_slp_instance): Do not split store dataref groups on loop SLP discovery failure but create a single SLP instance for the stores but branch to SLP sub-trees and merge with a series of VEC_PERM nodes. * gcc.dg/vect/pr97428.c: Expect a single store SLP group. * gcc.dg/vect/slp-11c.c: Likewise, if !vect_load_lanes. * gcc.dg/vect/vect-complex-5.c: Likewise. * gcc.dg/vect/slp-12a.c: Do not expect SLP. * gcc.dg/vect/slp-21.c: Remove not important scanning for SLP. * gcc.dg/vect/slp-cond-1.c: Expect one more SLP if !vect_load_lanes. * gcc.dg/vect/vect-gather-2.c: Expect SLP to be used. * gcc.target/i386/pr52252-atom.c: XFAIL test for palignr.
2024-05-22Fix mixed input kind permute optimizationRichard Biener1-0/+2
When change_vec_perm_layout runs into a permute combining two nodes where one is invariant and one internal the partition of one input can be -1 but the other might not be. The following supports this case by simply ignoring inputs with input partiton -1. I'm not sure this is correct but it avoids ICEing when accessing that partitions layout for gcc.target/i386/pr98928.c with the change to avoid splitting store dataref groups during SLP discovery. * tree-vect-slp.cc (change_vec_perm_layout): Ignore an input partition of -1.
2024-05-22Avoid SLP_REPRESENTATIVE access for VEC_PERM in SLP schedulingRichard Biener1-12/+16
SLP permute nodes can end up without a SLP_REPRESENTATIVE now, the following avoids touching it in this case in vect_schedule_slp_node. * tree-vect-slp.cc (vect_schedule_slp_node): Avoid looking at SLP_REPRESENTATIVE for VEC_PERM nodes.
2024-05-13Refactor SLP reduction group discoveryRichard Biener1-33/+34
The following refactors a bit how we perform SLP reduction group discovery possibly making it easier to have multiple reduction groups later, esp. with single-lane SLP. * tree-vect-slp.cc (vect_analyze_slp_instance): Remove slp_inst_kind_reduc_group handling. (vect_analyze_slp): Add the meat here.
2024-05-10Allow patterns in SLP reductionsRichard Biener1-8/+18
The following removes the over-broad rejection of patterns for SLP reductions which is done by removing them from LOOP_VINFO_REDUCTIONS during pattern detection. That's also insufficient in case the pattern only appears on the reduction path. Instead this implements the proper correctness check in vectorizable_reduction and guides SLP discovery to heuristically avoid forming later invalid groups. I also couldn't find any testcase that FAILs when allowing the SLP reductions to form so I've added one. I came across this for single-lane SLP reductions with the all-SLP work where we rely on patterns to properly vectorize COND_EXPR reductions. * tree-vect-patterns.cc (vect_pattern_recog_1): Do not remove reductions involving patterns. * tree-vect-loop.cc (vectorizable_reduction): Reject SLP reduction groups with multiple lane-reducing reductions. * tree-vect-slp.cc (vect_analyze_slp_instance): When discovering SLP reduction groups avoid including lane-reducing ones. * gcc.dg/vect/vect-reduc-sad-9.c: New testcase.
2024-05-02Improve SLP dump and graphRichard Biener1-1/+20
The following notes which lanes are considered live and adds an overload to produce a graphviz graph for multiple entries into an SLP graph. * tree-vect-slp.cc (vect_print_slp_tree): Mark live lanes. (dot_slp_tree): New overload for multiple entries.
2024-04-23tree-optimization/114799 - SLP and patternsRichard Biener1-0/+6
The following plugs a hole with computing whether a SLP node has any pattern stmts which is important to know when we want to replace it by a CTOR from external defs. PR tree-optimization/114799 * tree-vect-slp.cc (vect_get_and_check_slp_defs): Properly update ->any_pattern when swapping operands. * gcc.dg/vect/bb-slp-pr114799.c: New testcase.
2024-04-16tree-optimization/114736 - SLP DFS walk issueRichard Biener1-1/+2
The following fixes a DFS walk issue when identifying to be ignored latch edges. We have (bogus) SLP_TREE_REPRESENTATIVEs for VEC_PERM nodes so those have to be explicitly ignored as possibly being PHIs. PR tree-optimization/114736 * tree-vect-slp.cc (vect_optimize_slp_pass::is_cfg_latch_edge): Do not consider VEC_PERM_EXPRs as PHI use. * gfortran.dg/vect/pr114736.f90: New testcase.
2024-03-27tree-optimization/114057 - handle BB reduction remain defs as LIVERichard Biener1-3/+10
The following makes sure to record the scalars we add to the BB reduction vectorization result as scalar uses for the purpose of computing live lanes. This restores vectorization in the bondfree.c TU of 435.gromacs. PR tree-optimization/114057 * tree-vect-slp.cc (vect_bb_slp_mark_live_stmts): Mark BB reduction remain defs as scalar uses.