aboutsummaryrefslogtreecommitdiff
path: root/gcc/tree-vect-loop-manip.cc
AgeCommit message (Collapse)AuthorFilesLines
2024-07-16tree-optimization/115843 - fix wrong-code with fully-masked loop and peelingRichard Biener1-2/+6
When AVX512 uses a fully masked loop and peeling we fail to create the correct initial loop mask when the mask is composed of multiple components in some cases. The following fixes this by properly applying the bias for the component to the shift amount. PR tree-optimization/115843 * tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors_avx512): Properly bias the shift of the initial mask for alignment peeling. * gcc.dg/vect/pr115843.c: New testcase.
2024-06-03Remove value_range typedef.Aldy Hernandez1-16/+16
Now that pointers and integers have been disambiguated from irange, and all the pointer range temporaries use prange, we can reclaim value_range as a general purpose range container. This patch removes the typedef, in favor of int_range_max, thus providing slightly better ranges in places. I have also used int_range<1> or <2> when it's known ahead of time how big the range will be, thus saving a few words. In a follow-up patch I will rename the Value_Range temporary to value_range. No change in performance. gcc/ChangeLog: * builtins.cc (expand_builtin_strnlen): Replace value_range use with int_range_max or irange when appropriate. (determine_block_size): Same. * fold-const.cc (minmax_from_comparison): Same. * gimple-array-bounds.cc (check_out_of_bounds_and_warn): Same. (array_bounds_checker::check_array_ref): Same. * gimple-fold.cc (size_must_be_zero_p): Same. * gimple-predicate-analysis.cc (find_var_cmp_const): Same. * gimple-ssa-sprintf.cc (get_int_range): Same. (format_integer): Same. (try_substitute_return_value): Same. (handle_printf_call): Same. * gimple-ssa-warn-restrict.cc (builtin_memref::extend_offset_range): Same. * graphite-sese-to-poly.cc (add_param_constraints): Same. * internal-fn.cc (get_min_precision): Same. * match.pd: Same. * pointer-query.cc (get_size_range): Same. * range-op.cc (get_shift_range): Same. (operator_trunc_mod::op1_range): Same. (operator_trunc_mod::op2_range): Same. * range.cc (range_negatives): Same. * range.h (range_positives): Same. (range_negatives): Same. * tree-affine.cc (expr_to_aff_combination): Same. * tree-data-ref.cc (compute_distributive_range): Same. (nop_conversion_for_offset_p): Same. (split_constant_offset): Same. (split_constant_offset_1): Same. (dr_step_indicator): Same. * tree-dfa.cc (get_ref_base_and_extent): Same. * tree-scalar-evolution.cc (iv_can_overflow_p): Same. * tree-ssa-math-opts.cc (optimize_spaceship): Same. * tree-ssa-pre.cc (insert_into_preds_of_block): Same. * tree-ssa-reassoc.cc (optimize_range_tests_to_bit_test): Same. * tree-ssa-strlen.cc (compare_nonzero_chars): Same. (dump_strlen_info): Same. (get_range_strlen_dynamic): Same. (set_strlen_range): Same. (maybe_diag_stxncpy_trunc): Same. (strlen_pass::get_len_or_size): Same. (strlen_pass::handle_builtin_string_cmp): Same. (strlen_pass::count_nonzero_bytes_addr): Same. (strlen_pass::handle_integral_assign): Same. * tree-switch-conversion.cc (bit_test_cluster::emit): Same. * tree-vect-loop-manip.cc (vect_gen_vector_loop_niters): Same. (vect_do_peeling): Same. * tree-vect-patterns.cc (vect_get_range_info): Same. (vect_recog_divmod_pattern): Same. * tree.cc (get_range_pos_neg): Same. * value-range.cc (debug): Remove value_range variants. * value-range.h (value_range): Remove typedef. * vr-values.cc (simplify_using_ranges::op_with_boolean_value_range_p): Replace value_range use with int_range_max or irange when appropriate. (check_for_binary_op_overflow): Same. (simplify_using_ranges::legacy_fold_cond_overflow): Same. (find_case_label_ranges): Same. (simplify_using_ranges::simplify_abs_using_ranges): Same. (test_for_singularity): Same. (simplify_using_ranges::simplify_compare_using_ranges_1): Same. (simplify_using_ranges::simplify_casted_compare): Same. (simplify_using_ranges::simplify_switch_using_ranges): Same. (simplify_conversion_using_ranges): Same. (simplify_using_ranges::two_valued_val_range_p): Same.
2024-04-24tree-optimization/114832 - wrong dominator info with vect peelingRichard Biener1-1/+1
When we update the dominator of the redirected exit after peeling we check whether the immediate dominator was the loop header rather than the exit source when we later want to just update it to the new source. The following fixes this oversight. PR tree-optimization/114832 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Fix dominance check. * gcc.dg/vect/pr114832.c: New testcase.
2024-04-04tree-optimization/114485 - neg induction with partial vectorsRichard Biener1-7/+7
We can't use vect_update_ivs_after_vectorizer for partial vectors, the following fixes vect_can_peel_nonlinear_iv_p accordingly. PR tree-optimization/114485 * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): vect_step_op_neg isn't OK for partial vectors but only for unknown niter. * gcc.dg/vect/pr114485.c: New testcase.
2024-03-07vect: Do not peel epilogue for partial vectors.Robin Dapp1-23/+7
r14-7036-gcbf569486b2dec added an epilogue vectorization guard for early break but PR114196 shows that we also run into the problem without early break. Therefore merge the condition into the topmost vectorization guard. gcc/ChangeLog: PR middle-end/114196 * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Merge vectorization guards. gcc/testsuite/ChangeLog: * gcc.target/aarch64/pr114196.c: New test. * gcc.target/riscv/rvv/autovec/pr114196.c: New test.
2024-02-27tree-optimization/114081 - dominator update for prologue peelingRichard Biener1-22/+56
The following implements manual update for multi-exit loop prologue peeling during vectorization. PR tree-optimization/114081 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Perform manual dominator update for prologue peeling. (vect_do_peeling): Properly update dominators after adding the prologue-around guard. * gcc.dg/vect/vect-early-break_121-pr114081.c: New testcase.
2024-02-26tree-optimization/114099 - virtual LC PHIs and early exit vectRichard Biener1-22/+13
In some cases exits can lack LC PHI nodes for the virtual operand. We have to create them when the epilog loop requires them which also allows us to remove some only halfway correct fixups. This is the variant triggering for alternate exits. PR tree-optimization/114099 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Create and fill in a needed virtual LC PHI for the alternate exits. Remove code dealing with that missing. * gcc.dg/vect/vect-early-break_120-pr114099.c: New testcase.
2024-02-26tree-optimization/114068 - missed virtual LC PHI after vect peelingRichard Biener1-13/+39
When we choose the IV exit to be one leading to no virtual use we fail to have a virtual LC PHI even though we need it for the epilog entry. The following makes sure to create it so that later updating works. PR tree-optimization/114068 * tree-vect-loop-manip.cc (get_live_virtual_operand_on_edge): New function. (slpeel_tree_duplicate_loop_to_edge_cfg): Add a virtual LC PHI on the main exit if needed. Remove band-aid for the case it was missing. * gcc.dg/vect/vect-early-break_118-pr114068.c: New testcase. * gcc.dg/vect/vect-early-break_119-pr114068.c: Likewise.
2024-01-30tree-optimization/113659 - early exit vectorization and missing VUSERichard Biener1-3/+17
The following handles the case of the main exit going to a path without virtual use and handles it similar to the alternate exit handling. PR tree-optimization/113659 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Handle main exit without virtual use. * gcc.dg/pr113659.c: New testcase.
2024-01-23Refactor exit PHI handling in vectorizer epilogue peelingRichard Biener1-61/+74
This refactors the handling of PHIs inbetween the main and the epilogue loop. Instead of trying to handle the multiple exit and original single exit case together the following separates these cases resulting in much easier to understand code. * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Separate single and multi-exit case when creating PHIs between the main and epilogue.
2024-01-22tree-optimization/113373 - add missing LC PHIs for live operationsRichard Biener1-5/+29
The following makes reduction epilogue code generation happy by properly adding LC PHIs to the exit blocks for multiple exit vectorized loops. Some refactoring might make the flow easier to follow but I've refrained from doing that with this patch. I've kept some fixes in reduction epilogue generation from the earlier attempt fixing this PR. PR tree-optimization/113373 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Create LC PHIs in the exit blocks where necessary. * tree-vect-loop.cc (vectorizable_live_operation): Do not try to handle missing LC PHIs. (find_connected_edge): Remove. (vect_create_epilog_for_reduction): Cleanup use of auto_vec. * gcc.dg/vect/vect-early-break_104-pr113373.c: New testcase.
2024-01-19tree-optimization/113494 - Fix two observed regressions with r14-8206Richard Biener1-5/+21
The following handles the situation where we lack a loop-closed PHI for a virtual operand because a loop exit goes to a code region not having any virtual use (an endless loop). It also handles the situation of edge redirection re-allocating a PHI node in the destination block so we have to re-lookup that before populating the new PHI argument. PR tree-optimization/113494 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Handle endless loop on exit. Handle re-allocated PHI.
2024-01-18tree-optimization/113374 - early break vect and virtual operandsRichard Biener1-106/+96
The following fixes wrong virtual operands being used for peeled early breaks where we can have different live ones and for multiple exits it makes sure to update the correct PHI arguments. I've introduced SET_PHI_ARG_DEF_ON_EDGE so we can avoid using a wrong edge to compute the PHI arg index from. I've took the liberty to understand the code again and refactor and comment it a bit differently. The main functional change is that we preserve the live virtual operand on all exits. PR tree-optimization/113374 * tree-ssa-operands.h (SET_PHI_ARG_DEF_ON_EDGE): New. * tree-vect-loop.cc (move_early_exit_stmts): Update virtual LC PHIs. * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Refactor. Preserve virtual LC PHIs on all exits. * gcc.dg/vect/vect-early-break_106-pr113374.c: New testcase.
2024-01-17tree-optimization/113371 - avoid prologue peeling for peeled early exitsRichard Biener1-1/+2
The following avoids prologue peeling when doing early exit vectorization with the IV exit before the early exit. That's because we it invalidates the invariant that the effective latch of the loop is empty causing wrong continuation to the main loop. In particular this is prone to break virtual SSA form. PR tree-optimization/113371 * tree-vect-data-refs.cc (vect_enhance_data_refs_alignment): Do not peel when LOOP_VINFO_EARLY_BREAKS_VECT_PEELED. * tree-vect-loop-manip.cc (vect_do_peeling): Assert we do not perform prologue peeling when LOOP_VINFO_EARLY_BREAKS_VECT_PEELED. * gcc.dg/vect/pr113371.c: New testcase.
2024-01-15tree-optimization/113385 - wrong loop father with early exit vectorizationRichard Biener1-2/+2
The following avoids splitting an edge before redirecting it. This allows the loop father of the new block to be correct in the first place. PR tree-optimization/113385 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): First redirect, then split the exit edge.
2024-01-12middle-end: remove more usages of single_exitTamar Christina1-1/+10
This replaces two more usages of single_exit that I had missed before. They both seem to happen when we re-use the ifcvt scalar loop for versioning. The condition in versioning is the same as the one for when we don't re-use the scalar loop. gcc/ChangeLog: * tree-vect-loop-manip.cc (vect_loop_versioning): Replace single_exit. * tree-vect-loop.cc (vect_transform_loop): Likewise.
2024-01-12middle-end: thread through existing LCSSA variable for alternative exits too ↵Tamar Christina1-2/+7
[PR113237] Builing on top of the previous patch, similar to when we have a single exit if we have a case where all exits are considered early exits and there are existing non virtual phi then in order to maintain LCSSA we have to use the existing PHI variables. We can't simply clear them and just rebuild them because the order of the PHIs in the main exit must match the original exit for when we add the skip_epilog guard. But the infrastructure is already in place to maintain them, we just have to use the right value. gcc/ChangeLog: PR tree-optimization/113237 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Use existing LCSSA variable for exit when all exits are early break. gcc/testsuite/ChangeLog: PR tree-optimization/113237 * gcc.dg/vect/vect-early-break_98-pr113237.c: New test.
2024-01-12middle-end: maintain LCSSA form when peeled vector iterations have virtual ↵Tamar Christina1-8/+35
operands This patch fixes several interconnected issues. 1. When picking an exit we wanted to check for niter_desc.may_be_zero not true. i.e. we want to pick an exit which we know will iterate at least once. However niter_desc.may_be_zero is not a boolean. It is a tree that encodes a boolean value. !niter_desc.may_be_zero is just checking if we have some information, not what the information is. This leads us to pick a more difficult to vectorize exit more often than we should. 2. Because we had this bug, we used to pick an alternative exit much more ofthen which showed one issue, when the loop accesses memory and we "invert it" we would corrupt the VUSE chain. This is because on an peeled vector iteration every exit restarts the loop (i.e. they're all early) BUT since we may have performed a store, the vUSE would need to be updated. This version maintains virtual PHIs correctly in these cases. Note that we can't simply remove all of them and recreate them because we need the PHI nodes still in the right order for if skip_vector. 3. Since we're moving the stores to a safe location I don't think we actually need to analyze whether the store is in range of the memref, because if we ever get there, we know that the loads must be in range, and if the loads are in range and we get to the store we know the early breaks were not taken and so the scalar loop would have done the VF stores too. 4. Instead of searching for where to move stores to, they should always be in exit belonging to the latch. We can only ever delay stores and even if we pick a different exit than the latch one as the main one, effects still happen in program order when vectorized. If we don't move the stores to the latch exit but instead to whever we pick as the "main" exit then we can perform incorrect memory accesses (luckily these are trapped by verify_ssa). 5. We only used to analyze loads inside the same BB as an early break, and also we'd never analyze the ones inside the block where we'd be moving memory references to. This is obviously bogus and to fix it this patch splits apart the two constraints. We first validate that all load memory references are in bounds and only after that do we perform the alias checks for the writes. This makes the code simpler to understand and more trivially correct. gcc/ChangeLog: PR tree-optimization/113137 PR tree-optimization/113136 PR tree-optimization/113172 PR tree-optimization/113178 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Maintain PHIs on inverted loops. (vect_do_peeling): Maintain virtual PHIs on inverted loops. * tree-vect-loop.cc (vec_init_loop_exit_info): Pick exit closes to latch. (vect_create_loop_vinfo): Record all conds instead of only alt ones. gcc/testsuite/ChangeLog: PR tree-optimization/113137 PR tree-optimization/113136 PR tree-optimization/113172 PR tree-optimization/113178 * g++.dg/vect/vect-early-break_4-pr113137.cc: New test. * g++.dg/vect/vect-early-break_5-pr113137.cc: New test. * gcc.dg/vect/vect-early-break_95-pr113137.c: New test. * gcc.dg/vect/vect-early-break_96-pr113136.c: New test. * gcc.dg/vect/vect-early-break_97-pr113172.c: New test.
2024-01-10middle-end: Fix dominators updates when peeling with multiple exits [PR113144]Tamar Christina1-9/+4
When we peel at_exit we are moving the new loop at the exit of the previous loop. This means that the blocks outside the loop dat the previous loop used to dominate are no longer being dominated by it. The new dominators however are hard to predict since if the loop has multiple exits and all the exits are an "early" one then we always execute the scalar loop. In this case the scalar loop can completely dominate the new loop. If we later have skip_vector then there's an additional skip edge added that might change the dominators. The previous patch would force an update of all blocks reachable from the new exits. This one updates *only* blocks that we know the scalar exits dominated. For the examples this reduces the blocks to update from 18 to 3. gcc/ChangeLog: PR tree-optimization/113144 PR tree-optimization/113145 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Update all BB that the original exits dominated. gcc/testsuite/ChangeLog: PR tree-optimization/113144 PR tree-optimization/113145 * gcc.dg/vect/vect-early-break_94-pr113144.c: New test.
2024-01-09tree-optimization/113026 - fix vector epilogue maximum iter boundRichard Biener1-32/+15
The late amendment with a limit based on VF was redundant and wrong for peeled early exits. The following moves the adjustment done when we don't have a skip edge down to the place where the already existing VF based max iter check is done and removes the amendment. PR tree-optimization/113026 * tree-vect-loop-manip.cc (vect_do_peeling): Remove redundant and wrong niter bound setting. Move niter bound adjustment down.
2024-01-09middle-end: rejects loops with nonlinear inductions and early breaks [PR113163]Tamar Christina1-0/+19
We can't support nonlinear inductions other than neg when vectorizing early breaks and iteration count is known. For early break we currently require a peeled epilog but in these cases we can't compute the remaining values. gcc/ChangeLog: PR middle-end/113163 * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Reject non-linear inductions that aren't supported. gcc/testsuite/ChangeLog: PR middle-end/113163 * gcc.target/gcn/pr113163.c: New test.
2024-01-08tree-optimization/113026 - avoid vector epilog in more casesRichard Biener1-0/+32
The following avoids creating a niter peeling epilog more consistently, matching what peeling later uses for the skip_vector condition, in particular when versioning is required which then also ensures the vector loop is entered unless the epilog is vectorized. This should ideally match LOOP_VINFO_VERSIONING_THRESHOLD which is only computed later, some refactoring could make that better matching. The patch also makes sure to adjust the upper bound of the epilogues when we do not have a skip edge around the vector loop. PR tree-optimization/113026 * tree-vect-loop.cc (vect_need_peeling_or_partial_vectors_p): Avoid an epilog in more cases. * tree-vect-loop-manip.cc (vect_do_peeling): Adjust the epilogues niter upper bounds and estimates. * gcc.dg/torture/pr113026-1.c: New testcase. * gcc.dg/torture/pr113026-2.c: Likewise.
2024-01-03Update copyright years.Jakub Jelinek1-1/+1
2023-12-24middle-end: Support vectorization of loops with multiple exits.Tamar Christina1-64/+267
Hi All, This patch adds initial support for early break vectorization in GCC. In other words it implements support for vectorization of loops with multiple exits. The support is added for any target that implements a vector cbranch optab, this includes both fully masked and non-masked targets. Depending on the operation, the vectorizer may also require support for boolean mask reductions using Inclusive OR/Bitwise AND. This is however only checked then the comparison would produce multiple statements. This also fully decouples the vectorizer's notion of exit from the existing loop infrastructure's exit. Before this patch the vectorizer always picked the natural loop latch connected exit as the main exit. After this patch the vectorizer is free to choose any exit it deems appropriate as the main exit. This means that even if the main exit is not countable (i.e. the termination condition could not be determined) we might still be able to vectorize should one of the other exits be countable. In such situations the loop is reflowed which enabled vectorization of many other loop forms. Concretely the kind of loops supported are of the forms: for (int i = 0; i < N; i++) { <statements1> if (<condition>) { ... <action>; } <statements2> } where <action> can be: - break - return - goto Any number of statements can be used before the <action> occurs. Since this is an initial version for GCC 14 it has the following limitations and features: - Only fixed sized iterations and buffers are supported. That is to say any vectors loaded or stored must be to statically allocated arrays with known sizes. N must also be known. This limitation is because our primary target for this optimization is SVE. For VLA SVE we can't easily do cross page iteraion checks. The result is likely to also not be beneficial. For that reason we punt support for variable buffers till we have First-Faulting support in GCC 15. - any stores in <statements1> should not be to the same objects as in <condition>. Loads are fine as long as they don't have the possibility to alias. More concretely, we block RAW dependencies when the intermediate value can't be separated fromt the store, or the store itself can't be moved. - Prologue peeling, alignment peelinig and loop versioning are supported. - Fully masked loops, unmasked loops and partially masked loops are supported - Any number of loop early exits are supported. - No support for epilogue vectorization. The only epilogue supported is the scalar final one. Peeling code supports it but the code motion code cannot find instructions to make the move in the epilog. - Early breaks are only supported for inner loop vectorization. With the help of IPA and LTO this still gets hit quite often. During bootstrap it hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now since these are tests for support for early exit vectorization. This implementation does not support completely handling the early break inside the vector loop itself but instead supports adding checks such that if we know that we have to exit in the current iteration then we branch to scalar code to actually do the final VF iterations which handles all the code in <action>. For the scalar loop we know that whatever exit you take you have to perform at most VF iterations. For vector code we only case about the state of fully performed iteration and reset the scalar code to the (partially) remaining loop. That is to say, the first vector loop executes so long as the early exit isn't needed. Once the exit is taken, the scalar code will perform at most VF extra iterations. The exact number depending on peeling and iteration start and which exit was taken (natural or early). For this scalar loop, all early exits are treated the same. When we vectorize we move any statement not related to the early break itself and that would be incorrect to execute before the break (i.e. has side effects) to after the break. If this is not possible we decline to vectorize. The analysis and code motion also takes into account that it doesn't introduce a RAW dependency after the move of the stores. This means that we check at the start of iterations whether we are going to exit or not. During the analyis phase we check whether we are allowed to do this moving of statements. Also note that we only move the scalar statements, but only do so after peeling but just before we start transforming statements. With this the vector flow no longer necessarily needs to match that of the scalar code. In addition most of the infrastructure is in place to support general control flow safely, however we are punting this to GCC 15. Codegen: for e.g. unsigned vect_a[N]; unsigned vect_b[N]; unsigned test4(unsigned x) { unsigned ret = 0; for (int i = 0; i < N; i++) { vect_b[i] = x + i; if (vect_a[i] > x) break; vect_a[i] = x; } return ret; } We generate for Adv. SIMD: test4: adrp x2, .LC0 adrp x3, .LANCHOR0 dup v2.4s, w0 add x3, x3, :lo12:.LANCHOR0 movi v4.4s, 0x4 add x4, x3, 3216 ldr q1, [x2, #:lo12:.LC0] mov x1, 0 mov w2, 0 .p2align 3,,7 .L3: ldr q0, [x3, x1] add v3.4s, v1.4s, v2.4s add v1.4s, v1.4s, v4.4s cmhi v0.4s, v0.4s, v2.4s umaxp v0.4s, v0.4s, v0.4s fmov x5, d0 cbnz x5, .L6 add w2, w2, 1 str q3, [x1, x4] str q2, [x3, x1] add x1, x1, 16 cmp w2, 200 bne .L3 mov w7, 3 .L2: lsl w2, w2, 2 add x5, x3, 3216 add w6, w2, w0 sxtw x4, w2 ldr w1, [x3, x4, lsl 2] str w6, [x5, x4, lsl 2] cmp w0, w1 bcc .L4 add w1, w2, 1 str w0, [x3, x4, lsl 2] add w6, w1, w0 sxtw x1, w1 ldr w4, [x3, x1, lsl 2] str w6, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 add w4, w2, 2 str w0, [x3, x1, lsl 2] sxtw x1, w4 add w6, w1, w0 ldr w4, [x3, x1, lsl 2] str w6, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 str w0, [x3, x1, lsl 2] add w2, w2, 3 cmp w7, 3 beq .L4 sxtw x1, w2 add w2, w2, w0 ldr w4, [x3, x1, lsl 2] str w2, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 str w0, [x3, x1, lsl 2] .L4: mov w0, 0 ret .p2align 2,,3 .L6: mov w7, 4 b .L2 and for SVE: test4: adrp x2, .LANCHOR0 add x2, x2, :lo12:.LANCHOR0 add x5, x2, 3216 mov x3, 0 mov w1, 0 cntw x4 mov z1.s, w0 index z0.s, #0, #1 ptrue p1.b, all ptrue p0.s, all .p2align 3,,7 .L3: ld1w z2.s, p1/z, [x2, x3, lsl 2] add z3.s, z0.s, z1.s cmplo p2.s, p0/z, z1.s, z2.s b.any .L2 st1w z3.s, p1, [x5, x3, lsl 2] add w1, w1, 1 st1w z1.s, p1, [x2, x3, lsl 2] add x3, x3, x4 incw z0.s cmp w3, 803 bls .L3 .L5: mov w0, 0 ret .p2align 2,,3 .L2: cntw x5 mul w1, w1, w5 cbz w5, .L5 sxtw x1, w1 sub w5, w5, #1 add x5, x5, x1 add x6, x2, 3216 b .L6 .p2align 2,,3 .L14: str w0, [x2, x1, lsl 2] cmp x1, x5 beq .L5 mov x1, x4 .L6: ldr w3, [x2, x1, lsl 2] add w4, w0, w1 str w4, [x6, x1, lsl 2] add x4, x1, 1 cmp w0, w3 bcs .L14 mov w0, 0 ret On the workloads this work is based on we see between 2-3x performance uplift using this patch. Follow up plan: - Boolean vectorization has several shortcomings. I've filed PR110223 with the bigger ones that cause vectorization to fail with this patch. - SLP support. This is planned for GCC 15 as for majority of the cases build SLP itself fails. This means I'll need to spend time in making this more robust first. Additionally it requires: * Adding support for vectorizing CFG (gconds) * Support for CFG to differ between vector and scalar loops. Both of which would be disruptive to the tree and I suspect I'll be handling fallouts from this patch for a while. So I plan to work on the surrounding building blocks first for the remainder of the year. Additionally it also contains reduced cases from issues found running over various codebases. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Also regtested with: -march=armv8.3-a+sve -march=armv8.3-a+nosve -march=armv9-a -mcpu=neoverse-v1 -mcpu=neoverse-n2 Bootstrapped Regtested x86_64-pc-linux-gnu and no issues. Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues. gcc/ChangeLog: * tree-if-conv.cc (idx_within_array_bound): Expose. * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New. (vect_analyze_data_ref_dependences): Use it. * tree-vect-loop-manip.cc (vect_iv_increment_position): New. (vect_set_loop_controls_directly, vect_set_loop_condition_partial_vectors, vect_set_loop_condition_partial_vectors_avx512, vect_set_loop_condition_normal): Support multiple exits. (slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for multiple exits. (slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB count and instead look at loop shape. (vect_update_ivs_after_vectorizer): Drop asserts. (vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations. (vect_do_peeling): Support multiple exits. (vect_loop_versioning): Likewise. * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise early_breaks. (vect_analyze_loop_form): Support loop flows with more than single BB loop body. (vect_create_loop_vinfo): Support niters analysis for multiple exits. (vect_analyze_loop): Likewise. (vect_get_vect_def): New. (vect_create_epilog_for_reduction): Support early exit reductions. (vectorizable_live_operation_1): New. (find_connected_edge): New. (vectorizable_live_operation): Support early exit live operations. (move_early_exit_stmts): New. (vect_transform_loop): Use it. * tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond. (vect_recog_bitfield_ref_pattern): Support gconds and bools. (vect_recog_gcond_pattern): New. (possible_vector_mask_operation_p): Support gcond masks. (vect_determine_mask_precision): Likewise. (vect_mark_pattern_stmts): Set gcond def type. (can_vectorize_live_stmts): Force early break inductions to be live. * tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for early breaks. (vect_mark_stmts_to_be_vectorized): Process gcond usage. (perm_mask_for_reverse): Expose. (vectorizable_comparison_1): New. (vectorizable_early_exit): New. (vect_analyze_stmt): Support early break and gcond. (vect_transform_stmt): Likewise. (vect_is_simple_use): Likewise. (vect_get_vector_types_for_stmt): Likewise. * tree-vectorizer.cc (pass_vectorize::execute): Update exits for value numbering. * tree-vectorizer.h (enum vect_def_type): Add vect_condition_def. (LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES, LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB, LOOP_VINFO_EARLY_BRK_VUSES): New. (is_loop_header_bb_p): Drop assert. (class loop): Add early_breaks, early_break_stores, early_break_dest_bb, early_break_vuses. (vect_iv_increment_position, perm_mask_for_reverse, ref_within_array_bound): New. (slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.
2023-11-16middle-end: skip checking loop exits if loop malformed [PR111878]Tamar Christina1-0/+4
Before my refactoring if the loop->latch was incorrect then find_loop_location skipped checking the edges and would eventually return a dummy location. It turns out that a loop can have loops_state_satisfies_p (LOOPS_HAVE_RECORDED_EXITS) but also not have a latch in which case get_loop_exit_edges traps. This restores the old behavior. gcc/ChangeLog: PR tree-optimization/111878 * tree-vect-loop-manip.cc (find_loop_location): Skip edges check if latch incorrect. gcc/testsuite/ChangeLog: PR tree-optimization/111878 * gcc.dg/graphite/pr111878.c: New test.
2023-11-06tree-optimization/111950 - vectorizer loop copyingRichard Biener1-217/+25
The following simplifies LC-PHI arg population during epilog peeling, thereby fixing the testcase in this PR. PR tree-optimization/111950 * tree-vect-loop-manip.cc (slpeel_duplicate_current_defs_from_edges): Remove. (find_guard_arg): Likewise. (slpeel_update_phi_nodes_for_guard2): Likewise. (slpeel_tree_duplicate_loop_to_edge_cfg): Remove calls to slpeel_duplicate_current_defs_from_edges, do not elide LC-PHIs for invariant values. (vect_do_peeling): Materialize PHI arguments for the edge around the epilog from the PHI defs of the main loop exit. * gcc.dg/torture/pr111950.c: New testcase.
2023-10-23middle-end: don't keep .MEM guard nodes for PHI nodes who dominate loop ↵Tamar Christina1-1/+20
[PR111860] The previous patch tried to remove PHI nodes that dominated the first loop, however the correct fix is to only remove .MEM nodes. This patch thus makes the condition a bit stricter and only tries to remove MEM phi nodes. I couldn't figure out a way to easily determine if a particular PHI is vUSE related, so the patch does: 1. check if the definition is a vDEF and not defined in main loop. 2. check if the definition is a PHI and not defined in main loop. 3. check if the definition is a default definition. For no 2 and 3 we may misidentify the PHI, in both cases the value is defined outside of the loop version block which also makes it ok to remove. gcc/ChangeLog: PR tree-optimization/111860 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Drop .MEM nodes only. gcc/testsuite/ChangeLog: PR tree-optimization/111860 * gcc.dg/vect/pr111860-2.c: New test. * gcc.dg/vect/pr111860-3.c: New test.
2023-10-23Avoid compile time hog on vect_peel_nonlinear_iv_init for nonlinear ↵liuhongt1-3/+25
induction vec_step_op_mul when iteration count is too big. There's loop in vect_peel_nonlinear_iv_init to get init_expr * pow (step_expr, skip_niters). When skipn_iters is too big, compile time hogs. To avoid that, optimize init_expr * pow (step_expr, skip_niters) to init_expr << (exact_log2 (step_expr) * skip_niters) when step_expr is pow of 2, otherwise give up vectorization when skip_niters >= TYPE_PRECISION (TREE_TYPE (init_expr)). Also give up vectorization when niters_skip is negative which will be used for fully masked loop. gcc/ChangeLog: PR tree-optimization/111820 PR tree-optimization/111833 * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Give up vectorization for nonlinear iv vect_step_op_mul when step_expr is not exact_log2 and niters is greater than TYPE_PRECISION (TREE_TYPE (step_expr)). Also don't vectorize for nagative niters_skip which will be used by fully masked loop. (vect_can_advance_ivs_p): Pass whole phi_info to vect_can_peel_nonlinear_iv_p. * tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Optimize init_expr * pow (step_expr, skipn) to init_expr << (log2 (step_expr) * skipn) when step_expr is exact_log2. gcc/testsuite/ChangeLog: * gcc.target/i386/pr111820-1.c: New test. * gcc.target/i386/pr111820-2.c: New test. * gcc.target/i386/pr111820-3.c: New test. * gcc.target/i386/pr103144-mul-1.c: Adjust testcase. * gcc.target/i386/pr103144-mul-2.c: Adjust testcase.
2023-10-20middle-end: don't pass loop_vinfo to vect_set_loop_condition during prolog ↵Tamar Christina1-1/+1
peeling During the refactoring I had passed loop_vinfo on to vect_set_loop_condition during prolog peeling. This parameter is unused in most cases except for in vect_set_loop_condition_partial_vectors where it's behaviour depends on whether loop_vinfo is NULL or not. Apparently this code expect it to be NULL and it reads the structures from a different location. This fixes the failing testcase which was not using the lens values determined earlier in vectorizable_store because it was looking it up in the given loop_vinfo instead. gcc/ChangeLog: PR tree-optimization/111866 * tree-vect-loop-manip.cc (vect_do_peeling): Pass null as vinfo to vect_set_loop_condition during prolog peeling.
2023-10-19middle-end: don't create LC-SSA PHI variables for PHI nodes who dominate loopTamar Christina1-0/+15
As the testcase shows, when a PHI node dominates the loop there is no new definition inside the loop. As such there would be no PHI nodes to update. When we maintain LCSSA form we create an intermediate node in between the two loops to thread alongt the value. However later on when we update the second loop we don't have any PHI nodes to update and so adjust_phi_and_debug_stmts does nothing. This leaves us with an incorrect phi node. Normally this does nothing and just gets ignored. But in the case of the vUSE chain we end up corrupting the chain. As such whenever a PHI node's argument dominates the loop, we should remove the newly created PHI node after edge redirection. The one exception to this is when the loop has been versioned. In such cases the versioned loop may not use the value but the second loop can. When this happens and we add the loop guard unless the join block has the PHI it can't find the original value for use inside the guard block. The next refactoring in the series moves the formation of the guard block inside peeling itself. Here we have all the information and wouldn't need to re-create it later. gcc/ChangeLog: PR tree-optimization/111860 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Remove PHI nodes that dominate loop. gcc/testsuite/ChangeLog: PR tree-optimization/111860 * gcc.dg/vect/pr111860.c: New test.
2023-10-18middle-end: maintain LCSSA throughout loop peelingTamar Christina1-189/+146
This final patch updates peeling to maintain LCSSA all the way through. It's significantly easier to maintain it during peeling while we still know where all new edges connect rather than touching it up later as is currently being done. This allows us to remove many of the helper functions that touch up the loops at various parts. The only complication is for loop distribution where we should be able to use the same, however ldist depending on whether redirect_lc_phi_defs is true or not will either try to maintain a limited LCSSA form itself or removes are non-virtual phis. The problem here is that if we maintain LCSSA then in some cases the blocks connecting the two loops get PHIs to keep the loop IV up to date. However there is no loop, the guard condition is rewritten as 0 != 0, to the "loop" always exits. However due to the PHI nodes the probabilities get completely wrong. It seems to think that the impossible exit is the likely edge. This causes incorrect warnings and the presence of the PHIs prevent the blocks to be simplified. While it may be possible to make ldist work with LCSSA form, doing so seems more work than not. For that reason the peeling code has an additional parameter used by only ldist to not connect the two loops during peeling. This preserves the current behaviour from ldist until I can dive into the implementation more. Hopefully that's ok for now. gcc/ChangeLog: * tree-loop-distribution.cc (copy_loop_before): Request no LCSSA. * tree-vect-loop-manip.cc (adjust_phi_and_debug_stmts): Add additional asserts. (slpeel_tree_duplicate_loop_to_edge_cfg): Keep LCSSA during peeling. (find_guard_arg): Look value up through explicit edge and original defs. (vect_do_peeling): Use it. (slpeel_update_phi_nodes_for_guard2): Take explicit exit edge. (slpeel_update_phi_nodes_for_lcssa, slpeel_update_phi_nodes_for_loops): Remove. * tree-vect-loop.cc (vect_create_epilog_for_reduction): Initialize phi. * tree-vectorizer.h (slpeel_tree_duplicate_loop_to_edge_cfg): Add optional param to turn off LCSSA mode.
2023-10-18middle-end: updated niters analysis to handle multiple exits.Tamar Christina1-0/+14
This second part updates niters analysis to be able to analyze any number of exits. If we have multiple exits we determine the main exit by finding the first counting IV. The change allows the vectorizer to pass analysis for multiple loops, but we later gracefully reject them. It does however allow us to test if the exit handling is using the right exit everywhere. Additionally since we analyze all exits, we now return all conditions for them and determine which condition belongs to the main exit. The main condition is needed because the vectorizer needs to ignore the main IV condition during vectorization as it will replace it during codegen. To track versioned loops we extend the contract between ifcvt and the vectorizer to store the exit number in aux so that we can match it up again during peeling. gcc/ChangeLog: * tree-if-conv.cc (tree_if_conversion): Record exits in aux. * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Use it. * tree-vect-loop.cc (vect_get_loop_niters): Determine main exit. (vec_init_loop_exit_info): Extend analysis when multiple exits. (vect_analyze_loop_form): Record conds and determine main cond. (vect_create_loop_vinfo): Extend bookkeeping of conds. (vect_analyze_loop): Release conds. * tree-vectorizer.h (LOOP_VINFO_LOOP_CONDS, LOOP_VINFO_LOOP_IV_COND): New. (struct vect_loop_form_info): Add conds, alt_loop_conds; (struct loop_vec_info): Add conds, loop_iv_cond.
2023-10-18middle-end: Refactor vectorizer loop conditionals and separate out IV to new ↵Tamar Christina1-68/+102
variables This is extracted out of the patch series to support early break vectorization in order to simplify the review of that patch series. The goal of this one is to separate out the refactoring from the new functionality. This first patch separates out the vectorizer's definition of an exit to their own values inside loop_vinfo. During vectorization we can have three separate copies for each loop: scalar, vectorized, epilogue. The scalar loop can also be the versioned loop before peeling. Because of this we track 3 different exits inside loop_vinfo corresponding to each of these loops. Additionally each function that uses an exit, when not obviously clear which exit is needed will now take the exit explicitly as an argument. This is because often times the callers switch the loops being passed around. While the caller knows which loops it is, the callee does not. For now the loop exits are simply initialized to same value as before determined by single_exit (..). No change in functionality is expected throughout this patch series. gcc/ChangeLog: * tree-loop-distribution.cc (copy_loop_before): Pass exit explicitly. (loop_distribution::distribute_loop): Bail out of not single exit. * tree-scalar-evolution.cc (get_loop_exit_condition): New. * tree-scalar-evolution.h (get_loop_exit_condition): New. * tree-vect-data-refs.cc (vect_enhance_data_refs_alignment): Pass exit explicitly. * tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors, vect_set_loop_condition_partial_vectors_avx512, vect_set_loop_condition_normal, vect_set_loop_condition): Explicitly take exit. (slpeel_tree_duplicate_loop_to_edge_cfg): Explicitly take exit and return new peeled corresponding peeled exit. (slpeel_can_duplicate_loop_p): Explicitly take exit. (find_loop_location): Handle not knowing an explicit exit. (vect_update_ivs_after_vectorizer, vect_gen_vector_loop_niters_mult_vf, find_guard_arg, slpeel_update_phi_nodes_for_loops, slpeel_update_phi_nodes_for_guard2): Use new exits. (vect_do_peeling): Update bookkeeping to keep track of exits. * tree-vect-loop.cc (vect_get_loop_niters): Explicitly take exit to analyze. (vec_init_loop_exit_info): New. (_loop_vec_info::_loop_vec_info): Initialize vec_loop_iv, vec_epilogue_loop_iv, scalar_loop_iv. (vect_analyze_loop_form): Initialize exits. (vect_create_loop_vinfo): Set main exit. (vect_create_epilog_for_reduction, vectorizable_live_operation, vect_transform_loop): Use it. (scale_profile_for_vect_loop): Explicitly take exit to scale. * tree-vectorizer.cc (set_uid_loop_bbs): Initialize loop exit. * tree-vectorizer.h (LOOP_VINFO_IV_EXIT, LOOP_VINFO_EPILOGUE_IV_EXIT, LOOP_VINFO_SCALAR_IV_EXIT): New. (struct loop_vec_info): Add vec_loop_iv, vec_epilogue_loop_iv, scalar_loop_iv. (vect_set_loop_condition, slpeel_can_duplicate_loop_p, slpeel_tree_duplicate_loop_to_edge_cfg): Take explicit exits. (vec_init_loop_exit_info): New. (struct vect_loop_form_info): Add loop_exit.
2023-10-09Fixes for profile count/probability maintenanceEugene Rozenfeld1-1/+1
Verifier checks have recently been strengthened to check that all counts and probabilities are initialized. The checks fired during autoprofiledbootstrap build and this patch fixes it. Tested on x86_64-pc-linux-gnu. gcc/ChangeLog: * auto-profile.cc (afdo_calculate_branch_prob): Fix count comparisons * tree-vect-loop-manip.cc (vect_do_peeling): Guard against zero count when scaling loop profile
2023-08-07Fix profile update after versioning ifconverted loopJan Hubicka1-3/+11
If loop is ifconverted and later versioning by vectorizer, vectorizer will reuse the scalar loop produced by ifconvert. Curiously enough it does not seem to do so for versions produced by loop distribution while for loop distribution this matters (since since both ldist versions survive to final code) while after ifcvt it does not (since we remove non-vectorized path). This patch fixes associated profile update. Here it is necessary to scale both arms of the conditional according to runtime checks inserted. We got partly right the loop body, but not the preheader block and block after exit. The first is particularly bad since it changes loop iterations estimates. So we now turn 4 original loops: loop 1: iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 0.9979) loop 2: iterations by profile: 100.000000 (reliable) entry count:39848881 (precise, freq 468.8104) loop 3: iterations by profile: 100.000000 (reliable) entry count:39848881 (precise, freq 468.8104) loop 4: iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 0.9902) Into following loops iterations by profile: 5.312499 (unreliable, maybe flat) entry count:12742188 (guessed, freq 149.9081) vectorized and split loop 1, peeled iterations by profile: 0.009496 (unreliable, maybe flat) entry count:374798 (guessed, freq 4.4094) split loop 1 (last iteration), peeled iterations by profile: 100.000008 (unreliable) entry count:3945039 (guessed, freq 46.4122) scalar version of loop 1 iterations by profile: 100.000007 (unreliable) entry count:7101070 (guessed, freq 83.5420) redundant scalar version of loop 1 which we could eliminate if vectorizer understood ldist iterations by profile: 100.000000 (unreliable) entry count:35505353 (guessed, freq 417.7100) unvectorized loop 2 iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, freq 300.7512) vectorized loop 2, not peeled (hits max-peel-insns) iterations by profile: 100.000007 (unreliable) entry count:7101070 (guessed, freq 83.5420) unvectorized loop 3 iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, freq 300.7512) vectorized loop 3, not peeled (hits max-peel-insns) iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 0.9979) loop 1 iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 0.9902) loop 4 With this change we are on 0 profile erros on hmmer benchmark: Pass dump id |dynamic mismatch |overall | |in count |size |time | 172t ch_vect | 0 | 996 | 385812023346 | 173t ifcvt | 71010686 +71010686| 1021 +2.5%| 468361969416 +21.4%| 174t vect | 210830784 +139820098| 1497 +46.6%| 216073467874 -53.9%| 175t dce | 210830784 | 1387 -7.3%| 205273170281 -5.0%| 176t pcom | 210830784 | 1387 | 201722634966 -1.7%| 177t cunroll | 0 -210830784| 1443 +4.0%| 180441501289 -10.5%| 182t ivopts | 0 | 1385 -4.0%| 136412345683 -24.4%| 183t lim | 0 | 1389 +0.3%| 135093950836 -1.0%| 192t reassoc | 0 | 1381 -0.6%| 134778347700 -0.2%| 193t slsr | 0 | 1380 -0.1%| 134738100330 -0.0%| 195t tracer | 0 | 1521 +10.2%| 134738179146 +0.0%| 196t fre | 2680654 +2680654| 1489 -2.1%| 134659672725 -0.1%| 198t dom | 5361308 +2680654| 1473 -1.1%| 134449553658 -0.2%| 201t vrp | 5361308 | 1474 +0.1%| 134489004050 +0.0%| 202t ccp | 5361308 | 1472 -0.1%| 134440752274 -0.0%| 204t dse | 5361308 | 1444 -1.9%| 133802300525 -0.5%| 206t forwprop| 5361308 | 1433 -0.8%| 133542828370 -0.2%| 207t sink | 5361308 | 1431 -0.1%| 133542658728 -0.0%| 211t store-me| 5361308 | 1430 -0.1%| 133542573728 -0.0%| 212t cddce | 5361308 | 1428 -0.1%| 133541776728 -0.0%| 258r expand | 5361308 |----------------|--------------------| 260r into_cfg| 5361308 | 9334 -0.8%| 885820707913 -0.6%| 261r jump | 5361308 | 9330 -0.0%| 885820367913 -0.0%| 265r fwprop1 | 5361308 | 9206 -1.3%| 876756504385 -1.0%| 267r rtl pre | 5361308 | 9210 +0.0%| 876914305953 +0.0%| 269r cprop | 5361308 | 9202 -0.1%| 876756165101 -0.0%| 271r cse_loca| 5361308 | 9198 -0.0%| 876727760821 -0.0%| 272r ce1 | 5361308 | 9126 -0.8%| 875726815885 -0.1%| 276r loop2_in| 5361308 | 9167 +0.4%| 873573110570 -0.2%| 282r cprop | 5361308 | 9095 -0.8%| 871937317262 -0.2%| 284r cse2 | 5361308 | 9091 -0.0%| 871936977978 -0.0%| 285r dse1 | 5361308 | 9067 -0.3%| 871437031602 -0.1%| 290r combine | 5361308 | 9071 +0.0%| 869206278202 -0.3%| 292r stv | 5361308 | 17157 +89.1%| 2111071925708+142.9%| 295r bbpart | 5361308 | 17161 +0.0%| 2111071925708 | 296r outof_cf| 5361308 | 17233 +0.4%| 2111655121000 +0.0%| 297r split1 | 5361308 | 17245 +0.1%| 2111656138852 +0.0%| 306r ira | 5361308 | 19189 +11.3%| 2136098398308 +1.2%| 307r reload | 5361308 | 12101 -36.9%| 981091222830 -54.1%| 309r postrelo| 5361308 | 12019 -0.7%| 978750345475 -0.2%| 310r gcse2 | 5361308 | 12027 +0.1%| 978329108320 -0.0%| 311r split2 | 5361308 | 12023 -0.0%| 978507631352 +0.0%| 312r ree | 5361308 | 12027 +0.0%| 978505414244 -0.0%| 313r cmpelim | 5361308 | 11979 -0.4%| 977531601988 -0.1%| 314r pro_and_| 5361308 | 12091 +0.9%| 977541801988 +0.0%| 315r dse2 | 5361308 | 12091 | 977541801988 | 316r csa | 5361308 | 12087 -0.0%| 977541461988 -0.0%| 317r jump2 | 5361308 | 12039 -0.4%| 977683176572 +0.0%| 318r compgoto| 5361308 | 12039 | 977683176572 | 320r peephole| 5361308 | 12047 +0.1%| 977362727612 -0.0%| 321r ce3 | 5361308 | 12047 | 977362727612 | 323r cprop_ha| 5361308 | 11907 -1.2%| 968751076676 -0.9%| 324r rtl_dce | 5361308 | 11903 -0.0%| 968593274820 -0.0%| 325r bbro | 5361308 | 11883 -0.2%| 967964046644 -0.1%| Bootstrapped/regtested x86_64-linux, plan to commit it tomorrow if there are no complains. gcc/ChangeLog: PR tree-optimization/106293 * tree-vect-loop-manip.cc (vect_loop_versioning): Fix profile update. * tree-vect-loop.cc (vect_transform_loop): Likewise. gcc/testsuite/ChangeLog: PR tree-optimization/106293 * gcc.dg/vect/vect-cond-11.c: Check profile consistency. * gcc.dg/vect/vect-widen-mult-extern-1.c: Check profile consistency.
2023-08-06Fix profile update after peeled epiloguesJan Hubicka1-2/+11
Epilogue peeling expects the scalar loop to have same number of executions as the vector loop which is true at the beggining of vectorization. However if the epilogues are vectorized, this is no longer the case. In this situation the loop preheader is replaced by new guard code with correct profile, however loop body is left unscaled. This leads to loop that exists more often then it is entered. This patch add slogic to scale the frequencies down and also to fix profile of original preheader where necesary. Bootstrapped/regtested x86_64-linux, comitted. gcc/ChangeLog: * tree-vect-loop-manip.cc (vect_do_peeling): Fix profile update of peeled epilogues. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-bitfield-read-1.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-read-2.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-read-3.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-read-4.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-read-5.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-read-6.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-read-7.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-write-1.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-write-2.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-write-3.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-write-4.c: Check profile consistency. * gcc.dg/vect/vect-bitfield-write-5.c: Check profile consistency. * gcc.dg/vect/vect-epilogues-2.c: Check profile consistency. * gcc.dg/vect/vect-epilogues.c: Check profile consistency. * gcc.dg/vect/vect-mask-store-move-1.c: Check profile consistency.
2023-08-01Fix profile update after prologue peeling in vectorizerJan Hubicka1-2/+7
This patch fixes update after constant peeling in profilogue. We now reached 0 profile update bugs on tramp3d vectorizaiton and also on quite few testcases, so I am enabling the testuiste checks so we do not regress again. gcc/ChangeLog: * tree-vect-loop-manip.cc (vect_do_peeling): Fix profile update after constant prologue peeling. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-1-big-array.c: Check profile consistency. * gcc.dg/vect/vect-1.c: Check profile consistency. * gcc.dg/vect/vect-10-big-array.c: Check profile consistency. * gcc.dg/vect/vect-10.c: Check profile consistency. * gcc.dg/vect/vect-100.c: Check profile consistency. * gcc.dg/vect/vect-103.c: Check profile consistency. * gcc.dg/vect/vect-104.c: Check profile consistency. * gcc.dg/vect/vect-105-big-array.c: Check profile consistency. * gcc.dg/vect/vect-105.c: Check profile consistency. * gcc.dg/vect/vect-106.c: Check profile consistency. * gcc.dg/vect/vect-107.c: Check profile consistency. * gcc.dg/vect/vect-108.c: Check profile consistency. * gcc.dg/vect/vect-109.c: Check profile consistency. * gcc.dg/vect/vect-11.c: Check profile consistency. * gcc.dg/vect/vect-110.c: Check profile consistency. * gcc.dg/vect/vect-112-big-array.c: Check profile consistency. * gcc.dg/vect/vect-112.c: Check profile consistency. * gcc.dg/vect/vect-113.c: Check profile consistency. * gcc.dg/vect/vect-114.c: Check profile consistency. * gcc.dg/vect/vect-115.c: Check profile consistency. * gcc.dg/vect/vect-116.c: Check profile consistency. * gcc.dg/vect/vect-117.c: Check profile consistency. * gcc.dg/vect/vect-118.c: Check profile consistency. * gcc.dg/vect/vect-119.c: Check profile consistency. * gcc.dg/vect/vect-11a.c: Check profile consistency. * gcc.dg/vect/vect-12.c: Check profile consistency. * gcc.dg/vect/vect-120.c: Check profile consistency. * gcc.dg/vect/vect-121.c: Check profile consistency. * gcc.dg/vect/vect-122.c: Check profile consistency. * gcc.dg/vect/vect-123.c: Check profile consistency. * gcc.dg/vect/vect-124.c: Check profile consistency. * gcc.dg/vect/vect-126.c: Check profile consistency. * gcc.dg/vect/vect-13.c: Check profile consistency. * gcc.dg/vect/vect-14.c: Check profile consistency. * gcc.dg/vect/vect-15-big-array.c: Check profile consistency. * gcc.dg/vect/vect-15.c: Check profile consistency. * gcc.dg/vect/vect-17.c: Check profile consistency. * gcc.dg/vect/vect-18.c: Check profile consistency. * gcc.dg/vect/vect-19.c: Check profile consistency. * gcc.dg/vect/vect-2-big-array.c: Check profile consistency. * gcc.dg/vect/vect-2.c: Check profile consistency. * gcc.dg/vect/vect-20.c: Check profile consistency. * gcc.dg/vect/vect-21.c: Check profile consistency. * gcc.dg/vect/vect-22.c: Check profile consistency. * gcc.dg/vect/vect-23.c: Check profile consistency. * gcc.dg/vect/vect-24.c: Check profile consistency. * gcc.dg/vect/vect-25.c: Check profile consistency. * gcc.dg/vect/vect-26.c: Check profile consistency. * gcc.dg/vect/vect-27.c: Check profile consistency. * gcc.dg/vect/vect-28.c: Check profile consistency. * gcc.dg/vect/vect-29.c: Check profile consistency. * gcc.dg/vect/vect-3.c: Check profile consistency. * gcc.dg/vect/vect-30.c: Check profile consistency. * gcc.dg/vect/vect-31-big-array.c: Check profile consistency. * gcc.dg/vect/vect-31.c: Check profile consistency. * gcc.dg/vect/vect-32-big-array.c: Check profile consistency. * gcc.dg/vect/vect-32-chars.c: Check profile consistency. * gcc.dg/vect/vect-32.c: Check profile consistency. * gcc.dg/vect/vect-33-big-array.c: Check profile consistency. * gcc.dg/vect/vect-33.c: Check profile consistency. * gcc.dg/vect/vect-34-big-array.c: Check profile consistency. * gcc.dg/vect/vect-34.c: Check profile consistency. * gcc.dg/vect/vect-35-big-array.c: Check profile consistency. * gcc.dg/vect/vect-35.c: Check profile consistency. * gcc.dg/vect/vect-36-big-array.c: Check profile consistency. * gcc.dg/vect/vect-36.c: Check profile consistency. * gcc.dg/vect/vect-38.c: Check profile consistency. * gcc.dg/vect/vect-4.c: Check profile consistency. * gcc.dg/vect/vect-40.c: Check profile consistency. * gcc.dg/vect/vect-42.c: Check profile consistency. * gcc.dg/vect/vect-44.c: Check profile consistency. * gcc.dg/vect/vect-46.c: Check profile consistency. * gcc.dg/vect/vect-48.c: Check profile consistency. * gcc.dg/vect/vect-5.c: Check profile consistency. * gcc.dg/vect/vect-50.c: Check profile consistency. * gcc.dg/vect/vect-52.c: Check profile consistency. * gcc.dg/vect/vect-54.c: Check profile consistency. * gcc.dg/vect/vect-56.c: Check profile consistency. * gcc.dg/vect/vect-58.c: Check profile consistency. * gcc.dg/vect/vect-6-big-array.c: Check profile consistency. * gcc.dg/vect/vect-6.c: Check profile consistency. * gcc.dg/vect/vect-60.c: Check profile consistency. * gcc.dg/vect/vect-62.c: Check profile consistency. * gcc.dg/vect/vect-63.c: Check profile consistency. * gcc.dg/vect/vect-64.c: Check profile consistency. * gcc.dg/vect/vect-65.c: Check profile consistency. * gcc.dg/vect/vect-66.c: Check profile consistency. * gcc.dg/vect/vect-67.c: Check profile consistency. * gcc.dg/vect/vect-68.c: Check profile consistency. * gcc.dg/vect/vect-7.c: Check profile consistency. * gcc.dg/vect/vect-70.c: Check profile consistency. * gcc.dg/vect/vect-71.c: Check profile consistency. * gcc.dg/vect/vect-72.c: Check profile consistency. * gcc.dg/vect/vect-73-big-array.c: Check profile consistency. * gcc.dg/vect/vect-73.c: Check profile consistency. * gcc.dg/vect/vect-74-big-array.c: Check profile consistency. * gcc.dg/vect/vect-74.c: Check profile consistency. * gcc.dg/vect/vect-75-big-array.c: Check profile consistency. * gcc.dg/vect/vect-75.c: Check profile consistency. * gcc.dg/vect/vect-76-big-array.c: Check profile consistency. * gcc.dg/vect/vect-76.c: Check profile consistency. * gcc.dg/vect/vect-77-alignchecks.c: Check profile consistency. * gcc.dg/vect/vect-77-global.c: Check profile consistency. * gcc.dg/vect/vect-77.c: Check profile consistency. * gcc.dg/vect/vect-78-alignchecks.c: Check profile consistency. * gcc.dg/vect/vect-78-global.c: Check profile consistency. * gcc.dg/vect/vect-78.c: Check profile consistency. * gcc.dg/vect/vect-8.c: Check profile consistency. * gcc.dg/vect/vect-80-big-array.c: Check profile consistency. * gcc.dg/vect/vect-80.c: Check profile consistency. * gcc.dg/vect/vect-82.c: Check profile consistency. * gcc.dg/vect/vect-82_64.c: Check profile consistency. * gcc.dg/vect/vect-83.c: Check profile consistency. * gcc.dg/vect/vect-83_64.c: Check profile consistency. * gcc.dg/vect/vect-85-big-array.c: Check profile consistency. * gcc.dg/vect/vect-85.c: Check profile consistency. * gcc.dg/vect/vect-86.c: Check profile consistency. * gcc.dg/vect/vect-87.c: Check profile consistency. * gcc.dg/vect/vect-88.c: Check profile consistency. * gcc.dg/vect/vect-89-big-array.c: Check profile consistency. * gcc.dg/vect/vect-89.c: Check profile consistency. * gcc.dg/vect/vect-9.c: Check profile consistency. * gcc.dg/vect/vect-91.c: Check profile consistency. * gcc.dg/vect/vect-92.c: Check profile consistency. * gcc.dg/vect/vect-93.c: Check profile consistency. * gcc.dg/vect/vect-95.c: Check profile consistency. * gcc.dg/vect/vect-96.c: Check profile consistency. * gcc.dg/vect/vect-97-big-array.c: Check profile consistency. * gcc.dg/vect/vect-97.c: Check profile consistency. * gcc.dg/vect/vect-98-big-array.c: Check profile consistency. * gcc.dg/vect/vect-98.c: Check profile consistency. * gcc.dg/vect/vect-99.c: Check profile consistency.
2023-07-29Fix profile update after vectorize loop versioningJan Hubicka1-3/+12
Vectorizer while loop versioning produces a versioned loop guarded with two conditionals of the form if (cond1) goto scalar_loop else goto next_bb next_bb: if (cond2) godo scalar_loop else goto vector_loop It wants the combined test to be prob (whch is set to likely) and uses profile_probability::split to determine probability of cond1 and cond2. However spliting is turning: if (cond) goto lab; // ORIG probability into if (cond1) goto lab; // FIRST = ORIG * CPROB probability if (cond2) goto lab; // SECOND probability Which is or instead of and. As a result we get pretty low probabiility of entering vectorized loop. The fixes this by introducing sqrt to profile probability (which is correct way to split this) and also adding pow that is needed elsewhere. While loop versioning I now produce code as if there was only one combined conditional and then update probability of conditional produced (containig cond1). Later edge is split and new conditional is added. At that time it is necessary to update probability of the BB containing second conditional so everything matches. gcc/ChangeLog: * profile-count.cc (profile_probability::sqrt): New member function. (profile_probability::pow): Likewise. * profile-count.h: (profile_probability::sqrt): Declare (profile_probability::pow): Likewise. * tree-vect-loop-manip.cc (vect_loop_versioning): Fix profile update.
2023-07-07Fix epilogue loop profileJan Hubicka1-0/+2
Fix two bugs in scale_loop_profile which crept in during my cleanups and curiously enoug did not show on the testcases we have so far. The patch also adds the missing call to cap iteration count of the vectorized loop epilogues. Vectorizer profile needs more work, but I am trying to chase out obvious bugs first so the profile quality statistics become meaningful and we can try to improve on them. Now we get: Pass dump id and name |static mismatcdynamic mismatch |in count |in count 107t cunrolli | 3 +3| 17251 +17251 116t vrp | 5 +2| 30908 +16532 118t dce | 3 -2| 17251 -13657 127t ch | 13 +10| 17251 131t dom | 39 +26| 17251 133t isolate-paths | 47 +8| 17251 134t reassoc | 49 +2| 17251 136t forwprop | 53 +4| 202501 +185250 159t cddce | 61 +8| 216211 +13710 161t ldist | 62 +1| 216211 172t ifcvt | 66 +4| 373711 +157500 173t vect | 143 +77| 9801947 +9428236 176t cunroll | 149 +6| 12006408 +2204461 183t loopdone | 146 -3| 11944469 -61939 195t fre | 142 -4| 11944469 197t dom | 141 -1| 13038435 +1093966 199t threadfull | 143 +2| 13246410 +207975 200t vrp | 145 +2| 13444579 +198169 204t dce | 143 -2| 13371315 -73264 206t sink | 141 -2| 13371315 211t cddce | 147 +6| 13372755 +1440 255t optimized | 145 -2| 13372755 256r expand | 141 -4| 13371197 -1558 258r into_cfglayout | 139 -2| 13371197 275r loop2_unroll | 143 +4| 16792056 +3420859 291r ce2 | 141 -2| 16811462 312r pro_and_epilogue | 161 +20| 16873400 +61938 315r jump2 | 167 +6| 20910158 +4036758 323r bbro | 160 -7| 16559844 -4350314 Vect still introduces 77 profile mismatches (same as without this patch) however subsequent cunroll works much better with 6 new mismatches compared to 78. Overall it reduces 229 mismatches to 160. Also overall runtime estimate is now reduced by 6.9%. Previously the overall runtime estimate grew by 11% which was result of the fat that the epilogue profile was pretty much the same as profile of the original loop. Bootstrapped/regtested x86_64-linux, comitted. gcc/ChangeLog: * cfgloopmanip.cc (scale_loop_profile): Fix computation of count_in and scaling blocks after exit. * tree-vect-loop-manip.cc (vect_do_peeling): Scale loop profile of the epilogue if bound is known. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/vect-profile-upate.c: New test.
2023-07-06Improve scale_loop_profileJan Hubicka1-3/+3
Original scale_loop_profile was implemented to only handle very simple loops produced by vectorizer at that time (basically loops with only one exit and no subloops). It also has not been updated to new profile-count API very carefully. The function does two thigs 1) scales down the loop profile by a given probability. This is useful, for example, to scale down profile after peeling when loop body is executed less often than before 2) update profile to cap iteration count by ITERATION_BOUND parameter. I changed ITERATION_BOUND to be actual bound on number of iterations as used elsewhere (i.e. number of executions of latch edge) rather then number of iterations + 1 as it was before. To do 2) one needs to do the following a) scale own loop profile so frquency o header is at most the sum of in-edge counts * (iteration_bound + 1) b) update loop exit probabilities so their count is the same as before scaling. c) reduce frequencies of basic blocks after loop exit old code did b) by setting probability to 1 / iteration_bound which is correctly only of the basic block containing exit executes precisely one per iteration (it is not insie other conditional or inner loop). This is fixed now by using set_edge_probability_and_rescale_others aldo c) was implemented only for special case when the exit was just before latch bacis block. I now use dominance info to get right some of addional case. I still did not try to do anything for multiple exit loops, though the implementatoin could be generalized. Bootstrapped/regtested x86_64-linux. Plan to cmmit it tonight if there are no complains. gcc/ChangeLog: * cfgloopmanip.cc (scale_loop_profile): Rewrite exit edge probability update to be safe on loops with subloops. Make bound parameter to be iteration bound. * tree-ssa-loop-ivcanon.cc (try_peel_loop): Update call of scale_loop_profile. * tree-vect-loop-manip.cc (vect_do_peeling): Likewise.
2023-07-06tree-optimization/110563 - simplify epilogue VF checksRichard Biener1-2/+1
The following consolidates an assert that now hits for ppc64le with an earlier check we already do, simplifying vect_determine_partial_vectors_and_peeling and getting rid of its now redundant argument. PR tree-optimization/110563 * tree-vectorizer.h (vect_determine_partial_vectors_and_peeling): Remove second argument. * tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling): Remove for_epilogue_p argument. Merge assert ... (vect_analyze_loop_2): ... with check done before determining partial vectors by moving it after. * tree-vect-loop-manip.cc (vect_do_peeling): Adjust.
2023-07-04tree-optimization/110310 - move vector epilogue disabling to analysis phaseRichard Biener1-84/+20
The following removes late deciding to elide vectorized epilogues to the analysis phase and also avoids altering the epilogues niter. The costing part from vect_determine_partial_vectors_and_peeling is moved to vect_analyze_loop_costing where we use the main loop analysis to constrain the epilogue scalar iterations. I have not tried to integrate this with vect_known_niters_smaller_than_vf. It seems the for_epilogue_p parameter in vect_determine_partial_vectors_and_peeling is largely useless and we could compute that in the function itself. PR tree-optimization/110310 * tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling): Move costing part ... (vect_analyze_loop_costing): ... here. Integrate better estimate for epilogues from ... (vect_analyze_loop_2): Call vect_determine_partial_vectors_and_peeling with actual epilogue status. * tree-vect-loop-manip.cc (vect_do_peeling): ... here and avoid cancelling epilogue vectorization. (vect_update_epilogue_niters): Remove. No longer update epilogue LOOP_VINFO_NITERS. * gcc.target/i386/pr110310.c: New testcase. * gcc.dg/vect/slp-perm-12.c: Disable epilogue vectorization.
2023-06-19vect: Restore aarch64 bootstrapRichard Sandiford1-1/+2
gcc/ * tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors): Handle null niters_skip.
2023-06-19AVX512 fully masked vectorizationRichard Biener1-6/+256
This implemens fully masked vectorization or a masked epilog for AVX512 style masks which single themselves out by representing each lane with a single bit and by using integer modes for the mask (both is much like GCN). AVX512 is also special in that it doesn't have any instruction to compute the mask from a scalar IV like SVE has with while_ult. Instead the masks are produced by vector compares and the loop control retains the scalar IV (mainly to avoid dependences on mask generation, a suitable mask test instruction is available). Like RVV code generation prefers a decrementing IV though IVOPTs messes things up in some cases removing that IV to eliminate it with an incrementing one used for address generation. One of the motivating testcases is from PR108410 which in turn is extracted from x264 where large size vectorization shows issues with small trip loops. Execution time there improves compared to classic AVX512 with AVX2 epilogues for the cases of less than 32 iterations. size scalar 128 256 512 512e 512f 1 9.42 11.32 9.35 11.17 15.13 16.89 2 5.72 6.53 6.66 6.66 7.62 8.56 3 4.49 5.10 5.10 5.74 5.08 5.73 4 4.10 4.33 4.29 5.21 3.79 4.25 6 3.78 3.85 3.86 4.76 2.54 2.85 8 3.64 1.89 3.76 4.50 1.92 2.16 12 3.56 2.21 3.75 4.26 1.26 1.42 16 3.36 0.83 1.06 4.16 0.95 1.07 20 3.39 1.42 1.33 4.07 0.75 0.85 24 3.23 0.66 1.72 4.22 0.62 0.70 28 3.18 1.09 2.04 4.20 0.54 0.61 32 3.16 0.47 0.41 0.41 0.47 0.53 34 3.16 0.67 0.61 0.56 0.44 0.50 38 3.19 0.95 0.95 0.82 0.40 0.45 42 3.09 0.58 1.21 1.13 0.36 0.40 'size' specifies the number of actual iterations, 512e is for a masked epilog and 512f for the fully masked loop. From 4 scalar iterations on the AVX512 masked epilog code is clearly the winner, the fully masked variant is clearly worse and it's size benefit is also tiny. This patch does not enable using fully masked loops or masked epilogues by default. More work on cost modeling and vectorization kind selection on x86_64 is necessary for this. Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE which could be exploited further to unify some of the flags we have right now but there didn't seem to be many easy things to merge, so I'm leaving this for followups. Mask requirements as registered by vect_record_loop_mask are kept in their original form and recorded in a hash_set now instead of being processed to a vector of rgroup_controls. Instead that's now left to the final analysis phase which tries forming the rgroup_controls vector using while_ult and if that fails now tries AVX512 style which needs a different organization and instead fills a hash_map with the relevant info. vect_get_loop_mask now has two implementations, one for the two mask styles we then have. I have decided against interweaving vect_set_loop_condition_partial_vectors with conditions to do AVX512 style masking and instead opted to "duplicate" this to vect_set_loop_condition_partial_vectors_avx512. Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512. The vect_prepare_for_masked_peels hunk might run into issues with SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE looked odd. Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run the testsuite with --param vect-partial-vector-usage=2 with and without -fno-vect-cost-model and filed two bugs, one ICE (PR110221) and one latent wrong-code (PR110237). * tree-vectorizer.h (enum vect_partial_vector_style): New. (_loop_vec_info::partial_vector_style): Likewise. (LOOP_VINFO_PARTIAL_VECTORS_STYLE): Likewise. (rgroup_controls::compare_type): Add. (vec_loop_masks): Change from a typedef to auto_vec<> to a structure. * tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors): Adjust. Convert niters_skip to compare_type. (vect_set_loop_condition_partial_vectors_avx512): New function implementing the AVX512 partial vector codegen. (vect_set_loop_condition): Dispatch to the correct vect_set_loop_condition_partial_vectors_* function based on LOOP_VINFO_PARTIAL_VECTORS_STYLE. (vect_prepare_for_masked_peels): Compute LOOP_VINFO_MASK_SKIP_NITERS in the original niter type. * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize partial_vector_style. (can_produce_all_loop_masks_p): Adjust. (vect_verify_full_masking): Produce the rgroup_controls vector here. Set LOOP_VINFO_PARTIAL_VECTORS_STYLE on success. (vect_verify_full_masking_avx512): New function implementing verification of AVX512 style masking. (vect_verify_loop_lens): Set LOOP_VINFO_PARTIAL_VECTORS_STYLE. (vect_analyze_loop_2): Also try AVX512 style masking. Adjust condition. (vect_estimate_min_profitable_iters): Implement AVX512 style mask producing cost. (vect_record_loop_mask): Do not build the rgroup_controls vector here but record masks in a hash-set. (vect_get_loop_mask): Implement AVX512 style mask query, complementing the existing while_ult style.
2023-06-10VECT: Add SELECT_VL supportJu-Zhe Zhong1-9/+23
This patch address comments from Richard && Richi and rebase to trunk. This patch is adding SELECT_VL middle-end support allow target have target dependent optimization in case of length calculation. This patch is inspired by RVV ISA and LLVM: https://reviews.llvm.org/D99750 The SELECT_VL is same behavior as LLVM "get_vector_length" with these following properties: 1. Only apply on single-rgroup. 2. non SLP. 3. adjust loop control IV. 4. adjust data reference IV. 5. allow non-vf elements processing in non-final iteration Code # void vvaddint32(size_t n, const int*x, const int*y, int*z) # { for (size_t i=0; i<n; i++) { z[i]=x[i]+y[i]; } } Take RVV codegen for example: Before this patch: vvaddint32: ble a0,zero,.L6 csrr a4,vlenb srli a6,a4,2 .L4: mv a5,a0 bleu a0,a6,.L3 mv a5,a6 .L3: vsetvli zero,a5,e32,m1,ta,ma vle32.v v2,0(a1) vle32.v v1,0(a2) vsetvli a7,zero,e32,m1,ta,ma sub a0,a0,a5 vadd.vv v1,v1,v2 vsetvli zero,a5,e32,m1,ta,ma vse32.v v1,0(a3) add a2,a2,a4 add a3,a3,a4 add a1,a1,a4 bne a0,zero,.L4 .L6: ret After this patch: vvaddint32: vsetvli t0, a0, e32, ta, ma # Set vector length based on 32-bit vectors vle32.v v0, (a1) # Get first vector sub a0, a0, t0 # Decrement number done slli t0, t0, 2 # Multiply number done by 4 bytes add a1, a1, t0 # Bump pointer vle32.v v1, (a2) # Get second vector add a2, a2, t0 # Bump pointer vadd.vv v2, v0, v1 # Sum vectors vse32.v v2, (a3) # Store result add a3, a3, t0 # Bump pointer bnez a0, vvaddint32 # Loop back ret # Finished Co-authored-by: Richard Sandiford<richard.sandiford@arm.com> Co-authored-by: Richard Biener <rguenther@suse.de> gcc/ChangeLog: * doc/md.texi: Add SELECT_VL support. * internal-fn.def (SELECT_VL): Ditto. * optabs.def (OPTAB_D): Ditto. * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Ditto. * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Ditto. * tree-vect-stmts.cc (get_select_vl_data_ref_ptr): Ditto. (vectorizable_store): Ditto. (vectorizable_load): Ditto. * tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): Ditto.
2023-06-02VECT: Change flow of decrement IVJu-Zhe Zhong1-11/+25
Follow Richi's suggestion, I change current decrement IV flow from: do { remain -= MIN (vf, remain); } while (remain != 0); into: do { old_remain = remain; len = MIN (vf, remain); remain -= vf; } while (old_remain >= vf); to enhance SCEV. Include fixes from kewen. This patch will need to wait for Kewen's test feedback. Testing on X86 is on-going Co-Authored by: Kewen Lin <linkw@linux.ibm.com> PR tree-optimization/109971 gcc/ChangeLog: * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Change decrement IV flow. (vect_set_loop_condition_partial_vectors): Ditto.
2023-05-25VECT: Add decrement IV iteration loop control by variable amount supportJu-Zhe Zhong1-12/+124
This patch is supporting decrement IV by following the flow designed by Richard: (1) In vect_set_loop_condition_partial_vectors, for the first iteration of: call vect_set_loop_controls_directly. (2) vect_set_loop_controls_directly calculates "step" as in your patch. If rgc has 1 control, this step is the SSA name created for that control. Otherwise the step is a fresh SSA name, as in your patch. (3) vect_set_loop_controls_directly stores this step somewhere for later use, probably in LOOP_VINFO. Let's use "S" to refer to this stored step. (4) After the vect_set_loop_controls_directly call above, and outside the "if" statement that now contains vect_set_loop_controls_directly, check whether rgc->controls.length () > 1. If so, use vect_adjust_loop_lens_control to set the controls based on S. Then the only caller of vect_adjust_loop_lens_control is vect_set_loop_condition_partial_vectors. And the starting step for vect_adjust_loop_lens_control is always S. This patch has well tested for single-rgroup and multiple-rgroup (SLP) and passed all testcase in RISC-V port. Signed-off-by: Ju-Zhe Zhong <juzhe.zhong@rivai.ai> Co-Authored-By: Richard Sandiford <richard.sandiford@arm.com> gcc/ChangeLog: * tree-vect-loop-manip.cc (vect_adjust_loop_lens_control): New function. (vect_set_loop_controls_directly): Add decrement IV support. (vect_set_loop_condition_partial_vectors): Ditto. * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): New variable. * tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c: New test. * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c: New test. * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c: New test. * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c: New test.
2023-05-11VECT: Add tree_code into "creat_iv" and allow it can handle MINUS_EXPR IVPan Li1-3/+4
This patch is going to be commited after bootstrap && regression on X86 PASSED. Thanks Richards. gcc/ChangeLog: * cfgloopmanip.cc (create_empty_loop_on_edge): Add PLUS_EXPR. * gimple-loop-interchange.cc (tree_loop_interchange::map_inductions_to_loop): Ditto. * tree-ssa-loop-ivcanon.cc (create_canonical_iv): Ditto. * tree-ssa-loop-ivopts.cc (create_new_iv): Ditto. * tree-ssa-loop-manip.cc (create_iv): Ditto. (tree_transform_and_unroll_loop): Ditto. (canonicalize_loop_ivs): Ditto. * tree-ssa-loop-manip.h (create_iv): Ditto. * tree-vect-data-refs.cc (vect_create_data_ref_ptr): Ditto. * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Ditto. (vect_set_loop_condition_normal): Ditto. * tree-vect-loop.cc (vect_create_epilog_for_reduction): Ditto. * tree-vect-stmts.cc (vectorizable_store): Ditto. (vectorizable_load): Ditto. Signed-off-by: Juzhe Zhong <juzhe.zhong@rivai.ai>
2023-04-24This replaces uses of last_stmt where we do not require debug skippingRichard Biener1-1/+1
There are quite some cases which want to access the control stmt ending a basic-block. Since there cannot be debug stmts after such stmt there's no point in using last_stmt which skips debug stmts and can be a compile-time hog for larger testcases. * gimple-ssa-split-paths.cc (is_feasible_trace): Avoid last_stmt. * graphite-scop-detection.cc (single_pred_cond_non_loop_exit): Likewise. * ipa-fnsummary.cc (set_cond_stmt_execution_predicate): Likewise. (set_switch_stmt_execution_predicate): Likewise. (phi_result_unknown_predicate): Likewise. * ipa-prop.cc (compute_complex_ancestor_jump_func): Likewise. (ipa_analyze_indirect_call_uses): Likewise. * predict.cc (predict_iv_comparison): Likewise. (predict_extra_loop_exits): Likewise. (predict_loops): Likewise. (tree_predict_by_opcode): Likewise. * gimple-predicate-analysis.cc (predicate::init_from_control_deps): Likewise. * gimple-pretty-print.cc (dump_implicit_edges): Likewise. * tree-ssa-phiopt.cc (tree_ssa_phiopt_worker): Likewise. (replace_phi_edge_with_variable): Likewise. (two_value_replacement): Likewise. (value_replacement): Likewise. (minmax_replacement): Likewise. (spaceship_replacement): Likewise. (cond_removal_in_builtin_zero_pattern): Likewise. * tree-ssa-reassoc.cc (maybe_optimize_range_tests): Likewise. * tree-ssa-sccvn.cc (vn_phi_eq): Likewise. (vn_phi_lookup): Likewise. (vn_phi_insert): Likewise. * tree-ssa-structalias.cc (compute_points_to_sets): Likewise. * tree-ssa-threadbackward.cc (back_threader::maybe_thread_block): Likewise. (back_threader_profitability::possibly_profitable_path_p): Likewise. * tree-ssa-threadedge.cc (jump_threader::thread_outgoing_edges): Likewise. * tree-switch-conversion.cc (pass_convert_switch::execute): Likewise. (pass_lower_switch<O0>::execute): Likewise. * tree-tailcall.cc (tree_optimize_tail_calls_1): Likewise. * tree-vect-loop-manip.cc (vect_loop_versioning): Likewise. * tree-vect-slp.cc (vect_slp_function): Likewise. * tree-vect-stmts.cc (cfun_returns): Likewise. * tree-vectorizer.cc (vect_loop_vectorized_call): Likewise. (vect_loop_dist_alias_call): Likewise.
2023-03-10[PATCH v2] vect: Check that vector factor is a compile-time constantMichael Collison1-1/+1
* tree-vect-loop-manip.cc (vect_do_peeling): Use result of constant_lower_bound instead of vf for the lower bound of the epilog loop trip count.