aboutsummaryrefslogtreecommitdiff
path: root/gcc/tree-vect-loop.cc
AgeCommit message (Collapse)AuthorFilesLines
2024-03-06tree-optimization/114239 - rework reduction epilogue drivingRichard Biener1-81/+24
The following reworks vectorizable_live_operation to pass the live stmt to vect_create_epilog_for_reduction also for early breaks and a peeled main exit. This is to be able to figure the scalar definition to replace. This reverts the PR114192 fix as it is subsumed by this cleanup. PR tree-optimization/114239 * tree-vect-loop.cc (vect_get_vect_def): Remove. (vect_create_epilog_for_reduction): The passed in stmt_info should now be the live stmt that produces the scalar reduction result. Revert PR114192 fix. Base reduction info off info_for_reduction. Remove special handling of early-break/peeled, restore original vector def gathering. Make sure to pick the correct exit PHIs. (vectorizable_live_operation): Pass in the proper stmt_info for early break exits. * gcc.dg/vect/vect-early-break_122-pr114239.c: New testcase.
2024-03-04tree-optimization/114192 - scalar reduction kept live with early break vectRichard Biener1-14/+26
The following fixes a missing replacement of the reduction value used in the epilog, causing the scalar reduction to be kept live across the early break exit path. PR tree-optimization/114192 * tree-vect-loop.cc (vect_create_epilog_for_reduction): Use the appropriate def for the live out stmt in case of an alternate exit.
2024-02-22tree-optimization/114027 - conditional reduction chainRichard Biener1-5/+6
When we classify a conditional reduction chain as CONST_COND_REDUCTION we fail to verify all involved conditionals have the same constant. That's a quite unlikely situation so the following simply disables such classification when there's more than one reduction statement. PR tree-optimization/114027 * tree-vect-loop.cc (vecctorizable_reduction): Use optimized condition reduction classification only for single-element chains. * gcc.dg/vect/pr114027.c: New testcase.
2024-02-15tree-optimization/111156 - properly dissolve SLP only groupsRichard Biener1-1/+2
The following fixes the omission of failing to look at pattern stmts when we need to dissolve SLP only groups. PR tree-optimization/111156 * tree-vect-loop.cc (vect_dissolve_slp_only_groups): Look at the pattern stmt if any.
2024-02-13tree-optimization/113902 - fix VUSE update in move_early_exit_stmtsRichard Biener1-5/+7
The following adjusts move_early_exit_stmts to track the last seen VUSE instead of getting it from the last store which could be a PHI where gimple_vuse doesn't work. PR tree-optimization/113902 * tree-vect-loop.cc (move_early_exit_stmts): Track last_seen_vuse for VUSE updating. * gcc.dg/vect/pr113902.c: New testcase.
2024-02-13middle-end: update vector loop upper bounds when early break vect [PR113734]Tamar Christina1-1/+2
When doing early break vectorization we should treat the final iteration as possibly being partial. This so that when we calculate the vector loop upper bounds we take into account that final iteration could have done some work. The attached testcase shows that if we don't then cunroll may unroll the loop an if the upper bound is wrong we lose a vector iteration. This is similar to how we adjust the scalar loop bounds for the PEELED case. gcc/ChangeLog: PR tree-optimization/113734 * tree-vect-loop.cc (vect_transform_loop): Treat the final iteration of an early break loop as partial. gcc/testsuite/ChangeLog: PR tree-optimization/113734 * gcc.dg/vect/vect-early-break_117-pr113734.c: New test.
2024-02-12tree-optimization/113863 - elide degenerate virtual PHIs when moving ee storesRichard Biener1-0/+19
This makes sure to elide degenerate virtual PHIs when moving stores across early exits. PR tree-optimization/113863 * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): Record crossed virtual PHIs. * tree-vect-loop.cc (move_early_exit_stmts): Elide crossed virtual PHIs. * gcc.dg/vect/pr113863.c: New testcase.
2024-02-08middle-end: don't cache restart_loop in vectorizable_live_operations [PR113808]Tamar Christina1-3/+2
There's a bug in vectorizable_live_operation that restart_loop is defined outside the loop. This variable is supposed to indicate whether we are doing a first or last index reduction. The problem is that by defining it outside the loop it becomes dependent on the order we visit the USE/DEFs. In the given example, the loop isn't PEELED, but we visit the early exit uses first. This then sets the boolean to true and it can't get to false again. So when we visit the main exit we still treat it as an early exit for that SSA name. This cleans it up and renames the variables to something that's hopefully clearer to their intention. gcc/ChangeLog: PR tree-optimization/113808 * tree-vect-loop.cc (vectorizable_live_operation): Don't cache the value cross iterations. gcc/testsuite/ChangeLog: PR tree-optimization/113808 * gfortran.dg/vect/vect-early-break_1-PR113808.f90: New test.
2024-02-07middle-end: fix ICE when destination BB for stores starts with a label ↵Tamar Christina1-1/+1
[PR113750] The report shows that if the FE leaves a label as the first thing in the dest BB then we ICE because we move the stores before the label. This is easy to fix if we know that there's still only one way into the BB. We would have already rejected the loop if there was multiple paths into the BB however I added an additional check just for early break in case the other constraints are relaxed later with an explanation. After that we fix the issue just by getting the GSI after the labels and I add a bunch of testcases for different positions the label can be added. Only the vect-early-break_112-pr113750.c one results in the label being kept. gcc/ChangeLog: PR tree-optimization/113750 * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): Check for single predecessor when doing early break vect. * tree-vect-loop.cc (move_early_exit_stmts): Get gsi at the start but after labels. gcc/testsuite/ChangeLog: PR tree-optimization/113750 * gcc.dg/vect/vect-early-break_112-pr113750.c: New test. * gcc.dg/vect/vect-early-break_113-pr113750.c: New test. * gcc.dg/vect/vect-early-break_114-pr113750.c: New test. * gcc.dg/vect/vect-early-break_115-pr113750.c: New test. * gcc.dg/vect/vect-early-break_116-pr113750.c: New test.
2024-02-07middle-end: fix ICE when moving statements to empty BB [PR113731]Tamar Christina1-2/+1
We use gsi_move_before (&stmt_gsi, &dest_gsi); to request that the new statement be placed before any other statement. Typically this then moves the current pointer to be after the statement we just inserted. However it looks like when the BB is empty, this does not happen and the CUR pointer stays NULL. There's a comment in the source of gsi_insert_before that explains: /* If CUR is NULL, we link at the end of the sequence (this case happens This adds a default parameter to gsi_move_before to allow us to control where the insertion happens. gcc/ChangeLog: PR tree-optimization/113731 * gimple-iterator.cc (gsi_move_before): Take new parameter for update method. * gimple-iterator.h (gsi_move_before): Default new param to GSI_SAME_STMT. * tree-vect-loop.cc (move_early_exit_stmts): Call gsi_move_before with GSI_NEW_STMT. gcc/testsuite/ChangeLog: PR tree-optimization/113731 * gcc.dg/vect/vect-early-break_111-pr113731.c: New test.
2024-01-25tree-optimization/113576 - non-empty latch and may_be_zero vectorizationRichard Biener1-2/+7
We can't support niters with may_be_zero when we end up with a non-empty latch due to early exit peeling. At least not in the simplistic way the vectorizer handles this now. Disallow it again for exits that are not the last one. PR tree-optimization/113576 * tree-vect-loop.cc (vec_init_loop_exit_info): Only allow exits with may_be_zero niters when its the last one. * gcc.dg/vect/pr113576.c: New testcase.
2024-01-24middle-end: rename main_exit_p in reduction code.Tamar Christina1-9/+10
This renamed main_exit_p to last_val_reduc_p to more accurately reflect what the value is calculating. gcc/ChangeLog: * tree-vect-loop.cc (vect_get_vect_def, vect_create_epilog_for_reduction): Rename main_exit_p to last_val_reduc_p.
2024-01-24middle-end: fix epilog reductions when vector iters peeled [PR113364]Tamar Christina1-1/+2
This fixes a bug where vect_create_epilog_for_reduction does not handle the case where all exits are early exits. In this case we should do like induction handling code does and not have a main exit. This shows that some new miscompiles are happening (stage3 is likely miscompiled) but that's unrelated to this patch and I'll look at it next. gcc/ChangeLog: PR tree-optimization/113364 * tree-vect-loop.cc (vect_create_epilog_for_reduction): If all exits all early exits then we must reduce from the first offset for all of them. gcc/testsuite/ChangeLog: PR tree-optimization/113364 * gcc.dg/vect/vect-early-break_107-pr113364.c: New test.
2024-01-22tree-optimization/113373 - add missing LC PHIs for live operationsRichard Biener1-34/+12
The following makes reduction epilogue code generation happy by properly adding LC PHIs to the exit blocks for multiple exit vectorized loops. Some refactoring might make the flow easier to follow but I've refrained from doing that with this patch. I've kept some fixes in reduction epilogue generation from the earlier attempt fixing this PR. PR tree-optimization/113373 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Create LC PHIs in the exit blocks where necessary. * tree-vect-loop.cc (vectorizable_live_operation): Do not try to handle missing LC PHIs. (find_connected_edge): Remove. (vect_create_epilog_for_reduction): Cleanup use of auto_vec. * gcc.dg/vect/vect-early-break_104-pr113373.c: New testcase.
2024-01-18Fix memory leak in vect_analyze_loop_formRichard Biener1-26/+19
The following fixes a memory leak in vect_analyze_loop_form which fails to free the loop body it gets. It also allows more countable exits, matching what we can handle later, when we decide which exit to use as main exit. Finally some no longer applying comments are adjusted. * tree-vect-loop.cc (vec_init_loop_exit_info): Adjust comment, prefer all later exits we can handle. (vect_analyze_loop_form): Free the allocated loop body. Adjust comments.
2024-01-18tree-optimization/113374 - early break vect and virtual operandsRichard Biener1-0/+6
The following fixes wrong virtual operands being used for peeled early breaks where we can have different live ones and for multiple exits it makes sure to update the correct PHI arguments. I've introduced SET_PHI_ARG_DEF_ON_EDGE so we can avoid using a wrong edge to compute the PHI arg index from. I've took the liberty to understand the code again and refactor and comment it a bit differently. The main functional change is that we preserve the live virtual operand on all exits. PR tree-optimization/113374 * tree-ssa-operands.h (SET_PHI_ARG_DEF_ON_EDGE): New. * tree-vect-loop.cc (move_early_exit_stmts): Update virtual LC PHIs. * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Refactor. Preserve virtual LC PHIs on all exits. * gcc.dg/vect/vect-early-break_106-pr113374.c: New testcase.
2024-01-12middle-end: remove more usages of single_exitTamar Christina1-2/+1
This replaces two more usages of single_exit that I had missed before. They both seem to happen when we re-use the ifcvt scalar loop for versioning. The condition in versioning is the same as the one for when we don't re-use the scalar loop. gcc/ChangeLog: * tree-vect-loop-manip.cc (vect_loop_versioning): Replace single_exit. * tree-vect-loop.cc (vect_transform_loop): Likewise.
2024-01-12middle-end: fill in reduction PHI for all alt exits [PR113178]Tamar Christina1-1/+7
When we have a loop with more than 2 exits and a reduction I forgot to fill in the PHI value for all alternate exits. All alternate exits use the same PHI value so we should loop over the new PHI elements and copy the value across since we call the reduction calculation code only once for all exits. This was normally covered up by earlier parts of the compiler rejecting loops incorrectly (which has been fixed now). Note that while I can use the loop in all cases, the reason I separated out the main and alt exit is so that if you pass the wrong edge the macro will assert. gcc/ChangeLog: PR tree-optimization/113178 * tree-vect-loop.cc (vect_create_epilog_for_reduction): Fill in all alternate exits. gcc/testsuite/ChangeLog: PR tree-optimization/113178 * gcc.dg/vect/vect-early-break_101-pr113178.c: New test. * gcc.dg/vect/vect-early-break_102-pr113178.c: New test.
2024-01-12middle-end: maintain LCSSA form when peeled vector iterations have virtual ↵Tamar Christina1-1/+5
operands This patch fixes several interconnected issues. 1. When picking an exit we wanted to check for niter_desc.may_be_zero not true. i.e. we want to pick an exit which we know will iterate at least once. However niter_desc.may_be_zero is not a boolean. It is a tree that encodes a boolean value. !niter_desc.may_be_zero is just checking if we have some information, not what the information is. This leads us to pick a more difficult to vectorize exit more often than we should. 2. Because we had this bug, we used to pick an alternative exit much more ofthen which showed one issue, when the loop accesses memory and we "invert it" we would corrupt the VUSE chain. This is because on an peeled vector iteration every exit restarts the loop (i.e. they're all early) BUT since we may have performed a store, the vUSE would need to be updated. This version maintains virtual PHIs correctly in these cases. Note that we can't simply remove all of them and recreate them because we need the PHI nodes still in the right order for if skip_vector. 3. Since we're moving the stores to a safe location I don't think we actually need to analyze whether the store is in range of the memref, because if we ever get there, we know that the loads must be in range, and if the loads are in range and we get to the store we know the early breaks were not taken and so the scalar loop would have done the VF stores too. 4. Instead of searching for where to move stores to, they should always be in exit belonging to the latch. We can only ever delay stores and even if we pick a different exit than the latch one as the main one, effects still happen in program order when vectorized. If we don't move the stores to the latch exit but instead to whever we pick as the "main" exit then we can perform incorrect memory accesses (luckily these are trapped by verify_ssa). 5. We only used to analyze loads inside the same BB as an early break, and also we'd never analyze the ones inside the block where we'd be moving memory references to. This is obviously bogus and to fix it this patch splits apart the two constraints. We first validate that all load memory references are in bounds and only after that do we perform the alias checks for the writes. This makes the code simpler to understand and more trivially correct. gcc/ChangeLog: PR tree-optimization/113137 PR tree-optimization/113136 PR tree-optimization/113172 PR tree-optimization/113178 * tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Maintain PHIs on inverted loops. (vect_do_peeling): Maintain virtual PHIs on inverted loops. * tree-vect-loop.cc (vec_init_loop_exit_info): Pick exit closes to latch. (vect_create_loop_vinfo): Record all conds instead of only alt ones. gcc/testsuite/ChangeLog: PR tree-optimization/113137 PR tree-optimization/113136 PR tree-optimization/113172 PR tree-optimization/113178 * g++.dg/vect/vect-early-break_4-pr113137.cc: New test. * g++.dg/vect/vect-early-break_5-pr113137.cc: New test. * gcc.dg/vect/vect-early-break_95-pr113137.c: New test. * gcc.dg/vect/vect-early-break_96-pr113136.c: New test. * gcc.dg/vect/vect-early-break_97-pr113172.c: New test.
2024-01-11tree-optimization/112505 - bit-precision induction vectorizationRichard Biener1-0/+9
Vectorization of bit-precision inductions isn't implemented but we don't check this, instead we ICE during transform. PR tree-optimization/112505 * tree-vect-loop.cc (vectorizable_induction): Reject bit-precision induction. * gcc.dg/vect/pr112505.c: New testcase.
2024-01-10tree-optimization/113078 - conditional subtraction reduction vectorizationRichard Biener1-0/+7
When if-conversion was changed to use .COND_ADD/SUB for conditional reduction it was forgotten to update reduction path handling to canonicalize .COND_SUB to .COND_ADD for vectorizable_reduction similar to what we do for MINUS_EXPR. The following adds this and testcases exercising this at runtime and looking for the appropriate masked subtraction in the vectorized code on x86. PR tree-optimization/113078 * tree-vect-loop.cc (check_reduction_path): Canonicalize .COND_SUB to .COND_ADD. * gcc.dg/vect/vect-reduc-cond-sub.c: New testcase. * gcc.target/i386/vect-pr113078.c: Likewise.
2024-01-09middle-end: removed unused variable in vectorizable_live_operation_1Tamar Christina1-6/+4
It looks like the previous patch had an unused variable. It's odd that my bootstrap didn't catch it (I'm assuming -Werror is still on for O3 bootstraps) but this fixes it. gcc/ChangeLog: * tree-vect-loop.cc (vectorizable_live_operation_1): Drop unused restart_loop. (vectorizable_live_operation): Likewise.
2024-01-09middle-end: check if target can do extract first for early breaks [PR113199]Tamar Christina1-33/+11
I was generating the vector reverse mask without checking if the target actually supported such an operation. This patch changes it to if the bitstart is 0 then use BIT_FIELD_REF instead to extract the first element since this is supported by all targets. This is good for now since masks always come from whilelo. But in the future when masks can come from other sources we will need the old code back. gcc/ChangeLog: PR tree-optimization/113199 * tree-vect-loop.cc (vectorizable_live_operation_1): Use BIT_FIELD_REF. gcc/testsuite/ChangeLog: PR tree-optimization/113199 * gcc.target/gcn/pr113199.c: New test.
2024-01-09vect: Ensure both NITERSM1 and NITERS are INTEGER_CSTs or neither of them ↵Jakub Jelinek1-0/+13
[PR113210] On the following testcase e.g. on riscv64 or aarch64 (latter with -O3 -march=armv8-a+sve ) we ICE, because while NITERS is INTEGER_CST, NITERSM1 is a complex expression like (short unsigned int) (a.0_1 + 255) + 1 > 256 ? ~(short unsigned int) (a.0_1 + 255) : 0 where a.0_1 is unsigned char. The condition is never true, so the above is equivalent to just 0, but only when trying to fold the above with PLUS_EXPR 1 we manage to simplify it (first ~(short unsigned int) (a.0_1 + 255) to -(short unsigned int) (a.0_1 + 255) and then (short unsigned int) (a.0_1 + 255) + 1 > 256 ? -(short unsigned int) (a.0_1 + 255) : 1 to (short unsigned int) (a.0_1 + 255) >= 256 ? -(short unsigned int) (a.0_1 + 255) : 1 and only at this point we fold the condition to be false. But the vectorizer seems to assume that if NITERS is known (i.e. suitable INTEGER_CST) then NITERSM1 also is, so the following hack ensures that if NITERS folds into INTEGER_CST NITERSM1 will be one as well. 2024-01-09 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/113210 * tree-vect-loop.cc (vect_get_loop_niters): If non-INTEGER_CST value in *number_of_iterationsm1 PLUS_EXPR 1 is folded into INTEGER_CST, recompute *number_of_iterationsm1 as the INTEGER_CST minus 1. * gcc.c-torture/compile/pr113210.c: New test.
2024-01-08tree-optimization/113026 - avoid vector epilog in more casesRichard Biener1-1/+5
The following avoids creating a niter peeling epilog more consistently, matching what peeling later uses for the skip_vector condition, in particular when versioning is required which then also ensures the vector loop is entered unless the epilog is vectorized. This should ideally match LOOP_VINFO_VERSIONING_THRESHOLD which is only computed later, some refactoring could make that better matching. The patch also makes sure to adjust the upper bound of the epilogues when we do not have a skip edge around the vector loop. PR tree-optimization/113026 * tree-vect-loop.cc (vect_need_peeling_or_partial_vectors_p): Avoid an epilog in more cases. * tree-vect-loop-manip.cc (vect_do_peeling): Adjust the epilogues niter upper bounds and estimates. * gcc.dg/torture/pr113026-1.c: New testcase. * gcc.dg/torture/pr113026-2.c: Likewise.
2024-01-03Update copyright years.Jakub Jelinek1-1/+1
2023-12-25middle-end: explicitly initialize vec_stmts [PR113132]Tamar Christina1-1/+1
when configured with --enable-checking=release we get a false positive on the use of vec_stmts as the compiler seems unable to notice it gets initialized through the pass-by-reference. This explicitly initializes the local. gcc/ChangeLog: PR bootstrap/113132 * tree-vect-loop.cc (vect_create_epilog_for_reduction): Initialize vec_stmts;
2023-12-24middle-end: Support vectorization of loops with multiple exits.Tamar Christina1-129/+353
Hi All, This patch adds initial support for early break vectorization in GCC. In other words it implements support for vectorization of loops with multiple exits. The support is added for any target that implements a vector cbranch optab, this includes both fully masked and non-masked targets. Depending on the operation, the vectorizer may also require support for boolean mask reductions using Inclusive OR/Bitwise AND. This is however only checked then the comparison would produce multiple statements. This also fully decouples the vectorizer's notion of exit from the existing loop infrastructure's exit. Before this patch the vectorizer always picked the natural loop latch connected exit as the main exit. After this patch the vectorizer is free to choose any exit it deems appropriate as the main exit. This means that even if the main exit is not countable (i.e. the termination condition could not be determined) we might still be able to vectorize should one of the other exits be countable. In such situations the loop is reflowed which enabled vectorization of many other loop forms. Concretely the kind of loops supported are of the forms: for (int i = 0; i < N; i++) { <statements1> if (<condition>) { ... <action>; } <statements2> } where <action> can be: - break - return - goto Any number of statements can be used before the <action> occurs. Since this is an initial version for GCC 14 it has the following limitations and features: - Only fixed sized iterations and buffers are supported. That is to say any vectors loaded or stored must be to statically allocated arrays with known sizes. N must also be known. This limitation is because our primary target for this optimization is SVE. For VLA SVE we can't easily do cross page iteraion checks. The result is likely to also not be beneficial. For that reason we punt support for variable buffers till we have First-Faulting support in GCC 15. - any stores in <statements1> should not be to the same objects as in <condition>. Loads are fine as long as they don't have the possibility to alias. More concretely, we block RAW dependencies when the intermediate value can't be separated fromt the store, or the store itself can't be moved. - Prologue peeling, alignment peelinig and loop versioning are supported. - Fully masked loops, unmasked loops and partially masked loops are supported - Any number of loop early exits are supported. - No support for epilogue vectorization. The only epilogue supported is the scalar final one. Peeling code supports it but the code motion code cannot find instructions to make the move in the epilog. - Early breaks are only supported for inner loop vectorization. With the help of IPA and LTO this still gets hit quite often. During bootstrap it hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now since these are tests for support for early exit vectorization. This implementation does not support completely handling the early break inside the vector loop itself but instead supports adding checks such that if we know that we have to exit in the current iteration then we branch to scalar code to actually do the final VF iterations which handles all the code in <action>. For the scalar loop we know that whatever exit you take you have to perform at most VF iterations. For vector code we only case about the state of fully performed iteration and reset the scalar code to the (partially) remaining loop. That is to say, the first vector loop executes so long as the early exit isn't needed. Once the exit is taken, the scalar code will perform at most VF extra iterations. The exact number depending on peeling and iteration start and which exit was taken (natural or early). For this scalar loop, all early exits are treated the same. When we vectorize we move any statement not related to the early break itself and that would be incorrect to execute before the break (i.e. has side effects) to after the break. If this is not possible we decline to vectorize. The analysis and code motion also takes into account that it doesn't introduce a RAW dependency after the move of the stores. This means that we check at the start of iterations whether we are going to exit or not. During the analyis phase we check whether we are allowed to do this moving of statements. Also note that we only move the scalar statements, but only do so after peeling but just before we start transforming statements. With this the vector flow no longer necessarily needs to match that of the scalar code. In addition most of the infrastructure is in place to support general control flow safely, however we are punting this to GCC 15. Codegen: for e.g. unsigned vect_a[N]; unsigned vect_b[N]; unsigned test4(unsigned x) { unsigned ret = 0; for (int i = 0; i < N; i++) { vect_b[i] = x + i; if (vect_a[i] > x) break; vect_a[i] = x; } return ret; } We generate for Adv. SIMD: test4: adrp x2, .LC0 adrp x3, .LANCHOR0 dup v2.4s, w0 add x3, x3, :lo12:.LANCHOR0 movi v4.4s, 0x4 add x4, x3, 3216 ldr q1, [x2, #:lo12:.LC0] mov x1, 0 mov w2, 0 .p2align 3,,7 .L3: ldr q0, [x3, x1] add v3.4s, v1.4s, v2.4s add v1.4s, v1.4s, v4.4s cmhi v0.4s, v0.4s, v2.4s umaxp v0.4s, v0.4s, v0.4s fmov x5, d0 cbnz x5, .L6 add w2, w2, 1 str q3, [x1, x4] str q2, [x3, x1] add x1, x1, 16 cmp w2, 200 bne .L3 mov w7, 3 .L2: lsl w2, w2, 2 add x5, x3, 3216 add w6, w2, w0 sxtw x4, w2 ldr w1, [x3, x4, lsl 2] str w6, [x5, x4, lsl 2] cmp w0, w1 bcc .L4 add w1, w2, 1 str w0, [x3, x4, lsl 2] add w6, w1, w0 sxtw x1, w1 ldr w4, [x3, x1, lsl 2] str w6, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 add w4, w2, 2 str w0, [x3, x1, lsl 2] sxtw x1, w4 add w6, w1, w0 ldr w4, [x3, x1, lsl 2] str w6, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 str w0, [x3, x1, lsl 2] add w2, w2, 3 cmp w7, 3 beq .L4 sxtw x1, w2 add w2, w2, w0 ldr w4, [x3, x1, lsl 2] str w2, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 str w0, [x3, x1, lsl 2] .L4: mov w0, 0 ret .p2align 2,,3 .L6: mov w7, 4 b .L2 and for SVE: test4: adrp x2, .LANCHOR0 add x2, x2, :lo12:.LANCHOR0 add x5, x2, 3216 mov x3, 0 mov w1, 0 cntw x4 mov z1.s, w0 index z0.s, #0, #1 ptrue p1.b, all ptrue p0.s, all .p2align 3,,7 .L3: ld1w z2.s, p1/z, [x2, x3, lsl 2] add z3.s, z0.s, z1.s cmplo p2.s, p0/z, z1.s, z2.s b.any .L2 st1w z3.s, p1, [x5, x3, lsl 2] add w1, w1, 1 st1w z1.s, p1, [x2, x3, lsl 2] add x3, x3, x4 incw z0.s cmp w3, 803 bls .L3 .L5: mov w0, 0 ret .p2align 2,,3 .L2: cntw x5 mul w1, w1, w5 cbz w5, .L5 sxtw x1, w1 sub w5, w5, #1 add x5, x5, x1 add x6, x2, 3216 b .L6 .p2align 2,,3 .L14: str w0, [x2, x1, lsl 2] cmp x1, x5 beq .L5 mov x1, x4 .L6: ldr w3, [x2, x1, lsl 2] add w4, w0, w1 str w4, [x6, x1, lsl 2] add x4, x1, 1 cmp w0, w3 bcs .L14 mov w0, 0 ret On the workloads this work is based on we see between 2-3x performance uplift using this patch. Follow up plan: - Boolean vectorization has several shortcomings. I've filed PR110223 with the bigger ones that cause vectorization to fail with this patch. - SLP support. This is planned for GCC 15 as for majority of the cases build SLP itself fails. This means I'll need to spend time in making this more robust first. Additionally it requires: * Adding support for vectorizing CFG (gconds) * Support for CFG to differ between vector and scalar loops. Both of which would be disruptive to the tree and I suspect I'll be handling fallouts from this patch for a while. So I plan to work on the surrounding building blocks first for the remainder of the year. Additionally it also contains reduced cases from issues found running over various codebases. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Also regtested with: -march=armv8.3-a+sve -march=armv8.3-a+nosve -march=armv9-a -mcpu=neoverse-v1 -mcpu=neoverse-n2 Bootstrapped Regtested x86_64-pc-linux-gnu and no issues. Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues. gcc/ChangeLog: * tree-if-conv.cc (idx_within_array_bound): Expose. * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New. (vect_analyze_data_ref_dependences): Use it. * tree-vect-loop-manip.cc (vect_iv_increment_position): New. (vect_set_loop_controls_directly, vect_set_loop_condition_partial_vectors, vect_set_loop_condition_partial_vectors_avx512, vect_set_loop_condition_normal): Support multiple exits. (slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for multiple exits. (slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB count and instead look at loop shape. (vect_update_ivs_after_vectorizer): Drop asserts. (vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations. (vect_do_peeling): Support multiple exits. (vect_loop_versioning): Likewise. * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise early_breaks. (vect_analyze_loop_form): Support loop flows with more than single BB loop body. (vect_create_loop_vinfo): Support niters analysis for multiple exits. (vect_analyze_loop): Likewise. (vect_get_vect_def): New. (vect_create_epilog_for_reduction): Support early exit reductions. (vectorizable_live_operation_1): New. (find_connected_edge): New. (vectorizable_live_operation): Support early exit live operations. (move_early_exit_stmts): New. (vect_transform_loop): Use it. * tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond. (vect_recog_bitfield_ref_pattern): Support gconds and bools. (vect_recog_gcond_pattern): New. (possible_vector_mask_operation_p): Support gcond masks. (vect_determine_mask_precision): Likewise. (vect_mark_pattern_stmts): Set gcond def type. (can_vectorize_live_stmts): Force early break inductions to be live. * tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for early breaks. (vect_mark_stmts_to_be_vectorized): Process gcond usage. (perm_mask_for_reverse): Expose. (vectorizable_comparison_1): New. (vectorizable_early_exit): New. (vect_analyze_stmt): Support early break and gcond. (vect_transform_stmt): Likewise. (vect_is_simple_use): Likewise. (vect_get_vector_types_for_stmt): Likewise. * tree-vectorizer.cc (pass_vectorize::execute): Update exits for value numbering. * tree-vectorizer.h (enum vect_def_type): Add vect_condition_def. (LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES, LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB, LOOP_VINFO_EARLY_BRK_VUSES): New. (is_loop_header_bb_p): Drop assert. (class loop): Add early_breaks, early_break_stores, early_break_dest_bb, early_break_vuses. (vect_iv_increment_position, perm_mask_for_reverse, ref_within_array_bound): New. (slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.
2023-12-15Middle-end: Do not model address cost for SELECT_VL style vectorizationJuzhe-Zhong1-6/+4
Follow Richard's suggestions, we should not model address cost in the loop vectorizer for select_vl or decrement IV since other style vectorization doesn't do that. To make cost model comparison apple to apple. This patch set COST from 2 to 1 which turns out have better codegen in various codegen for RVV. Ok for trunk ? PR target/111153 gcc/ChangeLog: * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Remove address cost for select_vl/decrement IV. gcc/testsuite/ChangeLog: * gcc.dg/vect/costmodel/riscv/rvv/pr111153.c: Moved to... * gcc.dg/vect/costmodel/riscv/rvv/pr11153-2.c: ...here. * gcc.dg/vect/costmodel/riscv/rvv/pr111153-1.c: New test.
2023-12-13Middle-end: Adjust decrement IV style partial vectorization COST modelJuzhe-Zhong1-3/+14
Hi, before this patch, a simple conversion case for RVV codegen: foo: ble a2,zero,.L8 addiw a5,a2,-1 li a4,6 bleu a5,a4,.L6 srliw a3,a2,3 slli a3,a3,3 add a3,a3,a0 mv a5,a0 mv a4,a1 vsetivli zero,8,e16,m1,ta,ma .L4: vle8.v v2,0(a5) addi a5,a5,8 vzext.vf2 v1,v2 vse16.v v1,0(a4) addi a4,a4,16 bne a3,a5,.L4 andi a5,a2,-8 beq a2,a5,.L10 .L3: slli a4,a5,32 srli a4,a4,32 subw a2,a2,a5 slli a2,a2,32 slli a5,a4,1 srli a2,a2,32 add a0,a0,a4 add a1,a1,a5 vsetvli zero,a2,e16,m1,ta,ma vle8.v v2,0(a0) vzext.vf2 v1,v2 vse16.v v1,0(a1) .L8: ret .L10: ret .L6: li a5,0 j .L3 This vectorization go through first loop: vsetivli zero,8,e16,m1,ta,ma .L4: vle8.v v2,0(a5) addi a5,a5,8 vzext.vf2 v1,v2 vse16.v v1,0(a4) addi a4,a4,16 bne a3,a5,.L4 Each iteration processes 8 elements. For a scalable vectorization with VLEN > 128 bits CPU, it's ok when VLEN = 128. But, as long as VLEN > 128 bits, it will waste the CPU resources. That is, e.g. VLEN = 256bits. only half of the vector units are working and another half is idle. After investigation, I realize that I forgot to adjust COST for SELECT_VL. So, adjust COST for SELECT_VL styple length vectorization. We adjust COST from 3 to 2. since after this patch: foo: ble a2,zero,.L5 .L3: vsetvli a5,a2,e16,m1,ta,ma -----> SELECT_VL cost. vle8.v v2,0(a0) slli a4,a5,1 -----> additional shift of outcome SELECT_VL for memory address calculation. vzext.vf2 v1,v2 sub a2,a2,a5 vse16.v v1,0(a1) add a0,a0,a5 add a1,a1,a4 bne a2,zero,.L3 .L5: ret This patch is a simple fix that I previous forgot. Ok for trunk ? If not, I am going to adjust cost in backend cost model. PR target/111317 gcc/ChangeLog: * tree-vect-loop.cc (vect_estimate_min_profitable_iters): Adjust for COST for decrement IV. gcc/testsuite/ChangeLog: * gcc.dg/vect/costmodel/riscv/rvv/pr111317.c: New test.
2023-12-08tree-optimization/112774: extend the SCEV CHREC tree with a nonwrapping flagHao Liu1-0/+4
The flag is defined as CHREC_NOWRAP(tree), and will be dumped from "{offset, +, 1}_1" to "{offset, +, 1}<nw>_1" (nw is short for nonwrapping). Two SCEV interfaces record_nonwrapping_chrec and nonwrapping_chrec_p are added to set and check the flag respectively. As resetting the SCEV cache (i.e., the chrec trees) may not reset the loop->estimate_state, free_numbers_of_iterations_estimates is called explicitly in loop vectorization to make sure the flag can be calculated propriately by niter. gcc/ChangeLog: PR tree-optimization/112774 * tree-pretty-print.cc: if nonwrapping flag is set, chrec will be printed with additional <nw> info. * tree-scalar-evolution.cc: add record_nonwrapping_chrec and nonwrapping_chrec_p to set and check the new flag respectively. * tree-scalar-evolution.h: Likewise. * tree-ssa-loop-niter.cc (idx_infer_loop_bounds, infer_loop_bounds_from_pointer_arith, infer_loop_bounds_from_signedness, scev_probably_wraps_p): call record_nonwrapping_chrec before record_nonwrapping_iv, call nonwrapping_chrec_p to check the flag is set and return false from scev_probably_wraps_p. * tree-vect-loop.cc (vect_analyze_loop): call free_numbers_of_iterations_estimates explicitly. * tree-core.h: document the nothrow_flag usage in CHREC_NOWRAP * tree.h: add CHREC_NOWRAP(NODE), base.nothrow_flag is used to represent the nonwrapping info. gcc/testsuite/ChangeLog: * gcc.dg/tree-ssa/scev-16.c: New test.
2023-12-01Fix ambiguity between vect_get_vec_defs with/without vectypeRichard Biener1-3/+3
When querying a single set of vector defs with the overloaded vect_get_vec_defs API then when you try to use the overload with the vector type specified the call will be ambiguous with the variant without the vector type. The following fixes this by re-ordering the vector type argument to come before the output def vector argument. I've changed vectorizable_conversion as that triggered this so it has coverage showing this works. The motivation is to reduce the number of (redundant) get_vectype_for_scalar_type calls. * tree-vectorizer.h (vect_get_vec_defs): Re-order arguments. * tree-vect-stmts.cc (vect_get_vec_defs): Likewise. (vectorizable_condition): Update caller. (vectorizable_comparison_1): Likewise. (vectorizable_conversion): Specify the vector type to be used for invariant/external defs. * tree-vect-loop.cc (vect_transform_reduction): Update caller.
2023-11-24tree-optimization/112677 - stack corruption with .COND_* reductionRichard Biener1-1/+1
The following makes sure to allocate enough space for vectype_op in vectorizable_reduction. PR tree-optimization/112677 * tree-vect-loop.cc (vectorizable_reduction): Use alloca to allocate vectype_op.
2023-11-21vect: Allow reduc_index != 1 for COND_OPs.Robin Dapp1-7/+11
In PR112406 Tamar found another problem with COND_OP reductions. I wrongly assumed that the reduction variable will always remain in operand 1, just as we create the COND_OP in ifcvt. But of course, addition being commutative, we are free to swap operand 1 and 2 and we end up with e.g. _ifc__60 = .COND_ADD (_2, _6, MADPictureC1_lsm.10_25, MADPictureC1_lsm.10_25); which does not pass the asserts I put in place. This patch removes this restriction and allows the reduction index to be 2 as well. gcc/ChangeLog: PR middle-end/112406 * tree-vect-loop.cc (vectorize_fold_left_reduction): Allow reduction index != 1. (vect_transform_reduction): Handle reduction index != 1. gcc/testsuite/ChangeLog: * gcc.target/aarch64/pr112406-2.c: New test.
2023-11-21Move VF based dependence checkRichard Biener1-3/+4
The following moves the check whether the maximum vectorization factor determined by data dependence analysis is in conflict with the chosen vectorization factor to after the point where we applied both the SLP and the unrolling adjustment to the vectorization factor. We check the latter before applying unrolling, but the SLP adjustment can result in both missed optimization and wrong-code. * tree-vect-loop.cc (vect_analyze_loop_2): Move check of VF against max_vf until VF is final.
2023-11-20tree-optimization/112618 - unused .MASK_CALLRichard Biener1-1/+10
We have to make sure to remove unused .MASK_CALL internal function calls after vectorization. PR tree-optimization/112618 * tree-vect-loop.cc (vect_transform_loop_stmt): For not relevant and unused .MASK_CALL make sure we remove the scalar stmt. * gcc.dg/pr112618.c: New testcase.
2023-11-17vect: Pass truth type to vect_get_vec_defs.Robin Dapp1-9/+22
For conditional operations the mask is loop invariant and cannot be stored explicitly. By default, for reductions, we deduce the vectype from the statement or the loop but this does not work for conditional operations. Therefore this patch passes the truth type of the reduction input vectype for the mask operand instead. This will override the other choices and make sure we have the proper mask vectype. gcc/ChangeLog: PR middle-end/112406 PR middle-end/112552 * tree-vect-loop.cc (vect_transform_reduction): Pass truth vectype for mask operand. gcc/testsuite/ChangeLog: * gcc.target/aarch64/pr112406.c: New test. * gcc.target/riscv/rvv/autovec/pr112552.c: New test.
2023-11-17vect: Fix check_reduction_path [PR112374]Jakub Jelinek1-2/+2
As mentioned in the PR, the intent of the r14-5076 changes was that it doesn't count one of the uses on the use_stmt, but what actually got implemented is that it does this processing on any op_use_stmt, even if it is not the use_stmt statement, which means that it can increase count even on debug stmts (-fcompare-debug failures), or if there would be some other use stmt with 2+ uses it could count that as a single use. Though, because it fails whenever cnt != 1 and I believe use_stmt must be one of the uses, it would probably fail in the latter case anyway. The following patch fixes that by doing this extra processing only when op_use_stmt is use_stmt, and using the normal processing otherwise (so ignore debug stmts, and increase on any uses on the stmt). 2023-11-17 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/112374 * tree-vect-loop.cc (check_reduction_path): Perform the cond_fn_p special case only if op_use_stmt == use_stmt, use as_a rather than dyn_cast in that case. * gcc.dg/pr112374-1.c: New test. * gcc.dg/pr112374-2.c: New test. * g++.dg/opt/pr112374.C: New test.
2023-11-16VECT: Clear LOOP_VINFO_USING_SELECT_VL_P when loop is not partial vectorizedJuzhe-Zhong1-0/+13
This patch fixes ICE: https://godbolt.org/z/z8T6o6qov <source>: In function 'b': <source>:2:6: error: missing definition 2 | void b() { | ^ for SSA_NAME: loop_len_8 in statement: _1 = -loop_len_8; during GIMPLE pass: vect <source>:2:6: internal compiler error: verify_ssa failed 0x7f1b56331082 __libc_start_main ???:0 Please submit a full bug report, with preprocessed source (by using -freport-bug). Please include the complete backtrace with any bug report. See <https://gcc.gnu.org/bugs/> for instructions. Compiler returned: 1 The root cause is we generate such IR in vectorization: _1 = -loop_len_8; vect_cst__11 = {_1, _1}; _18 = vect_vec_iv_.6_14 + vect_cst__11; loop_len_8 is uninitialized value. The IR _18 = vect_vec_iv_.6_14 + vect_cst__11; is generated because of we are adding induction variable with the result of SELECT_VL instead of VF. The code is: else if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo)) { /* When we're using loop_len produced by SELEC_VL, the non-final iterations are not always processing VF elements. So vectorize induction variable instead of _21 = vect_vec_iv_.6_22 + { VF, ... }; We should generate: _35 = .SELECT_VL (ivtmp_33, VF); vect_cst__22 = [vec_duplicate_expr] _35; _21 = vect_vec_iv_.6_22 + vect_cst__22; */ gcc_assert (!slp_node); gimple_seq seq = NULL; vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); tree len = vect_get_loop_len (loop_vinfo, NULL, lens, 1, vectype, 0, 0); expr = force_gimple_operand (fold_convert (TREE_TYPE (step_expr), unshare_expr (len)), &seq, true, NULL_TREE); new_name = gimple_build (&seq, MULT_EXPR, TREE_TYPE (step_expr), expr, step_expr); gsi_insert_seq_before (&si, seq, GSI_SAME_STMT); step_iv_si = &si; } LOOP_VINFO_USING_SELECT_VL_P is set before loop vectorization analysis so we don't know whether it is partial vectorization or not but the induction variable depends on SELECT_VL_P is true. So update SELECT_VL_P as false when it is not partial vectorization. PR middle-end/112554 gcc/ChangeLog: * tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling): Clear SELECT_VL_P for non-partial vectorization. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/pr112554.c: New test.
2023-11-14Fix ICE in vectorizable_nonlinear_induction with bitfield.liuhongt1-3/+10
if (TREE_CODE (init_expr) == INTEGER_CST) init_expr = fold_convert (TREE_TYPE (vectype), init_expr); else gcc_assert (tree_nop_conversion_p (TREE_TYPE (vectype), TREE_TYPE (init_expr))); and init_expr is a 24 bit integer type while vectype has 32bit components. The "fix" is to bail out instead of asserting. gcc/ChangeLog: PR tree-optimization/112496 * tree-vect-loop.cc (vectorizable_nonlinear_induction): Return false when !tree_nop_conversion_p (TREE_TYPE (vectype), TREE_TYPE (init_expr)). gcc/testsuite/ChangeLog: * gcc.target/i386/pr112496.c: New test.
2023-11-10Middle-end: Fix bug of induction variable vectorization for RVVJuzhe-Zhong1-1/+29
PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112438 1. Since SELECT_VL result is not necessary always VF in non-final iteration. Current GIMPLE IR is wrong: ... _35 = .SELECT_VL (ivtmp_33, VF); _21 = vect_vec_iv_.8_22 + { VF, ... }; E.g. Consider the total iterations N = 6, the VF = 4. Since SELECT_VL output is defined as not always to be VF in non-final iteration which needs to depend on hardware implementation. Suppose we have a RVV CPU core with vsetvl doing even distribution workload optimization. It may process 3 elements at the 1st iteration and 3 elements at the last iteration. Then the induction variable here: _21 = vect_vec_iv_.8_22 + { POLY_INT_CST [4, 4], ... }; is wrong which is adding VF, which is 4, actually, we didn't process 4 elements. It should be adding 3 elements which is the result of SELECT_VL. So, here the correct IR should be: _36 = .SELECT_VL (ivtmp_34, VF); _22 = (int) _36; vect_cst__21 = [vec_duplicate_expr] _22; 2. This issue only happens on non-SLP vectorization single rgroup since: if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)) { tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo); if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type, OPTIMIZE_FOR_SPEED) && LOOP_VINFO_LENS (loop_vinfo).length () == 1 && LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1 && !slp && (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ())) LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true; } 3. This issue doesn't appears on nested loop no matter LOOP_VINFO_USING_SELECT_VL_P is true or false. Since: # vect_vec_iv_.6_5 = PHI <_19(3), { 0, ... }(5)> # vect_diff_15.7_20 = PHI <vect_diff_9.8_22(3), vect_diff_18.5_11(5)> _19 = vect_vec_iv_.6_5 + { 1, ... }; vect_diff_9.8_22 = .COND_LEN_ADD ({ -1, ... }, vect_vec_iv_.6_5, vect_diff_15.7_20, vect_diff_15.7_20, _28, 0); ivtmp_1 = ivtmp_4 + 4294967295; .... <bb 5> [local count: 6549826]: # vect_diff_18.5_11 = PHI <vect_diff_9.8_22(4), { 0, ... }(2)> # ivtmp_26 = PHI <ivtmp_27(4), 40(2)> _28 = .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]); goto <bb 3>; [100.00%] Note the induction variable IR: _21 = vect_vec_iv_.8_22 + { POLY_INT_CST [4, 4], ... }; update induction variable independent on VF (or don't care about how many elements are processed in the iteration). The update is loop invariant. So it won't be the problem even if LOOP_VINFO_USING_SELECT_VL_P is true. Testing passed, Ok for trunk ? PR tree-optimization/112438 gcc/ChangeLog: * tree-vect-loop.cc (vectorizable_induction): Bugfix when LOOP_VINFO_USING_SELECT_VL_P. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/pr112438.c: New test.
2023-11-10vect: Look through pattern stmt in fold_left_reduction.Robin Dapp1-1/+1
It appears as if we "look through" a statement pattern in vect_finish_replace_stmt but not before when we replace the newly created vector statement's lhs. Then the lhs is the statement pattern's lhs while in vect_finish_replace_stmt we assert that it's from the statement the pattern replaced. This patch uses vect_orig_stmt on the scalar destination's definition so the replaced statement is used everywhere. gcc/ChangeLog: PR tree-optimization/112464 * tree-vect-loop.cc (vectorize_fold_left_reduction): Use vect_orig_stmt on scalar_dest_def_info. gcc/testsuite/ChangeLog: * gcc.target/i386/pr112464.c: New test.
2023-11-09tree-optimization/112450 - avoid AVX512 style masking for BImode masksRichard Biener1-1/+4
The following avoids running into the AVX512 style masking code for RVV which would theoretically be able to handle it if I were not relying on integer mode maskness in vect_get_loop_mask. While that's easy to fix (patch in PR), the preference is to not have AVX512 style masking for RVV, thus the following. * tree-vect-loop.cc (vect_verify_full_masking_avx512): Check we have integer mode masks as required by vect_get_loop_mask.
2023-11-07vect/ifcvt: Add vec_cond fallback and check for vector versioning.Robin Dapp1-2/+20
This restricts tree-ifcvt to only create COND_OPs when we versioned the loop for vectorization. Apart from that it re-creates a VEC_COND_EXPR in vect_expand_fold_left if we emitted a COND_OP. gcc/ChangeLog: PR tree-optimization/112361 PR target/112359 PR middle-end/112406 * tree-if-conv.cc (convert_scalar_cond_reduction): Remember if loop was versioned and only then create COND_OPs. (predicate_scalar_phi): Do not create COND_OP when not vectorizing. * tree-vect-loop.cc (vect_expand_fold_left): Re-create VEC_COND_EXPR. (vectorize_fold_left_reduction): Pass mask to vect_expand_fold_left. gcc/testsuite/ChangeLog: * gcc.dg/pr112359.c: New test.
2023-11-06tree-optimization/112404 - two issues with SLP of .MASK_LOADRichard Biener1-21/+26
The following fixes an oversight in vect_check_scalar_mask when the mask is external or constant. When doing BB vectorization we need to provide a group_size, best via an overload accepting the SLP node as argument. When fixed we then run into the issue that we have not analyzed alignment of the .MASK_LOADs because they were not identified as loads by vect_gather_slp_loads. Fixed by reworking the detection. PR tree-optimization/112404 * tree-vectorizer.h (get_mask_type_for_scalar_type): Declare overload with SLP node argument. * tree-vect-stmts.cc (get_mask_type_for_scalar_type): Implement it. (vect_check_scalar_mask): Use it. * tree-vect-slp.cc (vect_gather_slp_loads): Properly identify loads also for nodes with children, like .MASK_LOAD. * tree-vect-loop.cc (vect_analyze_loop_2): Look at the representative for load nodes and check whether it is a grouped access before looking for load-lanes support. * gfortran.dg/pr112404.f90: New testcase.
2023-11-03Cleanup vectorizable_live_operationRichard Biener1-36/+17
During analyzing PR111950 I found the loop live operation code-gen odd, in particular only replacing a single PHI but then adjusting possibly remaining PHIs afterwards where there shouldn't really be any out-of-loop uses of the scalar in-loop def left. * tree-vect-loop.cc (vectorizable_live_operation): Simplify LC PHI replacement.
2023-11-03tree-optimization/112366 - remove assert for failed live lane code genRichard Biener1-6/+1
The following removes a bogus assert constraining the uses that could appear when a built from scalar defs SLP node constrains code generation in a way so earlier uses of the vector CTOR components fail to get vectorized. We can't really constrain the operation such use appears in. PR tree-optimization/112366 * tree-vect-loop.cc (vectorizable_live_operation): Remove assert.
2023-11-02ifcvt/vect: Emit COND_OP for conditional scalar reduction.Robin Dapp1-37/+156
As described in PR111401 we currently emit a COND and a PLUS expression for conditional reductions. This makes it difficult to combine both into a masked reduction statement later. This patch improves that by directly emitting a COND_ADD/COND_OP during ifcvt and adjusting some vectorizer code to handle it. It also makes neutral_op_for_reduction return -0 if HONOR_SIGNED_ZEROS is true. gcc/ChangeLog: PR middle-end/111401 * internal-fn.cc (internal_fn_else_index): New function. * internal-fn.h (internal_fn_else_index): Define. * tree-if-conv.cc (convert_scalar_cond_reduction): Emit COND_OP if supported. (predicate_scalar_phi): Add whitespace. * tree-vect-loop.cc (fold_left_reduction_fn): Add IFN_COND_OP. (neutral_op_for_reduction): Return -0 for PLUS. (check_reduction_path): Don't count else operand in COND_OP. (vect_is_simple_reduction): Ditto. (vect_create_epilog_for_reduction): Fix whitespace. (vectorize_fold_left_reduction): Add COND_OP handling. (vectorizable_reduction): Don't count else operand in COND_OP. (vect_transform_reduction): Add COND_OP handling. * tree-vectorizer.h (neutral_op_for_reduction): Add default parameter. gcc/testsuite/ChangeLog: * gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c: New test. * gcc.target/riscv/rvv/autovec/cond/pr111401.c: New test. * gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c: Adjust. * gcc.target/riscv/rvv/autovec/reduc/reduc_call-4.c: Ditto.
2023-10-23Avoid compile time hog on vect_peel_nonlinear_iv_init for nonlinear ↵liuhongt1-4/+9
induction vec_step_op_mul when iteration count is too big. There's loop in vect_peel_nonlinear_iv_init to get init_expr * pow (step_expr, skip_niters). When skipn_iters is too big, compile time hogs. To avoid that, optimize init_expr * pow (step_expr, skip_niters) to init_expr << (exact_log2 (step_expr) * skip_niters) when step_expr is pow of 2, otherwise give up vectorization when skip_niters >= TYPE_PRECISION (TREE_TYPE (init_expr)). Also give up vectorization when niters_skip is negative which will be used for fully masked loop. gcc/ChangeLog: PR tree-optimization/111820 PR tree-optimization/111833 * tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Give up vectorization for nonlinear iv vect_step_op_mul when step_expr is not exact_log2 and niters is greater than TYPE_PRECISION (TREE_TYPE (step_expr)). Also don't vectorize for nagative niters_skip which will be used by fully masked loop. (vect_can_advance_ivs_p): Pass whole phi_info to vect_can_peel_nonlinear_iv_p. * tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Optimize init_expr * pow (step_expr, skipn) to init_expr << (log2 (step_expr) * skipn) when step_expr is exact_log2. gcc/testsuite/ChangeLog: * gcc.target/i386/pr111820-1.c: New test. * gcc.target/i386/pr111820-2.c: New test. * gcc.target/i386/pr111820-3.c: New test. * gcc.target/i386/pr103144-mul-1.c: Adjust testcase. * gcc.target/i386/pr103144-mul-2.c: Adjust testcase.
2023-10-20Rewrite more refs for epilogue vectorizationRichard Biener1-5/+6
The following makes sure to rewrite all gather/scatter detected by dataref analysis plus stmts classified as VMAT_GATHER_SCATTER. Maybe we need to rewrite all refs, the following covers the cases I've run into now. * tree-vect-loop.cc (update_epilogue_loop_vinfo): Rewrite both STMT_VINFO_GATHER_SCATTER_P and VMAT_GATHER_SCATTER stmt refs.