Age | Commit message (Collapse) | Author | Files | Lines |
|
When AVX512 uses a fully masked loop and peeling we fail to create the
correct initial loop mask when the mask is composed of multiple
components in some cases. The following fixes this by properly applying
the bias for the component to the shift amount.
PR tree-optimization/115843
* tree-vect-loop-manip.cc
(vect_set_loop_condition_partial_vectors_avx512): Properly
bias the shift of the initial mask for alignment peeling.
* gcc.dg/vect/pr115843.c: New testcase.
|
|
Now that pointers and integers have been disambiguated from irange,
and all the pointer range temporaries use prange, we can reclaim
value_range as a general purpose range container.
This patch removes the typedef, in favor of int_range_max, thus
providing slightly better ranges in places. I have also used
int_range<1> or <2> when it's known ahead of time how big the range will
be, thus saving a few words.
In a follow-up patch I will rename the Value_Range temporary to
value_range.
No change in performance.
gcc/ChangeLog:
* builtins.cc (expand_builtin_strnlen): Replace value_range use
with int_range_max or irange when appropriate.
(determine_block_size): Same.
* fold-const.cc (minmax_from_comparison): Same.
* gimple-array-bounds.cc (check_out_of_bounds_and_warn): Same.
(array_bounds_checker::check_array_ref): Same.
* gimple-fold.cc (size_must_be_zero_p): Same.
* gimple-predicate-analysis.cc (find_var_cmp_const): Same.
* gimple-ssa-sprintf.cc (get_int_range): Same.
(format_integer): Same.
(try_substitute_return_value): Same.
(handle_printf_call): Same.
* gimple-ssa-warn-restrict.cc
(builtin_memref::extend_offset_range): Same.
* graphite-sese-to-poly.cc (add_param_constraints): Same.
* internal-fn.cc (get_min_precision): Same.
* match.pd: Same.
* pointer-query.cc (get_size_range): Same.
* range-op.cc (get_shift_range): Same.
(operator_trunc_mod::op1_range): Same.
(operator_trunc_mod::op2_range): Same.
* range.cc (range_negatives): Same.
* range.h (range_positives): Same.
(range_negatives): Same.
* tree-affine.cc (expr_to_aff_combination): Same.
* tree-data-ref.cc (compute_distributive_range): Same.
(nop_conversion_for_offset_p): Same.
(split_constant_offset): Same.
(split_constant_offset_1): Same.
(dr_step_indicator): Same.
* tree-dfa.cc (get_ref_base_and_extent): Same.
* tree-scalar-evolution.cc (iv_can_overflow_p): Same.
* tree-ssa-math-opts.cc (optimize_spaceship): Same.
* tree-ssa-pre.cc (insert_into_preds_of_block): Same.
* tree-ssa-reassoc.cc (optimize_range_tests_to_bit_test): Same.
* tree-ssa-strlen.cc (compare_nonzero_chars): Same.
(dump_strlen_info): Same.
(get_range_strlen_dynamic): Same.
(set_strlen_range): Same.
(maybe_diag_stxncpy_trunc): Same.
(strlen_pass::get_len_or_size): Same.
(strlen_pass::handle_builtin_string_cmp): Same.
(strlen_pass::count_nonzero_bytes_addr): Same.
(strlen_pass::handle_integral_assign): Same.
* tree-switch-conversion.cc (bit_test_cluster::emit): Same.
* tree-vect-loop-manip.cc (vect_gen_vector_loop_niters): Same.
(vect_do_peeling): Same.
* tree-vect-patterns.cc (vect_get_range_info): Same.
(vect_recog_divmod_pattern): Same.
* tree.cc (get_range_pos_neg): Same.
* value-range.cc (debug): Remove value_range variants.
* value-range.h (value_range): Remove typedef.
* vr-values.cc
(simplify_using_ranges::op_with_boolean_value_range_p): Replace
value_range use with int_range_max or irange when appropriate.
(check_for_binary_op_overflow): Same.
(simplify_using_ranges::legacy_fold_cond_overflow): Same.
(find_case_label_ranges): Same.
(simplify_using_ranges::simplify_abs_using_ranges): Same.
(test_for_singularity): Same.
(simplify_using_ranges::simplify_compare_using_ranges_1): Same.
(simplify_using_ranges::simplify_casted_compare): Same.
(simplify_using_ranges::simplify_switch_using_ranges): Same.
(simplify_conversion_using_ranges): Same.
(simplify_using_ranges::two_valued_val_range_p): Same.
|
|
When we update the dominator of the redirected exit after peeling
we check whether the immediate dominator was the loop header rather
than the exit source when we later want to just update it to the
new source. The following fixes this oversight.
PR tree-optimization/114832
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Fix dominance check.
* gcc.dg/vect/pr114832.c: New testcase.
|
|
We can't use vect_update_ivs_after_vectorizer for partial vectors,
the following fixes vect_can_peel_nonlinear_iv_p accordingly.
PR tree-optimization/114485
* tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p):
vect_step_op_neg isn't OK for partial vectors but only
for unknown niter.
* gcc.dg/vect/pr114485.c: New testcase.
|
|
r14-7036-gcbf569486b2dec added an epilogue vectorization guard for early
break but PR114196 shows that we also run into the problem without early
break. Therefore merge the condition into the topmost vectorization
guard.
gcc/ChangeLog:
PR middle-end/114196
* tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Merge
vectorization guards.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/pr114196.c: New test.
* gcc.target/riscv/rvv/autovec/pr114196.c: New test.
|
|
The following implements manual update for multi-exit loop prologue
peeling during vectorization.
PR tree-optimization/114081
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Perform manual dominator update for prologue peeling.
(vect_do_peeling): Properly update dominators after adding the
prologue-around guard.
* gcc.dg/vect/vect-early-break_121-pr114081.c: New testcase.
|
|
In some cases exits can lack LC PHI nodes for the virtual operand.
We have to create them when the epilog loop requires them which also
allows us to remove some only halfway correct fixups. This is the
variant triggering for alternate exits.
PR tree-optimization/114099
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Create and fill in a needed virtual LC PHI for the alternate
exits. Remove code dealing with that missing.
* gcc.dg/vect/vect-early-break_120-pr114099.c: New testcase.
|
|
When we choose the IV exit to be one leading to no virtual use we
fail to have a virtual LC PHI even though we need it for the epilog
entry. The following makes sure to create it so that later updating
works.
PR tree-optimization/114068
* tree-vect-loop-manip.cc (get_live_virtual_operand_on_edge):
New function.
(slpeel_tree_duplicate_loop_to_edge_cfg): Add a virtual LC PHI
on the main exit if needed. Remove band-aid for the case
it was missing.
* gcc.dg/vect/vect-early-break_118-pr114068.c: New testcase.
* gcc.dg/vect/vect-early-break_119-pr114068.c: Likewise.
|
|
The following handles the case of the main exit going to a path without
virtual use and handles it similar to the alternate exit handling.
PR tree-optimization/113659
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Handle main exit without virtual use.
* gcc.dg/pr113659.c: New testcase.
|
|
This refactors the handling of PHIs inbetween the main and the
epilogue loop. Instead of trying to handle the multiple exit
and original single exit case together the following separates
these cases resulting in much easier to understand code.
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Separate single and multi-exit case when creating PHIs between
the main and epilogue.
|
|
The following makes reduction epilogue code generation happy by properly
adding LC PHIs to the exit blocks for multiple exit vectorized loops.
Some refactoring might make the flow easier to follow but I've refrained
from doing that with this patch.
I've kept some fixes in reduction epilogue generation from the earlier
attempt fixing this PR.
PR tree-optimization/113373
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Create LC PHIs in the exit blocks where necessary.
* tree-vect-loop.cc (vectorizable_live_operation): Do not try
to handle missing LC PHIs.
(find_connected_edge): Remove.
(vect_create_epilog_for_reduction): Cleanup use of auto_vec.
* gcc.dg/vect/vect-early-break_104-pr113373.c: New testcase.
|
|
The following handles the situation where we lack a loop-closed
PHI for a virtual operand because a loop exit goes to a code
region not having any virtual use (an endless loop). It also
handles the situation of edge redirection re-allocating a PHI node
in the destination block so we have to re-lookup that before
populating the new PHI argument.
PR tree-optimization/113494
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Handle endless loop on exit. Handle re-allocated PHI.
|
|
The following fixes wrong virtual operands being used for peeled
early breaks where we can have different live ones and for multiple
exits it makes sure to update the correct PHI arguments.
I've introduced SET_PHI_ARG_DEF_ON_EDGE so we can avoid using
a wrong edge to compute the PHI arg index from.
I've took the liberty to understand the code again and refactor
and comment it a bit differently. The main functional change
is that we preserve the live virtual operand on all exits.
PR tree-optimization/113374
* tree-ssa-operands.h (SET_PHI_ARG_DEF_ON_EDGE): New.
* tree-vect-loop.cc (move_early_exit_stmts): Update
virtual LC PHIs.
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Refactor. Preserve virtual LC PHIs on all exits.
* gcc.dg/vect/vect-early-break_106-pr113374.c: New testcase.
|
|
The following avoids prologue peeling when doing early exit
vectorization with the IV exit before the early exit. That's because
we it invalidates the invariant that the effective latch of the loop
is empty causing wrong continuation to the main loop. In particular
this is prone to break virtual SSA form.
PR tree-optimization/113371
* tree-vect-data-refs.cc (vect_enhance_data_refs_alignment):
Do not peel when LOOP_VINFO_EARLY_BREAKS_VECT_PEELED.
* tree-vect-loop-manip.cc (vect_do_peeling): Assert we do
not perform prologue peeling when LOOP_VINFO_EARLY_BREAKS_VECT_PEELED.
* gcc.dg/vect/pr113371.c: New testcase.
|
|
The following avoids splitting an edge before redirecting it. This
allows the loop father of the new block to be correct in the first
place.
PR tree-optimization/113385
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
First redirect, then split the exit edge.
|
|
This replaces two more usages of single_exit that I had missed before.
They both seem to happen when we re-use the ifcvt scalar loop for versioning.
The condition in versioning is the same as the one for when we don't re-use the
scalar loop.
gcc/ChangeLog:
* tree-vect-loop-manip.cc (vect_loop_versioning): Replace single_exit.
* tree-vect-loop.cc (vect_transform_loop): Likewise.
|
|
[PR113237]
Builing on top of the previous patch, similar to when we have a single exit if
we have a case where all exits are considered early exits and there are existing
non virtual phi then in order to maintain LCSSA we have to use the existing PHI
variables. We can't simply clear them and just rebuild them because the order
of the PHIs in the main exit must match the original exit for when we add the
skip_epilog guard.
But the infrastructure is already in place to maintain them, we just have to use
the right value.
gcc/ChangeLog:
PR tree-optimization/113237
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Use
existing LCSSA variable for exit when all exits are early break.
gcc/testsuite/ChangeLog:
PR tree-optimization/113237
* gcc.dg/vect/vect-early-break_98-pr113237.c: New test.
|
|
operands
This patch fixes several interconnected issues.
1. When picking an exit we wanted to check for niter_desc.may_be_zero not true.
i.e. we want to pick an exit which we know will iterate at least once.
However niter_desc.may_be_zero is not a boolean. It is a tree that encodes
a boolean value. !niter_desc.may_be_zero is just checking if we have some
information, not what the information is. This leads us to pick a more
difficult to vectorize exit more often than we should.
2. Because we had this bug, we used to pick an alternative exit much more ofthen
which showed one issue, when the loop accesses memory and we "invert it" we
would corrupt the VUSE chain. This is because on an peeled vector iteration
every exit restarts the loop (i.e. they're all early) BUT since we may have
performed a store, the vUSE would need to be updated. This version maintains
virtual PHIs correctly in these cases. Note that we can't simply remove all
of them and recreate them because we need the PHI nodes still in the right
order for if skip_vector.
3. Since we're moving the stores to a safe location I don't think we actually
need to analyze whether the store is in range of the memref, because if we
ever get there, we know that the loads must be in range, and if the loads are
in range and we get to the store we know the early breaks were not taken and
so the scalar loop would have done the VF stores too.
4. Instead of searching for where to move stores to, they should always be in
exit belonging to the latch. We can only ever delay stores and even if we
pick a different exit than the latch one as the main one, effects still
happen in program order when vectorized. If we don't move the stores to the
latch exit but instead to whever we pick as the "main" exit then we can
perform incorrect memory accesses (luckily these are trapped by verify_ssa).
5. We only used to analyze loads inside the same BB as an early break, and also
we'd never analyze the ones inside the block where we'd be moving memory
references to. This is obviously bogus and to fix it this patch splits apart
the two constraints. We first validate that all load memory references are
in bounds and only after that do we perform the alias checks for the writes.
This makes the code simpler to understand and more trivially correct.
gcc/ChangeLog:
PR tree-optimization/113137
PR tree-optimization/113136
PR tree-optimization/113172
PR tree-optimization/113178
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Maintain PHIs on inverted loops.
(vect_do_peeling): Maintain virtual PHIs on inverted loops.
* tree-vect-loop.cc (vec_init_loop_exit_info): Pick exit closes to
latch.
(vect_create_loop_vinfo): Record all conds instead of only alt ones.
gcc/testsuite/ChangeLog:
PR tree-optimization/113137
PR tree-optimization/113136
PR tree-optimization/113172
PR tree-optimization/113178
* g++.dg/vect/vect-early-break_4-pr113137.cc: New test.
* g++.dg/vect/vect-early-break_5-pr113137.cc: New test.
* gcc.dg/vect/vect-early-break_95-pr113137.c: New test.
* gcc.dg/vect/vect-early-break_96-pr113136.c: New test.
* gcc.dg/vect/vect-early-break_97-pr113172.c: New test.
|
|
When we peel at_exit we are moving the new loop at the exit of the previous
loop. This means that the blocks outside the loop dat the previous loop used to
dominate are no longer being dominated by it.
The new dominators however are hard to predict since if the loop has multiple
exits and all the exits are an "early" one then we always execute the scalar
loop. In this case the scalar loop can completely dominate the new loop.
If we later have skip_vector then there's an additional skip edge added that
might change the dominators.
The previous patch would force an update of all blocks reachable from the new
exits. This one updates *only* blocks that we know the scalar exits dominated.
For the examples this reduces the blocks to update from 18 to 3.
gcc/ChangeLog:
PR tree-optimization/113144
PR tree-optimization/113145
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Update all BB that the original exits dominated.
gcc/testsuite/ChangeLog:
PR tree-optimization/113144
PR tree-optimization/113145
* gcc.dg/vect/vect-early-break_94-pr113144.c: New test.
|
|
The late amendment with a limit based on VF was redundant and wrong
for peeled early exits. The following moves the adjustment done
when we don't have a skip edge down to the place where the already
existing VF based max iter check is done and removes the amendment.
PR tree-optimization/113026
* tree-vect-loop-manip.cc (vect_do_peeling): Remove
redundant and wrong niter bound setting. Move niter
bound adjustment down.
|
|
We can't support nonlinear inductions other than neg when vectorizing
early breaks and iteration count is known.
For early break we currently require a peeled epilog but in these cases
we can't compute the remaining values.
gcc/ChangeLog:
PR middle-end/113163
* tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p):
Reject non-linear inductions that aren't supported.
gcc/testsuite/ChangeLog:
PR middle-end/113163
* gcc.target/gcn/pr113163.c: New test.
|
|
The following avoids creating a niter peeling epilog more consistently,
matching what peeling later uses for the skip_vector condition, in
particular when versioning is required which then also ensures the
vector loop is entered unless the epilog is vectorized. This should
ideally match LOOP_VINFO_VERSIONING_THRESHOLD which is only computed
later, some refactoring could make that better matching.
The patch also makes sure to adjust the upper bound of the epilogues
when we do not have a skip edge around the vector loop.
PR tree-optimization/113026
* tree-vect-loop.cc (vect_need_peeling_or_partial_vectors_p):
Avoid an epilog in more cases.
* tree-vect-loop-manip.cc (vect_do_peeling): Adjust the
epilogues niter upper bounds and estimates.
* gcc.dg/torture/pr113026-1.c: New testcase.
* gcc.dg/torture/pr113026-2.c: Likewise.
|
|
|
|
Hi All,
This patch adds initial support for early break vectorization in GCC. In other
words it implements support for vectorization of loops with multiple exits.
The support is added for any target that implements a vector cbranch optab,
this includes both fully masked and non-masked targets.
Depending on the operation, the vectorizer may also require support for boolean
mask reductions using Inclusive OR/Bitwise AND. This is however only checked
then the comparison would produce multiple statements.
This also fully decouples the vectorizer's notion of exit from the existing loop
infrastructure's exit. Before this patch the vectorizer always picked the
natural loop latch connected exit as the main exit.
After this patch the vectorizer is free to choose any exit it deems appropriate
as the main exit. This means that even if the main exit is not countable (i.e.
the termination condition could not be determined) we might still be able to
vectorize should one of the other exits be countable.
In such situations the loop is reflowed which enabled vectorization of many
other loop forms.
Concretely the kind of loops supported are of the forms:
for (int i = 0; i < N; i++)
{
<statements1>
if (<condition>)
{
...
<action>;
}
<statements2>
}
where <action> can be:
- break
- return
- goto
Any number of statements can be used before the <action> occurs.
Since this is an initial version for GCC 14 it has the following limitations and
features:
- Only fixed sized iterations and buffers are supported. That is to say any
vectors loaded or stored must be to statically allocated arrays with known
sizes. N must also be known. This limitation is because our primary target
for this optimization is SVE. For VLA SVE we can't easily do cross page
iteraion checks. The result is likely to also not be beneficial. For that
reason we punt support for variable buffers till we have First-Faulting
support in GCC 15.
- any stores in <statements1> should not be to the same objects as in
<condition>. Loads are fine as long as they don't have the possibility to
alias. More concretely, we block RAW dependencies when the intermediate value
can't be separated fromt the store, or the store itself can't be moved.
- Prologue peeling, alignment peelinig and loop versioning are supported.
- Fully masked loops, unmasked loops and partially masked loops are supported
- Any number of loop early exits are supported.
- No support for epilogue vectorization. The only epilogue supported is the
scalar final one. Peeling code supports it but the code motion code cannot
find instructions to make the move in the epilog.
- Early breaks are only supported for inner loop vectorization.
With the help of IPA and LTO this still gets hit quite often. During bootstrap
it hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now
since these are tests for support for early exit vectorization.
This implementation does not support completely handling the early break inside
the vector loop itself but instead supports adding checks such that if we know
that we have to exit in the current iteration then we branch to scalar code to
actually do the final VF iterations which handles all the code in <action>.
For the scalar loop we know that whatever exit you take you have to perform at
most VF iterations. For vector code we only case about the state of fully
performed iteration and reset the scalar code to the (partially) remaining loop.
That is to say, the first vector loop executes so long as the early exit isn't
needed. Once the exit is taken, the scalar code will perform at most VF extra
iterations. The exact number depending on peeling and iteration start and which
exit was taken (natural or early). For this scalar loop, all early exits are
treated the same.
When we vectorize we move any statement not related to the early break itself
and that would be incorrect to execute before the break (i.e. has side effects)
to after the break. If this is not possible we decline to vectorize. The
analysis and code motion also takes into account that it doesn't introduce a RAW
dependency after the move of the stores.
This means that we check at the start of iterations whether we are going to exit
or not. During the analyis phase we check whether we are allowed to do this
moving of statements. Also note that we only move the scalar statements, but
only do so after peeling but just before we start transforming statements.
With this the vector flow no longer necessarily needs to match that of the
scalar code. In addition most of the infrastructure is in place to support
general control flow safely, however we are punting this to GCC 15.
Codegen:
for e.g.
unsigned vect_a[N];
unsigned vect_b[N];
unsigned test4(unsigned x)
{
unsigned ret = 0;
for (int i = 0; i < N; i++)
{
vect_b[i] = x + i;
if (vect_a[i] > x)
break;
vect_a[i] = x;
}
return ret;
}
We generate for Adv. SIMD:
test4:
adrp x2, .LC0
adrp x3, .LANCHOR0
dup v2.4s, w0
add x3, x3, :lo12:.LANCHOR0
movi v4.4s, 0x4
add x4, x3, 3216
ldr q1, [x2, #:lo12:.LC0]
mov x1, 0
mov w2, 0
.p2align 3,,7
.L3:
ldr q0, [x3, x1]
add v3.4s, v1.4s, v2.4s
add v1.4s, v1.4s, v4.4s
cmhi v0.4s, v0.4s, v2.4s
umaxp v0.4s, v0.4s, v0.4s
fmov x5, d0
cbnz x5, .L6
add w2, w2, 1
str q3, [x1, x4]
str q2, [x3, x1]
add x1, x1, 16
cmp w2, 200
bne .L3
mov w7, 3
.L2:
lsl w2, w2, 2
add x5, x3, 3216
add w6, w2, w0
sxtw x4, w2
ldr w1, [x3, x4, lsl 2]
str w6, [x5, x4, lsl 2]
cmp w0, w1
bcc .L4
add w1, w2, 1
str w0, [x3, x4, lsl 2]
add w6, w1, w0
sxtw x1, w1
ldr w4, [x3, x1, lsl 2]
str w6, [x5, x1, lsl 2]
cmp w0, w4
bcc .L4
add w4, w2, 2
str w0, [x3, x1, lsl 2]
sxtw x1, w4
add w6, w1, w0
ldr w4, [x3, x1, lsl 2]
str w6, [x5, x1, lsl 2]
cmp w0, w4
bcc .L4
str w0, [x3, x1, lsl 2]
add w2, w2, 3
cmp w7, 3
beq .L4
sxtw x1, w2
add w2, w2, w0
ldr w4, [x3, x1, lsl 2]
str w2, [x5, x1, lsl 2]
cmp w0, w4
bcc .L4
str w0, [x3, x1, lsl 2]
.L4:
mov w0, 0
ret
.p2align 2,,3
.L6:
mov w7, 4
b .L2
and for SVE:
test4:
adrp x2, .LANCHOR0
add x2, x2, :lo12:.LANCHOR0
add x5, x2, 3216
mov x3, 0
mov w1, 0
cntw x4
mov z1.s, w0
index z0.s, #0, #1
ptrue p1.b, all
ptrue p0.s, all
.p2align 3,,7
.L3:
ld1w z2.s, p1/z, [x2, x3, lsl 2]
add z3.s, z0.s, z1.s
cmplo p2.s, p0/z, z1.s, z2.s
b.any .L2
st1w z3.s, p1, [x5, x3, lsl 2]
add w1, w1, 1
st1w z1.s, p1, [x2, x3, lsl 2]
add x3, x3, x4
incw z0.s
cmp w3, 803
bls .L3
.L5:
mov w0, 0
ret
.p2align 2,,3
.L2:
cntw x5
mul w1, w1, w5
cbz w5, .L5
sxtw x1, w1
sub w5, w5, #1
add x5, x5, x1
add x6, x2, 3216
b .L6
.p2align 2,,3
.L14:
str w0, [x2, x1, lsl 2]
cmp x1, x5
beq .L5
mov x1, x4
.L6:
ldr w3, [x2, x1, lsl 2]
add w4, w0, w1
str w4, [x6, x1, lsl 2]
add x4, x1, 1
cmp w0, w3
bcs .L14
mov w0, 0
ret
On the workloads this work is based on we see between 2-3x performance uplift
using this patch.
Follow up plan:
- Boolean vectorization has several shortcomings. I've filed PR110223 with the
bigger ones that cause vectorization to fail with this patch.
- SLP support. This is planned for GCC 15 as for majority of the cases build
SLP itself fails. This means I'll need to spend time in making this more
robust first. Additionally it requires:
* Adding support for vectorizing CFG (gconds)
* Support for CFG to differ between vector and scalar loops.
Both of which would be disruptive to the tree and I suspect I'll be handling
fallouts from this patch for a while. So I plan to work on the surrounding
building blocks first for the remainder of the year.
Additionally it also contains reduced cases from issues found running over
various codebases.
Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Also regtested with:
-march=armv8.3-a+sve
-march=armv8.3-a+nosve
-march=armv9-a
-mcpu=neoverse-v1
-mcpu=neoverse-n2
Bootstrapped Regtested x86_64-pc-linux-gnu and no issues.
Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues.
gcc/ChangeLog:
* tree-if-conv.cc (idx_within_array_bound): Expose.
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New.
(vect_analyze_data_ref_dependences): Use it.
* tree-vect-loop-manip.cc (vect_iv_increment_position): New.
(vect_set_loop_controls_directly,
vect_set_loop_condition_partial_vectors,
vect_set_loop_condition_partial_vectors_avx512,
vect_set_loop_condition_normal): Support multiple exits.
(slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for
multiple exits.
(slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB
count and instead look at loop shape.
(vect_update_ivs_after_vectorizer): Drop asserts.
(vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations.
(vect_do_peeling): Support multiple exits.
(vect_loop_versioning): Likewise.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise
early_breaks.
(vect_analyze_loop_form): Support loop flows with more than single BB
loop body.
(vect_create_loop_vinfo): Support niters analysis for multiple exits.
(vect_analyze_loop): Likewise.
(vect_get_vect_def): New.
(vect_create_epilog_for_reduction): Support early exit reductions.
(vectorizable_live_operation_1): New.
(find_connected_edge): New.
(vectorizable_live_operation): Support early exit live operations.
(move_early_exit_stmts): New.
(vect_transform_loop): Use it.
* tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond.
(vect_recog_bitfield_ref_pattern): Support gconds and bools.
(vect_recog_gcond_pattern): New.
(possible_vector_mask_operation_p): Support gcond masks.
(vect_determine_mask_precision): Likewise.
(vect_mark_pattern_stmts): Set gcond def type.
(can_vectorize_live_stmts): Force early break inductions to be live.
* tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for
early breaks.
(vect_mark_stmts_to_be_vectorized): Process gcond usage.
(perm_mask_for_reverse): Expose.
(vectorizable_comparison_1): New.
(vectorizable_early_exit): New.
(vect_analyze_stmt): Support early break and gcond.
(vect_transform_stmt): Likewise.
(vect_is_simple_use): Likewise.
(vect_get_vector_types_for_stmt): Likewise.
* tree-vectorizer.cc (pass_vectorize::execute): Update exits for value
numbering.
* tree-vectorizer.h (enum vect_def_type): Add vect_condition_def.
(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES,
LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB,
LOOP_VINFO_EARLY_BRK_VUSES): New.
(is_loop_header_bb_p): Drop assert.
(class loop): Add early_breaks, early_break_stores, early_break_dest_bb,
early_break_vuses.
(vect_iv_increment_position, perm_mask_for_reverse,
ref_within_array_bound): New.
(slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.
|
|
Before my refactoring if the loop->latch was incorrect then find_loop_location
skipped checking the edges and would eventually return a dummy location.
It turns out that a loop can have
loops_state_satisfies_p (LOOPS_HAVE_RECORDED_EXITS) but also not have a latch
in which case get_loop_exit_edges traps.
This restores the old behavior.
gcc/ChangeLog:
PR tree-optimization/111878
* tree-vect-loop-manip.cc (find_loop_location): Skip edges check if
latch incorrect.
gcc/testsuite/ChangeLog:
PR tree-optimization/111878
* gcc.dg/graphite/pr111878.c: New test.
|
|
The following simplifies LC-PHI arg population during epilog peeling,
thereby fixing the testcase in this PR.
PR tree-optimization/111950
* tree-vect-loop-manip.cc (slpeel_duplicate_current_defs_from_edges):
Remove.
(find_guard_arg): Likewise.
(slpeel_update_phi_nodes_for_guard2): Likewise.
(slpeel_tree_duplicate_loop_to_edge_cfg): Remove calls to
slpeel_duplicate_current_defs_from_edges, do not elide
LC-PHIs for invariant values.
(vect_do_peeling): Materialize PHI arguments for the edge
around the epilog from the PHI defs of the main loop exit.
* gcc.dg/torture/pr111950.c: New testcase.
|
|
[PR111860]
The previous patch tried to remove PHI nodes that dominated the first loop,
however the correct fix is to only remove .MEM nodes.
This patch thus makes the condition a bit stricter and only tries to remove
MEM phi nodes.
I couldn't figure out a way to easily determine if a particular PHI is vUSE
related, so the patch does:
1. check if the definition is a vDEF and not defined in main loop.
2. check if the definition is a PHI and not defined in main loop.
3. check if the definition is a default definition.
For no 2 and 3 we may misidentify the PHI, in both cases the value is defined
outside of the loop version block which also makes it ok to remove.
gcc/ChangeLog:
PR tree-optimization/111860
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Drop .MEM nodes only.
gcc/testsuite/ChangeLog:
PR tree-optimization/111860
* gcc.dg/vect/pr111860-2.c: New test.
* gcc.dg/vect/pr111860-3.c: New test.
|
|
induction vec_step_op_mul when iteration count is too big.
There's loop in vect_peel_nonlinear_iv_init to get init_expr *
pow (step_expr, skip_niters). When skipn_iters is too big, compile time
hogs. To avoid that, optimize init_expr * pow (step_expr, skip_niters) to
init_expr << (exact_log2 (step_expr) * skip_niters) when step_expr is
pow of 2, otherwise give up vectorization when skip_niters >=
TYPE_PRECISION (TREE_TYPE (init_expr)).
Also give up vectorization when niters_skip is negative which will be
used for fully masked loop.
gcc/ChangeLog:
PR tree-optimization/111820
PR tree-optimization/111833
* tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Give
up vectorization for nonlinear iv vect_step_op_mul when
step_expr is not exact_log2 and niters is greater than
TYPE_PRECISION (TREE_TYPE (step_expr)). Also don't vectorize
for nagative niters_skip which will be used by fully masked
loop.
(vect_can_advance_ivs_p): Pass whole phi_info to
vect_can_peel_nonlinear_iv_p.
* tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Optimize
init_expr * pow (step_expr, skipn) to init_expr
<< (log2 (step_expr) * skipn) when step_expr is exact_log2.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr111820-1.c: New test.
* gcc.target/i386/pr111820-2.c: New test.
* gcc.target/i386/pr111820-3.c: New test.
* gcc.target/i386/pr103144-mul-1.c: Adjust testcase.
* gcc.target/i386/pr103144-mul-2.c: Adjust testcase.
|
|
peeling
During the refactoring I had passed loop_vinfo on to vect_set_loop_condition
during prolog peeling. This parameter is unused in most cases except for in
vect_set_loop_condition_partial_vectors where it's behaviour depends on whether
loop_vinfo is NULL or not. Apparently this code expect it to be NULL and it
reads the structures from a different location.
This fixes the failing testcase which was not using the lens values determined
earlier in vectorizable_store because it was looking it up in the given
loop_vinfo instead.
gcc/ChangeLog:
PR tree-optimization/111866
* tree-vect-loop-manip.cc (vect_do_peeling): Pass null as vinfo to
vect_set_loop_condition during prolog peeling.
|
|
As the testcase shows, when a PHI node dominates the loop there is no new
definition inside the loop. As such there would be no PHI nodes to update.
When we maintain LCSSA form we create an intermediate node in between the two
loops to thread alongt the value. However later on when we update the second
loop we don't have any PHI nodes to update and so adjust_phi_and_debug_stmts
does nothing. This leaves us with an incorrect phi node. Normally this does
nothing and just gets ignored. But in the case of the vUSE chain we end up
corrupting the chain.
As such whenever a PHI node's argument dominates the loop, we should remove
the newly created PHI node after edge redirection.
The one exception to this is when the loop has been versioned. In such cases
the versioned loop may not use the value but the second loop can.
When this happens and we add the loop guard unless the join block has the PHI
it can't find the original value for use inside the guard block.
The next refactoring in the series moves the formation of the guard block
inside peeling itself. Here we have all the information and wouldn't
need to re-create it later.
gcc/ChangeLog:
PR tree-optimization/111860
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Remove PHI nodes that dominate loop.
gcc/testsuite/ChangeLog:
PR tree-optimization/111860
* gcc.dg/vect/pr111860.c: New test.
|
|
This final patch updates peeling to maintain LCSSA all the way through.
It's significantly easier to maintain it during peeling while we still know
where all new edges connect rather than touching it up later as is currently
being done.
This allows us to remove many of the helper functions that touch up the loops
at various parts. The only complication is for loop distribution where we
should be able to use the same, however ldist depending on whether
redirect_lc_phi_defs is true or not will either try to maintain a limited LCSSA
form itself or removes are non-virtual phis.
The problem here is that if we maintain LCSSA then in some cases the blocks
connecting the two loops get PHIs to keep the loop IV up to date.
However there is no loop, the guard condition is rewritten as 0 != 0, to the
"loop" always exits. However due to the PHI nodes the probabilities get
completely wrong. It seems to think that the impossible exit is the likely
edge. This causes incorrect warnings and the presence of the PHIs prevent the
blocks to be simplified.
While it may be possible to make ldist work with LCSSA form, doing so seems more
work than not. For that reason the peeling code has an additional parameter
used by only ldist to not connect the two loops during peeling.
This preserves the current behaviour from ldist until I can dive into the
implementation more. Hopefully that's ok for now.
gcc/ChangeLog:
* tree-loop-distribution.cc (copy_loop_before): Request no LCSSA.
* tree-vect-loop-manip.cc (adjust_phi_and_debug_stmts): Add additional
asserts.
(slpeel_tree_duplicate_loop_to_edge_cfg): Keep LCSSA during peeling.
(find_guard_arg): Look value up through explicit edge and original defs.
(vect_do_peeling): Use it.
(slpeel_update_phi_nodes_for_guard2): Take explicit exit edge.
(slpeel_update_phi_nodes_for_lcssa, slpeel_update_phi_nodes_for_loops):
Remove.
* tree-vect-loop.cc (vect_create_epilog_for_reduction): Initialize phi.
* tree-vectorizer.h (slpeel_tree_duplicate_loop_to_edge_cfg): Add
optional param to turn off LCSSA mode.
|
|
This second part updates niters analysis to be able to analyze any number of
exits. If we have multiple exits we determine the main exit by finding the
first counting IV.
The change allows the vectorizer to pass analysis for multiple loops, but we
later gracefully reject them. It does however allow us to test if the exit
handling is using the right exit everywhere.
Additionally since we analyze all exits, we now return all conditions for them
and determine which condition belongs to the main exit.
The main condition is needed because the vectorizer needs to ignore the main IV
condition during vectorization as it will replace it during codegen.
To track versioned loops we extend the contract between ifcvt and the vectorizer
to store the exit number in aux so that we can match it up again during peeling.
gcc/ChangeLog:
* tree-if-conv.cc (tree_if_conversion): Record exits in aux.
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg): Use
it.
* tree-vect-loop.cc (vect_get_loop_niters): Determine main exit.
(vec_init_loop_exit_info): Extend analysis when multiple exits.
(vect_analyze_loop_form): Record conds and determine main cond.
(vect_create_loop_vinfo): Extend bookkeeping of conds.
(vect_analyze_loop): Release conds.
* tree-vectorizer.h (LOOP_VINFO_LOOP_CONDS,
LOOP_VINFO_LOOP_IV_COND): New.
(struct vect_loop_form_info): Add conds, alt_loop_conds;
(struct loop_vec_info): Add conds, loop_iv_cond.
|
|
variables
This is extracted out of the patch series to support early break vectorization
in order to simplify the review of that patch series.
The goal of this one is to separate out the refactoring from the new
functionality.
This first patch separates out the vectorizer's definition of an exit to their
own values inside loop_vinfo. During vectorization we can have three separate
copies for each loop: scalar, vectorized, epilogue. The scalar loop can also be
the versioned loop before peeling.
Because of this we track 3 different exits inside loop_vinfo corresponding to
each of these loops. Additionally each function that uses an exit, when not
obviously clear which exit is needed will now take the exit explicitly as an
argument.
This is because often times the callers switch the loops being passed around.
While the caller knows which loops it is, the callee does not.
For now the loop exits are simply initialized to same value as before determined
by single_exit (..).
No change in functionality is expected throughout this patch series.
gcc/ChangeLog:
* tree-loop-distribution.cc (copy_loop_before): Pass exit explicitly.
(loop_distribution::distribute_loop): Bail out of not single exit.
* tree-scalar-evolution.cc (get_loop_exit_condition): New.
* tree-scalar-evolution.h (get_loop_exit_condition): New.
* tree-vect-data-refs.cc (vect_enhance_data_refs_alignment): Pass exit
explicitly.
* tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors,
vect_set_loop_condition_partial_vectors_avx512,
vect_set_loop_condition_normal, vect_set_loop_condition): Explicitly
take exit.
(slpeel_tree_duplicate_loop_to_edge_cfg): Explicitly take exit and
return new peeled corresponding peeled exit.
(slpeel_can_duplicate_loop_p): Explicitly take exit.
(find_loop_location): Handle not knowing an explicit exit.
(vect_update_ivs_after_vectorizer, vect_gen_vector_loop_niters_mult_vf,
find_guard_arg, slpeel_update_phi_nodes_for_loops,
slpeel_update_phi_nodes_for_guard2): Use new exits.
(vect_do_peeling): Update bookkeeping to keep track of exits.
* tree-vect-loop.cc (vect_get_loop_niters): Explicitly take exit to
analyze.
(vec_init_loop_exit_info): New.
(_loop_vec_info::_loop_vec_info): Initialize vec_loop_iv,
vec_epilogue_loop_iv, scalar_loop_iv.
(vect_analyze_loop_form): Initialize exits.
(vect_create_loop_vinfo): Set main exit.
(vect_create_epilog_for_reduction, vectorizable_live_operation,
vect_transform_loop): Use it.
(scale_profile_for_vect_loop): Explicitly take exit to scale.
* tree-vectorizer.cc (set_uid_loop_bbs): Initialize loop exit.
* tree-vectorizer.h (LOOP_VINFO_IV_EXIT, LOOP_VINFO_EPILOGUE_IV_EXIT,
LOOP_VINFO_SCALAR_IV_EXIT): New.
(struct loop_vec_info): Add vec_loop_iv, vec_epilogue_loop_iv,
scalar_loop_iv.
(vect_set_loop_condition, slpeel_can_duplicate_loop_p,
slpeel_tree_duplicate_loop_to_edge_cfg): Take explicit exits.
(vec_init_loop_exit_info): New.
(struct vect_loop_form_info): Add loop_exit.
|
|
Verifier checks have recently been strengthened to check that
all counts and probabilities are initialized. The checks fired
during autoprofiledbootstrap build and this patch fixes it.
Tested on x86_64-pc-linux-gnu.
gcc/ChangeLog:
* auto-profile.cc (afdo_calculate_branch_prob): Fix count comparisons
* tree-vect-loop-manip.cc (vect_do_peeling): Guard against zero count
when scaling loop profile
|
|
If loop is ifconverted and later versioning by vectorizer, vectorizer will
reuse the scalar loop produced by ifconvert. Curiously enough it does not seem
to do so for versions produced by loop distribution while for loop distribution
this matters (since since both ldist versions survive to final code) while
after ifcvt it does not (since we remove non-vectorized path).
This patch fixes associated profile update. Here it is necessary to scale both
arms of the conditional according to runtime checks inserted. We got partly
right the loop body, but not the preheader block and block after exit. The
first is particularly bad since it changes loop iterations estimates.
So we now turn 4 original loops:
loop 1: iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 0.9979)
loop 2: iterations by profile: 100.000000 (reliable) entry count:39848881 (precise, freq 468.8104)
loop 3: iterations by profile: 100.000000 (reliable) entry count:39848881 (precise, freq 468.8104)
loop 4: iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 0.9902)
Into following loops
iterations by profile: 5.312499 (unreliable, maybe flat) entry count:12742188 (guessed, freq 149.9081)
vectorized and split loop 1, peeled
iterations by profile: 0.009496 (unreliable, maybe flat) entry count:374798 (guessed, freq 4.4094)
split loop 1 (last iteration), peeled
iterations by profile: 100.000008 (unreliable) entry count:3945039 (guessed, freq 46.4122)
scalar version of loop 1
iterations by profile: 100.000007 (unreliable) entry count:7101070 (guessed, freq 83.5420)
redundant scalar version of loop 1 which we could eliminate if vectorizer understood ldist
iterations by profile: 100.000000 (unreliable) entry count:35505353 (guessed, freq 417.7100)
unvectorized loop 2
iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, freq 300.7512)
vectorized loop 2, not peeled (hits max-peel-insns)
iterations by profile: 100.000007 (unreliable) entry count:7101070 (guessed, freq 83.5420)
unvectorized loop 3
iterations by profile: 5.312500 (unreliable) entry count:25563855 (guessed, freq 300.7512)
vectorized loop 3, not peeled (hits max-peel-insns)
iterations by profile: 473.497707 (reliable) entry count:84821 (precise, freq 0.9979)
loop 1
iterations by profile: 100.999596 (reliable) entry count:84167 (precise, freq 0.9902)
loop 4
With this change we are on 0 profile erros on hmmer benchmark:
Pass dump id |dynamic mismatch |overall |
|in count |size |time |
172t ch_vect | 0 | 996 | 385812023346 |
173t ifcvt | 71010686 +71010686| 1021 +2.5%| 468361969416 +21.4%|
174t vect | 210830784 +139820098| 1497 +46.6%| 216073467874 -53.9%|
175t dce | 210830784 | 1387 -7.3%| 205273170281 -5.0%|
176t pcom | 210830784 | 1387 | 201722634966 -1.7%|
177t cunroll | 0 -210830784| 1443 +4.0%| 180441501289 -10.5%|
182t ivopts | 0 | 1385 -4.0%| 136412345683 -24.4%|
183t lim | 0 | 1389 +0.3%| 135093950836 -1.0%|
192t reassoc | 0 | 1381 -0.6%| 134778347700 -0.2%|
193t slsr | 0 | 1380 -0.1%| 134738100330 -0.0%|
195t tracer | 0 | 1521 +10.2%| 134738179146 +0.0%|
196t fre | 2680654 +2680654| 1489 -2.1%| 134659672725 -0.1%|
198t dom | 5361308 +2680654| 1473 -1.1%| 134449553658 -0.2%|
201t vrp | 5361308 | 1474 +0.1%| 134489004050 +0.0%|
202t ccp | 5361308 | 1472 -0.1%| 134440752274 -0.0%|
204t dse | 5361308 | 1444 -1.9%| 133802300525 -0.5%|
206t forwprop| 5361308 | 1433 -0.8%| 133542828370 -0.2%|
207t sink | 5361308 | 1431 -0.1%| 133542658728 -0.0%|
211t store-me| 5361308 | 1430 -0.1%| 133542573728 -0.0%|
212t cddce | 5361308 | 1428 -0.1%| 133541776728 -0.0%|
258r expand | 5361308 |----------------|--------------------|
260r into_cfg| 5361308 | 9334 -0.8%| 885820707913 -0.6%|
261r jump | 5361308 | 9330 -0.0%| 885820367913 -0.0%|
265r fwprop1 | 5361308 | 9206 -1.3%| 876756504385 -1.0%|
267r rtl pre | 5361308 | 9210 +0.0%| 876914305953 +0.0%|
269r cprop | 5361308 | 9202 -0.1%| 876756165101 -0.0%|
271r cse_loca| 5361308 | 9198 -0.0%| 876727760821 -0.0%|
272r ce1 | 5361308 | 9126 -0.8%| 875726815885 -0.1%|
276r loop2_in| 5361308 | 9167 +0.4%| 873573110570 -0.2%|
282r cprop | 5361308 | 9095 -0.8%| 871937317262 -0.2%|
284r cse2 | 5361308 | 9091 -0.0%| 871936977978 -0.0%|
285r dse1 | 5361308 | 9067 -0.3%| 871437031602 -0.1%|
290r combine | 5361308 | 9071 +0.0%| 869206278202 -0.3%|
292r stv | 5361308 | 17157 +89.1%| 2111071925708+142.9%|
295r bbpart | 5361308 | 17161 +0.0%| 2111071925708 |
296r outof_cf| 5361308 | 17233 +0.4%| 2111655121000 +0.0%|
297r split1 | 5361308 | 17245 +0.1%| 2111656138852 +0.0%|
306r ira | 5361308 | 19189 +11.3%| 2136098398308 +1.2%|
307r reload | 5361308 | 12101 -36.9%| 981091222830 -54.1%|
309r postrelo| 5361308 | 12019 -0.7%| 978750345475 -0.2%|
310r gcse2 | 5361308 | 12027 +0.1%| 978329108320 -0.0%|
311r split2 | 5361308 | 12023 -0.0%| 978507631352 +0.0%|
312r ree | 5361308 | 12027 +0.0%| 978505414244 -0.0%|
313r cmpelim | 5361308 | 11979 -0.4%| 977531601988 -0.1%|
314r pro_and_| 5361308 | 12091 +0.9%| 977541801988 +0.0%|
315r dse2 | 5361308 | 12091 | 977541801988 |
316r csa | 5361308 | 12087 -0.0%| 977541461988 -0.0%|
317r jump2 | 5361308 | 12039 -0.4%| 977683176572 +0.0%|
318r compgoto| 5361308 | 12039 | 977683176572 |
320r peephole| 5361308 | 12047 +0.1%| 977362727612 -0.0%|
321r ce3 | 5361308 | 12047 | 977362727612 |
323r cprop_ha| 5361308 | 11907 -1.2%| 968751076676 -0.9%|
324r rtl_dce | 5361308 | 11903 -0.0%| 968593274820 -0.0%|
325r bbro | 5361308 | 11883 -0.2%| 967964046644 -0.1%|
Bootstrapped/regtested x86_64-linux, plan to commit it tomorrow if there are no
complains.
gcc/ChangeLog:
PR tree-optimization/106293
* tree-vect-loop-manip.cc (vect_loop_versioning): Fix profile update.
* tree-vect-loop.cc (vect_transform_loop): Likewise.
gcc/testsuite/ChangeLog:
PR tree-optimization/106293
* gcc.dg/vect/vect-cond-11.c: Check profile consistency.
* gcc.dg/vect/vect-widen-mult-extern-1.c: Check profile consistency.
|
|
Epilogue peeling expects the scalar loop to have same number of executions as
the vector loop which is true at the beggining of vectorization. However if the
epilogues are vectorized, this is no longer the case. In this situation the
loop preheader is replaced by new guard code with correct profile, however
loop body is left unscaled. This leads to loop that exists more often then
it is entered.
This patch add slogic to scale the frequencies down and also to fix profile
of original preheader where necesary.
Bootstrapped/regtested x86_64-linux, comitted.
gcc/ChangeLog:
* tree-vect-loop-manip.cc (vect_do_peeling): Fix profile update of peeled epilogues.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-bitfield-read-1.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-2.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-3.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-4.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-5.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-6.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-read-7.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-write-1.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-write-2.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-write-3.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-write-4.c: Check profile consistency.
* gcc.dg/vect/vect-bitfield-write-5.c: Check profile consistency.
* gcc.dg/vect/vect-epilogues-2.c: Check profile consistency.
* gcc.dg/vect/vect-epilogues.c: Check profile consistency.
* gcc.dg/vect/vect-mask-store-move-1.c: Check profile consistency.
|
|
This patch fixes update after constant peeling in profilogue. We now reached 0 profile
update bugs on tramp3d vectorizaiton and also on quite few testcases, so I am enabling the
testuiste checks so we do not regress again.
gcc/ChangeLog:
* tree-vect-loop-manip.cc (vect_do_peeling): Fix profile update after
constant prologue peeling.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-1-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-1.c: Check profile consistency.
* gcc.dg/vect/vect-10-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-10.c: Check profile consistency.
* gcc.dg/vect/vect-100.c: Check profile consistency.
* gcc.dg/vect/vect-103.c: Check profile consistency.
* gcc.dg/vect/vect-104.c: Check profile consistency.
* gcc.dg/vect/vect-105-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-105.c: Check profile consistency.
* gcc.dg/vect/vect-106.c: Check profile consistency.
* gcc.dg/vect/vect-107.c: Check profile consistency.
* gcc.dg/vect/vect-108.c: Check profile consistency.
* gcc.dg/vect/vect-109.c: Check profile consistency.
* gcc.dg/vect/vect-11.c: Check profile consistency.
* gcc.dg/vect/vect-110.c: Check profile consistency.
* gcc.dg/vect/vect-112-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-112.c: Check profile consistency.
* gcc.dg/vect/vect-113.c: Check profile consistency.
* gcc.dg/vect/vect-114.c: Check profile consistency.
* gcc.dg/vect/vect-115.c: Check profile consistency.
* gcc.dg/vect/vect-116.c: Check profile consistency.
* gcc.dg/vect/vect-117.c: Check profile consistency.
* gcc.dg/vect/vect-118.c: Check profile consistency.
* gcc.dg/vect/vect-119.c: Check profile consistency.
* gcc.dg/vect/vect-11a.c: Check profile consistency.
* gcc.dg/vect/vect-12.c: Check profile consistency.
* gcc.dg/vect/vect-120.c: Check profile consistency.
* gcc.dg/vect/vect-121.c: Check profile consistency.
* gcc.dg/vect/vect-122.c: Check profile consistency.
* gcc.dg/vect/vect-123.c: Check profile consistency.
* gcc.dg/vect/vect-124.c: Check profile consistency.
* gcc.dg/vect/vect-126.c: Check profile consistency.
* gcc.dg/vect/vect-13.c: Check profile consistency.
* gcc.dg/vect/vect-14.c: Check profile consistency.
* gcc.dg/vect/vect-15-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-15.c: Check profile consistency.
* gcc.dg/vect/vect-17.c: Check profile consistency.
* gcc.dg/vect/vect-18.c: Check profile consistency.
* gcc.dg/vect/vect-19.c: Check profile consistency.
* gcc.dg/vect/vect-2-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-2.c: Check profile consistency.
* gcc.dg/vect/vect-20.c: Check profile consistency.
* gcc.dg/vect/vect-21.c: Check profile consistency.
* gcc.dg/vect/vect-22.c: Check profile consistency.
* gcc.dg/vect/vect-23.c: Check profile consistency.
* gcc.dg/vect/vect-24.c: Check profile consistency.
* gcc.dg/vect/vect-25.c: Check profile consistency.
* gcc.dg/vect/vect-26.c: Check profile consistency.
* gcc.dg/vect/vect-27.c: Check profile consistency.
* gcc.dg/vect/vect-28.c: Check profile consistency.
* gcc.dg/vect/vect-29.c: Check profile consistency.
* gcc.dg/vect/vect-3.c: Check profile consistency.
* gcc.dg/vect/vect-30.c: Check profile consistency.
* gcc.dg/vect/vect-31-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-31.c: Check profile consistency.
* gcc.dg/vect/vect-32-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-32-chars.c: Check profile consistency.
* gcc.dg/vect/vect-32.c: Check profile consistency.
* gcc.dg/vect/vect-33-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-33.c: Check profile consistency.
* gcc.dg/vect/vect-34-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-34.c: Check profile consistency.
* gcc.dg/vect/vect-35-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-35.c: Check profile consistency.
* gcc.dg/vect/vect-36-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-36.c: Check profile consistency.
* gcc.dg/vect/vect-38.c: Check profile consistency.
* gcc.dg/vect/vect-4.c: Check profile consistency.
* gcc.dg/vect/vect-40.c: Check profile consistency.
* gcc.dg/vect/vect-42.c: Check profile consistency.
* gcc.dg/vect/vect-44.c: Check profile consistency.
* gcc.dg/vect/vect-46.c: Check profile consistency.
* gcc.dg/vect/vect-48.c: Check profile consistency.
* gcc.dg/vect/vect-5.c: Check profile consistency.
* gcc.dg/vect/vect-50.c: Check profile consistency.
* gcc.dg/vect/vect-52.c: Check profile consistency.
* gcc.dg/vect/vect-54.c: Check profile consistency.
* gcc.dg/vect/vect-56.c: Check profile consistency.
* gcc.dg/vect/vect-58.c: Check profile consistency.
* gcc.dg/vect/vect-6-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-6.c: Check profile consistency.
* gcc.dg/vect/vect-60.c: Check profile consistency.
* gcc.dg/vect/vect-62.c: Check profile consistency.
* gcc.dg/vect/vect-63.c: Check profile consistency.
* gcc.dg/vect/vect-64.c: Check profile consistency.
* gcc.dg/vect/vect-65.c: Check profile consistency.
* gcc.dg/vect/vect-66.c: Check profile consistency.
* gcc.dg/vect/vect-67.c: Check profile consistency.
* gcc.dg/vect/vect-68.c: Check profile consistency.
* gcc.dg/vect/vect-7.c: Check profile consistency.
* gcc.dg/vect/vect-70.c: Check profile consistency.
* gcc.dg/vect/vect-71.c: Check profile consistency.
* gcc.dg/vect/vect-72.c: Check profile consistency.
* gcc.dg/vect/vect-73-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-73.c: Check profile consistency.
* gcc.dg/vect/vect-74-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-74.c: Check profile consistency.
* gcc.dg/vect/vect-75-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-75.c: Check profile consistency.
* gcc.dg/vect/vect-76-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-76.c: Check profile consistency.
* gcc.dg/vect/vect-77-alignchecks.c: Check profile consistency.
* gcc.dg/vect/vect-77-global.c: Check profile consistency.
* gcc.dg/vect/vect-77.c: Check profile consistency.
* gcc.dg/vect/vect-78-alignchecks.c: Check profile consistency.
* gcc.dg/vect/vect-78-global.c: Check profile consistency.
* gcc.dg/vect/vect-78.c: Check profile consistency.
* gcc.dg/vect/vect-8.c: Check profile consistency.
* gcc.dg/vect/vect-80-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-80.c: Check profile consistency.
* gcc.dg/vect/vect-82.c: Check profile consistency.
* gcc.dg/vect/vect-82_64.c: Check profile consistency.
* gcc.dg/vect/vect-83.c: Check profile consistency.
* gcc.dg/vect/vect-83_64.c: Check profile consistency.
* gcc.dg/vect/vect-85-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-85.c: Check profile consistency.
* gcc.dg/vect/vect-86.c: Check profile consistency.
* gcc.dg/vect/vect-87.c: Check profile consistency.
* gcc.dg/vect/vect-88.c: Check profile consistency.
* gcc.dg/vect/vect-89-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-89.c: Check profile consistency.
* gcc.dg/vect/vect-9.c: Check profile consistency.
* gcc.dg/vect/vect-91.c: Check profile consistency.
* gcc.dg/vect/vect-92.c: Check profile consistency.
* gcc.dg/vect/vect-93.c: Check profile consistency.
* gcc.dg/vect/vect-95.c: Check profile consistency.
* gcc.dg/vect/vect-96.c: Check profile consistency.
* gcc.dg/vect/vect-97-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-97.c: Check profile consistency.
* gcc.dg/vect/vect-98-big-array.c: Check profile consistency.
* gcc.dg/vect/vect-98.c: Check profile consistency.
* gcc.dg/vect/vect-99.c: Check profile consistency.
|
|
Vectorizer while loop versioning produces a versioned loop
guarded with two conditionals of the form
if (cond1)
goto scalar_loop
else
goto next_bb
next_bb:
if (cond2)
godo scalar_loop
else
goto vector_loop
It wants the combined test to be prob (whch is set to likely)
and uses profile_probability::split to determine probability
of cond1 and cond2.
However spliting is turning:
if (cond)
goto lab; // ORIG probability
into
if (cond1)
goto lab; // FIRST = ORIG * CPROB probability
if (cond2)
goto lab; // SECOND probability
Which is or instead of and. As a result we get pretty low probabiility
of entering vectorized loop.
The fixes this by introducing sqrt to profile probability (which is correct
way to split this) and also adding pow that is needed elsewhere.
While loop versioning I now produce code as if there was only one combined
conditional and then update probability of conditional produced (containig
cond1). Later edge is split and new conditional is added. At that time
it is necessary to update probability of the BB containing second conditional
so everything matches.
gcc/ChangeLog:
* profile-count.cc (profile_probability::sqrt): New member function.
(profile_probability::pow): Likewise.
* profile-count.h: (profile_probability::sqrt): Declare
(profile_probability::pow): Likewise.
* tree-vect-loop-manip.cc (vect_loop_versioning): Fix profile update.
|
|
Fix two bugs in scale_loop_profile which crept in during my
cleanups and curiously enoug did not show on the testcases we have so far.
The patch also adds the missing call to cap iteration count of the vectorized
loop epilogues.
Vectorizer profile needs more work, but I am trying to chase out obvious bugs first
so the profile quality statistics become meaningful and we can try to improve on them.
Now we get:
Pass dump id and name |static mismatcdynamic mismatch
|in count |in count
107t cunrolli | 3 +3| 17251 +17251
116t vrp | 5 +2| 30908 +16532
118t dce | 3 -2| 17251 -13657
127t ch | 13 +10| 17251
131t dom | 39 +26| 17251
133t isolate-paths | 47 +8| 17251
134t reassoc | 49 +2| 17251
136t forwprop | 53 +4| 202501 +185250
159t cddce | 61 +8| 216211 +13710
161t ldist | 62 +1| 216211
172t ifcvt | 66 +4| 373711 +157500
173t vect | 143 +77| 9801947 +9428236
176t cunroll | 149 +6| 12006408 +2204461
183t loopdone | 146 -3| 11944469 -61939
195t fre | 142 -4| 11944469
197t dom | 141 -1| 13038435 +1093966
199t threadfull | 143 +2| 13246410 +207975
200t vrp | 145 +2| 13444579 +198169
204t dce | 143 -2| 13371315 -73264
206t sink | 141 -2| 13371315
211t cddce | 147 +6| 13372755 +1440
255t optimized | 145 -2| 13372755
256r expand | 141 -4| 13371197 -1558
258r into_cfglayout | 139 -2| 13371197
275r loop2_unroll | 143 +4| 16792056 +3420859
291r ce2 | 141 -2| 16811462
312r pro_and_epilogue | 161 +20| 16873400 +61938
315r jump2 | 167 +6| 20910158 +4036758
323r bbro | 160 -7| 16559844 -4350314
Vect still introduces 77 profile mismatches (same as without this patch)
however subsequent cunroll works much better with 6 new mismatches compared to
78. Overall it reduces 229 mismatches to 160.
Also overall runtime estimate is now reduced by 6.9%.
Previously the overall runtime estimate grew by 11% which was result of the fat
that the epilogue profile was pretty much the same as profile of the original
loop.
Bootstrapped/regtested x86_64-linux, comitted.
gcc/ChangeLog:
* cfgloopmanip.cc (scale_loop_profile): Fix computation of count_in and scaling blocks
after exit.
* tree-vect-loop-manip.cc (vect_do_peeling): Scale loop profile of the epilogue if bound
is known.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/vect-profile-upate.c: New test.
|
|
Original scale_loop_profile was implemented to only handle very simple loops
produced by vectorizer at that time (basically loops with only one exit and no
subloops). It also has not been updated to new profile-count API very carefully.
The function does two thigs
1) scales down the loop profile by a given probability.
This is useful, for example, to scale down profile after peeling when loop
body is executed less often than before
2) update profile to cap iteration count by ITERATION_BOUND parameter.
I changed ITERATION_BOUND to be actual bound on number of iterations as
used elsewhere (i.e. number of executions of latch edge) rather then
number of iterations + 1 as it was before.
To do 2) one needs to do the following
a) scale own loop profile so frquency o header is at most
the sum of in-edge counts * (iteration_bound + 1)
b) update loop exit probabilities so their count is the same
as before scaling.
c) reduce frequencies of basic blocks after loop exit
old code did b) by setting probability to 1 / iteration_bound which is
correctly only of the basic block containing exit executes precisely one per
iteration (it is not insie other conditional or inner loop). This is fixed
now by using set_edge_probability_and_rescale_others
aldo c) was implemented only for special case when the exit was just before
latch bacis block. I now use dominance info to get right some of addional
case.
I still did not try to do anything for multiple exit loops, though the
implementatoin could be generalized.
Bootstrapped/regtested x86_64-linux. Plan to cmmit it tonight if there
are no complains.
gcc/ChangeLog:
* cfgloopmanip.cc (scale_loop_profile): Rewrite exit edge
probability update to be safe on loops with subloops.
Make bound parameter to be iteration bound.
* tree-ssa-loop-ivcanon.cc (try_peel_loop): Update call
of scale_loop_profile.
* tree-vect-loop-manip.cc (vect_do_peeling): Likewise.
|
|
The following consolidates an assert that now hits for ppc64le
with an earlier check we already do, simplifying
vect_determine_partial_vectors_and_peeling and getting rid of
its now redundant argument.
PR tree-optimization/110563
* tree-vectorizer.h (vect_determine_partial_vectors_and_peeling):
Remove second argument.
* tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling):
Remove for_epilogue_p argument. Merge assert ...
(vect_analyze_loop_2): ... with check done before determining
partial vectors by moving it after.
* tree-vect-loop-manip.cc (vect_do_peeling): Adjust.
|
|
The following removes late deciding to elide vectorized epilogues to
the analysis phase and also avoids altering the epilogues niter.
The costing part from vect_determine_partial_vectors_and_peeling is
moved to vect_analyze_loop_costing where we use the main loop
analysis to constrain the epilogue scalar iterations.
I have not tried to integrate this with vect_known_niters_smaller_than_vf.
It seems the for_epilogue_p parameter in
vect_determine_partial_vectors_and_peeling is largely useless and
we could compute that in the function itself.
PR tree-optimization/110310
* tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling):
Move costing part ...
(vect_analyze_loop_costing): ... here. Integrate better
estimate for epilogues from ...
(vect_analyze_loop_2): Call vect_determine_partial_vectors_and_peeling
with actual epilogue status.
* tree-vect-loop-manip.cc (vect_do_peeling): ... here and
avoid cancelling epilogue vectorization.
(vect_update_epilogue_niters): Remove. No longer update
epilogue LOOP_VINFO_NITERS.
* gcc.target/i386/pr110310.c: New testcase.
* gcc.dg/vect/slp-perm-12.c: Disable epilogue vectorization.
|
|
gcc/
* tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors):
Handle null niters_skip.
|
|
This implemens fully masked vectorization or a masked epilog for
AVX512 style masks which single themselves out by representing
each lane with a single bit and by using integer modes for the mask
(both is much like GCN).
AVX512 is also special in that it doesn't have any instruction
to compute the mask from a scalar IV like SVE has with while_ult.
Instead the masks are produced by vector compares and the loop
control retains the scalar IV (mainly to avoid dependences on
mask generation, a suitable mask test instruction is available).
Like RVV code generation prefers a decrementing IV though IVOPTs
messes things up in some cases removing that IV to eliminate
it with an incrementing one used for address generation.
One of the motivating testcases is from PR108410 which in turn
is extracted from x264 where large size vectorization shows
issues with small trip loops. Execution time there improves
compared to classic AVX512 with AVX2 epilogues for the cases
of less than 32 iterations.
size scalar 128 256 512 512e 512f
1 9.42 11.32 9.35 11.17 15.13 16.89
2 5.72 6.53 6.66 6.66 7.62 8.56
3 4.49 5.10 5.10 5.74 5.08 5.73
4 4.10 4.33 4.29 5.21 3.79 4.25
6 3.78 3.85 3.86 4.76 2.54 2.85
8 3.64 1.89 3.76 4.50 1.92 2.16
12 3.56 2.21 3.75 4.26 1.26 1.42
16 3.36 0.83 1.06 4.16 0.95 1.07
20 3.39 1.42 1.33 4.07 0.75 0.85
24 3.23 0.66 1.72 4.22 0.62 0.70
28 3.18 1.09 2.04 4.20 0.54 0.61
32 3.16 0.47 0.41 0.41 0.47 0.53
34 3.16 0.67 0.61 0.56 0.44 0.50
38 3.19 0.95 0.95 0.82 0.40 0.45
42 3.09 0.58 1.21 1.13 0.36 0.40
'size' specifies the number of actual iterations, 512e is for
a masked epilog and 512f for the fully masked loop. From
4 scalar iterations on the AVX512 masked epilog code is clearly
the winner, the fully masked variant is clearly worse and
it's size benefit is also tiny.
This patch does not enable using fully masked loops or
masked epilogues by default. More work on cost modeling
and vectorization kind selection on x86_64 is necessary
for this.
Implementation wise this introduces LOOP_VINFO_PARTIAL_VECTORS_STYLE
which could be exploited further to unify some of the flags
we have right now but there didn't seem to be many easy things
to merge, so I'm leaving this for followups.
Mask requirements as registered by vect_record_loop_mask are kept in their
original form and recorded in a hash_set now instead of being
processed to a vector of rgroup_controls. Instead that's now
left to the final analysis phase which tries forming the rgroup_controls
vector using while_ult and if that fails now tries AVX512 style
which needs a different organization and instead fills a hash_map
with the relevant info. vect_get_loop_mask now has two implementations,
one for the two mask styles we then have.
I have decided against interweaving vect_set_loop_condition_partial_vectors
with conditions to do AVX512 style masking and instead opted to
"duplicate" this to vect_set_loop_condition_partial_vectors_avx512.
Likewise for vect_verify_full_masking vs vect_verify_full_masking_avx512.
The vect_prepare_for_masked_peels hunk might run into issues with
SVE, I didn't check yet but using LOOP_VINFO_RGROUP_COMPARE_TYPE
looked odd.
Bootstrapped and tested on x86_64-unknown-linux-gnu. I've run
the testsuite with --param vect-partial-vector-usage=2 with and
without -fno-vect-cost-model and filed two bugs, one ICE (PR110221)
and one latent wrong-code (PR110237).
* tree-vectorizer.h (enum vect_partial_vector_style): New.
(_loop_vec_info::partial_vector_style): Likewise.
(LOOP_VINFO_PARTIAL_VECTORS_STYLE): Likewise.
(rgroup_controls::compare_type): Add.
(vec_loop_masks): Change from a typedef to auto_vec<>
to a structure.
* tree-vect-loop-manip.cc (vect_set_loop_condition_partial_vectors):
Adjust. Convert niters_skip to compare_type.
(vect_set_loop_condition_partial_vectors_avx512): New function
implementing the AVX512 partial vector codegen.
(vect_set_loop_condition): Dispatch to the correct
vect_set_loop_condition_partial_vectors_* function based on
LOOP_VINFO_PARTIAL_VECTORS_STYLE.
(vect_prepare_for_masked_peels): Compute LOOP_VINFO_MASK_SKIP_NITERS
in the original niter type.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize
partial_vector_style.
(can_produce_all_loop_masks_p): Adjust.
(vect_verify_full_masking): Produce the rgroup_controls vector
here. Set LOOP_VINFO_PARTIAL_VECTORS_STYLE on success.
(vect_verify_full_masking_avx512): New function implementing
verification of AVX512 style masking.
(vect_verify_loop_lens): Set LOOP_VINFO_PARTIAL_VECTORS_STYLE.
(vect_analyze_loop_2): Also try AVX512 style masking.
Adjust condition.
(vect_estimate_min_profitable_iters): Implement AVX512 style
mask producing cost.
(vect_record_loop_mask): Do not build the rgroup_controls
vector here but record masks in a hash-set.
(vect_get_loop_mask): Implement AVX512 style mask query,
complementing the existing while_ult style.
|
|
This patch address comments from Richard && Richi and rebase to trunk.
This patch is adding SELECT_VL middle-end support
allow target have target dependent optimization in case of
length calculation.
This patch is inspired by RVV ISA and LLVM:
https://reviews.llvm.org/D99750
The SELECT_VL is same behavior as LLVM "get_vector_length" with
these following properties:
1. Only apply on single-rgroup.
2. non SLP.
3. adjust loop control IV.
4. adjust data reference IV.
5. allow non-vf elements processing in non-final iteration
Code
# void vvaddint32(size_t n, const int*x, const int*y, int*z)
# { for (size_t i=0; i<n; i++) { z[i]=x[i]+y[i]; } }
Take RVV codegen for example:
Before this patch:
vvaddint32:
ble a0,zero,.L6
csrr a4,vlenb
srli a6,a4,2
.L4:
mv a5,a0
bleu a0,a6,.L3
mv a5,a6
.L3:
vsetvli zero,a5,e32,m1,ta,ma
vle32.v v2,0(a1)
vle32.v v1,0(a2)
vsetvli a7,zero,e32,m1,ta,ma
sub a0,a0,a5
vadd.vv v1,v1,v2
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a3)
add a2,a2,a4
add a3,a3,a4
add a1,a1,a4
bne a0,zero,.L4
.L6:
ret
After this patch:
vvaddint32:
vsetvli t0, a0, e32, ta, ma # Set vector length based on 32-bit vectors
vle32.v v0, (a1) # Get first vector
sub a0, a0, t0 # Decrement number done
slli t0, t0, 2 # Multiply number done by 4 bytes
add a1, a1, t0 # Bump pointer
vle32.v v1, (a2) # Get second vector
add a2, a2, t0 # Bump pointer
vadd.vv v2, v0, v1 # Sum vectors
vse32.v v2, (a3) # Store result
add a3, a3, t0 # Bump pointer
bnez a0, vvaddint32 # Loop back
ret # Finished
Co-authored-by: Richard Sandiford<richard.sandiford@arm.com>
Co-authored-by: Richard Biener <rguenther@suse.de>
gcc/ChangeLog:
* doc/md.texi: Add SELECT_VL support.
* internal-fn.def (SELECT_VL): Ditto.
* optabs.def (OPTAB_D): Ditto.
* tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Ditto.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Ditto.
* tree-vect-stmts.cc (get_select_vl_data_ref_ptr): Ditto.
(vectorizable_store): Ditto.
(vectorizable_load): Ditto.
* tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): Ditto.
|
|
Follow Richi's suggestion, I change current decrement IV flow from:
do {
remain -= MIN (vf, remain);
} while (remain != 0);
into:
do {
old_remain = remain;
len = MIN (vf, remain);
remain -= vf;
} while (old_remain >= vf);
to enhance SCEV.
Include fixes from kewen.
This patch will need to wait for Kewen's test feedback.
Testing on X86 is on-going
Co-Authored by: Kewen Lin <linkw@linux.ibm.com>
PR tree-optimization/109971
gcc/ChangeLog:
* tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Change decrement IV flow.
(vect_set_loop_condition_partial_vectors): Ditto.
|
|
This patch is supporting decrement IV by following the flow designed by
Richard:
(1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
call vect_set_loop_controls_directly.
(2) vect_set_loop_controls_directly calculates "step" as in your patch.
If rgc has 1 control, this step is the SSA name created for that
control. Otherwise the step is a fresh SSA name, as in your patch.
(3) vect_set_loop_controls_directly stores this step somewhere for later
use, probably in LOOP_VINFO. Let's use "S" to refer to this stored
step.
(4) After the vect_set_loop_controls_directly call above, and outside
the "if" statement that now contains vect_set_loop_controls_directly,
check whether rgc->controls.length () > 1. If so, use
vect_adjust_loop_lens_control to set the controls based on S.
Then the only caller of vect_adjust_loop_lens_control is
vect_set_loop_condition_partial_vectors. And the starting
step for vect_adjust_loop_lens_control is always S.
This patch has well tested for single-rgroup and multiple-rgroup (SLP)
and passed all testcase in RISC-V port.
Signed-off-by: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
Co-Authored-By: Richard Sandiford <richard.sandiford@arm.com>
gcc/ChangeLog:
* tree-vect-loop-manip.cc (vect_adjust_loop_lens_control): New
function.
(vect_set_loop_controls_directly): Add decrement IV support.
(vect_set_loop_condition_partial_vectors): Ditto.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): New
variable.
* tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New
macro.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c: New test.
* gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c: New test.
* gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c: New test.
* gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c: New test.
|
|
This patch is going to be commited after bootstrap && regression on X86
PASSED.
Thanks Richards.
gcc/ChangeLog:
* cfgloopmanip.cc (create_empty_loop_on_edge): Add PLUS_EXPR.
* gimple-loop-interchange.cc
(tree_loop_interchange::map_inductions_to_loop): Ditto.
* tree-ssa-loop-ivcanon.cc (create_canonical_iv): Ditto.
* tree-ssa-loop-ivopts.cc (create_new_iv): Ditto.
* tree-ssa-loop-manip.cc (create_iv): Ditto.
(tree_transform_and_unroll_loop): Ditto.
(canonicalize_loop_ivs): Ditto.
* tree-ssa-loop-manip.h (create_iv): Ditto.
* tree-vect-data-refs.cc (vect_create_data_ref_ptr): Ditto.
* tree-vect-loop-manip.cc (vect_set_loop_controls_directly):
Ditto.
(vect_set_loop_condition_normal): Ditto.
* tree-vect-loop.cc (vect_create_epilog_for_reduction): Ditto.
* tree-vect-stmts.cc (vectorizable_store): Ditto.
(vectorizable_load): Ditto.
Signed-off-by: Juzhe Zhong <juzhe.zhong@rivai.ai>
|
|
There are quite some cases which want to access the control stmt
ending a basic-block. Since there cannot be debug stmts after
such stmt there's no point in using last_stmt which skips debug
stmts and can be a compile-time hog for larger testcases.
* gimple-ssa-split-paths.cc (is_feasible_trace): Avoid
last_stmt.
* graphite-scop-detection.cc (single_pred_cond_non_loop_exit):
Likewise.
* ipa-fnsummary.cc (set_cond_stmt_execution_predicate): Likewise.
(set_switch_stmt_execution_predicate): Likewise.
(phi_result_unknown_predicate): Likewise.
* ipa-prop.cc (compute_complex_ancestor_jump_func): Likewise.
(ipa_analyze_indirect_call_uses): Likewise.
* predict.cc (predict_iv_comparison): Likewise.
(predict_extra_loop_exits): Likewise.
(predict_loops): Likewise.
(tree_predict_by_opcode): Likewise.
* gimple-predicate-analysis.cc (predicate::init_from_control_deps):
Likewise.
* gimple-pretty-print.cc (dump_implicit_edges): Likewise.
* tree-ssa-phiopt.cc (tree_ssa_phiopt_worker): Likewise.
(replace_phi_edge_with_variable): Likewise.
(two_value_replacement): Likewise.
(value_replacement): Likewise.
(minmax_replacement): Likewise.
(spaceship_replacement): Likewise.
(cond_removal_in_builtin_zero_pattern): Likewise.
* tree-ssa-reassoc.cc (maybe_optimize_range_tests): Likewise.
* tree-ssa-sccvn.cc (vn_phi_eq): Likewise.
(vn_phi_lookup): Likewise.
(vn_phi_insert): Likewise.
* tree-ssa-structalias.cc (compute_points_to_sets): Likewise.
* tree-ssa-threadbackward.cc (back_threader::maybe_thread_block):
Likewise.
(back_threader_profitability::possibly_profitable_path_p):
Likewise.
* tree-ssa-threadedge.cc (jump_threader::thread_outgoing_edges):
Likewise.
* tree-switch-conversion.cc (pass_convert_switch::execute):
Likewise.
(pass_lower_switch<O0>::execute): Likewise.
* tree-tailcall.cc (tree_optimize_tail_calls_1): Likewise.
* tree-vect-loop-manip.cc (vect_loop_versioning): Likewise.
* tree-vect-slp.cc (vect_slp_function): Likewise.
* tree-vect-stmts.cc (cfun_returns): Likewise.
* tree-vectorizer.cc (vect_loop_vectorized_call): Likewise.
(vect_loop_dist_alias_call): Likewise.
|
|
* tree-vect-loop-manip.cc (vect_do_peeling): Use
result of constant_lower_bound instead of vf for the lower
bound of the epilog loop trip count.
|