Age | Commit message (Collapse) | Author | Files | Lines |
|
The following reworks vectorizable_live_operation to pass the
live stmt to vect_create_epilog_for_reduction also for early breaks
and a peeled main exit. This is to be able to figure the scalar
definition to replace. This reverts the PR114192 fix as it is
subsumed by this cleanup.
PR tree-optimization/114239
* tree-vect-loop.cc (vect_get_vect_def): Remove.
(vect_create_epilog_for_reduction): The passed in stmt_info
should now be the live stmt that produces the scalar reduction
result. Revert PR114192 fix. Base reduction info off
info_for_reduction. Remove special handling of
early-break/peeled, restore original vector def gathering.
Make sure to pick the correct exit PHIs.
(vectorizable_live_operation): Pass in the proper stmt_info
for early break exits.
* gcc.dg/vect/vect-early-break_122-pr114239.c: New testcase.
|
|
The following fixes a missing replacement of the reduction value
used in the epilog, causing the scalar reduction to be kept live
across the early break exit path.
PR tree-optimization/114192
* tree-vect-loop.cc (vect_create_epilog_for_reduction): Use the
appropriate def for the live out stmt in case of an alternate
exit.
|
|
When we classify a conditional reduction chain as CONST_COND_REDUCTION
we fail to verify all involved conditionals have the same constant.
That's a quite unlikely situation so the following simply disables
such classification when there's more than one reduction statement.
PR tree-optimization/114027
* tree-vect-loop.cc (vecctorizable_reduction): Use optimized
condition reduction classification only for single-element
chains.
* gcc.dg/vect/pr114027.c: New testcase.
|
|
The following fixes the omission of failing to look at pattern
stmts when we need to dissolve SLP only groups.
PR tree-optimization/111156
* tree-vect-loop.cc (vect_dissolve_slp_only_groups): Look
at the pattern stmt if any.
|
|
The following adjusts move_early_exit_stmts to track the last seen
VUSE instead of getting it from the last store which could be a PHI
where gimple_vuse doesn't work.
PR tree-optimization/113902
* tree-vect-loop.cc (move_early_exit_stmts): Track
last_seen_vuse for VUSE updating.
* gcc.dg/vect/pr113902.c: New testcase.
|
|
When doing early break vectorization we should treat the final iteration as
possibly being partial. This so that when we calculate the vector loop upper
bounds we take into account that final iteration could have done some work.
The attached testcase shows that if we don't then cunroll may unroll the loop an
if the upper bound is wrong we lose a vector iteration.
This is similar to how we adjust the scalar loop bounds for the PEELED case.
gcc/ChangeLog:
PR tree-optimization/113734
* tree-vect-loop.cc (vect_transform_loop): Treat the final iteration of
an early break loop as partial.
gcc/testsuite/ChangeLog:
PR tree-optimization/113734
* gcc.dg/vect/vect-early-break_117-pr113734.c: New test.
|
|
This makes sure to elide degenerate virtual PHIs when moving stores
across early exits.
PR tree-optimization/113863
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences):
Record crossed virtual PHIs.
* tree-vect-loop.cc (move_early_exit_stmts): Elide crossed
virtual PHIs.
* gcc.dg/vect/pr113863.c: New testcase.
|
|
There's a bug in vectorizable_live_operation that restart_loop is defined
outside the loop.
This variable is supposed to indicate whether we are doing a first or last
index reduction. The problem is that by defining it outside the loop it becomes
dependent on the order we visit the USE/DEFs.
In the given example, the loop isn't PEELED, but we visit the early exit uses
first. This then sets the boolean to true and it can't get to false again.
So when we visit the main exit we still treat it as an early exit for that
SSA name.
This cleans it up and renames the variables to something that's hopefully
clearer to their intention.
gcc/ChangeLog:
PR tree-optimization/113808
* tree-vect-loop.cc (vectorizable_live_operation): Don't cache the
value cross iterations.
gcc/testsuite/ChangeLog:
PR tree-optimization/113808
* gfortran.dg/vect/vect-early-break_1-PR113808.f90: New test.
|
|
[PR113750]
The report shows that if the FE leaves a label as the first thing in the dest
BB then we ICE because we move the stores before the label.
This is easy to fix if we know that there's still only one way into the BB.
We would have already rejected the loop if there was multiple paths into the BB
however I added an additional check just for early break in case the other
constraints are relaxed later with an explanation.
After that we fix the issue just by getting the GSI after the labels and I add
a bunch of testcases for different positions the label can be added. Only the
vect-early-break_112-pr113750.c one results in the label being kept.
gcc/ChangeLog:
PR tree-optimization/113750
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): Check
for single predecessor when doing early break vect.
* tree-vect-loop.cc (move_early_exit_stmts): Get gsi at the start but
after labels.
gcc/testsuite/ChangeLog:
PR tree-optimization/113750
* gcc.dg/vect/vect-early-break_112-pr113750.c: New test.
* gcc.dg/vect/vect-early-break_113-pr113750.c: New test.
* gcc.dg/vect/vect-early-break_114-pr113750.c: New test.
* gcc.dg/vect/vect-early-break_115-pr113750.c: New test.
* gcc.dg/vect/vect-early-break_116-pr113750.c: New test.
|
|
We use gsi_move_before (&stmt_gsi, &dest_gsi); to request that the new statement
be placed before any other statement. Typically this then moves the current
pointer to be after the statement we just inserted.
However it looks like when the BB is empty, this does not happen and the CUR
pointer stays NULL. There's a comment in the source of gsi_insert_before that
explains:
/* If CUR is NULL, we link at the end of the sequence (this case happens
This adds a default parameter to gsi_move_before to allow us to control where
the insertion happens.
gcc/ChangeLog:
PR tree-optimization/113731
* gimple-iterator.cc (gsi_move_before): Take new parameter for update
method.
* gimple-iterator.h (gsi_move_before): Default new param to
GSI_SAME_STMT.
* tree-vect-loop.cc (move_early_exit_stmts): Call gsi_move_before with
GSI_NEW_STMT.
gcc/testsuite/ChangeLog:
PR tree-optimization/113731
* gcc.dg/vect/vect-early-break_111-pr113731.c: New test.
|
|
We can't support niters with may_be_zero when we end up with a
non-empty latch due to early exit peeling. At least not in
the simplistic way the vectorizer handles this now. Disallow
it again for exits that are not the last one.
PR tree-optimization/113576
* tree-vect-loop.cc (vec_init_loop_exit_info): Only allow
exits with may_be_zero niters when its the last one.
* gcc.dg/vect/pr113576.c: New testcase.
|
|
This renamed main_exit_p to last_val_reduc_p to more accurately
reflect what the value is calculating.
gcc/ChangeLog:
* tree-vect-loop.cc (vect_get_vect_def,
vect_create_epilog_for_reduction): Rename main_exit_p to
last_val_reduc_p.
|
|
This fixes a bug where vect_create_epilog_for_reduction does not handle the
case where all exits are early exits. In this case we should do like induction
handling code does and not have a main exit.
This shows that some new miscompiles are happening (stage3 is likely miscompiled)
but that's unrelated to this patch and I'll look at it next.
gcc/ChangeLog:
PR tree-optimization/113364
* tree-vect-loop.cc (vect_create_epilog_for_reduction): If all exits all
early exits then we must reduce from the first offset for all of them.
gcc/testsuite/ChangeLog:
PR tree-optimization/113364
* gcc.dg/vect/vect-early-break_107-pr113364.c: New test.
|
|
The following makes reduction epilogue code generation happy by properly
adding LC PHIs to the exit blocks for multiple exit vectorized loops.
Some refactoring might make the flow easier to follow but I've refrained
from doing that with this patch.
I've kept some fixes in reduction epilogue generation from the earlier
attempt fixing this PR.
PR tree-optimization/113373
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Create LC PHIs in the exit blocks where necessary.
* tree-vect-loop.cc (vectorizable_live_operation): Do not try
to handle missing LC PHIs.
(find_connected_edge): Remove.
(vect_create_epilog_for_reduction): Cleanup use of auto_vec.
* gcc.dg/vect/vect-early-break_104-pr113373.c: New testcase.
|
|
The following fixes a memory leak in vect_analyze_loop_form which fails
to free the loop body it gets. It also allows more countable exits,
matching what we can handle later, when we decide which exit to use
as main exit. Finally some no longer applying comments are adjusted.
* tree-vect-loop.cc (vec_init_loop_exit_info): Adjust comment,
prefer all later exits we can handle.
(vect_analyze_loop_form): Free the allocated loop body.
Adjust comments.
|
|
The following fixes wrong virtual operands being used for peeled
early breaks where we can have different live ones and for multiple
exits it makes sure to update the correct PHI arguments.
I've introduced SET_PHI_ARG_DEF_ON_EDGE so we can avoid using
a wrong edge to compute the PHI arg index from.
I've took the liberty to understand the code again and refactor
and comment it a bit differently. The main functional change
is that we preserve the live virtual operand on all exits.
PR tree-optimization/113374
* tree-ssa-operands.h (SET_PHI_ARG_DEF_ON_EDGE): New.
* tree-vect-loop.cc (move_early_exit_stmts): Update
virtual LC PHIs.
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Refactor. Preserve virtual LC PHIs on all exits.
* gcc.dg/vect/vect-early-break_106-pr113374.c: New testcase.
|
|
This replaces two more usages of single_exit that I had missed before.
They both seem to happen when we re-use the ifcvt scalar loop for versioning.
The condition in versioning is the same as the one for when we don't re-use the
scalar loop.
gcc/ChangeLog:
* tree-vect-loop-manip.cc (vect_loop_versioning): Replace single_exit.
* tree-vect-loop.cc (vect_transform_loop): Likewise.
|
|
When we have a loop with more than 2 exits and a reduction I forgot to fill in
the PHI value for all alternate exits.
All alternate exits use the same PHI value so we should loop over the new
PHI elements and copy the value across since we call the reduction calculation
code only once for all exits. This was normally covered up by earlier parts of
the compiler rejecting loops incorrectly (which has been fixed now).
Note that while I can use the loop in all cases, the reason I separated out the
main and alt exit is so that if you pass the wrong edge the macro will assert.
gcc/ChangeLog:
PR tree-optimization/113178
* tree-vect-loop.cc (vect_create_epilog_for_reduction): Fill in all
alternate exits.
gcc/testsuite/ChangeLog:
PR tree-optimization/113178
* gcc.dg/vect/vect-early-break_101-pr113178.c: New test.
* gcc.dg/vect/vect-early-break_102-pr113178.c: New test.
|
|
operands
This patch fixes several interconnected issues.
1. When picking an exit we wanted to check for niter_desc.may_be_zero not true.
i.e. we want to pick an exit which we know will iterate at least once.
However niter_desc.may_be_zero is not a boolean. It is a tree that encodes
a boolean value. !niter_desc.may_be_zero is just checking if we have some
information, not what the information is. This leads us to pick a more
difficult to vectorize exit more often than we should.
2. Because we had this bug, we used to pick an alternative exit much more ofthen
which showed one issue, when the loop accesses memory and we "invert it" we
would corrupt the VUSE chain. This is because on an peeled vector iteration
every exit restarts the loop (i.e. they're all early) BUT since we may have
performed a store, the vUSE would need to be updated. This version maintains
virtual PHIs correctly in these cases. Note that we can't simply remove all
of them and recreate them because we need the PHI nodes still in the right
order for if skip_vector.
3. Since we're moving the stores to a safe location I don't think we actually
need to analyze whether the store is in range of the memref, because if we
ever get there, we know that the loads must be in range, and if the loads are
in range and we get to the store we know the early breaks were not taken and
so the scalar loop would have done the VF stores too.
4. Instead of searching for where to move stores to, they should always be in
exit belonging to the latch. We can only ever delay stores and even if we
pick a different exit than the latch one as the main one, effects still
happen in program order when vectorized. If we don't move the stores to the
latch exit but instead to whever we pick as the "main" exit then we can
perform incorrect memory accesses (luckily these are trapped by verify_ssa).
5. We only used to analyze loads inside the same BB as an early break, and also
we'd never analyze the ones inside the block where we'd be moving memory
references to. This is obviously bogus and to fix it this patch splits apart
the two constraints. We first validate that all load memory references are
in bounds and only after that do we perform the alias checks for the writes.
This makes the code simpler to understand and more trivially correct.
gcc/ChangeLog:
PR tree-optimization/113137
PR tree-optimization/113136
PR tree-optimization/113172
PR tree-optimization/113178
* tree-vect-loop-manip.cc (slpeel_tree_duplicate_loop_to_edge_cfg):
Maintain PHIs on inverted loops.
(vect_do_peeling): Maintain virtual PHIs on inverted loops.
* tree-vect-loop.cc (vec_init_loop_exit_info): Pick exit closes to
latch.
(vect_create_loop_vinfo): Record all conds instead of only alt ones.
gcc/testsuite/ChangeLog:
PR tree-optimization/113137
PR tree-optimization/113136
PR tree-optimization/113172
PR tree-optimization/113178
* g++.dg/vect/vect-early-break_4-pr113137.cc: New test.
* g++.dg/vect/vect-early-break_5-pr113137.cc: New test.
* gcc.dg/vect/vect-early-break_95-pr113137.c: New test.
* gcc.dg/vect/vect-early-break_96-pr113136.c: New test.
* gcc.dg/vect/vect-early-break_97-pr113172.c: New test.
|
|
Vectorization of bit-precision inductions isn't implemented but we
don't check this, instead we ICE during transform.
PR tree-optimization/112505
* tree-vect-loop.cc (vectorizable_induction): Reject
bit-precision induction.
* gcc.dg/vect/pr112505.c: New testcase.
|
|
When if-conversion was changed to use .COND_ADD/SUB for conditional
reduction it was forgotten to update reduction path handling to
canonicalize .COND_SUB to .COND_ADD for vectorizable_reduction
similar to what we do for MINUS_EXPR. The following adds this
and testcases exercising this at runtime and looking for the
appropriate masked subtraction in the vectorized code on x86.
PR tree-optimization/113078
* tree-vect-loop.cc (check_reduction_path): Canonicalize
.COND_SUB to .COND_ADD.
* gcc.dg/vect/vect-reduc-cond-sub.c: New testcase.
* gcc.target/i386/vect-pr113078.c: Likewise.
|
|
It looks like the previous patch had an unused variable.
It's odd that my bootstrap didn't catch it (I'm assuming
-Werror is still on for O3 bootstraps) but this fixes it.
gcc/ChangeLog:
* tree-vect-loop.cc (vectorizable_live_operation_1): Drop unused
restart_loop.
(vectorizable_live_operation): Likewise.
|
|
I was generating the vector reverse mask without checking if the target
actually supported such an operation.
This patch changes it to if the bitstart is 0 then use BIT_FIELD_REF instead
to extract the first element since this is supported by all targets.
This is good for now since masks always come from whilelo. But in the future
when masks can come from other sources we will need the old code back.
gcc/ChangeLog:
PR tree-optimization/113199
* tree-vect-loop.cc (vectorizable_live_operation_1): Use
BIT_FIELD_REF.
gcc/testsuite/ChangeLog:
PR tree-optimization/113199
* gcc.target/gcn/pr113199.c: New test.
|
|
[PR113210]
On the following testcase e.g. on riscv64 or aarch64 (latter with
-O3 -march=armv8-a+sve ) we ICE, because while NITERS is INTEGER_CST,
NITERSM1 is a complex expression like
(short unsigned int) (a.0_1 + 255) + 1 > 256 ? ~(short unsigned int) (a.0_1 + 255) : 0
where a.0_1 is unsigned char. The condition is never true, so the above
is equivalent to just 0, but only when trying to fold the above with
PLUS_EXPR 1 we manage to simplify it (first
~(short unsigned int) (a.0_1 + 255)
to
-(short unsigned int) (a.0_1 + 255)
and then
(short unsigned int) (a.0_1 + 255) + 1 > 256 ? -(short unsigned int) (a.0_1 + 255) : 1
to
(short unsigned int) (a.0_1 + 255) >= 256 ? -(short unsigned int) (a.0_1 + 255) : 1
and only at this point we fold the condition to be false.
But the vectorizer seems to assume that if NITERS is known (i.e. suitable
INTEGER_CST) then NITERSM1 also is, so the following hack ensures that if
NITERS folds into INTEGER_CST NITERSM1 will be one as well.
2024-01-09 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/113210
* tree-vect-loop.cc (vect_get_loop_niters): If non-INTEGER_CST
value in *number_of_iterationsm1 PLUS_EXPR 1 is folded into
INTEGER_CST, recompute *number_of_iterationsm1 as the INTEGER_CST
minus 1.
* gcc.c-torture/compile/pr113210.c: New test.
|
|
The following avoids creating a niter peeling epilog more consistently,
matching what peeling later uses for the skip_vector condition, in
particular when versioning is required which then also ensures the
vector loop is entered unless the epilog is vectorized. This should
ideally match LOOP_VINFO_VERSIONING_THRESHOLD which is only computed
later, some refactoring could make that better matching.
The patch also makes sure to adjust the upper bound of the epilogues
when we do not have a skip edge around the vector loop.
PR tree-optimization/113026
* tree-vect-loop.cc (vect_need_peeling_or_partial_vectors_p):
Avoid an epilog in more cases.
* tree-vect-loop-manip.cc (vect_do_peeling): Adjust the
epilogues niter upper bounds and estimates.
* gcc.dg/torture/pr113026-1.c: New testcase.
* gcc.dg/torture/pr113026-2.c: Likewise.
|
|
|
|
when configured with --enable-checking=release we get a false
positive on the use of vec_stmts as the compiler seems unable
to notice it gets initialized through the pass-by-reference.
This explicitly initializes the local.
gcc/ChangeLog:
PR bootstrap/113132
* tree-vect-loop.cc (vect_create_epilog_for_reduction): Initialize vec_stmts;
|
|
Hi All,
This patch adds initial support for early break vectorization in GCC. In other
words it implements support for vectorization of loops with multiple exits.
The support is added for any target that implements a vector cbranch optab,
this includes both fully masked and non-masked targets.
Depending on the operation, the vectorizer may also require support for boolean
mask reductions using Inclusive OR/Bitwise AND. This is however only checked
then the comparison would produce multiple statements.
This also fully decouples the vectorizer's notion of exit from the existing loop
infrastructure's exit. Before this patch the vectorizer always picked the
natural loop latch connected exit as the main exit.
After this patch the vectorizer is free to choose any exit it deems appropriate
as the main exit. This means that even if the main exit is not countable (i.e.
the termination condition could not be determined) we might still be able to
vectorize should one of the other exits be countable.
In such situations the loop is reflowed which enabled vectorization of many
other loop forms.
Concretely the kind of loops supported are of the forms:
for (int i = 0; i < N; i++)
{
<statements1>
if (<condition>)
{
...
<action>;
}
<statements2>
}
where <action> can be:
- break
- return
- goto
Any number of statements can be used before the <action> occurs.
Since this is an initial version for GCC 14 it has the following limitations and
features:
- Only fixed sized iterations and buffers are supported. That is to say any
vectors loaded or stored must be to statically allocated arrays with known
sizes. N must also be known. This limitation is because our primary target
for this optimization is SVE. For VLA SVE we can't easily do cross page
iteraion checks. The result is likely to also not be beneficial. For that
reason we punt support for variable buffers till we have First-Faulting
support in GCC 15.
- any stores in <statements1> should not be to the same objects as in
<condition>. Loads are fine as long as they don't have the possibility to
alias. More concretely, we block RAW dependencies when the intermediate value
can't be separated fromt the store, or the store itself can't be moved.
- Prologue peeling, alignment peelinig and loop versioning are supported.
- Fully masked loops, unmasked loops and partially masked loops are supported
- Any number of loop early exits are supported.
- No support for epilogue vectorization. The only epilogue supported is the
scalar final one. Peeling code supports it but the code motion code cannot
find instructions to make the move in the epilog.
- Early breaks are only supported for inner loop vectorization.
With the help of IPA and LTO this still gets hit quite often. During bootstrap
it hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now
since these are tests for support for early exit vectorization.
This implementation does not support completely handling the early break inside
the vector loop itself but instead supports adding checks such that if we know
that we have to exit in the current iteration then we branch to scalar code to
actually do the final VF iterations which handles all the code in <action>.
For the scalar loop we know that whatever exit you take you have to perform at
most VF iterations. For vector code we only case about the state of fully
performed iteration and reset the scalar code to the (partially) remaining loop.
That is to say, the first vector loop executes so long as the early exit isn't
needed. Once the exit is taken, the scalar code will perform at most VF extra
iterations. The exact number depending on peeling and iteration start and which
exit was taken (natural or early). For this scalar loop, all early exits are
treated the same.
When we vectorize we move any statement not related to the early break itself
and that would be incorrect to execute before the break (i.e. has side effects)
to after the break. If this is not possible we decline to vectorize. The
analysis and code motion also takes into account that it doesn't introduce a RAW
dependency after the move of the stores.
This means that we check at the start of iterations whether we are going to exit
or not. During the analyis phase we check whether we are allowed to do this
moving of statements. Also note that we only move the scalar statements, but
only do so after peeling but just before we start transforming statements.
With this the vector flow no longer necessarily needs to match that of the
scalar code. In addition most of the infrastructure is in place to support
general control flow safely, however we are punting this to GCC 15.
Codegen:
for e.g.
unsigned vect_a[N];
unsigned vect_b[N];
unsigned test4(unsigned x)
{
unsigned ret = 0;
for (int i = 0; i < N; i++)
{
vect_b[i] = x + i;
if (vect_a[i] > x)
break;
vect_a[i] = x;
}
return ret;
}
We generate for Adv. SIMD:
test4:
adrp x2, .LC0
adrp x3, .LANCHOR0
dup v2.4s, w0
add x3, x3, :lo12:.LANCHOR0
movi v4.4s, 0x4
add x4, x3, 3216
ldr q1, [x2, #:lo12:.LC0]
mov x1, 0
mov w2, 0
.p2align 3,,7
.L3:
ldr q0, [x3, x1]
add v3.4s, v1.4s, v2.4s
add v1.4s, v1.4s, v4.4s
cmhi v0.4s, v0.4s, v2.4s
umaxp v0.4s, v0.4s, v0.4s
fmov x5, d0
cbnz x5, .L6
add w2, w2, 1
str q3, [x1, x4]
str q2, [x3, x1]
add x1, x1, 16
cmp w2, 200
bne .L3
mov w7, 3
.L2:
lsl w2, w2, 2
add x5, x3, 3216
add w6, w2, w0
sxtw x4, w2
ldr w1, [x3, x4, lsl 2]
str w6, [x5, x4, lsl 2]
cmp w0, w1
bcc .L4
add w1, w2, 1
str w0, [x3, x4, lsl 2]
add w6, w1, w0
sxtw x1, w1
ldr w4, [x3, x1, lsl 2]
str w6, [x5, x1, lsl 2]
cmp w0, w4
bcc .L4
add w4, w2, 2
str w0, [x3, x1, lsl 2]
sxtw x1, w4
add w6, w1, w0
ldr w4, [x3, x1, lsl 2]
str w6, [x5, x1, lsl 2]
cmp w0, w4
bcc .L4
str w0, [x3, x1, lsl 2]
add w2, w2, 3
cmp w7, 3
beq .L4
sxtw x1, w2
add w2, w2, w0
ldr w4, [x3, x1, lsl 2]
str w2, [x5, x1, lsl 2]
cmp w0, w4
bcc .L4
str w0, [x3, x1, lsl 2]
.L4:
mov w0, 0
ret
.p2align 2,,3
.L6:
mov w7, 4
b .L2
and for SVE:
test4:
adrp x2, .LANCHOR0
add x2, x2, :lo12:.LANCHOR0
add x5, x2, 3216
mov x3, 0
mov w1, 0
cntw x4
mov z1.s, w0
index z0.s, #0, #1
ptrue p1.b, all
ptrue p0.s, all
.p2align 3,,7
.L3:
ld1w z2.s, p1/z, [x2, x3, lsl 2]
add z3.s, z0.s, z1.s
cmplo p2.s, p0/z, z1.s, z2.s
b.any .L2
st1w z3.s, p1, [x5, x3, lsl 2]
add w1, w1, 1
st1w z1.s, p1, [x2, x3, lsl 2]
add x3, x3, x4
incw z0.s
cmp w3, 803
bls .L3
.L5:
mov w0, 0
ret
.p2align 2,,3
.L2:
cntw x5
mul w1, w1, w5
cbz w5, .L5
sxtw x1, w1
sub w5, w5, #1
add x5, x5, x1
add x6, x2, 3216
b .L6
.p2align 2,,3
.L14:
str w0, [x2, x1, lsl 2]
cmp x1, x5
beq .L5
mov x1, x4
.L6:
ldr w3, [x2, x1, lsl 2]
add w4, w0, w1
str w4, [x6, x1, lsl 2]
add x4, x1, 1
cmp w0, w3
bcs .L14
mov w0, 0
ret
On the workloads this work is based on we see between 2-3x performance uplift
using this patch.
Follow up plan:
- Boolean vectorization has several shortcomings. I've filed PR110223 with the
bigger ones that cause vectorization to fail with this patch.
- SLP support. This is planned for GCC 15 as for majority of the cases build
SLP itself fails. This means I'll need to spend time in making this more
robust first. Additionally it requires:
* Adding support for vectorizing CFG (gconds)
* Support for CFG to differ between vector and scalar loops.
Both of which would be disruptive to the tree and I suspect I'll be handling
fallouts from this patch for a while. So I plan to work on the surrounding
building blocks first for the remainder of the year.
Additionally it also contains reduced cases from issues found running over
various codebases.
Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Also regtested with:
-march=armv8.3-a+sve
-march=armv8.3-a+nosve
-march=armv9-a
-mcpu=neoverse-v1
-mcpu=neoverse-n2
Bootstrapped Regtested x86_64-pc-linux-gnu and no issues.
Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues.
gcc/ChangeLog:
* tree-if-conv.cc (idx_within_array_bound): Expose.
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New.
(vect_analyze_data_ref_dependences): Use it.
* tree-vect-loop-manip.cc (vect_iv_increment_position): New.
(vect_set_loop_controls_directly,
vect_set_loop_condition_partial_vectors,
vect_set_loop_condition_partial_vectors_avx512,
vect_set_loop_condition_normal): Support multiple exits.
(slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for
multiple exits.
(slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB
count and instead look at loop shape.
(vect_update_ivs_after_vectorizer): Drop asserts.
(vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations.
(vect_do_peeling): Support multiple exits.
(vect_loop_versioning): Likewise.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise
early_breaks.
(vect_analyze_loop_form): Support loop flows with more than single BB
loop body.
(vect_create_loop_vinfo): Support niters analysis for multiple exits.
(vect_analyze_loop): Likewise.
(vect_get_vect_def): New.
(vect_create_epilog_for_reduction): Support early exit reductions.
(vectorizable_live_operation_1): New.
(find_connected_edge): New.
(vectorizable_live_operation): Support early exit live operations.
(move_early_exit_stmts): New.
(vect_transform_loop): Use it.
* tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond.
(vect_recog_bitfield_ref_pattern): Support gconds and bools.
(vect_recog_gcond_pattern): New.
(possible_vector_mask_operation_p): Support gcond masks.
(vect_determine_mask_precision): Likewise.
(vect_mark_pattern_stmts): Set gcond def type.
(can_vectorize_live_stmts): Force early break inductions to be live.
* tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for
early breaks.
(vect_mark_stmts_to_be_vectorized): Process gcond usage.
(perm_mask_for_reverse): Expose.
(vectorizable_comparison_1): New.
(vectorizable_early_exit): New.
(vect_analyze_stmt): Support early break and gcond.
(vect_transform_stmt): Likewise.
(vect_is_simple_use): Likewise.
(vect_get_vector_types_for_stmt): Likewise.
* tree-vectorizer.cc (pass_vectorize::execute): Update exits for value
numbering.
* tree-vectorizer.h (enum vect_def_type): Add vect_condition_def.
(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES,
LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB,
LOOP_VINFO_EARLY_BRK_VUSES): New.
(is_loop_header_bb_p): Drop assert.
(class loop): Add early_breaks, early_break_stores, early_break_dest_bb,
early_break_vuses.
(vect_iv_increment_position, perm_mask_for_reverse,
ref_within_array_bound): New.
(slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.
|
|
Follow Richard's suggestions, we should not model address cost in the loop
vectorizer for select_vl or decrement IV since other style vectorization doesn't
do that.
To make cost model comparison apple to apple.
This patch set COST from 2 to 1 which turns out have better codegen
in various codegen for RVV.
Ok for trunk ?
PR target/111153
gcc/ChangeLog:
* tree-vect-loop.cc (vect_estimate_min_profitable_iters):
Remove address cost for select_vl/decrement IV.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/costmodel/riscv/rvv/pr111153.c: Moved to...
* gcc.dg/vect/costmodel/riscv/rvv/pr11153-2.c: ...here.
* gcc.dg/vect/costmodel/riscv/rvv/pr111153-1.c: New test.
|
|
Hi, before this patch, a simple conversion case for RVV codegen:
foo:
ble a2,zero,.L8
addiw a5,a2,-1
li a4,6
bleu a5,a4,.L6
srliw a3,a2,3
slli a3,a3,3
add a3,a3,a0
mv a5,a0
mv a4,a1
vsetivli zero,8,e16,m1,ta,ma
.L4:
vle8.v v2,0(a5)
addi a5,a5,8
vzext.vf2 v1,v2
vse16.v v1,0(a4)
addi a4,a4,16
bne a3,a5,.L4
andi a5,a2,-8
beq a2,a5,.L10
.L3:
slli a4,a5,32
srli a4,a4,32
subw a2,a2,a5
slli a2,a2,32
slli a5,a4,1
srli a2,a2,32
add a0,a0,a4
add a1,a1,a5
vsetvli zero,a2,e16,m1,ta,ma
vle8.v v2,0(a0)
vzext.vf2 v1,v2
vse16.v v1,0(a1)
.L8:
ret
.L10:
ret
.L6:
li a5,0
j .L3
This vectorization go through first loop:
vsetivli zero,8,e16,m1,ta,ma
.L4:
vle8.v v2,0(a5)
addi a5,a5,8
vzext.vf2 v1,v2
vse16.v v1,0(a4)
addi a4,a4,16
bne a3,a5,.L4
Each iteration processes 8 elements.
For a scalable vectorization with VLEN > 128 bits CPU, it's ok when VLEN = 128.
But, as long as VLEN > 128 bits, it will waste the CPU resources. That is, e.g. VLEN = 256bits.
only half of the vector units are working and another half is idle.
After investigation, I realize that I forgot to adjust COST for SELECT_VL.
So, adjust COST for SELECT_VL styple length vectorization. We adjust COST from 3 to 2. since
after this patch:
foo:
ble a2,zero,.L5
.L3:
vsetvli a5,a2,e16,m1,ta,ma -----> SELECT_VL cost.
vle8.v v2,0(a0)
slli a4,a5,1 -----> additional shift of outcome SELECT_VL for memory address calculation.
vzext.vf2 v1,v2
sub a2,a2,a5
vse16.v v1,0(a1)
add a0,a0,a5
add a1,a1,a4
bne a2,zero,.L3
.L5:
ret
This patch is a simple fix that I previous forgot.
Ok for trunk ?
If not, I am going to adjust cost in backend cost model.
PR target/111317
gcc/ChangeLog:
* tree-vect-loop.cc (vect_estimate_min_profitable_iters): Adjust for COST for decrement IV.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/costmodel/riscv/rvv/pr111317.c: New test.
|
|
The flag is defined as CHREC_NOWRAP(tree), and will be dumped from
"{offset, +, 1}_1" to "{offset, +, 1}<nw>_1" (nw is short for nonwrapping).
Two SCEV interfaces record_nonwrapping_chrec and nonwrapping_chrec_p are
added to set and check the flag respectively.
As resetting the SCEV cache (i.e., the chrec trees) may not reset the
loop->estimate_state, free_numbers_of_iterations_estimates is called
explicitly in loop vectorization to make sure the flag can be
calculated propriately by niter.
gcc/ChangeLog:
PR tree-optimization/112774
* tree-pretty-print.cc: if nonwrapping flag is set, chrec will be
printed with additional <nw> info.
* tree-scalar-evolution.cc: add record_nonwrapping_chrec and
nonwrapping_chrec_p to set and check the new flag respectively.
* tree-scalar-evolution.h: Likewise.
* tree-ssa-loop-niter.cc (idx_infer_loop_bounds,
infer_loop_bounds_from_pointer_arith, infer_loop_bounds_from_signedness,
scev_probably_wraps_p): call record_nonwrapping_chrec before
record_nonwrapping_iv, call nonwrapping_chrec_p to check the flag is
set and return false from scev_probably_wraps_p.
* tree-vect-loop.cc (vect_analyze_loop): call
free_numbers_of_iterations_estimates explicitly.
* tree-core.h: document the nothrow_flag usage in CHREC_NOWRAP
* tree.h: add CHREC_NOWRAP(NODE), base.nothrow_flag is used to
represent the nonwrapping info.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/scev-16.c: New test.
|
|
When querying a single set of vector defs with the overloaded
vect_get_vec_defs API then when you try to use the overload with
the vector type specified the call will be ambiguous with the
variant without the vector type. The following fixes this by
re-ordering the vector type argument to come before the output
def vector argument.
I've changed vectorizable_conversion as that triggered this
so it has coverage showing this works. The motivation is to
reduce the number of (redundant) get_vectype_for_scalar_type
calls.
* tree-vectorizer.h (vect_get_vec_defs): Re-order arguments.
* tree-vect-stmts.cc (vect_get_vec_defs): Likewise.
(vectorizable_condition): Update caller.
(vectorizable_comparison_1): Likewise.
(vectorizable_conversion): Specify the vector type to be
used for invariant/external defs.
* tree-vect-loop.cc (vect_transform_reduction): Update caller.
|
|
The following makes sure to allocate enough space for vectype_op
in vectorizable_reduction.
PR tree-optimization/112677
* tree-vect-loop.cc (vectorizable_reduction): Use alloca
to allocate vectype_op.
|
|
In PR112406 Tamar found another problem with COND_OP reductions.
I wrongly assumed that the reduction variable will always remain in
operand 1, just as we create the COND_OP in ifcvt. But of course,
addition being commutative, we are free to swap operand 1 and 2 and we
end up with e.g.
_ifc__60 = .COND_ADD (_2, _6, MADPictureC1_lsm.10_25, MADPictureC1_lsm.10_25);
which does not pass the asserts I put in place.
This patch removes this restriction and allows the reduction index to be
2 as well.
gcc/ChangeLog:
PR middle-end/112406
* tree-vect-loop.cc (vectorize_fold_left_reduction): Allow
reduction index != 1.
(vect_transform_reduction): Handle reduction index != 1.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/pr112406-2.c: New test.
|
|
The following moves the check whether the maximum vectorization
factor determined by data dependence analysis is in conflict with
the chosen vectorization factor to after the point where we applied
both the SLP and the unrolling adjustment to the vectorization
factor. We check the latter before applying unrolling, but the
SLP adjustment can result in both missed optimization and wrong-code.
* tree-vect-loop.cc (vect_analyze_loop_2): Move check
of VF against max_vf until VF is final.
|
|
We have to make sure to remove unused .MASK_CALL internal function
calls after vectorization.
PR tree-optimization/112618
* tree-vect-loop.cc (vect_transform_loop_stmt): For not
relevant and unused .MASK_CALL make sure we remove the
scalar stmt.
* gcc.dg/pr112618.c: New testcase.
|
|
For conditional operations the mask is loop invariant and cannot be
stored explicitly. By default, for reductions, we deduce the vectype
from the statement or the loop but this does not work for conditional
operations. Therefore this patch passes the truth type of the reduction
input vectype for the mask operand instead. This will override the
other choices and make sure we have the proper mask vectype.
gcc/ChangeLog:
PR middle-end/112406
PR middle-end/112552
* tree-vect-loop.cc (vect_transform_reduction): Pass truth
vectype for mask operand.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/pr112406.c: New test.
* gcc.target/riscv/rvv/autovec/pr112552.c: New test.
|
|
As mentioned in the PR, the intent of the r14-5076 changes was that
it doesn't count one of the uses on the use_stmt, but what actually
got implemented is that it does this processing on any op_use_stmt,
even if it is not the use_stmt statement, which means that it
can increase count even on debug stmts (-fcompare-debug failures),
or if there would be some other use stmt with 2+ uses it could count
that as a single use. Though, because it fails whenever cnt != 1
and I believe use_stmt must be one of the uses, it would probably
fail in the latter case anyway.
The following patch fixes that by doing this extra processing only when
op_use_stmt is use_stmt, and using the normal processing otherwise
(so ignore debug stmts, and increase on any uses on the stmt).
2023-11-17 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/112374
* tree-vect-loop.cc (check_reduction_path): Perform the cond_fn_p
special case only if op_use_stmt == use_stmt, use as_a rather than
dyn_cast in that case.
* gcc.dg/pr112374-1.c: New test.
* gcc.dg/pr112374-2.c: New test.
* g++.dg/opt/pr112374.C: New test.
|
|
This patch fixes ICE:
https://godbolt.org/z/z8T6o6qov
<source>: In function 'b':
<source>:2:6: error: missing definition
2 | void b() {
| ^
for SSA_NAME: loop_len_8 in statement:
_1 = -loop_len_8;
during GIMPLE pass: vect
<source>:2:6: internal compiler error: verify_ssa failed
0x7f1b56331082 __libc_start_main
???:0
Please submit a full bug report, with preprocessed source (by using -freport-bug).
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.
Compiler returned: 1
The root cause is we generate such IR in vectorization:
_1 = -loop_len_8;
vect_cst__11 = {_1, _1};
_18 = vect_vec_iv_.6_14 + vect_cst__11;
loop_len_8 is uninitialized value.
The IR _18 = vect_vec_iv_.6_14 + vect_cst__11; is generated because of we are adding induction variable with
the result of SELECT_VL instead of VF.
The code is:
else if (LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
{
/* When we're using loop_len produced by SELEC_VL, the non-final
iterations are not always processing VF elements. So vectorize
induction variable instead of
_21 = vect_vec_iv_.6_22 + { VF, ... };
We should generate:
_35 = .SELECT_VL (ivtmp_33, VF);
vect_cst__22 = [vec_duplicate_expr] _35;
_21 = vect_vec_iv_.6_22 + vect_cst__22; */
gcc_assert (!slp_node);
gimple_seq seq = NULL;
vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo);
tree len = vect_get_loop_len (loop_vinfo, NULL, lens, 1, vectype, 0, 0);
expr = force_gimple_operand (fold_convert (TREE_TYPE (step_expr),
unshare_expr (len)),
&seq, true, NULL_TREE);
new_name = gimple_build (&seq, MULT_EXPR, TREE_TYPE (step_expr), expr,
step_expr);
gsi_insert_seq_before (&si, seq, GSI_SAME_STMT);
step_iv_si = &si;
}
LOOP_VINFO_USING_SELECT_VL_P is set before loop vectorization analysis so we don't know whether it is partial
vectorization or not but the induction variable depends on SELECT_VL_P is true.
So update SELECT_VL_P as false when it is not partial vectorization.
PR middle-end/112554
gcc/ChangeLog:
* tree-vect-loop.cc (vect_determine_partial_vectors_and_peeling):
Clear SELECT_VL_P for non-partial vectorization.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/pr112554.c: New test.
|
|
if (TREE_CODE (init_expr) == INTEGER_CST)
init_expr = fold_convert (TREE_TYPE (vectype), init_expr);
else
gcc_assert (tree_nop_conversion_p (TREE_TYPE (vectype),
TREE_TYPE (init_expr)));
and init_expr is a 24 bit integer type while vectype has 32bit components.
The "fix" is to bail out instead of asserting.
gcc/ChangeLog:
PR tree-optimization/112496
* tree-vect-loop.cc (vectorizable_nonlinear_induction): Return
false when !tree_nop_conversion_p (TREE_TYPE (vectype),
TREE_TYPE (init_expr)).
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr112496.c: New test.
|
|
PR: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112438
1. Since SELECT_VL result is not necessary always VF in non-final iteration.
Current GIMPLE IR is wrong:
...
_35 = .SELECT_VL (ivtmp_33, VF);
_21 = vect_vec_iv_.8_22 + { VF, ... };
E.g. Consider the total iterations N = 6, the VF = 4.
Since SELECT_VL output is defined as not always to be VF in non-final iteration
which needs to depend on hardware implementation.
Suppose we have a RVV CPU core with vsetvl doing even distribution workload optimization.
It may process 3 elements at the 1st iteration and 3 elements at the last iteration.
Then the induction variable here: _21 = vect_vec_iv_.8_22 + { POLY_INT_CST [4, 4], ... };
is wrong which is adding VF, which is 4, actually, we didn't process 4 elements.
It should be adding 3 elements which is the result of SELECT_VL.
So, here the correct IR should be:
_36 = .SELECT_VL (ivtmp_34, VF);
_22 = (int) _36;
vect_cst__21 = [vec_duplicate_expr] _22;
2. This issue only happens on non-SLP vectorization single rgroup since:
if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
{
tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
if (direct_internal_fn_supported_p (IFN_SELECT_VL, iv_type,
OPTIMIZE_FOR_SPEED)
&& LOOP_VINFO_LENS (loop_vinfo).length () == 1
&& LOOP_VINFO_LENS (loop_vinfo)[0].factor == 1 && !slp
&& (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
|| !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()))
LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo) = true;
}
3. This issue doesn't appears on nested loop no matter LOOP_VINFO_USING_SELECT_VL_P is true or false.
Since:
# vect_vec_iv_.6_5 = PHI <_19(3), { 0, ... }(5)>
# vect_diff_15.7_20 = PHI <vect_diff_9.8_22(3), vect_diff_18.5_11(5)>
_19 = vect_vec_iv_.6_5 + { 1, ... };
vect_diff_9.8_22 = .COND_LEN_ADD ({ -1, ... }, vect_vec_iv_.6_5, vect_diff_15.7_20, vect_diff_15.7_20, _28, 0);
ivtmp_1 = ivtmp_4 + 4294967295;
....
<bb 5> [local count: 6549826]:
# vect_diff_18.5_11 = PHI <vect_diff_9.8_22(4), { 0, ... }(2)>
# ivtmp_26 = PHI <ivtmp_27(4), 40(2)>
_28 = .SELECT_VL (ivtmp_26, POLY_INT_CST [4, 4]);
goto <bb 3>; [100.00%]
Note the induction variable IR: _21 = vect_vec_iv_.8_22 + { POLY_INT_CST [4, 4], ... }; update induction variable
independent on VF (or don't care about how many elements are processed in the iteration).
The update is loop invariant. So it won't be the problem even if LOOP_VINFO_USING_SELECT_VL_P is true.
Testing passed, Ok for trunk ?
PR tree-optimization/112438
gcc/ChangeLog:
* tree-vect-loop.cc (vectorizable_induction): Bugfix when
LOOP_VINFO_USING_SELECT_VL_P.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/pr112438.c: New test.
|
|
It appears as if we "look through" a statement pattern in
vect_finish_replace_stmt but not before when we replace the newly
created vector statement's lhs. Then the lhs is the statement pattern's
lhs while in vect_finish_replace_stmt we assert that it's from the
statement the pattern replaced.
This patch uses vect_orig_stmt on the scalar destination's definition so
the replaced statement is used everywhere.
gcc/ChangeLog:
PR tree-optimization/112464
* tree-vect-loop.cc (vectorize_fold_left_reduction): Use
vect_orig_stmt on scalar_dest_def_info.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr112464.c: New test.
|
|
The following avoids running into the AVX512 style masking code for
RVV which would theoretically be able to handle it if I were not
relying on integer mode maskness in vect_get_loop_mask. While that's
easy to fix (patch in PR), the preference is to not have AVX512 style
masking for RVV, thus the following.
* tree-vect-loop.cc (vect_verify_full_masking_avx512):
Check we have integer mode masks as required by
vect_get_loop_mask.
|
|
This restricts tree-ifcvt to only create COND_OPs when we versioned the
loop for vectorization. Apart from that it re-creates a VEC_COND_EXPR
in vect_expand_fold_left if we emitted a COND_OP.
gcc/ChangeLog:
PR tree-optimization/112361
PR target/112359
PR middle-end/112406
* tree-if-conv.cc (convert_scalar_cond_reduction): Remember if
loop was versioned and only then create COND_OPs.
(predicate_scalar_phi): Do not create COND_OP when not
vectorizing.
* tree-vect-loop.cc (vect_expand_fold_left): Re-create
VEC_COND_EXPR.
(vectorize_fold_left_reduction): Pass mask to
vect_expand_fold_left.
gcc/testsuite/ChangeLog:
* gcc.dg/pr112359.c: New test.
|
|
The following fixes an oversight in vect_check_scalar_mask when
the mask is external or constant. When doing BB vectorization
we need to provide a group_size, best via an overload accepting
the SLP node as argument.
When fixed we then run into the issue that we have not analyzed
alignment of the .MASK_LOADs because they were not identified
as loads by vect_gather_slp_loads. Fixed by reworking the
detection.
PR tree-optimization/112404
* tree-vectorizer.h (get_mask_type_for_scalar_type): Declare
overload with SLP node argument.
* tree-vect-stmts.cc (get_mask_type_for_scalar_type): Implement it.
(vect_check_scalar_mask): Use it.
* tree-vect-slp.cc (vect_gather_slp_loads): Properly identify
loads also for nodes with children, like .MASK_LOAD.
* tree-vect-loop.cc (vect_analyze_loop_2): Look at the
representative for load nodes and check whether it is a grouped
access before looking for load-lanes support.
* gfortran.dg/pr112404.f90: New testcase.
|
|
During analyzing PR111950 I found the loop live operation code-gen
odd, in particular only replacing a single PHI but then adjusting
possibly remaining PHIs afterwards where there shouldn't really
be any out-of-loop uses of the scalar in-loop def left.
* tree-vect-loop.cc (vectorizable_live_operation): Simplify
LC PHI replacement.
|
|
The following removes a bogus assert constraining the uses that
could appear when a built from scalar defs SLP node constrains
code generation in a way so earlier uses of the vector CTOR
components fail to get vectorized. We can't really constrain the
operation such use appears in.
PR tree-optimization/112366
* tree-vect-loop.cc (vectorizable_live_operation): Remove
assert.
|
|
As described in PR111401 we currently emit a COND and a PLUS expression
for conditional reductions. This makes it difficult to combine both
into a masked reduction statement later.
This patch improves that by directly emitting a COND_ADD/COND_OP during
ifcvt and adjusting some vectorizer code to handle it.
It also makes neutral_op_for_reduction return -0 if HONOR_SIGNED_ZEROS
is true.
gcc/ChangeLog:
PR middle-end/111401
* internal-fn.cc (internal_fn_else_index): New function.
* internal-fn.h (internal_fn_else_index): Define.
* tree-if-conv.cc (convert_scalar_cond_reduction): Emit COND_OP
if supported.
(predicate_scalar_phi): Add whitespace.
* tree-vect-loop.cc (fold_left_reduction_fn): Add IFN_COND_OP.
(neutral_op_for_reduction): Return -0 for PLUS.
(check_reduction_path): Don't count else operand in COND_OP.
(vect_is_simple_reduction): Ditto.
(vect_create_epilog_for_reduction): Fix whitespace.
(vectorize_fold_left_reduction): Add COND_OP handling.
(vectorizable_reduction): Don't count else operand in COND_OP.
(vect_transform_reduction): Add COND_OP handling.
* tree-vectorizer.h (neutral_op_for_reduction): Add default
parameter.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-cond-reduc-in-order-2-signed-zero.c: New test.
* gcc.target/riscv/rvv/autovec/cond/pr111401.c: New test.
* gcc.target/riscv/rvv/autovec/reduc/reduc_call-2.c: Adjust.
* gcc.target/riscv/rvv/autovec/reduc/reduc_call-4.c: Ditto.
|
|
induction vec_step_op_mul when iteration count is too big.
There's loop in vect_peel_nonlinear_iv_init to get init_expr *
pow (step_expr, skip_niters). When skipn_iters is too big, compile time
hogs. To avoid that, optimize init_expr * pow (step_expr, skip_niters) to
init_expr << (exact_log2 (step_expr) * skip_niters) when step_expr is
pow of 2, otherwise give up vectorization when skip_niters >=
TYPE_PRECISION (TREE_TYPE (init_expr)).
Also give up vectorization when niters_skip is negative which will be
used for fully masked loop.
gcc/ChangeLog:
PR tree-optimization/111820
PR tree-optimization/111833
* tree-vect-loop-manip.cc (vect_can_peel_nonlinear_iv_p): Give
up vectorization for nonlinear iv vect_step_op_mul when
step_expr is not exact_log2 and niters is greater than
TYPE_PRECISION (TREE_TYPE (step_expr)). Also don't vectorize
for nagative niters_skip which will be used by fully masked
loop.
(vect_can_advance_ivs_p): Pass whole phi_info to
vect_can_peel_nonlinear_iv_p.
* tree-vect-loop.cc (vect_peel_nonlinear_iv_init): Optimize
init_expr * pow (step_expr, skipn) to init_expr
<< (log2 (step_expr) * skipn) when step_expr is exact_log2.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr111820-1.c: New test.
* gcc.target/i386/pr111820-2.c: New test.
* gcc.target/i386/pr111820-3.c: New test.
* gcc.target/i386/pr103144-mul-1.c: Adjust testcase.
* gcc.target/i386/pr103144-mul-2.c: Adjust testcase.
|
|
The following makes sure to rewrite all gather/scatter detected by
dataref analysis plus stmts classified as VMAT_GATHER_SCATTER. Maybe
we need to rewrite all refs, the following covers the cases I've
run into now.
* tree-vect-loop.cc (update_epilogue_loop_vinfo): Rewrite
both STMT_VINFO_GATHER_SCATTER_P and VMAT_GATHER_SCATTER
stmt refs.
|