Age | Commit message (Collapse) | Author | Files | Lines |
|
This builds ontop of the vect_worthwhile_without_simd_p refactoring
done earlier. It was wrong in dropping the appearant double checks
for operation support since the optab check can happen with an
integer vector emulation mode and thus succeed but vector lowering
might not actually support the operation on word_mode.
The following patch adds a vect_emulated_vector_p helper and
re-instantiates the check where it was previously. It also adds
appropriate costing of the scalar stmts emitted by vector lowering
to vectorizable_operation which should be the only place such
operations are synthesized. I've also cared for the case where
the vector mode is supported but the operation is not (though
I think this will be unlikely given we're talking about plus, minus
and negate).
This fixes the observed FAIL of gcc.dg/tree-ssa/gen-vect-11b.c
with -m32 where we end up vectorizing a multiplication that ends up
being teared down to scalars again by vector lowering.
I'm not super happy about all the other places where we're now
and previously feeding scalar modes to optab checks where we
want to know whether we can vectorize sth but well.
2021-09-08 Richard Biener <rguenther@suse.de>
PR tree-optimization/101801
PR tree-optimization/101819
* tree-vectorizer.h (vect_emulated_vector_p): Declare.
* tree-vect-loop.c (vect_emulated_vector_p): New function.
(vectorizable_reduction): Re-instantiate a check for emulated
operations.
* tree-vect-stmts.c (vectorizable_shift): Likewise.
(vectorizable_operation): Likewise. Cost emulated vector
operations according to the scalar sequence synthesized by
vector lowering.
|
|
This removes the cost part of vect_worthwhile_without_simd_p, retaining
only the correctness bits. The reason is that the cost heuristic
do not properly account for SLP plus the check whether "without simd"
applies misfires for AVX512 mask vectors at the moment, leading to
missed vectorizations there.
Any costing decision should take place in the cost modeling, no
single stmt is to disable all vectorization on its own.
2021-08-06 Richard Biener <rguenther@suse.de>
PR tree-optimization/101801
* tree-vectorizer.h (vect_worthwhile_without_simd_p): Rename...
(vect_can_vectorize_without_simd_p): ... to this.
* tree-vect-loop.c (vect_worthwhile_without_simd_p): Rename...
(vect_can_vectorize_without_simd_p): ... to this and fold
in vect_min_worthwhile_factor.
(vect_min_worthwhile_factor): Remove.
(vectorizable_reduction): Adjust and remove the cost part.
* tree-vect-stmts.c (vectorizable_shift): Likewise.
(vectorizable_operation): Likewise.
|
|
This patch uses a more accurate scalar iteration estimate when
comparing the epilogue of a constant-iteration loop with a candidate
replacement epilogue.
In the testcase, the patch prevents a 1-to-3-element SVE epilogue
from seeming better than a 64-bit Advanced SIMD epilogue.
gcc/
* tree-vect-loop.c (vect_better_loop_vinfo_p): Detect cases in
which old_loop_vinfo is an epilogue loop that handles a constant
number of iterations.
gcc/testsuite/
* gcc.target/aarch64/sve/cost_model_12.c: New test.
|
|
After vect_analyze_loop has successfully analysed a loop for
one base vector mode B1, it considers using following base vector
modes to vectorise an epilogue. However, for VECT_COMPARE_COSTS,
a later mode B2 might turn out to be better than B1 was. Initially
this comparison will be between an epilogue loop (for B2) and a main
loop (for B1). However, in r11-6458 I'd added code to reanalyse the
B2 epilogue loop as a main loop, partly for correctness and partly
for better costing.
This can lead to a situation in which we think that the B2 epilogue
loop was better than the B1 main loop, but that the B2 main loop is
not better than the B1 main loop. There was no dump message to say
that this had happened, which made it look like B2 had still won.
gcc/
* tree-vect-loop.c (vect_analyze_loop): Print a dump message
when a reanalyzed loop fails to be cheaper than the current
main loop.
|
|
This fixes the partial reduction of the reused reduction vector to
carried out in the correct sign and the correctly signed vector
recorded for the skip edge use.
2021-07-16 Richard Biener <rguenther@suse.de>
* tree-vect-loop.c (vect_transform_cycle_phi): Correct sign
conversion issues with the partial reduction of the reused
vector accumulator.
|
|
This patch adds support for a dot product where the sign of the multiplication
arguments differ. i.e. one is signed and one is unsigned but the precisions are
the same.
#define N 480
#define SIGNEDNESS_1 unsigned
#define SIGNEDNESS_2 signed
#define SIGNEDNESS_3 signed
#define SIGNEDNESS_4 unsigned
SIGNEDNESS_1 int __attribute__ ((noipa))
f (SIGNEDNESS_1 int res, SIGNEDNESS_3 char *restrict a,
SIGNEDNESS_4 char *restrict b)
{
for (__INTPTR_TYPE__ i = 0; i < N; ++i)
{
int av = a[i];
int bv = b[i];
SIGNEDNESS_2 short mult = av * bv;
res += mult;
}
return res;
}
The operations are performed as if the operands were extended to a 32-bit value.
As such this operation isn't valid if there is an intermediate conversion to an
unsigned value. i.e. if SIGNEDNESS_2 is unsigned.
more over if the signs of SIGNEDNESS_3 and SIGNEDNESS_4 are flipped the same
optab is used but the operands are flipped in the optab expansion.
To support this the patch extends the dot-product detection to optionally
ignore operands with different signs and stores this information in the optab
subtype which is now made a bitfield.
The subtype can now additionally controls which optab an EXPR can expand to.
gcc/ChangeLog:
* optabs.def (usdot_prod_optab): New.
* doc/md.texi: Document it and clarify other dot prod optabs.
* optabs-tree.h (enum optab_subtype): Add optab_vector_mixed_sign.
* optabs-tree.c (optab_for_tree_code): Support usdot_prod_optab.
* optabs.c (expand_widen_pattern_expr): Likewise.
* tree-cfg.c (verify_gimple_assign_ternary): Likewise.
* tree-vect-loop.c (vectorizable_reduction): Query dot-product kind.
* tree-vect-patterns.c (vect_supportable_direct_optab_p): Take optional
optab subtype.
(vect_widened_op_tree): Optionally ignore
mismatch types.
(vect_recog_dot_prod_pattern): Support usdot_prod_optab.
|
|
The following adds support for re-using the vector reduction def
from the main loop in vectorized epilogue loops on architectures
which use different vector sizes for the epilogue. That's only
x86 as far as I am aware.
2021-07-13 Richard Biener <rguenther@suse.de>
* tree-vect-loop.c (vect_find_reusable_accumulator): Handle
vector types where the old vector type has a multiple of
the new vector type elements.
(vect_create_partial_epilog): New function, split out from...
(vect_create_epilog_for_reduction): ... here.
(vect_transform_cycle_phi): Reduce the re-used accumulator
to the new vector type.
* gcc.target/i386/vect-reduc-1.c: New testcase.
|
|
This patch adds support for reusing a main loop's reduction accumulator
in an epilogue loop. This in turn lets the loops share a single piece
of vector->scalar reduction code.
The patch has the following restrictions:
(1) The epilogue reduction can only operate on a single vector
(e.g. ncopies must be 1 for non-SLP reductions, and the group size
must be <= the element count for SLP reductions).
(2) Both loops must use the same vector mode for their accumulators.
This means that the patch is restricted to targets that support
--param vect-partial-vector-usage=1.
(3) The reduction must be a standard “tree code” reduction.
However, these restrictions could be lifted in future. For example,
if the main loop operates on 128-bit vectors and the epilogue loop
operates on 64-bit vectors, we could in future reduce the 128-bit
vector by one stage and use the 64-bit result as the starting point
for the epilogue result.
The patch tries to handle chained SLP reductions, unchained SLP
reductions and non-SLP reductions. It also handles cases in which
the epilogue loop is entered directly (rather than via the main loop)
and cases in which the epilogue loop can be skipped.
vect_get_main_loop_result is a bit more general than the current
patch needs.
gcc/
* tree-vectorizer.h (vect_reusable_accumulator): New structure.
(_loop_vec_info::main_loop_edge): New field.
(_loop_vec_info::skip_main_loop_edge): Likewise.
(_loop_vec_info::skip_this_loop_edge): Likewise.
(_loop_vec_info::reusable_accumulators): Likewise.
(_stmt_vec_info::reduc_scalar_results): Likewise.
(_stmt_vec_info::reused_accumulator): Likewise.
(vect_get_main_loop_result): Declare.
* tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize
reduc_scalar_inputs.
(vec_info::free_stmt_vec_info): Free reduc_scalar_inputs.
* tree-vect-loop-manip.c (vect_get_main_loop_result): New function.
(vect_do_peeling): Fill an epilogue loop's main_loop_edge,
skip_main_loop_edge and skip_this_loop_edge fields.
* tree-vect-loop.c (INCLUDE_ALGORITHM): Define.
(vect_emit_reduction_init_stmts): New function.
(get_initial_def_for_reduction): Use it.
(get_initial_defs_for_reduction): Likewise. Change the vinfo
parameter to a loop_vec_info.
(vect_create_epilog_for_reduction): Store the scalar results
in the reduc_info. If an epilogue loop is reusing an accumulator
from the main loop, and if the epilogue loop can also be skipped,
try to place the reduction code in the join block. Record
accumulators that could potentially be reused by epilogue loops.
(vect_transform_cycle_phi): When vectorizing epilogue loops,
try to reuse accumulators from the main loop. Record the initial
value in reduc_info for non-SLP reductions too.
gcc/testsuite/
* gcc.target/aarch64/sve/reduc_9.c: New test.
* gcc.target/aarch64/sve/reduc_9_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_10.c: Likewise.
* gcc.target/aarch64/sve/reduc_10_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_11.c: Likewise.
* gcc.target/aarch64/sve/reduc_11_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_12.c: Likewise.
* gcc.target/aarch64/sve/reduc_12_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_13.c: Likewise.
* gcc.target/aarch64/sve/reduc_13_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_14.c: Likewise.
* gcc.target/aarch64/sve/reduc_14_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_15.c: Likewise.
* gcc.target/aarch64/sve/reduc_15_run.c: Likewise.
|
|
After previous patches, we can now easily provide the neutral op
as an argument to get_initial_def_for_reduction. This in turn
allows the adjustment calculation to be moved outside of
get_initial_def_for_reduction, which is the main motivation
of the patch.
gcc/
* tree-vect-loop.c (get_initial_def_for_reduction): Remove
adjustment handling. Take the neutral value as an argument,
in place of the code argument.
(vect_transform_cycle_phi): Update accordingly. Handle the
initial values of cond reductions separately from code reductions.
Choose the adjustment here rather than in
get_initial_def_for_reduction. Sink the splat of vec_initial_def.
|
|
This patch generalises the interface to neutral_op_for_slp_reduction
so that it can be used for non-SLP reductions too. This isn't much
of a win on its own, but it helps later patches.
gcc/
* tree-vect-loop.c (neutral_op_for_slp_reduction): Replace with...
(neutral_op_for_reduction): ...this, providing a more general
interface.
(vect_create_epilog_for_reduction): Update accordingly.
(vectorizable_reduction): Likewise.
(vect_transform_cycle_phi): Likewise.
|
|
Similarly to the previous patch, this one passes the reduc_info
to get_initial_def_for_reduction, rather than a stmt_vec_info that
lacks the metadata. This again becomes useful later.
gcc/
* tree-vect-loop.c (get_initial_def_for_reduction): Take the
reduc_info instead of the original stmt_vec_info.
(vect_transform_cycle_phi): Update accordingly.
|
|
This patch passes the reduc_info to get_initial_defs_for_reduction,
so that the function can get general information from there rather
than from the first SLP statement. This isn't a win on its own,
but it becomes important with later patches.
gcc/
* tree-vect-loop.c (get_initial_defs_for_reduction): Take the
reduc_info as an additional parameter.
(vect_transform_cycle_phi): Update accordingly.
|
|
This patch adds a helper function called vect_phi_initial_value
for returning the incoming value of a given loop phi. The main
reason for adding it is to ensure that the right preheader edge
is used when vectorising nested loops. (PHI_ARG_DEF_FROM_EDGE
itself doesn't assert that the given edge is for the right block,
although I guess that would be good to add separately.)
gcc/
* tree-vectorizer.h: Include tree-ssa-operands.h.
(vect_phi_initial_value): New function.
* tree-vect-loop.c (neutral_op_for_slp_reduction): Use it.
(get_initial_defs_for_reduction, info_for_reduction): Likewise.
(vect_create_epilog_for_reduction, vectorizable_reduction): Likewise.
(vect_transform_cycle_phi, vectorizable_induction): Likewise.
|
|
Vector reduction accumulators can differ in signedness from the
final scalar result. The conversions to handle that case were
distributed through vect_create_epilog_for_reduction; this patch
does the conversion up-front instead.
gcc/
* tree-vect-loop.c (vect_create_epilog_for_reduction): Convert
the phi results to vectype after creating them. Remove later
conversion code that thus becomes redundant.
|
|
vect_create_epilog_for_reduction had a variable called new_phis.
It collected the statements that produce the exit block definitions
of the vector reduction accumulators. Although those statements
are indeed phis initially, they are often replaced with normal
statements later, leading to puzzling code like:
FOR_EACH_VEC_ELT (new_phis, i, new_phi)
{
int bit_offset;
if (gimple_code (new_phi) == GIMPLE_PHI)
vec_temp = PHI_RESULT (new_phi);
else
vec_temp = gimple_assign_lhs (new_phi);
Also, although the array collects statements, in practice all users want
the lhs instead.
This patch therefore replaces new_phis with a vector of gimple values
called “reduc_inputs”.
Also, reduction chains and ncopies>1 were handled with identical code
(and there was a comment saying so). The patch unites them into
a single “if”.
gcc/
* tree-vect-loop.c (vect_create_epilog_for_reduction): Replace
the new_phis vector with a reduc_inputs vector. Combine handling
of reduction chains and ncopies > 1.
|
|
This patch constructs an array_slice of the scalar statements that
produce live-out reduction results in the original unvectorised loop.
There are three cases:
- SLP reduction chains: the final SLP stmt is live-out
- full SLP reductions: all SLP stmts are live-out
- non-SLP reductions: the single scalar stmt is live-out
This is a slight simplification on its own, mostly because it maans
“group_size” has a consistent meaning throughout the function.
The main justification though is that it helps with later patches.
gcc/
* tree-vect-loop.c (vect_create_epilog_for_reduction): Truncate
scalar_results to group_size elements after reducing down from
N*group_size elements. Construct an array_slice of the live-out
stmts and assert that there is one stmt per scalar result.
|
|
vect_create_epilog_for_reduction only handles two cases: single-loop
reductions and double reductions. “nested cycles” (i.e. reductions
in the inner loop when vectorising an outer loop) are handled elsewhere
and don't need a vector->scalar reduction.
The function had variables called nested_in_vect_loop and double_reduc
and asserted that nested_in_vect_loop implied double_reduc, but it
still had code to handle nested_in_vect_loop && !double_reduc.
This patch removes that and uses double_reduc everywhere.
gcc/
* tree-vect-loop.c (vect_create_epilog_for_reduction): Remove
nested_in_vect_loop and use double_reduc everywhere. Remove dead
assignment to "loop".
|
|
vectorizable_reduction had code guarded by:
if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def
|| STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def)
But that's always true after:
if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_reduction_def
&& STMT_VINFO_DEF_TYPE (stmt_info) != vect_double_reduction_def
&& STMT_VINFO_DEF_TYPE (stmt_info) != vect_nested_cycle)
return false;
if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_nested_cycle)
{
…
return true;
}
(I wasn't sure at first how the empty “else” for the first “if” above
was supposed to work.)
gcc/
* tree-vect-loop.c (vectorizable_reduction): Remove always-true
if condition.
|
|
This adds a simple reduction vectorization capability to the
non-loop vectorizer. Simple meaning it lacks any of the fancy
ways to generate the reduction epilogue but only supports
those we can handle via a direct internal function reducing
a vector to a scalar. One of the main reasons is to avoid
massive refactoring at this point but also that more complex
epilogue operations are hardly profitable.
Mixed sign reductions are for now fend off and I'm not finally
settled with whether we want an explicit SLP node for the
reduction epilogue operation. Handling mixed signs could be
done by multiplying with a { 1, -1, .. } vector. Fend off
are also reductions with non-internal operands (constants
or register parameters for example).
Costing is done by accounting the original scalar participating
stmts for the scalar cost and log2 permutes and operations for
the vectorized epilogue.
--
SPEC CPU 2017 FP with rate workload measurements show (picked
fastest runs of three) regressions for 507.cactuBSSN_r (1.5%),
508.namd_r (2.5%), 511.povray_r (2.5%), 526.blender_r (0.5) and
527.cam4_r (2.5%) and improvements for 510.parest_r (5%) and
538.imagick_r (1.5%). This is with -Ofast -march=znver2 on a Zen2.
Statistics on CPU 2017 shows that the overwhelming number of seeds
we find are reductions of two lanes (well - that's basically every
associative operation). That means we put a quite high pressure
on the SLP discovery process this way.
In total we find 583218 seeds we put to SLP discovery out of which
66205 pass that and only 6185 of those make it through
code generation checks. 796 of those are discarded because the reduction
is part of a larger SLP instance. 4195 of the remaining
are deemed not profitable to vectorize and 1194 are finally
vectorized. That's a poor 0.2% rate.
Of the 583218 seeds 486826 (83%) have two lanes, 60912 have three (10%),
28181 four (5%), 4808 five, 909 six and there are instances up to 120
lanes.
There's a set of 54086 candidate seeds we reject because
they contain a constant or invariant (not implemented yet) but still
have two or more lanes that could be put to SLP discovery.
2021-06-16 Richard Biener <rguenther@suse.de>
PR tree-optimization/54400
* tree-vectorizer.h (enum slp_instance_kind): Add
slp_inst_kind_bb_reduc.
(reduction_fn_for_scalar_code): Declare.
* tree-vect-data-refs.c (vect_slp_analyze_instance_dependence):
Check SLP_INSTANCE_KIND instead of looking at the
representative.
(vect_slp_analyze_instance_alignment): Likewise.
* tree-vect-loop.c (reduction_fn_for_scalar_code): Export.
* tree-vect-slp.c (vect_slp_linearize_chain): Split out
chain linearization from vect_build_slp_tree_2 and generalize
for the use of BB reduction vectorization.
(vect_build_slp_tree_2): Adjust accordingly.
(vect_optimize_slp): Elide permutes at the root of BB reduction
instances.
(vectorizable_bb_reduc_epilogue): New function.
(vect_slp_prune_covered_roots): Likewise.
(vect_slp_analyze_operations): Use them.
(vect_slp_check_for_constructors): Recognize associatable
chains for BB reduction vectorization.
(vectorize_slp_instance_root_stmt): Generate code for the
BB reduction epilogue.
* gcc.dg/vect/bb-slp-pr54400.c: New testcase.
|
|
The following fixes the SLP FMA patterns to preserve reduction
info and the reduction vectorization to consider internal function
call defs for the reduction stmt.
2021-06-09 Richard Biener <rguenther@suse.de>
PR tree-optimization/100981
gcc/
* tree-vect-loop.c (vect_create_epilog_for_reduction): Use
gimple_get_lhs to also handle calls.
* tree-vect-slp-patterns.c (complex_pattern::build): Transfer
reduction info.
gcc/testsuite/
* gfortran.dg/vect/pr100981-1.f90: New testcase.
libgomp/
* testsuite/libgomp.fortran/pr100981-2.f90: New testcase.
|
|
This patch uses the knowledge of the conditions to enter an epilogue loop to
help come up with a potentially more restricive upper bound.
gcc/ChangeLog:
* tree-vect-loop.c (vect_transform_loop): Use main loop's various'
thresholds to narrow the upper bound on epilogue iterations.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/sve/part_vect_single_iter_epilog.c: New test.
|
|
This patch is to replace the current hardcoded weight factor
50, which is applied by the loop vectorizer to the cost of
statements in an inner loop relative to the loop being
vectorized, with one newly added member inner_loop_cost_factor
in loop vinfo. It also introduces one parameter
vect-inner-loop-cost-factor whose default value is 50, and
is used to initialize the inner_loop_cost_factor member.
The motivation here is that: if targets want to have one
unique function to gather some information in each add_stmt_cost
call, no matter that it's put before or after the cost tweaking
part for inner loop, it may have the need to adjust (expand or
shrink) the gathered data as the factor. Now the factor is
hardcoded, it's not easily maintained.
Bootstrapped/regtested on powerpc64le-linux-gnu P9,
x86_64-redhat-linux and aarch64-linux-gnu.
gcc/ChangeLog:
* doc/invoke.texi (vect-inner-loop-cost-factor): Document new
parameter.
* params.opt (vect-inner-loop-cost-factor): New.
* targhooks.c (default_add_stmt_cost): Replace hardcoded factor
50 with LOOP_VINFO_INNER_LOOP_COST_FACTOR, include head file
tree-vectorizer.h and its required ones.
* config/aarch64/aarch64.c (aarch64_add_stmt_cost): Replace
hardcoded factor 50 with LOOP_VINFO_INNER_LOOP_COST_FACTOR.
* config/arm/arm.c (arm_add_stmt_cost): Likewise.
* config/i386/i386.c (ix86_add_stmt_cost): Likewise.
* config/rs6000/rs6000.c (rs6000_add_stmt_cost): Likewise.
* tree-vect-loop.c (vect_compute_single_scalar_iteration_cost):
Likewise.
(_loop_vec_info::_loop_vec_info): Init inner_loop_cost_factor.
* tree-vectorizer.h (_loop_vec_info): Add inner_loop_cost_factor.
(LOOP_VINFO_INNER_LOOP_COST_FACTOR): New macro.
|
|
rs6000 port function rs6000_density_test wants to differentiate the
current cost model is for the scalar version of a loop or block, or
the vector version. As Richi suggested, this patch introduces one
new parameter costing_for_scalar to init_cost hook to pass down this
information explicitly.
gcc/ChangeLog:
* doc/tm.texi: Regenerated.
* target.def (init_cost): Add new parameter costing_for_scalar.
* targhooks.c (default_init_cost): Adjust for new parameter.
* targhooks.h (default_init_cost): Likewise.
* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise.
(vect_compute_single_scalar_iteration_cost): Likewise.
(vect_analyze_loop_2): Likewise.
* tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise.
(vect_bb_vectorization_profitable_p): Likewise.
* tree-vectorizer.h (init_cost): Likewise.
* config/aarch64/aarch64.c (aarch64_init_cost): Likewise.
* config/i386/i386.c (ix86_init_cost): Likewise.
* config/rs6000/rs6000.c (rs6000_init_cost): Likewise.
|
|
The following testcase ICEs because disabling of DCE means there are dead
stmts in the loop (though, in theory they could become dead only shortly
before if-conv through some optimization), ifcvt which goes through all
stmts in the loop if-converts them into .COND_DIV etc. internal fn calls
in the copy of the loop meant for vectorization only, the loop is
successfully vectorized but the particular .COND_* call is not because
it isn't a live statement and the scalar .COND_* remains in the IL until
expansion where it ICEs because these ifns only support vectors and not
scalars.
These ifns are similar to .MASK_{LOAD,STORE} in this behavior.
One possible fix could be to expand scalar versions of them during
expansion, basically undoing what if-conv did to create them, i.e.
expand them as the lhs = else; if (mask) { lhs = statement; } or so.
For .MASK_LOAD we have code to replace them in vect_transform_loop already
though (not needed for .MASK_STORE, as stores should be always live
and thus always vectorized), so this patch instead replaces .COND_*
similarly to .MASK_LOAD in that loop, with the small difference
that lhs = .MASK_LOAD (...); is replaced by lhs = 0; while
lhs = .COND_* (..., else_arg); is replaced by lhs = else_arg.
The statement must be dead, otherwise it would be vectorized, so I think
it is not a big deal we don't turn it back into multiple basic blocks etc.
(and it might be not possible to do that at that point).
2021-04-16 Jakub Jelinek <jakub@redhat.com>
PR target/99767
* tree-vect-loop.c (vect_transform_loop): Don't remove just
dead scalar .MASK_LOAD calls, but also dead .COND_* calls - replace
them by their last argument.
* gcc.target/aarch64/pr99767.c: New test.
|
|
This avoids (again) the C++ pitfall of pushing a reference to
sth being reallocated.
2021-04-07 Richard Biener <rguenther@suse.de>
PR tree-optimization/99947
* tree-vect-loop.c (vectorizable_induction): Pre-allocate
steps vector to avoid pushing elements from the reallocated
vector.
* gcc.dg/torture/pr99947.c: New testcase.
|
|
This adds a relevancy check before trying to set the vector def of
a backedge in an unvectorized PHI.
2021-04-06 Richard Biener <rguenther@suse.de>
PR tree-optimization/99880
* tree-vect-loop.c (maybe_set_vectorized_backedge_value): Only
set vectorized defs of relevant PHIs.
* gcc.dg/torture/pr99880.c: New testcase.
|
|
This patch is to initialize the inside_cost as zero, can avoid
to use its uninitialized value when some path doesn't assign it.
gcc/ChangeLog:
* tree-vect-loop.c (vect_model_reduction_cost): Init inside_cost.
|
|
This fixes an ordering problem with verifying that no intermediate
computations in a reduction path are used outside of the chain. The
check was disabled for value-preserving conversions at the tail
but whether a stmt was a conversion or not was only computed after
the first use. The following fixes this by re-ordering things
accordingly.
2021-02-25 Richard Biener <rguenther@suse.de>
PR tree-optimization/99253
* tree-vect-loop.c (check_reduction_path): First compute
code, then verify out-of-loop uses.
* gcc.dg/vect/pr99253.c: New testcase.
|
|
When we analyzed a loop as epilogue but later in peeling decide
we're not going to use it then in the DTOR we clear the original
loops ->aux which causes us to leak the main loop vinfo.
Fixed by only clearing aux if it is associated with the vinfo
we're destroying.
2021-02-10 Richard Biener <rguenther@suse.de>
PR tree-optimization/99024
* tree-vect-loop.c (_loop_vec_info::~_loop_vec_info): Only
clear loop->aux if it is associated with the destroyed loop_vinfo.
|
|
This fixes us not costing vectorized bswap for SLP as well as
avoiding biasing to the vectorized side when costing single-argument
PHIs. Instead we assume coalescing here and cost them with zero cost
for both the scalar and vectorized code.
This doesn't fix the PR on its own.
2021-02-04 Richard Biener <rguenther@suse.de>
PR tree-optimization/98855
* tree-vect-loop.c (vectorizable_phi): Do not cost
single-argument PHIs.
* tree-vect-slp.c (vect_bb_slp_scalar_cost): Likewise.
* tree-vect-stmts.c (vectorizable_bswap): Also perform
costing for SLP operation.
|
|
Previously the SLP pattern matcher was using STMT_VINFO_SLP_VECT_ONLY as a way
to dissolve the SLP only patterns during SLP cancellation. However it seems
like the semantics for STMT_VINFO_SLP_VECT_ONLY are slightly different than what
I expected.
Namely that the non-SLP path can still use a statement marked
STMT_VINFO_SLP_VECT_ONLY. One such example is masked loads which are used both
in the SLP and non-SLP path.
To fix this I now introduce a new flag STMT_VINFO_SLP_VECT_ONLY_PATTERN which is
used only by the pattern matcher.
gcc/ChangeLog:
PR tree-optimization/98928
* tree-vect-loop.c (vect_analyze_loop_2): Change
STMT_VINFO_SLP_VECT_ONLY to STMT_VINFO_SLP_VECT_ONLY_PATTERN.
* tree-vect-slp-patterns.c (complex_pattern::build): Likewise.
* tree-vectorizer.h (STMT_VINFO_SLP_VECT_ONLY_PATTERN): New.
(class _stmt_vec_info): Add slp_vect_pattern_only_p.
gcc/testsuite/ChangeLog:
PR tree-optimization/98928
* gcc.target/i386/pr98928.c: New test.
|
|
This fixes a double-counting in the reduction cost when vectorizing
the reduction through the regular vectorizable_* functions.
2021-01-11 Richard Biener <rguenther@suse.de>
PR tree-optimization/98526
* tree-vect-loop.c (vect_model_reduction_cost): Remove costing
of the actual reduction op for the regular case.
(vectorizable_reduction): Cost the stmts
vect_transform_reduction produces here.
|
|
This fixes extraction of live bool vector results for the case of
integer mode vectors.
2021-01-05 Richard Biener <rguenther@suse.de>
PR tree-optimization/98381
* tree.c (vector_element_bits): Properly compute bool vector
element size.
* tree-vect-loop.c (vectorizable_live_operation): Properly
compute the last lane bit offset.
|
|
On AArch64, the vectoriser tries various ways of vectorising with both
SVE and Advanced SIMD and picks the best one. All other things being
equal, it prefers earlier attempts over later attempts.
The way this works currently is that, once it has a successful
vectorisation attempt A, it analyses all other attempts as epilogue
loops of A:
/* When pick_lowest_cost_p is true, we should in principle iterate
over all the loop_vec_infos that LOOP_VINFO could replace and
try to vectorize LOOP_VINFO under the same conditions.
E.g. when trying to replace an epilogue loop, we should vectorize
LOOP_VINFO as an epilogue loop with the same VF limit. When trying
to replace the main loop, we should vectorize LOOP_VINFO as a main
loop too.
However, autovectorize_vector_modes is usually sorted as follows:
- Modes that naturally produce lower VFs usually follow modes that
naturally produce higher VFs.
- When modes naturally produce the same VF, maskable modes
usually follow unmaskable ones, so that the maskable mode
can be used to vectorize the epilogue of the unmaskable mode.
This order is preferred because it leads to the maximum
epilogue vectorization opportunities. Targets should only use
a different order if they want to make wide modes available while
disparaging them relative to earlier, smaller modes. The assumption
in that case is that the wider modes are more expensive in some
way that isn't reflected directly in the costs.
There should therefore be few interesting cases in which
LOOP_VINFO fails when treated as an epilogue loop, succeeds when
treated as a standalone loop, and ends up being genuinely cheaper
than FIRST_LOOP_VINFO. */
However, the vectoriser can normally elide alias checks for epilogue
loops, on the basis that the main loop should do them instead.
Converting an epilogue loop to a main loop can therefore cause the alias
checks to be skipped. (It probably also unfairly penalises the original
loop in the cost comparison, given that one loop will have alias checks
and the other won't.)
As the comment says, we should in principle analyse each vector mode
twice: once as a main loop and once as an epilogue. However, doing
that up-front would be quite expensive. This patch instead goes for a
compromise: if an epilogue loop for mode M2 seems better than a main
loop for mode M1, re-analyse with M2 as the main loop.
The patch fixes dg.torture.exp=pr69719.c when testing with
-msve-vector-bits=128.
gcc/
PR tree-optimization/98371
* tree-vect-loop.c (vect_reanalyze_as_main_loop): New function.
(vect_analyze_loop): If an epilogue loop appears to be cheaper
than the main loop, re-analyze it as a main loop before adopting
it as a main loop.
|
|
When the VF is one a SLP reduction is in-order and thus we can
vectorize even when the reduction op is not associative.
2021-01-04 Richard Biener <rguenther@suse.de>
PR tree-optimization/98291
* tree-vect-loop.c (vectorizable_reduction): Bypass
associativity check for SLP reductions with VF 1.
* gcc.dg/vect/slp-reduc-11.c: New testcase.
* gcc.dg/vect/vect-reduc-in-order-4.c: Adjust.
|
|
|
|
Problem comes from using the wrong interface to get the index type for a
COND_REDUCTION. For fixed-length SVE we get a V2SI (a 64-bit Advanced
SIMD vector) instead of a VNx2SI (an SVE vector that stores SI elements
in DI containers).
Credits to Richard Sandiford for pointing out the issue's root cause.
Original PR snippet proposed to reproduce issue was only causing ICE for C++
compiler (see pr98177-1 test cases). I've slightly modified original
snippet in order to reproduce issue on both C and C++ compilers. These
are pr98177-2 test cases.
gcc/ChangeLog:
PR target/98177
* tree-vect-loop.c (vect_create_epilog_for_reduction): Use
get_same_sized_vectype to obtain index type.
(vectorizable_reduction): Likewise.
gcc/testsuite/ChangeLog:
PR target/98177
* g++.target/aarch64/sve/pr98177-1.C: New test.
* g++.target/aarch64/sve/pr98177-2.C: New test.
* gcc.target/aarch64/sve/pr98177-1.c: New test.
* gcc.target/aarch64/sve/pr98177-2.c: New test.
|
|
vect_better_loop_vinfo_p
While experimenting with some backend costs for Advanced SIMD and SVE I
hit many cases where GCC would pick SVE for VLA auto-vectorisation even when
the backend very clearly presented cheaper costs for Advanced SIMD.
For a simple float addition loop the SVE costs were:
vec.c:9:21: note: Cost model analysis:
Vector inside of loop cost: 28
Vector prologue cost: 2
Vector epilogue cost: 0
Scalar iteration cost: 10
Scalar outside cost: 0
Vector outside cost: 2
prologue iterations: 0
epilogue iterations: 0
Minimum number of vector iterations: 1
Calculated minimum iters for profitability: 4
and for Advanced SIMD (Neon) they're:
vec.c:9:21: note: Cost model analysis:
Vector inside of loop cost: 11
Vector prologue cost: 0
Vector epilogue cost: 0
Scalar iteration cost: 10
Scalar outside cost: 0
Vector outside cost: 0
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 0
vec.c:9:21: note: Runtime profitability threshold = 4
yet the SVE one was always picked. With guidance from Richard this seems
to be due to the vinfo comparisons in vect_better_loop_vinfo_p, in
particular the part with the big comment explaining the
estimated_rel_new * 2 <= estimated_rel_old heuristic.
This patch extends the comparisons by introducing a three-way estimate
kind for poly_int values that the backend can distinguish.
This allows vect_better_loop_vinfo_p to ask for minimum, maximum and
likely estimates and pick Advanced SIMD overs SVE when it is clearly cheaper.
gcc/
* target.h (enum poly_value_estimate_kind): Define.
(estimated_poly_value): Take an estimate kind argument.
* target.def (estimated_poly_value): Update definition for the
above.
* doc/tm.texi: Regenerate.
* targhooks.c (estimated_poly_value): Update prototype.
* tree-vect-loop.c (vect_better_loop_vinfo_p): Use min, max and
likely estimates of VF to pick between vinfos.
* config/aarch64/aarch64.c (aarch64_cmp_autovec_modes): Use
estimated_poly_value instead of aarch64_estimated_poly_value.
(aarch64_estimated_poly_value): Take a kind argument and handle
it.
|
|
This patch adds support for
* Complex Addition with rotation of 90 and 270.
Addition with rotation of the second argument around the Argand plane.
Supported rotations are 90 and 180.
c = a + (b * I) and c = a + (b * I * I * I)
gcc/ChangeLog:
* tree-vect-slp-patterns.c: New file.
* Makefile.in: Add it.
* doc/passes.texi: Document it.
* internal-fn.def (COMPLEX_ADD_ROT90, COMPLEX_ADD_ROT270): New.
* optabs.def (cadd90_optab, cadd270_optab): New.
* doc/md.texi: Document them.
* tree-vect-loop.c (vect_analyze_loop_2): Add dissolve code.
* tree-vect-slp.c:
(vect_free_slp_instance, vect_create_new_slp_node): Export.
(vect_match_slp_patterns_2, vect_match_slp_patterns): New.
(vect_analyze_slp): Use it.
* tree-vectorizer.h (vect_free_slp_tree): Export.
(enum _complex_operation): Forward declare.
(class vect_pattern): New
gcc/testsuite/ChangeLog:
* lib/target-supports.exp
(check_effective_target_arm_v8_3a_complex_neon_ok_nocache): Fix it.
(check_effective_target_vect_complex_add_byte
,check_effective_target_vect_complex_add_int
,check_effective_target_vect_complex_add_short
,check_effective_target_vect_complex_add_long
,check_effective_target_vect_complex_add_half
,check_effective_target_vect_complex_add_float
,check_effective_target_vect_complex_add_double): New.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-byte.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-int.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-long.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-short.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-unsigned-byte.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-unsigned-int.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-unsigned-long.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-unsigned-short.c: New test.
* gcc.dg/vect/complex/complex-add-pattern-template.c: New test.
* gcc.dg/vect/complex/complex-add-template.c: New test.
* gcc.dg/vect/complex/complex-operations-run.c: New test.
* gcc.dg/vect/complex/complex-operations.c: New test.
* gcc.dg/vect/complex/complex.exp: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-double.c: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-float.c: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-half-float.c: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-pattern-double.c: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-pattern-float.c: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-pattern-half-float.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-double.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-float.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-half-float.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-pattern-double.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-pattern-float.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-byte.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-int.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-long.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-short.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-unsigned-byte.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-unsigned-int.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-unsigned-long.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-unsigned-short.c: New test.
|
|
This makes sure to not call maybe_set_vectorized_backedge_value when
we did not vectorize the latch def candidate.
2020-12-02 Richard Biener <rguenther@suse.de>
* tree-vect-loop.c (vect_transform_loop_stmt): Return whether
we vectorized a stmt.
(vect_transform_loop): Only call maybe_set_vectorized_backedge_value
when we vectorized the stmt.
|
|
This avoids breaking LC SSA when SLP codegen pulled an out-of-loop
def into a loop when merging with in-loop defs for an external def.
2020-11-30 Richard Biener <rguenther@suse.de>
PR tree-optimization/98064
* tree-vect-loop.c (vectorizable_live_operation): Avoid
breaking LC SSA for BB vectorization.
* g++.dg/vect/pr98064.cc: New testcase.
|
|
Currently we have three vector cost models: cheap, dynamic and
unlimited. -O2 -ftree-vectorize uses “cheap” by default, but that's
still relatively aggressive about peeling and aliasing checks,
and can lead to significant code size growth.
This patch adds an even more conservative choice, which for lack of
imagination I've called “very cheap”. It only allows vectorisation
if the vector code entirely replaces the scalar code. It also
requires one iteration of the vector loop to pay for itself,
regardless of how often the loop iterates. (If the vector loop
needs multiple iterations to be beneficial then things are
probably too close to call, and the conservative thing would
be to stick with the scalar code.)
The idea is that this should be suitable for -O2, although the patch
doesn't change any defaults itself.
I tested this by building and running a bunch of workloads for SVE,
with three options:
(1) -O2
(2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
(3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]
All three builds used the default -msve-vector-bits=scalable and
ran with the minimum vector length of 128 bits, which should give
a worst-case bound for the performance impact.
The workloads included a mixture of microbenchmarks and full
applications. Because it's quite an eclectic mix, there's not
much point giving exact figures. The aim was more to get a general
impression.
Code size growth with (2) was much lower than with (3). Only a
handful of tests increased by more than 5%, and all of them were
microbenchmarks.
In terms of performance, (2) was significantly faster than (1)
on microbenchmarks (as expected) but also on some full apps.
Again, performance only regressed on a handful of tests.
As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
of a mixed bag. There are several significant improvements with (3)
over (2), but also some (smaller) regressions. That seems to be in
line with -O2 -ftree-vectorize being a kind of -O2.5.
The patch reorders vect_cost_model so that values are in order
of increasing aggressiveness, which makes it possible to use
range checks. The value 0 still represents “unlimited”,
so “if (flag_vect_cost_model)” is still a meaningful check.
gcc/
* doc/invoke.texi (-fvect-cost-model): Add a very-cheap model.
* common.opt (fvect-cost-model=): Add very-cheap as a possible option.
(fsimd-cost-model=): Likewise.
(vect_cost_model): Add very-cheap.
* flag-types.h (vect_cost_model): Add VECT_COST_MODEL_VERY_CHEAP.
Put the values in order of increasing aggressiveness.
* tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Use
range checks when comparing against VECT_COST_MODEL_CHEAP.
(vect_prune_runtime_alias_test_list): Do not allow any alias
checks for the very-cheap cost model.
* tree-vect-loop.c (vect_analyze_loop_costing): Do not allow
any peeling for the very-cheap cost model. Also require one
iteration of the vector loop to pay for itself.
gcc/testsuite/
* gcc.dg/vect/vect-cost-model-1.c: New test.
* gcc.dg/vect/vect-cost-model-2.c: Likewise.
* gcc.dg/vect/vect-cost-model-3.c: Likewise.
* gcc.dg/vect/vect-cost-model-4.c: Likewise.
* gcc.dg/vect/vect-cost-model-5.c: Likewise.
* gcc.dg/vect/vect-cost-model-6.c: Likewise.
|
|
This makes vectorization properly assign vector types to PHI
nodes that copy from externals on loop exit edges.
2020-11-18 Richard Biener <rguenther@suse.de>
PR tree-optimization/97886
* tree-vect-loop.c (vectorizable_lc_phi): Properly assign
vector types to invariants for SLP.
|
|
This delays filling SLP_INSTANCE_LOADS.
2020-11-16 Richard Biener <rguenther@suse.de>
* tree-vectorizer.h (vect_gather_slp_loads): Declare.
* tree-vect-loop.c (vect_analyze_loop_2): Call
vect_gather_slp_loads.
* tree-vect-slp.c (vect_build_slp_instance): Do not gather
SLP loads here.
(vect_gather_slp_loads): Remove wrapper, new function.
(vect_slp_analyze_bb_1): Call it.
|
|
We're stripping conversions off access functions of inductions and
thus the step can be of different sign. Fix bogus step CTORs by
converting the elements rather than the whole vector.
2020-11-16 Richard Biener <rguenther@suse.de>
PR tree-optimization/97835
* tree-vect-loop.c (vectorizable_induction): Convert step
scalars rather than step vector.
* gcc.dg/vect/pr97835.c: New testcase.
|
|
This makes sure we reject reduction paths with a live stmt that
is not the last one altering the value. This is because we do not
handle this in the epilogue unless there's a scalar epilogue loop.
2020-11-09 Richard Biener <rguenther@suse.de>
PR tree-optimization/97760
* tree-vect-loop.c (check_reduction_path): Reject
reduction paths we do not handle in epilogue generation.
* gcc.dg/vect/pr97760.c: New testcase.
|
|
This fixes updating of the step vectors when filling up to group_size.
2020-11-09 Richard Biener <rguenther@suse.de>
PR tree-optimization/97753
* tree-vect-loop.c (vectorizable_induction): Fill vec_steps
when CSEing inside the group.
* gcc.dg/vect/pr97753.c: New testcase.
|
|
This PR exposes two issues - one that the vector builder treats
&x as eligible for VECTOR_CST elements and one that SLP induction
vectorization forgets to convert init elements to the vector
component type which makes a difference for pointer vs. integer.
2020-11-06 Richard Biener <rguenther@suse.de>
PR tree-optimization/97732
* tree-vect-loop.c (vectorizable_induction): Convert the
init elements to the vector component type.
* gimple-fold.c (gimple_build_vector): Use CONSTANT_CLASS_P
rather than TREE_CONSTANT to determine if elements are
eligible for VECTOR_CSTs.
* gcc.dg/vect/bb-slp-pr97732.c: New testcase.
|
|
This patch stores the SLP instance kind in the SLP instance so that we can use
it later when detecting load/store lanes support.
This also changes the load/store lane support check to only check if the SLP
kind is a store. This means that in order for the load/lanes to work all
instances must be of kind store.
gcc/ChangeLog:
* tree-vect-loop.c (vect_analyze_loop_2): Check kind.
* tree-vect-slp.c (vect_build_slp_instance): New.
(enum slp_instance_kind): Move to...
* tree-vectorizer.h (enum slp_instance_kind): .. Here
(SLP_INSTANCE_KIND): New.
|
|
This moves the code that checks for load/store lanes further in the pipeline and
places it after slp_optimize. This would allow us to perform optimizations on
the SLP tree and only bail out if we really have a permute.
With this change it allows us to handle permutes such as {1,1,1,1} which should
be handled by a load and replicate.
This change however makes it all or nothing. Either all instances can be handled
or none at all. This is why some of the test cases have been adjusted.
gcc/ChangeLog:
* tree-vect-slp.c (vect_analyze_slp_instance): Moved load/store lanes
check to ...
* tree-vect-loop.c (vect_analyze_loop_2): ..Here
gcc/testsuite/ChangeLog:
* gcc.dg/vect/slp-11b.c: Update output scan.
* gcc.dg/vect/slp-perm-6.c: Likewise.
|