Age | Commit message (Collapse) | Author | Files | Lines |
|
vectorizable_reduction had code guarded by:
if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_reduction_def
|| STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def)
But that's always true after:
if (STMT_VINFO_DEF_TYPE (stmt_info) != vect_reduction_def
&& STMT_VINFO_DEF_TYPE (stmt_info) != vect_double_reduction_def
&& STMT_VINFO_DEF_TYPE (stmt_info) != vect_nested_cycle)
return false;
if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_nested_cycle)
{
…
return true;
}
(I wasn't sure at first how the empty “else” for the first “if” above
was supposed to work.)
gcc/
* tree-vect-loop.c (vectorizable_reduction): Remove always-true
if condition.
|
|
This adds a simple reduction vectorization capability to the
non-loop vectorizer. Simple meaning it lacks any of the fancy
ways to generate the reduction epilogue but only supports
those we can handle via a direct internal function reducing
a vector to a scalar. One of the main reasons is to avoid
massive refactoring at this point but also that more complex
epilogue operations are hardly profitable.
Mixed sign reductions are for now fend off and I'm not finally
settled with whether we want an explicit SLP node for the
reduction epilogue operation. Handling mixed signs could be
done by multiplying with a { 1, -1, .. } vector. Fend off
are also reductions with non-internal operands (constants
or register parameters for example).
Costing is done by accounting the original scalar participating
stmts for the scalar cost and log2 permutes and operations for
the vectorized epilogue.
--
SPEC CPU 2017 FP with rate workload measurements show (picked
fastest runs of three) regressions for 507.cactuBSSN_r (1.5%),
508.namd_r (2.5%), 511.povray_r (2.5%), 526.blender_r (0.5) and
527.cam4_r (2.5%) and improvements for 510.parest_r (5%) and
538.imagick_r (1.5%). This is with -Ofast -march=znver2 on a Zen2.
Statistics on CPU 2017 shows that the overwhelming number of seeds
we find are reductions of two lanes (well - that's basically every
associative operation). That means we put a quite high pressure
on the SLP discovery process this way.
In total we find 583218 seeds we put to SLP discovery out of which
66205 pass that and only 6185 of those make it through
code generation checks. 796 of those are discarded because the reduction
is part of a larger SLP instance. 4195 of the remaining
are deemed not profitable to vectorize and 1194 are finally
vectorized. That's a poor 0.2% rate.
Of the 583218 seeds 486826 (83%) have two lanes, 60912 have three (10%),
28181 four (5%), 4808 five, 909 six and there are instances up to 120
lanes.
There's a set of 54086 candidate seeds we reject because
they contain a constant or invariant (not implemented yet) but still
have two or more lanes that could be put to SLP discovery.
2021-06-16 Richard Biener <rguenther@suse.de>
PR tree-optimization/54400
* tree-vectorizer.h (enum slp_instance_kind): Add
slp_inst_kind_bb_reduc.
(reduction_fn_for_scalar_code): Declare.
* tree-vect-data-refs.c (vect_slp_analyze_instance_dependence):
Check SLP_INSTANCE_KIND instead of looking at the
representative.
(vect_slp_analyze_instance_alignment): Likewise.
* tree-vect-loop.c (reduction_fn_for_scalar_code): Export.
* tree-vect-slp.c (vect_slp_linearize_chain): Split out
chain linearization from vect_build_slp_tree_2 and generalize
for the use of BB reduction vectorization.
(vect_build_slp_tree_2): Adjust accordingly.
(vect_optimize_slp): Elide permutes at the root of BB reduction
instances.
(vectorizable_bb_reduc_epilogue): New function.
(vect_slp_prune_covered_roots): Likewise.
(vect_slp_analyze_operations): Use them.
(vect_slp_check_for_constructors): Recognize associatable
chains for BB reduction vectorization.
(vectorize_slp_instance_root_stmt): Generate code for the
BB reduction epilogue.
* gcc.dg/vect/bb-slp-pr54400.c: New testcase.
|
|
The following fixes the SLP FMA patterns to preserve reduction
info and the reduction vectorization to consider internal function
call defs for the reduction stmt.
2021-06-09 Richard Biener <rguenther@suse.de>
PR tree-optimization/100981
gcc/
* tree-vect-loop.c (vect_create_epilog_for_reduction): Use
gimple_get_lhs to also handle calls.
* tree-vect-slp-patterns.c (complex_pattern::build): Transfer
reduction info.
gcc/testsuite/
* gfortran.dg/vect/pr100981-1.f90: New testcase.
libgomp/
* testsuite/libgomp.fortran/pr100981-2.f90: New testcase.
|
|
This patch uses the knowledge of the conditions to enter an epilogue loop to
help come up with a potentially more restricive upper bound.
gcc/ChangeLog:
* tree-vect-loop.c (vect_transform_loop): Use main loop's various'
thresholds to narrow the upper bound on epilogue iterations.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/sve/part_vect_single_iter_epilog.c: New test.
|
|
This patch is to replace the current hardcoded weight factor
50, which is applied by the loop vectorizer to the cost of
statements in an inner loop relative to the loop being
vectorized, with one newly added member inner_loop_cost_factor
in loop vinfo. It also introduces one parameter
vect-inner-loop-cost-factor whose default value is 50, and
is used to initialize the inner_loop_cost_factor member.
The motivation here is that: if targets want to have one
unique function to gather some information in each add_stmt_cost
call, no matter that it's put before or after the cost tweaking
part for inner loop, it may have the need to adjust (expand or
shrink) the gathered data as the factor. Now the factor is
hardcoded, it's not easily maintained.
Bootstrapped/regtested on powerpc64le-linux-gnu P9,
x86_64-redhat-linux and aarch64-linux-gnu.
gcc/ChangeLog:
* doc/invoke.texi (vect-inner-loop-cost-factor): Document new
parameter.
* params.opt (vect-inner-loop-cost-factor): New.
* targhooks.c (default_add_stmt_cost): Replace hardcoded factor
50 with LOOP_VINFO_INNER_LOOP_COST_FACTOR, include head file
tree-vectorizer.h and its required ones.
* config/aarch64/aarch64.c (aarch64_add_stmt_cost): Replace
hardcoded factor 50 with LOOP_VINFO_INNER_LOOP_COST_FACTOR.
* config/arm/arm.c (arm_add_stmt_cost): Likewise.
* config/i386/i386.c (ix86_add_stmt_cost): Likewise.
* config/rs6000/rs6000.c (rs6000_add_stmt_cost): Likewise.
* tree-vect-loop.c (vect_compute_single_scalar_iteration_cost):
Likewise.
(_loop_vec_info::_loop_vec_info): Init inner_loop_cost_factor.
* tree-vectorizer.h (_loop_vec_info): Add inner_loop_cost_factor.
(LOOP_VINFO_INNER_LOOP_COST_FACTOR): New macro.
|
|
rs6000 port function rs6000_density_test wants to differentiate the
current cost model is for the scalar version of a loop or block, or
the vector version. As Richi suggested, this patch introduces one
new parameter costing_for_scalar to init_cost hook to pass down this
information explicitly.
gcc/ChangeLog:
* doc/tm.texi: Regenerated.
* target.def (init_cost): Add new parameter costing_for_scalar.
* targhooks.c (default_init_cost): Adjust for new parameter.
* targhooks.h (default_init_cost): Likewise.
* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Likewise.
(vect_compute_single_scalar_iteration_cost): Likewise.
(vect_analyze_loop_2): Likewise.
* tree-vect-slp.c (_bb_vec_info::_bb_vec_info): Likewise.
(vect_bb_vectorization_profitable_p): Likewise.
* tree-vectorizer.h (init_cost): Likewise.
* config/aarch64/aarch64.c (aarch64_init_cost): Likewise.
* config/i386/i386.c (ix86_init_cost): Likewise.
* config/rs6000/rs6000.c (rs6000_init_cost): Likewise.
|
|
The following testcase ICEs because disabling of DCE means there are dead
stmts in the loop (though, in theory they could become dead only shortly
before if-conv through some optimization), ifcvt which goes through all
stmts in the loop if-converts them into .COND_DIV etc. internal fn calls
in the copy of the loop meant for vectorization only, the loop is
successfully vectorized but the particular .COND_* call is not because
it isn't a live statement and the scalar .COND_* remains in the IL until
expansion where it ICEs because these ifns only support vectors and not
scalars.
These ifns are similar to .MASK_{LOAD,STORE} in this behavior.
One possible fix could be to expand scalar versions of them during
expansion, basically undoing what if-conv did to create them, i.e.
expand them as the lhs = else; if (mask) { lhs = statement; } or so.
For .MASK_LOAD we have code to replace them in vect_transform_loop already
though (not needed for .MASK_STORE, as stores should be always live
and thus always vectorized), so this patch instead replaces .COND_*
similarly to .MASK_LOAD in that loop, with the small difference
that lhs = .MASK_LOAD (...); is replaced by lhs = 0; while
lhs = .COND_* (..., else_arg); is replaced by lhs = else_arg.
The statement must be dead, otherwise it would be vectorized, so I think
it is not a big deal we don't turn it back into multiple basic blocks etc.
(and it might be not possible to do that at that point).
2021-04-16 Jakub Jelinek <jakub@redhat.com>
PR target/99767
* tree-vect-loop.c (vect_transform_loop): Don't remove just
dead scalar .MASK_LOAD calls, but also dead .COND_* calls - replace
them by their last argument.
* gcc.target/aarch64/pr99767.c: New test.
|
|
This avoids (again) the C++ pitfall of pushing a reference to
sth being reallocated.
2021-04-07 Richard Biener <rguenther@suse.de>
PR tree-optimization/99947
* tree-vect-loop.c (vectorizable_induction): Pre-allocate
steps vector to avoid pushing elements from the reallocated
vector.
* gcc.dg/torture/pr99947.c: New testcase.
|
|
This adds a relevancy check before trying to set the vector def of
a backedge in an unvectorized PHI.
2021-04-06 Richard Biener <rguenther@suse.de>
PR tree-optimization/99880
* tree-vect-loop.c (maybe_set_vectorized_backedge_value): Only
set vectorized defs of relevant PHIs.
* gcc.dg/torture/pr99880.c: New testcase.
|
|
This patch is to initialize the inside_cost as zero, can avoid
to use its uninitialized value when some path doesn't assign it.
gcc/ChangeLog:
* tree-vect-loop.c (vect_model_reduction_cost): Init inside_cost.
|
|
This fixes an ordering problem with verifying that no intermediate
computations in a reduction path are used outside of the chain. The
check was disabled for value-preserving conversions at the tail
but whether a stmt was a conversion or not was only computed after
the first use. The following fixes this by re-ordering things
accordingly.
2021-02-25 Richard Biener <rguenther@suse.de>
PR tree-optimization/99253
* tree-vect-loop.c (check_reduction_path): First compute
code, then verify out-of-loop uses.
* gcc.dg/vect/pr99253.c: New testcase.
|
|
When we analyzed a loop as epilogue but later in peeling decide
we're not going to use it then in the DTOR we clear the original
loops ->aux which causes us to leak the main loop vinfo.
Fixed by only clearing aux if it is associated with the vinfo
we're destroying.
2021-02-10 Richard Biener <rguenther@suse.de>
PR tree-optimization/99024
* tree-vect-loop.c (_loop_vec_info::~_loop_vec_info): Only
clear loop->aux if it is associated with the destroyed loop_vinfo.
|
|
This fixes us not costing vectorized bswap for SLP as well as
avoiding biasing to the vectorized side when costing single-argument
PHIs. Instead we assume coalescing here and cost them with zero cost
for both the scalar and vectorized code.
This doesn't fix the PR on its own.
2021-02-04 Richard Biener <rguenther@suse.de>
PR tree-optimization/98855
* tree-vect-loop.c (vectorizable_phi): Do not cost
single-argument PHIs.
* tree-vect-slp.c (vect_bb_slp_scalar_cost): Likewise.
* tree-vect-stmts.c (vectorizable_bswap): Also perform
costing for SLP operation.
|
|
Previously the SLP pattern matcher was using STMT_VINFO_SLP_VECT_ONLY as a way
to dissolve the SLP only patterns during SLP cancellation. However it seems
like the semantics for STMT_VINFO_SLP_VECT_ONLY are slightly different than what
I expected.
Namely that the non-SLP path can still use a statement marked
STMT_VINFO_SLP_VECT_ONLY. One such example is masked loads which are used both
in the SLP and non-SLP path.
To fix this I now introduce a new flag STMT_VINFO_SLP_VECT_ONLY_PATTERN which is
used only by the pattern matcher.
gcc/ChangeLog:
PR tree-optimization/98928
* tree-vect-loop.c (vect_analyze_loop_2): Change
STMT_VINFO_SLP_VECT_ONLY to STMT_VINFO_SLP_VECT_ONLY_PATTERN.
* tree-vect-slp-patterns.c (complex_pattern::build): Likewise.
* tree-vectorizer.h (STMT_VINFO_SLP_VECT_ONLY_PATTERN): New.
(class _stmt_vec_info): Add slp_vect_pattern_only_p.
gcc/testsuite/ChangeLog:
PR tree-optimization/98928
* gcc.target/i386/pr98928.c: New test.
|
|
This fixes a double-counting in the reduction cost when vectorizing
the reduction through the regular vectorizable_* functions.
2021-01-11 Richard Biener <rguenther@suse.de>
PR tree-optimization/98526
* tree-vect-loop.c (vect_model_reduction_cost): Remove costing
of the actual reduction op for the regular case.
(vectorizable_reduction): Cost the stmts
vect_transform_reduction produces here.
|
|
This fixes extraction of live bool vector results for the case of
integer mode vectors.
2021-01-05 Richard Biener <rguenther@suse.de>
PR tree-optimization/98381
* tree.c (vector_element_bits): Properly compute bool vector
element size.
* tree-vect-loop.c (vectorizable_live_operation): Properly
compute the last lane bit offset.
|
|
On AArch64, the vectoriser tries various ways of vectorising with both
SVE and Advanced SIMD and picks the best one. All other things being
equal, it prefers earlier attempts over later attempts.
The way this works currently is that, once it has a successful
vectorisation attempt A, it analyses all other attempts as epilogue
loops of A:
/* When pick_lowest_cost_p is true, we should in principle iterate
over all the loop_vec_infos that LOOP_VINFO could replace and
try to vectorize LOOP_VINFO under the same conditions.
E.g. when trying to replace an epilogue loop, we should vectorize
LOOP_VINFO as an epilogue loop with the same VF limit. When trying
to replace the main loop, we should vectorize LOOP_VINFO as a main
loop too.
However, autovectorize_vector_modes is usually sorted as follows:
- Modes that naturally produce lower VFs usually follow modes that
naturally produce higher VFs.
- When modes naturally produce the same VF, maskable modes
usually follow unmaskable ones, so that the maskable mode
can be used to vectorize the epilogue of the unmaskable mode.
This order is preferred because it leads to the maximum
epilogue vectorization opportunities. Targets should only use
a different order if they want to make wide modes available while
disparaging them relative to earlier, smaller modes. The assumption
in that case is that the wider modes are more expensive in some
way that isn't reflected directly in the costs.
There should therefore be few interesting cases in which
LOOP_VINFO fails when treated as an epilogue loop, succeeds when
treated as a standalone loop, and ends up being genuinely cheaper
than FIRST_LOOP_VINFO. */
However, the vectoriser can normally elide alias checks for epilogue
loops, on the basis that the main loop should do them instead.
Converting an epilogue loop to a main loop can therefore cause the alias
checks to be skipped. (It probably also unfairly penalises the original
loop in the cost comparison, given that one loop will have alias checks
and the other won't.)
As the comment says, we should in principle analyse each vector mode
twice: once as a main loop and once as an epilogue. However, doing
that up-front would be quite expensive. This patch instead goes for a
compromise: if an epilogue loop for mode M2 seems better than a main
loop for mode M1, re-analyse with M2 as the main loop.
The patch fixes dg.torture.exp=pr69719.c when testing with
-msve-vector-bits=128.
gcc/
PR tree-optimization/98371
* tree-vect-loop.c (vect_reanalyze_as_main_loop): New function.
(vect_analyze_loop): If an epilogue loop appears to be cheaper
than the main loop, re-analyze it as a main loop before adopting
it as a main loop.
|
|
When the VF is one a SLP reduction is in-order and thus we can
vectorize even when the reduction op is not associative.
2021-01-04 Richard Biener <rguenther@suse.de>
PR tree-optimization/98291
* tree-vect-loop.c (vectorizable_reduction): Bypass
associativity check for SLP reductions with VF 1.
* gcc.dg/vect/slp-reduc-11.c: New testcase.
* gcc.dg/vect/vect-reduc-in-order-4.c: Adjust.
|
|
|
|
Problem comes from using the wrong interface to get the index type for a
COND_REDUCTION. For fixed-length SVE we get a V2SI (a 64-bit Advanced
SIMD vector) instead of a VNx2SI (an SVE vector that stores SI elements
in DI containers).
Credits to Richard Sandiford for pointing out the issue's root cause.
Original PR snippet proposed to reproduce issue was only causing ICE for C++
compiler (see pr98177-1 test cases). I've slightly modified original
snippet in order to reproduce issue on both C and C++ compilers. These
are pr98177-2 test cases.
gcc/ChangeLog:
PR target/98177
* tree-vect-loop.c (vect_create_epilog_for_reduction): Use
get_same_sized_vectype to obtain index type.
(vectorizable_reduction): Likewise.
gcc/testsuite/ChangeLog:
PR target/98177
* g++.target/aarch64/sve/pr98177-1.C: New test.
* g++.target/aarch64/sve/pr98177-2.C: New test.
* gcc.target/aarch64/sve/pr98177-1.c: New test.
* gcc.target/aarch64/sve/pr98177-2.c: New test.
|
|
vect_better_loop_vinfo_p
While experimenting with some backend costs for Advanced SIMD and SVE I
hit many cases where GCC would pick SVE for VLA auto-vectorisation even when
the backend very clearly presented cheaper costs for Advanced SIMD.
For a simple float addition loop the SVE costs were:
vec.c:9:21: note: Cost model analysis:
Vector inside of loop cost: 28
Vector prologue cost: 2
Vector epilogue cost: 0
Scalar iteration cost: 10
Scalar outside cost: 0
Vector outside cost: 2
prologue iterations: 0
epilogue iterations: 0
Minimum number of vector iterations: 1
Calculated minimum iters for profitability: 4
and for Advanced SIMD (Neon) they're:
vec.c:9:21: note: Cost model analysis:
Vector inside of loop cost: 11
Vector prologue cost: 0
Vector epilogue cost: 0
Scalar iteration cost: 10
Scalar outside cost: 0
Vector outside cost: 0
prologue iterations: 0
epilogue iterations: 0
Calculated minimum iters for profitability: 0
vec.c:9:21: note: Runtime profitability threshold = 4
yet the SVE one was always picked. With guidance from Richard this seems
to be due to the vinfo comparisons in vect_better_loop_vinfo_p, in
particular the part with the big comment explaining the
estimated_rel_new * 2 <= estimated_rel_old heuristic.
This patch extends the comparisons by introducing a three-way estimate
kind for poly_int values that the backend can distinguish.
This allows vect_better_loop_vinfo_p to ask for minimum, maximum and
likely estimates and pick Advanced SIMD overs SVE when it is clearly cheaper.
gcc/
* target.h (enum poly_value_estimate_kind): Define.
(estimated_poly_value): Take an estimate kind argument.
* target.def (estimated_poly_value): Update definition for the
above.
* doc/tm.texi: Regenerate.
* targhooks.c (estimated_poly_value): Update prototype.
* tree-vect-loop.c (vect_better_loop_vinfo_p): Use min, max and
likely estimates of VF to pick between vinfos.
* config/aarch64/aarch64.c (aarch64_cmp_autovec_modes): Use
estimated_poly_value instead of aarch64_estimated_poly_value.
(aarch64_estimated_poly_value): Take a kind argument and handle
it.
|
|
This patch adds support for
* Complex Addition with rotation of 90 and 270.
Addition with rotation of the second argument around the Argand plane.
Supported rotations are 90 and 180.
c = a + (b * I) and c = a + (b * I * I * I)
gcc/ChangeLog:
* tree-vect-slp-patterns.c: New file.
* Makefile.in: Add it.
* doc/passes.texi: Document it.
* internal-fn.def (COMPLEX_ADD_ROT90, COMPLEX_ADD_ROT270): New.
* optabs.def (cadd90_optab, cadd270_optab): New.
* doc/md.texi: Document them.
* tree-vect-loop.c (vect_analyze_loop_2): Add dissolve code.
* tree-vect-slp.c:
(vect_free_slp_instance, vect_create_new_slp_node): Export.
(vect_match_slp_patterns_2, vect_match_slp_patterns): New.
(vect_analyze_slp): Use it.
* tree-vectorizer.h (vect_free_slp_tree): Export.
(enum _complex_operation): Forward declare.
(class vect_pattern): New
gcc/testsuite/ChangeLog:
* lib/target-supports.exp
(check_effective_target_arm_v8_3a_complex_neon_ok_nocache): Fix it.
(check_effective_target_vect_complex_add_byte
,check_effective_target_vect_complex_add_int
,check_effective_target_vect_complex_add_short
,check_effective_target_vect_complex_add_long
,check_effective_target_vect_complex_add_half
,check_effective_target_vect_complex_add_float
,check_effective_target_vect_complex_add_double): New.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-byte.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-int.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-long.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-short.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-unsigned-byte.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-unsigned-int.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-unsigned-long.c: New test.
* gcc.dg/vect/complex/bb-slp-complex-add-pattern-unsigned-short.c: New test.
* gcc.dg/vect/complex/complex-add-pattern-template.c: New test.
* gcc.dg/vect/complex/complex-add-template.c: New test.
* gcc.dg/vect/complex/complex-operations-run.c: New test.
* gcc.dg/vect/complex/complex-operations.c: New test.
* gcc.dg/vect/complex/complex.exp: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-double.c: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-float.c: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-half-float.c: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-pattern-double.c: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-pattern-float.c: New test.
* gcc.dg/vect/complex/fast-math-bb-slp-complex-add-pattern-half-float.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-double.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-float.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-half-float.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-pattern-double.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-pattern-float.c: New test.
* gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-byte.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-int.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-long.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-short.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-unsigned-byte.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-unsigned-int.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-unsigned-long.c: New test.
* gcc.dg/vect/complex/vect-complex-add-pattern-unsigned-short.c: New test.
|
|
This makes sure to not call maybe_set_vectorized_backedge_value when
we did not vectorize the latch def candidate.
2020-12-02 Richard Biener <rguenther@suse.de>
* tree-vect-loop.c (vect_transform_loop_stmt): Return whether
we vectorized a stmt.
(vect_transform_loop): Only call maybe_set_vectorized_backedge_value
when we vectorized the stmt.
|
|
This avoids breaking LC SSA when SLP codegen pulled an out-of-loop
def into a loop when merging with in-loop defs for an external def.
2020-11-30 Richard Biener <rguenther@suse.de>
PR tree-optimization/98064
* tree-vect-loop.c (vectorizable_live_operation): Avoid
breaking LC SSA for BB vectorization.
* g++.dg/vect/pr98064.cc: New testcase.
|
|
Currently we have three vector cost models: cheap, dynamic and
unlimited. -O2 -ftree-vectorize uses “cheap” by default, but that's
still relatively aggressive about peeling and aliasing checks,
and can lead to significant code size growth.
This patch adds an even more conservative choice, which for lack of
imagination I've called “very cheap”. It only allows vectorisation
if the vector code entirely replaces the scalar code. It also
requires one iteration of the vector loop to pay for itself,
regardless of how often the loop iterates. (If the vector loop
needs multiple iterations to be beneficial then things are
probably too close to call, and the conservative thing would
be to stick with the scalar code.)
The idea is that this should be suitable for -O2, although the patch
doesn't change any defaults itself.
I tested this by building and running a bunch of workloads for SVE,
with three options:
(1) -O2
(2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
(3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]
All three builds used the default -msve-vector-bits=scalable and
ran with the minimum vector length of 128 bits, which should give
a worst-case bound for the performance impact.
The workloads included a mixture of microbenchmarks and full
applications. Because it's quite an eclectic mix, there's not
much point giving exact figures. The aim was more to get a general
impression.
Code size growth with (2) was much lower than with (3). Only a
handful of tests increased by more than 5%, and all of them were
microbenchmarks.
In terms of performance, (2) was significantly faster than (1)
on microbenchmarks (as expected) but also on some full apps.
Again, performance only regressed on a handful of tests.
As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
of a mixed bag. There are several significant improvements with (3)
over (2), but also some (smaller) regressions. That seems to be in
line with -O2 -ftree-vectorize being a kind of -O2.5.
The patch reorders vect_cost_model so that values are in order
of increasing aggressiveness, which makes it possible to use
range checks. The value 0 still represents “unlimited”,
so “if (flag_vect_cost_model)” is still a meaningful check.
gcc/
* doc/invoke.texi (-fvect-cost-model): Add a very-cheap model.
* common.opt (fvect-cost-model=): Add very-cheap as a possible option.
(fsimd-cost-model=): Likewise.
(vect_cost_model): Add very-cheap.
* flag-types.h (vect_cost_model): Add VECT_COST_MODEL_VERY_CHEAP.
Put the values in order of increasing aggressiveness.
* tree-vect-data-refs.c (vect_enhance_data_refs_alignment): Use
range checks when comparing against VECT_COST_MODEL_CHEAP.
(vect_prune_runtime_alias_test_list): Do not allow any alias
checks for the very-cheap cost model.
* tree-vect-loop.c (vect_analyze_loop_costing): Do not allow
any peeling for the very-cheap cost model. Also require one
iteration of the vector loop to pay for itself.
gcc/testsuite/
* gcc.dg/vect/vect-cost-model-1.c: New test.
* gcc.dg/vect/vect-cost-model-2.c: Likewise.
* gcc.dg/vect/vect-cost-model-3.c: Likewise.
* gcc.dg/vect/vect-cost-model-4.c: Likewise.
* gcc.dg/vect/vect-cost-model-5.c: Likewise.
* gcc.dg/vect/vect-cost-model-6.c: Likewise.
|
|
This makes vectorization properly assign vector types to PHI
nodes that copy from externals on loop exit edges.
2020-11-18 Richard Biener <rguenther@suse.de>
PR tree-optimization/97886
* tree-vect-loop.c (vectorizable_lc_phi): Properly assign
vector types to invariants for SLP.
|
|
This delays filling SLP_INSTANCE_LOADS.
2020-11-16 Richard Biener <rguenther@suse.de>
* tree-vectorizer.h (vect_gather_slp_loads): Declare.
* tree-vect-loop.c (vect_analyze_loop_2): Call
vect_gather_slp_loads.
* tree-vect-slp.c (vect_build_slp_instance): Do not gather
SLP loads here.
(vect_gather_slp_loads): Remove wrapper, new function.
(vect_slp_analyze_bb_1): Call it.
|
|
We're stripping conversions off access functions of inductions and
thus the step can be of different sign. Fix bogus step CTORs by
converting the elements rather than the whole vector.
2020-11-16 Richard Biener <rguenther@suse.de>
PR tree-optimization/97835
* tree-vect-loop.c (vectorizable_induction): Convert step
scalars rather than step vector.
* gcc.dg/vect/pr97835.c: New testcase.
|
|
This makes sure we reject reduction paths with a live stmt that
is not the last one altering the value. This is because we do not
handle this in the epilogue unless there's a scalar epilogue loop.
2020-11-09 Richard Biener <rguenther@suse.de>
PR tree-optimization/97760
* tree-vect-loop.c (check_reduction_path): Reject
reduction paths we do not handle in epilogue generation.
* gcc.dg/vect/pr97760.c: New testcase.
|
|
This fixes updating of the step vectors when filling up to group_size.
2020-11-09 Richard Biener <rguenther@suse.de>
PR tree-optimization/97753
* tree-vect-loop.c (vectorizable_induction): Fill vec_steps
when CSEing inside the group.
* gcc.dg/vect/pr97753.c: New testcase.
|
|
This PR exposes two issues - one that the vector builder treats
&x as eligible for VECTOR_CST elements and one that SLP induction
vectorization forgets to convert init elements to the vector
component type which makes a difference for pointer vs. integer.
2020-11-06 Richard Biener <rguenther@suse.de>
PR tree-optimization/97732
* tree-vect-loop.c (vectorizable_induction): Convert the
init elements to the vector component type.
* gimple-fold.c (gimple_build_vector): Use CONSTANT_CLASS_P
rather than TREE_CONSTANT to determine if elements are
eligible for VECTOR_CSTs.
* gcc.dg/vect/bb-slp-pr97732.c: New testcase.
|
|
This patch stores the SLP instance kind in the SLP instance so that we can use
it later when detecting load/store lanes support.
This also changes the load/store lane support check to only check if the SLP
kind is a store. This means that in order for the load/lanes to work all
instances must be of kind store.
gcc/ChangeLog:
* tree-vect-loop.c (vect_analyze_loop_2): Check kind.
* tree-vect-slp.c (vect_build_slp_instance): New.
(enum slp_instance_kind): Move to...
* tree-vectorizer.h (enum slp_instance_kind): .. Here
(SLP_INSTANCE_KIND): New.
|
|
This moves the code that checks for load/store lanes further in the pipeline and
places it after slp_optimize. This would allow us to perform optimizations on
the SLP tree and only bail out if we really have a permute.
With this change it allows us to handle permutes such as {1,1,1,1} which should
be handled by a load and replicate.
This change however makes it all or nothing. Either all instances can be handled
or none at all. This is why some of the test cases have been adjusted.
gcc/ChangeLog:
* tree-vect-slp.c (vect_analyze_slp_instance): Moved load/store lanes
check to ...
* tree-vect-loop.c (vect_analyze_loop_2): ..Here
gcc/testsuite/ChangeLog:
* gcc.dg/vect/slp-11b.c: Update output scan.
* gcc.dg/vect/slp-perm-6.c: Likewise.
|
|
I forgot to cost vectorized PHIs. Scalar PHIs are just costed
as scalar_stmt so the following costs vector PHIs as vector_stmt.
2020-11-04 Richard Biener <rguenther@suse.de>
* tree-vectorizer.h (vectorizable_phi): Adjust prototype.
* tree-vect-stmts.c (vect_transform_stmt): Adjust.
(vect_analyze_stmt): Pass cost_vec to vectorizable_phi.
* tree-vect-loop.c (vectorizable_phi): Do costing.
|
|
This properly sets the abnormal flag when vectorizing live lanes
when the original scalar was live across an abnormal edge.
2020-11-04 Richard Biener <rguenther@suse.de>
PR tree-optimization/97709
* tree-vect-loop.c (vectorizable_live_operation): Set
SSA_NAME_OCCURS_IN_ABNORMAL_PHI when necessary.
* gcc.dg/vect/bb-slp-pr97709.c: New testcase.
|
|
This re-instantiates the previously removed CSE, fixing the
FAIL of gcc.dg/vect/costmodel/x86_64/costmodel-pr30843.c
It turns out the previous approach still works.
2020-11-04 Richard Biener <rguenther@suse.de>
* tree-vect-loop.c (vectorizable_induction): Re-instantiate
previously removed CSE of SLP IVs.
|
|
This adds SLP vectorization of nested inductions.
2020-11-03 Richard Biener <rguenther@suse.de>
PR tree-optimization/80928
* tree-vect-loop.c (vectorizable_induction): SLP vectorize
nested inductions.
* gcc.dg/vect/vect-outer-slp-2.c: New testcase.
* gcc.dg/vect/vect-outer-slp-3.c: Likewise.
|
|
This restores not tracking SLP nodes for induction initial values
in not nested context because this interferes with peeling and
epilogue vectorization.
2020-11-03 Richard Biener <rguenther@suse.de>
PR tree-optimization/97678
* tree-vect-slp.c (vect_build_slp_tree_2): Do not track
the initial values of inductions when not nested.
* tree-vect-loop.c (vectorizable_induction): Look at
PHI node initial values again for SLP and not nested
inductions. Handle LOOP_VINFO_MASK_SKIP_NITERS and cost
invariants.
* gcc.dg/vect/pr97678.c: New testcase.
|
|
This rewrites SLP induction vectorization to handle different
inductions in the different SLP lanes. It also changes SLP
build to represent the initial value (but not the cycle) so
it can be enhanced to handle outer loop vectorization later.
Note this FAILs gcc.dg/vect/costmodel/x86_64/costmodel-pr30843.c
because it removes one CSE optimization that no longer works
with non-uniform initial value and step. I'll see to recover
from this after outer loop vectorization of inductions works.
It might be a bit friendlier to variable-size vectors now
but then we're now building the step vector from scalars ...
2020-11-02 Richard Biener <rguenther@suse.de>
* tree.h (build_real_from_wide): Declare.
* tree.c (build_real_from_wide): New function.
* tree-vect-slp.c (vect_build_slp_tree_2): Remove
restriction on induction vectorization, represent
the initial value.
* tree-vect-loop.c (vect_model_induction_cost): Inline ...
(vectorizable_induction): ... here. Rewrite SLP
code generation.
* gcc.dg/vect/slp-49.c: New testcase.
|
|
This makes sure to compute the vector type for invariant SLP children
of nested cycles.
2020-11-02 Richard Biener <rguenther@suse.de>
PR tree-optimization/97558
* tree-vect-loop.c (vectorizable_reduction): For nested SLP
cycles compute invariant operands vector type.
* gcc.dg/vect/pr97558-2.c: New testcase.
|
|
This avoids analyzing reductions that are not relevant (thus dead)
which eventually will lead into crashes because the participating
stmts meta is not analyzed. For this to work the patch also
properly removes reduction groups that are not uniformly recognized
as patterns.
2020-11-02 Richard Biener <rguenther@suse.de>
PR tree-optimization/97558
* tree-vect-loop.c (vect_fixup_scalar_cycles_with_patterns):
Check for any mismatch in pattern vs. non-pattern and dissolve
the group if there is one.
* tree-vect-slp.c (vect_analyze_slp_instance): Avoid
analyzing not relevant reductions.
(vect_analyze_slp): Avoid analyzing not relevant reduction
groups.
* gcc.dg/vect/pr97558.c: New testcase.
|
|
This fixes some memleaks, one older, one recently introduced.
2020-10-29 Richard Biener <rguenther@suse.de>
* tree-ssa-pre.c (compute_avail): Free operands consistently.
* tree-vect-loop.c (vectorizable_phi): Make sure all operand
defs vectors are released.
|
|
This makes SLP discovery detect backedges by seeding the bst_map with
the node to be analyzed so it can be picked up from recursive calls.
This removes the need to discover backedges in a separate walk.
This enables SLP build to handle PHI nodes in full, continuing
the SLP build to non-backedges. For loop vectorization this
enables outer loop vectorization of nested SLP cycles and for
BB vectorization this enables vectorization of PHIs at CFG merges.
It also turns code generation into a SCC discovery walk to handle
irreducible regions and nodes only reachable via backedges where
we now also fill in vectorized backedge defs.
This requires sanitizing the SLP tree for SLP reduction chains even
more, manually filling the backedge SLP def.
This also exposes the fact that CFG copying (and edge splitting
until I fixed that) ends up with different edge order in the
copy which doesn't play well with the desired 1:1 mapping of
SLP PHI node children and edges for epilogue vectorization.
I've tried to fixup CFG copying here but this really looks
like a dead (or expensive) end there so I've done fixup in
slpeel_tree_duplicate_loop_to_edge_cfg instead for the cases
we can run into.
There's still NULLs in the SLP_TREE_CHILDREN vectors and I'm
not sure it's possible to eliminate them all this stage1 so the
patch has quite some checks for this case all over the place.
Bootstrapped and tested on x86_64-unknown-linux-gnu. SPEC CPU 2017
and SPEC CPU 2006 successfully built and tested.
2020-10-27 Richard Biener <rguenther@suse.de>
* gimple.h (gimple_expr_type): For PHIs return the type
of the result.
* tree-vect-loop-manip.c (slpeel_tree_duplicate_loop_to_edge_cfg):
Make sure edge order into copied loop headers line up with the
originals.
* tree-vect-loop.c (vect_transform_cycle_phi): Handle nested
loops with SLP.
(vectorizable_phi): New function.
(vectorizable_live_operation): For BB vectorization compute insert
location here.
* tree-vect-slp.c (vect_free_slp_tree): Deal with NULL
SLP_TREE_CHILDREN entries.
(vect_create_new_slp_node): Add overloads with pre-existing node
argument.
(vect_print_slp_graph): Likewise.
(vect_mark_slp_stmts): Likewise.
(vect_mark_slp_stmts_relevant): Likewise.
(vect_gather_slp_loads): Likewise.
(vect_optimize_slp): Likewise.
(vect_slp_analyze_node_operations): Likewise.
(vect_bb_slp_scalar_cost): Likewise.
(vect_remove_slp_scalar_calls): Likewise.
(vect_get_and_check_slp_defs): Handle PHIs.
(vect_build_slp_tree_1): Handle PHIs.
(vect_build_slp_tree_2): Continue SLP build, following PHI
arguments. Fix memory leak.
(vect_build_slp_tree): Put stub node into the hash-map so
we can discover cycles directly.
(vect_build_slp_instance): Set the backedge SLP def for
reduction chains.
(vect_analyze_slp_backedges): Remove.
(vect_analyze_slp): Do not call it.
(vect_slp_convert_to_external): Release SLP_TREE_LOAD_PERMUTATION.
(vect_slp_analyze_node_operations): Handle stray failed
backedge defs by failing.
(vect_slp_build_vertices): Adjust leaf condition.
(vect_bb_slp_mark_live_stmts): Handle PHIs, use visited
hash-set to handle cycles.
(vect_slp_analyze_operations): Adjust.
(vect_bb_partition_graph_r): Likewise.
(vect_slp_function): Adjust split condition to allow CFG
merges.
(vect_schedule_slp_instance): Rename to ...
(vect_schedule_slp_node): ... this. Move DFS walk to ...
(vect_schedule_scc): ... this new function.
(vect_schedule_slp): Call it. Remove ad-hoc vectorized
backedge fill code.
* tree-vect-stmts.c (vect_analyze_stmt): Call
vectorizable_phi.
(vect_transform_stmt): Likewise.
(vect_is_simple_use): Handle vect_backedge_def.
* tree-vectorizer.c (vec_info::new_stmt_vec_info): Only
set loop header PHIs to vect_unknown_def_type for loop
vectorization.
* tree-vectorizer.h (enum vect_def_type): Add vect_backedge_def.
(enum stmt_vec_info_type): Add phi_info_type.
(vectorizable_phi): Declare.
* gcc.dg/vect/bb-slp-54.c: New test.
* gcc.dg/vect/bb-slp-55.c: Likewise.
* gcc.dg/vect/bb-slp-56.c: Likewise.
* gcc.dg/vect/bb-slp-57.c: Likewise.
* gcc.dg/vect/bb-slp-58.c: Likewise.
* gcc.dg/vect/bb-slp-59.c: Likewise.
* gcc.dg/vect/bb-slp-60.c: Likewise.
* gcc.dg/vect/bb-slp-61.c: Likewise.
* gcc.dg/vect/bb-slp-62.c: Likewise.
* gcc.dg/vect/bb-slp-63.c: Likewise.
* gcc.dg/vect/bb-slp-64.c: Likewise.
* gcc.dg/vect/bb-slp-65.c: Likewise.
* gcc.dg/vect/bb-slp-66.c: Likewise.
* gcc.dg/vect/vect-outer-slp-1.c: Likewise.
* gfortran.dg/vect/O3-bb-slp-1.f: Likewise.
* gfortran.dg/vect/O3-bb-slp-2.f: Likewise.
* g++.dg/vect/simd-11.cc: Likewise.
|
|
Remove one redundant LOOP_VINFO_FULLY_MASKED_P condition check
which will be checked in vect_use_loop_mask_for_alignment_p.
gcc/ChangeLog:
* tree-vect-loop.c (vect_transform_loop): Remove the redundant
LOOP_VINFO_FULLY_MASKED_P check.
|
|
We were using the wrong loop to figure the latch arg of a
double-reduction PHI. Which isn't a problem in case ->dest_idx
match up with the outer loop edges - but that's of course not guaranteed.
2020-10-20 Richard Biener <rguenther@suse.de>
* tree-vect-loop.c (vectorizable_reduction): Use the correct
loops latch edge for the PHI arg lookup.
|
|
This fixes the case where the insertion iterator for the live stmt
is the end of a BB by adjusting the dominance query to the definition
of the def we're substituting.
2020-10-15 Richard Biener <rguenther@suse.de>
* tree-vect-loop.c (vectorizable_live_operation): Adjust
dominance query.
* gcc.dg/vect/bb-slp-52.c: New testcase.
|
|
This fixes leaks discovered checking whether I introduced new ones
with the last vectorizer changes.
2020-10-09 Richard Biener <rguenther@suse.de>
* cgraphunit.c (expand_all_functions): Free tp_first_run_order.
* ipa-modref.c (pass_ipa_modref::execute): Free order.
* tree-ssa-loop-niter.c (estimate_numbers_of_iterations): Free
loop body.
* tree-vect-data-refs.c (vect_find_stmt_data_reference): Free
data references upon failure.
* tree-vect-loop.c (update_epilogue_loop_vinfo): Free BBs
array of the original loop.
* tree-vect-slp.c (vect_slp_bbs): Use an auto_vec for
dataref_groups to release its memory.
|
|
The following moves an ad-hoc attempt at discovering the SLP node
for a stmt to the place where we can find it in lock-step when
we find the stmt itself.
2020-09-29 Richard Biener <rguenther@suse.de>
PR tree-optimization/97241
* tree-vect-loop.c (vectorizable_reduction): Move finding
the SLP node for the reduction stmt to a better place.
* gcc.dg/vect/pr97241.c: New testcase.
|
|
This patch fixes the fallout that Kewen reported on Power after
the recent change to avoid unnecessary use of partial vectors.
As Kewen said, the problem is that vect_analyze_loop_2 doesn't
know how many epilogue iterations there will be, and so it
cannot make a final decision about whether the number of
iterations forces an epilogue loop to use partial vectors.
This is similar to the current situation for peeling: we don't know
during initial analysis whether an epilogue loop will itself require
peeling. Instead we decide that during vect_do_peeling, where the
final number of epilogue loop iterations is known.
The patch takes a similar approach for the decision about whether
to use partial vectors. As the comments in the patch say, the
idea is that vect_analyze_loop_2 should make peeling and partial-
vector decisions based on the assumption that the loop_vinfo will
be used as the main loop, while vect_do_peeling should make them
in the knowledge that the loop_vinfo will be used as an epilogue loop.
This allows the same analysis to be used for both cases, which we
rely on for implementing VECT_COMPARE_COSTS; see the big comment
in vect_analyze_loop for details.
I hope the patch makes the (mostly preexisting) structure a bit
more obvious. It isn't what anyone would design from scratch,
but that's the nature of working with a mature vector framework.
Arranging things this way means that vect_verify_full_masking
and vect_verify_loop_lens now become part of the “can” rather
than “will” test for partial vectors.
Also, while splitting out the logic that handles epilogues with
constant iterations, I added a check to make sure that we don't
try to use partial vectors to vectorise a single-scalar loop.
This required some changes to the Power tests.
gcc/
* tree-vectorizer.h (determine_peel_for_niter): Delete in favor of...
(vect_determine_partial_vectors_and_peeling): ...this new function.
* tree-vect-loop-manip.c (vect_update_epilogue_niters): New function.
Reject using vector epilogue loops for single iterations. Install
the constant number of epilogue loop iterations in the associated
loop_vinfo. Rely on vect_determine_partial_vectors_and_peeling
to do the main part of the test.
(vect_do_peeling): Use vect_update_epilogue_niters to handle
epilogue loops with a known number of iterations. Skip recomputing
the number of iterations later in that case. Otherwise, use
vect_determine_partial_vectors_and_peeling to decide whether the
epilogue loop needs to use partial vectors or peeling.
* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Set the
default can_use_partial_vectors_p to false if partial-vector-usage=0.
(determine_peel_for_niter): Remove in favor of...
(vect_determine_partial_vectors_and_peeling): ...this new function,
split out from...
(vect_analyze_loop_2): ...here. Reflect the vect_verify_full_masking
and vect_verify_loop_lens results in CAN_USE_PARTIAL_VECTORS_P
rather than USING_PARTIAL_VECTORS_P.
gcc/testsuite/
* gcc.target/powerpc/p9-vec-length-epil-1.c: Do not expect the
single-iteration epilogues of the 64-bit loops to be vectorized.
* gcc.target/powerpc/p9-vec-length-epil-7.c: Likewise.
* gcc.target/powerpc/p9-vec-length-epil-8.c: Likewise.
|
|
The condition we're expecting to eventually run into isn't fully
captured by checking for CTORs, instead we can also run into the
CTOR element conversion.
2020-09-23 Richard Biener <rguenther@suse.de>
PR tree-optimization/97173
* tree-vect-loop.c (vectorizable_live_operation): Extend
assert to also conver element conversions.
* gcc.dg/vect/pr97173.c: New testcase.
|