Age | Commit message (Collapse) | Author | Files | Lines |
|
The reduction guard isn't correct, STMT_VINFO_REDUC_DEF also exists
for nested cycles not part of reductions but there's no reduction
info for them.
PR tree-optimization/121754
* tree-vectorizer.h (vect_reduc_type): Simplify to not ICE
on nested cycles.
* gcc.dg/vect/pr121754.c: New testcase.
* gcc.target/aarch64/vect-pr121754.c: Likewise.
|
|
The strided-store path needs to have the SLP trees vector type so
the following patch passes dowm the vector type to be used to
vect_check_gather_scatter and adjusts all other callers. This
removes one of the last pieces requiring STMT_VINFO_VECTYPE
during SLP stmt analysis.
* tree-vectorizer.h (vect_check_gather_scatter): Add
vectype parameter.
* tree-vect-data-refs.cc (vect_check_gather_scatter): Get
vectype as parameter.
(vect_analyze_data_refs): Adjust.
* tree-vect-patterns.cc (vect_recog_gather_scatter_pattern): Likewise.
* tree-vect-slp.cc (vect_get_and_check_slp_defs): Get vectype
as parameter, pass down.
(vect_build_slp_tree_2): Adjust.
* tree-vect-stmts.cc (vect_mark_stmts_to_be_vectorized): Likewise.
(vect_use_strided_gather_scatters_p): Likewise.
|
|
While we have already the accessor info_for_reduction, its result
is a plain stmt_vec_info. The following turns that into a class
for the purpose of changing accesses to reduction info to a new
set of accessors prefixed with VECT_REDUC_INFO and removes
the corresponding STMT_VINFO prefixed accessors where possible.
There is few reduction related things that are used by scalar
cycle detection and thus have to stay as-is for now and as
copies in future.
This also separates reduction info into one object per reduction
and associate it with SLP nodes, splitting it out from
stmt_vec_info, retaining (and duplicating) parts used by scalar
cycle analysis. The data is then associated with SLP nodes
forming reduction cycles and accessible via info_for_reduction.
The data is created at SLP discovery time as we look at it even
pre-vectorizable_reduction analysis, but most of the data is
only populated by the latter. There is no reduction info with
nested cycles that are not part of an outer reduction.
In the process this adds cycle info to each SLP tree, notably
the reduc-idx and a way to identify the reduction info.
* tree-vectorizer.h (vect_reduc_info): New.
(create_info_for_reduction): Likewise.
(VECT_REDUC_INFO_TYPE): Likewise.
(VECT_REDUC_INFO_CODE): Likewise.
(VECT_REDUC_INFO_FN): Likewise.
(VECT_REDUC_INFO_SCALAR_RESULTS): Likewise.
(VECT_REDUC_INFO_INITIAL_VALUES): Likewise.
(VECT_REDUC_INFO_REUSED_ACCUMULATOR): Likewise.
(VECT_REDUC_INFO_INDUC_COND_INITIAL_VAL): Likewise.
(VECT_REDUC_INFO_EPILOGUE_ADJUSTMENT): Likewise.
(VECT_REDUC_INFO_FORCE_SINGLE_CYCLE): Likewise.
(VECT_REDUC_INFO_RESULT_POS): Likewise.
(VECT_REDUC_INFO_VECTYPE): Likewise.
(STMT_VINFO_VEC_INDUC_COND_INITIAL_VAL): Remove.
(STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT): Likewise.
(STMT_VINFO_FORCE_SINGLE_CYCLE): Likewise.
(STMT_VINFO_REDUC_FN): Likewise.
(STMT_VINFO_REDUC_VECTYPE): Likewise.
(vect_reusable_accumulator::reduc_info): Adjust.
(vect_reduc_type): Adjust.
(_slp_tree::cycle_info): New member.
(SLP_TREE_REDUC_IDX): Likewise.
(vect_reduc_info_s): Move/copy data from ...
(_stmt_vec_info): ... here.
(_loop_vec_info::redcu_infos): New member.
(info_for_reduction): Adjust to take SLP node.
(vect_reduc_type): Adjust.
(vect_is_reduction): Add overload for SLP node.
* tree-vectorizer.cc (vec_info::new_stmt_vec_info):
Do not initialize removed members.
(vec_info::free_stmt_vec_info): Do not release them.
* tree-vect-stmts.cc (vectorizable_condition): Adjust.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize
cycle info.
(vect_build_slp_tree_2): Compute SLP reduc_idx and store
it. Create, populate and propagate reduction info.
(vect_print_slp_tree): Print cycle info.
(vect_analyze_slp_reduc_chain): Set cycle info on the
manual added conversion node.
(vect_optimize_slp_pass::start_choosing_layouts): Adjust.
* tree-vect-loop.cc (_loop_vec_info::~_loop_vec_info):
Release reduction infos.
(info_for_reduction): Get the reduction info from
the vector in the loop_vinfo.
(vect_create_epilog_for_reduction): Adjust.
(vectorizable_reduction): Likewise.
(vect_transform_reduction): Likewise.
(vect_transform_cycle_phi): Likewise, deal with nested
cycles not part of a double reduction have no reduction info.
* config/aarch64/aarch64.cc (aarch64_force_single_cycle):
Use VECT_REDUC_INFO_FORCE_SINGLE_CYCLE, get SLP node and use
that.
(aarch64_vector_costs::count_ops): Adjust.
|
|
The following changes the vect_reduc_type API to work on the SLP node.
The API is only used from the aarch64 backend, so all changes are there.
In particular I noticed aarch64_force_single_cycle is invoked even
for scalar costing (where the flag tested isn't computed yet), I
figured in scalar costing all reductions are a single cycle.
* tree-vectorizer.h (vect_reduc_type): Get SLP node as argument.
* config/aarch64/aarch64.cc (aarch64_sve_in_loop_reduction_latency):
Take SLO node as argument and adjust.
(aarch64_in_loop_reduction_latency): Likewise.
(aarch64_detect_vector_stmt_subtype): Adjust.
(aarch64_vector_costs::count_ops): Likewise. Treat reductions
during scalar costing as single-cycle.
|
|
This was added when invariants/externals outside of SLP didn't have
an easily accessible vector type. Now it's redundant so the
following removes it.
* tree-vectorizer.h (stmt_vec_info_::reduc_vectype_in): Remove.
(STMT_VINFO_REDUC_VECTYPE_IN): Likewise.
* tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Get
at the input vectype via the SLP node child.
(vectorizable_lane_reducing): Likewise.
(vect_transform_reduction): Likewise.
(vectorizable_reduction): Do not set STMT_VINFO_REDUC_VECTYPE_IN.
|
|
We have now common patterns for most of the vectorizable_* calls, so
merge. This also avoids calling vectorizable_early_exit for BB
vect and clarifies signatures of it and vectorizable_phi.
* tree-vectorizer.h (vectorizable_phi): Take bb_vec_info.
(vectorizable_early_exit): Take loop_vec_info.
* tree-vect-loop.cc (vectorizable_phi): Adjust.
* tree-vect-slp.cc (vect_slp_analyze_operations): Likewise.
(vectorize_slp_instance_root_stmt): Likewise.
* tree-vect-stmts.cc (vectorizable_early_exit): Likewise.
(vect_transform_stmt): Likewise.
(vect_analyze_stmt): Merge the sequences of vectorizable_*
where common.
|
|
The following is a patch to make us record the get_load_store_info
results from load/store analysis and re-use them during transform.
In particular this moves where SLP_TREE_MEMORY_ACCESS_TYPE is stored.
A major hassle was (and still is, to some extent), gather/scatter
handling with it's accompaning gather_scatter_info. As
get_load_store_info no longer fully re-analyzes them but parts of
the information is recorded in the SLP tree during SLP build the
following goes and eliminates the use of this data in
vectorizable_load/store, instead recording the other relevant
part in the load-store info (namely the IFN or decl chosen).
Strided load handling keeps the re-analysis but populates the
data back to the SLP tree and the load-store info. That's something
for further improvement. This also shows that early classifying
a SLP tree as load/store and allocating the load-store data might
be a way to move back all of the gather/scatter auxiliary data
into one place.
Rather than mass-replacing references to variables I've kept the
locals but made them read-only, only adjusting a few elsval setters
and adding a FIXME to strided SLP handling of alignment (allowing
local override there).
The FIXME shows that while a lot of analysis is done in
get_load_store_type that's far from all of it. There's also
a possibility that splitting up the transform phase into
separate load/store def types, based on VMAT choosen, will make
the code more maintainable.
* tree-vectorizer.h (vect_load_store_data): New.
(_slp_tree::memory_access_type): Remove.
(SLP_TREE_MEMORY_ACCESS_TYPE): Turn into inline function.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Do not
initialize SLP_TREE_MEMORY_ACCESS_TYPE.
* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
Remove gather_scatter_info pointer argument, instead get
info from the SLP node.
(vect_build_one_gather_load_call): Get SLP node and builtin
decl as argument and remove uses of gather_scatter_info.
(vect_build_one_scatter_store_call): Likewise.
(vect_get_gather_scatter_ops): Remove uses of gather_scatter_info.
(vect_get_strided_load_store_ops): Get SLP node and remove
uses of gather_scatter_info.
(get_load_store_type): Take pointer to vect_load_store_data
instead of individual pointers.
(vectorizable_store): Adjust. Re-use get_load_store_type
result from analysis time.
(vectorizable_load): Likewise.
|
|
The following wraps SLP_TREE_CODE checks against VEC_PERM_EXPR
(the only relevant code) in a new SLP_TREE_PERMUTE_P predicate.
Most places guard against SLP_TREE_REPRESENTATIVE being NULL.
* tree-vectorizer.h (SLP_TREE_PERMUTE_P): New.
* tree-vect-slp-patterns.cc (linear_loads_p): Adjust.
(vect_detect_pair_op): Likewise.
(addsub_pattern::recognize): Likewise.
* tree-vect-slp.cc (vect_print_slp_tree): Likewise.
(vect_gather_slp_loads): Likewise.
(vect_is_slp_load_node): Likewise.
(optimize_load_redistribution_1): Likewise.
(vect_optimize_slp_pass::is_cfg_latch_edge): Likewise.
(vect_optimize_slp_pass::internal_node_cost): Likewise.
(vect_optimize_slp_pass::start_choosing_layouts): Likewise.
(vect_optimize_slp_pass::backward_cost): Likewise.
(vect_optimize_slp_pass::forward_pass): Likewise.
(vect_optimize_slp_pass::get_result_with_layout): Likewise.
(vect_optimize_slp_pass::materialize): Likewise.
(vect_optimize_slp_pass::dump): Likewise.
(vect_optimize_slp_pass::decide_masked_load_lanes): Likewise.
(vect_update_slp_vf_for_node): Likewise.
(vect_slp_analyze_node_operations_1): Likewise.
(vect_schedule_slp_node): Likewise.
(vect_schedule_scc): Likewise.
* tree-vect-stmts.cc (vect_analyze_stmt): Likewise.
(vect_transform_stmt): Likewise.
(vect_is_simple_use): Likewise.
|
|
The following splits up VMAT_GATHER_SCATTER into
VMAT_GATHER_SCATTER_LEGACY, VMAT_GATHER_SCATTER_IFN and
VMAT_GATHER_SCATTER_EMULATED. The main motivation is to reduce
the uses of (full) gs_info, but it also makes the kind representable
by a single entry rather than the ifn and decl tristate.
The strided load with gather case gets to use VMAT_GATHER_SCATTER_IFN,
since that's what we end up checking.
* tree-vectorizer.h (vect_memory_access_type): Replace
VMAT_GATHER_SCATTER with three separate access types,
VMAT_GATHER_SCATTER_LEGACY, VMAT_GATHER_SCATTER_IFN and
VMAT_GATHER_SCATTER_EMULATED.
(mat_gather_scatter_p): New predicate.
(GATHER_SCATTER_LEGACY_P): Remove.
(GATHER_SCATTER_IFN_P): Likewise.
(GATHER_SCATTER_EMULATED_P): Likewise.
* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
Adjust.
(get_load_store_type): Likewise.
(vect_get_loop_variant_data_ptr_increment): Likewise.
(vectorizable_store): Likewise.
(vectorizable_load): Likewise.
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
Likewise.
* config/riscv/riscv-vector-costs.cc
(costs::need_additional_vector_vars_p): Likewise.
* config/aarch64/aarch64.cc (aarch64_detect_vector_stmt_subtype):
Likewise.
(aarch64_vector_costs::count_ops): Likewise.
(aarch64_vector_costs::add_stmt_cost): Likewise.
|
|
The gather_scatter_info pointer is only used as flag, so pass down
a flag.
* tree-vectorizer.h (vect_supportable_dr_alignment): Pass
a bool instead of a pointer to gather_scatter_info.
* tree-vect-data-refs.cc (vect_supportable_dr_alignment):
Likewise.
* tree-vect-stmts.cc (get_load_store_type): Adjust.
|
|
This patch extends the support for peeling and versioning for alignment
from VLS modes to VLA modes. The key change is allowing the DR target
alignment to be set to a non-constant poly_int. Since the value must be
a power-of-two, for variable VFs, the power-of-two check is deferred to
runtime through loop versioning. The vectorizable check for speculative
loads is also refactored in this patch to handle both constant and
variable target alignment values.
Additional changes for VLA modes include:
1) Peeling
In VLA modes, we use peeling with masking - using a partial vector in
the first iteration of the vectorized loop to ensure aligned DRs in
subsequent iterations. It was already enabled for VLS modes to avoid
scalar peeling. This patch reuses most of the existing logic and just
fixes a small issue of incorrect IV offset in VLA code path. This also
removes a power-of-two rounding when computing the number of iterations
to peel, as power-of-two VF has been guaranteed by a new runtime check.
2) Versioning
The type of the mask for runtime alignment check is updated to poly_int
to support variable VFs. After this change, both standalone versioning
and peeling with versioning are available in VLA modes. This patch also
introduces another runtime check for speculative read amount, to ensure
that all speculative loads remain within current valid memory page. We
plan to remove these runtime checks in the future by introducing capped
VF - using partial vectors to limit the actual VF value at runtime.
3) Speculative read flag
DRs whose scalar accesses are known to be in-bounds will be considered
unaligned unsupported with a variable target alignment. But in fact,
speculative reads can be naturally avoided for in-bounds DRs as long as
partial vectors are used. Therefore, this patch clears the speculative
flags and sets the "must use partial vectors" flag for these cases.
This patch is bootstrapped and regression-tested on x86_64-linux-gnu,
arm-linux-gnueabihf and aarch64-linux-gnu with bootstrap-O3.
gcc/ChangeLog:
* tree-vect-data-refs.cc (vect_compute_data_ref_alignment):
Allow DR target alignment to be a poly_int.
(vect_enhance_data_refs_alignment): Support peeling and
versioning for VLA modes.
* tree-vect-loop-manip.cc (get_misalign_in_elems): Remove
power-of-two rounding in peeling.
(vect_create_cond_for_align_checks): Update alignment check
logic for poly_int mask.
(vect_create_cond_for_vla_spec_read): New runtime checks.
(vect_loop_versioning): Support new runtime checks.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Add a new
loop_vinfo field.
(vectorizable_induction): Fix wrong IV offset issue.
* tree-vect-stmts.cc (get_load_store_type): Refactor
vectorizable checks for speculative loads.
* tree-vectorizer.h (LOOP_VINFO_MAX_SPEC_READ_AMOUNT): New
macro for new runtime checks.
(LOOP_REQUIRES_VERSIONING_FOR_SPEC_READ): Likewise
(LOOP_REQUIRES_VERSIONING): Update macro for new runtime checks.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/sve/peel_ind_11.c: New test.
* gcc.target/aarch64/sve/peel_ind_11_run.c: New test.
* gcc.target/aarch64/sve/peel_ind_12.c: New test.
* gcc.target/aarch64/sve/peel_ind_12_run.c: New test.
* gcc.target/aarch64/sve/peel_ind_13.c: New test.
* gcc.target/aarch64/sve/peel_ind_13_run.c: New test.
|
|
The main gather/scatter discovery happens at SLP discovery time,
the base address and the offset scale are currently not explicitly
represented in the SLP tree. This requires re-discovery of them
during vectorizable_store/load. The following fixes this by
recording this info into the SLP tree. This allows the main
vect_check_gather_scatter call to be elided from get_load_store_type
and replaced with target support checks for IFN/decl or fallback
emulated mode.
There's vect_check_gather_scatter left in the path using gather/scatter
for strided load/store. I hope to deal with this later.
* tree-vectorizer.h (_slp_tree::gs_scale): New.
(_slp_tree::gs_base): Likewise.
(SLP_TREE_GS_SCALE): Likewise.
(SLP_TREE_GS_BASE): Likewise.
(vect_describe_gather_scatter_call): Declare.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize
new members.
(vect_build_slp_tree_2): Record gather/scatter base and scale.
(vect_get_and_check_slp_defs): For gather/scatter IFNs
describe the call to first_gs_info.
* tree-vect-data-refs.cc (vect_gather_scatter_fn_p): Add
mode of operation with fixed offset vector type.
(vect_describe_gather_scatter_call): Export.
* tree-vect-stmts.cc (get_load_store_type): Do not call
vect_check_gather_scatter to fill gs_info, instead populate
from the SLP tree. Check which of, IFN, decl or fallback
is supported and record that decision.
|
|
The following removes hybrid SLP detection - it existed as sanity
check that all stmts are covered by SLP, but it proved itself
incomplete at that. Its job is taken by early terminating SLP
build when SLP discovery fails for one root and the hope that
we now do catch all of them.
* tree-vectorizer.h (vect_relevant::hybrid): Remove.
* tree-vect-loop.cc (vect_analyze_loop_2): Do not call
vect_detect_hybrid_slp.
* tree-vect-slp.cc (maybe_push_to_hybrid_worklist): Remove.
(vect_detect_hybrid_slp): Likewise.
|
|
The following records the alternate SLP instance entries coming from
stmts with stores that have no SSA def, like OMP SIMD calls without LHS.
There's a bit of fallout with having a SLP tree with a NULL vectype,
but nothing too gross.
PR tree-optimization/121395
* tree-vectorizer.h (_loop_vec_info::alternate_defs): New member.
(LOOP_VINFO_ALTERNATE_DEFS): New.
* tree-vect-stmts.cc (vect_stmt_relevant_p): Populate it.
(vectorizable_simd_clone_call): Do not register a SLP def
when there is none.
* tree-vect-slp.cc (vect_build_slp_tree_1): Allow a NULL
vectype when there's no LHS. Allow all calls w/o LHS.
(vect_analyze_slp): Process LOOP_VINFO_ALTERNATE_DEFS as
SLP graph entries.
(vect_make_slp_decision): Handle a NULL SLP_TREE_VECTYPE.
(vect_slp_analyze_node_operations_1): Likewise.
(vect_schedule_slp_node): Likewise.
* gcc.dg/vect/pr59984.c: Adjust.
|
|
The following renames loop_vect to not_vect, removes the unused
HYBRID_SLP_STMT macro and rewords the slp_vect_type docs to clarify
STMT_SLP_TYPE is mainly used for BB vectorization, tracking what is
vectorized and what not.
* tree-vectorizer.h (enum slp_vect_type): Rename loop_vect
to not_vect, clarify docs.
(HYBRID_SLP_STMT): Remove.
* tree-vectorizer.cc (vec_info::new_stmt_vec_info): Adjust.
* tree-vect-loop.cc (vect_analyze_loop_2): Likewise.
|
|
We're using VMAT_INVARIANT as default, but we should simply have
an uninitialized state.
* tree-vectorizer.h (VMAT_UNINITIALIZED): New
vect_memory_access_type.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Use it.
|
|
The following adds vect_simd_clone_data as a container for vect
type specific data for vectorizable_simd_clone_call and moves
SLP_TREE_SIMD_CLONE_INFO there.
* tree-vectorizer.h (vect_simd_clone_data): New.
(_slp_tree::simd_clone_info): Remove.
(SLP_TREE_SIMD_CLONE_INFO): Likewise.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Adjust.
(_slp_tree::~_slp_tree): Likewise.
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Use
tyupe specific data to store SLP_TREE_SIMD_CLONE_INFO.
|
|
The following turns the union into a class hierarchy. One completed
SLP_TREE_TYPE could move into the base class.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Adjust.
(_slp_tree::~_slp_tree): Likewise.
* tree-vectorizer.h (vect_data): New base class.
(_slp_tree::u): Remove.
(_slp_tree::data): Add pointer to vect_data.
(_slp_tree::get_data): New helper template.
|
|
This should be present only on SLP nodes now. The RISC-V changes
are mechanical along the line of the SLP_TREE_TYPE changes.
* tree-vectorizer.h (_stmt_vec_info::memory_access_type): Remove.
(STMT_VINFO_MEMORY_ACCESS_TYPE): Likewise.
(vect_mem_access_type): Likewise.
* tree-vect-stmts.cc (vectorizable_store): Do not set
STMT_VINFO_MEMORY_ACCESS_TYPE. Fix SLP_TREE_MEMORY_ACCESS_TYPE
usage.
* tree-vect-loop.cc (update_epilogue_loop_vinfo): Remove
checking of memory access type.
* config/riscv/riscv-vector-costs.cc (costs::compute_local_live_ranges):
Use SLP_TREE_MEMORY_ACCESS_TYPE.
(costs::need_additional_vector_vars_p): Likewise.
(segment_loadstore_group_size): Get SLP node as argument,
use SLP_TREE_MEMORY_ACCESS_TYPE.
(costs::adjust_stmt_cost): Pass down SLP node.
* config/aarch64/aarch64.cc (aarch64_ld234_st234_vectors): Use
SLP_TREE_MEMORY_ACCESS_TYPE instead of vect_mem_access_type.
(aarch64_detect_vector_stmt_subtype): Likewise.
(aarch64_vector_costs::count_ops): Likewise.
(aarch64_vector_costs::add_stmt_cost): Likewise.
|
|
The following makes sure to not leak a set vectype on a stmt when
doing scalar IL costing as this can confuse vector cost models
which do not look at m_costing_for_scalar most of the time.
* tree-vectorizer.h (vector_costs::costing_for_scalar): New
accessor.
(add_stmt_cost): For scalar costing force vectype to NULL.
Verify we do not pass in a SLP node.
|
|
This fixes a miscompilation issue introduced by the enablement of
combined loop peeling and versioning. A test case that reproduces the
issue is included in the patch.
When performing loop peeling, GCC usually inserts a skip-vector check.
This ensures that after peeling, there are enough remaining iterations
to enter the main vectorized loop. Previously, the check was omitted if
loop versioning for alignment was applied. It was safe before because
versioning and peeling for alignment were mutually exclusive.
However, with combined peeling and versioning enabled, this is not safe
any more. A loop may be peeled and versioned at the same time. Without
the skip-vector check, the main vectorized loop can be entered even if
its iteration count is zero. This can cause the loop running many more
iterations than needed, resulting in incorrect results.
To fix this, the patch updates the condition of omitting the skip-vector
check to when versioning is performed alone without peeling.
gcc/ChangeLog:
PR tree-optimization/121020
* tree-vect-loop-manip.cc (vect_do_peeling): Update the
condition of omitting the skip-vector check.
* tree-vectorizer.h (LOOP_VINFO_USE_VERSIONING_WITHOUT_PEELING):
Add a helper macro.
gcc/testsuite/ChangeLog:
PR tree-optimization/121020
* gcc.dg/vect/vect-early-break_138-pr121020.c: New test.
|
|
The following removes this only set member. Sligthly complicated
by the hoops get_group_load_store_type jumps through. I've simplified
that, noting the offset vector type that's relevant is that of the
actual offset SLP node, not of what vect_check_gather_scatter (re-)computes.
* tree-vectorizer.h (gather_scatter_info::offset_dt): Remove.
* tree-vect-data-refs.cc (vect_describe_gather_scatter_call):
Do not set it.
(vect_check_gather_scatter): Likewise.
* tree-vect-stmts.cc (vect_truncate_gather_scatter_offset):
Likewise.
(get_group_load_store_type): Use the vector type of the offset
SLP child. Do not re-check vect_is_simple_use validated by
SLP build.
|
|
I am at a point where I want to store additional information from
analysis (from loads and stores) to re-use them at transform stage
without repeating the analysis. I do not want to add to
stmt_vec_info at this point, so this starts adding kind specific
sub-structures by moving the STMT_VINFO_TYPE field to the SLP
tree and adding a (dummy for now) union tagged by it to receive
such data.
The change is largely mechanical after RISC-V has been prepared
to have a SLP node around.
I have settled for a union (supposed to get pointers to data).
As followup this enables getting rid of SLP_TREE_CODE and making
VEC_PERM therein a separate type, unifying its handling.
* tree-vectorizer.h (_slp_tree::type): Add.
(_slp_tree::u): Likewise.
(_stmt_vec_info::type): Remove.
(STMT_VINFO_TYPE): Likewise.
(SLP_TREE_TYPE): New.
* tree-vectorizer.cc (vec_info::new_stmt_vec_info): Do not
initialize type.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize type.
(vect_slp_analyze_node_operations): Adjust.
(vect_schedule_slp_node): Likewise.
* tree-vect-patterns.cc (vect_init_pattern_stmt): Do not
copy STMT_VINFO_TYPE.
* tree-vect-loop.cc: Set SLP_TREE_TYPE instead of
STMT_VINFO_TYPE everywhere.
(vect_create_loop_vinfo): Do not set STMT_VINFO_TYPE on
loop conditions.
* tree-vect-stmts.cc: Set SLP_TREE_TYPE instead of
STMT_VINFO_TYPE everywhere.
(vect_analyze_stmt): Adjust.
(vect_transform_stmt): Likewise.
* config/aarch64/aarch64.cc (aarch64_vector_costs::count_ops):
Access SLP_TREE_TYPE instead of STMT_VINFO_TYPE.
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
Remove non-SLP element-wise load/store matching.
* config/rs6000/rs6000.cc
(rs6000_cost_data::update_target_cost_per_stmt): Pass in
the SLP node. Use that to get at the memory access
kind and type.
(rs6000_cost_data::add_stmt_cost): Pass down SLP node.
* config/riscv/riscv-vector-costs.cc (variable_vectorized_p):
Use SLP_TREE_TYPE.
(costs::need_additional_vector_vars_p): Likewise.
(costs::update_local_live_ranges): Likewise.
|
|
The following removes the vect_get_vec_defs overload receiving
a vector type to be used for the possibly constant/invariant
operand. This was used for non-SLP code generation as there
constants/invariants are generated on the fly. It also elides
the stmt_vec_info and ncopies argument which are not required
for SLP.
* tree-vectorizer.h (vect_get_vec_defs): Remove overload
with operand vector type. Remove stmt_vec_info and
ncopies argument.
* tree-vect-stmts.cc (vect_get_vec_defs): Likewise.
(vectorizable_conversion): Adjust by not passing in
vector types, stmt_vec_info and ncopies.
(vectorizable_bswap): Likewise.
(vectorizable_assignment): Likewise.
(vectorizable_shift): Likewise.
(vectorizable_operation): Likewise.
(vectorizable_scan_store): Likewise.
(vectorizable_store): Likewise.
(vectorizable_condition): Likewise.
(vectorizable_comparison_1): Likewise.
* tree-vect-loop.cc (vect_transform_reduction): Likewise.
(vect_transform_lc_phi): Likewise.
|
|
The following removes one vect_is_simple_use overload that shouldn't
be used anymore after removing the single remaining use related
to gather handling in get_group_load_store_type. It also removes
the dual-purpose of the overload getting both SLP node and
stmt_vec_info and removes the latter argument.
That leaves us with a SLP overload handling vector code and the
stmt_info overload handling scalar code. In theory the former
is only convenience and it should never fail given SLP build
checks the constraint already, but there's the 'op' argument
we have to get rid of first.
* tree-vectorizer.h (vect_is_simple_use): Remove stmt-info
with vectype output overload and remove stmt-info argument
from SLP based API.
* tree-vect-loop.cc (vectorizable_lane_reducing): Remove
unused def_stmt_info output argument to vect_is_simple_use.
Adjust.
* tree-vect-stmts.cc (get_group_load_store_type): Get
the gather/scatter offset vector type from the SLP child.
(vect_check_scalar_mask): Remove stmt_info argument. Adjust.
(vect_check_store_rhs): Likewise.
(vectorizable_call): Likewise.
(vectorizable_simd_clone_call): Likewise.
(vectorizable_conversion): Likewise.
(vectorizable_assignment): Likewise.
(vectorizable_shift): Likewise.
(vectorizable_operation): Likewise.
(vectorizable_load): Likewise.
(vect_is_simple_cond): Remove stmt_info argument. Adjust.
(vectorizable_condition): Likewise.
(vectorizable_comparison_1): Likewise.
(vectorizable_store): Likewise.
(vect_is_simple_use): Remove overload and non-SLP path.
|
|
The following removes the last uses of STMT_VINFO_VEC_STMTS and
the vector itself. Vector stmts are recorded in SLP nodes now.
The last use is a bit strange - it was introduced by
Richard S. in r8-6064-ga57776a1136962 and affects only
power7 and below (the re-align optimized load path). The
check should have never been true since vect_vfa_access_size
is only ever invoked before stmt transform. I have done
the "conservative" change of making it always true now
(so the code is now entered). I can as well remove it, but
I wonder if you remember anything about this ...
* tree-vectorizer.h (_stmt_vec_info::vec_stmts): Remove.
(STMT_VINFO_VEC_STMTS): Likewise.
* tree-vectorizer.cc (vec_info::new_stmt_vec_info): Do not
initialize it.
(vec_info::free_stmt_vec_info): Nor free it.
* tree-vect-data-refs.cc (vect_vfa_access_size): Remove
check on STMT_VINFO_VEC_STMTS.
|
|
The following removes the non-SLP load interleaving code which was
almost unused.
* tree-vectorizer.h (vect_transform_grouped_load): Remove.
(vect_record_grouped_load_vectors): Likewise.
* tree-vect-data-refs.cc (vect_permute_load_chain): Likewise.
(vect_shift_permute_load_chain): Likewise.
(vect_transform_grouped_load): Likewise.
(vect_record_grouped_load_vectors): Likewise.
* tree-vect-stmts.cc (vectorizable_load): Remove comments
about load interleaving.
|
|
The following removes the non-SLP store interleaving support which
was already almost unused.
* tree-vectorizer.h (vect_permute_store_chain): Remove.
* tree-vect-data-refs.cc (vect_permute_store_chain): Likewise.
* tree-vect-stmts.cc (vectorizable_store): Remove comment
about store interleaving.
|
|
This removes vect_get_vec_defs_for_operand and its remaining uses.
It also removes some remaining non-SLP paths in preparation to
elide STMT_VINFO_VEC_STMTS.
* tree-vectorizer.h (vect_get_vec_defs_for_operand): Remove.
* tree-vect-stmts.cc (vect_get_vec_defs_for_operand): Likewise.
(vect_get_vec_defs): Remove non-SLP path.
(check_load_store_for_partial_vectors): We always have an
SLP node.
(vect_check_store_rhs): Likewise.
(vect_get_gather_scatter_ops): Likewise.
(vect_create_vectorized_demotion_stmts): Likewise.
(vectorizable_store): Adjust.
(vectorizable_load): Likewise.
|
|
This VMAT was used for interleaving which was non-SLP only. The
following removes code gated by it (code selecting it is already gone).
* tree-vectorizer.h (VMAT_CONTIGUOUS_PERMUTE): Remove.
* tree-vect-stmts.cc (check_load_store_for_partial_vectors):
Remove checks on VMAT_CONTIGUOUS_PERMUTE.
(vectorizable_load): Likewise.
(vectorizable_store): Likewise. Prune dead code.
|
|
The following removes the non-SLP gimple **vec_stmt argument from
the vectorizable_* functions API. Checks on it can be replaced
by an inverted check on the passed cost_vec vector pointer.
* tree-vectorizer.h (vectorizable_induction): Remove
gimple **vec_stmt argument.
(vectorizable_phi): Likewise.
(vectorizable_recurr): Likewise.
(vectorizable_early_exit): Likewise.
* tree-vect-loop.cc (vectorizable_phi): Likewise and adjust.
(vectorizable_recurr): Likewise.
(vectorizable_nonlinear_induction): Likewise.
(vectorizable_induction): Likewise.
* tree-vect-stmts.cc (vectorizable_bswap): Likewise.
(vectorizable_call): Likewise.
(vectorizable_simd_clone_call): Likewise.
(vectorizable_conversion): Likewise.
(vectorizable_assignment): Likewise.
(vectorizable_shift): Likewise.
(vectorizable_operation): Likewise.
(vectorizable_store): Likewise.
(vectorizable_load): Likewise.
(vectorizable_condition): Likewise.
(vectorizable_comparison_1): Likewise.
(vectorizable_comparison): Likewise.
(vectorizable_early_exit): Likewise.
(vect_analyze_stmt): Adjust.
(vect_transform_stmt): Likewise.
* tree-vect-slp.cc (vect_slp_analyze_operations): Adjust.
(vectorize_slp_instance_root_stmt): Likewise.
|
|
This removes the non-SLP path from vectorizable_simd_clone_call and
the then unused simd_clone_info from the stmt_vec_info structure.
* tree-vectorizer.h (_stmt_vec_info::simd_clone_info): Remove.
(STMT_VINFO_SIMD_CLONE_INFO): Likewise.
* tree-vectorizer.cc (vec_info::free_stmt_vec_info): Do not
release it.
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Remove
non-SLP path.
|
|
This patch adds simple misalignment checks for gather/scatter
operations. Previously, we assumed that those perform element accesses
internally so alignment does not matter. The riscv vector spec however
explicitly states that vector operations are allowed to fault on
element-misaligned accesses. Reasonable uarchs won't, but...
For gather/scatter we have two paths in the vectorizer:
(1) Regular analysis based on datarefs. Here we can also create
strided loads.
(2) Non-affine access where each gather index is relative to the
initial address.
The assumption this patch works on is that once the alignment for the
first scalar is correct, all others will fall in line, as the index is
always a multiple of the first element's size.
For (1) we have a dataref and can check it for alignment as in other
cases. For (2) this patch checks the object alignment of BASE and
compares it against the natural alignment of the current vectype's unit.
The patch also adds a pointer argument to the gather/scatter IFNs that
contains the necessary alignment. Most of the patch is thus mechanical
in that it merely adjusts indices.
I tested the riscv version with a custom qemu version that faults on
element-misaligned vector accesses. With this patch applied, there is
just a single fault left, which is due to PR120782 and which will be
addressed separately.
Bootstrapped and regtested on x86 and aarch64. Regtested on
rv64gcv_zvl512b with and without unaligned vector support.
gcc/ChangeLog:
* internal-fn.cc (internal_fn_len_index): Adjust indices for new
alias_ptr param.
(internal_fn_else_index): Ditto.
(internal_fn_mask_index): Ditto.
(internal_fn_stored_value_index): Ditto.
(internal_fn_alias_ptr_index): Ditto.
(internal_fn_offset_index): Ditto.
(internal_fn_scale_index): Ditto.
(internal_gather_scatter_fn_supported_p): Ditto.
* internal-fn.h (internal_fn_alias_ptr_index): Ditto.
* optabs-query.cc (supports_vec_gather_load_p): Ditto.
* tree-vect-data-refs.cc (vect_check_gather_scatter): Add alias
pointer.
* tree-vect-patterns.cc (vect_recog_gather_scatter_pattern): Add
alias pointer.
* tree-vect-slp.cc (vect_get_operand_map): Adjust for alias
pointer.
* tree-vect-stmts.cc (vect_truncate_gather_scatter_offset): Add
alias pointer and misalignment handling.
(get_load_store_type): Move from here...
(get_group_load_store_type): ...To here.
(vectorizable_store): Add alias pointer.
(vectorizable_load): Ditto.
* tree-vectorizer.h (struct gather_scatter_info): Ditto.
|
|
This encapsulates the IFN and the builtin-function way of handling
gather/scatter via three defines:
GATHER_SCATTER_IFN_P
GATHER_SCATTER_LEGACY_P
GATHER_SCATTER_EMULATED_P
and introduces a helper define for SLP operand handling as well.
gcc/ChangeLog:
* tree-vect-slp.cc (GATHER_SCATTER_OFFSET): New define.
(vect_get_and_check_slp_defs): Use.
* tree-vectorizer.h (GATHER_SCATTER_LEGACY_P): New define.
(GATHER_SCATTER_IFN_P): Ditto.
(GATHER_SCATTER_EMULATED_P): Ditto.
* tree-vect-stmts.cc (vectorizable_store): Use.
(vectorizable_load): Use.
|
|
The following removes the minimum VF compute from dataref analysis
which does not take into account SLP at all, leaving the testcase
vectorized with V2SImode instead of V4SImode on x86. With SLP
the only minimum VF we can compute this early is 1.
* tree-vectorizer.h (vect_analyze_data_refs): Remove min_vf
output.
* tree-vect-data-refs.cc (vect_analyze_data_refs): Likewise.
* tree-vect-loop.cc (vect_analyze_loop_2): Remove early
out based on bogus min_vf.
* tree-vect-slp.cc (vect_slp_analyze_bb_1): Adjust.
* gcc.dg/vect/vect-127.c: New testcase.
|
|
After vect_analyze_loop_operations is gone we can clean up
vect_analyze_stmt as it is no longer called out of SLP context.
* tree-vectorizer.h (vect_analyze_stmt): Remove stmt-info
and need_to_vectorize arguments.
* tree-vect-slp.cc (vect_slp_analyze_node_operations_1):
Adjust.
* tree-vect-stmts.cc (can_vectorize_live_stmts): Remove
stmt_info argument and remove non-SLP path.
(vect_analyze_stmt): Remove stmt_info and need_to_vectorize
argument and prune paths no longer reachable.
(vect_transform_stmt): Adjust.
|
|
The following removes the VF determining step from non-SLP stmts.
For now we keep setting STMT_VINFO_VECTYPE for all stmts, there are
too many places to fix, including some more complicated ones, so
this is defered for a followup.
Along this removes vect_update_vf_for_slp, merging the check for
present hybrid SLP stmts to vect_detect_hybrid_slp and fail analysis
early. This also removes to essentially duplicate this check in
the stmt walk of vect_analyze_loop_operations. Getting rid of that,
and performing some other checks earlier is also defered to a followup.
* tree-vect-loop.cc (vect_determine_vf_for_stmt_1): Rename
to ...
(vect_determine_vectype_for_stmt_1): ... this and only set
STMT_VINFO_VECTYPE. Fail for single-element vector types.
(vect_determine_vf_for_stmt): Rename to ...
(vect_determine_vectype_for_stmt): ... this and only set
STMT_VINFO_VECTYPE. Fail for single-element vector types.
(vect_determine_vectorization_factor): Rename to ...
(vect_set_stmts_vectype): ... this and only set STMT_VINFO_VECTYPE.
(vect_update_vf_for_slp): Remove.
(vect_analyze_loop_operations): Remove walk over stmts.
(vect_analyze_loop_2): Call vect_set_stmts_vectype instead of
vect_determine_vectorization_factor. Set vectorization factor
from LOOP_VINFO_SLP_UNROLLING_FACTOR. Fail if vect_detect_hybrid_slp
detects hybrid stmts or when vect_make_slp_decision finds
nothing to SLP.
* tree-vect-slp.cc (vect_detect_hybrid_slp): Move check
whether we have any hybrid stmts here from vect_update_vf_for_slp
* tree-vect-stmts.cc (vect_analyze_stmt): Remove loop over
stmts.
* tree-vectorizer.h (vect_detect_hybrid_slp): Update.
|
|
Targets recently got the ability to request the vector mode to be
used for a vector epilogue (or the epilogue of a vector epilogue). The
following adds the ability for it to indicate the epilogue should use
loop masking, irrespective of the --param vect-partial-vector-usage
default setting.
The patch below uses a separate flag from the epilogue mode, not
addressing the issue that on x86 the vector_modes mode iteration
hook would not allow for both masked and unmasked variants to be
tried and costed given this doesn't naturally map to modes on
that target. That's left for a future exercise - turning on
cost comparison for the x86 backend would be a prerequesite there.
* tree-vectorizer.h (vector_costs::suggested_epilogue_mode):
Add masked output parameter and return m_masked_epilogue.
(vector_costs::m_masked_epilogue): New tristate flag.
(vector_costs::vector_costs): Initialize m_masked_epilogue.
* tree-vect-loop.cc (vect_analyze_loop_1): Pass in masked
flag to optionally initialize can_use_partial_vectors_p.
(vect_analyze_loop): For epilogues also get whether to use
a masked epilogue for this loop from the target and use
that for the first epilogue mode we try.
|
|
The following avoids re-analyzing the loop as epilogue when not
using partial vectors and the mode is the same as the autodetected
vector mode and that has a too high VF for a non-predicated loop.
This situation occurs almost always on x86 and saves us one
re-analysis unless --param vect-partial-vector-usage is non-default.
* tree-vectorizer.h (vect_chooses_same_modes_p): New
overload.
* tree-vect-stmts.cc (vect_chooses_same_modes_p): Likewise.
* tree-vect-loop.cc (vect_analyze_loop): Prune epilogue
analysis further when not using partial vectors.
|
|
The following allows SLP build to succeed when mixing .FMA/.FMS
in different lanes like we handle mixed plus/minus. This does not
yet address SLP pattern matching to not being able to form
a FMADDSUB from this.
PR tree-optimization/120808
* tree-vectorizer.h (compatible_calls_p): Add flag to
indicate a FMA/FMS pair is allowed.
* tree-vect-slp.cc (compatible_calls_p): Likewise.
(vect_build_slp_tree_1): Allow mixed .FMA/.FMS as two-operator.
(vect_build_slp_tree_2): Handle calls in two-operator SLP build.
* tree-vect-slp-patterns.cc (compatible_complex_nodes_p):
Adjust.
* gcc.dg/vect/bb-slp-pr120808.c: New testcase.
|
|
Consider the loop
void f1 (int *restrict a, int n)
{
#pragma GCC unroll 4 requested
for (int i = 0; i < n; i++)
a[i] *= 2;
}
Which today is vectorized and then unrolled 3x by the RTL unroller due to the
use of the pragma. This is unfortunate because the pragma was intended for the
scalar loop but we end up with an unrolled vector loop and a longer path to the
entry which has a low enough VF requirement to enter.
This patch instead seeds the suggested_unroll_factor with the value the user
requested and instead uses it to maintain the total VF that the user wanted the
scalar loop to maintain.
In effect it applies the unrolling inside the vector loop itself. This has the
benefits for things like reductions, as it allows us to split the accumulator
and so the unrolled loop is more efficient. For early-break it allows the
cbranch call to be shared between the unrolled elements, giving you more
effective unrolling because it doesn't need the repeated cbranch which can be
expensive.
The target can then choose to create multiple epilogues to deal with the "rest".
The example above now generates:
.L4:
ldr q31, [x2]
add v31.4s, v31.4s, v31.4s
str q31, [x2], 16
cmp x2, x3
bne .L4
as V4SI maintains the requested VF, but e.g. pragma unroll 8 generates:
.L4:
ldp q30, q31, [x2]
add v30.4s, v30.4s, v30.4s
add v31.4s, v31.4s, v31.4s
stp q30, q31, [x2], 32
cmp x3, x2
bne .L4
gcc/ChangeLog:
* doc/extend.texi: Document pragma unroll interaction with vectorizer.
* tree-vectorizer.h (LOOP_VINFO_USER_UNROLL): New.
(class _loop_vec_info): Add user_unroll.
* tree-vect-loop.cc (vect_analyze_loop_1): Set
suggested_unroll_factor and retry.
(_loop_vec_info::_loop_vec_info): Initialize user_unroll.
(vect_transform_loop): Clear the loop->unroll value if the pragma was
used.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/unroll-vect.c: New test.
|
|
Current GCC uses either peeling or versioning, but not in combination,
to handle unaligned data references (DRs) during vectorization. This
limitation causes some loops with early break to fall back to scalar
code at runtime.
Consider the following loop with DRs in its early break condition:
for (int i = start; i < end; i++) {
if (a[i] == b[i])
break;
count++;
}
In the loop, references to a[] and b[] need to be strictly aligned for
vectorization because speculative reads that may cross page boundaries
are not allowed. Current GCC does versioning for this loop by creating a
runtime check like:
((&a[start] | &b[start]) & mask) == 0
to see if two initial addresses both have lower bits zeros. If above
runtime check fails, the loop will fall back to scalar code. However,
it's often possible that DRs are all unaligned at the beginning but they
become all aligned after a few loop iterations. We call this situation
DRs being "mutually aligned".
This patch enables combined peeling and versioning to avoid loops with
mutually aligned DRs falling back to scalar code. Specifically, the
function vect_peeling_supportable is updated in this patch to return a
three-state enum indicating how peeling can make all unsupportable DRs
aligned. In addition to previous true/false return values, a new state
peeling_maybe_supported is used to indicate that peeling may be able to
make these DRs aligned but we are not sure about it at compile time. In
this case, peeling should be combined with versioning so that a runtime
check will be generated to guard the peeled vectorized loop.
A new type of runtime check is also introduced for combined peeling and
versioning. It's enabled when LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT is true.
The new check tests if all DRs recorded in LOOP_VINFO_MAY_MISALIGN_STMTS
have the same lower address bits. For above loop case, the new test will
generate an XOR between two addresses, like:
((&a[start] ^ &b[start]) & mask) == 0
Therefore, if a and b have the same alignment step (element size) and
the same offset from an alignment boundary, a peeled vectorized loop
will run. This new runtime check also works for >2 DRs, with the LHS
expression being:
((a1 ^ a2) | (a2 ^ a3) | (a3 ^ a4) | ... | (an-1 ^ an)) & mask
where ai is the address of i'th DR.
This patch is bootstrapped and regression tested on x86_64-linux-gnu,
arm-linux-gnueabihf and aarch64-linux-gnu.
gcc/ChangeLog:
* tree-vect-data-refs.cc (vect_peeling_supportable): Return new
enum values to indicate if combined peeling and versioning can
potentially support vectorization.
(vect_enhance_data_refs_alignment): Support combined peeling and
versioning in vectorization analysis.
* tree-vect-loop-manip.cc (vect_create_cond_for_align_checks):
Add a new type of runtime check for mutually aligned DRs.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Set
default value of allow_mutual_alignment in the initializer list.
* tree-vectorizer.h (enum peeling_support): Define type of
peeling support for function vect_peeling_supportable.
(LOOP_VINFO_ALLOW_MUTUAL_ALIGNMENT): New access macro.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-early-break_133_pfa6.c: Adjust test.
|
|
The following changes the record_stmt_cost calls in
vectorizable_load/store to only pass the SLP node when costing
vector stmts. For now we'll still pass the stmt_vec_info,
determined from SLP_TREE_REPRESENTATIVE, so this merely cleans up
the API.
* tree-vectorizer.h (record_stmt_cost): Remove mixed
stmt_vec_info/SLP node inline overload.
* tree-vect-stmts.cc (vectorizable_store): For costing
vector stmts only pass SLP node to record_stmt_cost.
(vectorizable_load): Likewise.
|
|
As part of the vector cost API cleanup this transitions
vect_model_simple_cost to only record costs with SLP node.
For this to work the patch adds an overload to record_stmt_cost
only passing in the SLP node.
The vect_prologue_cost_for_slp adjustment is one spot that
needs an eye with regard to re-doing the whole thing.
* tree-vectorizer.h (record_stmt_cost): Add overload with
only SLP node and no vector type.
* tree-vect-stmts.cc (record_stmt_cost): Use
SLP_TREE_REPRESENTATIVE for stmt_vec_info.
(vect_model_simple_cost): Do not get stmt_vec_info argument
and adjust.
(vectorizable_call): Adjust.
(vectorizable_simd_clone_call): Likewise.
(vectorizable_conversion): Likewise.
(vectorizable_assignment): Likewise.
(vectorizable_shift): Likewise.
(vectorizable_operation): Likewise.
(vectorizable_condition): Likewise.
(vectorizable_comparison_1): Likewise.
* tree-vect-slp.cc (vect_prologue_cost_for_slp): Use
full-blown record_stmt_cost.
|
|
Fold slp_node to TRUE and clean-up vectorizable_reduction and related functions.
Also split up vectorizable_lc_phi and create vect_transform_lc_phi.
gcc/ChangeLog:
* tree-vect-loop.cc (get_initial_def_for_reduction): Remove.
(vect-create_epilog_for_reduction): Remove non-SLP path.
(vectorize_fold_left_reduction): Likewise.
(vectorizable_lane_reducing): Likewise.
(vectorizable_reduction): Likewise.
(vect_transform_reduction): Likewise.
(vect_transform_cycle_phi): Likewise.
(vectorizable_lc_phi): Remove non-SLP PATH and split into...
(vect_transform_lc_phi): ... this.
(update_epilogue_loop_vinfo): Update comment.
* tree-vect-stmts.cc (vect_analyze_stmt): Update call to
vectorizable_lc_phi.
(vect_transform_stmt): Update calls to vect_transform_reduction and
vect_transform_cycle_phi. Rename call from vectorizable_lc_phi to
vect_transform_lc_phi.
* tree-vectorizer.h (vect_transform_reduction): Update declaration.
(vect_transform_cycle_phi): Likewise.
(vectorizable_lc_phi): Likewise.
(vect_transform_lc_phi): New.
|
|
The following tries to address us BB vectorizing a loop body that
swaps consecutive elements of an array like for bubble-sort. This
causes the vector store in the previous iteration to fail to forward
to the vector load in the current iteration since there's a partial
overlap.
We try to detect this situation by looking for a load to store
data dependence and analyze this with respect to the containing loop
for a proven problematic access. Currently the search for a
problematic pair is limited to loads and stores in the same SLP
instance which means the problematic load happens in the next
loop iteration and larger dependence distances are not considered.
On x86 with generic costing this avoids vectorizing the loop body,
but once you do core-specific tuning the saved cost for the vector
store vs. the scalar stores makes vectorization still profitable,
but at least the STLF issue is avoided.
For example on my Zen4 machine with -O2 -march=znver4 the testcase in
the PR is improving from
insertion_sort => 2327
to
insertion_sort => 997
but plain -O2 (or -fno-tree-slp-vectorize) gives
insertion_sort => 183
In the end a better target-side cost model for small vector
vectorization is needed to reject this vectorization from this side.
I'll note this is a machine independent heuristic (similar to the
avoid-store-forwarding RTL optimization pass), I expect that uarchs
implementing vectors will suffer from this kind of issue. I know
some aarch64 uarchs can forward from upper/lower part stores, this
isn't considered at the moment. The actual vector size/overlap
distance check could be moved to a target hook if it turns out
necessary.
There might be the chance to use a smaller vector size for the loads
avoiding the penalty rather than falling back to elementwise accesses,
that's not implemented either.
PR tree-optimization/1157777
* tree-vectorizer.h (_slp_tree::avoid_stlf_fail): New member.
* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize it.
(vect_print_slp_tree): Dump it.
* tree-vect-data-refs.cc (vect_slp_analyze_instance_dependence):
For dataflow dependent loads of a store check whether there's
a cross-iteration data dependence that for sure prohibits
store-to-load forwarding and mark involved loads.
* tree-vect-stmts.cc (get_group_load_store_type): For avoid_stlf_fail
marked loads use VMAT_ELEMENTWISE.
* gcc.dg/vect/bb-slp-pr115777.c: New testcase.
|
|
r15-9859-ga6cfde60d8c added a call to dominated_by_p to tree-vectorizer.h
but dominance.h is not always included; and you get a build failure on riscv building
riscv-vector-costs.cc.
Let's add the include of dominance.h to tree-vectorizer.h
Pushed as obvious after builds for riscv and x86_64.
gcc/ChangeLog:
PR target/120042
* tree-vectorizer.h: Include dominance.h.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
|
|
The following makes get_later_stmt handle stmts from different
basic-blocks in the case they are orderd and otherwise asserts.
* tree-vectorizer.h (get_later_stmt): Robustify against
stmts in different BBs, assert when they are unordered.
|
|
The following example:
#define N 512
#define START 2
#define END 505
int x[N] __attribute__((aligned(32)));
int __attribute__((noipa))
foo (void)
{
for (signed int i = START; i < END; ++i)
{
if (x[i] == 0)
return i;
}
return -1;
}
generates incorrect code with fixed length SVE because for early break we need
to know which value to start the scalar loop with if we take an early exit.
Historically this means that we take the first element of every induction.
this is because there's an assumption in place, that even with masked loops the
masks come from a whilel* instruction.
As such we reduce using a BIT_FIELD_REF <, 0>.
When PFA was added this assumption was correct for non-masked loop, however we
assumed that PFA for VLA wouldn't work for now, and disabled it using the
alignment requirement checks. We also expected VLS to PFA using scalar loops.
However as this PR shows, for VLS the vectorizer can, and does in some
circumstances choose to peel using masks by masking the first iteration of the
loop with an additional alignment mask.
When this is done, the first elements of the predicate can be inactive. In this
example element 1 is inactive based on the calculated misalignment. hence the
-1 value in the first vector IV element.
When we reduce using BIT_FIELD_REF we get the wrong value.
This patch updates it by creating a new scalar PHI that keeps track of whether
we are the first iteration of the loop (with the additional masking) or whether
we have taken a loop iteration already.
The generated sequence:
pre-header:
bb1:
i_1 = <number of leading inactive elements>
header:
bb2:
i_2 = PHI <i_1(bb1), 0(latch)>
…
early-exit:
bb3:
i_3 = iv_step * i_2 + PHI<vector-iv>
Which eliminates the need to do an expensive mask based reduction.
This fixes gromacs with one OpenMP thread. But with > 1 there is still an issue.
gcc/ChangeLog:
PR tree-optimization/119351
* tree-vectorizer.h (LOOP_VINFO_MASK_NITERS_PFA_OFFSET,
LOOP_VINFO_NON_LINEAR_IV): New.
(class _loop_vec_info): Add mask_skip_niters_pfa_offset and
nonlinear_iv.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialize them.
(vect_analyze_scalar_cycles_1): Record non-linear inductions.
(vectorizable_induction): If early break and PFA using masking create a
new phi which tracks where the scalar code needs to start...
(vectorizable_live_operation): ...and generate the adjustments here.
(vect_use_loop_mask_for_alignment_p): Reject non-linear inductions and
early break needing peeling.
gcc/testsuite/ChangeLog:
PR tree-optimization/119351
* gcc.target/aarch64/sve/peel_ind_10.c: New test.
* gcc.target/aarch64/sve/peel_ind_10_run.c: New test.
* gcc.target/aarch64/sve/peel_ind_5.c: New test.
* gcc.target/aarch64/sve/peel_ind_5_run.c: New test.
* gcc.target/aarch64/sve/peel_ind_6.c: New test.
* gcc.target/aarch64/sve/peel_ind_6_run.c: New test.
* gcc.target/aarch64/sve/peel_ind_7.c: New test.
* gcc.target/aarch64/sve/peel_ind_7_run.c: New test.
* gcc.target/aarch64/sve/peel_ind_8.c: New test.
* gcc.target/aarch64/sve/peel_ind_8_run.c: New test.
* gcc.target/aarch64/sve/peel_ind_9.c: New test.
* gcc.target/aarch64/sve/peel_ind_9_run.c: New test.
|
|
This fixes two PRs on Early break vectorization by delaying the safety checks to
vectorizable_load when the VF, VMAT and vectype are all known.
This patch does add two new restrictions:
1. On LOAD_LANES targets, where the buffer size is known, we reject non-power
of two group sizes, as they are unaligned every other iteration and so may
cross a page unwittingly. For those cases require partial masking support.
2. On LOAD_LANES targets when the buffer is unknown, we reject vectorization if
we cannot peel for alignment, as the alignment requirement is quite large at
GROUP_SIZE * vectype_size. This is unlikely to ever be beneficial so we
don't support it for now.
There are other steps documented inside the code itself so that the reasoning
is next to the code.
As a fall-back, when the alignment fails we require partial vector support.
For VLA targets like SVE return element alignment as the desired vector
alignment. This means that the loads are never misaligned and so annoying it
won't ever need to peel.
So what I think needs to happen in GCC 16 is that.
1. during vect_compute_data_ref_alignment we need to take the max of
POLY_VALUE_MIN and vector_alignment.
2. vect_do_peeling define skip_vector when PFA for VLA, and in the guard add a
check that ncopies * vectype does not exceed POLY_VALUE_MAX which we use as a
proxy for pagesize.
3. Force LOOP_VINFO_USING_PARTIAL_VECTORS_P to be true in
vect_determine_partial_vectors_and_peeling since the first iteration has to
be partial. Require LOOP_VINFO_MUST_USE_PARTIAL_VECTORS_P otherwise we have
to fail to vectorize.
4. Create a default mask to be used, so that vect_use_loop_mask_for_alignment_p
becomes true and we generate the peeled check through loop control for
partial loops. From what I can tell this won't work for
LOOP_VINFO_FULLY_WITH_LENGTH_P since they don't have any peeling support at
all in the compiler. That would need to be done independently from the
above.
In any case, not GCC 15 material so I've kept the WIP patches I have downstream.
Bootstrapped Regtested on aarch64-none-linux-gnu,
arm-none-linux-gnueabihf, x86_64-pc-linux-gnu
-m32, -m64 and no issues.
gcc/ChangeLog:
PR tree-optimization/118464
PR tree-optimization/116855
* doc/invoke.texi (min-pagesize): Update docs with vectorizer use.
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): Delay
checks.
(vect_compute_data_ref_alignment): Remove alignment checks and move to
get_load_store_type, increase group access alignment.
(vect_enhance_data_refs_alignment): Add note to comment needing
investigating.
(vect_analyze_data_refs_alignment): Likewise.
(vect_supportable_dr_alignment): For group loads look at first DR.
* tree-vect-stmts.cc (get_load_store_type):
Perform safety checks for early break pfa.
* tree-vectorizer.h (dr_set_safe_speculative_read_required,
dr_safe_speculative_read_required, DR_SCALAR_KNOWN_BOUNDS): New.
(need_peeling_for_alignment): Renamed to...
(safe_speculative_read_required): .. This
(class dr_vec_info): Add scalar_access_known_in_bounds.
gcc/testsuite/ChangeLog:
PR tree-optimization/118464
PR tree-optimization/116855
* gcc.dg/vect/bb-slp-pr65935.c: Update, it now vectorizes because the
load type is relaxed later.
* gcc.dg/vect/vect-early-break_121-pr114081.c: Update.
* gcc.dg/vect/vect-early-break_22.c: Require partial vectors.
* gcc.dg/vect/vect-early-break_128.c: Likewise.
* gcc.dg/vect/vect-early-break_26.c: Likewise.
* gcc.dg/vect/vect-early-break_43.c: Likewise.
* gcc.dg/vect/vect-early-break_44.c: Likewise.
* gcc.dg/vect/vect-early-break_2.c: Require load_lanes.
* gcc.dg/vect/vect-early-break_7.c: Likewise.
* gcc.dg/vect/vect-early-break_132-pr118464.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa1.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa11.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa10.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa2.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa3.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa4.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa5.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa6.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa7.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa8.c: New test.
* gcc.dg/vect/vect-early-break_133_pfa9.c: New test.
* gcc.dg/vect/vect-early-break_39.c: Update testcase for misalignment.
* gcc.dg/vect/vect-early-break_18.c: Likewise.
* gcc.dg/vect/vect-early-break_20.c: Likewise.
* gcc.dg/vect/vect-early-break_21.c: Likewise.
* gcc.dg/vect/vect-early-break_38.c: Likewise.
* gcc.dg/vect/vect-early-break_6.c: Likewise.
* gcc.dg/vect/vect-early-break_53.c: Likewise.
* gcc.dg/vect/vect-early-break_56.c: Likewise.
* gcc.dg/vect/vect-early-break_57.c: Likewise.
* gcc.dg/vect/vect-early-break_81.c: Likewise.
|