diff options
author | Richard Sandiford <richard.sandiford@arm.com> | 2021-07-13 10:17:43 +0100 |
---|---|---|
committer | Richard Sandiford <richard.sandiford@arm.com> | 2021-07-13 10:17:43 +0100 |
commit | 1583b8bff0be7e41aa721dde79f90ca0763bd4e2 (patch) | |
tree | d2f95d2ac56f2508877f03cd54c3be109b1fed4d /gcc/tree-vectorizer.h | |
parent | 7670b6633e51afbbc4b3c8a5775accf7f2d887af (diff) | |
download | gcc-1583b8bff0be7e41aa721dde79f90ca0763bd4e2.zip gcc-1583b8bff0be7e41aa721dde79f90ca0763bd4e2.tar.gz gcc-1583b8bff0be7e41aa721dde79f90ca0763bd4e2.tar.bz2 |
vect: Reuse reduction accumulators between loops
This patch adds support for reusing a main loop's reduction accumulator
in an epilogue loop. This in turn lets the loops share a single piece
of vector->scalar reduction code.
The patch has the following restrictions:
(1) The epilogue reduction can only operate on a single vector
(e.g. ncopies must be 1 for non-SLP reductions, and the group size
must be <= the element count for SLP reductions).
(2) Both loops must use the same vector mode for their accumulators.
This means that the patch is restricted to targets that support
--param vect-partial-vector-usage=1.
(3) The reduction must be a standard “tree code” reduction.
However, these restrictions could be lifted in future. For example,
if the main loop operates on 128-bit vectors and the epilogue loop
operates on 64-bit vectors, we could in future reduce the 128-bit
vector by one stage and use the 64-bit result as the starting point
for the epilogue result.
The patch tries to handle chained SLP reductions, unchained SLP
reductions and non-SLP reductions. It also handles cases in which
the epilogue loop is entered directly (rather than via the main loop)
and cases in which the epilogue loop can be skipped.
vect_get_main_loop_result is a bit more general than the current
patch needs.
gcc/
* tree-vectorizer.h (vect_reusable_accumulator): New structure.
(_loop_vec_info::main_loop_edge): New field.
(_loop_vec_info::skip_main_loop_edge): Likewise.
(_loop_vec_info::skip_this_loop_edge): Likewise.
(_loop_vec_info::reusable_accumulators): Likewise.
(_stmt_vec_info::reduc_scalar_results): Likewise.
(_stmt_vec_info::reused_accumulator): Likewise.
(vect_get_main_loop_result): Declare.
* tree-vectorizer.c (vec_info::new_stmt_vec_info): Initialize
reduc_scalar_inputs.
(vec_info::free_stmt_vec_info): Free reduc_scalar_inputs.
* tree-vect-loop-manip.c (vect_get_main_loop_result): New function.
(vect_do_peeling): Fill an epilogue loop's main_loop_edge,
skip_main_loop_edge and skip_this_loop_edge fields.
* tree-vect-loop.c (INCLUDE_ALGORITHM): Define.
(vect_emit_reduction_init_stmts): New function.
(get_initial_def_for_reduction): Use it.
(get_initial_defs_for_reduction): Likewise. Change the vinfo
parameter to a loop_vec_info.
(vect_create_epilog_for_reduction): Store the scalar results
in the reduc_info. If an epilogue loop is reusing an accumulator
from the main loop, and if the epilogue loop can also be skipped,
try to place the reduction code in the join block. Record
accumulators that could potentially be reused by epilogue loops.
(vect_transform_cycle_phi): When vectorizing epilogue loops,
try to reuse accumulators from the main loop. Record the initial
value in reduc_info for non-SLP reductions too.
gcc/testsuite/
* gcc.target/aarch64/sve/reduc_9.c: New test.
* gcc.target/aarch64/sve/reduc_9_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_10.c: Likewise.
* gcc.target/aarch64/sve/reduc_10_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_11.c: Likewise.
* gcc.target/aarch64/sve/reduc_11_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_12.c: Likewise.
* gcc.target/aarch64/sve/reduc_12_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_13.c: Likewise.
* gcc.target/aarch64/sve/reduc_13_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_14.c: Likewise.
* gcc.target/aarch64/sve/reduc_14_run.c: Likewise.
* gcc.target/aarch64/sve/reduc_15.c: Likewise.
* gcc.target/aarch64/sve/reduc_15_run.c: Likewise.
Diffstat (limited to 'gcc/tree-vectorizer.h')
-rw-r--r-- | gcc/tree-vectorizer.h | 56 |
1 files changed, 50 insertions, 6 deletions
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index e2fd360..d825b0c 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -551,6 +551,18 @@ typedef auto_vec<rgroup_controls> vec_loop_lens; typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec; +/* Information about a reduction accumulator from the main loop that could + conceivably be reused as the input to a reduction in an epilogue loop. */ +struct vect_reusable_accumulator { + /* The final value of the accumulator, which forms the input to the + reduction operation. */ + tree reduc_input; + + /* The stmt_vec_info that describes the reduction (i.e. the one for + which is_reduc_info is true). */ + stmt_vec_info reduc_info; +}; + /*-----------------------------------------------------------------*/ /* Info on vectorized loops. */ /*-----------------------------------------------------------------*/ @@ -588,6 +600,26 @@ public: /* Unrolling factor */ poly_uint64 vectorization_factor; + /* If this loop is an epilogue loop whose main loop can be skipped, + MAIN_LOOP_EDGE is the edge from the main loop to this loop's + preheader. SKIP_MAIN_LOOP_EDGE is then the edge that skips the + main loop and goes straight to this loop's preheader. + + Both fields are null otherwise. */ + edge main_loop_edge; + edge skip_main_loop_edge; + + /* If this loop is an epilogue loop that might be skipped after executing + the main loop, this edge is the one that skips the epilogue. */ + edge skip_this_loop_edge; + + /* The vectorized form of a standard reduction replaces the original + scalar code's final result (a loop-closed SSA PHI) with the result + of a vector-to-scalar reduction operation. After vectorization, + this variable maps these vector-to-scalar results to information + about the reductions that generated them. */ + hash_map<tree, vect_reusable_accumulator> reusable_accumulators; + /* Maximum runtime vectorization factor, or MAX_VECTORIZATION_FACTOR if there is no particular limit. */ unsigned HOST_WIDE_INT max_vectorization_factor; @@ -1186,6 +1218,23 @@ public: /* The vector type for performing the actual reduction. */ tree reduc_vectype; + /* If IS_REDUC_INFO is true and if the vector code is performing + N scalar reductions in parallel, this variable gives the initial + scalar values of those N reductions. */ + vec<tree> reduc_initial_values; + + /* If IS_REDUC_INFO is true and if the vector code is performing + N scalar reductions in parallel, this variable gives the vectorized code's + final (scalar) result for each of those N reductions. In other words, + REDUC_SCALAR_RESULTS[I] replaces the original scalar code's loop-closed + SSA PHI for reduction number I. */ + vec<tree> reduc_scalar_results; + + /* Only meaningful if IS_REDUC_INFO. If non-null, the reduction is + being performed by an epilogue loop and we have decided to reuse + this accumulator from the main loop. */ + vect_reusable_accumulator *reused_accumulator; + /* Whether we force a single cycle PHI during reduction vectorization. */ bool force_single_cycle; @@ -1382,12 +1431,6 @@ vect_phi_initial_value (gphi *phi) return PHI_ARG_DEF_FROM_EDGE (phi, pe); } -static inline tree -vect_phi_initial_value (stmt_vec_info stmt_info) -{ - return vect_phi_initial_value (as_a <gphi *> (stmt_info->stmt)); -} - /* Return true if STMT_INFO should produce a vector mask type rather than a normal nonmask type. */ @@ -1818,6 +1861,7 @@ class loop *vect_loop_versioning (loop_vec_info, gimple *); extern class loop *vect_do_peeling (loop_vec_info, tree, tree, tree *, tree *, tree *, int, bool, bool, tree *); +extern tree vect_get_main_loop_result (loop_vec_info, tree, tree); extern void vect_prepare_for_masked_peels (loop_vec_info); extern dump_user_location_t find_loop_location (class loop *); extern bool vect_can_advance_ivs_p (loop_vec_info); |