diff options
author | Tamar Christina <tamar.christina@arm.com> | 2023-12-24 19:18:12 +0000 |
---|---|---|
committer | Tamar Christina <tamar.christina@arm.com> | 2023-12-24 19:29:32 +0000 |
commit | 01f4251b8775c832a92d55e2df57c9ac72eaceef (patch) | |
tree | d422c6ed538b9f7f09d216409abaf768b6196fa4 /gcc/tree-vect-patterns.cc | |
parent | f1dcc0fe371e3cb10d2cbe3f6c88db6f72edddda (diff) | |
download | gcc-01f4251b8775c832a92d55e2df57c9ac72eaceef.zip gcc-01f4251b8775c832a92d55e2df57c9ac72eaceef.tar.gz gcc-01f4251b8775c832a92d55e2df57c9ac72eaceef.tar.bz2 |
middle-end: Support vectorization of loops with multiple exits.
Hi All,
This patch adds initial support for early break vectorization in GCC. In other
words it implements support for vectorization of loops with multiple exits.
The support is added for any target that implements a vector cbranch optab,
this includes both fully masked and non-masked targets.
Depending on the operation, the vectorizer may also require support for boolean
mask reductions using Inclusive OR/Bitwise AND. This is however only checked
then the comparison would produce multiple statements.
This also fully decouples the vectorizer's notion of exit from the existing loop
infrastructure's exit. Before this patch the vectorizer always picked the
natural loop latch connected exit as the main exit.
After this patch the vectorizer is free to choose any exit it deems appropriate
as the main exit. This means that even if the main exit is not countable (i.e.
the termination condition could not be determined) we might still be able to
vectorize should one of the other exits be countable.
In such situations the loop is reflowed which enabled vectorization of many
other loop forms.
Concretely the kind of loops supported are of the forms:
for (int i = 0; i < N; i++)
{
<statements1>
if (<condition>)
{
...
<action>;
}
<statements2>
}
where <action> can be:
- break
- return
- goto
Any number of statements can be used before the <action> occurs.
Since this is an initial version for GCC 14 it has the following limitations and
features:
- Only fixed sized iterations and buffers are supported. That is to say any
vectors loaded or stored must be to statically allocated arrays with known
sizes. N must also be known. This limitation is because our primary target
for this optimization is SVE. For VLA SVE we can't easily do cross page
iteraion checks. The result is likely to also not be beneficial. For that
reason we punt support for variable buffers till we have First-Faulting
support in GCC 15.
- any stores in <statements1> should not be to the same objects as in
<condition>. Loads are fine as long as they don't have the possibility to
alias. More concretely, we block RAW dependencies when the intermediate value
can't be separated fromt the store, or the store itself can't be moved.
- Prologue peeling, alignment peelinig and loop versioning are supported.
- Fully masked loops, unmasked loops and partially masked loops are supported
- Any number of loop early exits are supported.
- No support for epilogue vectorization. The only epilogue supported is the
scalar final one. Peeling code supports it but the code motion code cannot
find instructions to make the move in the epilog.
- Early breaks are only supported for inner loop vectorization.
With the help of IPA and LTO this still gets hit quite often. During bootstrap
it hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now
since these are tests for support for early exit vectorization.
This implementation does not support completely handling the early break inside
the vector loop itself but instead supports adding checks such that if we know
that we have to exit in the current iteration then we branch to scalar code to
actually do the final VF iterations which handles all the code in <action>.
For the scalar loop we know that whatever exit you take you have to perform at
most VF iterations. For vector code we only case about the state of fully
performed iteration and reset the scalar code to the (partially) remaining loop.
That is to say, the first vector loop executes so long as the early exit isn't
needed. Once the exit is taken, the scalar code will perform at most VF extra
iterations. The exact number depending on peeling and iteration start and which
exit was taken (natural or early). For this scalar loop, all early exits are
treated the same.
When we vectorize we move any statement not related to the early break itself
and that would be incorrect to execute before the break (i.e. has side effects)
to after the break. If this is not possible we decline to vectorize. The
analysis and code motion also takes into account that it doesn't introduce a RAW
dependency after the move of the stores.
This means that we check at the start of iterations whether we are going to exit
or not. During the analyis phase we check whether we are allowed to do this
moving of statements. Also note that we only move the scalar statements, but
only do so after peeling but just before we start transforming statements.
With this the vector flow no longer necessarily needs to match that of the
scalar code. In addition most of the infrastructure is in place to support
general control flow safely, however we are punting this to GCC 15.
Codegen:
for e.g.
unsigned vect_a[N];
unsigned vect_b[N];
unsigned test4(unsigned x)
{
unsigned ret = 0;
for (int i = 0; i < N; i++)
{
vect_b[i] = x + i;
if (vect_a[i] > x)
break;
vect_a[i] = x;
}
return ret;
}
We generate for Adv. SIMD:
test4:
adrp x2, .LC0
adrp x3, .LANCHOR0
dup v2.4s, w0
add x3, x3, :lo12:.LANCHOR0
movi v4.4s, 0x4
add x4, x3, 3216
ldr q1, [x2, #:lo12:.LC0]
mov x1, 0
mov w2, 0
.p2align 3,,7
.L3:
ldr q0, [x3, x1]
add v3.4s, v1.4s, v2.4s
add v1.4s, v1.4s, v4.4s
cmhi v0.4s, v0.4s, v2.4s
umaxp v0.4s, v0.4s, v0.4s
fmov x5, d0
cbnz x5, .L6
add w2, w2, 1
str q3, [x1, x4]
str q2, [x3, x1]
add x1, x1, 16
cmp w2, 200
bne .L3
mov w7, 3
.L2:
lsl w2, w2, 2
add x5, x3, 3216
add w6, w2, w0
sxtw x4, w2
ldr w1, [x3, x4, lsl 2]
str w6, [x5, x4, lsl 2]
cmp w0, w1
bcc .L4
add w1, w2, 1
str w0, [x3, x4, lsl 2]
add w6, w1, w0
sxtw x1, w1
ldr w4, [x3, x1, lsl 2]
str w6, [x5, x1, lsl 2]
cmp w0, w4
bcc .L4
add w4, w2, 2
str w0, [x3, x1, lsl 2]
sxtw x1, w4
add w6, w1, w0
ldr w4, [x3, x1, lsl 2]
str w6, [x5, x1, lsl 2]
cmp w0, w4
bcc .L4
str w0, [x3, x1, lsl 2]
add w2, w2, 3
cmp w7, 3
beq .L4
sxtw x1, w2
add w2, w2, w0
ldr w4, [x3, x1, lsl 2]
str w2, [x5, x1, lsl 2]
cmp w0, w4
bcc .L4
str w0, [x3, x1, lsl 2]
.L4:
mov w0, 0
ret
.p2align 2,,3
.L6:
mov w7, 4
b .L2
and for SVE:
test4:
adrp x2, .LANCHOR0
add x2, x2, :lo12:.LANCHOR0
add x5, x2, 3216
mov x3, 0
mov w1, 0
cntw x4
mov z1.s, w0
index z0.s, #0, #1
ptrue p1.b, all
ptrue p0.s, all
.p2align 3,,7
.L3:
ld1w z2.s, p1/z, [x2, x3, lsl 2]
add z3.s, z0.s, z1.s
cmplo p2.s, p0/z, z1.s, z2.s
b.any .L2
st1w z3.s, p1, [x5, x3, lsl 2]
add w1, w1, 1
st1w z1.s, p1, [x2, x3, lsl 2]
add x3, x3, x4
incw z0.s
cmp w3, 803
bls .L3
.L5:
mov w0, 0
ret
.p2align 2,,3
.L2:
cntw x5
mul w1, w1, w5
cbz w5, .L5
sxtw x1, w1
sub w5, w5, #1
add x5, x5, x1
add x6, x2, 3216
b .L6
.p2align 2,,3
.L14:
str w0, [x2, x1, lsl 2]
cmp x1, x5
beq .L5
mov x1, x4
.L6:
ldr w3, [x2, x1, lsl 2]
add w4, w0, w1
str w4, [x6, x1, lsl 2]
add x4, x1, 1
cmp w0, w3
bcs .L14
mov w0, 0
ret
On the workloads this work is based on we see between 2-3x performance uplift
using this patch.
Follow up plan:
- Boolean vectorization has several shortcomings. I've filed PR110223 with the
bigger ones that cause vectorization to fail with this patch.
- SLP support. This is planned for GCC 15 as for majority of the cases build
SLP itself fails. This means I'll need to spend time in making this more
robust first. Additionally it requires:
* Adding support for vectorizing CFG (gconds)
* Support for CFG to differ between vector and scalar loops.
Both of which would be disruptive to the tree and I suspect I'll be handling
fallouts from this patch for a while. So I plan to work on the surrounding
building blocks first for the remainder of the year.
Additionally it also contains reduced cases from issues found running over
various codebases.
Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Also regtested with:
-march=armv8.3-a+sve
-march=armv8.3-a+nosve
-march=armv9-a
-mcpu=neoverse-v1
-mcpu=neoverse-n2
Bootstrapped Regtested x86_64-pc-linux-gnu and no issues.
Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues.
gcc/ChangeLog:
* tree-if-conv.cc (idx_within_array_bound): Expose.
* tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New.
(vect_analyze_data_ref_dependences): Use it.
* tree-vect-loop-manip.cc (vect_iv_increment_position): New.
(vect_set_loop_controls_directly,
vect_set_loop_condition_partial_vectors,
vect_set_loop_condition_partial_vectors_avx512,
vect_set_loop_condition_normal): Support multiple exits.
(slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for
multiple exits.
(slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB
count and instead look at loop shape.
(vect_update_ivs_after_vectorizer): Drop asserts.
(vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations.
(vect_do_peeling): Support multiple exits.
(vect_loop_versioning): Likewise.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise
early_breaks.
(vect_analyze_loop_form): Support loop flows with more than single BB
loop body.
(vect_create_loop_vinfo): Support niters analysis for multiple exits.
(vect_analyze_loop): Likewise.
(vect_get_vect_def): New.
(vect_create_epilog_for_reduction): Support early exit reductions.
(vectorizable_live_operation_1): New.
(find_connected_edge): New.
(vectorizable_live_operation): Support early exit live operations.
(move_early_exit_stmts): New.
(vect_transform_loop): Use it.
* tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond.
(vect_recog_bitfield_ref_pattern): Support gconds and bools.
(vect_recog_gcond_pattern): New.
(possible_vector_mask_operation_p): Support gcond masks.
(vect_determine_mask_precision): Likewise.
(vect_mark_pattern_stmts): Set gcond def type.
(can_vectorize_live_stmts): Force early break inductions to be live.
* tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for
early breaks.
(vect_mark_stmts_to_be_vectorized): Process gcond usage.
(perm_mask_for_reverse): Expose.
(vectorizable_comparison_1): New.
(vectorizable_early_exit): New.
(vect_analyze_stmt): Support early break and gcond.
(vect_transform_stmt): Likewise.
(vect_is_simple_use): Likewise.
(vect_get_vector_types_for_stmt): Likewise.
* tree-vectorizer.cc (pass_vectorize::execute): Update exits for value
numbering.
* tree-vectorizer.h (enum vect_def_type): Add vect_condition_def.
(LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES,
LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB,
LOOP_VINFO_EARLY_BRK_VUSES): New.
(is_loop_header_bb_p): Drop assert.
(class loop): Add early_breaks, early_break_stores, early_break_dest_bb,
early_break_vuses.
(vect_iv_increment_position, perm_mask_for_reverse,
ref_within_array_bound): New.
(slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.
Diffstat (limited to 'gcc/tree-vect-patterns.cc')
-rw-r--r-- | gcc/tree-vect-patterns.cc | 167 |
1 files changed, 145 insertions, 22 deletions
diff --git a/gcc/tree-vect-patterns.cc b/gcc/tree-vect-patterns.cc index 696b70b..2053673 100644 --- a/gcc/tree-vect-patterns.cc +++ b/gcc/tree-vect-patterns.cc @@ -132,6 +132,7 @@ vect_init_pattern_stmt (vec_info *vinfo, gimple *pattern_stmt, if (!STMT_VINFO_VECTYPE (pattern_stmt_info)) { gcc_assert (!vectype + || is_a <gcond *> (pattern_stmt) || (VECTOR_BOOLEAN_TYPE_P (vectype) == vect_use_mask_type_p (orig_stmt_info))); STMT_VINFO_VECTYPE (pattern_stmt_info) = vectype; @@ -2786,15 +2787,30 @@ vect_recog_bitfield_ref_pattern (vec_info *vinfo, stmt_vec_info stmt_info, if (!lhs) { + if (!vectype) + return NULL; + append_pattern_def_seq (vinfo, stmt_info, pattern_stmt, vectype); + vectype = truth_type_for (vectype); + + /* FIXME: This part extracts the boolean value out of the bitfield in the + same way as vect_recog_gcond_pattern does. However because + patterns cannot match the same root twice, when we handle and + lower the bitfield in the gcond, vect_recog_gcond_pattern can't + apply anymore. We should really fix it so that we don't need to + duplicate transformations like these. */ + tree new_lhs = vect_recog_temp_ssa_var (boolean_type_node, NULL); gcond *cond_stmt = dyn_cast <gcond *> (stmt_info->stmt); tree cond_cst = gimple_cond_rhs (cond_stmt); + gimple *new_stmt + = gimple_build_assign (new_lhs, gimple_cond_code (cond_stmt), + gimple_get_lhs (pattern_stmt), + fold_convert (container_type, cond_cst)); + append_pattern_def_seq (vinfo, stmt_info, new_stmt, vectype, container_type); pattern_stmt - = gimple_build_cond (gimple_cond_code (cond_stmt), - gimple_get_lhs (pattern_stmt), - fold_convert (ret_type, cond_cst), - gimple_cond_true_label (cond_stmt), - gimple_cond_false_label (cond_stmt)); + = gimple_build_cond (NE_EXPR, new_lhs, + build_zero_cst (TREE_TYPE (new_lhs)), + NULL_TREE, NULL_TREE); } *type_out = STMT_VINFO_VECTYPE (stmt_info); @@ -5553,6 +5569,78 @@ integer_type_for_mask (tree var, vec_info *vinfo) return build_nonstandard_integer_type (def_stmt_info->mask_precision, 1); } +/* Function vect_recog_gcond_pattern + + Try to find pattern like following: + + if (a op b) + + where operator 'op' is not != and convert it to an adjusted boolean pattern + + mask = a op b + if (mask != 0) + + and set the mask type on MASK. + + Input: + + * STMT_VINFO: The stmt at the end from which the pattern + search begins, i.e. cast of a bool to + an integer type. + + Output: + + * TYPE_OUT: The type of the output of this pattern. + + * Return value: A new stmt that will be used to replace the pattern. */ + +static gimple * +vect_recog_gcond_pattern (vec_info *vinfo, + stmt_vec_info stmt_vinfo, tree *type_out) +{ + /* Currently we only support this for loop vectorization and when multiple + exits. */ + loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo); + if (!loop_vinfo || !LOOP_VINFO_EARLY_BREAKS (loop_vinfo)) + return NULL; + + gimple *last_stmt = STMT_VINFO_STMT (stmt_vinfo); + gcond* cond = NULL; + if (!(cond = dyn_cast <gcond *> (last_stmt))) + return NULL; + + auto lhs = gimple_cond_lhs (cond); + auto rhs = gimple_cond_rhs (cond); + auto code = gimple_cond_code (cond); + + tree scalar_type = TREE_TYPE (lhs); + if (VECTOR_TYPE_P (scalar_type)) + return NULL; + + if (code == NE_EXPR + && zerop (rhs) + && VECT_SCALAR_BOOLEAN_TYPE_P (scalar_type)) + return NULL; + + tree vecitype = get_vectype_for_scalar_type (vinfo, scalar_type); + if (vecitype == NULL_TREE) + return NULL; + + tree vectype = truth_type_for (vecitype); + + tree new_lhs = vect_recog_temp_ssa_var (boolean_type_node, NULL); + gimple *new_stmt = gimple_build_assign (new_lhs, code, lhs, rhs); + append_pattern_def_seq (vinfo, stmt_vinfo, new_stmt, vectype, scalar_type); + + gimple *pattern_stmt + = gimple_build_cond (NE_EXPR, new_lhs, + build_int_cst (TREE_TYPE (new_lhs), 0), + NULL_TREE, NULL_TREE); + *type_out = vectype; + vect_pattern_detected ("vect_recog_gcond_pattern", last_stmt); + return pattern_stmt; +} + /* Function vect_recog_bool_pattern Try to find pattern like following: @@ -6591,15 +6679,26 @@ static bool possible_vector_mask_operation_p (stmt_vec_info stmt_info) { tree lhs = gimple_get_lhs (stmt_info->stmt); + tree_code code = ERROR_MARK; + gassign *assign = NULL; + gcond *cond = NULL; + + if ((assign = dyn_cast <gassign *> (stmt_info->stmt))) + code = gimple_assign_rhs_code (assign); + else if ((cond = dyn_cast <gcond *> (stmt_info->stmt))) + { + lhs = gimple_cond_lhs (cond); + code = gimple_cond_code (cond); + } + if (!lhs || TREE_CODE (lhs) != SSA_NAME || !VECT_SCALAR_BOOLEAN_TYPE_P (TREE_TYPE (lhs))) return false; - if (gassign *assign = dyn_cast <gassign *> (stmt_info->stmt)) + if (code != ERROR_MARK) { - tree_code rhs_code = gimple_assign_rhs_code (assign); - switch (rhs_code) + switch (code) { CASE_CONVERT: case SSA_NAME: @@ -6610,7 +6709,7 @@ possible_vector_mask_operation_p (stmt_vec_info stmt_info) return true; default: - return TREE_CODE_CLASS (rhs_code) == tcc_comparison; + return TREE_CODE_CLASS (code) == tcc_comparison; } } else if (is_a <gphi *> (stmt_info->stmt)) @@ -6657,12 +6756,35 @@ vect_determine_mask_precision (vec_info *vinfo, stmt_vec_info stmt_info) The number of operations are equal, but M16 would have given a shorter dependency chain and allowed more ILP. */ unsigned int precision = ~0U; - if (gassign *assign = dyn_cast <gassign *> (stmt_info->stmt)) + gimple *stmt = STMT_VINFO_STMT (stmt_info); + + /* If the statement compares two values that shouldn't use vector masks, + try comparing the values as normal scalars instead. */ + tree_code code = ERROR_MARK; + tree op0_type; + unsigned int nops = -1; + unsigned int ops_start = 0; + + if (gassign *assign = dyn_cast <gassign *> (stmt)) + { + code = gimple_assign_rhs_code (assign); + op0_type = TREE_TYPE (gimple_assign_rhs1 (assign)); + nops = gimple_num_ops (assign); + ops_start = 1; + } + else if (gcond *cond = dyn_cast <gcond *> (stmt)) { - unsigned int nops = gimple_num_ops (assign); - for (unsigned int i = 1; i < nops; ++i) + code = gimple_cond_code (cond); + op0_type = TREE_TYPE (gimple_cond_lhs (cond)); + nops = 2; + ops_start = 0; + } + + if (code != ERROR_MARK) + { + for (unsigned int i = ops_start; i < nops; ++i) { - tree rhs = gimple_op (assign, i); + tree rhs = gimple_op (stmt, i); if (!VECT_SCALAR_BOOLEAN_TYPE_P (TREE_TYPE (rhs))) continue; @@ -6679,19 +6801,15 @@ vect_determine_mask_precision (vec_info *vinfo, stmt_vec_info stmt_info) } } - /* If the statement compares two values that shouldn't use vector masks, - try comparing the values as normal scalars instead. */ - tree_code rhs_code = gimple_assign_rhs_code (assign); if (precision == ~0U - && TREE_CODE_CLASS (rhs_code) == tcc_comparison) + && TREE_CODE_CLASS (code) == tcc_comparison) { - tree rhs1_type = TREE_TYPE (gimple_assign_rhs1 (assign)); scalar_mode mode; tree vectype, mask_type; - if (is_a <scalar_mode> (TYPE_MODE (rhs1_type), &mode) - && (vectype = get_vectype_for_scalar_type (vinfo, rhs1_type)) - && (mask_type = get_mask_type_for_scalar_type (vinfo, rhs1_type)) - && expand_vec_cmp_expr_p (vectype, mask_type, rhs_code)) + if (is_a <scalar_mode> (TYPE_MODE (op0_type), &mode) + && (vectype = get_vectype_for_scalar_type (vinfo, op0_type)) + && (mask_type = get_mask_type_for_scalar_type (vinfo, op0_type)) + && expand_vec_cmp_expr_p (vectype, mask_type, code)) precision = GET_MODE_BITSIZE (mode); } } @@ -6870,6 +6988,7 @@ static vect_recog_func vect_vect_recog_func_ptrs[] = { { vect_recog_divmod_pattern, "divmod" }, { vect_recog_mult_pattern, "mult" }, { vect_recog_mixed_size_cond_pattern, "mixed_size_cond" }, + { vect_recog_gcond_pattern, "gcond" }, { vect_recog_bool_pattern, "bool" }, /* This must come before mask conversion, and includes the parts of mask conversion that are needed for gather and scatter @@ -6957,6 +7076,10 @@ vect_mark_pattern_stmts (vec_info *vinfo, vect_set_pattern_stmt (vinfo, pattern_stmt, orig_stmt_info, pattern_vectype); + /* For any conditionals mark them as vect_condition_def. */ + if (is_a <gcond *> (pattern_stmt)) + STMT_VINFO_DEF_TYPE (STMT_VINFO_RELATED_STMT (orig_stmt_info)) = vect_condition_def; + /* Transfer reduction path info to the pattern. */ if (STMT_VINFO_REDUC_IDX (orig_stmt_info_saved) != -1) { |