middle-end: Support vectorization of loops with multiple exits.

Hi All, This patch adds initial support for early break vectorization in GCC. In other words it implements support for vectorization of loops with multiple exits. The support is added for any target that implements a vector cbranch optab, this includes both fully masked and non-masked targets. Depending on the operation, the vectorizer may also require support for boolean mask reductions using Inclusive OR/Bitwise AND. This is however only checked then the comparison would produce multiple statements. This also fully decouples the vectorizer's notion of exit from the existing loop infrastructure's exit. Before this patch the vectorizer always picked the natural loop latch connected exit as the main exit. After this patch the vectorizer is free to choose any exit it deems appropriate as the main exit. This means that even if the main exit is not countable (i.e. the termination condition could not be determined) we might still be able to vectorize should one of the other exits be countable. In such situations the loop is reflowed which enabled vectorization of many other loop forms. Concretely the kind of loops supported are of the forms: for (int i = 0; i < N; i++) { <statements1> if (<condition>) { ... <action>; } <statements2> } where <action> can be: - break - return - goto Any number of statements can be used before the <action> occurs. Since this is an initial version for GCC 14 it has the following limitations and features: - Only fixed sized iterations and buffers are supported. That is to say any vectors loaded or stored must be to statically allocated arrays with known sizes. N must also be known. This limitation is because our primary target for this optimization is SVE. For VLA SVE we can't easily do cross page iteraion checks. The result is likely to also not be beneficial. For that reason we punt support for variable buffers till we have First-Faulting support in GCC 15. - any stores in <statements1> should not be to the same objects as in <condition>. Loads are fine as long as they don't have the possibility to alias. More concretely, we block RAW dependencies when the intermediate value can't be separated fromt the store, or the store itself can't be moved. - Prologue peeling, alignment peelinig and loop versioning are supported. - Fully masked loops, unmasked loops and partially masked loops are supported - Any number of loop early exits are supported. - No support for epilogue vectorization. The only epilogue supported is the scalar final one. Peeling code supports it but the code motion code cannot find instructions to make the move in the epilog. - Early breaks are only supported for inner loop vectorization. With the help of IPA and LTO this still gets hit quite often. During bootstrap it hit rather frequently. Additionally TSVC s332, s481 and s482 all pass now since these are tests for support for early exit vectorization. This implementation does not support completely handling the early break inside the vector loop itself but instead supports adding checks such that if we know that we have to exit in the current iteration then we branch to scalar code to actually do the final VF iterations which handles all the code in <action>. For the scalar loop we know that whatever exit you take you have to perform at most VF iterations. For vector code we only case about the state of fully performed iteration and reset the scalar code to the (partially) remaining loop. That is to say, the first vector loop executes so long as the early exit isn't needed. Once the exit is taken, the scalar code will perform at most VF extra iterations. The exact number depending on peeling and iteration start and which exit was taken (natural or early). For this scalar loop, all early exits are treated the same. When we vectorize we move any statement not related to the early break itself and that would be incorrect to execute before the break (i.e. has side effects) to after the break. If this is not possible we decline to vectorize. The analysis and code motion also takes into account that it doesn't introduce a RAW dependency after the move of the stores. This means that we check at the start of iterations whether we are going to exit or not. During the analyis phase we check whether we are allowed to do this moving of statements. Also note that we only move the scalar statements, but only do so after peeling but just before we start transforming statements. With this the vector flow no longer necessarily needs to match that of the scalar code. In addition most of the infrastructure is in place to support general control flow safely, however we are punting this to GCC 15. Codegen: for e.g. unsigned vect_a[N]; unsigned vect_b[N]; unsigned test4(unsigned x) { unsigned ret = 0; for (int i = 0; i < N; i++) { vect_b[i] = x + i; if (vect_a[i] > x) break; vect_a[i] = x; } return ret; } We generate for Adv. SIMD: test4: adrp x2, .LC0 adrp x3, .LANCHOR0 dup v2.4s, w0 add x3, x3, :lo12:.LANCHOR0 movi v4.4s, 0x4 add x4, x3, 3216 ldr q1, [x2, #:lo12:.LC0] mov x1, 0 mov w2, 0 .p2align 3,,7 .L3: ldr q0, [x3, x1] add v3.4s, v1.4s, v2.4s add v1.4s, v1.4s, v4.4s cmhi v0.4s, v0.4s, v2.4s umaxp v0.4s, v0.4s, v0.4s fmov x5, d0 cbnz x5, .L6 add w2, w2, 1 str q3, [x1, x4] str q2, [x3, x1] add x1, x1, 16 cmp w2, 200 bne .L3 mov w7, 3 .L2: lsl w2, w2, 2 add x5, x3, 3216 add w6, w2, w0 sxtw x4, w2 ldr w1, [x3, x4, lsl 2] str w6, [x5, x4, lsl 2] cmp w0, w1 bcc .L4 add w1, w2, 1 str w0, [x3, x4, lsl 2] add w6, w1, w0 sxtw x1, w1 ldr w4, [x3, x1, lsl 2] str w6, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 add w4, w2, 2 str w0, [x3, x1, lsl 2] sxtw x1, w4 add w6, w1, w0 ldr w4, [x3, x1, lsl 2] str w6, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 str w0, [x3, x1, lsl 2] add w2, w2, 3 cmp w7, 3 beq .L4 sxtw x1, w2 add w2, w2, w0 ldr w4, [x3, x1, lsl 2] str w2, [x5, x1, lsl 2] cmp w0, w4 bcc .L4 str w0, [x3, x1, lsl 2] .L4: mov w0, 0 ret .p2align 2,,3 .L6: mov w7, 4 b .L2 and for SVE: test4: adrp x2, .LANCHOR0 add x2, x2, :lo12:.LANCHOR0 add x5, x2, 3216 mov x3, 0 mov w1, 0 cntw x4 mov z1.s, w0 index z0.s, #0, #1 ptrue p1.b, all ptrue p0.s, all .p2align 3,,7 .L3: ld1w z2.s, p1/z, [x2, x3, lsl 2] add z3.s, z0.s, z1.s cmplo p2.s, p0/z, z1.s, z2.s b.any .L2 st1w z3.s, p1, [x5, x3, lsl 2] add w1, w1, 1 st1w z1.s, p1, [x2, x3, lsl 2] add x3, x3, x4 incw z0.s cmp w3, 803 bls .L3 .L5: mov w0, 0 ret .p2align 2,,3 .L2: cntw x5 mul w1, w1, w5 cbz w5, .L5 sxtw x1, w1 sub w5, w5, #1 add x5, x5, x1 add x6, x2, 3216 b .L6 .p2align 2,,3 .L14: str w0, [x2, x1, lsl 2] cmp x1, x5 beq .L5 mov x1, x4 .L6: ldr w3, [x2, x1, lsl 2] add w4, w0, w1 str w4, [x6, x1, lsl 2] add x4, x1, 1 cmp w0, w3 bcs .L14 mov w0, 0 ret On the workloads this work is based on we see between 2-3x performance uplift using this patch. Follow up plan: - Boolean vectorization has several shortcomings. I've filed PR110223 with the bigger ones that cause vectorization to fail with this patch. - SLP support. This is planned for GCC 15 as for majority of the cases build SLP itself fails. This means I'll need to spend time in making this more robust first. Additionally it requires: * Adding support for vectorizing CFG (gconds) * Support for CFG to differ between vector and scalar loops. Both of which would be disruptive to the tree and I suspect I'll be handling fallouts from this patch for a while. So I plan to work on the surrounding building blocks first for the remainder of the year. Additionally it also contains reduced cases from issues found running over various codebases. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Also regtested with: -march=armv8.3-a+sve -march=armv8.3-a+nosve -march=armv9-a -mcpu=neoverse-v1 -mcpu=neoverse-n2 Bootstrapped Regtested x86_64-pc-linux-gnu and no issues. Bootstrap and Regtest on arm-none-linux-gnueabihf and no issues. gcc/ChangeLog: * tree-if-conv.cc (idx_within_array_bound): Expose. * tree-vect-data-refs.cc (vect_analyze_early_break_dependences): New. (vect_analyze_data_ref_dependences): Use it. * tree-vect-loop-manip.cc (vect_iv_increment_position): New. (vect_set_loop_controls_directly, vect_set_loop_condition_partial_vectors, vect_set_loop_condition_partial_vectors_avx512, vect_set_loop_condition_normal): Support multiple exits. (slpeel_tree_duplicate_loop_to_edge_cfg): Support LCSAA peeling for multiple exits. (slpeel_can_duplicate_loop_p): Change vectorizer from looking at BB count and instead look at loop shape. (vect_update_ivs_after_vectorizer): Drop asserts. (vect_gen_vector_loop_niters_mult_vf): Support peeled vector iterations. (vect_do_peeling): Support multiple exits. (vect_loop_versioning): Likewise. * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Initialise early_breaks. (vect_analyze_loop_form): Support loop flows with more than single BB loop body. (vect_create_loop_vinfo): Support niters analysis for multiple exits. (vect_analyze_loop): Likewise. (vect_get_vect_def): New. (vect_create_epilog_for_reduction): Support early exit reductions. (vectorizable_live_operation_1): New. (find_connected_edge): New. (vectorizable_live_operation): Support early exit live operations. (move_early_exit_stmts): New. (vect_transform_loop): Use it. * tree-vect-patterns.cc (vect_init_pattern_stmt): Support gcond. (vect_recog_bitfield_ref_pattern): Support gconds and bools. (vect_recog_gcond_pattern): New. (possible_vector_mask_operation_p): Support gcond masks. (vect_determine_mask_precision): Likewise. (vect_mark_pattern_stmts): Set gcond def type. (can_vectorize_live_stmts): Force early break inductions to be live. * tree-vect-stmts.cc (vect_stmt_relevant_p): Add relevancy analysis for early breaks. (vect_mark_stmts_to_be_vectorized): Process gcond usage. (perm_mask_for_reverse): Expose. (vectorizable_comparison_1): New. (vectorizable_early_exit): New. (vect_analyze_stmt): Support early break and gcond. (vect_transform_stmt): Likewise. (vect_is_simple_use): Likewise. (vect_get_vector_types_for_stmt): Likewise. * tree-vectorizer.cc (pass_vectorize::execute): Update exits for value numbering. * tree-vectorizer.h (enum vect_def_type): Add vect_condition_def. (LOOP_VINFO_EARLY_BREAKS, LOOP_VINFO_EARLY_BRK_STORES, LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, LOOP_VINFO_EARLY_BRK_DEST_BB, LOOP_VINFO_EARLY_BRK_VUSES): New. (is_loop_header_bb_p): Drop assert. (class loop): Add early_breaks, early_break_stores, early_break_dest_bb, early_break_vuses. (vect_iv_increment_position, perm_mask_for_reverse, ref_within_array_bound): New. (slpeel_tree_duplicate_loop_to_edge_cfg): Update for early breaks.
author: Tamar Christina <tamar.christina@arm.com> 2023-12-24 19:18:12 +0000
committer: Tamar Christina <tamar.christina@arm.com> 2023-12-24 19:29:32 +0000
commit: 01f4251b8775c832a92d55e2df57c9ac72eaceef (patch)
tree: d422c6ed538b9f7f09d216409abaf768b6196fa4 /gcc/tree-vect-loop.cc
parent: f1dcc0fe371e3cb10d2cbe3f6c88db6f72edddda (diff)
download: gcc-01f4251b8775c832a92d55e2df57c9ac72eaceef.zip
gcc-01f4251b8775c832a92d55e2df57c9ac72eaceef.tar.gz
gcc-01f4251b8775c832a92d55e2df57c9ac72eaceef.tar.bz2
1 files changed, 353 insertions, 129 deletions
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 7a3db5f..88261a3 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -1040,6 +1040,7 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     partial_load_store_bias (0),
     peeling_for_gaps (false),
     peeling_for_niter (false),
+    early_breaks (false),
     no_data_dependencies (false),
     has_mask_store (false),
     scalar_loop_scaling (profile_probability::uninitialized ()),
@@ -1694,12 +1695,12 @@ vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo)
   loop_vinfo->scalar_costs->finish_cost (nullptr);
 }
 
-
 /* Function vect_analyze_loop_form.
 
    Verify that certain CFG restrictions hold, including:
    - the loop has a pre-header
-   - the loop has a single entry and exit
+   - the loop has a single entry
+   - nested loops can have only a single exit.
    - the loop exit condition is simple enough
    - the number of iterations can be analyzed, i.e, a countable loop.  The
      niter could be analyzed under some assumptions.  */
@@ -1721,6 +1722,17 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
 		       "using as main loop exit: %d -> %d [AUX: %p]\n",
 		       exit_e->src->index, exit_e->dest->index, exit_e->aux);
 
+  /* Check if we have any control flow that doesn't leave the loop.  */
+  class loop *v_loop = loop->inner ? loop->inner : loop;
+  basic_block *bbs= get_loop_body (v_loop);
+  for (unsigned i = 0; i < v_loop->num_nodes; i++)
+    if (EDGE_COUNT (bbs[i]->succs) != 1
+	&& (EDGE_COUNT (bbs[i]->succs) != 2
+	    || !loop_exits_from_bb_p (bbs[i]->loop_father, bbs[i])))
+      return opt_result::failure_at (vect_location,
+				     "not vectorized:"
+				     " unsupported control flow in loop.\n");
+
   /* Different restrictions apply when we are considering an inner-most loop,
      vs. an outer (nested) loop.
      (FORNOW. May want to relax some of these restrictions in the future).  */
@@ -1740,11 +1752,6 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
                            |
                         (exit-bb)  */
 
-      if (loop->num_nodes != 2)
-	return opt_result::failure_at (vect_location,
-				       "not vectorized:"
-				       " control flow in loop.\n");
-
       if (empty_block_p (loop->header))
 	return opt_result::failure_at (vect_location,
 				       "not vectorized: empty loop.\n");
@@ -1776,11 +1783,6 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
 				       "not vectorized:"
 				       " multiple nested loops.\n");
 
-      if (loop->num_nodes != 5)
-	return opt_result::failure_at (vect_location,
-				       "not vectorized:"
-				       " control flow in loop.\n");
-
       entryedge = loop_preheader_edge (innerloop);
       if (entryedge->src != loop->header
 	  || !single_exit (innerloop)
@@ -1817,9 +1819,6 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
       info->inner_loop_cond = inner.conds[0];
     }
 
-  if (!single_exit (loop))
-    return opt_result::failure_at (vect_location,
-				   "not vectorized: multiple exits.\n");
   if (EDGE_COUNT (loop->header->preds) != 2)
     return opt_result::failure_at (vect_location,
 				   "not vectorized:"
@@ -1835,10 +1834,14 @@ vect_analyze_loop_form (class loop *loop, vect_loop_form_info *info)
 				   "not vectorized: latch block not empty.\n");
 
   /* Make sure the exit is not abnormal.  */
-  if (exit_e->flags & EDGE_ABNORMAL)
-    return opt_result::failure_at (vect_location,
-				   "not vectorized:"
-				   " abnormal loop exit edge.\n");
+  auto_vec<edge> exits = get_loop_exit_edges (loop);
+  for (edge e : exits)
+    {
+      if (e->flags & EDGE_ABNORMAL)
+	return opt_result::failure_at (vect_location,
+				       "not vectorized:"
+				       " abnormal loop exit edge.\n");
+    }
 
   info->conds
     = vect_get_loop_niters (loop, exit_e, &info->assumptions,
@@ -1906,6 +1909,8 @@ vect_create_loop_vinfo (class loop *loop, vec_info_shared *shared,
     {
       stmt_vec_info loop_cond_info = loop_vinfo->lookup_stmt (cond);
       STMT_VINFO_TYPE (loop_cond_info) = loop_exit_ctrl_vec_info_type;
+      /* Mark the statement as a condition.  */
+      STMT_VINFO_DEF_TYPE (loop_cond_info) = vect_condition_def;
     }
 
   for (unsigned i = 1; i < info->conds.length (); i ++)
@@ -1914,6 +1919,10 @@ vect_create_loop_vinfo (class loop *loop, vec_info_shared *shared,
 
   LOOP_VINFO_IV_EXIT (loop_vinfo) = info->loop_exit;
 
+  /* Check to see if we're vectorizing multiple exits.  */
+  LOOP_VINFO_EARLY_BREAKS (loop_vinfo)
+    = !LOOP_VINFO_LOOP_CONDS (loop_vinfo).is_empty ();
+
   if (info->inner_loop_cond)
     {
       stmt_vec_info inner_loop_cond_info
@@ -3167,7 +3176,8 @@ start_over:
 
   /* If an epilogue loop is required make sure we can create one.  */
   if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))
+      || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)
+      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     {
       if (dump_enabled_p ())
         dump_printf_loc (MSG_NOTE, vect_location, "epilog loop required\n");
@@ -3677,6 +3687,9 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 			 && loop->inner == NULL
 			 && param_vect_epilogues_nomask
 			 && LOOP_VINFO_PEELING_FOR_NITER (first_loop_vinfo)
+			   /* No code motion support for multiple epilogues so for now
+			      not supported when multiple exits.  */
+			 && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
 			 && !loop->simduid);
   if (!vect_epilogues)
     return first_loop_vinfo;
@@ -5857,6 +5870,34 @@ vect_create_partial_epilog (tree vec_def, tree vectype, code_helper code,
   return new_temp;
 }
 
+/* Retrieves the definining statement to be used for a reduction.
+   For MAIN_EXIT_P we use the current VEC_STMTs and otherwise we look at
+   the reduction definitions.  */
+
+tree
+vect_get_vect_def (stmt_vec_info reduc_info, slp_tree slp_node,
+		   slp_instance slp_node_instance, bool main_exit_p, unsigned i,
+		   vec <gimple *> &vec_stmts)
+{
+  tree def;
+
+  if (slp_node)
+    {
+      if (!main_exit_p)
+        slp_node = slp_node_instance->reduc_phis;
+      def = vect_get_slp_vect_def (slp_node, i);
+    }
+  else
+    {
+      if (!main_exit_p)
+	reduc_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (reduc_info));
+      vec_stmts = STMT_VINFO_VEC_STMTS (reduc_info);
+      def = gimple_get_lhs (vec_stmts[0]);
+    }
+
+  return def;
+}
+
 /* Function vect_create_epilog_for_reduction
 
    Create code at the loop-epilog to finalize the result of a reduction
@@ -5868,6 +5909,8 @@ vect_create_partial_epilog (tree vec_def, tree vectype, code_helper code,
    SLP_NODE_INSTANCE is the SLP node instance containing SLP_NODE
    REDUC_INDEX says which rhs operand of the STMT_INFO is the reduction phi
      (counting from 0)
+   LOOP_EXIT is the edge to update in the merge block.  In the case of a single
+     exit this edge is always the main loop exit.
 
    This function:
    1. Completes the reduction def-use cycles.
@@ -5908,7 +5951,8 @@ static void
 vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
 				  stmt_vec_info stmt_info,
 				  slp_tree slp_node,
-				  slp_instance slp_node_instance)
+				  slp_instance slp_node_instance,
+				  edge loop_exit)
 {
   stmt_vec_info reduc_info = info_for_reduction (loop_vinfo, stmt_info);
   gcc_assert (reduc_info->is_reduc_info);
@@ -5917,6 +5961,7 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
      loop-closed PHI of the inner loop which we remember as
      def for the reduction PHI generation.  */
   bool double_reduc = false;
+  bool main_exit_p = LOOP_VINFO_IV_EXIT (loop_vinfo) == loop_exit;
   stmt_vec_info rdef_info = stmt_info;
   if (STMT_VINFO_DEF_TYPE (stmt_info) == vect_double_reduction_def)
     {
@@ -6079,7 +6124,7 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
       /* Create an induction variable.  */
       gimple_stmt_iterator incr_gsi;
       bool insert_after;
-      standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+      vect_iv_increment_position (loop_exit, &incr_gsi, &insert_after);
       create_iv (series_vect, PLUS_EXPR, vec_step, NULL_TREE, loop, &incr_gsi,
 		 insert_after, &indx_before_incr, &indx_after_incr);
 
@@ -6158,23 +6203,23 @@ vect_create_epilog_for_reduction (loop_vec_info loop_vinfo,
          Store them in NEW_PHIS.  */
   if (double_reduc)
     loop = outer_loop;
-  exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
+  /* We need to reduce values in all exits.  */
+  exit_bb = loop_exit->dest;
   exit_gsi = gsi_after_labels (exit_bb);
   reduc_inputs.create (slp_node ? vec_num : ncopies);
+  vec <gimple *> vec_stmts;
   for (unsigned i = 0; i < vec_num; i++)
     {
       gimple_seq stmts = NULL;
-      if (slp_node)
-	def = vect_get_slp_vect_def (slp_node, i);
-      else
-	def = gimple_get_lhs (STMT_VINFO_VEC_STMTS (rdef_info)[0]);
+      def = vect_get_vect_def (rdef_info, slp_node, slp_node_instance,
+			       main_exit_p, i, vec_stmts);
       for (j = 0; j < ncopies; j++)
 	{
 	  tree new_def = copy_ssa_name (def);
 	  phi = create_phi_node (new_def, exit_bb);
 	  if (j)
-	    def = gimple_get_lhs (STMT_VINFO_VEC_STMTS (rdef_info)[j]);
-	  SET_PHI_ARG_DEF (phi, LOOP_VINFO_IV_EXIT (loop_vinfo)->dest_idx, def);
+	    def = gimple_get_lhs (vec_stmts[j]);
+	  SET_PHI_ARG_DEF (phi, loop_exit->dest_idx, def);
 	  new_def = gimple_convert (&stmts, vectype, new_def);
 	  reduc_inputs.quick_push (new_def);
 	}
@@ -10524,6 +10569,146 @@ vectorizable_induction (loop_vec_info loop_vinfo,
   return true;
 }
 
+/* Function vectorizable_live_operation_1.
+
+   helper function for vectorizable_live_operation.  */
+
+tree
+vectorizable_live_operation_1 (loop_vec_info loop_vinfo,
+			       stmt_vec_info stmt_info, basic_block exit_bb,
+			       tree vectype, int ncopies, slp_tree slp_node,
+			       tree bitsize, tree bitstart, tree vec_lhs,
+			       tree lhs_type, bool restart_loop,
+			       gimple_stmt_iterator *exit_gsi)
+{
+  gcc_assert (single_pred_p (exit_bb) || LOOP_VINFO_EARLY_BREAKS (loop_vinfo));
+
+  tree vec_lhs_phi = copy_ssa_name (vec_lhs);
+  gimple *phi = create_phi_node (vec_lhs_phi, exit_bb);
+  for (unsigned i = 0; i < gimple_phi_num_args (phi); i++)
+    SET_PHI_ARG_DEF (phi, i, vec_lhs);
+
+  gimple_seq stmts = NULL;
+  tree new_tree;
+  if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
+    {
+      /* Emit:
+
+	 SCALAR_RES = VEC_EXTRACT <VEC_LHS, LEN + BIAS - 1>
+
+	 where VEC_LHS is the vectorized live-out result and MASK is
+	 the loop mask for the final iteration.  */
+      gcc_assert (ncopies == 1 && !slp_node);
+      gimple_seq tem = NULL;
+      gimple_stmt_iterator gsi = gsi_last (tem);
+      tree len = vect_get_loop_len (loop_vinfo, &gsi,
+				    &LOOP_VINFO_LENS (loop_vinfo),
+				    1, vectype, 0, 0);
+
+      /* BIAS - 1.  */
+      signed char biasval = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
+      tree bias_minus_one
+	= int_const_binop (MINUS_EXPR,
+			   build_int_cst (TREE_TYPE (len), biasval),
+			   build_one_cst (TREE_TYPE (len)));
+
+      /* LAST_INDEX = LEN + (BIAS - 1).  */
+      tree last_index = gimple_build (&stmts, PLUS_EXPR, TREE_TYPE (len),
+				     len, bias_minus_one);
+
+      /* This needs to implement extraction of the first index, but not sure
+	 how the LEN stuff works.  At the moment we shouldn't get here since
+	 there's no LEN support for early breaks.  But guard this so there's
+	 no incorrect codegen.  */
+      gcc_assert (!LOOP_VINFO_EARLY_BREAKS (loop_vinfo));
+
+      /* SCALAR_RES = VEC_EXTRACT <VEC_LHS, LEN + BIAS - 1>.  */
+      tree scalar_res
+	= gimple_build (&stmts, CFN_VEC_EXTRACT, TREE_TYPE (vectype),
+			vec_lhs_phi, last_index);
+
+      /* Convert the extracted vector element to the scalar type.  */
+      new_tree = gimple_convert (&stmts, lhs_type, scalar_res);
+    }
+  else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
+    {
+      /* Emit:
+
+	 SCALAR_RES = EXTRACT_LAST <VEC_LHS, MASK>
+
+	 where VEC_LHS is the vectorized live-out result and MASK is
+	 the loop mask for the final iteration.  */
+      gcc_assert (!slp_node);
+      tree scalar_type = TREE_TYPE (STMT_VINFO_VECTYPE (stmt_info));
+      gimple_seq tem = NULL;
+      gimple_stmt_iterator gsi = gsi_last (tem);
+      tree mask = vect_get_loop_mask (loop_vinfo, &gsi,
+				      &LOOP_VINFO_MASKS (loop_vinfo),
+				      1, vectype, 0);
+      tree scalar_res;
+
+      /* For an inverted control flow with early breaks we want EXTRACT_FIRST
+	 instead of EXTRACT_LAST.  Emulate by reversing the vector and mask. */
+      if (restart_loop && LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+	{
+	  /* First create the permuted mask.  */
+	  tree perm_mask = perm_mask_for_reverse (TREE_TYPE (mask));
+	  tree perm_dest = copy_ssa_name (mask);
+	  gimple *perm_stmt
+		= gimple_build_assign (perm_dest, VEC_PERM_EXPR, mask,
+				       mask, perm_mask);
+	  vect_finish_stmt_generation (loop_vinfo, stmt_info, perm_stmt,
+				       &gsi);
+	  mask = perm_dest;
+
+	  /* Then permute the vector contents.  */
+	  tree perm_elem = perm_mask_for_reverse (vectype);
+	  perm_dest = copy_ssa_name (vec_lhs_phi);
+	  perm_stmt
+		= gimple_build_assign (perm_dest, VEC_PERM_EXPR, vec_lhs_phi,
+				       vec_lhs_phi, perm_elem);
+	  vect_finish_stmt_generation (loop_vinfo, stmt_info, perm_stmt,
+				       &gsi);
+	  vec_lhs_phi = perm_dest;
+	}
+
+      gimple_seq_add_seq (&stmts, tem);
+
+      scalar_res = gimple_build (&stmts, CFN_EXTRACT_LAST, scalar_type,
+				 mask, vec_lhs_phi);
+
+      /* Convert the extracted vector element to the scalar type.  */
+      new_tree = gimple_convert (&stmts, lhs_type, scalar_res);
+    }
+  else
+    {
+      tree bftype = TREE_TYPE (vectype);
+      if (VECTOR_BOOLEAN_TYPE_P (vectype))
+	bftype = build_nonstandard_integer_type (tree_to_uhwi (bitsize), 1);
+      new_tree = build3 (BIT_FIELD_REF, bftype, vec_lhs_phi, bitsize, bitstart);
+      new_tree = force_gimple_operand (fold_convert (lhs_type, new_tree),
+				       &stmts, true, NULL_TREE);
+    }
+
+  *exit_gsi = gsi_after_labels (exit_bb);
+  if (stmts)
+    gsi_insert_seq_before (exit_gsi, stmts, GSI_SAME_STMT);
+
+  return new_tree;
+}
+
+/* Find the edge that's the final one in the path from SRC to DEST and
+   return it.  This edge must exist in at most one forwarder edge between.  */
+
+static edge
+find_connected_edge (edge src, basic_block dest)
+{
+   if (src->dest == dest)
+     return src;
+
+  return find_edge (src->dest, dest);
+}
+
 /* Function vectorizable_live_operation.
 
    STMT_INFO computes a value that is used outside the loop.  Check if
@@ -10544,11 +10729,13 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info,
   poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
   int ncopies;
   gimple *use_stmt;
+  use_operand_p use_p;
   auto_vec<tree> vec_oprnds;
   int vec_entry = 0;
   poly_uint64 vec_index = 0;
 
-  gcc_assert (STMT_VINFO_LIVE_P (stmt_info));
+  gcc_assert (STMT_VINFO_LIVE_P (stmt_info)
+	      || LOOP_VINFO_EARLY_BREAKS (loop_vinfo));
 
   /* If a stmt of a reduction is live, vectorize it via
      vect_create_epilog_for_reduction.  vectorizable_reduction assessed
@@ -10573,8 +10760,25 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info,
       if (STMT_VINFO_REDUC_TYPE (reduc_info) == FOLD_LEFT_REDUCTION
 	  || STMT_VINFO_REDUC_TYPE (reduc_info) == EXTRACT_LAST_REDUCTION)
 	return true;
+
       vect_create_epilog_for_reduction (loop_vinfo, stmt_info, slp_node,
-					slp_node_instance);
+					slp_node_instance,
+					LOOP_VINFO_IV_EXIT (loop_vinfo));
+
+      /* If early break we only have to materialize the reduction on the merge
+	 block, but we have to find an alternate exit first.  */
+      if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+	{
+	  for (auto exit : get_loop_exit_edges (LOOP_VINFO_LOOP (loop_vinfo)))
+	    if (exit != LOOP_VINFO_IV_EXIT (loop_vinfo))
+	      {
+		vect_create_epilog_for_reduction (loop_vinfo, stmt_info,
+						  slp_node, slp_node_instance,
+						  exit);
+		break;
+	      }
+	}
+
       return true;
     }
 
@@ -10685,8 +10889,8 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info,
   bitsize = vector_element_bits_tree (vectype);
 
   /* Get the vectorized lhs of STMT and the lane to use (counted in bits).  */
-  tree vec_lhs, bitstart;
-  gimple *vec_stmt;
+  tree vec_lhs, vec_lhs0, bitstart;
+  gimple *vec_stmt, *vec_stmt0;
   if (slp_node)
     {
       gcc_assert (!loop_vinfo
@@ -10697,6 +10901,10 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info,
       vec_lhs = SLP_TREE_VEC_DEFS (slp_node)[vec_entry];
       vec_stmt = SSA_NAME_DEF_STMT (vec_lhs);
 
+      /* In case we need to early break vectorize also get the first stmt.  */
+      vec_lhs0 = SLP_TREE_VEC_DEFS (slp_node)[0];
+      vec_stmt0 = SSA_NAME_DEF_STMT (vec_lhs0);
+
       /* Get entry to use.  */
       bitstart = bitsize_int (vec_index);
       bitstart = int_const_binop (MULT_EXPR, bitsize, bitstart);
@@ -10707,6 +10915,10 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info,
       vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info).last ();
       vec_lhs = gimple_get_lhs (vec_stmt);
 
+      /* In case we need to early break vectorize also get the first stmt.  */
+      vec_stmt0 = STMT_VINFO_VEC_STMTS (stmt_info)[0];
+      vec_lhs0 = gimple_get_lhs (vec_stmt0);
+
       /* Get the last lane in the vector.  */
       bitstart = int_const_binop (MULT_EXPR, bitsize, bitsize_int (nunits - 1));
     }
@@ -10726,103 +10938,60 @@ vectorizable_live_operation (vec_info *vinfo, stmt_vec_info stmt_info,
 	   lhs' = new_tree;  */
 
       class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
-      basic_block exit_bb = LOOP_VINFO_IV_EXIT (loop_vinfo)->dest;
-      gcc_assert (single_pred_p (exit_bb));
-
-      tree vec_lhs_phi = copy_ssa_name (vec_lhs);
-      gimple *phi = create_phi_node (vec_lhs_phi, exit_bb);
-      SET_PHI_ARG_DEF (phi, LOOP_VINFO_IV_EXIT (loop_vinfo)->dest_idx, vec_lhs);
-
-      gimple_seq stmts = NULL;
-      tree new_tree;
-      if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo))
-	{
-	  /* Emit:
-
-	       SCALAR_RES = VEC_EXTRACT <VEC_LHS, LEN + BIAS - 1>
-
-	     where VEC_LHS is the vectorized live-out result and MASK is
-	     the loop mask for the final iteration.  */
-	  gcc_assert (ncopies == 1 && !slp_node);
-	  gimple_seq tem = NULL;
-	  gimple_stmt_iterator gsi = gsi_last (tem);
-	  tree len
-	    = vect_get_loop_len (loop_vinfo, &gsi,
-				 &LOOP_VINFO_LENS (loop_vinfo),
-				 1, vectype, 0, 0);
-
-	  /* BIAS - 1.  */
-	  signed char biasval = LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo);
-	  tree bias_minus_one
-	    = int_const_binop (MINUS_EXPR,
-			       build_int_cst (TREE_TYPE (len), biasval),
-			       build_one_cst (TREE_TYPE (len)));
-
-	  /* LAST_INDEX = LEN + (BIAS - 1).  */
-	  tree last_index = gimple_build (&stmts, PLUS_EXPR, TREE_TYPE (len),
-					  len, bias_minus_one);
-
-	  /* SCALAR_RES = VEC_EXTRACT <VEC_LHS, LEN + BIAS - 1>.  */
-	  tree scalar_res
-	    = gimple_build (&stmts, CFN_VEC_EXTRACT, TREE_TYPE (vectype),
-			    vec_lhs_phi, last_index);
-
-	  /* Convert the extracted vector element to the scalar type.  */
-	  new_tree = gimple_convert (&stmts, lhs_type, scalar_res);
-	}
-      else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
-	{
-	  /* Emit:
-
-	       SCALAR_RES = EXTRACT_LAST <VEC_LHS, MASK>
-
-	     where VEC_LHS is the vectorized live-out result and MASK is
-	     the loop mask for the final iteration.  */
-	  gcc_assert (ncopies == 1 && !slp_node);
-	  tree scalar_type = TREE_TYPE (STMT_VINFO_VECTYPE (stmt_info));
-	  gimple_seq tem = NULL;
-	  gimple_stmt_iterator gsi = gsi_last (tem);
-	  tree mask = vect_get_loop_mask (loop_vinfo, &gsi,
-					  &LOOP_VINFO_MASKS (loop_vinfo),
-					  1, vectype, 0);
-	  gimple_seq_add_seq (&stmts, tem);
-	  tree scalar_res = gimple_build (&stmts, CFN_EXTRACT_LAST, scalar_type,
-					  mask, vec_lhs_phi);
-
-	  /* Convert the extracted vector element to the scalar type.  */
-	  new_tree = gimple_convert (&stmts, lhs_type, scalar_res);
-	}
-      else
-	{
-	  tree bftype = TREE_TYPE (vectype);
-	  if (VECTOR_BOOLEAN_TYPE_P (vectype))
-	    bftype = build_nonstandard_integer_type (tree_to_uhwi (bitsize), 1);
-	  new_tree = build3 (BIT_FIELD_REF, bftype,
-			     vec_lhs_phi, bitsize, bitstart);
-	  new_tree = force_gimple_operand (fold_convert (lhs_type, new_tree),
-					   &stmts, true, NULL_TREE);
-	}
+      /* Check if we have a loop where the chosen exit is not the main exit,
+	 in these cases for an early break we restart the iteration the vector code
+	 did.  For the live values we want the value at the start of the iteration
+	 rather than at the end.  */
+      edge main_e = LOOP_VINFO_IV_EXIT (loop_vinfo);
+      bool restart_loop = LOOP_VINFO_EARLY_BREAKS_VECT_PEELED (loop_vinfo);
+      FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs)
+	if (!is_gimple_debug (use_stmt)
+	    && !flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)))
+	  FOR_EACH_IMM_USE_ON_STMT (use_p, imm_iter)
+	    {
+	      edge e = gimple_phi_arg_edge (as_a <gphi *> (use_stmt),
+					   phi_arg_index_from_use (use_p));
+	      bool main_exit_edge = e == main_e
+				    || find_connected_edge (main_e, e->src);
+
+	      /* Early exits have an merge block, we want the merge block itself
+		 so use ->src.  For main exit the merge block is the
+		 destination.  */
+	      basic_block dest = main_exit_edge ? main_e->dest : e->src;
+	      tree tmp_vec_lhs = vec_lhs;
+	      tree tmp_bitstart = bitstart;
+
+	      /* For early exit where the exit is not in the BB that leads
+		 to the latch then we're restarting the iteration in the
+		 scalar loop.  So get the first live value.  */
+	      restart_loop = restart_loop || !main_exit_edge;
+	      if (restart_loop
+		  && STMT_VINFO_DEF_TYPE (stmt_info) == vect_induction_def)
+		{
+		  tmp_vec_lhs = vec_lhs0;
+		  tmp_bitstart = build_zero_cst (TREE_TYPE (bitstart));
+		}
 
-      gimple_stmt_iterator exit_gsi = gsi_after_labels (exit_bb);
-      if (stmts)
-	gsi_insert_seq_before (&exit_gsi, stmts, GSI_SAME_STMT);
+	      gimple_stmt_iterator exit_gsi;
+	      tree new_tree
+		= vectorizable_live_operation_1 (loop_vinfo, stmt_info,
+						 dest, vectype, ncopies,
+						 slp_node, bitsize,
+						 tmp_bitstart, tmp_vec_lhs,
+						 lhs_type, restart_loop,
+						 &exit_gsi);
 
-      /* Remove existing phis that copy from lhs and create copies
-	 from new_tree.  */
-      gimple_stmt_iterator gsi;
-      for (gsi = gsi_start_phis (exit_bb); !gsi_end_p (gsi);)
-	{
-	  gimple *phi = gsi_stmt (gsi);
-	  if ((gimple_phi_arg_def (phi, 0) == lhs))
-	    {
-	      remove_phi_node (&gsi, false);
-	      tree lhs_phi = gimple_phi_result (phi);
-	      gimple *copy = gimple_build_assign (lhs_phi, new_tree);
-	      gsi_insert_before (&exit_gsi, copy, GSI_SAME_STMT);
-	    }
-	  else
-	    gsi_next (&gsi);
-	}
+	      if (gimple_phi_num_args (use_stmt) == 1)
+		{
+		  auto gsi = gsi_for_stmt (use_stmt);
+		  remove_phi_node (&gsi, false);
+		  tree lhs_phi = gimple_phi_result (use_stmt);
+		  gimple *copy = gimple_build_assign (lhs_phi, new_tree);
+		  gsi_insert_before (&exit_gsi, copy, GSI_SAME_STMT);
+		}
+	      else
+		SET_PHI_ARG_DEF (use_stmt, e->dest_idx, new_tree);
+	  }
 
       /* There a no further out-of-loop uses of lhs by LC-SSA construction.  */
       FOR_EACH_IMM_USE_STMT (use_stmt, imm_iter, lhs)
@@ -11601,6 +11770,56 @@ update_epilogue_loop_vinfo (class loop *epilogue, tree advance)
   epilogue_vinfo->shared->save_datarefs ();
 }
 
+/*  When vectorizing early break statements instructions that happen before
+    the early break in the current BB need to be moved to after the early
+    break.  This function deals with that and assumes that any validity
+    checks has already been performed.
+
+    While moving the instructions if it encounters a VUSE or VDEF it then
+    corrects the VUSES as it moves the statements along.  GDEST is the location
+    in which to insert the new statements.  */
+
+static void
+move_early_exit_stmts (loop_vec_info loop_vinfo)
+{
+  DUMP_VECT_SCOPE ("move_early_exit_stmts");
+
+  if (LOOP_VINFO_EARLY_BRK_STORES (loop_vinfo).is_empty ())
+    return;
+
+  /* Move all stmts that need moving.  */
+  basic_block dest_bb = LOOP_VINFO_EARLY_BRK_DEST_BB (loop_vinfo);
+  gimple_stmt_iterator dest_gsi = gsi_start_bb (dest_bb);
+
+  for (gimple *stmt : LOOP_VINFO_EARLY_BRK_STORES (loop_vinfo))
+    {
+      /* Check to see if statement is still required for vect or has been
+	 elided.  */
+      auto stmt_info = loop_vinfo->lookup_stmt (stmt);
+      if (!stmt_info)
+	continue;
+
+      if (dump_enabled_p ())
+	dump_printf_loc (MSG_NOTE, vect_location, "moving stmt %G", stmt);
+
+      gimple_stmt_iterator stmt_gsi = gsi_for_stmt (stmt);
+      gsi_move_before (&stmt_gsi, &dest_gsi);
+      gsi_prev (&dest_gsi);
+    }
+
+  /* Update all the stmts with their new reaching VUSES.  */
+  tree vuse
+    = gimple_vuse (LOOP_VINFO_EARLY_BRK_STORES (loop_vinfo).last ());
+  for (auto p : LOOP_VINFO_EARLY_BRK_VUSES (loop_vinfo))
+    {
+      if (dump_enabled_p ())
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "updating vuse to %T for load %G", vuse, p);
+      gimple_set_vuse (p, vuse);
+      update_stmt (p);
+    }
+}
+
 /* Function vect_transform_loop.
 
    The analysis phase has determined that the loop is vectorizable.
@@ -11648,7 +11867,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
   /* Make sure there exists a single-predecessor exit bb.  Do this before 
      versioning.   */
   edge e = LOOP_VINFO_IV_EXIT (loop_vinfo);
-  if (! single_pred_p (e->dest))
+  if (! single_pred_p (e->dest) && !LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
     {
       split_loop_exit_edge (e, true);
       if (dump_enabled_p ())
@@ -11742,6 +11961,11 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call)
     /* This will deal with any possible peeling.  */
     vect_prepare_for_masked_peels (loop_vinfo);
 
+  /* Handle any code motion that we need to for early-break vectorization after
+     we've done peeling but just before we start vectorizing.  */
+  if (LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
+    move_early_exit_stmts (loop_vinfo);
+
   /* Schedule the SLP instances first, then handle loop vectorization
      below.  */
   if (!loop_vinfo->slp_instances.is_empty ())
author	Tamar Christina <tamar.christina@arm.com>	2023-12-24 19:18:12 +0000
committer	Tamar Christina <tamar.christina@arm.com>	2023-12-24 19:29:32 +0000
commit	01f4251b8775c832a92d55e2df57c9ac72eaceef (patch)
tree	d422c6ed538b9f7f09d216409abaf768b6196fa4 /gcc/tree-vect-loop.cc
parent	f1dcc0fe371e3cb10d2cbe3f6c88db6f72edddda (diff)
download	gcc-01f4251b8775c832a92d55e2df57c9ac72eaceef.zip gcc-01f4251b8775c832a92d55e2df57c9ac72eaceef.tar.gz gcc-01f4251b8775c832a92d55e2df57c9ac72eaceef.tar.bz2