riscv-gnu-toolchain/gcc.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Jakub Jelinek <jakub@redhat.com>	2026-02-07 18:45:13 +0100
committer	Jakub Jelinek <jakub@gcc.gnu.org>	2026-02-07 18:45:13 +0100
commit	7f4476239b1f8337a88844fb6dd98a9b1906c1d7 (patch)
tree	ee5387b5c7a6d879d5d4f708eb5e3c2de27375f7 /libjava/classpath/vm/reference/java
parent	f3f7e7514a794f34a0db1cda63cbaa0f1eb899f5 (diff)
download	gcc-7f4476239b1f8337a88844fb6dd98a9b1906c1d7.zip gcc-7f4476239b1f8337a88844fb6dd98a9b1906c1d7.tar.gz gcc-7f4476239b1f8337a88844fb6dd98a9b1906c1d7.tar.bz2

forwprop: Fix up calc_perm_vec_perm_simplify_seqs [PR123672]

Since r15-5563-g1c4d39ada we have an optimization to try to blend 2 sequences of 2xVEC_PERM_EXPR + 2x binop + 1x VEC_PERM where the first two VEC_PERMs are permuting a single input and the last one permutes result from those 2 binops into 2 VEC_PERM_EXPRs from 2 inputs, 2 binops and 2 final VEC_PERMs. On the following testcase, the intended change (i.e. after patch) is (including DCE after it which the optimizations relies on): a_7 = *x_6(D); b_9 = *y_8(D); - c_10 = VEC_PERM_EXPR <a_7, a_7, { 0, 2, 0, 2 }>; - d_11 = VEC_PERM_EXPR <a_7, a_7, { 1, 3, 1, 3 }>; - e_12 = VEC_PERM_EXPR <b_9, b_9, { 0, 2, 0, 2 }>; - f_13 = VEC_PERM_EXPR <b_9, b_9, { 1, 3, 1, 3 }>; + c_10 = VEC_PERM_EXPR <a_7, b_9, { 0, 2, 4, 6 }>; + d_11 = VEC_PERM_EXPR <a_7, b_9, { 1, 3, 5, 7 }>; _1 = c_10 + d_11; _2 = c_10 - d_11; g_14 = VEC_PERM_EXPR <_1, _2, { 0, 4, 1, 5 }>; - _3 = e_12 + f_13; - _4 = e_12 - f_13; - h_15 = VEC_PERM_EXPR <_3, _4, { 0, 4, 1, 5 }>; + h_15 = VEC_PERM_EXPR <_1, _2, { 2, 6, 3, 7 }>; *x_6(D) = g_14; *y_8(D) = h_15; This works by first identifying the two sequences, attempting to use vect elem redundancies to only use at most half of the vector elements (in this testcase a nop because 0, 4, 1, 5 perms already use only half of the vector elts), remembering details of such sequences and later comparing them if there are at least two (up to 8 I think) and trying to merge them. The optimization is meant to improve SPEC x264. Anyway, in r15-6387-geee289131 the optimization was changed to fix some regressions but regressed this testcase, instead of the desirable { 0, 2, 4, 6 } and { 1, 3, 5, 7 } first 2 VEC_PERMs 15 branch and trunk uses { 0, 2, 4, 4 } and { 1, 3, 5, 5 } and on this testcase that means computing incorrect result. On this testcase, it identified the two sequences (one ending with g_14 and one with h_15 with no changes (see above). The first one (it has some code to attempt to swap them if needed, but here the first one remains g_14) keeps using the final VEC_PERM_EXPR as is (or with whatever simplification recognise_vec_perm_simplify_seq performed on just that to reduce to at most half of nelts) and the second one is modified so that it uses the other elts of the two vectors. So, we have { 0, 4, 1, 5 } (i.e. twice first lanes and twice second lanes) from the first sequence and look up unused lanes (third and fourth) to transform the other { 0, 4, 1, 5 } to, and find that is { 2, 6, 3, 7 }. So far good. But the next operation is to compute the new selectors for the first 2 VEC_PERM_EXPRs, which are changed from single input to two input ones. For that, the code correctly uses the VECTOR_CST elts unmodified for the lanes used by the first sequence (in this testcase first/second lanes), so { 0, 2, X, X } and { 1, 3, X, X } and then need to find out what to use for the needs of the second sequence. Here is what it does currently: for (i = 0; i < nelts; i++) { bool use_seq1 = lane_assignment[i] != 2; unsigned int l1, l2; if (use_seq1) { /* Just reuse the selector indices. */ tree s1 = gimple_assign_rhs3 (seq1->v_1_stmt); tree s2 = gimple_assign_rhs3 (seq1->v_2_stmt); l1 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s1, i)); l2 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s2, i)); } else { /* We moved the lanes for seq2, so we need to adjust for that. */ tree s1 = gimple_assign_rhs3 (seq2->v_1_stmt); tree s2 = gimple_assign_rhs3 (seq2->v_2_stmt); unsigned int j = 0; for (; j < i; j++) { unsigned int sel_new; sel_new = seq2_stmt_sel_perm[j].to_constant (); sel_new %= nelts; if (sel_new == i) break; } /* This should not happen. Test anyway to guarantee correctness. */ if (j == i) return false; l1 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s1, j)); l2 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s2, j)); } seq1_v_1_stmt_sel_perm.quick_push (l1 + (use_seq1 ? 0 : nelts)); seq1_v_2_stmt_sel_perm.quick_push (l2 + (use_seq1 ? 0 : nelts)); } seq2_stmt_sel_perm is the newly computed { 2, 6, 3, 7 } selector and seq1->v_{1,2}_stmt are def stmts of {c_10,d_11} and seq2->v_{1,2}_stmt are def stmts of {e_12,f_13}. For i 0 and 1 it is use_seq1 and correct, then for i 2 the loop checks first seq2_stmt_sel_perm[0], it is 2 % 4, equal to i, so picks up VECTOR_CST_ELTS (s{1,2}, 2), which happens to be correct in this case, for i 3 the loop loops until seq2_stmt_sel_perm[2] which is 3 % 4, stops and picks the wrong VECTOR_CST_ELTS (s{1,2}, 2) which has the same value as VECTOR_CST_ELTS (s{1,2}, 0), when the correct value would be in this case either 1 or 3 (due to the duplication). What the loop should do for !use_seq1 is to take the lane transformations into account, we've changed { 0, 4, 1, 5 } to { 2, 6, 3, 7 }, so instead of using lanes 0, 0, 1, 1 we now use lanes 2, 2, 3, 3 (x / 4 is about which input it is picked from, here + or -). So, for 2 which got remapped from 0 we want to use 0 and for 3 which got remapped from 1 we want to use 1. The function uses an auto_vec lane_assignment with values 0 (unused lane, so far or altogether), 1 (used by first sequence) and 2 (used by second sequence). When we store in there 2, we know exactly which lane we are remapping to which lane, so instead of computing it again the following patch stores there 2 + l_orig, such that value >= 2 means second lane and lane_assignment[i] - 2 in that case is the lane that got remapped to i. And then the last loop doesn't need to recompute anything and can just use the remembered transformation. The rest of the changes (hunks 1-5 and 7) are just random small fixes I've noticed while trying to understand the code. The real fix is - lane_assignment[lane] = 2; + lane_assignment[lane] = 2 + l_orig; and - bool use_seq1 = lane_assignment[i] != 2; + bool use_seq1 = lane_assignment[i] < 2; and the rest of the last hunk. Also, the last loop was kind of assuming VEC_PERM_EXPR canonicalization happened and for single input perm the selector elts are never >= nelts, I've added %= nelts just to be sure. 2026-02-07 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/123672 * tree-ssa-forwprop.cc (recognise_vec_perm_simplify_seq): Use std::swap instead of fetching gimple_assign_rhs{1,2} again. Change type of lanes vector from auto_vec<unsigned int> to auto_vec<bool> and store true instead of 1 into it. Fix comment typo and formatting fix. (can_blend_vec_perm_simplify_seqs_p): Put end of comment on the same line as the last sentence in it. (calc_perm_vec_perm_simplify_seqs): Change lane_assignment type from auto_vec<int> to auto_vec<unsigned> and store 2 + l_orig into it instead of true. Fix comment typo and formatting fix. Set use_seq1 to line_assignment[i] < 2 instead of line_assignment[i] != 2. Replace bogus computation of index for !use_seq with using line_assignment[i] - 2. Set l1 to l1 % nelts and similarly for l2. * gcc.dg/pr123672.c: New test.

Diffstat (limited to 'libjava/classpath/vm/reference/java')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: