diff options
| author | Jakub Jelinek <jakub@redhat.com> | 2026-02-07 18:45:13 +0100 |
|---|---|---|
| committer | Jakub Jelinek <jakub@gcc.gnu.org> | 2026-02-07 18:45:13 +0100 |
| commit | 7f4476239b1f8337a88844fb6dd98a9b1906c1d7 (patch) | |
| tree | ee5387b5c7a6d879d5d4f708eb5e3c2de27375f7 /libjava/classpath/vm/reference/java | |
| parent | f3f7e7514a794f34a0db1cda63cbaa0f1eb899f5 (diff) | |
| download | gcc-7f4476239b1f8337a88844fb6dd98a9b1906c1d7.zip gcc-7f4476239b1f8337a88844fb6dd98a9b1906c1d7.tar.gz gcc-7f4476239b1f8337a88844fb6dd98a9b1906c1d7.tar.bz2 | |
forwprop: Fix up calc_perm_vec_perm_simplify_seqs [PR123672]
Since r15-5563-g1c4d39ada we have an optimization to try to blend 2
sequences of 2xVEC_PERM_EXPR + 2x binop + 1x VEC_PERM where the first two
VEC_PERMs are permuting a single input and the last one permutes result from
those 2 binops into 2 VEC_PERM_EXPRs from 2 inputs, 2 binops and 2 final
VEC_PERMs.
On the following testcase, the intended change (i.e. after patch) is
(including DCE after it which the optimizations relies on):
a_7 = *x_6(D);
b_9 = *y_8(D);
- c_10 = VEC_PERM_EXPR <a_7, a_7, { 0, 2, 0, 2 }>;
- d_11 = VEC_PERM_EXPR <a_7, a_7, { 1, 3, 1, 3 }>;
- e_12 = VEC_PERM_EXPR <b_9, b_9, { 0, 2, 0, 2 }>;
- f_13 = VEC_PERM_EXPR <b_9, b_9, { 1, 3, 1, 3 }>;
+ c_10 = VEC_PERM_EXPR <a_7, b_9, { 0, 2, 4, 6 }>;
+ d_11 = VEC_PERM_EXPR <a_7, b_9, { 1, 3, 5, 7 }>;
_1 = c_10 + d_11;
_2 = c_10 - d_11;
g_14 = VEC_PERM_EXPR <_1, _2, { 0, 4, 1, 5 }>;
- _3 = e_12 + f_13;
- _4 = e_12 - f_13;
- h_15 = VEC_PERM_EXPR <_3, _4, { 0, 4, 1, 5 }>;
+ h_15 = VEC_PERM_EXPR <_1, _2, { 2, 6, 3, 7 }>;
*x_6(D) = g_14;
*y_8(D) = h_15;
This works by first identifying the two sequences, attempting to use vect
elem redundancies to only use at most half of the vector elements
(in this testcase a nop because 0, 4, 1, 5 perms already use only half of
the vector elts), remembering details of such sequences and later comparing
them if there are at least two (up to 8 I think) and trying to merge them.
The optimization is meant to improve SPEC x264.
Anyway, in r15-6387-geee289131 the optimization was changed to fix some
regressions but regressed this testcase, instead of the desirable
{ 0, 2, 4, 6 } and { 1, 3, 5, 7 } first 2 VEC_PERMs 15 branch and trunk
uses { 0, 2, 4, 4 } and { 1, 3, 5, 5 } and on this testcase that means
computing incorrect result.
On this testcase, it identified the two sequences (one ending with g_14
and one with h_15 with no changes (see above). The first one (it has
some code to attempt to swap them if needed, but here the first one remains
g_14) keeps using the final VEC_PERM_EXPR as is (or with whatever
simplification recognise_vec_perm_simplify_seq performed on just that to
reduce to at most half of nelts) and the second one is modified so that
it uses the other elts of the two vectors.
So, we have { 0, 4, 1, 5 } (i.e. twice first lanes and twice second lanes)
from the first sequence and look up unused lanes (third and fourth) to
transform the other { 0, 4, 1, 5 } to, and find that is { 2, 6, 3, 7 }.
So far good. But the next operation is to compute the new selectors
for the first 2 VEC_PERM_EXPRs, which are changed from single input to
two input ones. For that, the code correctly uses the VECTOR_CST elts
unmodified for the lanes used by the first sequence (in this
testcase first/second lanes), so { 0, 2, X, X } and { 1, 3, X, X }
and then need to find out what to use for the needs of the second sequence.
Here is what it does currently:
for (i = 0; i < nelts; i++)
{
bool use_seq1 = lane_assignment[i] != 2;
unsigned int l1, l2;
if (use_seq1)
{
/* Just reuse the selector indices. */
tree s1 = gimple_assign_rhs3 (seq1->v_1_stmt);
tree s2 = gimple_assign_rhs3 (seq1->v_2_stmt);
l1 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s1, i));
l2 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s2, i));
}
else
{
/* We moved the lanes for seq2, so we need to adjust for that. */
tree s1 = gimple_assign_rhs3 (seq2->v_1_stmt);
tree s2 = gimple_assign_rhs3 (seq2->v_2_stmt);
unsigned int j = 0;
for (; j < i; j++)
{
unsigned int sel_new;
sel_new = seq2_stmt_sel_perm[j].to_constant ();
sel_new %= nelts;
if (sel_new == i)
break;
}
/* This should not happen. Test anyway to guarantee correctness. */
if (j == i)
return false;
l1 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s1, j));
l2 = TREE_INT_CST_LOW (VECTOR_CST_ELT (s2, j));
}
seq1_v_1_stmt_sel_perm.quick_push (l1 + (use_seq1 ? 0 : nelts));
seq1_v_2_stmt_sel_perm.quick_push (l2 + (use_seq1 ? 0 : nelts));
}
seq2_stmt_sel_perm is the newly computed { 2, 6, 3, 7 } selector and
seq1->v_{1,2}_stmt are def stmts of {c_10,d_11} and seq2->v_{1,2}_stmt
are def stmts of {e_12,f_13}. For i 0 and 1 it is use_seq1 and
correct, then for i 2 the loop checks first seq2_stmt_sel_perm[0],
it is 2 % 4, equal to i, so picks up VECTOR_CST_ELTS (s{1,2}, 2),
which happens to be correct in this case, for i 3 the loop loops until
seq2_stmt_sel_perm[2] which is 3 % 4, stops and picks the wrong
VECTOR_CST_ELTS (s{1,2}, 2) which has the same value as
VECTOR_CST_ELTS (s{1,2}, 0), when the correct value would be in this
case either 1 or 3 (due to the duplication).
What the loop should do for !use_seq1 is to take the lane transformations
into account, we've changed { 0, 4, 1, 5 } to { 2, 6, 3, 7 }, so instead
of using lanes 0, 0, 1, 1 we now use lanes 2, 2, 3, 3 (x / 4 is about
which input it is picked from, here + or -). So, for 2 which got remapped
from 0 we want to use 0 and for 3 which got remapped from 1 we want to use
1.
The function uses an auto_vec lane_assignment with values 0 (unused lane,
so far or altogether), 1 (used by first sequence) and 2 (used by second
sequence). When we store in there 2, we know exactly which lane we are
remapping to which lane, so instead of computing it again the following
patch stores there 2 + l_orig, such that value >= 2 means second lane
and lane_assignment[i] - 2 in that case is the lane that got remapped to i.
And then the last loop doesn't need to recompute anything and can just use
the remembered transformation.
The rest of the changes (hunks 1-5 and 7) are just random small fixes I've
noticed while trying to understand the code. The real fix is
- lane_assignment[lane] = 2;
+ lane_assignment[lane] = 2 + l_orig;
and
- bool use_seq1 = lane_assignment[i] != 2;
+ bool use_seq1 = lane_assignment[i] < 2;
and the rest of the last hunk. Also, the last loop was kind of assuming
VEC_PERM_EXPR canonicalization happened and for single input perm the
selector elts are never >= nelts, I've added %= nelts just to be sure.
2026-02-07 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/123672
* tree-ssa-forwprop.cc (recognise_vec_perm_simplify_seq): Use std::swap
instead of fetching gimple_assign_rhs{1,2} again. Change type of lanes
vector from auto_vec<unsigned int> to auto_vec<bool> and store true
instead of 1 into it. Fix comment typo and formatting fix.
(can_blend_vec_perm_simplify_seqs_p): Put end of comment on the same
line as the last sentence in it.
(calc_perm_vec_perm_simplify_seqs): Change lane_assignment type from
auto_vec<int> to auto_vec<unsigned> and store 2 + l_orig into it
instead of true. Fix comment typo and formatting fix. Set use_seq1
to line_assignment[i] < 2 instead of line_assignment[i] != 2. Replace
bogus computation of index for !use_seq with using
line_assignment[i] - 2. Set l1 to l1 % nelts and similarly for l2.
* gcc.dg/pr123672.c: New test.
Diffstat (limited to 'libjava/classpath/vm/reference/java')
0 files changed, 0 insertions, 0 deletions
