Age | Commit message (Collapse) | Author | Files | Lines |
|
The following testcase reduced from newlib ICEs on powerpc-linux,
with -O2 -m32 -mpowerpc64 since r12-6433 PR102239 optimization was
added and on the original testcase since some ranger improvements in
GCC 13 made it no longer latent on newlib.
The problem is that the *branch_anddi3_dot define_insn_and_split
relies on the *rotldi3_mask_dot define_insn_and_split being recognized
during splitting. The rs6000_is_valid_rotate_dot_mask function checks whether
the mask is a CONST_INT which is a valid mask, but *rotl<mode>3_mask_dot in
addition to checking that it is a valid mask also has
(<MODE>mode == Pmode || UINTVAL (operands[3]) <= 0x7fffffff)
test in the condition. For TARGET_64BIT that doesn't add any further
requirements, but for !TARGET_64BIT && TARGET_POWERPC64 if the AND
second operand is larger than INT_MAX it will not be recognized.
The rs6000_is_valid_rotate_dot_mask function is used solely in one spot,
condition of *branch_anddi3_dot, so the following patch adjusts it
to check for that as well.
2023-04-25 Jakub Jelinek <jakub@redhat.com>
PR target/109566
* config/rs6000/rs6000.cc (rs6000_is_valid_rotate_dot_mask): For
!TARGET_64BIT, don't return true if UINTVAL (mask) << (63 - nb)
is larger than signed int maximum.
* gcc.target/powerpc/pr109566.c: New test.
(cherry picked from commit 97f8f2d0a0384d377ca46da88495f9a3d18d4415)
|
|
gcc/
PR target/108812
* config/rs6000/vsx.md (vsx_sign_extend_qi_<mode>): Rename to...
(vsx_sign_extend_v16qi_<mode>): ... this.
(vsx_sign_extend_hi_<mode>): Rename to...
(vsx_sign_extend_v8hi_<mode>): ... this.
(vsx_sign_extend_si_v2di): Rename to...
(vsx_sign_extend_v4si_v2di): ... this.
(vsignextend_qi_<mode>): Remove.
(vsignextend_hi_<mode>): Remove.
(vsignextend_si_v2di): Remove.
(vsignextend_v2di_v1ti): Remove.
(*xxspltib_<mode>_split): Replace gen_vsx_sign_extend_qi_v2di with
gen_vsx_sign_extend_v16qi_v2di and gen_vsx_sign_extend_qi_v4si
with gen_vsx_sign_extend_v16qi_v4si.
* config/rs6000/rs6000.md (split for DI constant generation):
Replace gen_vsx_sign_extend_qi_si with gen_vsx_sign_extend_v16qi_si.
(split for HSDI constant generation): Replace gen_vsx_sign_extend_qi_di
with gen_vsx_sign_extend_v16qi_di and gen_vsx_sign_extend_qi_si
with gen_vsx_sign_extend_v16qi_si.
* config/rs6000/rs6000-builtins.def (__builtin_altivec_vsignextsb2d):
Set bif-pattern to vsx_sign_extend_v16qi_v2di.
(__builtin_altivec_vsignextsb2w): Set bif-pattern to
vsx_sign_extend_v16qi_v4si.
(__builtin_altivec_visgnextsh2d): Set bif-pattern to
vsx_sign_extend_v8hi_v2di.
(__builtin_altivec_vsignextsh2w): Set bif-pattern to
vsx_sign_extend_v8hi_v4si.
(__builtin_altivec_vsignextsw2d): Set bif-pattern to
vsx_sign_extend_si_v2di.
(__builtin_altivec_vsignext): Set bif-pattern to
vsx_sign_extend_v2di_v1ti.
* config/rs6000/rs6000-builtin.cc (lxvrse_expand_builtin): Replace
gen_vsx_sign_extend_qi_v2di with gen_vsx_sign_extend_v16qi_v2di,
gen_vsx_sign_extend_hi_v2di with gen_vsx_sign_extend_v8hi_v2di and
gen_vsx_sign_extend_si_v2di with gen_vsx_sign_extend_v4si_v2di.
gcc/testsuite/
PR target/108812
* gcc.target/powerpc/p9-sign_extend-runnable.c: Set corresponding
expected vectors for Big Endian.
* gcc.target/powerpc/int_128bit-runnable.c: Likewise.
(cherry picked from commit a213e2c965382c24fe391ee5798effeba8da0fdf)
|
|
-mpreferred-stack-boundary=2 DImode temporaries [PR109276]
The following testcase ICEs since r11-2259 because assign_386_stack_local
-> assign_stack_local -> ix86_local_alignment now uses 64-bit alignment
for DImode temporaries rather than 32-bit as before.
Most of the spots in the backend which ask for DImode temporaries are during
expansion and those apparently are handled fine with -m32
-mpreferred-stack-boundary=2, we dynamically realign the stack in that case
(most of the spots actually don't need that alignment but at least one
does), then 2 spots are in STV which I assume also work correctly.
But during splitting we can create a DImode slot which doesn't need to be
64-bit alignment (it is nicer for performance though), when we apparently
aren't able to detect it for dynamic stack realignment purposes.
The following patch just makes the slot 32-bit aligned in that rare case.
2023-03-28 Jakub Jelinek <jakub@redhat.com>
PR target/109276
* config/i386/i386.cc (assign_386_stack_local): For DImode
with SLOT_FLOATxFDI_387 and -m32 -mpreferred-stack-boundary=2 pass
align 32 rather than 0 to assign_stack_local.
* gcc.target/i386/pr109276.c: New test.
(cherry picked from commit 4b5ef857f5faf09f274c0a95c67faaa80d198124)
|
|
The following testcase ICEs on aarch64-linux, because
expand_vector_condition attempts to piecewise lower SVE
d_3 = a_1(D) < b_2(D);
_5 = VEC_COND_EXPR <d_3, c_4(D), d_3>;
which isn't possible - nunits_for_known_piecewise_op ICEs but
the rest of the code assumes constant number of elements too.
expand_vector_condition attempts to find if a (rhs1) is a SSA_NAME
for comparison and calls expand_vec_cond_expr_p (type, TREE_TYPE (a1), code)
where a1 is one of the operands of the comparison and code is the comparison
code. That one indeed isn't supported here, but what aarch64 SVE supports
are the individual statements, comparison (expand_vec_cmp_expr_p) and
expand_vec_cond_expr_p (type, TREE_TYPE (a), SSA_NAME), the latter because
that function starts with
if (VECTOR_BOOLEAN_TYPE_P (cmp_op_type)
&& get_vcond_mask_icode (TYPE_MODE (value_type),
TYPE_MODE (cmp_op_type)) != CODE_FOR_nothing)
return true;
In an earlier version of the patch (in the PR), we did this
if (VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (a))
&& expand_vec_cond_expr_p (type, TREE_TYPE (a), ERROR_MARK))
return true;
before the code == SSA_NAME handling plus some further tweaks later.
While that fixed the ICE, it broke quite a few tests on x86 and some on
aarch64 too. The problem is that expand_vector_comparison doesn't lower
comparisons which aren't supported and only feed VEC_COND_EXPR first operand
and expand_vector_condition succeeds for those, so with the above mentioned
change we'd verify the VEC_COND_EXPR is implementable using optab alone,
but nothing would verify the tcc_comparison which relied on
expand_vector_condition to verify.
So, the following patch instead queries whether optabs can handle the
comparison and VEC_COND_EXPR together (if a (rhs1) is a comparison;
otherwise as before it checks only the VEC_COND_EXPR) and if that fails,
also checks whether the two operations could be supported individually
and only if even that fails does the piecewise lowering.
2023-03-23 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/109176
* tree-vect-generic.cc (expand_vector_condition): If a has
vector boolean type and is a comparison, also check if both
the comparison and VEC_COND_EXPR could be successfully expanded
individually.
* gcc.target/aarch64/sve/pr109176.c: New test.
(cherry picked from commit 484c41c747d95f9cee15a33b75b32ae2e7eb45f3)
|
|
This adds a check for REG_P on SET_DEST for the new idiom recognizer
for AARCH64_FUSE_ADDSUB_2REG_CONST1. The reported ICE is only
observable with checking=rtl.
Bootstrapped/regtested aarch64-linux, committed.
PR target/108589
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch_macro_fusion_pair_p): Check
REG_P on SET_DEST.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/pr108589.c: New test.
(cherry picked from commit a39c6ec97906766ad65d15d4856fd41121ee7a45)
|
|
AmpereOne (-mcpu=ampere1) breaks LDP instructions into two uops.
Given the chance that this causes instructions to slip into the next
decoding cycle and the additional overheads when handling
cacheline-crossing LDP instructions, we disable the generation of LDP
isntructions through the tuning structure from instruction combining
(such as in peephole2).
Given the code-density benefits in builtins and prologue/epilogue
expansion, we allow LDPs there.
This commit:
* adds a new tuning option AARCH64_EXTRA_TUNE_NO_LDP_COMBINE
* allows -moverride=tune=... to override this
These changes are benchmark-driven, yielding the following changes
(with a net-overall improvement):
503.bwaves_r. -0.88%
507.cactuBSSN_r 0.35%
508.namd_r 3.09%
510.parest_r -2.99%
511.povray_r 5.54%
519.lbm_r 15.83%
521.wrf_r 0.56%
526.blender_r 2.47%
527.cam4_r 0.70%
538.imagick_r 0.00%
544.nab_r -0.33%
549.fotonik3d_r. -0.42%
554.roms_r 0.00%
-------------------------
= total 1.79%
Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>
Co-Authored-By: Di Zhao <di.zhao@amperecomputing.com>
gcc/ChangeLog:
* config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
Add AARCH64_EXTRA_TUNE_NO_LDP_COMBINE.
* config/aarch64/aarch64.cc (aarch64_operands_ok_for_ldpstp):
Check for the above tuning option when processing loads.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/ampere1-no_ldp_combine.c: New test.
(cherry picked from commit f200c56787f2c6f93ffb739d57d01a294ab72f68)
|
|
The failures on the original failed case builtin-bitops-1.c
and the associated test case pr108699.c here show that the
current support of parity vector mode is wrong on Power.
The hardware insns vprtyb[wdq] which operate on the least
significant bit of each byte per element, they doesn't match
what RTL opcode parity needs, but the current implementation
expands it with them wrongly.
This patch is to fix the handling with one more insn vpopcntb.
PR target/108699
gcc/ChangeLog:
* config/rs6000/altivec.md (*p9v_parity<mode>2): Rename to ...
(rs6000_vprtyb<mode>2): ... this.
* config/rs6000/rs6000-builtins.def (VPRTYBD): Replace parityv2di2 with
rs6000_vprtybv2di2.
(VPRTYBW): Replace parityv4si2 with rs6000_vprtybv4si2.
(VPRTYBQ): Replace parityv1ti2 with rs6000_vprtybv1ti2.
* config/rs6000/vector.md (parity<mode>2 with VEC_IP): Expand with
popcountv16qi2 and the corresponding rs6000_vprtyb<mode>2.
gcc/testsuite/ChangeLog:
* gcc.target/powerpc/p9-vparity.c: Add scan-assembler-not for vpopcntb
to distinguish parity byte from parity.
* gcc.target/powerpc/pr108699.c: New test.
(cherry picked from commit cdd2d6643f7fef40e335a7027edfea7276cde608)
|
|
2023-04-10 Michael Meissner <meissner@linux.ibm.com>
gcc/
PR target/109067
* config/rs6000/rs6000.cc (create_complex_muldiv): Delete.
(init_float128_ieee): Delete code to switch complex multiply and divide
for long double. Backport from master, 3/20/2023.
(complex_multiply_builtin_code): New helper function.
(complex_divide_builtin_code): Likewise.
(rs6000_mangle_decl_assembler_name): Add support for mangling the name
of complex 128-bit multiply and divide built-in functions.
gcc/testsuite/
PR target/109067
* gcc.target/powerpc/divic3-1.c: New test. Backport from master,
3/20/2023.
* gcc.target/powerpc/divic3-2.c: Likewise.
* gcc.target/powerpc/mulic3-1.c: Likewise.
* gcc.target/powerpc/mulic3-2.c: Likewise.
|
|
PR96373 points out that a predicated SVE loop currently converts
trapping unconditional ops into unpredicated vector ops. Doing
the operation on inactive lanes can then raise an exception.
As discussed in the PR trail, we aren't 100% consistent about
whether we preserve traps or not. But the direction of travel
is clearly to improve that rather than live with it. This patch
tries to do that for the SVE case.
Doing this regresses gcc.target/aarch64/sve/fabd_1.c. I've added
-fno-trapping-math for now and filed PR108571 to track it.
A similar problem applies to fsubr_1.c.
I think this is likely to regress Power 10, since conditional
operations are only available for masked loops. I think we'll
need to add -fno-trapping-math to any affected testcases,
but I don't have a Power 10 system to test on.
gcc/
PR tree-optimization/96373
PR tree-optimization/108979
* tree-vect-stmts.cc (vectorizable_operation): Predicate trapping
operations on the loop mask. Reject partial vectors if this isn't
possible. Don't mask operations on invariants.
gcc/testsuite/
PR tree-optimization/96373
PR tree-optimization/108571
PR tree-optimization/108979
* gcc.target/aarch64/sve/fabd_1.c: Add -fno-trapping-math.
* gcc.target/aarch64/sve/fsubr_1.c: Likewise.
* gcc.target/aarch64/sve/fmul_1.c: Expect predicate ops.
* gcc.target/aarch64/sve/fp_arith_1.c: Likewise.
* gfortran.dg/vect/pr108979.f90: New test.
|
|
Before GCC 12, we would vectorize:
int32_t arr[] = { x, x, x, x };
at -O3. Vectorizing the store on its own is often a loss, particularly
for integers, so g:4963079769c99c4073adfd799885410ad484cbbe suppressed it.
This was necessary to fix regressions from enabling vectorisation at -O2,
However, the vectorisation is important if the code subsequently loads
from the array using vld1:
return vld1q_s32 (arr);
This approach of initialising an array and loading from it is the
recommend endian-agnostic way of constructing an ACLE vector.
As discussed in the PR notes, the general fix would be to fold the
store and load-back to a constructor (preferably before vectorisation).
But that's clearly not stage 4 material.
This patch instead delays folding vld1 until after inlining and
records which decls a vld1 loads from. It then treats vector
stores to those decls as free, on the optimistic assumption that
they will be removed later. The patch also brute-forces
vectorization of plain constructor+store sequences, since some
of the CPU costs make that (dubiously) expensive even when the
store is discounted.
Delaying folding showed that we were failing to update the vops.
The patch fixes that too.
Thanks to Tamar for discussion & help with testing.
gcc/
PR target/109072
* config/aarch64/aarch64-protos.h (aarch64_vector_load_decl): Declare.
* config/aarch64/aarch64.h (machine_function::vector_load_decls): New
variable.
* config/aarch64/aarch64-builtins.cc (aarch64_record_vector_load_arg):
New function.
(aarch64_general_gimple_fold_builtin): Delay folding of vld1 until
after inlining. Record which decls are loaded from. Fix handling
of vops for loads and stores.
* config/aarch64/aarch64.cc (aarch64_vector_load_decl): New function.
(aarch64_accesses_vector_load_decl_p): Likewise.
(aarch64_vector_costs::m_stores_to_vector_load_decl): New member
variable.
(aarch64_vector_costs::add_stmt_cost): If the function has a vld1
that loads from a decl, treat vector stores to those decls as
zero cost.
(aarch64_vector_costs::finish_cost): ...and in that case,
if the vector code does nothing more than a store, give the
prologue a zero cost as well.
gcc/testsuite/
PR target/109072
* gcc.target/aarch64/pr109072_1.c: New test.
* gcc.target/aarch64/pr109072_2.c: Likewise.
(cherry picked from commit fcb411564a655a01f759eea3bb16bfd1bc879bfd)
|
|
In this PR we had a write to one vector of a 4-vector tuple.
The vector had mode V1DI, and the target doesn't provide V1DI
moves, so this was converted into:
(clobber (subreg:V1DI (reg/v:V4x1DI 92 [ b ]) 24))
followed by a DImode move. (The clobber isn't really necessary
or helpful for a single word, but would be for wider moves.)
The subreg in the clobber survived until after RA:
(clobber (subreg:V1DI (reg/v:V4x1DI 34 v2 [orig:92 b ] [92]) 24))
IMO this isn't well-formed. If a subreg of a hard register simplifies
to a hard register, it should be replaced by the hard register. If the
subreg doesn't simplify, then target-independent code can't be sure
which parts of the register are affected and which aren't. A clobber
of such a subreg isn't useful and (again IMO) should just be removed.
Conversely, a use of such a subreg is effectively a use of the whole
inner register.
LRA has code to simplify subregs of hard registers, but it didn't
handle bare uses and clobbers. The patch extends it to do that.
One question was whether the final_p argument to alter_subregs
should be true or false. True is IMO dangerous, since it forces
replacements that might not be valid from a dataflow perspective,
and uses and clobbers only exist for dataflow. As said above,
I think the correct way of handling a failed simplification would
be to delete clobbers and replace uses of subregs with uses of
the inner register. But I didn't want to write untested code
to do that.
In the PR, the clobber caused an infinite loop in DCE, because
of a disagreement about what effect the clobber had. But for
the reasons above, I think that was GIGO rather than a bug in
DF or DCE.
gcc/
PR rtl-optimization/108681
* lra-spills.cc (lra_final_code_change): Extend subreg replacement
code to handle bare uses and clobbers.
gcc/testsuite/
PR rtl-optimization/108681
* gcc.target/aarch64/pr108681.c: New test.
(cherry picked from commit 3cac06d84f334705ed0bce12fbc3a4cec4a8fd3b)
|
|
The patch that added support for fmin/fmax reductions didn't
handle single def-use cycles. In some ways, this seems like
going out of our way to make things slower, but that's a
discussion for another day.
gcc/
PR tree-optimization/108608
* tree-vect-loop.cc (vect_transform_reduction): Handle single
def-use cycles that involve function calls rather than tree codes.
gcc/testsuite/
PR tree-optimization/108608
* gcc.dg/vect/pr108608.c: New test.
* gcc.target/aarch64/sve/pr108608-1.c: Likewise.
(cherry picked from commit 2bb444787ed17a9e786f544cdf51ee2ac6779ab2)
|
|
convert_memory_address_addr_space_1 has two modes: one in which it
tries to create a self-contained RTL expression (which might fail)
and one in which it can emit new instructions where necessary.
When handling a CONST, the function recurses into the CONST's
operand and then constifies the result. But that's only valid if
the result is still a self-contained expression. If new instructions
have been emitted, the expression will refer to the (non-constant)
results of those instructions.
In the PR, this caused us to emit a nonsensical (const (reg ...))
REG_EQUAL note.
gcc/
PR tree-optimization/108603
* explow.cc (convert_memory_address_addr_space_1): Only wrap
the result of a recursive call in a CONST if no instructions
were emitted.
gcc/testsuite/
PR tree-optimization/108603
* gcc.target/aarch64/sve/pr108603.c: New test.
(cherry picked from commit b09dc74801cf4e19bdf5fcd18a5cd53fc9e9ca9a)
|
|
Since rtl-ssa isn't a real/native SSA representation, it has
to honour the constraints of the underlying rtl representation.
Part of this involves maintaining an rpo list of definitions
for each rtl register, backed by a splay tree where necessary
for quick lookup/insertion.
However, clobbers of a register don't act as barriers to
other clobbers of a register. E.g. it's possible to move one
flag-clobbering instruction across an arbitrary number of other
flag-clobbering instructions. In order to allow passes to do
that without quadratic complexity, the splay tree groups all
consecutive clobbers into groups, with only the group being
entered into the splay tree. These groups in turn have an
internal splay tree of clobbers where necessary.
This means that, if we insert a new definition and use into
the middle of a sea of clobbers, we need to split the clobber
group into two groups. This was quite a difficult condition
to trigger during development, and the PR shows that the code
to handle it had (at least) two bugs.
First, the process involves searching the clobber tree for
the split point. This search can give either the previous
clobber (which will belong to the first of the split groups)
or the next clobber (which will belong to the second of the
split groups). The code for the former case handled the
split correctly but the code for the latter case didn't.
Second, I'd forgotten to add the second clobber group to the
main splay tree. :-(
gcc/
PR rtl-optimization/108508
* rtl-ssa/accesses.cc (function_info::split_clobber_group): When
the splay tree search gives the first clobber in the second group,
make sure that the root of the first clobber group is updated
correctly. Enter the new clobber group into the definition splay
tree.
gcc/testsuite/
PR rtl-optimization/108508
* gcc.target/aarch64/pr108508.c: New test.
(cherry picked from commit f4e1b46618ef3bd7933992ab79f663ab9112bb80)
|
|
vectorizable_condition checks whether a COND_EXPR condition is used
elsewhere with a loop mask. If so, it applies the loop mask to the
COND_EXPR too, to reduce the number of live masks and to increase the
chance of combining the AND with the comparison.
There is also code to do this for inverted conditions. E.g. if
we have a < b ? c : d and something else is conditional on !(a < b)
(such as a load in d), we use !(a < b) ? d : c and apply the loop
mask to !(a < b).
This inversion relied on the function's bitop1/bitop2 mechanism.
However, that mechanism is skipped if the condition is split out of
the COND_EXPR as a separate statement. This meant that we could end
up using the inverse of the intended condition.
There is a separate way of negating the condition when a mask
is being applied (which is also used for EXTRACT_LAST reductions).
This patch uses that instead.
As well as the testcase, this fixes aarch64/sve/vcond_{4,17}_run.c.
gcc/
PR tree-optimization/108430
* tree-vect-stmts.cc (vectorizable_condition): Fix handling
of inverted condition.
gcc/testsuite/
PR tree-optimization/108430
* gcc.target/aarch64/sve/pr108430.c: New test.
(cherry picked from commit 2a8ce4b52f5892a10a02b94d7be689e59a444ff6)
|
|
This is the 2nd attempt to fix PR90706. IRA calculates wrong AVR
costs for moving general hard regs of SFmode. This was the reason for
spilling a pseudo in the PR. In this patch we use smaller move cost
of hard reg in its natural and operand modes.
PR rtl-optimization/90706
gcc/ChangeLog:
* ira-costs.cc: Include print-rtl.h.
(record_reg_classes, scan_one_insn): Add code to print debug info.
(record_operand_costs): Find and use smaller cost for hard reg
move.
gcc/testsuite/ChangeLog:
* gcc.target/avr/pr90706.c: New.
|
|
gcc/testsuite/
* gcc.target/sparc/20230328-1.c: New test.
* gcc.target/sparc/20230328-2.c: Likewise.
* gcc.target/sparc/20230328-3.c: Likewise.
* gcc.target/sparc/20230328-4.c: Likewise.
|
|
This is a regression present on the mainline and 12 branch at -O2, but the
issue is related to vectorization so was present at -O3 in earlier versions.
The vcondu expander that was added for VIS 3 more than a decade ago does not
fully work, because it does not filter out the unsigned condition codes (the
instruction is an UNSPEC that accepts only signed condition codes).
While I was at it, I also added the missing vcond and vcondu expanders for
the new comparison instructions that were added in VIS 4.
gcc/
PR target/109140
* config/sparc/sparc.cc (sparc_expand_vcond): Call signed_condition
on operand #3 to get the final condition code. Use std::swap.
* config/sparc/sparc.md (vcondv8qiv8qi): New VIS 4 expander.
(fucmp<gcond:code>8<P:mode>_vis): Move around.
(fpcmpu<gcond:code><GCM:gcm_name><P:mode>_vis): Likewise.
(vcondu<GCM:mode><GCM:mode>): New VIS 4 expander.
gcc/testsuite/
* gcc.target/sparc/20230328-1.c: New test.
* gcc.target/sparc/20230328-2.c: Likewise.
* gcc.target/sparc/20230328-3.c: Likewise.
* gcc.target/sparc/20230328-4.c: Likewise.
|
|
When we expand the __builtin_vec_xst_trunc built-in, we use the wrong mode
for the MEM operand which causes an unrecognizable insn ICE. The solution
is to use the correct TMODE mode.
2023-03-20 Peter Bergner <bergner@linux.ibm.com>
gcc/
PR target/109178
* config/rs6000/rs6000-builtin.cc (stv_expand_builtin): Use tmode.
gcc/testsuite/
PR target/109178
* gcc.target/powerpc/pr109178.c: New test.
(cherry picked from commit fbd50e867e6a782c7b56c9727bf7e1e74dae4b94)
|
|
The following testcase ICEs, because we call tree_function_versioning from
old_decl which has target attributes not supporting V4DImode and so
DECL_MODE of DECL_ARGUMENTS is BLKmode, while new_decl supports those.
tree_function_versioning initially copies DECL_RESULT and DECL_ARGUMENTS
from old_decl to new_decl, then calls initialize_cfun to create cfun
and only when the cfun is created it can later actually remap_decl
DECL_RESULT and DECL_ARGUMENTS etc.
The problem is that initialize_cfun -> push_struct_function ->
allocate_struct_function calls relayout_decl on DECL_RESULT and
DECL_ARGUMENTS, which clobbers DECL_MODE of old_decl and we then ICE because
of it.
In particular, allocate_struct_function does:
if (!abstract_p)
{
/* Now that we have activated any function-specific attributes
that might affect layout, particularly vector modes, relayout
each of the parameters and the result. */
relayout_decl (result);
for (tree parm = DECL_ARGUMENTS (fndecl); parm;
parm = DECL_CHAIN (parm))
relayout_decl (parm);
/* Similarly relayout the function decl. */
targetm.target_option.relayout_function (fndecl);
}
if (!abstract_p && aggregate_value_p (result, fndecl))
{
#ifdef PCC_STATIC_STRUCT_RETURN
cfun->returns_pcc_struct = 1;
#endif
cfun->returns_struct = 1;
}
Now, in the case of tree_function_versioning, I believe all that we need
from these is possibly the
targetm.target_option.relayout_function (fndecl);
call (arm only), we will remap DECL_RESULT and DECL_ARGUMENTS later on
and copy_decl_for_dup_finish in that case will handle all we need:
/* For vector typed decls make sure to update DECL_MODE according
to the new function context. */
if (VECTOR_TYPE_P (TREE_TYPE (copy)))
SET_DECL_MODE (copy, TYPE_MODE (TREE_TYPE (copy)));
We don't need the cfun->returns_*struct either, because we override it
in initialize_cfun a few lines later:
/* Copy items we preserve during cloning. */
...
cfun->returns_struct = src_cfun->returns_struct;
cfun->returns_pcc_struct = src_cfun->returns_pcc_struct;
So, to avoid the clobbering of DECL_RESULT/DECL_ARGUMENTS of old_decl,
the following patch arranges allocate_struct_function to be called with
abstract_p true and calls targetm.target_option.relayout_function (fndecl);
by hand.
The removal of DECL_RESULT/DECL_ARGUMENTS copying at the start of
initialize_cfun is removed because the only caller -
tree_function_versioning, does that unconditionally before.
2023-03-17 Jakub Jelinek <jakub@redhat.com>
PR target/105554
* function.h (push_struct_function): Add ABSTRACT_P argument defaulted
to false.
* function.cc (push_struct_function): Add ABSTRACT_P argument, pass it
to allocate_struct_function instead of false.
* tree-inline.cc (initialize_cfun): Don't copy DECL_ARGUMENTS
nor DECL_RESULT here. Pass true as ABSTRACT_P to
push_struct_function. Call targetm.target_option.relayout_function
after it.
(tree_function_versioning): Formatting fix.
* gcc.target/i386/pr105554.c: New test.
(cherry picked from commit 24c06560a7fa39049911eeb8777325d112e0deb9)
|
|
verification [PR108934]
In the following testcase we try to std::bit_cast a (pair of) integral
value(s) which has some non-zero bits in the place of x86 long double
(for 64-bit 16 byte type with 10 bytes actually loaded/stored by hw,
for 32-bit 12 byte) and starting with my PR104522 change we reject that
as native_interpret_expr fails on it. The PR104522 change extends what
has been done before for MODE_COMPOSITE_P (but those don't have any padding
bits) to all floating point types, because e.g. the exact x86 long double
has various bit combinations we don't support, like
pseudo-(denormals,infinities,NaNs) or unnormals. The HW handles some of
those as exceptional cases and others similarly to the non-pseudo ones.
But for the padding bits it actually doesn't load/store those bits at all,
it loads/stores 10 bytes. So, I think we should exempt the padding bits
from the reverse comparison (the native_encode_expr bits for the padding
will be all zeros), which the following patch does. For bit_cast it is
similar to e.g. ignoring padding bits if the destination is a structure
which has padding bits in there.
The change changed auto-init-4.c to how it has been behaving before the
PR105259 change, where some more VCEs can be now done.
2023-03-02 Jakub Jelinek <jakub@redhat.com>
PR c++/108934
* fold-const.cc (native_interpret_expr) <case REAL_CST>: Before memcmp
comparison copy the bytes from ptr to a temporary buffer and clearing
padding bits in there.
* gcc.target/i386/auto-init-4.c: Revert PR105259 change.
* g++.target/i386/pr108934.C: New test.
(cherry picked from commit cc88366a80e35b3e53141f49d3071010ff3c2ef8)
|
|
The builtins used in avx512bf16vlintrin.h implementation need both
avx512bf16 and avx512vl ISAs, which the header ensures for them, but
the builtins weren't actually requiring avx512vl, so when used by hand
with just -mavx512bf16 -mno-avx512vl it resulted in ICEs.
Fixed by adding OPTION_MASK_ISA_AVX512VL to their BDESC.
2023-02-24 Jakub Jelinek <jakub@redhat.com>
PR target/108881
* config/i386/i386-builtin.def (__builtin_ia32_cvtne2ps2bf16_v16hi,
__builtin_ia32_cvtne2ps2bf16_v16hi_mask,
__builtin_ia32_cvtne2ps2bf16_v16hi_maskz,
__builtin_ia32_cvtne2ps2bf16_v8hi,
__builtin_ia32_cvtne2ps2bf16_v8hi_mask,
__builtin_ia32_cvtne2ps2bf16_v8hi_maskz,
__builtin_ia32_cvtneps2bf16_v8sf_mask,
__builtin_ia32_cvtneps2bf16_v8sf_maskz,
__builtin_ia32_cvtneps2bf16_v4sf_mask,
__builtin_ia32_cvtneps2bf16_v4sf_maskz,
__builtin_ia32_dpbf16ps_v8sf, __builtin_ia32_dpbf16ps_v8sf_mask,
__builtin_ia32_dpbf16ps_v8sf_maskz, __builtin_ia32_dpbf16ps_v4sf,
__builtin_ia32_dpbf16ps_v4sf_mask,
__builtin_ia32_dpbf16ps_v4sf_maskz): Require also
OPTION_MASK_ISA_AVX512VL.
* gcc.target/i386/avx512bf16-pr108881.c: New test.
(cherry picked from commit 0ccfa3884f638816af0f5a3f0ee2695e0771ef6d)
|
|
This fixes an oversight to when removing the hard limits on using
generic vectors for the vectorizer to enable both SLP and BB
vectorization to use those. The vectorizer relies on vector lowering
to expand plus, minus and negate to bit operations but vector
lowering has a hard limit on the minimum number of elements per
work item. Vectorizer costs for the testcase at hand work out
to vectorize a loop with just two work items per vector and that
causes element wise expansion and spilling.
The fix for now is to re-instantiate the hard limit, matching what
vector lowering does. For the future the way to go is to emit the
lowered sequence directly from the vectorizer instead.
PR tree-optimization/108724
* tree-vect-stmts.cc (vectorizable_operation): Avoid
using word_mode vectors when vector lowering will
decompose them to elementwise operations.
* gcc.target/i386/pr108724.c: New testcase.
(cherry picked from commit dc87e1391c55c666c7ff39d4f0dea87666f25468)
|
|
In the toolchain convention, we describe -mfpu= as:
"Selects the allowed set of basic floating-point instructions and
registers. This option should not change the FP calling convention
unless it's necessary."
Though not explicitly stated, the rationale of this rule is to allow
combinations like "-mabi=lp64s -mfpu=64". This will be useful for
running applications with LP64S/F ABI on a double-float-capable
LoongArch hardware and using a math library with LP64S/F ABI but native
double float HW instructions, for a better performance.
And now a case in Linux kernel has again proven the usefulness of this
kind of combination. The AMDGPU DCN kernel driver needs to perform some
floating-point operation, but the entire kernel uses LP64S ABI. So the
translation units of the AMDGPU DCN driver need to be compiled with
-mfpu=64 (the kernel lacks soft-FP routines in libgcc), but -mabi=lp64s
(or you can't link it with the other part of the kernel).
Unfortunately, currently GCC uses TARGET_{HARD,SOFT,DOUBLE}_FLOAT to
determine the floating calling convention. This causes "-mfpu=64"
silently allow using $fa* to pass parameters and return values EVEN IF
-mabi=lp64s is used. To make things worse, the generated object file
has SOFT-FLOAT set in the eflags field so the linker will happily link
it with other LP64S ABI object files, but obviously this will lead to
bad results at runtime. And for now all loongarch64 CPU models (-march
settings) implies -mfpu=64 on by default, so the issue makes a single
"-mabi=lp64s" option basically broken (fortunately most projects for eg
the Linux kernel have used -msoft-float which implies both -mabi=lp64s
and -mfpu=none as we've recommended in the toolchain convention doc).
The fix is simple: use TARGET_*_FLOAT_ABI instead.
I consider this a bug fix: the behavior difference from the toolchain
convention doc is a bug, and generating object files with SOFT-FLOAT
flag but parameters/return values passed through FPRs is definitely a
bug.
Bootstrapped and regtested on loongarch64-linux-gnu. Ok for trunk and
release/gcc-12 branch?
gcc/ChangeLog:
PR target/109000
* config/loongarch/loongarch.h (FP_RETURN): Use
TARGET_*_FLOAT_ABI instead of TARGET_*_FLOAT.
(UNITS_PER_FP_ARG): Likewise.
gcc/testsuite/ChangeLog:
PR target/109000
* gcc.target/loongarch/flt-abi-isa-1.c: New test.
* gcc.target/loongarch/flt-abi-isa-2.c: New test.
* gcc.target/loongarch/flt-abi-isa-3.c: New test.
* gcc.target/loongarch/flt-abi-isa-4.c: New test.
(cherry picked from commit 75eccddef5784bc5e262af31f535267a9c4e993e)
|
|
As Richard pointed out in [1] and the testing on Power10, the
proposed fix for PR96373 requires some updates on a few rs6000
test cases which adopt partial vector. This patch is to fix
all of them with one extra option "-fno-trapping-math" as
Richard suggested.
Besides, the original test case also failed on Power10 without
Richard's proposed fix, this patch adds it together for a bit
better testing coverage.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2023-January/610728.html
PR target/96373
gcc/testsuite/ChangeLog:
* gcc.target/powerpc/p9-vec-length-epil-1.c: Add -fno-trapping-math.
* gcc.target/powerpc/p9-vec-length-epil-2.c: Likewise.
* gcc.target/powerpc/p9-vec-length-epil-3.c: Likewise.
* gcc.target/powerpc/p9-vec-length-epil-4.c: Likewise.
* gcc.target/powerpc/p9-vec-length-epil-5.c: Likewise.
* gcc.target/powerpc/p9-vec-length-epil-6.c: Likewise.
* gcc.target/powerpc/p9-vec-length-epil-8.c: Likewise.
* gcc.target/powerpc/p9-vec-length-full-1.c: Likewise.
* gcc.target/powerpc/p9-vec-length-full-2.c: Likewise.
* gcc.target/powerpc/p9-vec-length-full-3.c: Likewise.
* gcc.target/powerpc/p9-vec-length-full-4.c: Likewise.
* gcc.target/powerpc/p9-vec-length-full-5.c: Likewise.
* gcc.target/powerpc/p9-vec-length-full-6.c: Likewise.
* gcc.target/powerpc/p9-vec-length-full-8.c: Likewise.
* gcc.target/powerpc/pr96373.c: New test.
(cherry picked from commit 4f5a1198065dc078f8099db628da7b06a2666f34)
|
|
As the testcase shows, this pattern had an incorrect constraint leading
to GCC's output getting rejected by the assembler.
This patch fixes the constraint accordingly.
The test is split into two: one that can run without bf16 support from
the assembler and another that checks that the output actually assembles
when such support is available.
gcc/ChangeLog:
PR target/104921
* config/aarch64/aarch64-simd.md (aarch64_bfmlal<bt>_lane<q>v4sf):
Use correct constraint for operand 3.
gcc/testsuite/ChangeLog:
PR target/104921
* gcc.target/aarch64/pr104921-1.c: New test.
* gcc.target/aarch64/pr104921-2.c: New test.
* gcc.target/aarch64/pr104921.x: Include file for new tests.
(cherry picked from commit 277e1f30a5e4e634304a7b8a532825119f0ea47f)
|
|
As Andrew pointed out in PR108396, there is one typo in
rs6000-overload.def on built-in function vec_vsubcuq:
[VEC_VSUBCUQ, vec_vsubcuqP, __builtin_vec_vsubcuq]
"vec_vsubcuqP" should be "vec_vsubcuq", this typo caused
us to define vec_vsubcuqP in rs6000-vecdefines.h instead
of vec_vsubcuq, so that compiler is not able to realize
the built-in function name vec_vsubcuq any more.
Co-authored-By: Andrew Pinski <apinski@marvell.com>
PR target/108396
gcc/ChangeLog:
* config/rs6000/rs6000-overload.def (VEC_VSUBCUQ): Fix typo
vec_vsubcuqP with vec_vsubcuq.
gcc/testsuite/ChangeLog:
* gcc.target/powerpc/pr108396.c: New test.
(cherry picked from commit aaf29ae6cdbaad58b709a77784375d15138174b3)
|
|
PR108348 shows one special case that MMA opaque types are
used in function arguments and treated as pass by reference,
it results in one copying from argument to a temp variable,
since this copying happens before rs6000_function_arg check,
it can cause ICE without MMA support then. This patch is to
teach function rs6000_opaque_type_invalid_use_p to check if
any function argument in a gcall stmt has the invalid use of
MMA opaque types.
btw, I checked the handling on return value, it doesn't have
this kind of issue as its checking and error emission is quite
early, so this doesn't handle function return value.
PR target/108348
gcc/ChangeLog:
* config/rs6000/rs6000.cc (rs6000_opaque_type_invalid_use_p): Add the
support for invalid uses of MMA opaque type in function arguments.
gcc/testsuite/ChangeLog:
* gcc.target/powerpc/pr108348-1.c: New test.
* gcc.target/powerpc/pr108348-2.c: New test.
(cherry picked from commit 5d9529687deb9ed009361a16c02a7f6c3e2ebbf3)
|
|
As PR108272 shows, there are some invalid uses of MMA opaque
types in inline asm statements. This patch is to teach the
function rs6000_opaque_type_invalid_use_p for inline asm,
check and error any invalid use of MMA opaque types in input
and output operands.
PR target/108272
gcc/ChangeLog:
* config/rs6000/rs6000.cc (rs6000_opaque_type_invalid_use_p): Add the
support for invalid uses in inline asm, factor out the checking and
erroring to lambda function check_and_error_invalid_use.
gcc/testsuite/ChangeLog:
* gcc.target/powerpc/pr108272-1.c: New test.
* gcc.target/powerpc/pr108272-2.c: New test.
* gcc.target/powerpc/pr108272-3.c: New test.
* gcc.target/powerpc/pr108272-4.c: New test.
(cherry picked from commit 074b0c03eabeb8e9c8de813c81bf87a1f88fdb65)
|
|
As reported in the PR, there are some -Wuninitialized warnings in
avx512erintrin.h. One can see that by compiling sse-23.c testcase with
-Wuninitialized (or when actually using those intrinsics).
Those 6 spots use an uninitialized variable and pass it as one of the
argument to a builtin with constant mask -1, because there is no unmasked
builtin. It is true that expansion of those builtins into RTL will see
mask is all ones and ignore the unneeded argument, but -Wuninitialized
is diagnosed on GIMPLE and on GIMPLE these builtins are just builtin calls.
avx512fintrin.h and other headers use in these cases the _mm*_undefined_* ()
intrinsics, like:
return (__m512i) __builtin_ia32_psrav8di_mask ((__v8di) __X,
(__v8di) __Y,
(__v8di)
_mm512_undefined_epi32 (),
(__mmask8) -1);
etc. and the following patch does the same for avx512erintrin.h.
With the recent changes in C++ FE and the _mm*_undefined_* intrinsics,
we don't emit -Wuninitialized warnings for those (previously we didn't
just in C due to self-initialization). Of course we could also
just self-initialize these uninitialized vars and add the #pragma GCC
diagnostic dances around it, but using the intrinsics is consistent with
the rest and IMHO cleaner.
2023-01-31 Jakub Jelinek <jakub@redhat.com>
PR c++/105593
* config/i386/avx512erintrin.h (_mm512_exp2a23_round_pd,
_mm512_exp2a23_round_ps, _mm512_rcp28_round_pd, _mm512_rcp28_round_ps,
_mm512_rsqrt28_round_pd, _mm512_rsqrt28_round_ps): Use
_mm512_undefined_pd () or _mm512_undefined_ps () instead of using
uninitialized automatic variable __W.
* gcc.target/i386/sse-23.c: Add -Wuninitialized to dg-options.
(cherry picked from commit 41602390456901c14ecdfa2fa64c3cebd5b6ff09)
|
|
The following testcase is miscompiled. The problem is that during
RTL DSE we see a V4DI register is being loaded { 16, 16, 0, 0 }
value and DSE mostly works in terms of scalar modes, so it calls
movoi to set an OImode REG to (const_wide_int 0x100000000000000010)
and ix86_convert_const_wide_int_to_broadcast thinks it can compute
that value by broadcasting DImode 0x10. While it is true that
for TImode result the broadcast could be used, for OImode/XImode
it can't be, because all but the lowest 2 HOST_WIDE_INTs aren't
present (so are 0 or -1 depending on sign), not 0x10 in this case.
The function checks if the least significant HOST_WIDE_INT elt
of the CONST_WIDE_INT is broadcastable from QI/HI/SI/DImode and then
/* Check if OP can be broadcasted from VAL. */
for (int i = 1; i < CONST_WIDE_INT_NUNITS (op); i++)
if (val != CONST_WIDE_INT_ELT (op, i))
return nullptr;
That is needed of course, but nothing checks that
CONST_WIDE_INT_NUNITS (op) isn't too small for the mode in question.
I think if op would be 0 or -1, it ought to be never CONST_WIDE_INT,
but CONST_INT and so we can just punt whenever the number of
CONST_WIDE_INT elts is not the expected one.
2023-01-31 Jakub Jelinek <jakub@redhat.com>
PR target/108599
* config/i386/i386-expand.cc
(ix86_convert_const_wide_int_to_broadcast): Return nullptr if
CONST_WIDE_INT_NUNITS (op) times HOST_BITS_PER_WIDE_INT isn't
equal to bitsize of mode.
* gcc.target/i386/avx2-pr108599.c: New test.
(cherry picked from commit 963315a922e228c4f6853826666151fc540f111a)
|
|
2022-09-28 Tejas Joshi <TejasSanjay.Joshi@amd.com>
gcc/ChangeLog:
* common/config/i386/cpuinfo.h (get_amd_cpu): Recognize znver4.
* common/config/i386/i386-common.cc (processor_names): Add znver4.
(processor_alias_table): Add znver4 and modularize old znvers.
* common/config/i386/i386-cpuinfo.h (processor_subtypes):
AMDFAM19H_ZNVER4.
* config.gcc (x86_64-*-* |...): Likewise.
* config/i386/driver-i386.cc (host_detect_local_cpu): Let
-march=native recognize znver4 cpus.
* config/i386/i386-c.cc (ix86_target_macros_internal): Add znver4.
* config/i386/i386-options.cc (m_ZNVER4): New definition.
(m_ZNVER): Include m_ZNVER4.
(processor_cost_table): Add znver4.
* config/i386/i386.cc (ix86_reassociation_width): Likewise.
* config/i386/i386.h (processor_type): Add PROCESSOR_ZNVER4.
(PTA_ZNVER1): New definition.
(PTA_ZNVER2): Likewise.
(PTA_ZNVER3): Likewise.
(PTA_ZNVER4): Likewise.
* config/i386/i386.md (define_attr "cpu"): Add znver4 and rename
md file.
* config/i386/x86-tune-costs.h (znver4_cost): New definition.
* config/i386/x86-tune-sched.cc (ix86_issue_rate): Add znver4.
(ix86_adjust_cost): Likewise.
* config/i386/znver1.md: Rename to znver.md.
* config/i386/znver.md: Add new reservations for znver4.
* doc/extend.texi: Add details about znver4.
* doc/invoke.texi: Likewise.
gcc/testsuite/ChangeLog:
* gcc.target/i386/funcspec-56.inc: Handle new march.
* g++.target/i386/mv29.C: Likewise.
(cherry picked from commit bf3b532b524ecacb3202ab2c8af419ffaaab7cff)
|
|
This patch surrounds the scalar operand of the MVE vcmp patterns with a
vec_duplicate to ensure both operands of the comparision operator have the same
(vector) mode.
gcc/ChangeLog:
PR target/107987
* config/arm/mve.md (mve_vcmp<mve_cmp_op>q_n_<mode>,
@mve_vcmp<mve_cmp_op>q_n_f<mode>): Apply vec_duplicate to scalar
operand.
gcc/testsuite/ChangeLog:
* gcc.target/arm/mve/pr107987.c: New test.
(cherry picked from commit ed34c3bc3428bce663d42e9eeda10bc0c5d56d5c)
|
|
While looking at PR 105549, which is about fixing the ABI break
introduced in GCC 9.1 in parameter alignment with bit-fields, we
noticed that the GCC 9.1 warning is not emitted in all the cases where
it should be. This patch fixes that and the next patch in the series
fixes the GCC 9.1 break.
We split this into two patches since patch #2 introduces a new ABI
break starting with GCC 13.1. This way, patch #1 can be back-ported
to release branches if needed to fix the GCC 9.1 warning issue.
The main idea is to add a new global boolean that indicates whether
we're expanding the start of a function, so that aarch64_layout_arg
can emit warnings for callees as well as callers. This removes the
need for aarch64_function_arg_boundary to warn (with its incomplete
information). However, in the first patch there are still cases where
we emit warnings were we should not; this is fixed in patch #2 where
we can distinguish between GCC 9.1 and GCC.13.1 ABI breaks properly.
The fix in aarch64_function_arg_boundary (replacing & with &&) looks
like an oversight of a previous commit in this area which changed
'abi_break' from a boolean to an integer.
We also take the opportunity to fix the comment above
aarch64_function_arg_alignment since the value of the abi_break
parameter was changed in a previous commit, no longer matching the
description.
2022-11-28 Christophe Lyon <christophe.lyon@arm.com>
Richard Sandiford <richard.sandiford@arm.com>
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_function_arg_alignment): Fix
comment.
(aarch64_layout_arg): Factorize warning conditions.
(aarch64_function_arg_boundary): Fix typo.
* function.cc (currently_expanding_function_start): New variable.
(expand_function_start): Handle
currently_expanding_function_start.
* function.h (currently_expanding_function_start): Declare.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/bitfield-abi-warning-align16-O2.c: New test.
* gcc.target/aarch64/bitfield-abi-warning-align16-O2-extra.c: New
test.
* gcc.target/aarch64/bitfield-abi-warning-align32-O2.c: New test.
* gcc.target/aarch64/bitfield-abi-warning-align32-O2-extra.c: New
test.
* gcc.target/aarch64/bitfield-abi-warning-align8-O2.c: New test.
* gcc.target/aarch64/bitfield-abi-warning.h: New test.
* g++.target/aarch64/bitfield-abi-warning-align16-O2.C: New test.
* g++.target/aarch64/bitfield-abi-warning-align16-O2-extra.C: New
test.
* g++.target/aarch64/bitfield-abi-warning-align32-O2.C: New test.
* g++.target/aarch64/bitfield-abi-warning-align32-O2-extra.C: New
test.
* g++.target/aarch64/bitfield-abi-warning-align8-O2.C: New test.
* g++.target/aarch64/bitfield-abi-warning.h: New test.
(cherry picked from commit 3df1a115be22caeab3ffe7afb12e71adb54ff132)
|
|
In the M-Class Arm-ARM:
https://developer.arm.com/documentation/ddi0553/bu/?lang=en
these MVE instructions only have '!' writeback variant and at:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107714
we found that the Um constraint would also allow through a
register offset writeback, resulting in an assembler error.
Here I have added a new constraint and predicate for these
instructions, which (uniquely, AFAICT), only support a `!` writeback
increment by the data size (inside the compiler this is a POST_INC).
No regressions in arm-none-eabi with MVE and MVE.FP.
gcc/ChangeLog:
PR target/107714
* config/arm/arm-protos.h (mve_struct_mem_operand): New protoype.
* config/arm/arm.cc (mve_struct_mem_operand): New function.
* config/arm/constraints.md (Ug): New constraint.
* config/arm/mve.md (mve_vst4q<mode>): Change constraint.
(mve_vst2q<mode>): Likewise.
(mve_vld4q<mode>): Likewise.
(mve_vld2q<mode>): Likewise.
* config/arm/predicates.md (mve_struct_operand): New predicate.
gcc/testsuite/ChangeLog:
PR target/107714
* gcc.target/arm/mve/intrinsics/vldst24q_reg_offset.c: New test.
(cherry picked from commit 4269a6567eb991e6838f40bda5be9e3a7972530c)
|
|
In this PR we ICE when expanding the __rbit builtin with a NULL target rtx.
I *think* that only happens when the result is unused and hence maybe we shouldn't be expanding
any RTL at all, but the ICE here is easily fixed by deriving the mode from the type of the expression
rather than the target.
This patch does that.
Bootstrapped and tested on aarch64-none-linux-gnu.
gcc/ChangeLog:
PR target/108140
* config/aarch64/aarch64-builtins.cc
(aarch64_expand_builtin_data_intrinsic): Handle NULL target.
gcc/testsuite/ChangeLog:
PR target/108140
* gcc.target/aarch64/acle/pr108140.c: New test.
(cherry picked from commit 98756bcbe27647f263f2b312d1d933d70cf56ba9)
|
|
As PR106736 shows, it's unexpected to use __vector_quad and
__vector_pair types without MMA support, it would cause ICE
when expanding the corresponding assignment. We can't guard
these built-in types registering under MMA support as Peter
pointed out in that PR, because the registering is global,
it doesn't work for target pragma/attribute support with MMA
enabled. The existing verify_type_context mentioned in [2]
can help to make the diagnostics invalid built-in type uses
better, but as Richard pointed out in [4], it can't deal with
all cases. As the discussions in [1][3], this patch is to
check the invalid use of built-in types __vector_quad and
__vector_pair in mov pattern of OOmode and XOmode, on the
currently being expanded gimple assignment statement. It
still puts an assertion in else arm rather than just makes
it go through, it's to ensure we can catch any other possible
unexpected cases in time if there are.
[1] https://gcc.gnu.org/pipermail/gcc/2022-December/240218.html
[2] https://gcc.gnu.org/pipermail/gcc/2022-December/240220.html
[3] https://gcc.gnu.org/pipermail/gcc/2022-December/240223.html
[4] https://gcc.gnu.org/pipermail/gcc-patches/2022-December/608083.html
PR target/106736
gcc/ChangeLog:
* config/rs6000/mma.md (define_expand movoo): Call function
rs6000_opaque_type_invalid_use_p to check and emit error message for
the invalid use of opaque type.
(define_expand movxo): Likewise.
* config/rs6000/rs6000-protos.h
(rs6000_opaque_type_invalid_use_p): New function declaration.
(currently_expanding_gimple_stmt): New extern declaration.
* config/rs6000/rs6000.cc (rs6000_opaque_type_invalid_use_p): New
function.
gcc/testsuite/ChangeLog:
* gcc.target/powerpc/pr106736-1.c: New test.
* gcc.target/powerpc/pr106736-2.c: Likewise.
* gcc.target/powerpc/pr106736-3.c: Likewise.
* gcc.target/powerpc/pr106736-4.c: Likewise.
* gcc.target/powerpc/pr106736-5.c: Likewise.
|
|
Currently patchable area is at the wrong place on AArch64. It is placed
immediately after function label, before .cfi_startproc. This patch
adds UNSPECV_PATCHABLE_AREA for pseudo patchable area instruction and
modifies aarch64_print_patchable_function_entry to avoid placing
patchable area before .cfi_startproc.
gcc/
PR target/98776
* config/aarch64/aarch64-protos.h (aarch64_output_patchable_area):
Declared.
* config/aarch64/aarch64.cc (aarch64_print_patchable_function_entry):
Emit an UNSPECV_PATCHABLE_AREA pseudo instruction.
(aarch64_output_patchable_area): New.
* config/aarch64/aarch64.md (UNSPECV_PATCHABLE_AREA): New.
(patchable_area): Define.
gcc/testsuite/
PR target/98776
* gcc.target/aarch64/pr98776.c: New.
* gcc.target/aarch64/pr92424-2.c: Adjust pattern.
* gcc.target/aarch64/pr92424-3.c: Adjust pattern.
|
|
Only with -ffp-contract=fast we can synthesize FMA operations like
vfmaddsub231ps, so properly guard the transform in SLP pattern
detection.
PR tree-optimization/107647
* tree-vect-slp-patterns.cc (addsub_pattern::recognize): Only
allow FMA generation with -ffp-contract=fast for FP types.
(complex_mul_pattern::matches): Likewise.
* gcc.target/i386/pr107647.c: New testcase.
(cherry picked from commit c5df8392c5848c0462558f41cdf6eab31db301cf)
|
|
Similar story as PR103661, we again return a negative number
for __builtin_cpu_supports:
Documentation says:
int __builtin_cpu_supports(const char *feature)
This function returns a positive integer if the run-time CPU supports feature and returns 0 otherwise.
while we return -2147483648.
Moreover, I noticed "x86-64" is not a valid option for __builtin_cpu_is,
but for __builtin_cpu_supports.
PR target/107551
gcc/ChangeLog:
* config/i386/i386-builtins.cc (fold_builtin_cpu): Use same path
as for PR103661.
* doc/extend.texi: Fix "x86-64" use.
gcc/testsuite/ChangeLog:
* gcc.target/i386/builtin_target.c: Add more checks.
(cherry picked from commit d71b20fc30965ba8326ad9363d0aca9d61eb4ed3)
|
|
According to the architecture pseudocode the FEAT_MOPS sequences overwrite the NZCV flags
as par of their operation, so GCC needs to model that in the relevant RTL patterns.
For the testcase:
void g();
void foo (int a, size_t N, char *__restrict__ in,
char *__restrict__ out)
{
if (a != 3)
__builtin_memcpy (out, in, N);
if (a > 3)
g ();
}
we will currently generate:
foo:
cmp w0, 3
bne .L6
.L1:
ret
.L6:
cpyfp [x3]!, [x2]!, x1!
cpyfm [x3]!, [x2]!, x1!
cpyfe [x3]!, [x2]!, x1!
ble .L1 // Flags reused after CPYF* sequence
b g
This is wrong as the result of cmp needs to be recalculated after the MOPS sequence.
With this patch we'll insert a "cmp w0, 3" before the ble, similar to what clang does.
Bootstrapped and tested on aarch64-none-linux-gnu.
Pushing to trunk and to the GCC 12 branch after some baking time.
gcc/ChangeLog:
* config/aarch64/aarch64.md (aarch64_cpymemdi): Specify clobber of CC reg.
(*aarch64_cpymemdi): Likewise.
(aarch64_movmemdi): Likewise.
(aarch64_setmemdi): Likewise.
(*aarch64_setmemdi): Likewise.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/mops_5.c: New test.
* gcc.target/aarch64/mops_6.c: Likewise.
* gcc.target/aarch64/mops_7.c: Likewise.
(cherry picked from commit cbdffae5745327b0e5eb887afc512daf34b049b1)
|
|
QImode.
For __builtin_ia32_vec_set_v16qi (a, -1, 2) with
!flag_signed_char. it's transformed to
__builtin_ia32_vec_set_v16qi (_4, 255, 2) in the gimple,
and expanded to (const_int 255) in the rtl. But for immediate_operand,
it expects (const_int 255) to be signed extended to
(const_int -1). The mismatch caused an unrecognizable insn error.
The patch converts (const_int 255) to (const_int -1) in the backend
expander.
gcc/ChangeLog:
PR target/107863
* config/i386/i386-expand.cc (ix86_expand_vec_set_builtin):
Convert op1 to target mode whenever mode mismatch.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr107863.c: New test.
|
|
We used to expand atomic_exchange_n(ptr, new, mem_order) for subword types
into something like:
{
__typeof__(*ptr) t = atomic_load_n(ptr, mem_order);
atomic_compare_exchange_n(ptr, &t, new, true, mem_order, mem_order);
return t;
}
It's incorrect because another thread may store a different value into *ptr
after atomic_load_n. Then atomic_compare_exchange_n will not store into
*ptr, but atomic_exchange_n should always perform the store.
gcc/ChangeLog:
PR target/107713
* config/loongarch/sync.md
(atomic_cas_value_exchange_7_<mode>): New define_insn.
(atomic_exchange): Use atomic_cas_value_exchange_7_si instead of
atomic_cas_value_cmp_and_7_si.
gcc/testsuite/ChangeLog:
PR target/107713
* gcc.target/loongarch/pr107713-1.c: New test.
* gcc.target/loongarch/pr107713-2.c: New test.
(cherry picked from commit f0024bfb228f94e60e06dc32a4983e40a9b90be5)
|
|
e034c5c8957 re PR target/78643 (ICE in convert_move, at expr.c:230)
fixed the case where DECL_MODE of a vector field is BLKmode and its
TYPE_MODE is a vector mode because of target attribute. Remove the
BLKmode check for the case where DECL_MODE of a vector field is a vector
mode and its TYPE_MODE isn't a vector mode because of target attribute.
gcc/
PR target/107304
* expr.cc (get_inner_reference): Always use TYPE_MODE for vector
field with vector raw mode.
gcc/testsuite/
PR target/107304
* gcc.target/i386/pr107304.c: New test.
(cherry picked from commit 1c64aba8cdf6509533f554ad86640f274cdbe37f)
|
|
Fixes: 341573406b39
Don't subtract one from the result of strnlen() when trying to point
to the first character after the current string. This issue would
cause individual characters (where the 128 byte buffers are stitched
together) to be lost.
gcc/ChangeLog:
* config/aarch64/driver-aarch64.cc (readline): Fix off-by-one.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/cpunative/info_18: New test.
* gcc.target/aarch64/cpunative/native_cpu_18.c: New test.
(cherry picked from commit b1cfbccc41de6aec950c0f662e7e85ab34bfff8a)
|
|
For a parameter with BLKmode we cannot use REG_NREGS in order to
determine the number of consecutive registers. Streamlined this with
the implementation of s390_function_arg.
Fix some indentation whitespace, too.
gcc/ChangeLog:
PR target/106355
* config/s390/s390.cc (s390_call_saved_register_used): For a
parameter with BLKmode fix determining number of consecutive
registers.
gcc/testsuite/ChangeLog:
* gcc.target/s390/pr106355.h: Common code for new tests.
* gcc.target/s390/pr106355-1.c: New test.
* gcc.target/s390/pr106355-2.c: New test.
* gcc.target/s390/pr106355-3.c: New test.
(cherry picked from commit cb994acc08b67f26a54e7c5dc1f4995a2ce24d98)
|
|
Bit of a brown-paper-bag bug, but: GCC was generating
non-existent merging forms of BRKAS and BRKBS. Those
instructions only support zero predication (although
BRKA and BRKB support both).
gcc/
* config/aarch64/aarch64-sve.md (*aarch64_brk<brk_op>_cc): Remove
merging alternative.
(*aarch64_brk<brk_op>_ptest): Likewise.
gcc/testsuite/
* gcc.target/aarch64/sve/acle/general/brka_1.c: Expect a separate
PTEST instruction.
* gcc.target/aarch64/sve/acle/general/brkb_1.c: Likewise.
(cherry picked from commit 57675c7f92a3bd3ca8dae1faac7f2f51d40e0f9e)
|
|
Unlike other flag-setting SVE instructions, BRKNS sets the flags
based on an all-true governing predicate, rather than the GP operand.
gcc/
* config/aarch64/iterators.md (SVE_BRKP): New iterator.
* config/aarch64/aarch64-sve.md (*aarch64_brkn_cc): New pattern.
(*aarch64_brkn_ptest): Likewise.
(*aarch64_brk<brk_op>_cc): Restrict to SVE_BRKP.
(*aarch64_brk<brk_op>_ptest): Likewise.
gcc/testsuite/
* gcc.target/aarch64/sve/acle/general/brkn_1.c: Expect separate
PTEST instructions.
* gcc.target/aarch64/sve/acle/general/brkn_2.c: New test.
(cherry picked from commit 6bec66640597e2604f51fc1642c7d279164cd442)
|
|
https://github.com/ARM-software/acle/pull/199 adds a new feature
macro for RCPC, for use in things like inline assembly. This patch
adds the associated support to GCC.
Also, RCPC is required for Armv8.3-A and later, but the armv8.3-a
entry didn't include it. This was probably harmless in practice
since GCC simply ignored the extension until now. (The GAS
definition is OK.)
gcc/
* config/aarch64/aarch64.h (AARCH64_FL_FOR_ARCH8_3): Add
AARCH64_FL_RCPC.
(AARCH64_ISA_RCPC): New macro.
* config/aarch64/aarch64-cores.def (thunderx3t110, zeus, neoverse-v1)
(neoverse-512tvb, saphira): Remove RCPC from these Armv8.3-A+ cores.
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins): Define
__ARM_FEATURE_RCPC when appropriate.
gcc/testsuite/
* gcc.target/aarch64/pragma_cpp_predefs_1.c: Add RCPC tests.
|
|
As PR96072 shows, the code adding REG_CFA_DEF_CFA reg note
makes one assumption that we have emitted one insn which
restores the frame pointer previously. That part of code
was guarded with flag frame_pointer_needed before, it was
consistent, but it was replaced with flag
frame_pointer_needed_indeed since commit r10-7981. It
caused ICE due to unexpected NULL insn.
PR target/96072
gcc/ChangeLog:
* config/rs6000/rs6000-logue.cc (rs6000_emit_epilogue): Update the
condition for adding REG_CFA_DEF_CFA reg note with
frame_pointer_needed_indeed.
gcc/testsuite/ChangeLog:
* gcc.target/powerpc/pr96072.c: New test.
(cherry picked from commit 5be0950d22209f5ba69d244387228e12389a8470)
|