Age | Commit message (Collapse) | Author | Files | Lines |
|
Currently for a signbit operation instructions tc{f,d,x}b + ipm + srl
are emitted. If the source operand is a MEM, then a load precedes the
sequence. A faster implementation is by issuing a load either from a
REG or MEM into a GPR followed by a shift.
In spirit of the signbit function of the C standard, the signbit optab
only guarantees that the resulting value is nonzero if the signbit is
set. The common code implementation computes a value where the signbit
is stored in the most significant bit, i.e., all other bits are just
masked out, whereas the current implementation of s390 results in a
value where the signbit is stored in the least significant bit.
Although, there is no guarantee where the signbit is stored, keep the
current behaviour and, therefore, implement the signbit optab manually.
Since z10, instruction lgdr can be effectively used for a 64-bit
FPR-to-GPR load. However, there exists no 32-bit pendant. Thus, for
target z10 make use of post-reload splitters which emit either a 64-bit
or a 32-bit load depending on whether the source operand is a REG or a
MEM and a corresponding 63 or 31-bit shift. We can do without
post-reload splitter in case of vector extensions since there we also
have a 32-bit VR-to-GPR load via instruction vlgvf.
gcc/ChangeLog:
* config/s390/s390.md (signbit_tdc): Rename expander.
(signbit<mode>2): New expander.
(signbit<mode>2_z10): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/s390/isfinite-isinf-isnormal-signbit-2.c: Adapt
scan assembler directives.
* gcc.target/s390/isfinite-isinf-isnormal-signbit-3.c: Ditto.
* gcc.target/s390/signbit-1.c: New test.
* gcc.target/s390/signbit-2.c: New test.
* gcc.target/s390/signbit-3.c: New test.
* gcc.target/s390/signbit-4.c: New test.
* gcc.target/s390/signbit-5.c: New test.
* gcc.target/s390/signbit.h: New test.
|
|
Moving between GPRs and VRs in any mode with size less than or equal to
8 bytes becomes available with vector extensions. Without adapting
costs for those loads, we typically go over memory.
gcc/ChangeLog:
* config/s390/s390.cc (s390_register_move_cost): Add costing for
vlvg/vlgv.
|
|
Exploit the fact that instruction VLGV zeros excessive bits of a GPR.
gcc/ChangeLog:
* config/s390/vector.md (bhfgq): Add scalar modes.
(*movdi<mode>_zero_extend_A): New insn.
(*movsi<mode>_zero_extend_A): New insn.
(*movdi<mode>_zero_extend_B): New insn.
(*movsi<mode>_zero_extend_B): New insn.
gcc/testsuite/ChangeLog:
* gcc.target/s390/vector/vlgv-zero-extend-1.c: New test.
|
|
[PR121064]
When TARGET_VECTORIZE_VEC_PERM_CONST is called, target may be the
same pseudo as op0 and/or op1. Loading the selector into target
would clobber the input, producing wrong code like
vld $vr0, $t0
vshuf.w $vr0, $vr0, $vr1
So don't load the selector into d->target, use a new pseudo to hold the
selector instead. The reload pass will load the pseudo for selector and
the pseudo for target into the same hard register (following our
constraint '0' on the shuf instructions) anyway.
gcc/ChangeLog:
PR target/121064
* config/loongarch/lsx.md (lsx_vshuf_<lsxfmt_f>): Add '@' to
generate a mode-aware helper. Use <VIMODE> as the mode of the
operand 1 (selector).
* config/loongarch/lasx.md (lasx_xvshuf_<lasxfmt_f>): Likewise.
* config/loongarch/loongarch.cc
(loongarch_try_expand_lsx_vshuf_const): Create a new pseudo for
the selector. Use the mode-aware helper to simplify the code.
(loongarch_expand_vec_perm_const): Likewise.
gcc/testsuite/ChangeLog:
PR target/121064
* gcc.target/loongarch/pr121064.c: New test.
|
|
For MMX 16-bit, 32-bit and 64-bit constant vector loads from constant
vector pool:
(insn 6 2 7 2 (set (reg:V1SI 5 di)
(mem/u/c:V1SI (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S4 A32])) "pr121062-2.c":10:3 2036 {*movv1si_internal}
(expr_list:REG_EQUAL (const_vector:V1SI [
(const_int -1 [0xffffffffffffffff])
])
(nil)))
we can convert it to
(insn 12 2 7 2 (set (reg:SI 5 di)
(const_int -1 [0xffffffffffffffff])) "pr121062-2.c":10:3 100 {*movsi_internal}
(nil))
Co-Developed-by: H.J. Lu <hjl.tools@gmail.com>
gcc/
PR target/121062
* config/i386/i386.cc (ix86_convert_const_vector_to_integer):
Handle E_V1SImode and E_V1DImode.
* config/i386/mmx.md (V_16_32_64): Add V1SI, V2BF and V1DI.
(mmxinsnmode): Add V1DI and V1SI.
Add V_16_32_64 splitter for constant vector loads from constant
vector pool.
(V_16_32_64:*mov<mode>_imm): Moved after V_16_32_64 splitter.
Replace lowpart_subreg with adjust_address.
gcc/testsuite/
PR target/121062
* gcc.target/i386/pr121062-1.c: New test.
* gcc.target/i386/pr121062-2.c: Likewise.
* gcc.target/i386/pr121062-3a.c: Likewise.
* gcc.target/i386/pr121062-3b.c: Likewise.
* gcc.target/i386/pr121062-3c.c: Likewise.
* gcc.target/i386/pr121062-4.c: Likewise.
* gcc.target/i386/pr121062-5.c: Likewise.
* gcc.target/i386/pr121062-6.c: Likewise.
* gcc.target/i386/pr121062-7.c: Likewise.
|
|
Since only glibc targets support -mfentry, warn -pg without -mfentry only
on glibc targets.
gcc/
PR target/120881
PR testsuite/121078
* config/i386/i386-options.cc (ix86_option_override_internal):
Warn -pg without -mfentry only on glibc targets.
gcc/testsuite/
PR target/120881
PR testsuite/121078
* gcc.dg/20021014-1.c (dg-additional-options): Add -mfentry
-fno-pic only on gnu/x86 targets.
* gcc.dg/aru-2.c (dg-additional-options): Likewise.
* gcc.dg/nest.c (dg-additional-options): Likewise.
* gcc.dg/pr32450.c (dg-additional-options): Likewise.
* gcc.dg/pr43643.c (dg-additional-options): Likewise.
* gcc.target/i386/pr104447.c (dg-additional-options): Likewise.
* gcc.target/i386/pr113122-3.c(dg-additional-options): Likewise.
* gcc.target/i386/pr119386-1.c (dg-additional-options): Add
-mfentry only on gnu targets.
* gcc.target/i386/pr119386-2.c (dg-additional-options): Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
No functional changes.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_move):
Use MEM_P predicate instead of open coding it.
(ix86_erase_embedded_rounding):
Use NONJUMP_INSN_P predicate instead of open coding it.
* config/i386/i386-features.cc (convertible_comparison_p):
Use REG_P predicate instead of open coding it.
* config/i386/i386.cc (ix86_rtx_costs):
Use SUBREG_P predicate instead of open coding it.
|
|
No functional changes.
gcc/ChangeLog:
* config/i386/i386.cc (symbolic_reference_mentioned_p):
Use LABEL_REF_P predicate instead of open coding it.
(ix86_legitimate_constant_p): Ditto.
(legitimate_pic_address_disp_p): Ditto.
(ix86_legitimate_address_p): Ditto.
(legitimize_pic_address): Ditto.
(ix86_print_operand): Ditto.
(ix86_print_operand_address_as): Ditto.
(ix86_rip_relative_addr_p): Ditto.
* config/i386/i386.h (SYMBOLIC_CONST): Ditto.
* config/i386/i386.md (*anddi_1 to *andsi_1_zext splitter): Ditto.
* config/i386/predicates.md (symbolic_operand): Ditto.
(local_symbolic_operand): Ditto.
(vsib_address_operand): Ditto.
|
|
No functional changes.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_move):
Use SYMBOL_REF_P predicate instead of open coding it.
(ix86_split_long_move): Ditto.
(construct_plt_address): Ditto.
(ix86_expand_call): Ditto.
(ix86_notrack_prefixed_insn_p): Ditto.
* config/i386/i386-features.cc
(rest_of_insert_endbr_and_patchable_area): Ditto.
* config/i386/i386.cc (symbolic_reference_mentioned_p): Ditto.
(ix86_force_load_from_GOT_p): Ditto.
(ix86_legitimate_constant_p): Ditto.
(legitimate_pic_operand_p): Ditto.
(legitimate_pic_address_disp_p): Ditto.
(ix86_legitimate_address_p): Ditto.
(legitimize_pic_address): Ditto.
(ix86_legitimize_address): Ditto.
(ix86_delegitimize_tls_address): Ditto.
(ix86_print_operand): Ditto.
(ix86_print_operand_address_as): Ditto.
(ix86_rip_relative_addr_p): Ditto.
(symbolic_base_address_p): Ditto.
* config/i386/i386.h (SYMBOLIC_CONST): Ditto.
* config/i386/i386.md (*anddi_1 to *andsi_1_zext splitter): Ditto.
* config/i386/predicates.md (symbolic_operand): Ditto.
(local_symbolic_operand): Ditto.
(local_func_symbolic_operand): Ditto.
|
|
No functional changes.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_vector_logical_operator):
Use CONST_VECTOR_P instead of open coding it.
(ix86_expand_int_sse_cmp): Ditto.
(ix86_extract_perm_from_pool_constant): Ditto.
(ix86_split_to_parts): Ditto.
(const_vector_equal_evenodd_p): Ditto.
* config/i386/i386.cc (ix86_print_operand): Ditto.
* config/i386/predicates.md (zero_extended_scalar_load_operand): Ditto.
(float_vector_all_ones_operand): Ditto.
* config/i386/sse.md (avx512vl_vextractf128<mode>): Ditto.
|
|
The patterns did not accept inline immediate constants, even though the
hardware instructions do, which has lead to some errors in some patches I'm
working on.
Also the VCC update RTL was using the wrong operands in the wrong places. This
appears to have been harmless(?) but is definitely not intended.
gcc/ChangeLog:
* config/gcn/gcn-valu.md (add<mode>3_vcc_dup<exec_vcc>): Change
operand 2 to allow gcn_alu_operand. Swap the operands in the VCC
update RTL.
(add<mode>3_vcc_zext_dup): Likewise.
(add<mode>3_vcc_zext_dup_exec): Likewise.
(add<mode>3_vcc_zext_dup2): Likewise.
(add<mode>3_vcc_zext_dup2_exec): Likewise.
|
|
Add a fold at gimple_fold_builtin to prefer the highpart variant of a builtin
if at least one argument is a vector highpart and all others are VECTOR_CSTs
that we can extend to 128-bits.
For example, we prefer to duplicate f0 and use UMULL2 here over DUP+UMULL:
uint16x8_t
foo (const uint8x16_t s)
{
const uint8x8_t f0 = vdup_n_u8 (4);
return vmull_u8 (vget_high_u8 (s), f0);
}
gcc/ChangeLog:
PR target/117850
* config/aarch64/aarch64-builtins.cc (LO_HI_PAIRINGS): New, group the
lo/hi pairs from aarch64-builtin-pairs.def.
(aarch64_get_highpart_builtin): New function.
(aarch64_v128_highpart_ref): New function, helper to look for vector
highparts.
(aarch64_build_vector_cst): New function, helper to build duplicated
VECTOR_CSTs.
(aarch64_fold_lo_call_to_hi): New function.
(aarch64_general_gimple_fold_builtin): Add cases for the lo builtins
in aarch64-builtin-pairs.def.
* config/aarch64/aarch64-builtin-pairs.def: New file, declare the
parirs of lowpart-operating and highpart-operating builtins.
gcc/testsuite/ChangeLog:
PR target/117850
* gcc.target/aarch64/simd/vabal_combine.c: Removed. This is
covered by fold_to_highpart_1.c
* gcc.target/aarch64/simd/fold_to_highpart_1.c: New test.
* gcc.target/aarch64/simd/fold_to_highpart_2.c: Likewise.
* gcc.target/aarch64/simd/fold_to_highpart_3.c: Likewise.
* gcc.target/aarch64/simd/fold_to_highpart_4.c: Likewise.
* gcc.target/aarch64/simd/fold_to_highpart_5.c: Likewise.
* gcc.target/aarch64/simd/fold_to_highpart_6.c: Likewise.
|
|
In PR120297 we fuse
vsetvl e8,mf2,...
vsetvl e64,m1,...
into
vsetvl e64,m4,...
Individually, that's ok but we also change the new vsetvl's demand to
"SEW only" even though the first original one demanded SEW >= 8 and
ratio = 16.
As we forget the ratio after the merge we find that the vsetvl following
the merged one has ratio = 64 demand and we fuse into
vsetvl e64,m1,..
which obviously doesn't have ratio = 16 any more.
Regtested on rv64gcv_zvl512b.
PR target/120297
gcc/ChangeLog:
* config/riscv/riscv-vsetvl.def: Do not forget ratio demand of
previous vsetvl.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/pr120297.c: New test.
|
|
SVE2 BSL2N (x, y, z) = (x & z) | (~y & ~z). When x == y this computes:
(x & z) | (~x & ~z) which is ~(x ^ z).
Thus, we can use it to match RTL patterns (not (xor (...) (...))) for both
Advanced SIMD and SVE modes when TARGET_SVE2.
This patch does that.
For code like:
uint64x2_t eon_q(uint64x2_t a, uint64x2_t b) { return EON(a, b); }
svuint64_t eon_z(svuint64_t a, svuint64_t b) { return EON(a, b); }
We now generate:
eon_q:
bsl2n z0.d, z0.d, z0.d, z1.d
ret
eon_z:
bsl2n z0.d, z0.d, z0.d, z1.d
ret
instead of the previous:
eon_q:
eor v0.16b, v0.16b, v1.16b
not v0.16b, v0.16b
ret
eon_z:
eor z0.d, z0.d, z1.d
ptrue p3.b, all
not z0.d, p3/m, z0.d
ret
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* config/aarch64/aarch64-sve2.md (*aarch64_sve2_bsl2n_eon<mode>):
New pattern.
(*aarch64_sve2_eon_bsl2n_unpred<mode>): Likewise.
gcc/testsuite/
* gcc.target/aarch64/sve2/eon_bsl2n.c: New test.
|
|
We already have patterns to use the NBSL instruction to implement vector
NOR and NAND operations for SVE types and modes. It is straightforward to
have similar patterns for the fixed-width Advanced SIMD modes as well, though
it requires combine patterns without the predicate operand and an explicit 'Z'
output modifier. This patch does so.
So now for example we generate for:
uint64x2_t nand_q(uint64x2_t a, uint64x2_t b) { return NAND(a, b); }
uint64x2_t nor_q(uint64x2_t a, uint64x2_t b) { return NOR(a, b); }
nand_q:
nbsl z0.d, z0.d, z1.d, z1.d
ret
nor_q:
nbsl z0.d, z0.d, z1.d, z0.d
ret
instead of the previous:
nand_q:
and v0.16b, v0.16b, v1.16b
not v0.16b, v0.16b
ret
nor_q:
orr v0.16b, v0.16b, v1.16b
not v0.16b, v0.16b
ret
The tied operand requirements for NBSL mean that we can generate the MOVPRFX
when the operands fall that way, but I guess having a 2-insn MOVPRFX form is
not worse than the current 2-insn codegen at least, and the MOVPRFX can be
fused by many cores.
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* config/aarch64/aarch64-sve2.md (*aarch64_sve2_unpred_nor<mode>):
New define_insn.
(*aarch64_sve2_nand_unpred<mode>): Likewise.
gcc/testsuite/
* gcc.target/aarch64/sve2/nbsl_nor_nand_neon.c: New test.
|
|
The avg3_floor pattern leverage the add and shift rtl
with the DOUBLE_TRUNC mode iterator. Aka, RVVDImode
iterator will generate avg3rvvsimode_floor, only the
element size QI, HI and SI are allowed.
Thus, this patch would like to support the DImode by
the standard name, with the iterator V_VLSI_D.
The below test suites are passed for this patch series.
* The rv64gcv fully regression test.
gcc/ChangeLog:
* config/riscv/autovec.md (avg<mode>3_floor): Add new
pattern of avg3_floor for rvv DImode.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/avg.h: Add int128 type when
xlen == 64.
* gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i16-from-i32.c:
Suppress __int128 warning for run test.
* gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i16-from-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i32-from-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i8-from-i16.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i8-from-i32.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i8-from-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_data.h: Fix one incorrect
test data.
* gcc.target/riscv/rvv/autovec/avg_floor-run-1-i16-from-i32.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_floor-run-1-i16-from-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_floor-run-1-i32-from-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_floor-run-1-i8-from-i16.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_floor-run-1-i8-from-i32.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_floor-run-1-i8-from-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/avg_floor-1-i64-from-i128.c: New test.
* gcc.target/riscv/rvv/autovec/avg_floor-run-1-i64-from-i128.c: New test.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
Updated the test for rv32 accordingly and no regress found for runs like
"runtest --tool gcc --target_board='riscv-sim/-march=rv32gc_zba_zbb_zbc_zbs/-mabi=ilp32d/-mcmodel=medlow' riscv.exp" and
"runtest --tool gcc --target_board='riscv-sim/-march=rv64gc_zba_zbb_zbc_zbs/-mabi=lp64d/-mcmodel=medlow' riscv.exp"
lint warnings can be ignored for riscv-cores.def and riscv-ext-mips.def
gcc/ChangeLog:
* config/riscv/riscv-cores.def (RISCV_CORE): Updated the supported march.
* config/riscv/riscv-ext-mips.def (DEFINE_RISCV_EXT):
New file added for mips conditional mov extension.
* config/riscv/riscv-ext.def: Likewise.
* config/riscv/t-riscv: Generates riscv-ext.opt
* config/riscv/riscv-ext.opt: Generated file.
* config/riscv/riscv.cc (riscv_expand_conditional_move): Updated for mips cmov
and outlined some code that handle arch cond move.
* config/riscv/riscv.md (mov<mode>cc): updated expand for MIPS CCMOV.
* config/riscv/mips-insn.md: New file for mips-p8700 ccmov insn.
* doc/riscv-ext.texi: Updated for mips cmov.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/mipscondmov.c: Test file for mips.ccmov insn.
|
|
This patch adds the ability to fold the address computation into the addressing
mode for LDAPR instructions using LDAPUR when RCPC2 is available.
LDAPUR emission is enabled by default when RCPC2 is available, but can be
disabled using the avoid_ldapur tune flag on a per-core basis.
Currently, it is disabled for neoverse-v2, neoverse-v3, cortex-x925, and
architecutres before armv8.8-a.
Earlier, the following code:
uint64_t
foo (std::atomic<uint64_t> *x)
{
return x[1].load(std::memory_order_acquire);
}
would generate:
foo(std::atomic<unsigned long>*):
add x0, x0, 8
ldapr x0, [x0]
ret
but now generates:
foo(std::atomic<unsigned long>*):
ldapur x0, [x0, 8]
ret
The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?
Signed-off-by: Soumya AR <soumyaa@nvidia.com>
gcc/ChangeLog:
* config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
Add AVOID_LDAPUR tuning flag.
* config/aarch64/aarch64.cc (aarch64_adjust_generic_arch_tuning):
Set AVOID_LDAPUR for architectures before armv8.8-a.
(aarch64_override_options_internal): Apply generic tuning adjustments
to generic_armv8_a_tunings and generic_armv9_a_tunings.
* config/aarch64/aarch64.h (TARGET_ENABLE_LDAPUR): New macro to
control LDAPUR usage based on RCPC2 and tuning flags.
* config/aarch64/aarch64.md: Add enable_ldapur attribute.
* config/aarch64/atomics.md (aarch64_atomic_load<mode>_rcpc): Modify
to emit LDAPUR for cores with RCPC2.
(*aarch64_atomic_load<ALLX:mode>_rcpc_zext): Likewise.
(*aarch64_atomic_load<ALLX:mode>_rcpc_sext): Update constraint to Ust.
* config/aarch64/tuning_models/cortexx925.h: Add AVOID_LDAPUR flag.
* config/aarch64/tuning_models/neoversev2.h: Likewise.
* config/aarch64/tuning_models/neoversev3.h: Likewise.
* config/aarch64/tuning_models/neoversev3ae.h: Likewise.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/ldapr-sext.c: Update expected output to include
offsets.
* gcc.target/aarch64/ldapur.c: New test for LDAPUR.
* gcc.target/aarch64/ldapur_avoid.c: New test for AVOID_LDAPUR flag.
|
|
Fixup to the SME2+FAMINMAX intrinsics commit.
gcc/ChangeLog:
* config/aarch64/aarch64-sme.md (@aarch64_sme_<faminmax_uns_op><mode>):
Change gating and comment.
|
|
This reverts commit cfa827188dc236ba905b12ef06ccc517b9f2de39.
|
|
This patch extends the splitting patterns for combining FP comparisons
with predicated logical operations such that they cover all of SVE_F.
gcc/ChangeLog:
* config/aarch64/aarch64-sve.md (*fcm<cmp_op><mode>_and_combine):
Extend from SVE_FULL_F to SVE_F.
(*fcmuo<mode>_and_combine): Likewise.
(*fcm<cmp_op><mode>_bic_combine): Likewise.
(*fcm<cmp_op><mode>_nor_combine): Likewise.
(*fcmuo<mode>_bic_combine): Likewise.
(*fcmuo<mode>_nor_combine): Likewise. Move the comment here to
above fcmuo<mode>_bic_combine, since it applies to both patterns.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/sve/unpacked_fcm_combines_1.c: New test.
* gcc.target/aarch64/sve/unpacked_fcm_combines_2.c: Likewise.
|
|
I suppose this pattern doesn't get used much! The unsigned compare was meant to
be defined using the signed compare pattern, but actually ended up trying to
recursively call itself. This patch fixes the issue in the obvious way.
gcc/ChangeLog:
* config/gcn/gcn-valu.md (vec_cmpu<mode>di_exec): Call gen_vec_cmp*,
not gen_vec_cmpu*.
|
|
Implementation and tests for the standard reduction optabs.
Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>
gcc/ChangeLog:
* config/s390/vector.md (reduc_plus_scal_<mode>): Implement.
(reduc_plus_scal_v2df): Implement.
(reduc_plus_scal_v4sf): Implement.
(REDUC_FMINMAX): New int iterator.
(reduc_fminmax_name): New int attribute.
(reduc_minmax): New code iterator.
(reduc_minmax_name): New code attribute.
(reduc_<reduc_fminmax_name>_scal_v2df): Implement.
(reduc_<reduc_fminmax_name>_scal_v4sf): Implement.
(reduc_<reduc_minmax_name>_scal_v2df): Implement.
(reduc_<reduc_minmax_name>_scal_v4sf): Implement.
(REDUCBIN): New code iterator.
(reduc_bin_insn): New code attribute.
(reduc_<reduc_bin_insn>_scal_v2di): Implement.
(reduc_<reduc_bin_insn>_scal_v4si): Implement.
(reduc_<reduc_bin_insn>_scal_v8hi): Implement.
(reduc_<reduc_bin_insn>_scal_v16qi): Implement.
gcc/testsuite/ChangeLog:
* lib/target-supports.exp: Add s390 to vect_logical_reduc targets.
* gcc.target/s390/vector/reduc-binops-1.c: New test.
* gcc.target/s390/vector/reduc-minmax-1.c: New test.
* gcc.target/s390/vector/reduc-plus-1.c: New test.
|
|
The default setting of s390 for the parameter min-vect-loop-bound was
set to 2 to prevent certain epilogue loop vectorizations in the past.
Reevaluation of this parameter shows that this setting now is not
needed anymore and sometimes even harmful. Remove the overwrite to
align s390 with other backends.
Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>
gcc/ChangeLog:
* config/s390/s390.cc (s390_option_override_internal): Remove override.
|
|
This is a hold-over from GCN3 where v_add always wrote to the condition
register, whether you wanted it or not. This hasn't been true since GCN5, and
we dropped support for GCN3 a little while ago, so let's fix it.
There was actually a latent bug here because some other post-reload splitters
were generating v_add instructions without declaring the VCC clobber (at least
mul did this), so this should fix some wrong-code bugs also.
gcc/ChangeLog:
* config/gcn/gcn-valu.md (add<mode>3<exec_clobber>): Rename ...
(add<mode>3<exec>): ... to this, remove the clobber, and change the
instruction from v_add_co_u32 to v_add_u32.
(add<mode>3_dup<exec_clobber>): Rename ...
(add<mode>3_dup<exec>): ... to this, and likewise.
(sub<mode>3<exec_clobber>): Rename ...
(sub<mode>3<exec>): ... to this, and likewise
* config/gcn/gcn.md (addsi3): Remove the DI clobber, and change the
instruction from v_add_co_u32 to v_add_u32.
(addsi3_scc): Likewise.
(subsi3): Likewise, but for v_sub_co_u32.
(muldi3): Likewise.
|
|
commit 77473a27bae04da99d6979d43e7bd0a8106f4557
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Thu Jun 26 06:08:51 2025 +0800
x86: Also handle all 1s float vector constant
replaces
(insn 29 28 30 5 (set (reg:V2SF 107)
(mem/u/c:V2SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S8 A64])) 2031 {*movv2sf_internal}
(expr_list:REG_EQUAL (const_vector:V2SF [
(const_double:SF -QNaN [-QNaN]) repeated x2
])
(nil)))
with
(insn 98 13 14 3 (set (reg:V8QI 112)
(const_vector:V8QI [
(const_int -1 [0xffffffffffffffff]) repeated x8
])) -1
(nil))
...
(insn 29 28 30 5 (set (reg:V2SF 107)
(subreg:V2SF (reg:V8QI 112) 0)) 2031 {*movv2sf_internal}
(expr_list:REG_EQUAL (const_vector:V2SF [
(const_double:SF -QNaN [-QNaN]) repeated x2
])
(nil)))
which leads to
pr121015.c: In function ‘render_result_from_bake_h’:
pr121015.c:34:1: error: unrecognizable insn:
34 | }
| ^
(insn 98 13 14 3 (set (reg:V8QI 112)
(const_vector:V8QI [
(const_int -1 [0xffffffffffffffff]) repeated x8
])) -1
(expr_list:REG_EQUIV (const_vector:V8QI [
(const_int -1 [0xffffffffffffffff]) repeated x8
])
(nil)))
during RTL pass: ira
Check all 0s/1s vectors with standard_sse_constant_p to avoid unsupported
all 1s vectors.
Co-Developed-by: H.J. Lu <hjl.tools@gmail.com>
gcc/
PR target/121015
* config/i386/i386-features.cc (ix86_broadcast_inner): Check all
0s/1s vectors with standard_sse_constant_p.
gcc/testsuite/
PR target/121015
* gcc.target/i386/pr121015.c: New test.
|
|
When profiling is enabled with shrink wrapping, the mcount call may not
be placed at the function entry after
pushq %rbp
movq %rsp,%rbp
As the result, the profile data may be skewed which makes PGO less
effective.
Add --enable-x86-64-mfentry to enable -mfentry by default to use
__fentry__, added to glibc in 2010 by:
commit d22e4cc9397ed41534c9422d0b0ffef8c77bfa53
Author: Andi Kleen <ak@linux.intel.com>
Date: Sat Aug 7 21:24:05 2010 -0700
x86: Add support for frame pointer less mcount
instead of mcount, which is placed before the prologue so that -pg can
be used with -fshrink-wrap-separate enabled at -O1. This option is
64-bit only because __fentry__ doesn't support PIC in 32-bit mode. The
default it to enable -mfentry when targeting glibc.
Also warn -pg without -mfentry with shrink wrapping enabled. The warning
is disable for PIC in 32-bit mode.
gcc/
PR target/120881
* config.in: Regenerated.
* configure: Likewise.
* configure.ac: Add --enable-x86-64-mfentry.
* config/i386/i386-options.cc (ix86_option_override_internal):
Enable __fentry__ in 64-bit mode if ENABLE_X86_64_MFENTRY is set
to 1. Warn -pg without -mfentry with shrink wrapping enabled.
* doc/install.texi: Document --enable-x86-64-mfentry.
gcc/testsuite/
PR target/120881
* gcc.dg/20021014-1.c: Add additional -mfentry -fno-pic options
for x86.
* gcc.dg/aru-2.c: Likewise.
* gcc.dg/nest.c: Likewise.
* gcc.dg/pr32450.c: Likewise.
* gcc.dg/pr43643.c: Likewise.
* gcc.target/i386/pr104447.c: Likewise.
* gcc.target/i386/pr113122-3.c: Likewise.
* gcc.target/i386/pr119386-1.c: Add additional -mfentry if not
ia32.
* gcc.target/i386/pr119386-2.c: Likewise.
* gcc.target/i386/pr120881-1a.c: New test.
* gcc.target/i386/pr120881-1b.c: Likewise.
* gcc.target/i386/pr120881-1c.c: Likewise.
* gcc.target/i386/pr120881-1d.c: Likewise.
* gcc.target/i386/pr120881-2a.c: Likewise.
* gcc.target/i386/pr120881-2b.c: Likewise.
* gcc.target/i386/pr82699-1.c: Add additional -mfentry.
* lib/target-supports.exp (check_effective_target_fentry): New.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
darwin25 will be named macOS 26 (codename Tahoe). This is a change from
darwin24, which was macOS 15. We need to adapt the driver to this new
numbering scheme.
2025-07-14 François-Xavier Coudert <fxcoudert@gcc.gnu.org>
gcc/ChangeLog:
PR target/120645
* config/darwin-driver.cc: Account for latest macOS numbering
scheme.
gcc/testsuite/ChangeLog:
* gcc.dg/darwin-minversion-link.c: Account for macOS 26.
|
|
[PR119100]
This pattern enables the combine pass (or late-combine, depending on the case)
to merge a float_extend'ed vec_duplicate into a plus-mult or minus-mult RTL
instruction.
Before this patch, we have three instructions, e.g.:
fcvt.s.h fa5,fa5
vfmv.v.f v24,fa5
vfmadd.vv v8,v24,v16
After, we get only one:
vfwmacc.vf v8,fa5,v16
PR target/119100
gcc/ChangeLog:
* config/riscv/autovec-opt.md (*vfwmacc_vf_<mode>): New pattern to
handle both vfwmacc and vfwmsac.
(*extend_vf_<mode>): New pattern that serves as an intermediate combine
step.
* config/riscv/vector-iterators.md (vsubel): New mode attribute. This is
just the lower-case version of VSUBEL.
* config/riscv/vector.md (@pred_widen_mul_<optab><mode>_scalar): Reorder
and swap operands to match the RTL emitted by expand, i.e. first
float_extend then vec_duplicate.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfwmacc and
vfwmsac.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise. Also check
for fcvt and vfmv.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Add vfwmacc and
vfwmsac.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise. Also check
for fcvt and vfmv.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop.h: Add support for
widening variants.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop_widen_run.h: New test
helper.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmacc-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmacc-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmsac-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmsac-run-1-f32.c: New test.
|
|
Implements the sme2+faminmax svamin and svamax intrinsics.
gcc/ChangeLog:
* config/aarch64/aarch64-sme.md (@aarch64_sme_<faminmax_uns_op><mode>):
New patterns.
* config/aarch64/aarch64-sve-builtins-sme.def (svamin): New intrinsics.
(svamax): New intrinsics.
* config/aarch64/aarch64-sve-builtins-sve2.cc (class faminmaximpl): New
class.
(svamin): New function.
(svamax): New function.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/sme2/acle-asm/amax_f16_x2.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amax_f16_x4.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amax_f32_x2.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amax_f32_x4.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amax_f64_x2.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amax_f64_x4.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amin_f16_x2.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amin_f16_x4.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amin_f32_x2.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amin_f32_x4.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amin_f64_x2.c: New test.
* gcc.target/aarch64/sme2/acle-asm/amin_f64_x4.c: New test.
|
|
According to July 2025 SDM, Key locker will no longer be supported on
hardware 2025 onwards. This means for Panther Lake and Clearwater Forest,
the feature will not be enabled. Remove them from those two platforms.
gcc/ChangeLog:
* config/i386/i386.h (PTA_PANTHERLAKE): Revmoe KL and WIDEKL.
(PTA_CLEARWATERFOREST): Ditto.
* doc/invoke.texi: Revise documentation.
|
|
MMX allows only direct moves from zero, so correct V_32:mode and v2qi
move patterns to allow only nonimm_or_0_operand as their input operand.
gcc/ChangeLog:
* config/i386/mmx.md (mov<V_32:mode>):
Use nonimm_or_0_operand predicate for operand 1.
(*mov<V_32:mode>_internal): Ditto.
(movv2qi): Ditto.
(*movv2qi_internal): Ditto. Use ix86_hardreg_mov_ok
in insn condition.
|
|
This PR is partly about a code quality regression that was triggered
by g:caa7a99a052929d5970677c5b639e1fa5166e334. That patch taught the
gimple optimisers to fold two VEC_PERM_EXPRs into one, conditional
upon either (a) the original permutations not being "native" operations
or (b) the combined permutation being a "native" operation.
Whether something is a "native" operation is tested by calling
can_vec_perm_const_p with allow_variable_p set to false. This requires
the permutation to be supported directly by TARGET_VECTORIZE_VEC_PERM_CONST,
rather than falling back to the general vec_perm optab.
This exposed a problem with the way that we handled general 2-input
permutations for SVE. Unlike Advanced SIMD, base SVE does not have
an instruction to do general 2-input permutations. We do still implement
the vec_perm optab for SVE, but only when the vector length is known at
compile time. The general expansion is pretty expensive: an AND, a SUB,
two TBLs, and an ORR. It certainly couldn't be considered a "native"
operation.
However, if a VEC_PERM_EXPR has a constant selector, the indices can
be wider than the elements being permuted. This is not true for the
vec_perm optab, where the indices and permuted elements must have the
same precision.
This leads to one case where we cannot leave a general 2-input permutation
to be handled by the vec_perm optab: when permuting bytes on a target
with 2048-bit vectors. In that case, the indices of the elements in
the second vector are in the range [256, 511], which cannot be stored
in a byte index.
TARGET_VECTORIZE_VEC_PERM_CONST therefore has to handle 2-input SVE
permutations for one specific case. Rather than check for that
specific case, the code went ahead and used the vec_perm expansion
whenever it worked. But that undermines the !allow_variable_p
handling in can_vec_perm_const_p; it becomes impossible for
target-independent code to distinguish "native" operations from
the worst-case fallback.
This patch instead limits TARGET_VECTORIZE_VEC_PERM_CONST to the
cases that it has to handle. It fixes the PR for all vector lengths
except 2048 bits.
A better fix would be to introduce some sort of costing mechanism,
which would allow us to reject the new VEC_PERM_EXPR even for
2048-bit targets. But that would be a significant amount of work
and would not be backportable.
gcc/
PR target/121027
* config/aarch64/aarch64.cc (aarch64_evpc_sve_tbl): Punt on 2-input
operations that can be handled by vec_perm.
gcc/testsuite/
PR target/121027
* gcc.target/aarch64/sve/acle/general/perm_1.c: New test.
|
|
Similar to BCAX, we can use EOR3 for DImode, but we have to be careful
not to force GP<->SIMD moves unnecessarily, so add a splitter for that case.
So for input:
uint64_t eor3_d_gp (uint64_t a, uint64_t b, uint64_t c) { return EOR3 (a, b, c); }
uint64x1_t eor3_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return EOR3 (a, b, c); }
We generate the desired:
eor3_d_gp:
eor x1, x1, x2
eor x0, x1, x0
ret
eor3_d:
eor3 v0.16b, v0.16b, v1.16b, v2.16b
ret
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* config/aarch64/aarch64-simd.md (*eor3qdi4): New
define_insn_and_split.
gcc/testsuite/
* gcc.target/aarch64/simd/eor3_d.c: Add tests for DImode operands.
|
|
To handle DImode BCAX operations we want to do them on the SIMD side only if
the incoming arguments don't require a cross-bank move.
This means we need to split back the combination to separate GP BIC+EOR
instructions if the operands are expected to be in GP regs through reload.
The split happens pre-reload if we already know that the destination will be
a GP reg. Otherwise if reload descides to use the "=r,r" alternative we ensure
operand 0 is early-clobber.
This scheme is similar to how we handle the BSL operations elsewhere in
aarch64-simd.md.
Thus, for the functions:
uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX (a, b, c); }
uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return BCAX (a, b, c); }
we now generate the desired:
bcax_d_gp:
bic x1, x1, x2
eor x0, x1, x0
ret
bcax_d:
bcax v0.16b, v0.16b, v1.16b, v2.16b
ret
When the inputs are in SIMD regs we use BCAX and when they are in GP regs we
don't force them to SIMD with extra moves.
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* config/aarch64/aarch64-simd.md (*bcaxqdi4): New
define_insn_and_split.
gcc/testsuite/
* gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments.
|
|
Similar to the BCAX patch, we can also use EOR3 for 64-bit modes,
just by adjusting the mode iterator used.
Thus for input:
uint32x2_t
bcax_s (uint32x2_t a, uint32x2_t b, uint32x2_t c)
{
return EOR3 (a, b, c);
}
we now generate:
bcax_s:
eor3 v0.16b, v0.16b, v1.16b, v2.16b
ret
instead of:
bcax_s:
eor v1.8b, v1.8b, v2.8b
eor v0.8b, v1.8b, v0.8b
ret
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* config/aarch64/aarch64-simd.md (eor3q<mode>4): Use VDQ_I mode
iterator.
gcc/testsuite/
* gcc.target/aarch64/simd/eor3_d.c: New test.
|
|
The BCAX instruction from TARGET_SHA3 only operates on the full .16b form
of the inputs but as it's a pure bitwise operation we can use it for the 64-bit
modes as well as there we don't care about the upper 64 bits. This patch extends
the relevant pattern in aarch64-simd.md to accept the 64-bit vector modes.
Thus, for the input:
uint32x2_t
bcax_s (uint32x2_t a, uint32x2_t b, uint32x2_t c)
{
return BCAX (a, b, c);
}
we can now generate:
bcax_s:
bcax v0.16b, v0.16b, v1.16b, v2.16b
ret
instead of the current:
bcax_s:
bic v1.8b, v1.8b, v2.8b
eor v0.8b, v1.8b, v0.8b
ret
This patch doesn't cover the DI/V1DI modes as that would require extending
the bcaxqdi4 pattern with =r,r alternatives and adding splitting logic to
handle the cases where the operands arrive in GP regs. It is doable, but can
be a separate patch. This patch as is should be a straightforward improvement
always.
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* config/aarch64/aarch64-simd.md (bcaxq<mode>4): Use VDQ_I mode
iterator.
gcc/testsuite/
* gcc.target/aarch64/simd/bcax_d.c: New test.
|
|
gcc/ChangeLog:
PR target/91384
* config/i386/i386.md: Add new peeophole2 for optimize *negsi_1
followed by *cmpsi_ccno_1 with APX_F.
gcc/testsuite/ChangeLog:
PR target/91384
* gcc.target/i386/pr91384-1.c: New test.
|
|
The x86 add_stmt_hook relies on the passed vectype to determine
the mode and whether it is FP for a scalar operation. This is
unreliable now for stmts involving patterns and in the future when
there is no vector type passed for scalar operations.
To be least disruptive I've kept using the vector type if it is passed.
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Use
the LHS of a scalar stmt to determine mode and whether it is FP.
|
|
g:4b47acfe2b626d1276e229a0cf165e934813df6c caused a segfault
in aarch64_vector_costs::analyze_loop_vinfo when costing scalar
code, since we'd end up dividing by a zero VF.
Much of the structure of the aarch64 costing code dates from
a stage 4 patch, when we had to work within the bounds of what
the target-independent code did. Some of it could do with a
rework now that we're not so constrained.
This patch is therefore an emergency fix rather than the best
long-term solution. I'll revisit when I have more time to think
about it.
gcc/
* config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
Guard VF-based costing with !m_costing_for_scalar.
|
|
LD1Q gathers and ST1Q scatters are unusual in that they operate
on 128-bit blocks (effectively VNx1TI). However, we don't have
modes or ACLE types for 128-bit integers, and 128-bit integers
are not the intended use case. Instead, the instructions are
intended to be used in "hybrid VLA" operations, where each 128-bit
block is an Advanced SIMD vector.
The normal SVE modes therefore capture the intended use case better
than VNx1TI would. For example, VNx2DI is effectively N copies
of V2DI, VNx4SI N copies of V4SI, etc.
Since there is only one LD1Q instruction and one ST1Q instruction,
the ACLE support used a single pattern for each, with the loaded or
stored data having mode VNx2DI. The ST1Q pattern was generated by:
rtx data = e.args.last ();
e.args.last () = force_lowpart_subreg (VNx2DImode, data, GET_MODE (data));
e.prepare_gather_address_operands (1, false);
return e.use_exact_insn (CODE_FOR_aarch64_scatter_st1q);
where the force_lowpart_subreg bitcast the stored data to VNx2DI.
But such subregs require an element reverse on big-endian targets
(see the comment at the head of aarch64-sve.md), which wasn't the
intention. The code should have used aarch64_sve_reinterpret instead.
The LD1Q pattern was used as follows:
e.prepare_gather_address_operands (1, false);
return e.use_exact_insn (CODE_FOR_aarch64_gather_ld1q);
which always returns a VNx2DI value, leaving the caller to bitcast
that to the correct mode. That bitcast again uses subregs and has
the same problem as above.
However, for the reasons explained in the comment, using
aarch64_sve_reinterpret does not work well for LD1Q. The patch
instead parameterises the LD1Q based on the required data mode.
gcc/
* config/aarch64/aarch64-sve2.md (aarch64_gather_ld1q): Replace with...
(@aarch64_gather_ld1q<mode>): ...this, parameterizing based on mode.
* config/aarch64/aarch64-sve-builtins-sve2.cc
(svld1q_gather_impl::expand): Update accordingly.
(svst1q_scatter_impl::expand): Use aarch64_sve_reinterpret
instead of force_lowpart_subreg.
|
|
This patch makes the zero-stride load broadcast idiom dependent on a
uarch-tunable "use_zero_stride_load". Right now we have quite a few
paths that reach a strided load and some of them are not exactly
straightforward.
While broadcast is relatively rare on rv64 targets it is more common on
rv32 targets that want to vectorize 64-bit elements.
While the patch is more involved than I would have liked it could have
even touched more places. The whole broadcast-like insn path feels a
bit hackish due to the several optimizations we employ. Some of the
complications stem from the fact that we lump together real broadcasts,
vector single-element sets, and strided broadcasts. The strided-load
alternatives currently require a memory_constraint to work properly
which causes more complications when trying to disable just these.
In short, the whole pred_broadcast handling in combination with the
sew64_scalar_helper could use work in the future. I was about to start
with it in this patch but soon realized that it would only distract from
the original intent. What can help in the future is split strided and
non-strided broadcast entirely, as well as the single-element sets.
Yet unclear is whether we need to pay special attention for misaligned
strided loads (PR120782).
I regtested on rv32 and rv64 with strided_load_broadcast_p forced to
true and false. With either I didn't observe any new execution failures
but obviously there are new scan failures with strided broadcast turned
off.
PR target/118734
gcc/ChangeLog:
* config/riscv/constraints.md (Wdm): Use tunable for Wdm
constraint.
* config/riscv/riscv-protos.h (emit_avltype_insn): Declare.
(can_be_broadcasted_p): Rename to...
(can_be_broadcast_p): ...this.
* config/riscv/predicates.md: Use renamed function.
(strided_load_broadcast_p): Declare.
* config/riscv/riscv-selftests.cc (run_broadcast_selftests):
Only run broadcast selftest if strided broadcasts are OK.
* config/riscv/riscv-v.cc (emit_avltype_insn): New function.
(sew64_scalar_helper): Only emit a pred_broadcast if the new
tunable says so.
(can_be_broadcasted_p): Rename to...
(can_be_broadcast_p): ...this and use new tunable.
* config/riscv/riscv.cc (struct riscv_tune_param): Add strided
broad tunable.
(strided_load_broadcast_p): Implement.
* config/riscv/vector.md: Use strided_load_broadcast_p () and
work around 64-bit broadcast on rv32 targets.
|
|
This is primarily Daniel's work... He's chasing things in QEMU & LLVM right
now so I'm doing a bit of clean-up and shepherding this patch forward.
--
Instruction fusion is a reasonably common way to improve the performance of
code on many architectures/designs. A few years ago we submitted (via VRULL I
suspect) fusion support for a number of cases in the RISC-V space.
We made each type of fusion selectable independently in the tuning structure so
that designs which implemented some particular set of fusions could select just
the ones their design implemented. This patch adds to that generic
infrastructure.
In particular we're introducing additional load fusions, store pair fusions,
bitfield extractions and a few B extension related fusions.
Conceptually for the new load fusions we're adding the ability to fuse most
add/shNadd instructions with a subsequent load. There's a couple of
exceptions, but in general the expectation is that if we have add/shNadd for
address computation, then they can potentially use with the load where the
address gets used.
We've had limited forms of store pair fusion for a while. Essentially we
required both stores to be 64 bits wide and land on opposite sides of a 128 bit
cache line. That was enough to help prologues and a few other things, but was
fairly restrictive. The new cases capture store pairs where the two stores
have the same size and hit consecutive memory locations. For example, storing
consecutive bytes with sb+sb is fusible.
For bitfield extractions we can fuse together a shift left followed by a shift
right for arbitrary shift counts where as previously we restricted the shift
counts to those implementing sign/zero extensions of 8, and 16 bit objects.
Finally some B extension fusions. orc.b+not which shows up in string
comparisons, ctz+andi (deepsjeng?), neg+max (synthesized abs).
I hope these prove to be useful to other RISC-V designs. I wouldn't be
surprised if we have to break down the new load fusions further for some
designs. If we need to do that it wouldn't be hard.
FWIW, our data indicates the generalized store fusions followed by the expanded
load fusions are the most important cases for the new code.
These have been tested with crosses and bootstrapped on the BPI.
Waiting on pre-commit CI before moving forward (though it has been failing to
pick up some patches recently...)
gcc/
* config/riscv/riscv.cc (riscv_fusion_pairs): Add new cases.
(riscv_set_is_add): New function.
(riscv_set_is_addi, riscv_set_is_adduw, riscv_set_is_shNadd): Likewise.
(riscv_set_is_shNadduw): Likewise.
(riscv_macro_fusion_pair_p): Add new fusion cases.
Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
|
|
implementation of NOR
While the SVE2 NBSL instruction accepts MOVPRFX to add more flexibility
due to its tied operands, the destination of the movprfx cannot be also
a source operand. But the offending pattern in aarch64-sve2.md tries
to do exactly that for the "=?&w,w,w" alternative and gas warns for the
attached testcase.
This patch adjusts that alternative to avoid taking operand 0 as an input
in the NBSL again.
So for the testcase in the patch we now generate:
nor_z:
movprfx z0, z1
nbsl z0.d, z0.d, z2.d, z1.d
ret
instead of the previous:
nor_z:
movprfx z0, z1
nbsl z0.d, z0.d, z2.d, z0.d
ret
which generated a gas warning.
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
PR target/120999
* config/aarch64/aarch64-sve2.md (*aarch64_sve2_nor<mode>):
Adjust movprfx alternative.
gcc/testsuite/
PR target/120999
* gcc.target/aarch64/sve2/pr120999.c: New test.
|
|
TARGET_VECTORIZE_VEC_PERM_CONST has code to match the SVE2.1
"hybrid VLA" DUPQ, EXTQ, UZPQ{1,2}, and ZIPQ{1,2} instructions.
This matching was conditional on !BYTES_BIG_ENDIAN.
The ACLE code also lowered the associated SVE2.1 intrinsics into
suitable VEC_PERM_EXPRs. This lowering was not conditional on
!BYTES_BIG_ENDIAN.
The mismatch led to lots of ICEs in the ACLE tests on big-endian
targets: we lowered to VEC_PERM_EXPRs that are not supported.
I think the !BYTES_BIG_ENDIAN restriction was unnecessary.
SVE maps the first memory element to the least significant end of
the register for both endiannesses, so no endian correction or lane
number adjustment is necessary.
This is in some ways a bit counterintuitive. ZIPQ1 is conceptually
"apply Advanced SIMD ZIP1 to each 128-bit block" and endianness does
matter when choosing between Advanced SIMD ZIP1 and ZIP2. For example,
the V4SI permute selector { 0, 4, 1, 5 } corresponds to ZIP1 for little-
endian and ZIP2 for big-endian. But the difference between the hybrid
VLA and Advanced SIMD permute selectors is a consequence of the
difference between the SVE and Advanced SIMD element orders.
The same thing applies to ACLE intrinsics. The current lowering of
svzipq1 etc. is correct for both endiannesses. If ACLE code does:
2x svld1_s32 + svzipq1_s32 + svst1_s32
then the byte-for-byte result is the same for both endiannesses.
On big-endian targets, this is different from using the Advanced SIMD
sequence below for each 128-bit block:
2x LDR + ZIP1 + STR
In contrast, the byte-for-byte result of:
2x svld1q_gather_s32 + svzipq1_s32 + svst11_scatter_s32
depends on endianness, since the quadword gathers and scatters use
Advanced SIMD byte ordering for each 128-bit block. This gather/scatter
sequence behaves in the same way as the Advanced SIMD LDR+ZIP1+STR
sequence for both endiannesses.
Programmers writing ACLE code have to be aware of this difference
if they want to support both endiannesses.
The patch includes some new execution tests to verify the expansion
of the VEC_PERM_EXPRs.
gcc/
* doc/sourcebuild.texi (aarch64_sve2_hw, aarch64_sve2p1_hw): Document.
* config/aarch64/aarch64.cc (aarch64_evpc_hvla): Extend to
BYTES_BIG_ENDIAN.
gcc/testsuite/
* lib/target-supports.exp (check_effective_target_aarch64_sve2p1_hw):
New proc.
* gcc.target/aarch64/sve2/dupq_1.c: Extend to big-endian. Add
noipa attributes.
* gcc.target/aarch64/sve2/extq_1.c: Likewise.
* gcc.target/aarch64/sve2/uzpq_1.c: Likewise.
* gcc.target/aarch64/sve2/zipq_1.c: Likewise.
* gcc.target/aarch64/sve2/dupq_1_run.c: New test.
* gcc.target/aarch64/sve2/extq_1_run.c: Likewise.
* gcc.target/aarch64/sve2/uzpq_1_run.c: Likewise.
* gcc.target/aarch64/sve2/zipq_1_run.c: Likewise.
|
|
Kyrylo noticed another spelling bug and like usually, the same mistake
happens in multiple places.
2025-07-10 Jakub Jelinek <jakub@redhat.com>
* config/i386/x86-tune.def: Change "Tunning the" to "tuning" in
comment and use semicolon instead of dot in comment.
* loop-unroll.cc (decide_unroll_stupid): Comment spelling fix,
tunning -> tuning.
|
|
While I'm not a native English speaker, I believe all the uses
of bellow (roar/bark/...) in comments in gcc are meant to be
below (beneath/under/...).
2025-07-10 Jakub Jelinek <jakub@redhat.com>
gcc/
* tree-vect-loop.cc (scale_profile_for_vect_loop): Comment
spelling fix: bellow -> below.
* ipa-polymorphic-call.cc (record_known_type): Likewise.
* config/i386/x86-tune.def: Likewise.
* config/riscv/vector.md (*vsetvldi_no_side_effects_si_extend):
Likewise.
* tree-scalar-evolution.cc (iv_can_overflow_p): Likewise.
* ipa-devirt.cc (add_type_duplicate): Likewise.
* tree-ssa-loop-niter.cc (maybe_lower_iteration_bound): Likewise.
* gimple-ssa-sccopy.cc: Likewise.
* cgraphunit.cc: Likewise.
* graphite.h (struct poly_dr): Likewise.
* ipa-reference.cc (ignore_edge_p): Likewise.
* tree-ssa-alias.cc (ao_compare::compare_ao_refs): Likewise.
* profile-count.h (profile_probability::probably_reliable_p):
Likewise.
* ipa-inline-transform.cc (inline_call): Likewise.
gcc/ada/
* par-load.adb: Comment spelling fix: bellow -> below.
* libgnarl/s-taskin.ads: Likewise.
gcc/testsuite/
* gfortran.dg/g77/980310-3.f: Comment spelling fix: bellow -> below.
* jit.dg/test-debuginfo.c: Likewise.
libstdc++-v3/
* testsuite/22_locale/codecvt/codecvt_unicode.h
(ucs2_to_utf8_out_error): Comment spelling fix: bellow -> below.
(utf16_to_ucs2_in_error): Likewise.
|
|
aarch64_simd_valid_imm tries to decompose a constant into a repeating
series of 64 bits, since most Advanced SIMD and SVE immediate forms
require that. (The exceptions are handled first.) It does this by
building up a byte-level register image, lsb first. If the image does
turn out to repeat every 64 bits, it loads the first 64 bits into an
integer.
At this point, endianness has mostly been dealt with. Endianness
applies to transfers between registers and memory, whereas at this
point we're dealing purely with register values.
However, one of things we try is to bitcast the value to a float
and use FMOV. This involves splitting the value into 32-bit chunks
(stored as longs) and passing them to real_from_target. The problem
being fixed by this patch is that, when a value spans multiple 32-bit
chunks, real_from_target expects them to be in memory rather than
register order. Thus index 0 is the most significant chunk if
FLOAT_WORDS_BIG_ENDIAN and the least significant chunk otherwise.
This fixes aarch64/sve/cond_fadd_1.c and various other tests
for aarch64_be-elf.
gcc/
* config/aarch64/aarch64.cc (aarch64_simd_valid_imm): Account
for FLOAT_WORDS_BIG_ENDIAN when building a floating-point value.
|
|
When using SVE INDEX to load an Advanced SIMD vector, we need to
take account of the different element ordering for big-endian
targets. For example, when big-endian targets store the V4SI
constant { 0, 1, 2, 3 } in registers, 0 becomes the most
significant element, whereas INDEX always operates from the
least significant element. A big-endian target would therefore
load V4SI { 0, 1, 2, 3 } using:
INDEX Z0.S, #3, #-1
rather than little-endian's:
INDEX Z0.S, #0, #1
While there, I noticed that we would only check the first vector
in a multi-vector SVE constant, which would trigger an ICE if the
other vectors turned out to be invalid. This is pretty difficult to
trigger at the moment, since we only allow single-register modes to be
used as frontend & middle-end vector modes, but it can be seen using
the RTL frontend.
gcc/
* config/aarch64/aarch64.cc (aarch64_sve_index_series_p): New
function, split out from...
(aarch64_simd_valid_imm): ...here. Account for the different
SVE and Advanced SIMD element orders on big-endian targets.
Check each vector in a structure mode.
gcc/testsuite/
* gcc.dg/rtl/aarch64/vec-series-1.c: New test.
* gcc.dg/rtl/aarch64/vec-series-2.c: Likewise.
* gcc.target/aarch64/sve/acle/general/dupq_2.c: Fix expected
output for this big-endian test.
* gcc.target/aarch64/sve/acle/general/dupq_4.c: Likewise.
* gcc.target/aarch64/sve/vec_init_3.c: Restrict to little-endian
targets and add more tests.
* gcc.target/aarch64/sve/vec_init_4.c: New big-endian version
of vec_init_3.c.
|
|
This patch would like to combine the vec_duplicate + vssub.vv to the
vssub.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_SAT_S_ADD(T, UT, MIN, MAX) \
T \
test_##T##_sat_add (T x, T y) \
{ \
T sum = (UT)x + (UT)y; \
return (x ^ y) < 0 \
? sum \
: (sum ^ x) >= 0 \
? sum \
: x < 0 ? MIN : MAX; \
}
DEF_SAT_S_ADD(int32_t, uint32_t, INT32_MIN, INT32_MAX)
DEF_VX_BINARY_CASE_2_WRAP(T, SAT_S_ADD_FUNC(T), sat_add)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vssub.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vssub.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_vec_dup): Add
new case SS_MINUS.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op ss_minus.
Signed-off-by: Pan Li <pan2.li@intel.com>
|