aboutsummaryrefslogtreecommitdiff
path: root/gcc/config
AgeCommit message (Collapse)AuthorFilesLines
2025-07-17s390: Rework signbit optabStefan Schulze Frielinghaus1-1/+82
Currently for a signbit operation instructions tc{f,d,x}b + ipm + srl are emitted. If the source operand is a MEM, then a load precedes the sequence. A faster implementation is by issuing a load either from a REG or MEM into a GPR followed by a shift. In spirit of the signbit function of the C standard, the signbit optab only guarantees that the resulting value is nonzero if the signbit is set. The common code implementation computes a value where the signbit is stored in the most significant bit, i.e., all other bits are just masked out, whereas the current implementation of s390 results in a value where the signbit is stored in the least significant bit. Although, there is no guarantee where the signbit is stored, keep the current behaviour and, therefore, implement the signbit optab manually. Since z10, instruction lgdr can be effectively used for a 64-bit FPR-to-GPR load. However, there exists no 32-bit pendant. Thus, for target z10 make use of post-reload splitters which emit either a 64-bit or a 32-bit load depending on whether the source operand is a REG or a MEM and a corresponding 63 or 31-bit shift. We can do without post-reload splitter in case of vector extensions since there we also have a 32-bit VR-to-GPR load via instruction vlgvf. gcc/ChangeLog: * config/s390/s390.md (signbit_tdc): Rename expander. (signbit<mode>2): New expander. (signbit<mode>2_z10): New expander. gcc/testsuite/ChangeLog: * gcc.target/s390/isfinite-isinf-isnormal-signbit-2.c: Adapt scan assembler directives. * gcc.target/s390/isfinite-isinf-isnormal-signbit-3.c: Ditto. * gcc.target/s390/signbit-1.c: New test. * gcc.target/s390/signbit-2.c: New test. * gcc.target/s390/signbit-3.c: New test. * gcc.target/s390/signbit-4.c: New test. * gcc.target/s390/signbit-5.c: New test. * gcc.target/s390/signbit.h: New test.
2025-07-17s390: Adapt GPR<->VR costsStefan Schulze Frielinghaus1-1/+15
Moving between GPRs and VRs in any mode with size less than or equal to 8 bytes becomes available with vector extensions. Without adapting costs for those loads, we typically go over memory. gcc/ChangeLog: * config/s390/s390.cc (s390_register_move_cost): Add costing for vlvg/vlgv.
2025-07-17s390: Add implicit zero extend for VLGVStefan Schulze Frielinghaus1-6/+54
Exploit the fact that instruction VLGV zeros excessive bits of a GPR. gcc/ChangeLog: * config/s390/vector.md (bhfgq): Add scalar modes. (*movdi<mode>_zero_extend_A): New insn. (*movsi<mode>_zero_extend_A): New insn. (*movdi<mode>_zero_extend_B): New insn. (*movsi<mode>_zero_extend_B): New insn. gcc/testsuite/ChangeLog: * gcc.target/s390/vector/vlgv-zero-extend-1.c: New test.
2025-07-17LoongArch: Fix wrong code generated by TARGET_VECTORIZE_VEC_PERM_CONST ↵Xi Ruoyao3-99/+35
[PR121064] When TARGET_VECTORIZE_VEC_PERM_CONST is called, target may be the same pseudo as op0 and/or op1. Loading the selector into target would clobber the input, producing wrong code like vld $vr0, $t0 vshuf.w $vr0, $vr0, $vr1 So don't load the selector into d->target, use a new pseudo to hold the selector instead. The reload pass will load the pseudo for selector and the pseudo for target into the same hard register (following our constraint '0' on the shuf instructions) anyway. gcc/ChangeLog: PR target/121064 * config/loongarch/lsx.md (lsx_vshuf_<lsxfmt_f>): Add '@' to generate a mode-aware helper. Use <VIMODE> as the mode of the operand 1 (selector). * config/loongarch/lasx.md (lasx_xvshuf_<lasxfmt_f>): Likewise. * config/loongarch/loongarch.cc (loongarch_try_expand_lsx_vshuf_const): Create a new pseudo for the selector. Use the mode-aware helper to simplify the code. (loongarch_expand_vec_perm_const): Likewise. gcc/testsuite/ChangeLog: PR target/121064 * gcc.target/loongarch/pr121064.c: New test.
2025-07-16x86: Convert MMX integer loads from constant vector poolUros Bizjak2-19/+45
For MMX 16-bit, 32-bit and 64-bit constant vector loads from constant vector pool: (insn 6 2 7 2 (set (reg:V1SI 5 di) (mem/u/c:V1SI (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S4 A32])) "pr121062-2.c":10:3 2036 {*movv1si_internal} (expr_list:REG_EQUAL (const_vector:V1SI [ (const_int -1 [0xffffffffffffffff]) ]) (nil))) we can convert it to (insn 12 2 7 2 (set (reg:SI 5 di) (const_int -1 [0xffffffffffffffff])) "pr121062-2.c":10:3 100 {*movsi_internal} (nil)) Co-Developed-by: H.J. Lu <hjl.tools@gmail.com> gcc/ PR target/121062 * config/i386/i386.cc (ix86_convert_const_vector_to_integer): Handle E_V1SImode and E_V1DImode. * config/i386/mmx.md (V_16_32_64): Add V1SI, V2BF and V1DI. (mmxinsnmode): Add V1DI and V1SI. Add V_16_32_64 splitter for constant vector loads from constant vector pool. (V_16_32_64:*mov<mode>_imm): Moved after V_16_32_64 splitter. Replace lowpart_subreg with adjust_address. gcc/testsuite/ PR target/121062 * gcc.target/i386/pr121062-1.c: New test. * gcc.target/i386/pr121062-2.c: Likewise. * gcc.target/i386/pr121062-3a.c: Likewise. * gcc.target/i386/pr121062-3b.c: Likewise. * gcc.target/i386/pr121062-3c.c: Likewise. * gcc.target/i386/pr121062-4.c: Likewise. * gcc.target/i386/pr121062-5.c: Likewise. * gcc.target/i386/pr121062-6.c: Likewise. * gcc.target/i386/pr121062-7.c: Likewise.
2025-07-16x86: Warn -pg without -mfentry only on glibc targetsH.J. Lu1-0/+4
Since only glibc targets support -mfentry, warn -pg without -mfentry only on glibc targets. gcc/ PR target/120881 PR testsuite/121078 * config/i386/i386-options.cc (ix86_option_override_internal): Warn -pg without -mfentry only on glibc targets. gcc/testsuite/ PR target/120881 PR testsuite/121078 * gcc.dg/20021014-1.c (dg-additional-options): Add -mfentry -fno-pic only on gnu/x86 targets. * gcc.dg/aru-2.c (dg-additional-options): Likewise. * gcc.dg/nest.c (dg-additional-options): Likewise. * gcc.dg/pr32450.c (dg-additional-options): Likewise. * gcc.dg/pr43643.c (dg-additional-options): Likewise. * gcc.target/i386/pr104447.c (dg-additional-options): Likewise. * gcc.target/i386/pr113122-3.c(dg-additional-options): Likewise. * gcc.target/i386/pr119386-1.c (dg-additional-options): Add -mfentry only on gnu targets. * gcc.target/i386/pr119386-2.c (dg-additional-options): Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-07-16i386: Use various predicates instead of open coding themUros Bizjak3-4/+4
No functional changes. gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_move): Use MEM_P predicate instead of open coding it. (ix86_erase_embedded_rounding): Use NONJUMP_INSN_P predicate instead of open coding it. * config/i386/i386-features.cc (convertible_comparison_p): Use REG_P predicate instead of open coding it. * config/i386/i386.cc (ix86_rtx_costs): Use SUBREG_P predicate instead of open coding it.
2025-07-16i386: Use LABEL_REF_P predicate instead of open coding itUros Bizjak4-21/+21
No functional changes. gcc/ChangeLog: * config/i386/i386.cc (symbolic_reference_mentioned_p): Use LABEL_REF_P predicate instead of open coding it. (ix86_legitimate_constant_p): Ditto. (legitimate_pic_address_disp_p): Ditto. (ix86_legitimate_address_p): Ditto. (legitimize_pic_address): Ditto. (ix86_print_operand): Ditto. (ix86_print_operand_address_as): Ditto. (ix86_rip_relative_addr_p): Ditto. * config/i386/i386.h (SYMBOLIC_CONST): Ditto. * config/i386/i386.md (*anddi_1 to *andsi_1_zext splitter): Ditto. * config/i386/predicates.md (symbolic_operand): Ditto. (local_symbolic_operand): Ditto. (vsib_address_operand): Ditto.
2025-07-16i386: Use SYMBOL_REF_P predicate instead of open coding itUros Bizjak6-46/+45
No functional changes. gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_move): Use SYMBOL_REF_P predicate instead of open coding it. (ix86_split_long_move): Ditto. (construct_plt_address): Ditto. (ix86_expand_call): Ditto. (ix86_notrack_prefixed_insn_p): Ditto. * config/i386/i386-features.cc (rest_of_insert_endbr_and_patchable_area): Ditto. * config/i386/i386.cc (symbolic_reference_mentioned_p): Ditto. (ix86_force_load_from_GOT_p): Ditto. (ix86_legitimate_constant_p): Ditto. (legitimate_pic_operand_p): Ditto. (legitimate_pic_address_disp_p): Ditto. (ix86_legitimate_address_p): Ditto. (legitimize_pic_address): Ditto. (ix86_legitimize_address): Ditto. (ix86_delegitimize_tls_address): Ditto. (ix86_print_operand): Ditto. (ix86_print_operand_address_as): Ditto. (ix86_rip_relative_addr_p): Ditto. (symbolic_base_address_p): Ditto. * config/i386/i386.h (SYMBOLIC_CONST): Ditto. * config/i386/i386.md (*anddi_1 to *andsi_1_zext splitter): Ditto. * config/i386/predicates.md (symbolic_operand): Ditto. (local_symbolic_operand): Ditto. (local_func_symbolic_operand): Ditto.
2025-07-16i386: Use CONST_VECTOR_P predicate instead of open coding itUros Bizjak4-22/+22
No functional changes. gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_vector_logical_operator): Use CONST_VECTOR_P instead of open coding it. (ix86_expand_int_sse_cmp): Ditto. (ix86_extract_perm_from_pool_constant): Ditto. (ix86_split_to_parts): Ditto. (const_vector_equal_evenodd_p): Ditto. * config/i386/i386.cc (ix86_print_operand): Ditto. * config/i386/predicates.md (zero_extended_scalar_load_operand): Ditto. (float_vector_all_ones_operand): Ditto. * config/i386/sse.md (avx512vl_vextractf128<mode>): Ditto.
2025-07-16amdgcn: Fix various unrecognized pattern issues with add<mode>3_vcc_dupAndrew Stubbs1-11/+11
The patterns did not accept inline immediate constants, even though the hardware instructions do, which has lead to some errors in some patches I'm working on. Also the VCC update RTL was using the wrong operands in the wrong places. This appears to have been harmless(?) but is definitely not intended. gcc/ChangeLog: * config/gcn/gcn-valu.md (add<mode>3_vcc_dup<exec_vcc>): Change operand 2 to allow gcn_alu_operand. Swap the operands in the VCC update RTL. (add<mode>3_vcc_zext_dup): Likewise. (add<mode>3_vcc_zext_dup_exec): Likewise. (add<mode>3_vcc_zext_dup2): Likewise. (add<mode>3_vcc_zext_dup2_exec): Likewise.
2025-07-16aarch64: Fold builtins with highpart args to highpart equivalent [PR117850]Spencer Abson2-0/+249
Add a fold at gimple_fold_builtin to prefer the highpart variant of a builtin if at least one argument is a vector highpart and all others are VECTOR_CSTs that we can extend to 128-bits. For example, we prefer to duplicate f0 and use UMULL2 here over DUP+UMULL: uint16x8_t foo (const uint8x16_t s) { const uint8x8_t f0 = vdup_n_u8 (4); return vmull_u8 (vget_high_u8 (s), f0); } gcc/ChangeLog: PR target/117850 * config/aarch64/aarch64-builtins.cc (LO_HI_PAIRINGS): New, group the lo/hi pairs from aarch64-builtin-pairs.def. (aarch64_get_highpart_builtin): New function. (aarch64_v128_highpart_ref): New function, helper to look for vector highparts. (aarch64_build_vector_cst): New function, helper to build duplicated VECTOR_CSTs. (aarch64_fold_lo_call_to_hi): New function. (aarch64_general_gimple_fold_builtin): Add cases for the lo builtins in aarch64-builtin-pairs.def. * config/aarch64/aarch64-builtin-pairs.def: New file, declare the parirs of lowpart-operating and highpart-operating builtins. gcc/testsuite/ChangeLog: PR target/117850 * gcc.target/aarch64/simd/vabal_combine.c: Removed. This is covered by fold_to_highpart_1.c * gcc.target/aarch64/simd/fold_to_highpart_1.c: New test. * gcc.target/aarch64/simd/fold_to_highpart_2.c: Likewise. * gcc.target/aarch64/simd/fold_to_highpart_3.c: Likewise. * gcc.target/aarch64/simd/fold_to_highpart_4.c: Likewise. * gcc.target/aarch64/simd/fold_to_highpart_5.c: Likewise. * gcc.target/aarch64/simd/fold_to_highpart_6.c: Likewise.
2025-07-16RISC-V: Fix vsetvl merge rule.Robin Dapp1-3/+3
In PR120297 we fuse vsetvl e8,mf2,... vsetvl e64,m1,... into vsetvl e64,m4,... Individually, that's ok but we also change the new vsetvl's demand to "SEW only" even though the first original one demanded SEW >= 8 and ratio = 16. As we forget the ratio after the merge we find that the vsetvl following the merged one has ratio = 64 demand and we fuse into vsetvl e64,m1,.. which obviously doesn't have ratio = 16 any more. Regtested on rv64gcv_zvl512b. PR target/120297 gcc/ChangeLog: * config/riscv/riscv-vsetvl.def: Do not forget ratio demand of previous vsetvl. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/pr120297.c: New test.
2025-07-16aarch64: Use SVE2 BSL2N for vector EONKyrylo Tkachov1-0/+34
SVE2 BSL2N (x, y, z) = (x & z) | (~y & ~z). When x == y this computes: (x & z) | (~x & ~z) which is ~(x ^ z). Thus, we can use it to match RTL patterns (not (xor (...) (...))) for both Advanced SIMD and SVE modes when TARGET_SVE2. This patch does that. For code like: uint64x2_t eon_q(uint64x2_t a, uint64x2_t b) { return EON(a, b); } svuint64_t eon_z(svuint64_t a, svuint64_t b) { return EON(a, b); } We now generate: eon_q: bsl2n z0.d, z0.d, z0.d, z1.d ret eon_z: bsl2n z0.d, z0.d, z0.d, z1.d ret instead of the previous: eon_q: eor v0.16b, v0.16b, v1.16b not v0.16b, v0.16b ret eon_z: eor z0.d, z0.d, z1.d ptrue p3.b, all not z0.d, p3/m, z0.d ret Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ * config/aarch64/aarch64-sve2.md (*aarch64_sve2_bsl2n_eon<mode>): New pattern. (*aarch64_sve2_eon_bsl2n_unpred<mode>): Likewise. gcc/testsuite/ * gcc.target/aarch64/sve2/eon_bsl2n.c: New test.
2025-07-16aarch64: Use SVE2 NBSL for vector NOR and NAND for Advanced SIMD modesKyrylo Tkachov1-0/+29
We already have patterns to use the NBSL instruction to implement vector NOR and NAND operations for SVE types and modes. It is straightforward to have similar patterns for the fixed-width Advanced SIMD modes as well, though it requires combine patterns without the predicate operand and an explicit 'Z' output modifier. This patch does so. So now for example we generate for: uint64x2_t nand_q(uint64x2_t a, uint64x2_t b) { return NAND(a, b); } uint64x2_t nor_q(uint64x2_t a, uint64x2_t b) { return NOR(a, b); } nand_q: nbsl z0.d, z0.d, z1.d, z1.d ret nor_q: nbsl z0.d, z0.d, z1.d, z0.d ret instead of the previous: nand_q: and v0.16b, v0.16b, v1.16b not v0.16b, v0.16b ret nor_q: orr v0.16b, v0.16b, v1.16b not v0.16b, v0.16b ret The tied operand requirements for NBSL mean that we can generate the MOVPRFX when the operands fall that way, but I guess having a 2-insn MOVPRFX form is not worse than the current 2-insn codegen at least, and the MOVPRFX can be fused by many cores. Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ * config/aarch64/aarch64-sve2.md (*aarch64_sve2_unpred_nor<mode>): New define_insn. (*aarch64_sve2_nand_unpred<mode>): Likewise. gcc/testsuite/ * gcc.target/aarch64/sve2/nbsl_nor_nand_neon.c: New test.
2025-07-16RISC-V: Support RVVDImode for avg3_floor auto vectPan Li1-0/+13
The avg3_floor pattern leverage the add and shift rtl with the DOUBLE_TRUNC mode iterator. Aka, RVVDImode iterator will generate avg3rvvsimode_floor, only the element size QI, HI and SI are allowed. Thus, this patch would like to support the DImode by the standard name, with the iterator V_VLSI_D. The below test suites are passed for this patch series. * The rv64gcv fully regression test. gcc/ChangeLog: * config/riscv/autovec.md (avg<mode>3_floor): Add new pattern of avg3_floor for rvv DImode. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/avg.h: Add int128 type when xlen == 64. * gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i16-from-i32.c: Suppress __int128 warning for run test. * gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i16-from-i64.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i32-from-i64.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i8-from-i16.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i8-from-i32.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_ceil-run-1-i8-from-i64.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_data.h: Fix one incorrect test data. * gcc.target/riscv/rvv/autovec/avg_floor-run-1-i16-from-i32.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_floor-run-1-i16-from-i64.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_floor-run-1-i32-from-i64.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_floor-run-1-i8-from-i16.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_floor-run-1-i8-from-i32.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_floor-run-1-i8-from-i64.c: Ditto. * gcc.target/riscv/rvv/autovec/avg_floor-1-i64-from-i128.c: New test. * gcc.target/riscv/rvv/autovec/avg_floor-run-1-i64-from-i128.c: New test. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-07-15[PATCH v5] RISC-V: Mips P8700 Conditional Move Support.Umesh Kalappa8-37/+154
Updated the test for rv32 accordingly and no regress found for runs like "runtest --tool gcc --target_board='riscv-sim/-march=rv32gc_zba_zbb_zbc_zbs/-mabi=ilp32d/-mcmodel=medlow' riscv.exp" and "runtest --tool gcc --target_board='riscv-sim/-march=rv64gc_zba_zbb_zbc_zbs/-mabi=lp64d/-mcmodel=medlow' riscv.exp" lint warnings can be ignored for riscv-cores.def and riscv-ext-mips.def gcc/ChangeLog: * config/riscv/riscv-cores.def (RISCV_CORE): Updated the supported march. * config/riscv/riscv-ext-mips.def (DEFINE_RISCV_EXT): New file added for mips conditional mov extension. * config/riscv/riscv-ext.def: Likewise. * config/riscv/t-riscv: Generates riscv-ext.opt * config/riscv/riscv-ext.opt: Generated file. * config/riscv/riscv.cc (riscv_expand_conditional_move): Updated for mips cmov and outlined some code that handle arch cond move. * config/riscv/riscv.md (mov<mode>cc): updated expand for MIPS CCMOV. * config/riscv/mips-insn.md: New file for mips-p8700 ccmov insn. * doc/riscv-ext.texi: Updated for mips cmov. gcc/testsuite/ChangeLog: * gcc.target/riscv/mipscondmov.c: Test file for mips.ccmov insn.
2025-07-15aarch64: Enable selective LDAPUR generation for cores with RCPC2Soumya AR9-13/+44
This patch adds the ability to fold the address computation into the addressing mode for LDAPR instructions using LDAPUR when RCPC2 is available. LDAPUR emission is enabled by default when RCPC2 is available, but can be disabled using the avoid_ldapur tune flag on a per-core basis. Currently, it is disabled for neoverse-v2, neoverse-v3, cortex-x925, and architecutres before armv8.8-a. Earlier, the following code: uint64_t foo (std::atomic<uint64_t> *x) { return x[1].load(std::memory_order_acquire); } would generate: foo(std::atomic<unsigned long>*): add x0, x0, 8 ldapr x0, [x0] ret but now generates: foo(std::atomic<unsigned long>*): ldapur x0, [x0, 8] ret The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR <soumyaa@nvidia.com> gcc/ChangeLog: * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION): Add AVOID_LDAPUR tuning flag. * config/aarch64/aarch64.cc (aarch64_adjust_generic_arch_tuning): Set AVOID_LDAPUR for architectures before armv8.8-a. (aarch64_override_options_internal): Apply generic tuning adjustments to generic_armv8_a_tunings and generic_armv9_a_tunings. * config/aarch64/aarch64.h (TARGET_ENABLE_LDAPUR): New macro to control LDAPUR usage based on RCPC2 and tuning flags. * config/aarch64/aarch64.md: Add enable_ldapur attribute. * config/aarch64/atomics.md (aarch64_atomic_load<mode>_rcpc): Modify to emit LDAPUR for cores with RCPC2. (*aarch64_atomic_load<ALLX:mode>_rcpc_zext): Likewise. (*aarch64_atomic_load<ALLX:mode>_rcpc_sext): Update constraint to Ust. * config/aarch64/tuning_models/cortexx925.h: Add AVOID_LDAPUR flag. * config/aarch64/tuning_models/neoversev2.h: Likewise. * config/aarch64/tuning_models/neoversev3.h: Likewise. * config/aarch64/tuning_models/neoversev3ae.h: Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/ldapr-sext.c: Update expected output to include offsets. * gcc.target/aarch64/ldapur.c: New test for LDAPUR. * gcc.target/aarch64/ldapur_avoid.c: New test for AVOID_LDAPUR flag.
2025-07-15aarch64: fixup: Implement sme2+faminmax extension.Alfie Richards1-3/+3
Fixup to the SME2+FAMINMAX intrinsics commit. gcc/ChangeLog: * config/aarch64/aarch64-sme.md (@aarch64_sme_<faminmax_uns_op><mode>): Change gating and comment.
2025-07-15Revert "aarch64: Use EOR3 for DImode values"Kyrylo Tkachov1-25/+0
This reverts commit cfa827188dc236ba905b12ef06ccc517b9f2de39.
2025-07-15aarch64: AND/BIC combines for unpacked SVE FP comparisonsSpencer Abson1-13/+13
This patch extends the splitting patterns for combining FP comparisons with predicated logical operations such that they cover all of SVE_F. gcc/ChangeLog: * config/aarch64/aarch64-sve.md (*fcm<cmp_op><mode>_and_combine): Extend from SVE_FULL_F to SVE_F. (*fcmuo<mode>_and_combine): Likewise. (*fcm<cmp_op><mode>_bic_combine): Likewise. (*fcm<cmp_op><mode>_nor_combine): Likewise. (*fcmuo<mode>_bic_combine): Likewise. (*fcmuo<mode>_nor_combine): Likewise. Move the comment here to above fcmuo<mode>_bic_combine, since it applies to both patterns. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/unpacked_fcm_combines_1.c: New test. * gcc.target/aarch64/sve/unpacked_fcm_combines_2.c: Likewise.
2025-07-14amdgcn: fix vec_ucmp infinite recursionAndrew Stubbs1-3/+3
I suppose this pattern doesn't get used much! The unsigned compare was meant to be defined using the signed compare pattern, but actually ended up trying to recursively call itself. This patch fixes the issue in the obvious way. gcc/ChangeLog: * config/gcn/gcn-valu.md (vec_cmpu<mode>di_exec): Call gen_vec_cmp*, not gen_vec_cmpu*.
2025-07-14s390: Implement reduction optabsJuergen Christ1-5/+288
Implementation and tests for the standard reduction optabs. Signed-off-by: Juergen Christ <jchrist@linux.ibm.com> gcc/ChangeLog: * config/s390/vector.md (reduc_plus_scal_<mode>): Implement. (reduc_plus_scal_v2df): Implement. (reduc_plus_scal_v4sf): Implement. (REDUC_FMINMAX): New int iterator. (reduc_fminmax_name): New int attribute. (reduc_minmax): New code iterator. (reduc_minmax_name): New code attribute. (reduc_<reduc_fminmax_name>_scal_v2df): Implement. (reduc_<reduc_fminmax_name>_scal_v4sf): Implement. (reduc_<reduc_minmax_name>_scal_v2df): Implement. (reduc_<reduc_minmax_name>_scal_v4sf): Implement. (REDUCBIN): New code iterator. (reduc_bin_insn): New code attribute. (reduc_<reduc_bin_insn>_scal_v2di): Implement. (reduc_<reduc_bin_insn>_scal_v4si): Implement. (reduc_<reduc_bin_insn>_scal_v8hi): Implement. (reduc_<reduc_bin_insn>_scal_v16qi): Implement. gcc/testsuite/ChangeLog: * lib/target-supports.exp: Add s390 to vect_logical_reduc targets. * gcc.target/s390/vector/reduc-binops-1.c: New test. * gcc.target/s390/vector/reduc-minmax-1.c: New test. * gcc.target/s390/vector/reduc-plus-1.c: New test.
2025-07-14s390: Remove min-vect-loop-bound overrideJuergen Christ1-3/+0
The default setting of s390 for the parameter min-vect-loop-bound was set to 2 to prevent certain epilogue loop vectorizations in the past. Reevaluation of this parameter shows that this setting now is not needed anymore and sometimes even harmful. Remove the overwrite to align s390 with other backends. Signed-off-by: Juergen Christ <jchrist@linux.ibm.com> gcc/ChangeLog: * config/s390/s390.cc (s390_option_override_internal): Remove override.
2025-07-14amdgcn: Don't clobber VCC if we don't need toAndrew Stubbs2-30/+21
This is a hold-over from GCN3 where v_add always wrote to the condition register, whether you wanted it or not. This hasn't been true since GCN5, and we dropped support for GCN3 a little while ago, so let's fix it. There was actually a latent bug here because some other post-reload splitters were generating v_add instructions without declaring the VCC clobber (at least mul did this), so this should fix some wrong-code bugs also. gcc/ChangeLog: * config/gcn/gcn-valu.md (add<mode>3<exec_clobber>): Rename ... (add<mode>3<exec>): ... to this, remove the clobber, and change the instruction from v_add_co_u32 to v_add_u32. (add<mode>3_dup<exec_clobber>): Rename ... (add<mode>3_dup<exec>): ... to this, and likewise. (sub<mode>3<exec_clobber>): Rename ... (sub<mode>3<exec>): ... to this, and likewise * config/gcn/gcn.md (addsi3): Remove the DI clobber, and change the instruction from v_add_co_u32 to v_add_u32. (addsi3_scc): Likewise. (subsi3): Likewise, but for v_sub_co_u32. (muldi3): Likewise.
2025-07-14x86: Check all 0s/1s vectors with standard_sse_constant_pUros Bizjak1-7/+5
commit 77473a27bae04da99d6979d43e7bd0a8106f4557 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Jun 26 06:08:51 2025 +0800 x86: Also handle all 1s float vector constant replaces (insn 29 28 30 5 (set (reg:V2SF 107) (mem/u/c:V2SF (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S8 A64])) 2031 {*movv2sf_internal} (expr_list:REG_EQUAL (const_vector:V2SF [ (const_double:SF -QNaN [-QNaN]) repeated x2 ]) (nil))) with (insn 98 13 14 3 (set (reg:V8QI 112) (const_vector:V8QI [ (const_int -1 [0xffffffffffffffff]) repeated x8 ])) -1 (nil)) ... (insn 29 28 30 5 (set (reg:V2SF 107) (subreg:V2SF (reg:V8QI 112) 0)) 2031 {*movv2sf_internal} (expr_list:REG_EQUAL (const_vector:V2SF [ (const_double:SF -QNaN [-QNaN]) repeated x2 ]) (nil))) which leads to pr121015.c: In function ‘render_result_from_bake_h’: pr121015.c:34:1: error: unrecognizable insn: 34 | } | ^ (insn 98 13 14 3 (set (reg:V8QI 112) (const_vector:V8QI [ (const_int -1 [0xffffffffffffffff]) repeated x8 ])) -1 (expr_list:REG_EQUIV (const_vector:V8QI [ (const_int -1 [0xffffffffffffffff]) repeated x8 ]) (nil))) during RTL pass: ira Check all 0s/1s vectors with standard_sse_constant_p to avoid unsupported all 1s vectors. Co-Developed-by: H.J. Lu <hjl.tools@gmail.com> gcc/ PR target/121015 * config/i386/i386-features.cc (ix86_broadcast_inner): Check all 0s/1s vectors with standard_sse_constant_p. gcc/testsuite/ PR target/121015 * gcc.target/i386/pr121015.c: New test.
2025-07-14x86-64: Add --enable-x86-64-mfentryH.J. Lu1-1/+10
When profiling is enabled with shrink wrapping, the mcount call may not be placed at the function entry after pushq %rbp movq %rsp,%rbp As the result, the profile data may be skewed which makes PGO less effective. Add --enable-x86-64-mfentry to enable -mfentry by default to use __fentry__, added to glibc in 2010 by: commit d22e4cc9397ed41534c9422d0b0ffef8c77bfa53 Author: Andi Kleen <ak@linux.intel.com> Date: Sat Aug 7 21:24:05 2010 -0700 x86: Add support for frame pointer less mcount instead of mcount, which is placed before the prologue so that -pg can be used with -fshrink-wrap-separate enabled at -O1. This option is 64-bit only because __fentry__ doesn't support PIC in 32-bit mode. The default it to enable -mfentry when targeting glibc. Also warn -pg without -mfentry with shrink wrapping enabled. The warning is disable for PIC in 32-bit mode. gcc/ PR target/120881 * config.in: Regenerated. * configure: Likewise. * configure.ac: Add --enable-x86-64-mfentry. * config/i386/i386-options.cc (ix86_option_override_internal): Enable __fentry__ in 64-bit mode if ENABLE_X86_64_MFENTRY is set to 1. Warn -pg without -mfentry with shrink wrapping enabled. * doc/install.texi: Document --enable-x86-64-mfentry. gcc/testsuite/ PR target/120881 * gcc.dg/20021014-1.c: Add additional -mfentry -fno-pic options for x86. * gcc.dg/aru-2.c: Likewise. * gcc.dg/nest.c: Likewise. * gcc.dg/pr32450.c: Likewise. * gcc.dg/pr43643.c: Likewise. * gcc.target/i386/pr104447.c: Likewise. * gcc.target/i386/pr113122-3.c: Likewise. * gcc.target/i386/pr119386-1.c: Add additional -mfentry if not ia32. * gcc.target/i386/pr119386-2.c: Likewise. * gcc.target/i386/pr120881-1a.c: New test. * gcc.target/i386/pr120881-1b.c: Likewise. * gcc.target/i386/pr120881-1c.c: Likewise. * gcc.target/i386/pr120881-1d.c: Likewise. * gcc.target/i386/pr120881-2a.c: Likewise. * gcc.target/i386/pr120881-2b.c: Likewise. * gcc.target/i386/pr82699-1.c: Add additional -mfentry. * lib/target-supports.exp (check_effective_target_fentry): New. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-07-14Darwin: account for macOS 26Francois-Xavier Coudert1-10/+12
darwin25 will be named macOS 26 (codename Tahoe). This is a change from darwin24, which was macOS 15. We need to adapt the driver to this new numbering scheme. 2025-07-14 François-Xavier Coudert <fxcoudert@gcc.gnu.org> gcc/ChangeLog: PR target/120645 * config/darwin-driver.cc: Account for latest macOS numbering scheme. gcc/testsuite/ChangeLog: * gcc.dg/darwin-minversion-link.c: Account for macOS 26.
2025-07-14[PATCH v2] RISC-V: Vector-scalar widening multiply-(subtract-)accumulate ↵Paul-Antoine Arras3-4/+93
[PR119100] This pattern enables the combine pass (or late-combine, depending on the case) to merge a float_extend'ed vec_duplicate into a plus-mult or minus-mult RTL instruction. Before this patch, we have three instructions, e.g.: fcvt.s.h fa5,fa5 vfmv.v.f v24,fa5 vfmadd.vv v8,v24,v16 After, we get only one: vfwmacc.vf v8,fa5,v16 PR target/119100 gcc/ChangeLog: * config/riscv/autovec-opt.md (*vfwmacc_vf_<mode>): New pattern to handle both vfwmacc and vfwmsac. (*extend_vf_<mode>): New pattern that serves as an intermediate combine step. * config/riscv/vector-iterators.md (vsubel): New mode attribute. This is just the lower-case version of VSUBEL. * config/riscv/vector.md (@pred_widen_mul_<optab><mode>_scalar): Reorder and swap operands to match the RTL emitted by expand, i.e. first float_extend then vec_duplicate. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfwmacc and vfwmsac. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise. Also check for fcvt and vfmv. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Add vfwmacc and vfwmsac. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise. Also check for fcvt and vfmv. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop.h: Add support for widening variants. * gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop_widen_run.h: New test helper. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmacc-run-1-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmacc-run-1-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmsac-run-1-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfwmsac-run-1-f32.c: New test.
2025-07-14aarch64: Implement sme2+faminmax extension.Alfie Richards3-4/+63
Implements the sme2+faminmax svamin and svamax intrinsics. gcc/ChangeLog: * config/aarch64/aarch64-sme.md (@aarch64_sme_<faminmax_uns_op><mode>): New patterns. * config/aarch64/aarch64-sve-builtins-sme.def (svamin): New intrinsics. (svamax): New intrinsics. * config/aarch64/aarch64-sve-builtins-sve2.cc (class faminmaximpl): New class. (svamin): New function. (svamax): New function. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sme2/acle-asm/amax_f16_x2.c: New test. * gcc.target/aarch64/sme2/acle-asm/amax_f16_x4.c: New test. * gcc.target/aarch64/sme2/acle-asm/amax_f32_x2.c: New test. * gcc.target/aarch64/sme2/acle-asm/amax_f32_x4.c: New test. * gcc.target/aarch64/sme2/acle-asm/amax_f64_x2.c: New test. * gcc.target/aarch64/sme2/acle-asm/amax_f64_x4.c: New test. * gcc.target/aarch64/sme2/acle-asm/amin_f16_x2.c: New test. * gcc.target/aarch64/sme2/acle-asm/amin_f16_x4.c: New test. * gcc.target/aarch64/sme2/acle-asm/amin_f32_x2.c: New test. * gcc.target/aarch64/sme2/acle-asm/amin_f32_x4.c: New test. * gcc.target/aarch64/sme2/acle-asm/amin_f64_x2.c: New test. * gcc.target/aarch64/sme2/acle-asm/amin_f64_x4.c: New test.
2025-07-14i386: Remove KEYLOCKER related feature since Panther Lake and Clearwater ForestHaochen Jiang1-4/+5
According to July 2025 SDM, Key locker will no longer be supported on hardware 2025 onwards. This means for Panther Lake and Clearwater Forest, the feature will not be enabled. Remove them from those two platforms. gcc/ChangeLog: * config/i386/i386.h (PTA_PANTHERLAKE): Revmoe KL and WIDEKL. (PTA_CLEARWATERFOREST): Ditto. * doc/invoke.texi: Revise documentation.
2025-07-12i386: Robustify MMX move patternsUros Bizjak1-5/+6
MMX allows only direct moves from zero, so correct V_32:mode and v2qi move patterns to allow only nonimm_or_0_operand as their input operand. gcc/ChangeLog: * config/i386/mmx.md (mov<V_32:mode>): Use nonimm_or_0_operand predicate for operand 1. (*mov<V_32:mode>_internal): Ditto. (movv2qi): Ditto. (*movv2qi_internal): Ditto. Use ix86_hardreg_mov_ok in insn condition.
2025-07-11aarch64: Tweak handling of general SVE permutes [PR121027]Richard Sandiford1-5/+16
This PR is partly about a code quality regression that was triggered by g:caa7a99a052929d5970677c5b639e1fa5166e334. That patch taught the gimple optimisers to fold two VEC_PERM_EXPRs into one, conditional upon either (a) the original permutations not being "native" operations or (b) the combined permutation being a "native" operation. Whether something is a "native" operation is tested by calling can_vec_perm_const_p with allow_variable_p set to false. This requires the permutation to be supported directly by TARGET_VECTORIZE_VEC_PERM_CONST, rather than falling back to the general vec_perm optab. This exposed a problem with the way that we handled general 2-input permutations for SVE. Unlike Advanced SIMD, base SVE does not have an instruction to do general 2-input permutations. We do still implement the vec_perm optab for SVE, but only when the vector length is known at compile time. The general expansion is pretty expensive: an AND, a SUB, two TBLs, and an ORR. It certainly couldn't be considered a "native" operation. However, if a VEC_PERM_EXPR has a constant selector, the indices can be wider than the elements being permuted. This is not true for the vec_perm optab, where the indices and permuted elements must have the same precision. This leads to one case where we cannot leave a general 2-input permutation to be handled by the vec_perm optab: when permuting bytes on a target with 2048-bit vectors. In that case, the indices of the elements in the second vector are in the range [256, 511], which cannot be stored in a byte index. TARGET_VECTORIZE_VEC_PERM_CONST therefore has to handle 2-input SVE permutations for one specific case. Rather than check for that specific case, the code went ahead and used the vec_perm expansion whenever it worked. But that undermines the !allow_variable_p handling in can_vec_perm_const_p; it becomes impossible for target-independent code to distinguish "native" operations from the worst-case fallback. This patch instead limits TARGET_VECTORIZE_VEC_PERM_CONST to the cases that it has to handle. It fixes the PR for all vector lengths except 2048 bits. A better fix would be to introduce some sort of costing mechanism, which would allow us to reject the new VEC_PERM_EXPR even for 2048-bit targets. But that would be a significant amount of work and would not be backportable. gcc/ PR target/121027 * config/aarch64/aarch64.cc (aarch64_evpc_sve_tbl): Punt on 2-input operations that can be handled by vec_perm. gcc/testsuite/ PR target/121027 * gcc.target/aarch64/sve/acle/general/perm_1.c: New test.
2025-07-11aarch64: Use EOR3 for DImode valuesKyrylo Tkachov1-0/+25
Similar to BCAX, we can use EOR3 for DImode, but we have to be careful not to force GP<->SIMD moves unnecessarily, so add a splitter for that case. So for input: uint64_t eor3_d_gp (uint64_t a, uint64_t b, uint64_t c) { return EOR3 (a, b, c); } uint64x1_t eor3_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return EOR3 (a, b, c); } We generate the desired: eor3_d_gp: eor x1, x1, x2 eor x0, x1, x0 ret eor3_d: eor3 v0.16b, v0.16b, v1.16b, v2.16b ret Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ * config/aarch64/aarch64-simd.md (*eor3qdi4): New define_insn_and_split. gcc/testsuite/ * gcc.target/aarch64/simd/eor3_d.c: Add tests for DImode operands.
2025-07-11aarch64: Handle DImode BCAX operationsKyrylo Tkachov1-0/+29
To handle DImode BCAX operations we want to do them on the SIMD side only if the incoming arguments don't require a cross-bank move. This means we need to split back the combination to separate GP BIC+EOR instructions if the operands are expected to be in GP regs through reload. The split happens pre-reload if we already know that the destination will be a GP reg. Otherwise if reload descides to use the "=r,r" alternative we ensure operand 0 is early-clobber. This scheme is similar to how we handle the BSL operations elsewhere in aarch64-simd.md. Thus, for the functions: uint64_t bcax_d_gp (uint64_t a, uint64_t b, uint64_t c) { return BCAX (a, b, c); } uint64x1_t bcax_d (uint64x1_t a, uint64x1_t b, uint64x1_t c) { return BCAX (a, b, c); } we now generate the desired: bcax_d_gp: bic x1, x1, x2 eor x0, x1, x0 ret bcax_d: bcax v0.16b, v0.16b, v1.16b, v2.16b ret When the inputs are in SIMD regs we use BCAX and when they are in GP regs we don't force them to SIMD with extra moves. Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ * config/aarch64/aarch64-simd.md (*bcaxqdi4): New define_insn_and_split. gcc/testsuite/ * gcc.target/aarch64/simd/bcax_d.c: Add tests for DImode arguments.
2025-07-11aarch64: Use EOR3 for 64-bit vector modesKyrylo Tkachov1-6/+6
Similar to the BCAX patch, we can also use EOR3 for 64-bit modes, just by adjusting the mode iterator used. Thus for input: uint32x2_t bcax_s (uint32x2_t a, uint32x2_t b, uint32x2_t c) { return EOR3 (a, b, c); } we now generate: bcax_s: eor3 v0.16b, v0.16b, v1.16b, v2.16b ret instead of: bcax_s: eor v1.8b, v1.8b, v2.8b eor v0.8b, v1.8b, v0.8b ret Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ * config/aarch64/aarch64-simd.md (eor3q<mode>4): Use VDQ_I mode iterator. gcc/testsuite/ * gcc.target/aarch64/simd/eor3_d.c: New test.
2025-07-11aarch64: Allow 64-bit vector modes in pattern for BCAX instructionKyrylo Tkachov1-6/+6
The BCAX instruction from TARGET_SHA3 only operates on the full .16b form of the inputs but as it's a pure bitwise operation we can use it for the 64-bit modes as well as there we don't care about the upper 64 bits. This patch extends the relevant pattern in aarch64-simd.md to accept the 64-bit vector modes. Thus, for the input: uint32x2_t bcax_s (uint32x2_t a, uint32x2_t b, uint32x2_t c) { return BCAX (a, b, c); } we can now generate: bcax_s: bcax v0.16b, v0.16b, v1.16b, v2.16b ret instead of the current: bcax_s: bic v1.8b, v1.8b, v2.8b eor v0.8b, v1.8b, v0.8b ret This patch doesn't cover the DI/V1DI modes as that would require extending the bcaxqdi4 pattern with =r,r alternatives and adding splitting logic to handle the cases where the operands arrive in GP regs. It is doable, but can be a separate patch. This patch as is should be a straightforward improvement always. Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ * config/aarch64/aarch64-simd.md (bcaxq<mode>4): Use VDQ_I mode iterator. gcc/testsuite/ * gcc.target/aarch64/simd/bcax_d.c: New test.
2025-07-11i386: Add a new peeophole2 for PR91384 under APX_FHu, Lin11-0/+11
gcc/ChangeLog: PR target/91384 * config/i386/i386.md: Add new peeophole2 for optimize *negsi_1 followed by *cmpsi_ccno_1 with APX_F. gcc/testsuite/ChangeLog: PR target/91384 * gcc.target/i386/pr91384-1.c: New test.
2025-07-11properly compute fp/mode for scalar ops for vectorizer costingRichard Biener1-0/+8
The x86 add_stmt_hook relies on the passed vectype to determine the mode and whether it is FP for a scalar operation. This is unreliable now for stmts involving patterns and in the future when there is no vector type passed for scalar operations. To be least disruptive I've kept using the vector type if it is passed. * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Use the LHS of a scalar stmt to determine mode and whether it is FP.
2025-07-10aarch64: Guard VF-based costing with !m_costing_for_scalarRichard Sandiford1-1/+1
g:4b47acfe2b626d1276e229a0cf165e934813df6c caused a segfault in aarch64_vector_costs::analyze_loop_vinfo when costing scalar code, since we'd end up dividing by a zero VF. Much of the structure of the aarch64 costing code dates from a stage 4 patch, when we had to work within the bounds of what the target-independent code did. Some of it could do with a rework now that we're not so constrained. This patch is therefore an emergency fix rather than the best long-term solution. I'll revisit when I have more time to think about it. gcc/ * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost): Guard VF-based costing with !m_costing_for_scalar.
2025-07-10aarch64: Fix LD1Q and ST1Q failures for big-endianRichard Sandiford2-8/+18
LD1Q gathers and ST1Q scatters are unusual in that they operate on 128-bit blocks (effectively VNx1TI). However, we don't have modes or ACLE types for 128-bit integers, and 128-bit integers are not the intended use case. Instead, the instructions are intended to be used in "hybrid VLA" operations, where each 128-bit block is an Advanced SIMD vector. The normal SVE modes therefore capture the intended use case better than VNx1TI would. For example, VNx2DI is effectively N copies of V2DI, VNx4SI N copies of V4SI, etc. Since there is only one LD1Q instruction and one ST1Q instruction, the ACLE support used a single pattern for each, with the loaded or stored data having mode VNx2DI. The ST1Q pattern was generated by: rtx data = e.args.last (); e.args.last () = force_lowpart_subreg (VNx2DImode, data, GET_MODE (data)); e.prepare_gather_address_operands (1, false); return e.use_exact_insn (CODE_FOR_aarch64_scatter_st1q); where the force_lowpart_subreg bitcast the stored data to VNx2DI. But such subregs require an element reverse on big-endian targets (see the comment at the head of aarch64-sve.md), which wasn't the intention. The code should have used aarch64_sve_reinterpret instead. The LD1Q pattern was used as follows: e.prepare_gather_address_operands (1, false); return e.use_exact_insn (CODE_FOR_aarch64_gather_ld1q); which always returns a VNx2DI value, leaving the caller to bitcast that to the correct mode. That bitcast again uses subregs and has the same problem as above. However, for the reasons explained in the comment, using aarch64_sve_reinterpret does not work well for LD1Q. The patch instead parameterises the LD1Q based on the required data mode. gcc/ * config/aarch64/aarch64-sve2.md (aarch64_gather_ld1q): Replace with... (@aarch64_gather_ld1q<mode>): ...this, parameterizing based on mode. * config/aarch64/aarch64-sve-builtins-sve2.cc (svld1q_gather_impl::expand): Update accordingly. (svst1q_scatter_impl::expand): Use aarch64_sve_reinterpret instead of force_lowpart_subreg.
2025-07-10RISC-V: Make zero-stride load broadcast a tunable.Robin Dapp7-34/+133
This patch makes the zero-stride load broadcast idiom dependent on a uarch-tunable "use_zero_stride_load". Right now we have quite a few paths that reach a strided load and some of them are not exactly straightforward. While broadcast is relatively rare on rv64 targets it is more common on rv32 targets that want to vectorize 64-bit elements. While the patch is more involved than I would have liked it could have even touched more places. The whole broadcast-like insn path feels a bit hackish due to the several optimizations we employ. Some of the complications stem from the fact that we lump together real broadcasts, vector single-element sets, and strided broadcasts. The strided-load alternatives currently require a memory_constraint to work properly which causes more complications when trying to disable just these. In short, the whole pred_broadcast handling in combination with the sew64_scalar_helper could use work in the future. I was about to start with it in this patch but soon realized that it would only distract from the original intent. What can help in the future is split strided and non-strided broadcast entirely, as well as the single-element sets. Yet unclear is whether we need to pay special attention for misaligned strided loads (PR120782). I regtested on rv32 and rv64 with strided_load_broadcast_p forced to true and false. With either I didn't observe any new execution failures but obviously there are new scan failures with strided broadcast turned off. PR target/118734 gcc/ChangeLog: * config/riscv/constraints.md (Wdm): Use tunable for Wdm constraint. * config/riscv/riscv-protos.h (emit_avltype_insn): Declare. (can_be_broadcasted_p): Rename to... (can_be_broadcast_p): ...this. * config/riscv/predicates.md: Use renamed function. (strided_load_broadcast_p): Declare. * config/riscv/riscv-selftests.cc (run_broadcast_selftests): Only run broadcast selftest if strided broadcasts are OK. * config/riscv/riscv-v.cc (emit_avltype_insn): New function. (sew64_scalar_helper): Only emit a pred_broadcast if the new tunable says so. (can_be_broadcasted_p): Rename to... (can_be_broadcast_p): ...this and use new tunable. * config/riscv/riscv.cc (struct riscv_tune_param): Add strided broad tunable. (strided_load_broadcast_p): Implement. * config/riscv/vector.md: Use strided_load_broadcast_p () and work around 64-bit broadcast on rv32 targets.
2025-07-10[RISC-V] Detect new fusions for RISC-VDaniel Barboza1-1/+382
This is primarily Daniel's work... He's chasing things in QEMU & LLVM right now so I'm doing a bit of clean-up and shepherding this patch forward. -- Instruction fusion is a reasonably common way to improve the performance of code on many architectures/designs. A few years ago we submitted (via VRULL I suspect) fusion support for a number of cases in the RISC-V space. We made each type of fusion selectable independently in the tuning structure so that designs which implemented some particular set of fusions could select just the ones their design implemented. This patch adds to that generic infrastructure. In particular we're introducing additional load fusions, store pair fusions, bitfield extractions and a few B extension related fusions. Conceptually for the new load fusions we're adding the ability to fuse most add/shNadd instructions with a subsequent load. There's a couple of exceptions, but in general the expectation is that if we have add/shNadd for address computation, then they can potentially use with the load where the address gets used. We've had limited forms of store pair fusion for a while. Essentially we required both stores to be 64 bits wide and land on opposite sides of a 128 bit cache line. That was enough to help prologues and a few other things, but was fairly restrictive. The new cases capture store pairs where the two stores have the same size and hit consecutive memory locations. For example, storing consecutive bytes with sb+sb is fusible. For bitfield extractions we can fuse together a shift left followed by a shift right for arbitrary shift counts where as previously we restricted the shift counts to those implementing sign/zero extensions of 8, and 16 bit objects. Finally some B extension fusions. orc.b+not which shows up in string comparisons, ctz+andi (deepsjeng?), neg+max (synthesized abs). I hope these prove to be useful to other RISC-V designs. I wouldn't be surprised if we have to break down the new load fusions further for some designs. If we need to do that it wouldn't be hard. FWIW, our data indicates the generalized store fusions followed by the expanded load fusions are the most important cases for the new code. These have been tested with crosses and bootstrapped on the BPI. Waiting on pre-commit CI before moving forward (though it has been failing to pick up some patches recently...) gcc/ * config/riscv/riscv.cc (riscv_fusion_pairs): Add new cases. (riscv_set_is_add): New function. (riscv_set_is_addi, riscv_set_is_adduw, riscv_set_is_shNadd): Likewise. (riscv_set_is_shNadduw): Likewise. (riscv_macro_fusion_pair_p): Add new fusion cases. Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
2025-07-10aarch64: PR target/120999: Adjust operands for movprfx alternative of NBSL ↵Kyrylo Tkachov1-1/+1
implementation of NOR While the SVE2 NBSL instruction accepts MOVPRFX to add more flexibility due to its tied operands, the destination of the movprfx cannot be also a source operand. But the offending pattern in aarch64-sve2.md tries to do exactly that for the "=?&w,w,w" alternative and gas warns for the attached testcase. This patch adjusts that alternative to avoid taking operand 0 as an input in the NBSL again. So for the testcase in the patch we now generate: nor_z: movprfx z0, z1 nbsl z0.d, z0.d, z2.d, z1.d ret instead of the previous: nor_z: movprfx z0, z1 nbsl z0.d, z0.d, z2.d, z0.d ret which generated a gas warning. Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ PR target/120999 * config/aarch64/aarch64-sve2.md (*aarch64_sve2_nor<mode>): Adjust movprfx alternative. gcc/testsuite/ PR target/120999 * gcc.target/aarch64/sve2/pr120999.c: New test.
2025-07-10aarch64: Extend HVLA permutations to big-endianRichard Sandiford1-1/+0
TARGET_VECTORIZE_VEC_PERM_CONST has code to match the SVE2.1 "hybrid VLA" DUPQ, EXTQ, UZPQ{1,2}, and ZIPQ{1,2} instructions. This matching was conditional on !BYTES_BIG_ENDIAN. The ACLE code also lowered the associated SVE2.1 intrinsics into suitable VEC_PERM_EXPRs. This lowering was not conditional on !BYTES_BIG_ENDIAN. The mismatch led to lots of ICEs in the ACLE tests on big-endian targets: we lowered to VEC_PERM_EXPRs that are not supported. I think the !BYTES_BIG_ENDIAN restriction was unnecessary. SVE maps the first memory element to the least significant end of the register for both endiannesses, so no endian correction or lane number adjustment is necessary. This is in some ways a bit counterintuitive. ZIPQ1 is conceptually "apply Advanced SIMD ZIP1 to each 128-bit block" and endianness does matter when choosing between Advanced SIMD ZIP1 and ZIP2. For example, the V4SI permute selector { 0, 4, 1, 5 } corresponds to ZIP1 for little- endian and ZIP2 for big-endian. But the difference between the hybrid VLA and Advanced SIMD permute selectors is a consequence of the difference between the SVE and Advanced SIMD element orders. The same thing applies to ACLE intrinsics. The current lowering of svzipq1 etc. is correct for both endiannesses. If ACLE code does: 2x svld1_s32 + svzipq1_s32 + svst1_s32 then the byte-for-byte result is the same for both endiannesses. On big-endian targets, this is different from using the Advanced SIMD sequence below for each 128-bit block: 2x LDR + ZIP1 + STR In contrast, the byte-for-byte result of: 2x svld1q_gather_s32 + svzipq1_s32 + svst11_scatter_s32 depends on endianness, since the quadword gathers and scatters use Advanced SIMD byte ordering for each 128-bit block. This gather/scatter sequence behaves in the same way as the Advanced SIMD LDR+ZIP1+STR sequence for both endiannesses. Programmers writing ACLE code have to be aware of this difference if they want to support both endiannesses. The patch includes some new execution tests to verify the expansion of the VEC_PERM_EXPRs. gcc/ * doc/sourcebuild.texi (aarch64_sve2_hw, aarch64_sve2p1_hw): Document. * config/aarch64/aarch64.cc (aarch64_evpc_hvla): Extend to BYTES_BIG_ENDIAN. gcc/testsuite/ * lib/target-supports.exp (check_effective_target_aarch64_sve2p1_hw): New proc. * gcc.target/aarch64/sve2/dupq_1.c: Extend to big-endian. Add noipa attributes. * gcc.target/aarch64/sve2/extq_1.c: Likewise. * gcc.target/aarch64/sve2/uzpq_1.c: Likewise. * gcc.target/aarch64/sve2/zipq_1.c: Likewise. * gcc.target/aarch64/sve2/dupq_1_run.c: New test. * gcc.target/aarch64/sve2/extq_1_run.c: Likewise. * gcc.target/aarch64/sve2/uzpq_1_run.c: Likewise. * gcc.target/aarch64/sve2/zipq_1_run.c: Likewise.
2025-07-10Comment spelling fix: tunning -> tuningJakub Jelinek1-1/+1
Kyrylo noticed another spelling bug and like usually, the same mistake happens in multiple places. 2025-07-10 Jakub Jelinek <jakub@redhat.com> * config/i386/x86-tune.def: Change "Tunning the" to "tuning" in comment and use semicolon instead of dot in comment. * loop-unroll.cc (decide_unroll_stupid): Comment spelling fix, tunning -> tuning.
2025-07-10Change bellow in comments to belowJakub Jelinek2-2/+2
While I'm not a native English speaker, I believe all the uses of bellow (roar/bark/...) in comments in gcc are meant to be below (beneath/under/...). 2025-07-10 Jakub Jelinek <jakub@redhat.com> gcc/ * tree-vect-loop.cc (scale_profile_for_vect_loop): Comment spelling fix: bellow -> below. * ipa-polymorphic-call.cc (record_known_type): Likewise. * config/i386/x86-tune.def: Likewise. * config/riscv/vector.md (*vsetvldi_no_side_effects_si_extend): Likewise. * tree-scalar-evolution.cc (iv_can_overflow_p): Likewise. * ipa-devirt.cc (add_type_duplicate): Likewise. * tree-ssa-loop-niter.cc (maybe_lower_iteration_bound): Likewise. * gimple-ssa-sccopy.cc: Likewise. * cgraphunit.cc: Likewise. * graphite.h (struct poly_dr): Likewise. * ipa-reference.cc (ignore_edge_p): Likewise. * tree-ssa-alias.cc (ao_compare::compare_ao_refs): Likewise. * profile-count.h (profile_probability::probably_reliable_p): Likewise. * ipa-inline-transform.cc (inline_call): Likewise. gcc/ada/ * par-load.adb: Comment spelling fix: bellow -> below. * libgnarl/s-taskin.ads: Likewise. gcc/testsuite/ * gfortran.dg/g77/980310-3.f: Comment spelling fix: bellow -> below. * jit.dg/test-debuginfo.c: Likewise. libstdc++-v3/ * testsuite/22_locale/codecvt/codecvt_unicode.h (ucs2_to_utf8_out_error): Comment spelling fix: bellow -> below. (utf16_to_ucs2_in_error): Likewise.
2025-07-09aarch64: Fix endianness of DFmode vector constantsRichard Sandiford1-0/+2
aarch64_simd_valid_imm tries to decompose a constant into a repeating series of 64 bits, since most Advanced SIMD and SVE immediate forms require that. (The exceptions are handled first.) It does this by building up a byte-level register image, lsb first. If the image does turn out to repeat every 64 bits, it loads the first 64 bits into an integer. At this point, endianness has mostly been dealt with. Endianness applies to transfers between registers and memory, whereas at this point we're dealing purely with register values. However, one of things we try is to bitcast the value to a float and use FMOV. This involves splitting the value into 32-bit chunks (stored as longs) and passing them to real_from_target. The problem being fixed by this patch is that, when a value spans multiple 32-bit chunks, real_from_target expects them to be in memory rather than register order. Thus index 0 is the most significant chunk if FLOAT_WORDS_BIG_ENDIAN and the least significant chunk otherwise. This fixes aarch64/sve/cond_fadd_1.c and various other tests for aarch64_be-elf. gcc/ * config/aarch64/aarch64.cc (aarch64_simd_valid_imm): Account for FLOAT_WORDS_BIG_ENDIAN when building a floating-point value.
2025-07-09aarch64: Some fixes for SVE INDEX constantsRichard Sandiford1-6/+53
When using SVE INDEX to load an Advanced SIMD vector, we need to take account of the different element ordering for big-endian targets. For example, when big-endian targets store the V4SI constant { 0, 1, 2, 3 } in registers, 0 becomes the most significant element, whereas INDEX always operates from the least significant element. A big-endian target would therefore load V4SI { 0, 1, 2, 3 } using: INDEX Z0.S, #3, #-1 rather than little-endian's: INDEX Z0.S, #0, #1 While there, I noticed that we would only check the first vector in a multi-vector SVE constant, which would trigger an ICE if the other vectors turned out to be invalid. This is pretty difficult to trigger at the moment, since we only allow single-register modes to be used as frontend & middle-end vector modes, but it can be seen using the RTL frontend. gcc/ * config/aarch64/aarch64.cc (aarch64_sve_index_series_p): New function, split out from... (aarch64_simd_valid_imm): ...here. Account for the different SVE and Advanced SIMD element orders on big-endian targets. Check each vector in a structure mode. gcc/testsuite/ * gcc.dg/rtl/aarch64/vec-series-1.c: New test. * gcc.dg/rtl/aarch64/vec-series-2.c: Likewise. * gcc.target/aarch64/sve/acle/general/dupq_2.c: Fix expected output for this big-endian test. * gcc.target/aarch64/sve/acle/general/dupq_4.c: Likewise. * gcc.target/aarch64/sve/vec_init_3.c: Restrict to little-endian targets and add more tests. * gcc.target/aarch64/sve/vec_init_4.c: New big-endian version of vec_init_3.c.
2025-07-09RISC-V: Combine vec_duplicate + vssub.vv to vssub.vx on GR2VR costPan Li3-1/+4
This patch would like to combine the vec_duplicate + vssub.vv to the vssub.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_SAT_S_ADD(T, UT, MIN, MAX) \ T \ test_##T##_sat_add (T x, T y) \ { \ T sum = (UT)x + (UT)y; \ return (x ^ y) < 0 \ ? sum \ : (sum ^ x) >= 0 \ ? sum \ : x < 0 ? MIN : MAX; \ } DEF_SAT_S_ADD(int32_t, uint32_t, INT32_MIN, INT32_MAX) DEF_VX_BINARY_CASE_2_WRAP(T, SAT_S_ADD_FUNC(T), sat_add) Before this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 14 │ slli a3,a3,32 15 │ srli a3,a3,32 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma 18 │ vle32.v v1,0(a1) 19 │ slli a4,a5,2 20 │ sub a3,a3,a5 21 │ add a1,a1,a4 22 │ vssub.vv v1,v1,v2 23 │ vse32.v v1,0(a0) 24 │ add a0,a0,a4 25 │ bne a3,zero,.L3 After this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ slli a3,a3,32 13 │ srli a3,a3,32 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma 16 │ vle32.v v1,0(a1) 17 │ slli a4,a5,2 18 │ sub a3,a3,a5 19 │ add a1,a1,a4 20 │ vssub.vx v1,v1,a2 21 │ vse32.v v1,0(a0) 22 │ add a0,a0,a4 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vx_binary_vec_vec_dup): Add new case SS_MINUS. * config/riscv/riscv.cc (riscv_rtx_costs): Ditto. * config/riscv/vector-iterators.md: Add new op ss_minus. Signed-off-by: Pan Li <pan2.li@intel.com>