aboutsummaryrefslogtreecommitdiff
path: root/gcc/config
AgeCommit message (Collapse)AuthorFilesLines
12 daysRISC-V: Add support for the XAndesvpackfph ISA extension.Kuan-Lin Chen9-0/+113
This extension defines vector instructions to extract a pair of FP16 data from a floating-point register. Multiply the top FP16 data with the FP16 elements and add the result with the bottom FP16 data. gcc/ChangeLog: * common/config/riscv/riscv-common.cc: Turn on VECTOR_ELEN_FP_16 for XAndesvpackfph. * config/riscv/andes-vector-builtins-bases.cc (nds_vfpmad): New class. * config/riscv/andes-vector-builtins-bases.h: New def. * config/riscv/andes-vector-builtins-functions.def (nds_vfpmadt): Ditto. (nds_vfpmadb): Ditto. (nds_vfpmadt_frm): Ditto. (nds_vfpmadb_frm): Ditto. * config/riscv/andes-vector.md (@pred_nds_vfpmad<nds_tb><mode>): New pattern. * config/riscv/riscv-vector-builtins-types.def (DEF_RVV_F16_OPS): New def. * config/riscv/riscv-vector-builtins.cc (f16_ops): Ditto * config/riscv/riscv-vector-builtins.def (float32_type_node): Ditto. * config/riscv/riscv-vector-builtins.h (XANDESVPACKFPH_EXT): Ditto. (required_ext_to_isa_name): Add case XANDESVPACKFPH_EXT. (required_extensions_specified): Ditto. * config/riscv/vector-iterators.md (VHF): New iterator. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/xandesvector/non-policy/non-overloaded/nds_vfpmadb.c: New test. * gcc.target/riscv/rvv/xandesvector/non-policy/non-overloaded/nds_vfpmadt.c: New test. * gcc.target/riscv/rvv/xandesvector/non-policy/overloaded/nds_vfpmadb.c: New test. * gcc.target/riscv/rvv/xandesvector/non-policy/overloaded/nds_vfpmadt.c: New test. * gcc.target/riscv/rvv/xandesvector/policy/non-overloaded/nds_vfpmadb.c: New test. * gcc.target/riscv/rvv/xandesvector/policy/non-overloaded/nds_vfpmadt.c: New test. * gcc.target/riscv/rvv/xandesvector/policy/overloaded/nds_vfpmadb.c: New test. * gcc.target/riscv/rvv/xandesvector/policy/overloaded/nds_vfpmadt.c: New test.
13 daysAVR: ad target/121794 - Invoke zero_reg less.Georg-Johann Lay1-5/+5
gcc/ PR target/121794 * config/avr/avr.md (cmpqi3): Use cpi R,0 if possible.
13 daysRISC-V: Combine vec_duplicate + vnmsub.vv to vnmsub.vx on GR2VR costPan Li2-27/+31
This patch would like to combine the vec_duplicate + vnmsub.vv to the vnmsub.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. Before this patch: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 ... 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma ... 22 │ vnmsub.vv v1,v2,v3 ... 25 │ bne a3,zero,.L3 After this patch: 11 │ beq a3,zero,.L8 ... 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma ... 20 │ vnmsub.vx v1,a2,v3 ... 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/autovec-opt.md (*vnmsac_vx_<mode>): Rename from. (*mul_minus_vx_<mode>): Rename to and add nmsub support. * config/riscv/vector.md (@pred_vnmsac_vx_<mode>): Rename from. (@pred_mul_minus_vx_<mode>): Rename to and add nmsub support. (*pred_nmsac_<mode>_scalar_undef): Rename from. (*pred_mul_minus_vx<mode>_undef): Rename to and add nmsub support. Signed-off-by: Pan Li <pan2.li@intel.com>
13 daysRISC-V: Add support for the XAndesvsintload ISA extension.Kuan-Lin Chen9-1/+136
This extension defines vector load instructions to move sign-extended or zero-extended INT4 data into 8-bit vector register elements. gcc/ChangeLog: * config/riscv/andes-vector-builtins-bases.cc (nds_nibbleload): New class. * config/riscv/andes-vector-builtins-bases.h (nds_vln8): New def. (nds_vlnu8): Ditto. * config/riscv/andes-vector-builtins-functions.def (nds_vln8): Ditto. (nds_vlnu8): Ditto. * config/riscv/andes-vector.md (@pred_intload_mov<su><mode>): New pattern. * config/riscv/riscv-vector-builtins-types.def (DEF_RVV_Q_OPS): New def. (DEF_RVV_QU_OPS): Ditto. * config/riscv/riscv-vector-builtins.cc (q_v_void_const_ptr_ops): New operand information. (qu_v_void_const_ptr_ops): Ditto. * config/riscv/riscv-vector-builtins.def (void_const_ptr): New def. * config/riscv/riscv-vector-builtins.h (enum required_ext): Ditto. (required_ext_to_isa_name): Add case XANDESVSINTLOAD_EXT. (required_extensions_specified): Ditto. * config/riscv/vector-iterators.md (NDS_QVI): New iterator. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/xandesvector/non-policy/non-overloaded/nds_vln8.c: New test. * gcc.target/riscv/rvv/xandesvector/non-policy/overloaded/nds_vln8.c: New test. * gcc.target/riscv/rvv/xandesvector/policy/non-overloaded/nds_vln8.c: New test. * gcc.target/riscv/rvv/xandesvector/policy/overloaded/nds_vln8.c: New test.
13 daysRISC-V: Add support for the XAndesvbfhcvt ISA extension.Kuan-Lin Chen12-1/+344
This patch add support for XAndesvbfhcvt ISA extension. This extension defines instructions to perform vector floating-point conversion between the BFLOAT16 floating-point data and the IEEE-754 32-bit single-precision floating-point (SP) data in a vector register. gcc/ChangeLog: * common/config/riscv/riscv-common.cc: Turn on VECTOR_ELEN_BF_16 for XAndesvbfhcvt. * config.gcc: Add extra_objs andes-vector-builtins-bases.o and extra_headers andes_vector.h. * config/riscv/riscv-vector-builtins-shapes.cc (BASE_NAME_MAX_LEN): Increase size to 20. * config/riscv/riscv-vector-builtins.cc (f32_to_bf16_nf_w_ops): New operand information. (f32_to_bf16_nf_w_ops): New operand information. (DEF_RVV_FUNCTION): New def. * config/riscv/riscv-vector-builtins.def (bf16): Ditto. * config/riscv/riscv-vector-builtins.h (enum required_ext): Ditto. (required_ext_to_isa_name): Add case XANDESVBFHCVT_EXT. (required_extensions_specified): Ditto. * config/riscv/t-riscv: Add andes-vector-builtins-functions.def, andes-vector-builtins-bases.h and andes-vector-builtins-bases.o. * config/riscv/vector-iterators.md (NDS_VWEXTBF): New iterator. (NDS_V_DOUBLE_TRUNC_BF): New attr. * config/riscv/andes-vector-builtins-bases.cc: New file. * config/riscv/andes-vector-builtins-bases.h: New file. * config/riscv/andes-vector-builtins-functions.def: New file. * config/riscv/andes_vector.h: New file. * config/riscv/andes-vector.md: New file. * config/riscv/vector.md: Include andes_vector.md. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/rvv.exp: Add regression for xandesvector. * gcc.target/riscv/rvv/xandesvector/non-policy/non-overloaded/nds_vfncvtbf16s.c: New test. * gcc.target/riscv/rvv/xandesvector/non-policy/non-overloaded/nds_vfwcvtsbf16.c: New test. * gcc.target/riscv/rvv/xandesvector/non-policy/overloaded/nds_vfncvtbf16s.c: New test. * gcc.target/riscv/rvv/xandesvector/non-policy/overloaded/nds_vfwcvtsbf16.c: New test. * gcc.target/riscv/rvv/xandesvector/policy/non-overloaded/nds_vfncvtbf16s.c: New test. * gcc.target/riscv/rvv/xandesvector/policy/non-overloaded/nds_vfwcvtsbf16.c: New test. * gcc.target/riscv/rvv/xandesvector/policy/overloaded/nds_vfncvtbf16s.c: New test. * gcc.target/riscv/rvv/xandesvector/policy/overloaded/nds_vfwcvtsbf16.c: New test.
13 daysRISC-V: Add tt-ascalon-d8 pipeline descriptionAnton Blanchard4-2/+357
Add pipeline description for the Tenstorrent Ascalon 8 wide CPU. gcc/ChangeLog * config/riscv/riscv-cores.def (RISCV_TUNE): Update. * config/riscv/riscv-opts.h (enum riscv_microarchitecture_type): Add tt_ascalon_d8. * config/riscv/riscv.md: Update tune attribute and include tt-ascalon-d8.md. * config/riscv/tt-ascalon-d8.md: New file.
2025-09-05RISC-V: Check if we can vec_extract [PR121510].Robin Dapp1-1/+2
For Zvfhmin a vector mode exists but the corresponding vec_extract does not. This patch checks that a vec_extract is available and otherwise falls back to standard handling. PR target/121510 gcc/ChangeLog: * config/riscv/riscv.cc (riscv_legitimize_move): Check if we can vec_extract. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/pr121510.c: New test.
2025-09-05AVR: target/121794 - Invoke zero_reg less.Georg-Johann Lay1-27/+47
There are some cases where involing zero_reg is not needed and where there are other sequences with the same efficiency. An example is to use SBCI R,0 instead of SBC R,__zero_reg__ when R >= R16. This may turn out to be better for small ISRs. PR target/121794 gcc/ * config/avr/avr.cc (avr_out_compare): Only use zero_reg when there is no other sequence of the same length. (avr_out_plus_ext): Same. (avr_out_plus_1): Same.
2025-09-05aarch64: Use SVE for V2DImode integer min/max operationsKyrylo Tkachov3-6/+19
Unlike Advanced SIMD, SVE has instruction to perform smin, smax, umin, umax on 64-bit elements. Thus, we can use them with the fixed-width V2DImode expander. Most of the machinery is already there on the define_insn side, supporting V2DImode operands of the SVE pattern. We just need to wire up the RTL emission to the v2di standard names for the TARGET_SVE case. So for the smin case we now generate: min_di: ldr q30, [x0] ptrue p7.b, all ldr q31, [x1] smin z30.d, p7/m, z30.d, z31.d str q30, [x2] ret min_imm_di: ldr q31, [x0] smin z31.d, z31.d, #5 str q31, [x2] ret instead of the previous: min_di: ldr q30, [x0] ldr q31, [x1] cmgt v29.2d, v30.2d, v31.2d bsl v29.16b, v31.16b, v30.16b str q29, [x2] ret min_imm_di: ldr q31, [x0] mov z30.d, #5 cmgt v29.2d, v30.2d, v31.2d bsl v29.16b, v31.16b, v30.16b str q29, [x2] ret The register operand case is the same length, though the new ptrue can now be shared and moved away. But the immediate operand case is obviously better as the SVE immediate form doesn't require a predicate operand. Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ * config/aarch64/iterators.md (sve_di_suf): New mode attribute. * config/aarch64/aarch64-sve.md (<optab><mode>3 SVE_INT_BINARY_MULTI): Rename to... (<optab><mode>3<sve_di_suf>): ... This. Use SVE_I_SIMD_DI mode iterator. * config/aarch64/aarch64-simd.md (<su><maxmin>v2di3): Use the above for TARGET_SVE. gcc/testsuite/ * gcc.target/aarch64/sve/usminmax_di.c: New test.
2025-09-05RISC-V: Combine vec_duplicate + vmadd.vv to vmadd.vx on GR2VR costPan Li2-87/+91
This patch would like to combine the vec_duplicate + vmadd.vv to the vmadd.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. Before this patch: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 ... 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma ... 22 │ vmadd.vv v1,v2,v3 ... 25 │ bne a3,zero,.L3 After this patch: 11 │ beq a3,zero,.L8 ... 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma ... 20 │ vmadd.vx v1,a2,v3 ... 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/autovec-opt.md (*vmacc_vx_<mode>): Rename to handle both the macc and madd. (*mul_plus_vx_<mode>): Add madd pattern. * config/riscv/vector.md (@pred_mul_plus_vx_<mode>): Rename to handle both the macc and madd. (*pred_macc_<mode>_scalar_undef): Remove. (*pred_nmsac_<mode>_scalar_undef): Remove. (*pred_mul_plus_vx<mode>_undef): Add new pattern to handle both the vmacc and vmadd. (@pred_mul_plus_vx<mode>): Ditto. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-09-04arm: wrong code from vset_lane_* [PR121775]Richard Earnshaw1-3/+10
Insufficient validation of the operands in vec_set_<mode>_internal means that the optimizers can transform the exanded code into something that is invalid. We then emit code based on the incorrect RTL assuming that it is still valid. A valid pattern can only have a single bit set in the immediate operand, representing the lane to be written. gcc/ChangeLog: PR target/121775 * config/arm/neon.md (vec_set<mode>_internal, all variants): validate the immediate operand that indicates the lane to modify. gcc/testsuite/ChangeLog: PR target/121775 * gcc.target/arm/simd/vset_lane_u8.c: New test.
2025-09-04RISC-V: Use correct target in expand_vec_perm [PR121780].Robin Dapp1-1/+1
This fixes a glaring mistake in yesterday's change to the expansion of vec_perm. We should of course move tmp_target into the real target and not the other way around. I wonder why my testing hasn't caught this... PR target/121742 PR target/121780 PR target/121781 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vec_perm): Swap target and tmp_target. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/pr121780.c: New test. * gcc.target/riscv/rvv/autovec/pr121781.c: New test.
2025-09-04RISC-V: Always register vector built-in functions during LTO [PR110812]Kito Cheng3-184/+121
Previously, vector built-in functions were not properly registered during the LTO pipeline, causing link failures when vector intrinsics were used in LTO builds with mixed architecture options. This patch ensures all vector built-in functions are always registered during LTO compilation. The key changes include: - Moving pragma intrinsic flag manipulation from riscv-c.cc to riscv-vector-builtins.cc for better encapsulation - Registering all vector built-in functions regardless of current ISA extensions, deferring the actual extension checking to expansion time - Adding proper support for built-in type registration during LTO This approach is safe because we already perform extension requirement checking at expansion time. The trade-off is a slight increase in bootstrap time for LTO builds due to registering more built-in functions. PR target/110812 gcc/ChangeLog: * config/riscv/riscv-c.cc (pragma_intrinsic_flags): Remove struct. (riscv_pragma_intrinsic_flags_pollute): Remove function. (riscv_pragma_intrinsic_flags_restore): Remove function. (riscv_pragma_intrinsic): Simplify to only call handle_pragma_vector. * config/riscv/riscv-vector-builtins.cc (pragma_intrinsic_flags): Move struct definition here from riscv-c.cc. (riscv_pragma_intrinsic_flags_pollute): Move and adapt from riscv-c.cc, add zvfbfmin, zvfhmin and vector_elen_bf_16 support. (riscv_pragma_intrinsic_flags_restore): Move from riscv-c.cc. (rvv_switcher::rvv_switcher): Add pollute_flags parameter to control flag manipulation. (rvv_switcher::~rvv_switcher): Restore flags conditionally. (register_builtin_types): Use rvv_switcher without polluting flags. (get_required_extensions): Remove function. (check_required_extensions): Simplify to only check type validity. (function_instance::function_returns_void_p): Move implementation from header. (function_builder::add_function): Register placeholder for LTO. (init_builtins): Simplify and handle LTO case. (reinit_builtins): Remove function. (handle_pragma_vector): Remove extension checking. * config/riscv/riscv-vector-builtins.h (function_instance::function_returns_void_p): Add declaration. (function_call_info::function_returns_void_p): Remove inline implementation. gcc/testsuite/ChangeLog: * gcc.target/riscv/lto/pr110812_0.c: New test. * gcc.target/riscv/lto/pr110812_1.c: New test. * gcc.target/riscv/lto/riscv-lto.exp: New test driver. * gcc.target/riscv/lto/riscv_vector.h: New header wrapper.
2025-09-03RISC-V: Add support for the XAndesbfhcvt ISA extension.Kuan-Lin Chen4-4/+23
This extension defines instructions to perform scalar floating-point conversion between the BFLOAT16 floating-point data and the IEEE-754 32-bit single-precision floating-point (SP) data in a scalar floating point register. gcc/ChangeLog: * config/riscv/andes.def: Add nds_fcvt_s_bf16 and nds_fcvt_bf16_s. * config/riscv/riscv.md (truncsfbf2): Add TARGET_XANDESBFHCVT support. (extendbfsf2): Ditto. * config/riscv/riscv-builtins.cc: New AVAIL andesbfhcvt. Add new define RISCV_ATYPE_BF and RISCV_ATYPE_SF. * config/riscv/riscv-ftypes.def: New DEF_RISCV_FTYPE. gcc/testsuite/ChangeLog: * gcc.target/riscv/xandes/xandesbfhcvt-1.c: New test. * gcc.target/riscv/xandes/xandesbfhcvt-2.c: New test.
2025-09-03RISC-V: Add support for the XAndesperf ISA extension.Kuan-Lin Chen10-7/+557
This patch adds support for the XAndesperf ISA extension. The 32-bit AndeStar V5 extension includes branch instructions, load effective address instructions, and string processing instructions for performance improvement. New INSN patterns are added into the new file andes.md as a seprated vender extension. gcc/ChangeLog: * config/riscv/constraints.md (Ou07): New constraint. (ads_Bext): New constraint. * config/riscv/iterators.md (ANYLE32): New iterator. (sizen): New iterator. (sh_limit): New iterator. (sh_bit): New iterator. (cs): New iterator. * config/riscv/predicates.md (ads_branch_bbcs_operand): New predicate. (ads_branch_bimm_operand): New predicate. (ads_imm_extract_operand): New predicate. (ads_extract_size_imm_si): New predicate. (ads_extract_size_imm_di): New predicate. (const_int5_operand): New predicate. * config/riscv/riscv-builtins.cc: Add new AVAIL andesperf32 and andesperf64. Add new define RISCV_ATYPE_DI. * config/riscv/riscv-ftypes.def: New DEF_RISCV_FTYPE. * config/riscv/riscv.cc (riscv_extend_cost): Cost for pattern 'bfo'. (riscv_rtx_costs): Cost for XAndesperf extension. * config/riscv/riscv.md: Add support for XAndesperf to patterns zero_extendsidi2_internal, zero_extendhi2, extendsidi2_internal, extend<SHORT:mode><SUPERQI:mode>2, <any_extract:optab><GPR:mode>3 and branch_on_bit. * config/riscv/vector-iterators.md (sz): Add sign_extract and zero_extract. * config/riscv/andes.def: New file for vender Andes. * config/riscv/andes.md: New file for vender Andes. gcc/testsuite/ChangeLog: * gcc.target/riscv/riscv.exp: Add runtest for subdir xandes. * gcc.target/riscv/xandes/xandesperf-1.c: New test. * gcc.target/riscv/xandes/xandesperf-10.c: New test. * gcc.target/riscv/xandes/xandesperf-2.c: New test. * gcc.target/riscv/xandes/xandesperf-3.c: New test. * gcc.target/riscv/xandes/xandesperf-4.c: New test. * gcc.target/riscv/xandes/xandesperf-5.c: New test. * gcc.target/riscv/xandes/xandesperf-6.c: New test. * gcc.target/riscv/xandes/xandesperf-7.c: New test. * gcc.target/riscv/xandes/xandesperf-8.c: New test. * gcc.target/riscv/xandes/xandesperf-9.c: New test.
2025-09-03RISC-V: Add basic XAndes vendor extension support.Kuan-Lin Chen4-1/+118
This patch add basic support for the following XAndes ISA extensions: XANDESPERF XANDESBFHCVT XANDESVBFHCVT XANDESVSINTLOAD XANDESVPACKFPH XANDESVDOT gcc/ChangeLog: * config/riscv/riscv-ext.def: Include riscv-ext-andes.def. * config/riscv/riscv-ext.opt (riscv_xandes_subext): New variable. (XANDESPERF) : New mask. (XANDESBFHCVT): Ditto. (XANDESVBFHCVT): Ditto. (XANDESVSINTLOAD): Ditto. (XANDESVPACKFPH): Ditto. (XANDESVDOT): Ditto. * config/riscv/t-riscv: Add riscv-ext-andes.def. * doc/riscv-ext.texi: Regenerated. * config/riscv/riscv-ext-andes.def: New file. gcc/testsuite/ChangeLog: * gcc.target/riscv/xandes/xandes-predef-1.c: New test. * gcc.target/riscv/xandes/xandes-predef-2.c: New test. * gcc.target/riscv/xandes/xandes-predef-3.c: New test. * gcc.target/riscv/xandes/xandes-predef-4.c: New test. * gcc.target/riscv/xandes/xandes-predef-5.c: New test. * gcc.target/riscv/xandes/xandes-predef-6.c: New test. Co-author: Lino Hsing-Yu Peng (linopeng@andestech.com) Co-author: Kai Kai-Yi Weng (kaiweng@andestech.com).
2025-09-03RISC-V: Add pattern for vector-scalar floating-point maxPaul-Antoine Arras1-4/+4
This pattern enables the combine pass (or late-combine, depending on the case) to merge a vec_duplicate into an smax RTL instruction. Before this patch, we have two instructions, e.g.: vfmv.v.f v2,fa0 vfmax.vv v1,v1,v2 After, we get only one: vfmax.vf v1,v1,fa0 In some cases, it also shaves off one vsetvli. gcc/ChangeLog: * config/riscv/autovec-opt.md (*vfmax_vf_<mode>): Rename into... (*vf<optab>_vf_<mode>): New pattern to combine vec_duplicate + vf{min,max}.vv into vf{max,min}.vf. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/vls/floating-point-max-2.c: Adjust scan dump. * gcc.target/riscv/rvv/autovec/vls/floating-point-max-4.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfmax. Also add missing scan-dump for vfmul. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Add vfmax. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_binop.h: Add max functions. * gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_data.h: Add data for vfmax. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmax-run-1-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmax-run-1-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmax-run-1-f64.c: New test.
2025-09-03[RISC-V][PR target/121213] Avoid unnecessary sign extension in amoswap sequenceAustin Law1-1/+26
This is Austin's work to remove the redundant sign extension seen in pr121213. -- The .w form of amoswap will sign extend its result from 32 to 64 bits, thus any explicit sign extension insn doing the same is redundant. This uses Jivan's approach of allocating a DI temporary for an extended result and using a promoted subreg extraction to get that result into the final destination. Tested with no regressions on riscv32-elf and riscv64-elf and bootstrapped on the BPI and pioneer systems. PR target/121213 gcc/ * config/riscv/sync.md (amo_atomic_exchange_extended<mode>): Separate insn with sign extension for 64 bit targets. gcc/testsuite * gcc.target/riscv/amo/pr121213.c: Remove xfail.
2025-09-03aarch64: PR target/121749: Use correct predicate for narrowing shift amountsKyrylo Tkachov2-12/+13
With g:d20b2ad845876eec0ee80a3933ad49f9f6c4ee30 the narrowing shift instructions are now represented with standard RTL and more merging optimisations occur. This exposed a wrong predicate for the shift amount operand. The shift amount is the number of bits of the narrow destination, not the input sources. Correct this by using the vn_mode attribute when specifying the predicate, which exists for this purpose. I've spotted a few more narrowing shift patterns that need the restriction, so they are updated as well. Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ PR target/121749 * config/aarch64/aarch64-simd.md (aarch64_<shrn_op>shrn_n<mode>): Use aarch64_simd_shift_imm_offset_<vn_mode> instead of aarch64_simd_shift_imm_offset_<ve_mode> predicate. (aarch64_<shrn_op>shrn_n<mode> VQN define_expand): Likewise. (*aarch64_<shrn_op>rshrn_n<mode>_insn): Likewise. (aarch64_<shrn_op>rshrn_n<mode>): Likewise. (aarch64_<shrn_op>rshrn_n<mode> VQN define_expand): Likewise. (aarch64_sqshrun_n<mode>_insn): Likewise. (aarch64_sqshrun_n<mode>): Likewise. (aarch64_sqshrun_n<mode> VQN define_expand): Likewise. (aarch64_sqrshrun_n<mode>_insn): Likewise. (aarch64_sqrshrun_n<mode>): Likewise. (aarch64_sqrshrun_n<mode>): Likewise. * config/aarch64/iterators.md (vn_mode): Handle DI, SI, HI modes. gcc/testsuite/ PR target/121749 * gcc.target/aarch64/simd/pr121749.c: New test.
2025-09-02RISC-V: Fix is_vlmax_len_p and use for strided ops.Robin Dapp2-10/+30
This patch changes is_vlmax_len_p to handle VLS modes properly. Before we would check if len == GET_MODE_NUNITS (mode). This works vor VLA modes but not necessarily for VLS modes. We regularly have e.g. small VLS modes where LEN equals their number of units but which do not span a full vector. Therefore now check if len * GET_MODE_UNIT_SIZE (mode) equals BYTES_PER_RISCV_VECTOR * TARGET_MAX_LMUL. Changing this uncovered an oversight in avlprop where we used GET_MODE_NUNITS as AVL when GET_MODE_NUNITS / NF would be correct. The testsuite is unchanged. I didn't bother to add a dedicated test because we would have seen the fallout any way once the gather patch lands. gcc/ChangeLog: * config/riscv/riscv-v.cc (is_vlmax_len_p): Properly handle VLS modes. (imm_avl_p): Fix VLS length check. (expand_strided_load): Use is_vlmax_len_p. (expand_strided_store): Ditto. * config/riscv/riscv-avlprop.cc (pass_avlprop::execute): Use GET_MODE_NUNITS / NF as avl.
2025-09-02RISC-V: Handle overlap in expand_vec_perm PR121742.Robin Dapp1-3/+8
In a two-source gather we unconditionally overwrite target with the first gather's result already. If op1 == target this clobbers the source operand for the second gather. This patch uses a temporary in that case. PR target/121742 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vec_perm): Use temporary if op1 and target overlap. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/pr121742.c: New test.
2025-09-02RISC-V: Remove unused print_ext_doc_entry function [NFC]Kito Cheng1-44/+0
The print_ext_doc_entry function and associated version_t struct in gen-riscv-ext-opt.cc were not being used anywhere in the codebase. Remove them to clean up the code. gcc/ * config/riscv/gen-riscv-ext-opt.cc (version_t): Remove unused struct. (print_ext_doc_entry): Remove unused function.
2025-09-01PR target/89828 Inernal compiler error on "-fno-omit-frame-pointer"Yoshinori Sato1-32/+17
The problem was caused by an erroneous note about creating a stack frame, which caused the cur_cfa reg to fail to assert with a value other than the frame pointer. This fix will generate notes that correctly update cur_cfa. v2 changes. Add testcase. All tests that failed with "internal compiler error: in dwarf2out_frame_debug_adjust_cfa, at dwarf2cfi.cc" now pass. PR target/89828 gcc * config/rx/rx.cc (add_pop_cfi_notes): Release the frame pointer if it is used. (rx_expand_prologue): Redesigned stack pointer and frame pointer update process. gcc/testsuite/ * gcc.dg/pr89828.c: New.
2025-09-01Introduce abstraction for vect reduction info, tracked from SLP nodesRichard Biener1-6/+5
While we have already the accessor info_for_reduction, its result is a plain stmt_vec_info. The following turns that into a class for the purpose of changing accesses to reduction info to a new set of accessors prefixed with VECT_REDUC_INFO and removes the corresponding STMT_VINFO prefixed accessors where possible. There is few reduction related things that are used by scalar cycle detection and thus have to stay as-is for now and as copies in future. This also separates reduction info into one object per reduction and associate it with SLP nodes, splitting it out from stmt_vec_info, retaining (and duplicating) parts used by scalar cycle analysis. The data is then associated with SLP nodes forming reduction cycles and accessible via info_for_reduction. The data is created at SLP discovery time as we look at it even pre-vectorizable_reduction analysis, but most of the data is only populated by the latter. There is no reduction info with nested cycles that are not part of an outer reduction. In the process this adds cycle info to each SLP tree, notably the reduc-idx and a way to identify the reduction info. * tree-vectorizer.h (vect_reduc_info): New. (create_info_for_reduction): Likewise. (VECT_REDUC_INFO_TYPE): Likewise. (VECT_REDUC_INFO_CODE): Likewise. (VECT_REDUC_INFO_FN): Likewise. (VECT_REDUC_INFO_SCALAR_RESULTS): Likewise. (VECT_REDUC_INFO_INITIAL_VALUES): Likewise. (VECT_REDUC_INFO_REUSED_ACCUMULATOR): Likewise. (VECT_REDUC_INFO_INDUC_COND_INITIAL_VAL): Likewise. (VECT_REDUC_INFO_EPILOGUE_ADJUSTMENT): Likewise. (VECT_REDUC_INFO_FORCE_SINGLE_CYCLE): Likewise. (VECT_REDUC_INFO_RESULT_POS): Likewise. (VECT_REDUC_INFO_VECTYPE): Likewise. (STMT_VINFO_VEC_INDUC_COND_INITIAL_VAL): Remove. (STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT): Likewise. (STMT_VINFO_FORCE_SINGLE_CYCLE): Likewise. (STMT_VINFO_REDUC_FN): Likewise. (STMT_VINFO_REDUC_VECTYPE): Likewise. (vect_reusable_accumulator::reduc_info): Adjust. (vect_reduc_type): Adjust. (_slp_tree::cycle_info): New member. (SLP_TREE_REDUC_IDX): Likewise. (vect_reduc_info_s): Move/copy data from ... (_stmt_vec_info): ... here. (_loop_vec_info::redcu_infos): New member. (info_for_reduction): Adjust to take SLP node. (vect_reduc_type): Adjust. (vect_is_reduction): Add overload for SLP node. * tree-vectorizer.cc (vec_info::new_stmt_vec_info): Do not initialize removed members. (vec_info::free_stmt_vec_info): Do not release them. * tree-vect-stmts.cc (vectorizable_condition): Adjust. * tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize cycle info. (vect_build_slp_tree_2): Compute SLP reduc_idx and store it. Create, populate and propagate reduction info. (vect_print_slp_tree): Print cycle info. (vect_analyze_slp_reduc_chain): Set cycle info on the manual added conversion node. (vect_optimize_slp_pass::start_choosing_layouts): Adjust. * tree-vect-loop.cc (_loop_vec_info::~_loop_vec_info): Release reduction infos. (info_for_reduction): Get the reduction info from the vector in the loop_vinfo. (vect_create_epilog_for_reduction): Adjust. (vectorizable_reduction): Likewise. (vect_transform_reduction): Likewise. (vect_transform_cycle_phi): Likewise, deal with nested cycles not part of a double reduction have no reduction info. * config/aarch64/aarch64.cc (aarch64_force_single_cycle): Use VECT_REDUC_INFO_FORCE_SINGLE_CYCLE, get SLP node and use that. (aarch64_vector_costs::count_ops): Adjust.
2025-08-31Fix ICE due to wrong operand is passed to ix86_vgf2p8affine_shift_matrix.liuhongt2-4/+7
1) Fix predicate of operands[3] in cond_<insn><mode> since only const_vec_dup_operand is excepted for masked operations, and pass real count to ix86_vgf2p8affine_shift_matrix. 2) Pass operands[2] instead of operands[1] to gen_vgf2p8affineqb_<mode>_mask which excepted the operand to shifted, but operands[1] is mask operand in cond_<insn><mode>. gcc/ChangeLog: PR target/121699 * config/i386/predicates.md (const_vec_dup_operand): New predicate. * config/i386/sse.md (cond_<insn><mode>): Fix predicate of operands[3], and fix wrong operands passed to ix86_vgf2p8affine_shift_matrix and gen_vgf2p8affineqb_<mode>_mask. gcc/testsuite/ChangeLog: * gcc.target/i386/pr121699.c: New test.
2025-08-31xtensa: Optimize branch whether (reg:SI) is within/out the range handled by ↵Takayuki 'January June' Suwa2-0/+39
CLAMPS instruction The CLAMPS instruction in Xtensa ISA, provided when the TARGET_CLAMPS configuration is enabled (and also requires TARGET_MINMAX), returns a value clamped the number in the specified register to between -(1<<N) and (1<<N)-1 inclusive, where N is an immediate value from 7 to 22. Therefore, when the above configurations are met, by comparing the clamped result with the original value for equality, branching whether the value is within the range mentioned above or not is implemented with fewer instructions, especially when the upper and lower bounds of the range are too large to fit into a single immediate assignment. /* example (TARGET_MINMAX and TARGET_CLAMPS) */ extern void foo(void); void test0(int a) { if (a >= -(1 << 9) && a < (1 << 9)) foo(); } void test1(int a) { if (a < -(1 << 20) || a >= (1 << 20)) foo(); } ;; before test0: entry sp, 32 addmi a2, a2, 0x200 movi a8, 0x3ff bltu a8, a2, .L1 call8 foo .L1: retw.n test1: entry sp, 32 movi.n a9, 1 movi.n a8, -1 slli a9, a9, 20 srli a8, a8, 11 add.n a2, a2, a9 bgeu a8, a2, .L4 call8 foo .L4: retw.n ;; after test0: entry sp, 32 clamps a8, a2, 9 bne a2, a8, .L1 call8 foo .L1: retw.n test1: entry sp, 32 clamps a8, a2, 20 beq a2, a8, .L4 call8 foo .L4: retw.n (note: Currently, in the RTL instruction combination pass, the possible const_int values ​​are fundamentally constrained by TARGET_LEGITIMATE_CONSTANT_P() if no bare large constant assignments are possible (i.e., neither -mconst16 nor -mauto-litpools), so limiting N to a range of 7 to only 10 instead of to 22. A series of forthcoming patches will introduce an entirely new "xt_largeconst" pass that will solve several issues including this.) gcc/ChangeLog: * config/xtensa/predicates.md (alt_ubranch_operator): New predicate. * config/xtensa/xtensa.md (*eqne_in_range): New insn_and_split pattern.
2025-08-31[RISC-V] Improve initial RTL generation for SImode adds on rv64Shreya Munnangi3-13/+101
So this is the next chunk of Shreya's work to adjust our add expanders. In this patch we're adding support for adding a 2*s12 immediate in SI for rv64. To recap, the basic idea is reduce our reliance on the define_insn_and_split that was added a year or so ago by synthesizing the more efficient sequence at expansion time. By handling this early rather than late the synthesized sequence participates in the various optimizer passes in the natural way. In contrast using the define_insn_and_split bypasses the cost modeling in combine and hides the synthesis until after reload as completed (which in turn leads to the problems seen in pr120811). This doesn't solve pr120811, but it is the last prerequisite patch before directly tackling pr120811. This has been bootstrapped & regression tested on the pioneer & bpi and been through the usual testing on riscv32-elf and riscv64-elf. Waiting on pre-commit CI before moving forward. gcc/ * config/riscv/riscv-protos.h (synthesize_add_extended): Prototype. * config/riscv/riscv.cc (synthesize_add_extended): New function. * config/riscv/riscv.md (addsi3): For RV64, try synthesize_add_extended. gcc/testsuite/ * gcc.target/riscv/add-synthesis-2.c: New test.
2025-08-30phiopt, math-opts: Adjust spaceship_replacement and optimize_spaceship for ↵Jakub Jelinek3-6/+6
recent libstdc++ changes [PR121698] libstdc++ changed its ABI in <compare> for C++20 recently (under the C++20 is still experimental rule). In addition to the -1, 0, 1 values for less, equal, greater it now uses -128 for unordered instead of former 2 and changes some of the operators, instead of checks like (_M_value & ~1) == _M_value in some cases it now uses _M_reverse() which is negation in unsigned char type + conversion back to the original type. _M_reverse() thus turns the -1, 0, 1, -128 values into 1, 0, -1, -128. Note libc++ uses value -127 instead of 2/-128. Now, the middle-end has some optimizations which rely on the particular implementation and don't optimize if not. One is optimize_spaceship which on some targets (currently x86, aarch64 and s390) attempts to use better comparison instructions (ideally just one floating point comparison to get all 4 possible outcomes plus some flag tests or magic instead of 2 or 3 floating point comparisons). This one can actually handle arbitrary int non-[-1,1] values for unordered but still has a default of 2. The patch changes that default to -128 so that even if something is expanded as branches if it is later during RTL optimizations determined to convert that into partial_ordering we get better code. The other optimization (phiopt one) is about optimizing (x <=> y) < 0 etc. into just x < y. This one actually relies on the exact unordered value (2) and has code to deal with that (_M_value & ~1) == _M_value kind of tests and whatever match.pd lowers it. So, this patch partially rewrites it to look for -128 instead of 2, drop those (_M_value & ~1) == _M_value pattern recognitions and instead introduces pattern recognition of _M_reverse(), i.e. cast to unsigned char, negation in that type and cast back to the original signed type. With all these changes we get back the desired optimizations for all the cases we could optimize previously (note, for HONOR_NANS case we don't try to optimize say (x <=> y) == 0 because the original will raise exception if either x or y is a NaN, while turning it into x == y will not, but (x <=> y) <= 0 is fine (x <= y), because it does raise those exceptions. 2025-08-30 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/121698 * tree-ssa-phiopt.cc (spaceship_replacement): Adjust to handle spaceship unordered value -128 rather than 2 and stmts from the new std::partial_order::_M_reverse() instead of (_M_value & ~1) == _M_value etc. * doc/md.texi (spaceship@var{m}4): Use -128 instead of 2. * tree-ssa-math-opts.cc (optimize_spaceship): Adjust comments that libstdc++ unordered value is -128 rather than 2 and use that as the default unordered value. * config/i386/i386-expand.cc (ix86_expand_fp_spaceship): Use GEN_INT (-128) instead of const2_rtx and adjust comment accordingly. * config/aarch64/aarch64.cc (aarch64_expand_fp_spaceship): Likewise. * config/s390/s390.cc (s390_expand_fp_spaceship): Likewise. * gcc.dg/pr94589-2.c: Adjust for expected unordered value -128 rather than 2 and negations in unsigned char instead of and with ~1 and comparison against original value. * gcc.dg/pr94589-4.c: Likewise. * gcc.dg/pr94589-5.c: Likewise. * gcc.dg/pr94589-6.c: Likewise.
2025-08-30x86-64: Use UNSPEC_DTPOFF to check source operand in TLS64_COMBINEH.J. Lu1-22/+10
Since the first operand of PLUS in the source of TLS64_COMBINE pattern: (set (reg/f:DI 128) (plus:DI (unspec:DI [ (symbol_ref:DI ("_TLS_MODULE_BASE_") [flags 0x10]) (reg:DI 126) (reg/f:DI 7 sp) ] UNSPEC_TLSDESC) (const:DI (unspec:DI [ (symbol_ref:DI ("bfd_error") [flags 0x1a] <var_decl 0x7fffe99d6e40 bfd_error>) ] UNSPEC_DTPOFF)))) is unused, use the second operand of PLUS: (const:DI (unspec:DI [ (symbol_ref:DI ("bfd_error") [flags 0x1a] <var_decl 0x7fffe99d6e40 bfd_error>) ] UNSPEC_DTPOFF)) to check if 2 TLS_COMBINE patterns have the same source. gcc/ PR target/121725 * config/i386/i386-features.cc (pass_x86_cse::candidate_gnu2_tls_p): Use the UNSPEC_DTPOFF operand to check source operand in TLS64_COMBINE pattern. gcc/testsuite/ PR target/121725 * gcc.target/i386/pr121725-1a.c: New test. * gcc.target/i386/pr121725-1b.c: Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-08-29xtensa: Rewrite bswapsi2_internal with compact syntaxTakayuki 'January June' Suwa3-29/+97
Also, the omission of the instruction that sets the shift amount register (SAR) to 8 is now more efficient: it is omitted if there was a previous bswapsi2 in the same BB, but not omitted if no bswapsi2 is found or another insn that modifies SAR is found first (see below). Note that the five instructions for writing to SAR are as follows, along with the insns that use them (except for bswapsi2_internal itself): - SSA8B *shift_per_byte, *shlrd_per_byte - SSA8L *shift_per_byte, *shlrd_per_byte - SSR ashrsi3 (alt 1), lshrsi3 (alt 1), *shlrd_reg, rotrsi3 (alt 1) - SSL ashlsi3_internal (alt 1), *shlrd_reg, rotlsi3 (alt 1) - SSAI *shlrd_const, rotlsi3 (alt 0), rotrsi3 (alt 0) gcc/ChangeLog: * config/xtensa/xtensa-protos.h (xtensa_bswapsi2_output): New function prototype. * config/xtensa/xtensa.cc (xtensa_bswapsi2_output_1, xtensa_bswapsi2_output): New functions. * config/xtensa/xtensa.md (bswapsi2_internal): Rewrite in compact syntax and use xtensa_bswapsi2_output() as asm output. gcc/testsuite/ChangeLog: * gcc.target/xtensa/bswap-SSAI8.c: New.
2025-08-29[RISC-V][PR target/121548] Avoid bogus index into recog operand cacheJeff Law2-0/+7
So the RISC-V port has attributes which indicate the index within the recog_data where certain operands will be found. For this BZ the default value for the merge_op_idx attribute on the given insn is "2". But the insn only has operands 0 & 1. So we do an out of bounds array access and boom the ICE/valgrind failure. As we discussed in the patchwork meeting, this is all a bit clunky and has been fairly error prone. This doesn't add any massive checking, but does introduce some asserts to help catch problems a bit earlier and clearer. In particular in cases where we're already asserting that the returned index is valid (!= INVALID_ATTRIBUTE) we also assert that the index is less than the total number of operands. In the get_vlmax_ta_preferred_avl routine it appears like we need to handle these two cases more gracefully as we apparently legitimately query for the merge_op_idx on a fairly arbitrary insn. We just have to make sure to not *use* the result if it's INVALID_ATTRIBUTE. So for that code we assert that merge_op_idx is either INVALID_ATTRIBUTE or smaller than the number of operands. This patch also adds overrides for 3 patterns to return INVALID_ATTRIBUTE for merge_op_idx, similar to how they already do for mode_idx and avl_type_idx. This has been bootstrapped and regression tested on the bpi & pioneer systems and regression tested for riscv32-elf and riscv64-elf. Waiting on CI before pushing. PR target/121548 gcc/ * config/riscv/riscv-avlprop.cc (get_insn_vtype_mode): Assert MODE_IDX is smaller than the number of operands. (simplify_replace_vlmax_avl): Similarly. (pass_avlprop::get_vlmax_ta_preferred_avl): Similarly. * config/riscv/vector.md: Override merge_op_idx computation for simple moves, just like is done for avl_type_idx and mode_idx.
2025-08-29RISC-V: Add patterns for vector-scalar IEEE floating-point minPaul-Antoine Arras2-14/+52
This pattern enables the combine pass (or late-combine, depending on the case) to merge a vec_duplicate into an unspec_vfmin RTL instruction. Before this patch, we have two instructions, e.g.: vfmv.v.f v2,fa0 vfmin.vv v1,v1,v2 After, we get only one: vfmin.vf v1,v1,fa0 gcc/ChangeLog: * config/riscv/autovec-opt.md (*vfmin_vf_ieee_<mode>): Add new patterns to combine vec_duplicate + vfmin.vv (unspec) into vfmin.vf. (*vfmul_vf_<mode>, *vfrdiv_vf_<mode>, *vfmin_vf_<mode>): Fix attribute types. * config/riscv/vector.md (@pred_<ieee_fmaxmin_op><mode>_scalar): Allow VLS modes. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Add vfmin. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-5-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-5-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-5-f64.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-6-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-6-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-6-f64.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-7-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-7-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-7-f64.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-8-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-8-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf-8-f64.c: New test.
2025-08-29x86: Allow by_pieces op when expanding memcpy/memset epilogueH.J. Lu3-0/+28
Since commit 401199377c50045ede560daf3f6e8b51749c2a87 Author: H.J. Lu <hjl.tools@gmail.com> Date: Tue Jun 17 10:17:17 2025 +0800 x86: Improve vector_loop/unrolled_loop for memset/memcpy uses move_by_pieces and store_by_pieces to expand memcpy/memset epilogue with vector_loop even when targetm.use_by_pieces_infrastructure_p returns false, which triggers gcc_assert (targetm.use_by_pieces_infrastructure_p (len, align, memsetp ? SET_BY_PIECES : STORE_BY_PIECES, optimize_insn_for_speed_p ())); in store_by_pieces. Fix it by: 1. Add by_pieces_in_use to machine_function to indicate that by_pieces op is currently in use. 2. Set and clear by_pieces_in_use when expanding memcpy/memset epilogue with move_by_pieces and store_by_pieces. 3. Define TARGET_USE_BY_PIECES_INFRASTRUCTURE_P to return true if by_pieces_in_use is true. gcc/ PR target/121096 * config/i386/i386-expand.cc (expand_cpymem_epilogue): Set and clear by_pieces_in_use when using by_pieces op. (expand_setmem_epilogue): Likewise. * config/i386/i386.cc (ix86_use_by_pieces_infrastructure_p): New. (TARGET_USE_BY_PIECES_INFRASTRUCTURE_P): Likewise. * config/i386/i386.h (machine_function): Add by_pieces_in_use. gcc/testsuite/ PR target/121096 * gcc.target/i386/memcpy-strategy-14.c: New test. * gcc.target/i386/memcpy-strategy-15.c: Likewise. * gcc.target/i386/memset-strategy-10.c: Likewise. * gcc.target/i386/memset-strategy-11.c: Likewise. * gcc.target/i386/memset-strategy-12.c: Likewise. * gcc.target/i386/memset-strategy-13.c: Likewise. * gcc.target/i386/memset-strategy-14.c: Likewise. * gcc.target/i386/memset-strategy-15.c: Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-08-29x86: Handle constant in any modes in setmem_epilogue_gen_valH.J. Lu1-8/+5
Since the constant passed to setmem_epilogue_gen_val may not be in word_mode, update setmem_epilogue_gen_val to handle any integer modes. gcc/ PR target/121108 * config/i386/i386-expand.cc (setmem_epilogue_gen_val): Don't assert op_mode == word_mode and handle any integer modes. gcc/testsuite/ PR target/121108 * gcc.target/i386/memset-strategy-16.c: New test. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-08-29x86-64: Improve source operand check for TLS_CALLH.J. Lu1-78/+123
Source operands of 2 TLS_CALL patterns in (insn 10 9 11 3 (set (reg:DI 100) (unspec:DI [ (symbol_ref:DI ("caml_state") [flags 0x10] <var_decl 0x7fe10e1d9e40 caml_state>) ] UNSPEC_TLSDESC)) "x.c":7:16 1674 {*tls_dynamic_gnu2_lea_64_di} (nil)) (insn 11 10 12 3 (parallel [ (set (reg:DI 99) (unspec:DI [ (symbol_ref:DI ("caml_state") [flags 0x10] <var_decl 0x7fe10e1d9e40 caml_state>) (reg:DI 100) (reg/f:DI 7 sp) ] UNSPEC_TLSDESC)) (clobber (reg:CC 17 flags)) ]) "x.c":7:16 1676 {*tls_dynamic_gnu2_call_64_di} (expr_list:REG_DEAD (reg:DI 100) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) and (insn 19 17 20 4 (set (reg:DI 104) (unspec:DI [ (symbol_ref:DI ("caml_state") [flags 0x10] <var_decl 0x7fe10e1d9e40 caml_state>) ] UNSPEC_TLSDESC)) "x.c":6:10 discrim 1 1674 {*tls_dynamic_gnu2_lea_64_di} (nil)) (insn 20 19 21 4 (parallel [ (set (reg:DI 103) (unspec:DI [ (symbol_ref:DI ("caml_state") [flags 0x10] <var_decl 0x7fe10e1d9e40 caml_state>) (reg:DI 104) (reg/f:DI 7 sp) ] UNSPEC_TLSDESC)) (clobber (reg:CC 17 flags)) ]) "x.c":6:10 discrim 1 1676 {*tls_dynamic_gnu2_call_64_di} (expr_list:REG_DEAD (reg:DI 104) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) are the same even though rtx_equal_p returns false since (reg:DI 100) and (reg:DI 104) are set from the same symbol. Use the UNSPEC_TLSDESC symbol (unspec:DI [(symbol_ref:DI ("caml_state") [flags 0x10])] UNSPEC_TLSDESC)) to check if 2 TLS_CALL patterns have the same source. For TLS64_COMBINE, use both UNSPEC_TLSDESC and UNSPEC_DTPOFF unspecs to check if 2 TLS64_COMBINE patterns have the same source. gcc/ PR target/121694 * config/i386/i386-features.cc (redundant_pattern): Add tlsdesc_val. (pass_x86_cse): Likewise. (pass_x86_cse::tls_set_insn_from_symbol): New member function. (pass_x86_cse::candidate_gnu2_tls_p): Set tlsdesc_val. For TLS64_COMBINE, match both UNSPEC_TLSDESC and UNSPEC_DTPOFF symbols. For TLS64_CALL, match the UNSPEC_TLSDESC sumbol. (pass_x86_cse::x86_cse): Initialize the tlsdesc_val field in load. Pass the tlsdesc_val field to ix86_place_single_tls_call for X86_CSE_TLSDESC. gcc/testsuite/ PR target/121694 * gcc.target/i386/pr121668-1b.c: New test. * gcc.target/i386/pr121694-1a.c: Likewise. * gcc.target/i386/pr121694-1b.c: Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-08-29RISC-V: Combine vec_duplicate + vnmsac.vv to vnmsac.vx on GR2VR costPan Li2-0/+118
This patch would like to combine the vec_duplicate + vnmsac.vv to the vnmsac.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_VX_TERNARY_CASE_0(T, OP_1, OP_2, NAME) \ void \ test_vx_ternary_##NAME##_##T##_case_0 (T * restrict vd, T * restrict vs2, \NAME T rs1, unsigned n) \ { \ for (unsigned i = 0; i < n; i++) \ vd[i] = vd[i] OP_2 vs2[i] OP_1 rs1; \ } DEF_VX_TERNARY_CASE_0(int32_t, *, +, macc) Before this patch: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 ... 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma ... 22 │ vnmsac.vv v1,v2,v3 ... 25 │ bne a3,zero,.L3 After this patch: 11 │ beq a3,zero,.L8 ... 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma ... 20 │ vnmsac.vx v1,a2,v3 ... 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/autovec-opt.md (*vnmsac_vx_<mode>): Add new pattern to combine to vx. * config/riscv/vector.md (@pred_vnmsac_vx_<mode>): Add new pattern to generate rtl. (*pred_nmsac_<mode>_scalar_undef): Ditto. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-08-28RISC-V: Add pattern for vector-scalar floating-point minPaul-Antoine Arras2-12/+31
This pattern enables the combine pass (or late-combine, depending on the case) to merge a vec_duplicate into an smin RTL instruction. Before this patch, we have two instructions, e.g.: vfmv.v.f v2,fa0 vfmin.vv v1,v1,v2 After, we get only one: vfmin.vf v1,v1,fa0 gcc/ChangeLog: * config/riscv/autovec-opt.md (*vfmin_vf_<mode>): Add new pattern to combine vec_duplicate + vfmin.vv into vfmin.vf. * config/riscv/vector.md (@pred_<optab><mode>_scalar): Allow VLS modes. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/vls/floating-point-min-2.c: Adjust scan dump. * gcc.target/riscv/rvv/autovec/vls/floating-point-min-4.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfmin. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_binop.h: Add support for function variants. * gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_data.h: Add data for vfmin. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmin-run-1-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmin-run-1-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmin-run-1-f64.c: New test.
2025-08-28AArch64: Add isinf expander [PR 66462]Wilco Dijkstra2-1/+19
Add an expander for isinf using integer arithmetic. This is typically faster and avoids generating spurious exceptions on signaling NaNs. This fixes part of PR66462. int isinf1 (float x) { return __builtin_isinf (x); } Before: fabs s0, s0 mov w0, 2139095039 fmov s31, w0 fcmp s0, s31 cset w0, le eor w0, w0, 1 ret After: fmov w1, s0 mov w0, -16777216 cmp w0, w1, lsl 1 cset w0, eq ret gcc: PR middle-end/66462 * config/aarch64/aarch64.md (isinf<mode>2): Add new expander. * config/aarch64/iterators.md (mantissa_bits): Add new mode_attr. gcc/testsuite: PR middle-end/66462 * gcc.target/aarch64/pr66462.c: Add new test.
2025-08-26x86-64: Emit the TLS call after debug markerH.J. Lu1-5/+21
For a basic block with only a debug marker: (note 3 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (note 2 3 5 2 NOTE_INSN_FUNCTION_BEG) (debug_insn 5 2 16 2 (debug_marker) "x.c":6:3 -1 (nil)) emit the TLS call after debug marker. gcc/ PR target/121668 * config/i386/i386-features.cc (ix86_emit_tls_call): Emit the TLS call after debug marker. gcc/testsuite/ PR target/121668 * gcc.target/i386/pr121668-1a.c: New test. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-08-26More RISC-V testsuite hygieneJeff Law1-4/+4
More testsuite hygiene. Some of the thead tests are expecting to find xtheadvdot in the extension set, but it's not defined as a valid extension anywhere. I'm just removing xtheadvdot. Someone more familiar with these cores can add it back properly if they're so inclined. Second, there's a space after the zifencei in a couple of the thead arch strings. Naturally that causes failures as well. That's a trivial fix, just remove the bogus whitespace. That gets us clean on riscv.exp on the pioneer system. The pioneer is happy, as is riscv32-elf and riscv64-elf. Pushing to the trunk. gcc/ * config/riscv/riscv-cores.def (xt-c908v): Drop xtheadvdot. (xt-c910v2): Remove extraenous whitespace. (xt-c920v2): Drop xtheadvdot and remove extraeonous whitespace. gcc/testsuite/ * gcc.target/riscv/mcpu-xt-c908v.c: Drop xtheadvdot. * gcc.target/riscv/mcpu-xt-c920v2.c: Drop xtheadvdot.
2025-08-26Enable unroll in the vectorizer when there's reduction for ↵liuhongt4-3/+374
FMA/DOT_PROD_EXPR/SAD_EXPR The patch is trying to unroll the vectorized loop when there're FMA/DOT_PRDO_EXPR/SAD_EXPR reductions, it will break cross-iteration dependence and enable more parallelism(since vectorize will also enable partial sum). When there's gather/scatter or scalarization in the loop, don't do the unroll since the performance bottleneck is not at the reduction. The unroll factor is set according to FMA/DOT_PROX_EXPR/SAD_EXPR CEIL ((latency * throught), num_of_reduction) .i.e For fma, latency is 4, throught is 2, if there's 1 FMA for reduction then unroll factor is 2 * 4 / 1 = 8. There's also a vect_unroll_limit, the final suggested_unroll_factor is set as MIN (vect_unroll_limix, 8). The vect_unroll_limit is mainly for register pressure, avoid to many spills. Ideally, all instructions in the vectorized loop should be used to determine unroll_factor with their (latency * throughput) / number, but that would too much for this patch, and may just GIGO, so the patch only considers 3 kinds of instructions: FMA, DOT_PROD_EXPR, SAD_EXPR. Note when DOT_PROD_EXPR is not native support, m_num_reduction += 3 * count which almost prevents unroll. There's performance boost for simple benchmark with DOT_PRDO_EXPR/FMA chain, slight improvement in SPEC2017 performance. gcc/ChangeLog: * config/i386/i386.cc (ix86_vector_costs::ix86_vector_costs): Addd new memeber m_num_reduc, m_prefer_unroll. (ix86_vector_costs::add_stmt_cost): Set m_prefer_unroll and m_num_reduc (ix86_vector_costs::finish_cost): Determine m_suggested_unroll_vector with consideration of reduc_lat_mult_thr, m_num_reduction and ix86_vect_unroll_limit. * config/i386/i386.h (enum ix86_reduc_unroll_factor): New enum. (processor_costs): Add reduc_lat_mult_thr and vect_unroll_limit. * config/i386/x86-tune-costs.h: Initialize reduc_lat_mult_thr and vect_unroll_limit. * config/i386/i386.opt: Add -param=ix86-vect-unroll-limit. gcc/testsuite/ChangeLog: * gcc.target/i386/vect_unroll-1.c: New test. * gcc.target/i386/vect_unroll-2.c: New test. * gcc.target/i386/vect_unroll-3.c: New test. * gcc.target/i386/vect_unroll-4.c: New test. * gcc.target/i386/vect_unroll-5.c: New test.
2025-08-26[PATCH] RISC-V: Add pattern for reverse floating-point dividePaul-Antoine Arras2-6/+25
This pattern enables the combine pass (or late-combine, depending on the case) to merge a vec_duplicate into a div RTL instruction. The vec_duplicate is the dividend operand. Before this patch, we have two instructions, e.g.: vfmv.v.f v2,fa0 vfdiv.vv v1,v2,v1 After, we get only one: vfrdiv.vf v1,v1,fa0 gcc/ChangeLog: * config/riscv/autovec-opt.md (*vfrdiv_vf_<mode>): Add new pattern to combine vec_duplicate + vfdiv.vv into vfrdiv.vf. * config/riscv/vector.md (@pred_<optab><mode>_reverse_scalar): Allow VLS modes. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfrdiv. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_binop.h: Add support for reverse variants. * gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_data.h: Add data for reverse variants. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f64.c: New test.
2025-08-26AArch64: extend cost model to cost outer loop vect where the inner loop is ↵Tamar Christina1-2/+43
invariant [PR121290] Consider the example: void f (int *restrict x, int *restrict y, int *restrict z, int n) { for (int i = 0; i < 4; ++i) { int res = 0; for (int j = 0; j < 100; ++j) res += y[j] * z[i]; x[i] = res; } } we currently vectorize as f: movi v30.4s, 0 ldr q31, [x2] add x2, x1, 400 .L2: ld1r {v29.4s}, [x1], 4 mla v30.4s, v29.4s, v31.4s cmp x2, x1 bne .L2 str q30, [x0] ret which is not useful because by doing outer-loop vectorization we're performing less work per iteration than we would had we done inner-loop vectorization and simply unrolled the inner loop. This patch teaches the cost model that if all your leafs are invariant, then adjust the loop cost by * VF, since every vector iteration has at least one lane really just doing 1 scalar. There are a couple of ways we could have solved this, one is to increase the unroll factor to process more iterations of the inner loop. This removes the need for the broadcast, however we don't support unrolling the inner loop within the outer loop. We only support unrolling by increasing the VF, which would affect the outer loop as well as the inner loop. We also don't directly support costing inner-loop vs outer-loop vectorization, and as such we're left trying to predict/steer the cost model ahead of time to what we think should be profitable. This patch attempts to do so using a heuristic which penalizes the outer-loop vectorization. We now cost the loop as note: Cost model analysis: Vector inside of loop cost: 2000 Vector prologue cost: 4 Vector epilogue cost: 0 Scalar iteration cost: 300 Scalar outside cost: 0 Vector outside cost: 4 prologue iterations: 0 epilogue iterations: 0 missed: cost model: the vector iteration cost = 2000 divided by the scalar iteration cost = 300 is greater or equal to the vectorization factor = 4. missed: not vectorized: vectorization not profitable. missed: not vectorized: vector version will never be profitable. missed: Loop costings may not be worthwhile. And subsequently generate: .L5: add w4, w4, w7 ld1w z24.s, p6/z, [x0, #1, mul vl] ld1w z23.s, p6/z, [x0, #2, mul vl] ld1w z22.s, p6/z, [x0, #3, mul vl] ld1w z29.s, p6/z, [x0] mla z26.s, p6/m, z24.s, z30.s add x0, x0, x8 mla z27.s, p6/m, z23.s, z30.s mla z28.s, p6/m, z22.s, z30.s mla z25.s, p6/m, z29.s, z30.s cmp w4, w6 bls .L5 and avoids the load and replicate if it knows it has enough vector pipes to do so. gcc/ChangeLog: PR target/121290 * config/aarch64/aarch64.cc (class aarch64_vector_costs ): Add m_loop_fully_scalar_dup. (aarch64_vector_costs::add_stmt_cost): Detect invariant inner loops. (adjust_body_cost): Adjust final costing if m_loop_fully_scalar_dup. gcc/testsuite/ChangeLog: PR target/121290 * gcc.target/aarch64/pr121290.c: New test.
2025-08-26[PATCH] RISC-V: Add pattern for vector-scalar single-width floating-point ↵Paul-Antoine Arras2-6/+25
multiply This pattern enables the combine pass (or late-combine, depending on the case) to merge a vec_duplicate into a mult RTL instruction. Before this patch, we have two instructions, e.g.: vfmv.v.f v2,fa0 vfmul.vv v1,v1,v2 After, we get only one: vfmul.vf v2,v2,fa0 gcc/ChangeLog: * config/riscv/autovec-opt.md (*vfmul_vf_<mode>): Add new pattern to combine vec_duplicate + vfmul.vv into vfmul.vf. * config/riscv/vector.md (@pred_<optab><mode>_scalar): Allow VLS modes. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfmul. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_binop.h: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_data.h: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_run.h: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmul-run-1-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmul-run-1-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmul-run-1-f64.c: New test. * gcc.target/riscv/rvv/autovec/vls/floating-point-mul-2.c: Adjust scan dump. * gcc.target/riscv/rvv/autovec/vls/floating-point-mul-3.c: Likewise.
2025-08-26Fix RISC-V bootstrapJeff Law1-1/+1
Recent changes from Kito have an unused parameter. On the assumption that he's going to likely want it as part of the API, I've simply removed the parameter's name until such time as Kito needs it. This should restore bootstrapping to the RISC-V port. Committing now rather than waiting for the CI system given bootstrap builds currently fail. * config/riscv/riscv.cc (riscv_arg_partial_bytes): Remove name from unused parameter.
2025-08-26Compute vect_reduc_type off SLP node instead of stmt-infoRichard Biener1-7/+14
The following changes the vect_reduc_type API to work on the SLP node. The API is only used from the aarch64 backend, so all changes are there. In particular I noticed aarch64_force_single_cycle is invoked even for scalar costing (where the flag tested isn't computed yet), I figured in scalar costing all reductions are a single cycle. * tree-vectorizer.h (vect_reduc_type): Get SLP node as argument. * config/aarch64/aarch64.cc (aarch64_sve_in_loop_reduction_latency): Take SLO node as argument and adjust. (aarch64_in_loop_reduction_latency): Likewise. (aarch64_detect_vector_stmt_subtype): Adjust. (aarch64_vector_costs::count_ops): Likewise. Treat reductions during scalar costing as single-cycle.
2025-08-26i386: Fix up recent changes to use GFNI for rotates/shifts [PR121658]Jakub Jelinek2-9/+11
The vgf2p8affineqb_<mode><mask_name> pattern uses "register_operand" predicate for the first input operand, so using "general_operand" for the rotate operand passed to it leads to ICEs, and so does the "nonimmediate_operand" in the <insn>v16qi3 define_expand. The following patch fixes it by using "register_operand" in the former case (that pattern is TARGET_GFNI only) and using force_reg in the latter case (the pattern is TARGET_XOP || TARGET_GFNI and for XOP we can handle MEM operand). The rest of the changes are small formatting tweaks or use of const0_rtx instead of GEN_INT (0). 2025-08-26 Jakub Jelinek <jakub@redhat.com> PR target/121658 * config/i386/sse.md (<insn><mode>3 any_shift): Use const0_rtx instead of GEN_INT (0). (cond_<insn><mode> any_shift): Likewise. Formatting fix. (<insn><mode>3 any_rotate): Use register_operand predicate instead of general_operand for match_operand 1. Use const0_rtx instead of GEN_INT (0). (<insn>v16qi3 any_rotate): Use force_reg on operands[1]. Formatting fix. * config/i386/i386.cc (ix86_shift_rotate_cost): Comment formatting fixes. * gcc.target/i386/pr121658.c: New test.
2025-08-26RISC-V: Combine vec_duplicate + vmacc.vv to vmacc.vx on GR2VR costPan Li2-0/+119
This patch would like to combine the vec_duplicate + vmacc.vv to the vmacc.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_VX_TERNARY_CASE_0(T, OP_1, OP_2, NAME) \ void \ test_vx_ternary_##NAME##_##T##_case_0 (T * restrict vd, T * restrict vs2, \ T rs1, unsigned n) \ { \ for (unsigned i = 0; i < n; i++) \ vd[i] = vd[i] OP_2 vs2[i] OP_1 rs1; \ } DEF_VX_TERNARY_CASE_0(int32_t, *, +, macc) Before this patch: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 ... 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma ... 22 │ vmacc.vv v1,v2,v3 ... 25 │ bne a3,zero,.L3 After this patch: 11 │ beq a3,zero,.L8 ... 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma ... 20 │ vmacc.vx v1,a2,v3 ... 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/vector.md (@pred_mul_plus_vx_<mode>): Add new pattern to generate vmacc rtl. (*pred_macc_<mode>_scalar_undef): Ditto. * config/riscv/autovec-opt.md (*vmacc_vx_<mode>): Add new pattern to match the vmacc vx combine. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-08-25xtensa: Make use of compact insn definition syntax for insns whose have ↵Takayuki 'January June' Suwa1-174/+151
multiple alternatives The use of compact syntax makes the relationship between asm output, operand constraints, and insn attributes easier to understand and modify, especially for "mov<mode>_internal". gcc/ChangeLog: * config/xtensa/xtensa.md (addsi3, <u>mulhisi3, andsi3, zero_extend<mode>si2, extendhisi2_internal, movsi_internal, movhi_internal, movqi_internal, movsf_internal, ashlsi3_internal, ashrsi3, lshrsi3, rotlsi3, rotrsi3): Rewrite in compact syntax.
2025-08-25xtensa: Simplify "*masktrue_const_bitcmpl" insn patternTakayuki 'January June' Suwa1-1/+1
gcc/ChangeLog: * config/xtensa/xtensa.md (The auxiliary define_split for *masktrue_const_bitcmpl): Use a more concise function call, i.e., (1 << GET_MODE_BITSIZE (mode)) - 1 is equivalent to GET_MODE_MASK (mode).