aboutsummaryrefslogtreecommitdiff
path: root/gcc/config
AgeCommit message (Collapse)AuthorFilesLines
2025-06-19RISC-V: Adding cost model for zilsdKito Cheng1-0/+41
Motivation of this patch is we want to use ld/sd if possible when zilsd is enabled, however the subreg pass may split that into two lw/sw instructions because the cost, and it only check cost for 64 bits reg move, that's why we need adjust cost for 64 bit reg move as well. However even we adjust the cost model, 64 bit shift still use 32 bit load because it already got split at expand time, this may need to fix on the expander side, and this apparently need few more time to investigate, so I just added a testcase with XFAIL to show the current behavior, and we can fix that...when we have time. For long term, we may adding a new field to riscv_tune_param to control the cost model for that. gcc/ChangeLog: * config/riscv/riscv.cc (riscv_cost_model): Add cost model for zilsd. gcc/testsuite/ChangeLog: * gcc.target/riscv/zilsd-code-gen-split-subreg-1.c: New test. * gcc.target/riscv/zilsd-code-gen-split-subreg-2.c: New test.
2025-06-19x86: Fix shrink wrap separate ICE under -fstack-clash-protection [PR120697]Lili Cui1-13/+1
gcc/ChangeLog: PR target/120697 * config/i386/i386.cc (ix86_expand_prologue): Remove 3 assertions and associated code. gcc/testsuite/ChangeLog: PR target/120697 * gcc.target/i386/stack-clash-protection.c: New test.
2025-06-18RISC-V: Combine vec_duplicate + vmin.vv to vmin.vx on GR2VR costPan Li3-2/+5
This patch would like to combine the vec_duplicate + vmin.vv to the vmin.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_VX_BINARY(T, FUNC) \ void \ test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \ { \ for (unsigned i = 0; i < n; i++) \ out[i] = FUNC (in[i], x); \ } int32_t min(int32 a, int32 b) { return a > b ? b : a; } DEF_VX_BINARY(int32_t, min) Before this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 14 │ slli a3,a3,32 15 │ srli a3,a3,32 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma 18 │ vle32.v v1,0(a1) 19 │ slli a4,a5,2 20 │ sub a3,a3,a5 21 │ add a1,a1,a4 22 │ vmin.vv v1,v1,v2 23 │ vse32.v v1,0(a0) 24 │ add a0,a0,a4 25 │ bne a3,zero,.L3 After this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ slli a3,a3,32 13 │ srli a3,a3,32 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma 16 │ vle32.v v1,0(a1) 17 │ slli a4,a5,2 18 │ sub a3,a3,a5 19 │ add a1,a1,a4 20 │ vmin.vx v1,v1,a2 21 │ vse32.v v1,0(a0) 22 │ add a0,a0,a4 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add new case SMIN. (expand_vx_binary_vec_vec_dup): Ditto. * config/riscv/riscv.cc (riscv_rtx_costs): Ditto. * config/riscv/vector-iterators.md: Add new op smin. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-06-18x86: Enable separate shrink wrappingLili Cui4-45/+320
This commit implements the target macros (TARGET_SHRINK_WRAP_*) that enable separate shrink wrapping for function prologues/epilogues in x86. When performing separate shrink wrapping, we choose to use mov instead of push/pop, because using push/pop is more complicated to handle rsp adjustment and may lose performance, so here we choose to use mov, which has a small impact on code size, but guarantees performance. Using mov means we need to use sub/add to maintain the stack frame. In some special cases, we need to use lea to prevent affecting EFlags. Avoid inserting sub between test-je-jle to change EFlags, lea should be used here. foo: xorl %eax, %eax testl %edi, %edi je .L11 sub $16, %rsp ------> leaq -16(%rsp), %rsp movq %r13, 8(%rsp) movl $1, %r13d jle .L4 Tested against SPEC CPU 2017, this change always has a net-positive effect on the dynamic instruction count. See the following table for the breakdown on how this reduces the number of dynamic instructions per workload on a like-for-like (with/without this commit): instruction count base with commit (commit-base)/commit 502.gcc_r 98666845943 96891561634 -1.80% 526.blender_r 6.21226E+11 6.12992E+11 -1.33% 520.omnetpp_r 1.1241E+11 1.11093E+11 -1.17% 500.perlbench_r 1271558717 1263268350 -0.65% 523.xalancbmk_r 2.20103E+11 2.18836E+11 -0.58% 531.deepsjeng_r 2.73591E+11 2.72114E+11 -0.54% 500.perlbench_r 64195557393 63881512409 -0.49% 541.leela_r 2.99097E+11 2.98245E+11 -0.29% 548.exchange2_r 1.27976E+11 1.27784E+11 -0.15% 527.cam4_r 88981458425 88887334679 -0.11% 554.roms_r 2.60072E+11 2.59809E+11 -0.10% Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No performance regression was observed. For O2 multi-copy : 511.povray_r improved by 2.8% on ZNVER5. 511.povray_r improved by 4% on EMR 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE. gcc/ChangeLog: * config/i386/i386-protos.h (ix86_get_separate_components): New function. (ix86_components_for_bb): Likewise. (ix86_disqualify_components): Likewise. (ix86_emit_prologue_components): Likewise. (ix86_emit_epilogue_components): Likewise. (ix86_set_handled_components): Likewise. * config/i386/i386.cc (save_regs_using_push_pop): Split from ix86_compute_frame_layout. (ix86_compute_frame_layout): Use save_regs_using_push_pop. (pro_epilogue_adjust_stack): Use gen_pro_epilogue_adjust_stack_add_nocc. (ix86_expand_prologue): Add some assertions and adjust the stack frame at the beginning of the prolog for shrink wrapping separate. (ix86_emit_save_regs_using_mov): Skip registers that are wrapped separately. (ix86_emit_restore_regs_using_mov): Likewise. (ix86_expand_epilogue): Add some assertions and set restore_regs_via_mov to true for shrink wrapping separate. (ix86_get_separate_components): New function. (ix86_components_for_bb): Likewise. (ix86_disqualify_components): Likewise. (ix86_emit_prologue_components): Likewise. (ix86_emit_epilogue_components): Likewise. (ix86_set_handled_components): Likewise. (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define. (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise. (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise. (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise. (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise. (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise. * config/i386/i386.h (struct machine_function):Add reg_is_wrapped_separately array for register wrapping information. * config/i386/i386.md (@pro_epilogue_adjust_stack_add_nocc<mode>): New. gcc/testsuite/ChangeLog: * gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test. * gcc.target/i386/interrupt-16.c: Likewise. * gfortran.dg/guality/arg1.f90: Likewise. * gcc.target/i386/avx10_2-comibf-1.c: Likewise. * g++.target/i386/shrink_wrap_separate.C: New test. * gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise. Co-authored-by: Michael Matz <matz@suse.de>
2025-06-17[PATCH v1] RISC-V: Use scratch reg for loop controlUmesh Kalappa1-6/+5
By using the scratch register for loop control rather than the output of the lr instruction we can avoid an unnecessary "mv" instruction. -- V2: Testcase update with no regressions found for the following the changes. gcc/ChangeLog: * config/riscv/sync.md (lrsc_atomic_exchange<mode>): Use scratch register for loop control rather than lr output. gcc/testsuite/ChangeLog: * gcc.target/riscv/zalrsc.c: New test.
2025-06-17aarch64: Add vec_set/extract for tuple modes [PR113027]Richard Sandiford4-0/+109
We generated inefficient code for bitfield references to Advanced SIMD structure modes. In RTL, these modes are just extra-long vectors, and so inserting and extracting an element is simply a vec_set or vec_extract operation. For the record, I don't think these modes should ever become fully fledged vector modes. We shouldn't provide add, etc. for them. But vec_set and vec_extract are the vector equivalent of insv and extv. From that point of view, they seem closer to moves than to arithmetic. gcc/ PR target/113027 * config/aarch64/aarch64-protos.h (aarch64_decompose_vec_struct_index): Declare. * config/aarch64/aarch64.cc (aarch64_decompose_vec_struct_index): New function. * config/aarch64/iterators.md (VEL, Vel): Add Advanced SIMD structure modes. * config/aarch64/aarch64-simd.md (vec_set<VSTRUCT_QD:mode>) (vec_extract<VSTRUCT_QD:mode>): New patterns. gcc/testsuite/ PR target/113027 * gcc.target/aarch64/pr113027-1.c: New test. * gcc.target/aarch64/pr113027-2.c: Likewise. * gcc.target/aarch64/pr113027-3.c: Likewise. * gcc.target/aarch64/pr113027-4.c: Likewise. * gcc.target/aarch64/pr113027-5.c: Likewise. * gcc.target/aarch64/pr113027-6.c: Likewise. * gcc.target/aarch64/pr113027-7.c: Likewise.
2025-06-16aarch64: Add support for unpacked SVE FP conversionsSpencer Abson5-35/+242
This patch introduces expanders for FP<-FP conversions that levarage partial vector modes. We also extend the INT<-FP and FP<-INT conversions using the same approach. The ACLE enables vectorized conversions like the following: fcvt z0.h, p7/m, z1.s modelling the source vector as VNx4SF: ... | SF| SF| SF| SF| and the destination as a VNx8HF, where this operation would yield: ... | 0 | HF| 0 | HF| 0 | HF| 0 | HF| hence the useful results are stored unpacked, i.e. ... | X | HF| X | HF| X | HF| X | HF| (VNx4HF) This patch allows the vectorizer to use this variant of fcvt as a conversion from VNx4SF to VNx4HF. The same idea applies to widening conversions, and between vectors with FP and integer base types. If the source itself had been unpacked, e.g. ... | X | SF| X | SF| (VNx2SF) The result would yield ... | X | X | X | HF| X | X | X | HF| (VNx2HF) The upper bits of each container here are undefined, it's important to avoid interpreting them during FP operations - doing so could introduce spurious traps. The obvious route we've taken here is to mask undefined lanes using the operation's predicate if we have flag_trapping_math. The VPRED predicate mode (e.g. VNx2BI here) cannot do this; to ensure correct behavior, we need a predicate mode that can control the data as if it were fully-packed (VNx4BI). Both VNx2BI and VNx4BI must be recognised as legal governing predicate modes by the corresponding FP insns. In general, the governing predicate mode for an insn could be any such with at least as many significant lanes as the data mode. For example, addvnx4hf3 could be controlled by any of VNx{4,8,16}BI. We implement 'aarch64_predicate_operand', a new define_special_predicate, to acheive this. gcc/ChangeLog: * config/aarch64/aarch64-protos.h (aarch64_sve_valid_pred_p): Declare helper for aarch64_predicate_operand. (aarch64_sve_packed_pred): Declare helper for new expanders. (aarch64_sve_fp_pred): Likewise. * config/aarch64/aarch64-sve.md (<optab><mode><v_int_equiv>2): Extend into... (<optab><SVE_HSF:mode><SVE_HSDI:mode>2): New expander for converting vectors of HF,SF to vectors of HI,SI,DI. (<optab><VNx2DF_ONLY:mode><SVE_2SDI:mode>2): New expander for converting vectors of SI,DI to vectors of DF. (*aarch64_sve_<optab>_nontrunc<SVE_PARTIAL_F:mode><SVE_HSDI:mode>): New pattern to match those we've added here. (@aarch64_sve_<optab>_trunc<VNx2DF_ONLY:mode><VNx4SI_ONLY:mode>): Extend into... (@aarch64_sve_<optab>_trunc<VNx2DF_ONLY:mode><SVE_SI:mode>): Match both VNx2SI<-VNx2DF and VNx4SI<-VNx4DF. (<optab><v_int_equiv><mode>2): Extend into... (<optab><SVE_HSDI:mode><SVE_F:mode>2): New expander for converting vectors of HI,SI,DI to vectors of HF,SF,DF. (*aarch64_sve_<optab>_nonextend<SVE_HSDI:mode><SVE_PARTIAL_F:mode>): New pattern to match those we've added here. (trunc<SVE_SDF:mode><SVE_PARTIAL_HSF:mode>2): New expander to handle narrowing ('truncating') FP<-FP conversions. (*aarch64_sve_<optab>_trunc<SVE_SDF:mode><SVE_PARTIAL_HSF:mode>): New pattern to handle those we've added here. (extend<SVE_PARTIAL_HSF:mode><SVE_SDF:mode>2): New expander to handle widening ('extending') FP<-FP conversions. (*aarch64_sve_<optab>_nontrunc<SVE_PARTIAL_HSF:mode><SVE_SDF:mode>): New pattern to handle those we've added here. * config/aarch64/aarch64.cc (aarch64_sve_packed_pred): New function. (aarch64_sve_fp_pred): Likewise. (aarch64_sve_valid_pred_p): Likewise. * config/aarch64/iterators.md (SVE_PARTIAL_HSF): New mode iterator. (SVE_HSF): Likewise. (SVE_SDF): Likewise. (SVE_SI): Likewise. (SVE_2SDI) Likewise. (self_mask): Extend to all integer/FP vector modes. (narrower_mask): Likewise (excluding QI). * config/aarch64/predicates.md (aarch64_predicate_operand): New special predicate to handle narrower predicate modes. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/pack_fcvt_signed_1.c: Disable the aarch64 vector cost model to preserve this test. * gcc.target/aarch64/sve/pack_fcvt_unsigned_1.c: Likewise. * gcc.target/aarch64/sve/pack_float_1.c: Likewise. * gcc.target/aarch64/sve/unpack_float_1.c: Likewise. * gcc.target/aarch64/sve/unpacked_cvtf_1.c: New test. * gcc.target/aarch64/sve/unpacked_cvtf_2.c: Likewise. * gcc.target/aarch64/sve/unpacked_cvtf_3.c: Likewise. * gcc.target/aarch64/sve/unpacked_fcvt_1.c: Likewise. * gcc.target/aarch64/sve/unpacked_fcvt_2.c: Likewise. * gcc.target/aarch64/sve/unpacked_fcvtz_1.c: Likewise. * gcc.target/aarch64/sve/unpacked_fcvtz_2.c: Likewise.
2025-06-16aarch64: Extend iterator support for partial SVE FP modesSpencer Abson2-121/+136
Define new iterators for partial floating-point modes, and cover these in some existing mode_attrs. This patch serves as a starting point for an effort to extend support for unpacked floating-point operations. To differentiate between BFloat mode iterators that need to test TARGET_SSVE_B16B16, and those that don't (see LOGICALF), this patch enforces the following naming convention: - _BF: BF16 modes will not test TARGET_SSVE_B16B16. - _B16B16: BF16 modes will test TARGET_SSVE_B16B16. gcc/ChangeLog: * config/aarch64/aarch64-sve.md: Replace uses of SVE_FULL_F_BF with SVE_FULL_F_B16B16. Replace use of SVE_F with SVE_F_BF. * config/aarch64/iterators.md (SVE_PARTIAL_F): New iterator for partial SVE FP modes. (SVE_FULL_F_BF): Rename to SVE_FULL_F_B16B16. (SVE_PARTIAL_F_B16B16): New iterator (BF16 included) for partial SVE FP modes. (SVE_F_B16B16): New iterator for all SVE FP modes. (SVE_BF): New iterator for all SVE BF16 modes. (SVE_F): Redefine to exclude BF16 modes. (SVE_F_BF): New iterator to replace the previous SVE_F. (VPRED): Describe the VPRED mapping for partial vector modes. (b): Cover partial FP modes. (is_bf16): Likewise.
2025-06-16aarch64: add support for AEABI Build AttributesMatthieu Longo3-9/+247
GCS (Guarded Control Stack, an Armv9.4-a extension) requires some caution at runtime. The runtime linker needs to reason about the compatibility of a set of relocable object files that might not have been compiled with the same compiler. Up until now, those metadata, used for the previously mentioned runtime checks, have been provided to the runtime linker via GNU properties which are stored in the ELF section ".note.gnu.property". However, GNU properties are limited in their expressibility, and a long-term commmitment was taken in the ABI for the Arm architecture [1] to provide Build Attributes (a.k.a. BAs). This patch adds the support for emitting AArch64 Build Attributes. This support includes generating two new assembler directives: .aeabi_subsection and .aeabi_attribute. These directives are generated as per the syntax mentioned in spec "Build Attributes for the Arm® 64-bit Architecture (AArch64)" available at [1]. gcc/configure.ac now includes a new check to test whether the assembler being used to build the toolchain supports these new directives. Two behaviors can be observed when -mbranch-protection=[standard|...] is passed: - If the assembler support BAs, GCC emits the BAs directives and no GNU properties. Note: the static linker will derive the values of GNU properties from the BAs, and will emit both BAs and GNU properties into the output object. - If the assembler do not support them, only .note.gnu.property section will contain the relevant information. Bootstrapped on aarch64-none-linux-gnu, and no regression found. [1]: https://github.com/ARM-software/abi-aa/pull/230 gcc/ChangeLog: * config.in: Regenerate. * config/aarch64/aarch64-elf-metadata.h (class aeabi_subsection): New class for BAs. * config/aarch64/aarch64-protos.h (aarch64_pacret_enabled): New function. * config/aarch64/aarch64.cc (HAVE_AS_AEABI_BUILD_ATTRIBUTES): New definition. (aarch64_file_end_indicate_exec_stack): Emit BAss. (aarch64_pacret_enabled): New function. (aarch64_start_file): Indent. * configure: Regenerate. * configure.ac: New configure check for BAs support in binutils. gcc/testsuite/ChangeLog: * lib/target-supports.exp: (check_effective_target_aarch64_gas_has_build_attributes): New checker. * gcc.target/aarch64/build-attributes/aarch64-build-attributes.exp: New DejaGNU file. * gcc.target/aarch64/build-attributes/build-attribute-bti.c: New test. * gcc.target/aarch64/build-attributes/build-attribute-gcs.c: New test. * gcc.target/aarch64/build-attributes/build-attribute-pac.c: New test. * gcc.target/aarch64/build-attributes/build-attribute-standard.c: New test. * gcc.target/aarch64/build-attributes/no-build-attribute-bti.c: New test. * gcc.target/aarch64/build-attributes/no-build-attribute-gcs.c: New test. * gcc.target/aarch64/build-attributes/no-build-attribute-pac.c: New test. * gcc.target/aarch64/build-attributes/no-build-attribute-standard.c: New test. Co-Authored-By: Srinath Parvathaneni <srinath.parvathaneni@arm.com>
2025-06-16aarch64: encapsulate note.gnu.property emission into a classMatthieu Longo4-75/+210
The code emitting the GNU properties was moved to a separate file to improve modularity and "releave" the 31000-lines long aarch64.cc file from a few lines. It introduces a new namespace "aarch64::" for AArch64 backend which reduce the length of function names by not prepending 'aarch64_' to each of them. gcc/ChangeLog: * Makefile.in: Add missing declaration of BACKEND_H. * config.gcc: Add aarch64-elf-metadata.o to extra_objs. * config/aarch64/aarch64-elf-metadata.h: New file * config/aarch64/aarch64-elf-metadata.cc: New file. * config/aarch64/aarch64.cc (GNU_PROPERTY_AARCH64_FEATURE_1_AND): Removed. (GNU_PROPERTY_AARCH64_FEATURE_1_BTI): Likewise. (GNU_PROPERTY_AARCH64_FEATURE_1_PAC): Likewise. (GNU_PROPERTY_AARCH64_FEATURE_1_GCS): Likewise. (aarch64_file_end_indicate_exec_stack): Move GNU properties code to aarch64-elf-metadata.cc * config/aarch64/t-aarch64: Declare target aarch64-elf-metadata.o
2025-06-16aarch64: add debug comments to feature properties in .note.gnu.propertyMatthieu Longo1-2/+33
GNU properties are emitted to provide some information about the features used in the generated code like BTI, GCS, or PAC. However, no debug comment are emitted in the generated assembly even if -dA is provided. It makes understanding the information stored in the .note.gnu.property section more difficult than needed. This patch adds assembly comments (if -dA is provided) next to the GNU properties. For instance, if BTI and PAC are enabled, it will emit: .word 0x3 // GNU_PROPERTY_AARCH64_FEATURE_1_AND (BTI, PAC) gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_file_end_indicate_exec_stack): Emit assembly comments. gcc/testsuite/ChangeLog: * gcc.target/aarch64/bti-1.c: Emit assembly comments, and update test assertion.
2025-06-15xtensa: Revert "xtensa: Eliminate unwanted reg-reg moves during DFmode input ↵Takayuki 'January June' Suwa2-43/+0
reloads" Since there are no unwanted reg-reg moves during DFmode input reloads in recent GCCs, the previously committed patch "xtensa: eliminate unwanted reg-reg moves during DFmode input reloads" (commit cfad4856fa46abc878934a9433d0bfc2482ccf00) is no longer necessary and is therefore being reverted. gcc/ChangeLog: * config/xtensa/predicates.md (reload_operand): Remove. * config/xtensa/xtensa.md: Remove the peephole2 pattern that was previously added.
2025-06-15xtensa: Revert "xtensa: Eliminate unnecessary general-purpose reg-reg moves"Takayuki 'January June' Suwa1-46/+0
Due to improved register allocation for GP registers whose modes has been changed by paradoxical SUBREGs, the previously committed patch "xtensa: eliminate unnecessary general-purpose reg-reg moves" (commit f83e76c3f998c8708fe2ddca16ae3f317c39c37a) is no longer necessary and is therefore reverted. gcc/ChangeLog: * config/xtensa/xtensa.md: Remove the peephole2 pattern that was previously added. gcc/testsuite/ChangeLog: * gcc.target/xtensa/elim_GP_regmove_0.c: Remove. * gcc.target/xtensa/elim_GP_regmove_1.c: Remove.
2025-06-15RISC-V: Combine vec_duplicate + vmaxu.vv to vmaxu.vx on GR2VR costPan Li3-2/+5
This patch would like to combine the vec_duplicate + vmaxu.vv to the vmaxu.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_VX_BINARY(T, OP) \ void \ test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \ { \ for (unsigned i = 0; i < n; i++) \ out[i] = in[i] OP x; \ } DEF_VX_BINARY(int32_t, /) Before this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 14 │ slli a3,a3,32 15 │ srli a3,a3,32 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma 18 │ vle32.v v1,0(a1) 19 │ slli a4,a5,2 20 │ sub a3,a3,a5 21 │ add a1,a1,a4 22 │ vmaxu.vv v1,v1,v2 23 │ vse32.v v1,0(a0) 24 │ add a0,a0,a4 25 │ bne a3,zero,.L3 After this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ slli a3,a3,32 13 │ srli a3,a3,32 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma 16 │ vle32.v v1,0(a1) 17 │ slli a4,a5,2 18 │ sub a3,a3,a5 19 │ add a1,a1,a4 20 │ vmaxu.vx v1,v1,a2 21 │ vse32.v v1,0(a0) 22 │ add a0,a0,a4 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add new case UMAX. (expand_vx_binary_vec_vec_dup): Ditto. * config/riscv/riscv.cc (riscv_rtx_costs): Ditto. * config/riscv/vector-iterators.md: Add new op umax. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-06-14AVR: Fix PR120423 / PR116389.Georg-Johann Lay1-0/+35
The problem with PR120423 and PR116389 is that reload might assign an invalid hard register to a paradoxical subreg. For example with the test case from the PR, it assigns (REG:QI 31) to the inner of (subreg:HI (QI) 0) which is valid, but the subreg will be turned into (REG:HI 31) which is invalid and triggers an ICE in postreload. The problem only occurs with the old reload pass. The patch maps the paradoxical subregs to a zero-extends which will be allocated correctly. For the 120423 testcases, the code is the same like with -mlra (which doesn't implement the fix), so the patch doesn't even introduce a performance penalty. The patch is only needed for v15: v14 is not affected, and in v16 reload will be removed. PR rtl-optimization/120423 PR rtl-optimization/116389 gcc/ * config/avr/avr.md [-mno-lra]: Add pre-reload split to transform (left shift of) a paradoxical subreg to a (left shift of) zero-extend. gcc/testsuite/ * gcc.target/avr/torture/pr120423-1.c: New test. * gcc.target/avr/torture/pr120423-2.c: New test. * gcc.target/avr/torture/pr120423-116389.c: New test. (cherry picked from commit 61789b5abec3079d02ee9eaa7468015ab1f6f701)
2025-06-13aarch64: Fold NOT+PTEST to NOTS [PR118150]Spencer Abson1-0/+37
Add combiner patterns for folding NOT+PTEST to NOTS when they share the same governing predicate. gcc/ChangeLog: PR target/118150 * config/aarch64/aarch64-sve.md (*one_cmpl<mode>3_cc): New combiner pattern. (*one_cmpl<mode>3_ptest): Likewise. gcc/testsuite/ChangeLog: PR target/118150 * gcc.target/aarch64/sve/acle/general/not_1.c: New test.
2025-06-13mcore: Don't use gen_rtx_MEM on __attribute__((dllimport))H.J. Lu1-3/+1
On mcore-elf, mcore_mark_dllimport generated (gdb) call debug_tree (decl) <function_decl 0x7fffe9941200 f1 type <function_type 0x7fffe981f000 type <void_type 0x7fffe98180a8 void VOID align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffe98180a8 pointer_to_this <pointer_type 0x7fffe9818150>> HI size <integer_cst 0x7fffe9802738 constant 16> unit-size <integer_cst 0x7fffe9802750 constant 2> align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffe981f000 arg-types <tree_list 0x7fffe980b988 value <void_type 0x7fffe98180a8 void>> pointer_to_this <pointer_type 0x7fffe991b0a8>> addressable used public external decl_5 SI /tmp/x.c:1:40 align:16 warn_if_not_align:0 context <translation_unit_decl 0x7fffe9955080 /tmp/x.c> attributes <tree_list 0x7fffe9932708 purpose <identifier_node 0x7fffe9954000 dllimport>> (mem:SI (mem:SI (symbol_ref:SI ("@i.__imp_f1")) [0 S4 A32]) [0 S4 A32]) chain <function_decl 0x7fffe9941300 f2>> which caused: (gdb) bt file=0x2c0f1c8 "/export/gnu/import/git/sources/gcc-test/gcc/calls.cc", line=3746, function=0x2c0f747 "expand_call") at /export/gnu/import/git/sources/gcc-test/gcc/diagnostic.cc:1780 target=0x0, ignore=1) at /export/gnu/import/git/sources/gcc-test/gcc/calls.cc:3746 ... (gdb) call debug_rtx (datum) (mem:SI (symbol_ref:SI ("@i.__imp_f1")) [0 S4 A32]) (gdb) Don't use gen_rtx_MEM in mcore_mark_dllimport to generate (gdb) call debug_tree (fndecl) <function_decl 0x7fffe9941200 f1 type <function_type 0x7fffe981f000 type <void_type 0x7fffe98180a8 void VOID align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffe98180a8 pointer_to_this <pointer_type 0x7fffe9818150>> HI size <integer_cst 0x7fffe9802738 constant 16> unit-size <integer_cst 0x7fffe9802750 constant 2> align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffe981f000 arg-types <tree_list 0x7fffe980b988 value <void_type 0x7fffe98180a8 void>> pointer_to_this <pointer_type 0x7fffe991b0a8>> addressable used public external decl_5 SI /tmp/x.c:1:40 align:16 warn_if_not_align:0 context <translation_unit_decl 0x7fffe9955080 /tmp/x.c> attributes <tree_list 0x7fffe9932708 purpose <identifier_node 0x7fffe9954000 dllimport>> (mem:SI (symbol_ref:SI ("@i.__imp_f1")) [0 S4 A32]) chain <function_decl 0x7fffe9941300 f2>> (gdb) instead. This fixes: gcc.c-torture/compile/dll.c -O0 (internal compiler error: in assemble_variable, at varasm.cc:2544) gcc.dg/visibility-12.c (internal compiler error: in expand_call, at calls.cc:3744) for more-elf. PR target/120589 * config/mcore/mcore.cc (mcore_mark_dllimport): Don't use gen_rtx_MEM. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-06-12or1k: Fix ICE in libgcc caused by recent validate_subreg changesStafford Horne1-1/+2
After commit eb2ea476db2 ("emit-rtl: Allow extra checks for paradoxical subregs [PR119966]") paradoxical subregs or the OpenRISC condition flag register (reg:BI sr_f) are no longer allowed. This causes and ICE in the ce1 pass which tries to get the or1k flag register into an SI register, which is no longer possible. Adjust or1k_can_change_mode_class to allow changing the or1k flag reg to SI mode which in turn allows paradoxical subregs to be generated again. gcc/ChangeLog: PR target/120587 * config/or1k/or1k.cc (or1k_can_change_mode_class): Allow changing flags mode from BI to SI to allow for paradoxical subregs.
2025-06-12i386: Fix signed integer overflow in ix86_expand_int_movcc, part 2 [PR120604]Uros Bizjak1-9/+35
Make sure we can represent the difference between two 64-bit DImode immediate values in 64-bit HOST_WIDE_INT and return false if this is not the case. ix86_expand_int_movcc is used in mov<mode>cc expaner. Expander will FAIL when the function returns false and middle-end will retry expansion with values forced to registers. PR target/120604 gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_int_movcc): Make sure we can represent the difference between two 64-bit DImode immediate values in 64-bit HOST_WIDE_INT.
2025-06-12RISC-V: Combine vec_duplicate + vmax.vv to vmax.vx on GR2VR costPan Li3-2/+5
This patch would like to combine the vec_duplicate + vmax.vv to the vmax.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_VX_BINARY(T, OP) \ void \ test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \ { \ for (unsigned i = 0; i < n; i++) \ out[i] = in[i] OP x; \ } DEF_VX_BINARY(int32_t, /) Before this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 14 │ slli a3,a3,32 15 │ srli a3,a3,32 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma 18 │ vle32.v v1,0(a1) 19 │ slli a4,a5,2 20 │ sub a3,a3,a5 21 │ add a1,a1,a4 22 │ vmax.vv v1,v1,v2 23 │ vse32.v v1,0(a0) 24 │ add a0,a0,a4 25 │ bne a3,zero,.L3 After this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ slli a3,a3,32 13 │ srli a3,a3,32 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma 16 │ vle32.v v1,0(a1) 17 │ slli a4,a5,2 18 │ sub a3,a3,a5 19 │ add a1,a1,a4 20 │ vmax.vx v1,v1,a2 21 │ vse32.v v1,0(a0) 22 │ add a0,a0,a4 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add new case SMAX. (expand_vx_binary_vec_vec_dup): Ditto. * config/riscv/riscv.cc (riscv_rtx_costs): Ditto. * config/riscv/vector-iterators.md: Add new op smax. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-06-12aarch64: Incorrect removal of ZA restore [PR120624]Richard Sandiford2-0/+10
The PCS defines a lazy save scheme for managing ZA across normal "private-ZA" functions. GCC currently uses this scheme for calls to all private-ZA functions (rather than using caller-save). Therefore, before a sequence of calls to private-ZA functions, GCC emits code to set up a lazy save. After the sequence of calls, GCC emits code to check whether lazy save was committed and restore the ZA contents if so. These sequences are emitted by the mode-switching pass, in an attempt to reduce the number of redundant saves and restores. The lazy save scheme also means that, before a function can use ZA, it must first conditionally store the old contents of ZA to the caller's lazy save buffer, if any. This all creates some relatively complex dependencies between setup code, save/restore code, and normal reads from and writes to ZA. These dependencies are modelled using special fake hard registers: ;; Sometimes we use placeholder instructions to mark where later ;; ABI-related lowering is needed. These placeholders read and ;; write this register. Instructions that depend on the lowering ;; read the register. (LOWERING_REGNUM 87) ;; Represents the contents of the current function's TPIDR2 block, ;; in abstract form. (TPIDR2_BLOCK_REGNUM 88) ;; Holds the value that the current function wants PSTATE.ZA to be. ;; The actual value can sometimes vary, because it does not track ;; changes to PSTATE.ZA that happen during a lazy save and restore. ;; Those effects are instead tracked by ZA_SAVED_REGNUM. (SME_STATE_REGNUM 89) ;; Instructions write to this register if they set TPIDR2_EL0 to a ;; well-defined value. Instructions read from the register if they ;; depend on the result of such writes. ;; ;; The register does not model the architected TPIDR2_ELO, just the ;; current function's management of it. (TPIDR2_SETUP_REGNUM 90) ;; Represents the property "has an incoming lazy save been committed?". (ZA_FREE_REGNUM 91) ;; Represents the property "are the current function's ZA contents ;; stored in the lazy save buffer, rather than in ZA itself?". (ZA_SAVED_REGNUM 92) ;; Represents the contents of the current function's ZA state in ;; abstract form. At various times in the function, these contents ;; might be stored in ZA itself, or in the function's lazy save buffer. ;; ;; The contents persist even when the architected ZA is off. Private-ZA ;; functions have no effect on its contents. (ZA_REGNUM 93) Every normal read from ZA and write to ZA depends on SME_STATE_REGNUM, in order to sequence the code with the initial setup of ZA and with the lazy save scheme. The code to restore ZA after a call involves several instructions, including conditional control flow. It is initially represented as a single define_insn and is split late, after shrink-wrapping and prologue/epilogue insertion. The split form of the restore instruction includes a conditional call to __arm_tpidr2_restore: (define_insn "aarch64_tpidr2_restore" [(set (reg:DI ZA_SAVED_REGNUM) (unspec:DI [(reg:DI R0_REGNUM)] UNSPEC_TPIDR2_RESTORE)) (set (reg:DI SME_STATE_REGNUM) (unspec:DI [(reg:DI SME_STATE_REGNUM)] UNSPEC_TPIDR2_RESTORE)) ... ) The write to SME_STATE_REGNUM indicates the end of the region where ZA_REGNUM might differ from the real contents of ZA. In other words, it is the point at which normal reads from ZA and writes to ZA can safely take place. To finally get to the point, the problem in this PR was that the unsplit aarch64_restore_za pattern was missing this change to SME_STATE_REGNUM. It could therefore be deleted as dead before it had chance to be split. The split form had the correct dataflow, but the unsplit form didn't. Unfortunately, the tests for this code tended to use calls and asms to model regions of ZA usage, and those don't seem to be affected in the same way. gcc/ PR target/120624 * config/aarch64/aarch64.md (SME_STATE_REGNUM): Expand on comments. * config/aarch64/aarch64-sme.md (aarch64_restore_za): Also set SME_STATE_REGNUM gcc/testsuite/ PR target/120624 * gcc.target/aarch64/sme/za_state_7.c: New test.
2025-06-12Refactor record_function_versions.Alfie Richards4-117/+24
Renames record_function_versions to add_function_version, and make it explicit that it is adding a single version to the function structure. Additionally, change the insertion point to always maintain priority ordering of the versions. This allows for removing logic for moving the default to the first position which was duplicated across target specific code and enables easier reasoning about function sets. gcc/ChangeLog: * cgraph.cc (cgraph_node::record_function_versions): Refactor and rename to... (cgraph_node::add_function_version): new function. * cgraph.h (cgraph_node::record_function_versions): Refactor and rename to... (cgraph_node::add_function_version): new function. * config/aarch64/aarch64.cc (aarch64_get_function_versions_dispatcher): Remove reordering. * config/i386/i386-features.cc (ix86_get_function_versions_dispatcher): Remove reordering. * config/riscv/riscv.cc (riscv_get_function_versions_dispatcher): Remove reordering. * config/rs6000/rs6000.cc (rs6000_get_function_versions_dispatcher): Remove reordering. gcc/cp/ChangeLog: * decl.cc (maybe_version_functions): Change record_function_versions call to add_function_version.
2025-06-12i386: Set SRF, GRR, CWF, GNR, DMR, ARL and PTL issue rateHu, Lin11-0/+9
Hi, This patch aims to set SRF issue rate to 4, GNR issue rate to 6. According to tests about spec2017, the patch has little effect on performance. For GRR, CWF, DMR, ARL and PTL, the patch set their issue rate to 6. Waiting for more information to update. Bootstrapped and regtested on x86_64-linux-pc-gnu, OK for trunk. BRs, Lin gcc/ChangeLog: * config/i386/x86-tune-sched.cc (ix86_issue_rate): Set 4 for SRF, 6 for GRR, GNR, CWF, DMR, ARL, PTL.
2025-06-11RISC-V: Prevent speculative vsetvl insn schedulingEdwin Lu1-0/+35
The instruction scheduler appears to be speculatively hoisting vsetvl insns outside of their basic block without checking for data dependencies. This resulted in a situation where the following occurs vsetvli a5,a1,e32,m1,tu,ma vle32.v v2,0(a0) sub a1,a1,a5 <-- a1 potentially set to 0 sh2add a0,a5,a0 vfmacc.vv v1,v2,v2 vsetvli a5,a1,e32,m1,tu,ma <-- incompatible vinfo. update vl to 0 beq a1,zero,.L12 <-- check if avl is 0 This patch would essentially delay the vsetvl update to after the branch to prevent unnecessarily updating the vinfo at the end of a basic block. PR/117974 gcc/ChangeLog: * config/riscv/riscv.cc (struct riscv_tune_param): Add tune param. (riscv_sched_can_speculate_insn): Implement. (TARGET_SCHED_CAN_SPECULATE_INSN): Ditto. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/vsetvl/pr117974.c: New test. Signed-off-by: Edwin Lu <ewlu@rivosinc.com>
2025-06-11i386: Fix signed integer overflow in ix86_expand_int_movcc [PR120604]Uros Bizjak1-6/+12
Patch for PR120553 enabled full 64-bit DImode immediates in ix86_expand_int_movcc. However, the function calculates the difference between two immediate arguments using signed 64-bit HOST_WIDE_INT subtractions that can cause signed integer overflow. Avoid the overflow by casting operands of subtractions to (unsigned HOST_WIDE_INT). PR target/120604 gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_int_movcc): Cast operands of signed 64-bit HOST_WIDE_INT subtractions to (unsigned HOST_WIDE_INT).
2025-06-11RISC-V: Add patterns for vector-scalar negate-(multiply-add/sub) [PR119100]Paul-Antoine Arras2-4/+33
This pattern enables the combine pass (or late-combine, depending on the case) to merge a vec_duplicate into a (possibly negated) minus-mult RTL instruction. Before this patch, we have two instructions, e.g.: vfmv.v.f v6,fa0 vfnmadd.vv v2,v6,v4 After, we get only one: vfnmadd.vf v2,fa0,v4 This also fixes a sign mistake in the handling of vfmsub. PR target/119100 gcc/ChangeLog: * config/riscv/autovec-opt.md (*<optab>_vf_<mode>): Only handle vfmadd and vfmsub. (*vfnmsub_<mode>): New pattern. (*vfnmadd_<mode>): New pattern. * config/riscv/riscv.cc (get_vector_binary_rtx_cost): Add cost model for NEG and VEC_DUPLICATE. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfnmadd and vfnmsub. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop.h: Add support for neg variants. Fix sign for sub. * gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop_data.h: Add data for neg variants. Fix data for sub. * gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop_run.h: Rename x to f. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f16.c: Add neg argument. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f16.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f32.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f64.c: Likewise. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f64.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f16.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f32.c: New test. * gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f64.c: New test.
2025-06-10i386: Handle ZERO_EXTEND like SIGN_EXTEND in bsr patterns [PR120434]Jakub Jelinek1-7/+8
The just posted second PR120434 patch causes +FAIL: gcc.target/i386/pr78103-3.c scan-assembler \\\\m(leaq|addq|incq)\\\\M +FAIL: gcc.target/i386/pr78103-3.c scan-assembler-not \\\\mmovl\\\\M+ +FAIL: gcc.target/i386/pr78103-3.c scan-assembler-not \\\\msubq\\\\M +FAIL: gcc.target/i386/pr78103-3.c scan-assembler-not \\\\mxor[lq]\\\\M While the patch generally improves code generation by often using ZERO_EXTEND instead of SIGN_EXTEND, where the former is often for free on x86_64 while the latter requires an extra instruction or larger instruction than one with just zero extend, the PR78103 combine patterns and splitters were written only with SIGN_EXTEND in mind. As CLZ is UB on 0 and otherwise returns just [0,63] and is xored with 63, ZERO_EXTEND does the same thing there as SIGN_EXTEND. 2025-06-10 Jakub Jelinek <jakub@redhat.com> PR middle-end/120434 * config/i386/i386.md (*bsr_rex64_2): Rename to ... (*bsr_rex64<u>_2): ... this. Use any_extend instead of sign_extend. (*bsr_2): Rename to ... (*bsr<u>_2): ... this. Use any_extend instead of sign_extend. (bsr splitters after those): Use any_extend instead of sign_extend.
2025-06-10gcn: Add experimental MI300 (gfx942) supportTobias Burnus7-68/+173
As gfx942 and gfx950 belong to gfx9-4-generic, the latter two are also added. Note that there are no specific optimizations for MI300, yet. For none of the mentioned devices, any multilib is build by default; use '--with-multilib-list=' when configuring GCC to build them alongside. gfx942 was added in LLVM (and its mc assembler, used by GCC) in version 18, generic support in LLVM 19 and gfx950 in LLVM 20. gcc/ChangeLog: * config/gcn/gcn-devices.def: Add gfx942, gfx950 and gfx9-4-generic. * config/gcn/gcn-opts.h (TARGET_CDNA3, TARGET_CDNA3_PLUS, TARGET_GLC_NAME, TARGET_TARGET_SC_CACHE): Define. (TARGET_ARCHITECTED_FLAT_SCRATCH): Use also for CDNA3. * config/gcn/gcn.h (gcn_isa): Add ISA_CDNA3 to the enum. * config/gcn/gcn.cc (print_operand): Update 'g' to use TARGET_GLC_NAME; add 'G' to print TARGET_GLC_NAME unconditionally. * config/gcn/gcn-valu.md (scatter, gather): Use TARGET_GLC_NAME. * config/gcn/gcn.md: Use %G<num> instead of glc; use 'buffer_inv sc1' for TARGET_TARGET_SC_CACHE. * doc/invoke.texi (march): Add gfx942, gfx950 and gfx9-4-generic. * doc/install.texi (amdgcn*-*-*): Add gfx942, gfx950 and gfx9-4-generic. * config/gcn/gcn-tables.opt: Regenerate. libgomp/ChangeLog: * testsuite/libgomp.c/declare-variant-4.h (gfx942): New variant function. * testsuite/libgomp.c/declare-variant-4-gfx942.c: New test.
2025-06-10[RISC-V] Fix ICE due to splitter emitting constant loads directlyJeff Law1-5/+13
This is a fix for a bug found internally in Ventana using the cf3 testsuite. cf3 looks to be dead as a project and likely subsumed by modern fuzzers. In fact internally we tripped another issue with cf3 that had already been reported by Edwin with the fuzzer he runs. Anyway, the splitter in question blindly emits the 2nd adjusted constant into a register, that's not valid if the constant requires any kind of synthesis -- and it well could since we're mostly focused on the first constant turning into something that can be loaded via LUI without increasing the cost of the second constant. Instead of using the split RTL template, this just emits the code we want directly, using riscv_move_insn to synthesize the constant into the provided temporary register. Tested in my system. Waiting on upstream CI's verdict before moving forward. gcc/ * config/riscv/riscv.md (lui-constraint<X:mode>and_to_or): Do not use the RTL template for split code. Emit it directly taking care to avoid emitting a constant load that needed synthesis. Fix formatting. gcc/testsuite/ * gcc.target/riscv/ventana-16122.c: New test.
2025-06-10RISC-V: Combine vec_duplicate + vremu.vv to vremu.vx on GR2VR costPan Li3-1/+3
This patch would like to combine the vec_duplicate + vremu.vv to the vremu.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_VX_BINARY(T, OP) \ void \ test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \ { \ for (unsigned i = 0; i < n; i++) \ out[i] = in[i] OP x; \ } DEF_VX_BINARY(int32_t, /) Before this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 14 │ slli a3,a3,32 15 │ srli a3,a3,32 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma 18 │ vle32.v v1,0(a1) 19 │ slli a4,a5,2 20 │ sub a3,a3,a5 21 │ add a1,a1,a4 22 │ vremu.vv v1,v1,v2 23 │ vse32.v v1,0(a0) 24 │ add a0,a0,a4 25 │ bne a3,zero,.L3 After this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ slli a3,a3,32 13 │ srli a3,a3,32 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma 16 │ vle32.v v1,0(a1) 17 │ slli a4,a5,2 18 │ sub a3,a3,a5 19 │ add a1,a1,a4 20 │ vremu.vx v1,v1,a2 21 │ vse32.v v1,0(a0) 22 │ add a0,a0,a4 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vx_binary_vec_vec_dup): Add new case UMOD. * config/riscv/riscv.cc (riscv_rtx_costs): Ditto. * config/riscv/vector-iterators.md: Add new op umod. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-06-09[RISC-V] Enable more if-conversion on RISC-VJeff Law1-7/+5
Another czero related adjustment. This time in costing of conditional move sequences. Essentially a copy from a promoted subreg can and should be ignored from a costing standpoint. We had some code to do this, but its conditions were too strict. No real surprises evaluating spec. This should be a minor, but probably not measurable improvement in x264 and xz. It is if-converting more in some particular harm to hot routines, but not necessarily in the hot parts of those routines. It's been tested on riscv32-elf and riscv64-elf. Versions of this have bootstrapped and regression tested as well, though perhaps not this exact version. Waiting on pre-commit testing. gcc/ * config/riscv/riscv.cc (riscv_noce_conversion_profitable_p): Relax condition for adjustments due to copies from promoted SUBREGs.
2025-06-09Also handle avx512 kmask & immediate 15 or 3 when VF is 4/2.liuhongt2-1/+49
like r16-105-g599bca27dc37b3, the patch handles redunduant clean up of upper-bits for maskload. .i.e Successfully matched this instruction: (set (reg:V4DF 175) (vec_merge:V4DF (unspec:V4DF [ (mem:V4DF (plus:DI (reg/v/f:DI 155 [ b ]) (reg:DI 143 [ ivtmp.56 ])) [1 S32 A64]) ] UNSPEC_MASKLOAD) (const_vector:V4DF [ (const_double:DF 0.0 [0x0.0p+0]) repeated x4 ]) (and:QI (reg:QI 125 [ mask__29.16 ]) (const_int 15 [0xf])))) For maskstore, looks like it's already optimal(at least I can't make a testcase). So The patch only hanldes maskload. gcc/ChangeLog: PR target/103750 * config/i386/i386.cc (ix86_rtx_costs): Adjust rtx_cost for maskload. * config/i386/sse.md (*<avx512>_load<mode>mask_and15): New define_insn_and_split. (*<avx512>_load<mode>mask_and3): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512f-pr103750-3.c: New test.
2025-06-09RISC-V: Combine vec_duplicate + vrem.vv to vrem.vx on GR2VR costPan Li3-1/+3
This patch would like to combine the vec_duplicate + vrem.vv to the vrem.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_VX_BINARY(T, OP) \ void \ test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \ { \ for (unsigned i = 0; i < n; i++) \ out[i] = in[i] OP x; \ } DEF_VX_BINARY(int32_t, /) Before this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 14 │ slli a3,a3,32 15 │ srli a3,a3,32 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma 18 │ vle32.v v1,0(a1) 19 │ slli a4,a5,2 20 │ sub a3,a3,a5 21 │ add a1,a1,a4 22 │ vrem.vv v1,v1,v2 23 │ vse32.v v1,0(a0) 24 │ add a0,a0,a4 25 │ bne a3,zero,.L3 After this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ slli a3,a3,32 13 │ srli a3,a3,32 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma 16 │ vle32.v v1,0(a1) 17 │ slli a4,a5,2 18 │ sub a3,a3,a5 19 │ add a1,a1,a4 20 │ vrem.vx v1,v1,a2 21 │ vse32.v v1,0(a0) 22 │ add a0,a0,a4 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vx_binary_vec_vec_dup): Add new case MOD. * config/riscv/riscv.cc (riscv_rtx_costs): Ditto. * config/riscv/vector-iterators.md: Add new op mod. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-06-08RISC-V: frm/mode-switch: robustify call_insn backtracking [PR120203]Vineet Gupta1-26/+16
As described in prior patches of this series, FRM mode switching state machine has special handling around calls. After a call_insn, if in DYN_CALL state, it needs to transition back to DYN, which requires back checking if prev insn was indeed a call. Defering/delaying this could lead to unncessary final transitions leading to extraenous FRM save/restores. However the current back checking of call_insn was too coarse-grained. It used prev_nonnote_nondebug_insn_bb () which implies current insn to be in the same BB as the call_insn, which need not always be true. The problem is not with the API, but the use thereof. Fix this by tracking call_insn more explicitly in TARGET_MODE_NEEDED. - On seeing a call_insn, record a "call note". - On subsequent insns if a "call note" is seen, do the needed state switch and clear the note. - Remove the old BB based search. The number of FRM read/writes across SPEC2017 -Ofast -mrv64gcv improves. Before After ------------- --------------- frrm fsrmi fsrm frrm fsrmi frrm perlbench_r 17 0 1 17 0 1 cpugcc_r 11 0 0 11 0 0 bwaves_r 16 0 1 16 0 1 mcf_r 11 0 0 11 0 0 cactusBSSN_r 19 0 1 19 0 1 namd_r 14 0 1 14 0 1 parest_r 24 0 1 24 0 1 povray_r 26 1 6 26 1 6 lbm_r 6 0 0 6 0 0 omnetpp_r 17 0 1 17 0 1 wrf_r 1268 13 1603 613 13 82 cpuxalan_r 17 0 1 17 0 1 ldecod_r 11 0 0 11 0 0 x264_r 11 0 0 11 0 0 blender_r 61 12 42 39 12 16 cam4_r 45 13 20 40 13 17 deepsjeng_r 11 0 0 11 0 0 imagick_r 132 16 25 33 16 18 leela_r 12 0 0 12 0 0 nab_r 13 0 1 13 0 1 exchange2_r 16 0 1 16 0 1 fotonik3d_r 19 0 1 19 0 1 roms_r 21 0 1 21 0 1 xz_r 6 0 0 6 0 0 ----------------- -------------- 1804 55 1707 1023 55 150 ----------------- -------------- 3566 1228 ----------------- -------------- While this was a missed-optimization exercise, testing exposed a latent bug as additional testsuite failure, captured as PR120203. The existing test float-point-dynamic-frm-74.c was missing FRM save after a call which this fixes (as a side-effect of robust call state tracking). | frrm a5 | fsrmi 1 | | vfadd.vv v1,v8,v9 | fsrm a5 | beq a1,zero,.L2 | | call normalize_vl_1 | frrm a5 | | .L3: | fsrmi 3 | vfadd.vv v8,v8,v9 | fsrm a5 | jr ra | | .L2: | call normalize_vl_2 | frrm a5 <-- missing | j .L3 PR target/120203 gcc/ChangeLog: * config/riscv/riscv.cc (CFUN_IN_CALL): New macro. (struct mode_switching_info): Add new field. (riscv_frm_adjust_mode_after_call): Remove. (riscv_frm_mode_needed): Track call_insn. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/float-point-dynamic-frm-74.c: Expect an additional FRRM. Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
2025-06-08RISC-V: frm/mode-switch: Reduce FRM restores on DYN transition [PR119164]Vineet Gupta1-1/+1
FRM mode switching state machine has DYN as default state which it also fallsback to after transitioning to other states such as DYN_CALL. Currently TARGET_MODE_EMIT generates a FRM restore on any transition to DYN leading to spurious/extraneous FRM restores. Only do this if an interim static Rounding Mode was observed in the state machine. Fixes the extraneous FRM read/write in PR119164 (and also PR119832 w/o need for TARGET_MODE_CONFLUENCE). Also reduces the number of FRM writes in SPEC2017 -Ofast -mrv64gcv build significantly. Before After ------------- ------------- frrm fsrmi fsrm frrm fsrmi frrm perlbench_r 42 0 4 17 0 1 cpugcc_r 167 0 17 11 0 0 bwaves_r 16 0 1 16 0 1 mcf_r 11 0 0 11 0 0 cactusBSSN_r 76 0 27 19 0 1 namd_r 119 0 63 14 0 1 parest_r 168 0 114 24 0 1 povray_r 123 1 17 26 1 6 lbm_r 6 0 0 6 0 0 omnetpp_r 17 0 1 17 0 1 wrf_r 2287 13 1956 1268 13 1603 cpuxalan_r 17 0 1 17 0 1 ldecod_r 11 0 0 11 0 0 x264_r 14 0 1 11 0 0 blender_r 724 12 182 61 12 42 cam4_r 324 13 169 45 13 20 deepsjeng_r 11 0 0 11 0 0 imagick_r 265 16 34 132 16 25 leela_r 12 0 0 12 0 0 nab_r 13 0 1 13 0 1 exchange2_r 16 0 1 16 0 1 fotonik3d_r 20 0 11 19 0 1 roms_r 33 0 23 21 0 1 xz_r 6 0 0 6 0 0 --------------- -------------- 4498 55 2623 1804 55 1707 --------------- -------------- 7176 3566 --------------- -------------- PR target/119164 gcc/ChangeLog: * config/riscv/riscv.cc (riscv_emit_frm_mode_set): check STATIC_FRM_P for transition to DYN. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/pr119164.c: New test. Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
2025-06-08RISC-V: frm/mode-switch: remove dubious frm edge insertion before call_insnVineet Gupta1-44/+0
This showed up when debugging the testcase for PR119164. RISC-V FRM mode-switching state machine has special handling for transitions to and from a call_insn as FRM needs to saved/restored around calls despite it not being a callee-saved reg; rather it's a "global" reg which can be temporarily modified "locally" with a static RM. Thus a call needs to see the prior global state, hence the restore (from a prior backup) before the call. Corollarily any call can potentially clobber the FRM, thus post-call it needs to be it needs to be re-read/saved. The following example demostrate this: - insns 2, 4, 6 correspond to actual user code, - rest 1, 3, 5, 6 are frm save/restore insns generated by mode switch for the above described ABI semantics. test_float_point_frm_static: 1: frrm a5 <-- 2: fsrmi 2 3: fsrm a5 <-- 4: call normalize_vl 5: frrm a5 <-- 6: fsrmi 3 7: fsrm a5 <-- Current implementation of RISC-V TARGET_MODE_NEEDED has special handling if the call_insn is last insn of BB, to ensure FRM save/reads are emitted on all the edges. However it doesn't work as intended and is borderline bogus for following reasons: - It fails to detect call_insn as last of BB (PR119164 test) if the next BB starts with a code label (say due to call being conditional). Granted this is a deficiency of API next_nonnote_nondebug_insn_bb () which incorrectly returns next BB code_label as opposed to returning NULL (and this behavior is kind of relied upon by much of gcc). This causes missed/delayed state transition to DYN. - If code is tightened to actually detect above such as: - rtx_insn *insn = next_nonnote_nondebug_insn_bb (cur_insn); - if (!insn) + if (BB_END (BLOCK_FOR_INSN (cur_insn)) == cur_insn) edge insertion happens but ends up splitting the BB which generic mode-sw doesn't expect and ends up hittng an ICE. - TARGET_MODE_NEEDED hook typically don't modify the CFG. - For abnormal edges, insert_insn_end_basic_block () is called, which by design on encountering call_insn as last in BB, inserts new insn BEFORE the call, not after. So this is just all wrong and ripe for removal. Moreover there seems to be no testsuite coverage for this code path at all. Results don't change at all if this is removed. The total number of FRM read/writes emitted (static count) across all benchmarks of a SPEC2017 -Ofast -march=rv64gcv build decrease slightly so its a net win even if minimal but the real gain is reduced complexity and maintenance. Before Patch ---------------- --------------- frrm fsrmi fsrm frrm fsrmi frrm perlbench_r 42 0 4 42 0 4 cpugcc_r 167 0 17 167 0 17 bwaves_r 16 0 1 16 0 1 mcf_r 11 0 0 11 0 0 cactusBSSN_r 79 0 27 76 0 27 namd_r 119 0 63 119 0 63 parest_r 218 0 114 168 0 114 <-- povray_r 123 1 17 123 1 17 lbm_r 6 0 0 6 0 0 omnetpp_r 17 0 1 17 0 1 wrf_r 2287 13 1956 2287 13 1956 cpuxalan_r 17 0 1 17 0 1 ldecod_r 11 0 0 11 0 0 x264_r 14 0 1 14 0 1 blender_r 724 12 182 724 12 182 cam4_r 324 13 169 324 13 169 deepsjeng_r 11 0 0 11 0 0 imagick_r 265 16 34 265 16 34 leela_r 12 0 0 12 0 0 nab_r 13 0 1 13 0 1 exchange2_r 16 0 1 16 0 1 fotonik3d_r 20 0 11 20 0 11 roms_r 33 0 23 33 0 23 xz_r 6 0 0 6 0 0 ---------------- --------------- 4551 55 2623 4498 55 2623 gcc/ChangeLog: * config/riscv/riscv.cc (riscv_frm_emit_after_bb_end): Delete. (riscv_frm_mode_needed): Remove call riscv_frm_emit_after_bb_end. Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
2025-06-08RISC-V: frm/mode-switch: remove TARGET_MODE_CONFLUENCEVineet Gupta1-37/+0
This is effectively reverting e5d1f538bb7d "(RISC-V: Allow different dynamic floating point mode to be merged)" while retaining the testcase. The change itself is valid, however it obfuscates the deficiencies in current frm mode switching code. Also for a SPEC2017 -Ofast -march=rv64gcv build, it ends up generating net more FRM restores (writes) vs. the rest of this changeset. gcc/ChangeLog: * config/riscv/riscv.cc (riscv_dynamic_frm_mode_p): Remove. (riscv_mode_confluence): Ditto. (TARGET_MODE_CONFLUENCE): Ditto. Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
2025-06-08[RISC-V] Handle 32bit operands in condition for conditional movesShreya Munnangi1-62/+79
So here's the next chunk of conditional move work from Shreya. It's been a long standing wart that the conditional move expander does not support sub-word operands in the comparison. Particularly since we have support routines to handle the necessary extensions for that case. This patch adjusts the expander to use riscv_extend_comparands rather than fail for that case. I've built spec2017 before/after this and we definitely get more conditional moves and they look sensible from a performance standpoint. None are likely hitting terribly hot code, so I wouldn't expect any performance jumps. Waiting on pre-commit testing to do its thing. gcc/ * config/riscv/riscv.cc (riscv_expand_conditional_move): Use riscv_extend_comparands to extend sub-word comparison arguments. Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
2025-06-08xtensa: Implement l(ceil|floor)sfsi2 insn patterns and their scaled variantsTakayuki 'January June' Suwa2-12/+54
By using the previously unused CEIL|FLOOR.S floating-point coprocessor instructions. In addition, two instruction operand format codes are added to output the scale value as assembler source. /* example */ int test0(float a) { return __builtin_lceilf(a); } int test1(float a) { return __builtin_lceilf(a * 2); } int test2(float a) { return __builtin_lfloorf(a); } int test3(float a) { return __builtin_lfloorf(a * 32768); } ;; result test0: entry sp, 32 wfr f0, a2 ceil.s a2, f0, 0 retw.n test1: entry sp, 32 wfr f0, a2 ceil.s a2, f0, 1 retw.n test2: entry sp, 32 wfr f0, a2 floor.s a2, f0, 0 retw.n test3: entry sp, 32 wfr f0, a2 floor.s a2, f0, 15 retw.n However, because the rounding-half behavior (e.g., the rule that determines whether 1.5 should be rounded to 1 or 2) of the two is inconsistent; the lroundsfsi2 pattern is explicitly specified that rounding to nearest integer and away from zero, but the other hand, the ROUND.S instruction is not specified that by the ISA and is implementation-dependent. Therefore lroundsfsi2 cannot be implemented by ROUND.S. gcc/ChangeLog: * config/xtensa/xtensa.cc (printx, print_operand): Add two instruction operand format codes 'U' and 'V', whose represent scale factors of 0 to 15th positive/negative power of two. * config/xtensa/xtensa.md (c_enum "unspec"): Add UNSPEC_CEIL and UNSPEC_FLOOR. (int_iterator ANY_ROUND, int_attr m_round): New integer iterator and its attribute. (fix<s_fix>_truncsfsi2, *fix<s_fix>_truncsfsi2_2x, *fix<s_fix>_truncsfsi2_scaled, float<s_float>sisf2, *float<s_float>sisf2_scaled): Use output templates with the operand formats added above, instead of individual output statements. (l<m_round>sfsi2, *l<m_round>sfsi2_2x, *l<m_round>sfsi2_scaled): New insn patterns.
2025-06-07[to-be-committed][RISC-V] Handle 32bit operands in condition for conditional ↵Jeff Law1-5/+10
moves So here's the next chunk of conditional move work from Shreya. It's been a long standing wart that the conditional move expander does not support sub-word operands in the comparison. Particularly since we have support routines to handle the necessary extensions for that case. This patch adjusts the expander to use riscv_extend_comparands rather than fail for that case. I've built spec2017 before/after this and we definitely get more conditional moves and they look sensible from a performance standpoint. None are likely hitting terribly hot code, so I wouldn't expect any performance jumps. Waiting on pre-commit testing to do its thing. * config/riscv/riscv.cc (riscv_expand_conditional_move): Use riscv_extend_comparands to extend sub-word comparison arguments. Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
2025-06-06RISC-V: Combine vec_duplicate + vidvu.vv to vdivu.vx on GR2VR costPan Li3-1/+3
This patch would like to combine the vec_duplicate + vdivu.vv to the vdivu.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_VX_BINARY(T, OP) \ void \ test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \ { \ for (unsigned i = 0; i < n; i++) \ out[i] = in[i] OP x; \ } DEF_VX_BINARY(int32_t, /) Before this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 14 │ slli a3,a3,32 15 │ srli a3,a3,32 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma 18 │ vle32.v v1,0(a1) 19 │ slli a4,a5,2 20 │ sub a3,a3,a5 21 │ add a1,a1,a4 22 │ vdivu.vv v1,v1,v2 23 │ vse32.v v1,0(a0) 24 │ add a0,a0,a4 25 │ bne a3,zero,.L3 After this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ slli a3,a3,32 13 │ srli a3,a3,32 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma 16 │ vle32.v v1,0(a1) 17 │ slli a4,a5,2 18 │ sub a3,a3,a5 19 │ add a1,a1,a4 20 │ vdivu.vx v1,v1,a2 21 │ vse32.v v1,0(a0) 22 │ add a0,a0,a4 23 │ bne a3,zero,.L3 The below test suites are passed for this patch. * The rv64gcv fully regression test. gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vx_binary_vec_vec_dup): Add new case UDIV. * config/riscv/riscv.cc (riscv_rtx_costs): Ditto. * config/riscv/vector-iterators.md: Add new op divu. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-06-06Remove create_tmp_reg_or_ssa_nameRichard Biener1-12/+12
Now that create_tmp_reg_or_ssa_name just calls make_ssa_name replace all of its uses. * gimple-fold.h (create_tmp_reg_or_ssa_name): Remove. * gimple-fold.cc (create_tmp_reg_or_ssa_name): Likewise. (gimple_fold_builtin_memory_op): Use make_ssa_name. (gimple_fold_builtin_strchr): Likewise. (gimple_fold_builtin_strcat): Likewise. (gimple_load_first_char): Likewise. (gimple_fold_builtin_string_compare): Likewise. (gimple_build): Likewise. * tree-inline.cc (copy_bb): Likewise. * config/rs6000/rs6000-builtin.cc (fold_build_vec_cmp): Likewise. (rs6000_gimple_fold_mma_builtin): Likewise. (rs6000_gimple_fold_builtin): Likewise.
2025-06-06RISC-V: Support -mcpu for XiangShan Kunminghu cpu.Jiawei1-0/+14
This patch adds support for the XiangShan Kunminghu CPU in GCC, allowing the use of the `-mcpu=xiangshan-kunminghu` option. XiangShan-KunMingHu is the third-generation open-source high-performance RISC-V processor.[1] You can find the corresponding ISA extension from the XiangShan Github repository.[2] The latest news of KunMingHu can be found in the XiangShan Biweekly.[3] [1] https://github.com/OpenXiangShan/XiangShan-User-Guide/releases. [2] https://github.com/OpenXiangShan/XiangShan/blob/master/src/main/scala/xiangshan/Parameters.scala [3] https://docs.xiangshan.cc/zh-cn/latest/blog A dedicated scheduling model for KunMingHu's hybrid pipeline will be proposed in a subsequent PR. gcc/ChangeLog: * config/riscv/riscv-cores.def (RISCV_TUNE): New cpu tune. (RISCV_CORE): New cpu. * doc/invoke.texi: Ditto. gcc/testsuite/ChangeLog: * gcc.target/riscv/mcpu-xiangshan-kunminghu.c: New test. Co-Authored-By: Jiawei Chen <jiawei@iscas.ac.cn> Co-Authored-By: Yangyu Chen <cyy@cyyself.name> Co-Authored-By: Tang Haojin <tanghaojin@outlook.com>
2025-06-05[RISC-V] Improve signed division by 2^nJeff Law2-0/+52
So another class of cases where we can do better than a zicond sequence. Like the prior patch this came up evaluating some code from Shreya to detect more conditional move cases. This patch allows us to use the "splat the sign bit" idiom to efficiently select between 0 and 2^n-1. That's particularly important for signed division by a power of two. For signed division by a power of 2, you conditionally add 2^n-1 to the numerator, then right shift that result. Using zicond somewhat naively you get something like this (for n / 4096): > li a5,4096 > addi a5,a5,-1 > slti a4,a0,0 > add a5,a0,a5 > czero.eqz a5,a5,a4 > czero.nez a0,a0,a4 > add a0,a0,a5 > srai a0,a0,12 After this patch you get this instead: > srai a5,a0,63 > srli a5,a5,52 > add a0,a5,a0 > srai a0,a0,12 It's not *that* much faster, but it's certainly shorter. So the trick here is that after splatting the sign bit we have 0, -1. So a subsequent logical shift right would generate 0 or 2^n-1. Yes, there a nice variety of other constant pairs we can select between. Some notes have been added to the PR I opened yesterday. The first thing we need to do is throttle back zicond generation. Unfortunately we don't see the constants from the division-by-2^n algorithm, so we have to disable for all lt/ge 0 cases. This can have small negative impacts. I looked at this across spec and didn't see anything I was particularly worried about and numerous small improvements from that alone. With that in place we need to recognize the form seen by combine. Essentially it sees the splat of the sign bit feeding a logical AND. We split that into two right shifts. This has survived in my tester. Waiting on upstream pre-commit before moving forward. gcc/ * config/riscv/riscv.cc (riscv_expand_conditional_move): Avoid zicond in some cases involving sign bit tests. * config/riscv/riscv.md: Split a splat of the sign bit feeding a masking off high bits into a pair of right shifts. gcc/testsuite * gcc.target/riscv/nozicond-3.c: New test.
2025-06-05[i386] Improve "mov<mode>cc" expander for DImode immediates [PR120553]Uros Bizjak1-2/+2
"mov<mode>cc" expander uses x86_64_general_operand predicate that limits the range of immediate operands to 32-bit size. The usage of this predicate causes ifcvt to force out-of-range immediates to registers when converting through noce_try_cmove. The testcase: long long foo (long long c) { return c >= 0 ? 0x400000000ll : -1ll; } compiles (-O2) to: foo: testq %rdi, %rdi movq $-1, %rax movabsq $0x400000000, %rdx cmovns %rdx, %rax ret The above testcase can be compiled to a more optimized code without problematic CMOV instruction if 64-bit immediates are allowed in "mov<mode>cc" expander: foo: movq %rdi, %rax sarq $63, %rax btsq $34, %rax ret The expander calls the ix86_expand_int_movcc function which internally sanitizes arguments of emitted logical insns using expand_simple_binop. The out-of-range immediates are forced to a temporary register just before the instruction, so the instruction combiner is then able to synthesize 64-bit BTS instruction. The code improves even for non-exact-log2 64-bit immediates, e.g. long long foo (long long c) { return c >= 0 ? 0x400001234ll : -1ll; } that now compiles to: foo: movabsq $0x400001234, %rdx movq %rdi, %rax sarq $63, %rax orq %rdx, %rax ret again avoiding problematic CMOV instruction. PR target/120553 gcc/ChangeLog: * config/i386/i386.md (mov<mode>cc): Use "general_operand" predicate for operands 2 and 3 for all modes. gcc/testsuite/ChangeLog: * gcc.target/i386/pr120553.c: New test.
2025-06-05aarch64:sve: Use make_ssa_name instead of create_tmp_var in the folderAndrew Pinski1-2/+5
Currently gimple_folder::convert_and_fold calls create_tmp_var; that means while in ssa form, the pass which calls fold_stmt will always have to update the ssa (via TODO_update_ssa or otherwise). This seems not very useful since we know that this will always be a ssa name, using make_ssa_name instead is better and don't need to depend on the ssa updater. Plus this should have a small compile time performance and memory usage improvement too since this uses an anonymous ssa name rather than creating a full decl for this. Changes since v1: * Use make_ssa_name instead of create_tmp_reg_or_ssa_name, anonymous ssa names are allowed early on in gimple too. Built and tested on aarch64-linux-gnu. gcc/ChangeLog: * config/aarch64/aarch64-sve-builtins.cc: Include value-range.h and tree-ssanames.h (gimple_folder::convert_and_fold): Use make_ssa_name instead of create_tmp_var for the temporary. Add comment about callback argument. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
2025-06-05RISC-V: Fix ICE for gcc.dg/graphite/pr33576.c with rv32gcvPan Li2-3/+7
The div of rvv has not such insn v2 = div (vec_dup (x), v1), thus the generated rtl like that hit the unreachable assert when expand insn. This patch would like to remove op div from the binary op form (vec_dup (x), v) to avoid pattern matching by mistake. No new test introduced as pr33576.c covered already. The below test suites are passed for this patch series. * The rv64gcv fully regression test. gcc/ChangeLog: * config/riscv/autovec-opt.md: Leverage vdup_v and v_vdup binary op for different patterns. * config/riscv/vector-iterators.md: Add vdup_v and v_vdup binary op iterators. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-06-05[RISC-V] Improve sequences to generate -1, 1 in some cases.Jeff Law1-1/+35
This patch has a minor improvement to if-converted sequences based on observations I found while evaluating another patch from Shreya to handle more cases with zicond insns. Specifically there is a smaller/faster way than zicond to generate a -1,1 result when the condition is testing the sign bit. So let's consider these two tests (rv64): long foo1 (long c, long a) { return c >= 0 ? 1 : -1; } long foo2 (long c, long a) { return c < 0 ? -1 : 1; } So if we right arithmetic shift c by 63 bits, that splats the sign bit across a register giving us 0, -1 for the first test and -1, 0 for the second test. We then unconditionally turn on the LSB resulting in 1, -1 for the first case and -1, 1 for the second. This is implemented as a 4->2 splitter. There's another pair of cases we don't handle because we don't have 4->3 splitters. Specifically if the true/false values are reversed in the above examples without reversing the condition. Raphael is playing a bit in the gimple space to see what opportunities might exist to recognize more idioms in phiopt and generate better code earlier. No idea how that's likely to pan out. This is a pretty consistent small win. It's been through the rounds in my tester. Just waiting on a green light from pre-commit testing. gcc/ * config/riscv/zicond.md: Add new splitters to select 1, -1 or -1, 1 based on a sign bit test. gcc/testsuite/ * gcc.target/riscv/nozicond-1.c: New test. * gcc.target/riscv/nozicond-2.c: New test.
2025-06-05RISC-V: Support Ssu64xl extension.Jiawei2-0/+15
Support the Ssu64xl extension, which requires UXLEN to be 64. gcc/ChangeLog: * config/riscv/riscv-ext.def: New extension definition. * config/riscv/riscv-ext.opt: New extension mask. * doc/riscv-ext.texi: Document the new extension. gcc/testsuite/ChangeLog: * gcc.target/riscv/arch-ssu64xl.c: New test. Signed-off-by: Jiawei <jiawei@iscas.ac.cn>
2025-06-05RISC-V: Support Sstvecd extension.Jiawei2-0/+15
Support the Sstvecd extension, which allows Supervisor Trap Vector Base Address register (stvec) to support Direct mode. gcc/ChangeLog: * config/riscv/riscv-ext.def: New extension definition. * config/riscv/riscv-ext.opt: New extension mask. * doc/riscv-ext.texi: Document the new extension. gcc/testsuite/ChangeLog: * gcc.target/riscv/arch-sstvecd.c: New test. Signed-off-by: Jiawei <jiawei@iscas.ac.cn>