aboutsummaryrefslogtreecommitdiff
path: root/gcc/config
AgeCommit message (Collapse)AuthorFilesLines
2025-07-09[PATCH] [PR target/109286] H8/300: Fix warnings about initfini sections ↵Jan Dubiec1-13/+8
missing attributes The patch changes order of inclusions, i.e. elfos.h is included before target specific h8300/h8300.h, in a way similar to a few other targets. Thanks to this change it is possible to override macros from elfos.h in h8300/h8300.h, in particular .init/.fini section definitions. PR target/109286 gcc/ChangeLog: * config.gcc: Include elfos.h before h8300/h8300.h. * config/h8300/h8300.h (INIT_SECTION_ASM_OP): Override default version from elfos.h. (FINI_SECTION_ASM_OP): Ditto. (ASM_DECLARE_FUNCTION_NAME): Ditto. (ASM_GENERATE_INTERNAL_LABEL): Macro removed because it was being overridden in elfos.h anyway. (ASM_OUTPUT_SKIP): Ditto.
2025-07-09[RISC-V][PR target/120642] Avoid propagating constant AVL for theadvectorJeff Law1-1/+1
AVL propagation currently assumes that it can propagate a constant AVL into any vector insn and trips an assert if the insn fails to recognize after such a propagation. However, for xtheadvector that is not a correct assumption; xtheadvector does not allow the vector length to be a constant integer (other than zero which allowed via x0). After consulting with Jin Ma (thanks!) we agree the right fix is to avoid creating the immediate AVL for xtheadvector. This has been tested in my tester, just waiting for the pre-commit tester to spin it. PR target/120642 gcc/ * config/riscv/riscv-avlprop.cc (pass_avlprop::execute): Do not do constant AVL propagation for xtheadvector. gcc/testsuite/ * gcc.target/riscv/rvv/xtheadvector/pr120642.c: New test.
2025-07-09arm: remove useless push/pop pragmas in arm_neon.hChristophe Lyon1-5/+0
Remove #pragma GCC target ("arch=armv8.2-a+bf16") since it matches the preceding pragma GCC target and is thus useless. gcc/ChangeLog: * config/arm/arm_neon.h: Remove useless push/pop pragmas.
2025-07-08xtensa: Fix B[GE/LT]UI instructions with immediate values of 32768 or 65536 ↵Takayuki 'January June' Suwa1-12/+5
not being emitted This is because in canonicalize_comparison() in gcc/expmed.cc, the COMPARE rtx_cost() for the immediate values in the title does not change between the old and new versions. This patch fixes that. (note: Currently, this patch only works if some constant propagation optimizations are enabled (-O2 or higher) or if bare large constant assignments are possible (-mconst16 or -mauto-litpools). In the future I hope to make it work at -O1...) gcc/ChangeLog: * config/xtensa/xtensa.cc (xtensa_b4const_or_zero): Remove. (xtensa_b4const): Add a case where the value is 0, and rename to xtensa_b4const_or_zero. (xtensa_rtx_costs): Fix to also consider the result of xtensa_b4constu(). gcc/testsuite/ChangeLog: * gcc.target/xtensa/BGEUI-BLTUI-32k-64k.c: New.
2025-07-08s390: Always compute address of stack protector guardStefan Schulze Frielinghaus1-4/+43
Computing the address of the thread pointer on s390 involves multiple instructions and therefore bears the risk that the address of the canary or intermediate values of it are spilled after prologue in order to be reloaded for the epilogue. Since there exists no mechanism to ensure that a value is not coming from stack, as a precaution compute the address always twice, i.e., one time for the prologue and one time for the epilogue. Note, even if there were such a mechanism, emitting optimal code is non-trivial since there exist cases with opposing requirements as e.g. if the thread pointer is not only computed for the TLS guard but also for other TLS objects. For the latter accesses it is desired to spill and reload the thread pointer instead of recomputing it whereas for the former it is not. gcc/ChangeLog: * config/s390/s390.md (stack_protect_get_tpsi): New insn. (stack_protect_get_tpdi): New insn. (stack_protect_set): Use new insn. (stack_protect_test): Use new insn. gcc/testsuite/ChangeLog: * gcc.target/s390/stack-protector-guard-tls-1.c: New test.
2025-07-08RISC-V: Do not use vsetivli for THeadVector.Robin Dapp1-1/+1
In emit_vlmax_insn_lra we use a vsetivli for an immediate AVL. XTHeadVector does not support this, so guard appropriately. PR target/120461 gcc/ChangeLog: * config/riscv/riscv-v.cc (emit_vlmax_insn_lra): Do not emit vsetivli for XTHeadVector. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/xtheadvector/pr120461.c: New test.
2025-07-08RISC-V: Ignore non-types in builtin function hash.Robin Dapp1-0/+6
If a user passes a string that doesn't represent a variable we still try to compute a hash for its type. Its tree does not represent a type but just an exceptional, though. This patch just ignores it, leaving the error to the checking code later. PR target/113829 gcc/ChangeLog: * config/riscv/riscv-vector-builtins.cc (registered_function::overloaded_hash): Skip non-type arguments. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/pr113829.c: New test.
2025-07-08[PATCH] riscv: allow zero in zacas subword atomic casAndreas Schwab1-1/+1
gcc: PR target/120995 * config/riscv/sync.md (zacas_atomic_cas_value_strong<mode>): Allow op3 to be zero. gcc/testsuite: PR target/120995 * gcc.target/riscv/amo/zabha-zacas-atomic-cas.c: New test.
2025-07-08add masked-epilogue tuningRichard Biener2-0/+64
The following adds a x86 tuning to enable the use of AVX512 masked epilogues in cases we heuristically determine it to be not detrimental by high chance. Basically problematic cases are when there are data streams that are both stored and loaded from and an outer loop could end up executing only the inner loop masked epilogue and with unlucky data stream advacement from the outer loop end up needing to forward from masked stores to masked loads. This isn't very well handled, esp. for the case where unmasked operations would not need to forward at all - that is, when forwarding completely from the masked out portion of the store (like the AVX upper half to the AVX lower half of a load). There's also the case where the number of iterations is known at compile time, only with cost comparing we'd consider a non-masked epilog - as we are not doing that we have to add heuristics to avoid masking when a single vector epilog iteration would cover all scalar iterations left (this is exercised by gcc.target/i386/pr110310.c). SPEC CPU 2017 shows 3% text size savings over not using masked epilogues with performance impact in the noise. Masking all vector epilogues gets that to 4% text size savings with some major runtime regressions in 503.bwaves_r and 527.cam4_r (measured on a Zen4 system), we're leaving a 5% improvement for 549.fotonik3d_r unrealized with the implemented heuristic. With the heuristics we turn 22513 vector epilogues + up to 12305 scalar epilogues into 12305 masked vector epilogues of which 574 are for AVX vector sizes, 79 for SSE vector sizes and the rest for AVX512. When masking all epilogues we get 14567 of them from 29467 vector + up to 14567 scalar epilogues, so the heuristics disable an additional 20% of masked epilogues. * config/i386/x86-tune.def (X86_TUNE_AVX512_MASKED_EPILOGUES): New tunable, default on for m_ZNVER4 and m_ZNVER5. * config/i386/i386.cc (ix86_vector_costs::finish_cost): With X86_TUNE_AVX512_MASKED_EPILOGUES and when the main loop had a vectorization factor > 2 use a masked epilogue when possible and when not obviously problematic. * gcc.target/i386/vect-mask-epilogue-1.c: New testcase. * gcc.target/i386/vect-mask-epilogue-2.c: Likewise. * gcc.target/i386/vect-epilogues-3.c: Adjust.
2025-07-07[vxworks] [x86] disable vxworks6 PIC on vxworks7Alexandre Oliva5-8/+33
VxWorks6 used symbols __GOTT_BASE__ and __GOTT_INDEX__ to obtain the address of the global offset table. Starting with VxWorks7, that is no longer the case, but we've still issued these symbols in output_set_got. Do that only with VxWorks<7. Switching to the call-based PIC register sequence, we have to set the flag that prevents the use of the red zone, and AFAICT the reasons that ruled out GOTOFF and other relative addressing no longer apply to VxWorks7+. for gcc/ChangeLog * config/vxworks-dummy.h (TARGET_VXWORKS_VAROFF): New. (TARGET_VXWORKS_GOTTPIC): New. * config/vxworks.h (TARGET_VXWORKS_VAROFF): Override. (TARGET_VXWORKS_GOTTPIC): Likewise. * config/i386/i386.cc (output_set_got): Disable VxWorks6 GOT sequence on VxWorks7. (legitimize_pic_address): Accept relative addressing of labels on VxWorks7. (ix86_delegitimize_address_1): Likewise. (ix86_output_addr_diff_elt): Likewise. * config/i386/i386.md (tablejump): Likewise. (set_got, set_got_labelled): Set no-red-zone flag on VxWorks7. * config/i386/predicates.md (gotoff_operand): Test TARGET_VXWORKS_VAROFF.
2025-07-07xtensa: Remove TARGET_PROMOTE_PROTOTYPESH.J. Lu1-3/+18
xtensa ABI requires sign extension of signed 8/16-bit arguments to 32 bits and zero extension of unsigned 8/16-bit arguments to 32 bits. TARGET_PROMOTE_PROTOTYPES is an optimization, not an ABI requirement. Remove TARGET_PROMOTE_PROTOTYPES and define xtensa_promote_function_mode to properly extend 8/16-bit arguments to 32 bits. gcc/ PR target/120888 * config/xtensa/xtensa.cc (xtensa_promote_function_mode): New. (TARGET_PROMOTE_FUNCTION_MODE): Use. (TARGET_PROMOTE_PROTOTYPES): Removed. gcc/testsuite/ PR target/120888 * gcc.target/xtensa/pr120888-1.c: New test. * gcc.target/xtensa/pr120888-2.c: Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-07-07s390: Optimize fmin/fmax.Juergen Christ3-14/+38
On VXE targets, we can directly use the fp min/max instruction instead of calling into libm for fmin/fmax etc. Provide fmin/fmax versions also for vectors even though it cannot be called directly. This will be exploited with a follow-up patch when reductions are introduced. gcc/ChangeLog: * config/s390/s390.md: Update UNSPECs * config/s390/vector.md (fmax<mode>3): New expander. (fmin<mode>3): New expander. * config/s390/vx-builtins.md (*fmin<mode>): New insn. (vfmin<mode>): Redefined to use new insn. (*fmax<mode>): New insn. (vfmax<mode>): Redefined to use new insn. gcc/testsuite/ChangeLog: * gcc.target/s390/fminmax-1.c: New test. * gcc.target/s390/fminmax-2.c: New test. Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>
2025-07-07aarch64: Improve popcountti2 with SVEKyrylo Tkachov1-0/+13
The TImode popcount sequence can be slightly improved with SVE. If we generate: ldr q31, [x0] ptrue p7.b, vl16 cnt z31.d, p7/m, z31.d addp d31, v31.2d fmov x0, d31 ret instead of: h128: ldr q31, [x0] cnt v31.16b, v31.16b addv b31, v31.16b fmov w0, s31 ret we use the ADDP instruction for reduction, which is cheaper on all CPUs AFAIK, as it is only a single 64-bit addition vs the tree of additions for ADDV. For example, on a CPU like Grace we get a latency and throughput of 2,4 vs 4,1 for ADDV. We do generate one more instruction due to the PTRUE being materialised, but that is cheap itself and can be scheduled away from the critical path or even CSE'd with other PTRUE constants. As this sequence is larger code size-wise it is avoided for -Os. Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ * config/aarch64/aarch64.md (popcountti2): Add TARGET_SVE path. gcc/testsuite/ * gcc.target/aarch64/popcnt9.c: Add +nosve to target pragma. * gcc.target/aarch64/popcnt13.c: New test.
2025-07-07RISC-V: Implement unsigned scalar SAT_MUL from uint128_tPan Li3-0/+94
This patch would like to implement the SAT_MUL scalar unsigned from uint128_t, aka: NT __attribute__((noinline)) sat_u_mul_##NT##_fmt_1 (NT a, NT b) { uint128_t x = (uint128_t)a * (uint128_t)b; NT max = -1; if (x > (uint128_t)(max)) return max; else return (NT)x; } Take uint64_t and uint8_t as example: Before this patch for uint8_t: 10 │ sat_u_mul_uint8_t_from_uint128_t_fmt_1: 11 │ mulhu a5,a0,a1 12 │ mul a0,a0,a1 13 │ bne a5,zero,.L3 14 │ li a5,255 15 │ bleu a0,a5,.L4 16 │ .L3: 17 │ li a0,255 18 │ .L4: 19 │ andi a0,a0,0xff 20 │ ret After this patch for uint8_t: 10 │ sat_u_mul_uint8_t_from_uint128_t_fmt_1: 11 │ mul a0,a0,a1 12 │ li a5,255 13 │ sltu a5,a5,a0 14 │ neg a5,a5 15 │ or a0,a0,a5 16 │ andi a0,a0,0xff 17 │ ret Before this patch for uint64_t: 10 │ sat_u_mul_uint64_t_from_uint128_t_fmt_1: 11 │ mulhu a5,a0,a1 12 │ mul a0,a0,a1 13 │ beq a5,zero,.L4 14 │ li a0,-1 15 │ .L4: 16 │ ret After this patch for uint64_t: 10 │ sat_u_mul_uint64_t_from_uint128_t_fmt_1: 11 │ mulhsu a5,a1,a0 12 │ mul a0,a0,a1 13 │ snez a5,a5 14 │ neg a5,a5 15 │ or a0,a0,a5 16 │ ret gcc/ChangeLog: * config/riscv/riscv-protos.h (riscv_expand_usmul): Add new func decl. * config/riscv/riscv.cc (riscv_expand_xmode_usmul): Add new func to expand Xmode SAT_MUL. (riscv_expand_non_xmode_usmul): Ditto but for non-Xmode. (riscv_expand_usmul): Add new func to implment SAT_MUL. * config/riscv/riscv.md (usmul<mode>3): Add new pattern to match standard name usmul. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-07-07s390: Add some missing vector patterns.Juergen Christ3-11/+32
Some patterns that are detected by the autovectorizer can be supported by s390. Add expanders such that autovectorization of these patterns works. RTL for the builtins used unspec to represent highpart multiplication. Replace this by the correct RTL to allow further simplification. gcc/ChangeLog: * config/s390/s390.md: Removed unused unspecs. * config/s390/vector.md (avg<mode>3_ceil): New expander. (uavg<mode>3_ceil): New expander. (smul<mode>3_highpart): New expander. (umul<mode>3_highpart): New expander. * config/s390/vx-builtins.md (vec_umulh<mode>): Remove unspec. (vec_smulh<mode>): Remove unspec. gcc/testsuite/ChangeLog: * gcc.target/s390/vector/pattern-avg-1.c: New test. * gcc.target/s390/vector/pattern-mulh-1.c: New test. Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>
2025-07-07aarch64: Add support for unpacked SVE FP comparisonsSpencer Abson2-25/+49
This patch extends our vec_cmp expander to support partial FP modes. We use a predicate mode that is narrower the operation's VPRED to govern unpacked FP operations under flag_trapping_math, so the expansion must handle cases where the comparison's target and governing predicates have different modes. While such predicates enable all of the defined part of the operation, they are not all-true. Their false bits contribute to the (trapping) behavior of the operation, so we cannot have SVE_KNOWN_PTRUE. gcc/ChangeLog: * config/aarch64/aarch64-sve.md (vec_cmp<mode><vpred>): Extend to handle partial FP modes. (@aarch64_pred_fcm<cmp_op><mode>): Likewise. (@aarch64_pred_fcmuo<mode>): Likewise. (*one_cmpl<mode>3): Rename to... (@aarch64_pred_one_cmpl<mode>_z): ... this. * config/aarch64/aarch64.cc (aarch64_emit_sve_fp_cond): Allow the target and governing predicates to have different modes. (aarch64_emit_sve_or_fp_conds): Likewise. (aarch64_emit_sve_invert_fp_cond): Likewise. (aarch64_expand_sve_vec_cmp_float): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/unpacked_fcm_1.c: New test. * gcc.target/aarch64/sve/unpacked_fcm_2.c: Likewise.
2025-07-07aarch64: Fix neon-sve-bridge.c failures for big-endianRichard Sandiford1-10/+6
Lowpart subregs are generally disallowed on big-endian SVE vector registers, since the first memory element is stored at the least significant end of the register, rather than the most significant end. (See the comment at the head of aarch64-sve.md for details, and aarch64_modes_compatible_p for the implementation.) This means that arm_sve_neon_bridge.h needs to use custom define_insns for big-endian targets, in lieu of using lowpart subregs. However, one of those define_insns relied on the prohibited lowparts internally, to convert an Advanced SIMD register to an SVE register. Since the lowpart is not allowed, the lowpart_subreg would return null, leading to a later ICE. The simplest fix seems to be to use %Z instead, to force the Advanced SIMD register to be written as an SVE register. gcc/ * config/aarch64/aarch64-sve.md (@aarch64_sve_set_neonq_<mode>): Use %Z instead of lowpart_subreg. Tweak formatting.
2025-07-07aarch64: Fix ZIP1 order in aarch64_expand_vector_init [PR118891]Richard Sandiford1-0/+7
aarch64_expand_vector_init contains some divide-and-conquer code that tries to load the odd and even elements into 64-bit registers and then ZIP them together. On big-endian targets, the even elements are more significant than the odd elements and so should come second in the ZIP. This fixes many execution failures on aarch64_be-elf, including gcc.c-torture/execute/pr28982a.c. gcc/ PR target/118891 * config/aarch64/aarch64.cc (aarch64_expand_vector_init): Fix the ZIP1 operand order for big-endian targets.
2025-07-07x86: Improve vector_loop/unrolled_loop for memset/memcpyH.J. Lu1-61/+210
1. Don't generate the loop if the loop count is 1. 2. For memset with vector on small size, use vector if small size supports vector, otherwise use the scalar value. 3. Always expand vector-version of memset for vector_loop. 4. Always duplicate the promoted scalar value for vector_loop if not 0 nor -1. 5. Use misaligned prologue if alignment isn't needed. When misaligned prologue is used, check if destination is actually aligned and update destination alignment if aligned. 6. Use move_by_pieces and store_by_pieces for memcpy and memset epilogues with the fixed epilogue size to enable overlapping moves and stores. The included tests show that codegen of vector_loop/unrolled_loop for memset/memcpy are significantly improved. For void foo (void *p1, size_t len) { __builtin_memset (p1, 0, len); } with -O2 -minline-all-stringops -mmemset-strategy=vector_loop:256:noalign,libcall:-1:noalign -march=x86-64 we used to generate foo: .LFB0: .cfi_startproc movq %rdi, %rax pxor %xmm0, %xmm0 cmpq $64, %rsi jnb .L18 .L2: andl $63, %esi je .L1 xorl %edx, %edx testb $1, %sil je .L5 movl $1, %edx movb $0, (%rax) cmpq %rsi, %rdx jnb .L19 .L5: movb $0, (%rax,%rdx) movb $0, 1(%rax,%rdx) addq $2, %rdx cmpq %rsi, %rdx jb .L5 .L1: ret .p2align 4,,10 .p2align 3 .L18: movq %rsi, %rdx xorl %eax, %eax andq $-64, %rdx .L3: movups %xmm0, (%rdi,%rax) movups %xmm0, 16(%rdi,%rax) movups %xmm0, 32(%rdi,%rax) movups %xmm0, 48(%rdi,%rax) addq $64, %rax cmpq %rdx, %rax jb .L3 addq %rdi, %rax jmp .L2 .L19: ret .cfi_endproc with very poor prologue/epilogue. With this patch, we now generate: foo: .LFB0: .cfi_startproc pxor %xmm0, %xmm0 cmpq $64, %rsi jnb .L2 testb $32, %sil jne .L19 testb $16, %sil jne .L20 testb $8, %sil jne .L21 testb $4, %sil jne .L22 testq %rsi, %rsi jne .L23 .L1: ret .p2align 4,,10 .p2align 3 .L2: movups %xmm0, -64(%rdi,%rsi) movups %xmm0, -48(%rdi,%rsi) movups %xmm0, -32(%rdi,%rsi) movups %xmm0, -16(%rdi,%rsi) subq $1, %rsi cmpq $64, %rsi jb .L1 andq $-64, %rsi xorl %eax, %eax .L9: movups %xmm0, (%rdi,%rax) movups %xmm0, 16(%rdi,%rax) movups %xmm0, 32(%rdi,%rax) movups %xmm0, 48(%rdi,%rax) addq $64, %rax cmpq %rsi, %rax jb .L9 ret .p2align 4,,10 .p2align 3 .L23: movb $0, (%rdi) testb $2, %sil je .L1 xorl %eax, %eax movw %ax, -2(%rdi,%rsi) ret .p2align 4,,10 .p2align 3 .L19: movups %xmm0, (%rdi) movups %xmm0, 16(%rdi) movups %xmm0, -32(%rdi,%rsi) movups %xmm0, -16(%rdi,%rsi) ret .p2align 4,,10 .p2align 3 .L20: movups %xmm0, (%rdi) movups %xmm0, -16(%rdi,%rsi) ret .p2align 4,,10 .p2align 3 .L21: movq $0, (%rdi) movq $0, -8(%rdi,%rsi) ret .p2align 4,,10 .p2align 3 .L22: movl $0, (%rdi) movl $0, -4(%rdi,%rsi) ret .cfi_endproc gcc/ PR target/120670 PR target/120683 * config/i386/i386-expand.cc (expand_set_or_cpymem_via_loop): Don't generate the loop if the loop count is 1. (expand_cpymem_epilogue): Use move_by_pieces. (setmem_epilogue_gen_val): New. (expand_setmem_epilogue): Use store_by_pieces. (expand_small_cpymem_or_setmem): Choose cpymem mode from MOVE_MAX. For memset with vector and the size is smaller than the vector size, first try the narrower vector, otherwise, use the scalar value. (promote_duplicated_reg): Duplicate the scalar value for vector. (ix86_expand_set_or_cpymem): Always expand vector-version of memset for vector_loop. Use misaligned prologue if alignment isn't needed and destination isn't aligned. Always initialize vec_promoted_val from the promoted scalar value for vector_loop. gcc/testsuite/ PR target/120670 PR target/120683 * gcc.target/i386/auto-init-padding-9.c: Updated. * gcc.target/i386/memcpy-strategy-12.c: Likewise. * gcc.target/i386/memset-strategy-25.c: Likewise. * gcc.target/i386/memset-strategy-29.c: Likewise. * gcc.target/i386/memset-strategy-30.c: Likewise. * gcc.target/i386/memset-strategy-31.c: Likewise. * gcc.target/i386/memcpy-pr120683-1.c: New test. * gcc.target/i386/memcpy-pr120683-2.c: Likewise. * gcc.target/i386/memcpy-pr120683-3.c: Likewise. * gcc.target/i386/memcpy-pr120683-4.c: Likewise. * gcc.target/i386/memcpy-pr120683-5.c: Likewise. * gcc.target/i386/memcpy-pr120683-6.c: Likewise. * gcc.target/i386/memcpy-pr120683-7.c: Likewise. * gcc.target/i386/memset-pr120683-1.c: Likewise. * gcc.target/i386/memset-pr120683-2.c: Likewise. * gcc.target/i386/memset-pr120683-3.c: Likewise. * gcc.target/i386/memset-pr120683-4.c: Likewise. * gcc.target/i386/memset-pr120683-5.c: Likewise. * gcc.target/i386/memset-pr120683-6.c: Likewise. * gcc.target/i386/memset-pr120683-7.c: Likewise. * gcc.target/i386/memset-pr120683-8.c: Likewise. * gcc.target/i386/memset-pr120683-9.c: Likewise. * gcc.target/i386/memset-pr120683-10.c: Likewise. * gcc.target/i386/memset-pr120683-11.c: Likewise. * gcc.target/i386/memset-pr120683-12.c: Likewise. * gcc.target/i386/memset-pr120683-13.c: Likewise. * gcc.target/i386/memset-pr120683-14.c: Likewise. * gcc.target/i386/memset-pr120683-15.c: Likewise. * gcc.target/i386/memset-pr120683-16.c: Likewise. * gcc.target/i386/memset-pr120683-17.c: Likewise. * gcc.target/i386/memset-pr120683-18.c: Likewise. * gcc.target/i386/memset-pr120683-19.c: Likewise. * gcc.target/i386/memset-pr120683-20.c: Likewise. * gcc.target/i386/memset-pr120683-21.c: Likewise. * gcc.target/i386/memset-pr120683-22.c: Likewise. * gcc.target/i386/memset-pr120683-23.c: Likewise.
2025-07-06AVR: Fix a typo in avr-mcus.def.Georg-Johann Lay1-11/+11
gcc/ * config/avr/avr-mcus.def: -mmcu= takes lower case MCU names. * doc/avr-mmcu.texi: Rebuild.
2025-07-06AVR: Add support for AVR32DAxxS, AVR64DAxxS, AVR128DAxxS devices.Georg-Johann Lay1-0/+11
gcc/ * config/avr/avr-mcus.def (avr32da28S, avr32da32S, avr32da48S) (avr64da28S, avr64da32S, avr64da48S avr64da64S) (avr128da28S, avr128da32S, avr128da48S, avr128da64S): Add devices. * doc/avr-mmcu.texi: Rebuild.
2025-07-05[vxworks] [ppc] match TARGET_VXWORKS64 to TARGET_64BITAlexandre Oliva1-0/+15
Configuring gcc for --target=powerpc-wrs-vxworks7r2 sets things up for a 64-bit compiler, just like powerpc64-wrs-vxworks7r2, except that TARGET_VXWORKS64 is only defined as 1 for targets that match *64-*-vxworks*. With !TARGET_VXWORKS64, we get a 64-bit toolchain that defines SIZE_TYPE, PTRDIFF_TYPE, and WCHAR_TYPE as 32-bit types, and that breaks GCC passes that expect SIZE_TYPE and PTRDIFF_TYPE to be as wide as pointers. Arrange for TARGET_VXWORKS64 on ppc to match TARGET_64BIT, after using it to select the default word size with driver self specs. for gcc/ChangeLog * config/rs6000/vxworks.h (SUBTARGET_DRIVER_SELF_SPECS): Redefine to select word size matching TARGET_VXWORKS64. (TARGET_VXWORKS64): Redefine in terms of TARGET_64BIT.
2025-07-04RISC-V: prefetch: fix LRA failing to allocate reg [PR118241]Vineet Gupta1-1/+1
prefetch was recently fixed/tightened (with Q reg constraint) to only support right address patterns (REG or REG+D with lower 5 bits clear). However in some cases that's too restrictive for LRA and it fails to allocate a reg resulting in following ICE... | gcc/testsuite/gcc.target/riscv/pr118241-b.cc:31:19: error: unable to generate reloads for: | 31 | void m() { a.l(); } | | ^ |(insn 26 25 27 7 (prefetch (mem/f:DI (plus:DI (reg/f:DI 143 [ _5 ]) | (const_int 56 [0x38])) [5 _5->batch[6]+0 S8 A64]) | (const_int 0 [0]) | (const_int 3 [0x3])) "gcc/testsuite/gcc.target/riscv/pr118241-b.cc":18:29 498 {prefetch} | (expr_list:REG_DEAD (reg/f:DI 142 [ _5->batch[6] ]) | (nil))) |during RTL pass: reload Fix that by providing a fallback alternative register constraint to reload the address. PR target/118241 gcc/ChangeLog: * config/riscv/riscv.md (prefetch): Add alternative "r". gcc/testsuite/ChangeLog: * gcc.target/riscv/pr118241-b.cc: New test. Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
2025-07-04RISC-V: prefetch: const offset needs to have 5 bits zero, not 4Vineet Gupta1-2/+2
Spotted this by chance as I saw a similar fixup in comment. From comments, I think this is needed, but I've not hit any issues due to this. gcc/ChangeLog: * config/riscv/predicates.md (prefetch_operand): mack 5 bits. Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
2025-07-04sh: Recognize >> 31 in treg_set_expr_not_const01Raphael Moreira Zinsly3-3/+19
A right shift of 31 will become 0 or 1, this can be checked for treg_set_expr_not_const01 to avoid matching addc_t_r as this can expand to a 3 insn sequence instead. This improves tests 023 to 026 from gcc.target/sh/pr54236-2.c, e.g.: test_023: shll r5 mov #0,r1 mov r4,r0 rts addc r1,r0 With this change: test_023: shll r5 movt r0 rts add r4,r0 We noticed this while evaluating a patch to improve how we handle selecting between two constants based on the output of a LT/GE 0 test. gcc/ChangeLog: * config/sh/predicates.md (treg_set_expr_not_const01): call sh_recog_treg_set_expr_not_01 * config/sh/sh-protos.h (sh_recog_treg_set_expr_not_01): New function * config/sh/sh.cc (sh_recog_treg_set_expr_not_01): Likewise gcc/testsuite/ChangeLog: * gcc.target/sh/pr54236-2.c: Fix comments and expected output
2025-07-04RISC-V: Combine vec_duplicate + vsadd.vv to vsadd.vx on GR2VR costPan Li3-2/+5
This patch would like to combine the vec_duplicate + vsadd.vv to the vsadd.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_SAT_S_ADD(T, UT, MIN, MAX) \ T \ test_##T##_sat_add (T x, T y) \ { \ T sum = (UT)x + (UT)y; \ return (x ^ y) < 0 \ ? sum \ : (sum ^ x) >= 0 \ ? sum \ : x < 0 ? MIN : MAX; \ } DEF_SAT_S_ADD(int32_t, uint32_t, INT32_MIN, INT32_MAX) DEF_VX_BINARY_CASE_2_WRAP(T, SAT_S_ADD_FUNC(T), sat_add) Before this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 14 │ slli a3,a3,32 15 │ srli a3,a3,32 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma 18 │ vle32.v v1,0(a1) 19 │ slli a4,a5,2 20 │ sub a3,a3,a5 21 │ add a1,a1,a4 22 │ vsadd.vv v1,v1,v2 23 │ vse32.v v1,0(a0) 24 │ add a0,a0,a4 25 │ bne a3,zero,.L3 After this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ slli a3,a3,32 13 │ srli a3,a3,32 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma 16 │ vle32.v v1,0(a1) 17 │ slli a4,a5,2 18 │ sub a3,a3,a5 19 │ add a1,a1,a4 20 │ vsadd.vx v1,v1,a2 21 │ vse32.v v1,0(a0) 22 │ add a0,a0,a4 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add new case SS_PLUS. (expand_vx_binary_vec_vec_dup): Ditto. * config/riscv/riscv.cc (riscv_rtx_costs): Ditto. * config/riscv/vector-iterators.md: Add new op ss_plus. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-07-04LoongArch: Prevent subreg of subreg in CRCXi Ruoyao1-1/+2
The register_operand predicate can match subreg, then we'd have a subreg of subreg and it's invalid. Use lowpart_subreg to avoid the nested subreg. gcc/ChangeLog: * config/loongarch/loongarch.md (crc_combine): Avoid nested subreg. gcc/testsuite/ChangeLog: * gcc.c-torture/compile/pr120708.c: New test.
2025-07-03[RISC-V] Add basic instrumentation to fusion detectionShreya Munnangi1-16/+64
We were looking to evaluate some changes from Artemiy that improve GCC's ability to discover fusible instruction pairs. There was no good way to get any static data out of the compiler about what kinds of fusions were happening. Yea, you could grub around the .sched dumps looking for the magic '+' annotation, then look around at the slim RTL representation and make an educated guess about what fused. But boy that was inconvenient. All we really needed was a quick note in the dump file that the target hook found a fusion pair and what kind was discovered. That made it easy to spot invalid fusions, evaluate the effectiveness of Artemiy's work, write/discover testcases for existing fusions and implement new fusions. So from a codegen standpoint this is NFC, it only affects dump file output. It's gone through the usual testing and I'll wait for pre-commit CI to churn through it before moving forward. gcc/ * config/riscv/riscv.cc (riscv_macro_fusion_pair_p): Add basic instrumentation to all cases where fusion is detected. Fix minor formatting goofs found in the process.
2025-07-03s390: More vec-perm-const cases.Juergen Christ1-2/+167
s390 missed constant vector permutation cases based on the vector pack instruction or changing the size of the vector elements during vector merge. This enables some more patterns that do not need to load a constant vector for permutation. gcc/ChangeLog: * config/s390/s390.cc (expand_perm_with_merge): Add size change cases. (expand_perm_with_pack): New function. (vectorize_vec_perm_const_1): Wire up new function. gcc/testsuite/ChangeLog: * gcc.target/s390/vector/vec-perm-merge-1.c: New test. * gcc.target/s390/vector/vec-perm-pack-1.c: New test. Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>
2025-07-03[RISC-V][PR target/118886] Refine when two insns are signaled as fusion ↵Jeff Law1-57/+80
candidates A number of folks have had their fingers in this code and it's going to take a few submissions to do everything we want to do. This patch is primarily concerned with avoiding signaling that fusion can occur in cases where it obviously should not be signaling fusion. Every DEC based fusion I'm aware of requires the first instruction to set a destination register that is both used and set again by the second instruction. If the two instructions set different registers, then the destination of the first instruction was not dead and would need to have a result produced. This is complicated by the fact that we have pseudo registers prior to reload. So the approach we take is to signal fusion prior to reload even if the destination registers don't match. Post reload we require them to match. That allows us to clean up the code ever-so-slightly. Second, we sometimes signaled fusion into loads that weren't scalar integer loads. I'm not aware of a design that's fusing into FP loads or vector loads. So those get rejected explicitly. Third, the store pair "fusion" code is cleaned up a little. We use fusion to model store pair commits since the basic properties for detection are the same. The point where they "fuse" is different. Also this code liked to "return false" at each step along the way if fusion wasn't possible. Future work for additional fusion cases makes that behavior undesirable. So the logic gets reworked a little bit to be more friendly to future work. Fourth, if we already fused the previous instruction, then we can't fuse it again. Signaling fusion in that case is, umm, bad as it creates an atomic blob of code from a scheduling standpoint. Hopefully I got everything correct with extracting this work out of a larger set of changes 🙂 We will contribute some instrumentation & testing code so if I botched things in a major way we'll soon have a way to test that and I'll be on the hook to fix any goof's. From a correctness standpoint this should be a big fat nop. We've seen this make measurable differences in pico benchmarks, but obviously as you scale up to bigger stuff the gains largely disappear into the noise. This has been through Ventana's internal CI and my tester. I'll obviously wait for a verdict from the pre-commit tester. PR target/118886 gcc/ * config/riscv/riscv.cc (riscv_macro_fusion_pair_p): Check for fusion being disabled earlier. If PREV is already fused, then it can't be fused again. Be more selective about fusing when the destination registers do not match. Don't fuse into loads that aren't scalar integer modes. Revamp store pair commit support. Co-authored-by: Daniel Barboza <dbarboza@ventanamicro.com> Co-authored-by: Shreya Munnangi <smunnangi1@ventanamicro.com>
2025-07-03AArch64: make rules for CBZ/TBZ higher priorityKarl Meakin1-74/+87
Move the rules for CBZ/TBZ to be above the rules for CBB<cond>/CBH<cond>/CB<cond>. We want them to have higher priority because they can express larger displacements. gcc/ChangeLog: * config/aarch64/aarch64.md (aarch64_cbz<optab><mode>1): Move above rules for CBB<cond>/CBH<cond>/CB<cond>. (*aarch64_tbz<optab><mode>1): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/cmpbr.c: Update tests.
2025-07-03AArch64: rules for CMPBR instructionsKarl Meakin5-4/+174
Add rules for lowering `cbranch<mode>4` to CBB<cond>/CBH<cond>/CB<cond> when CMPBR extension is enabled. gcc/ChangeLog: * config/aarch64/aarch64-protos.h (aarch64_cb_rhs): New function. * config/aarch64/aarch64.cc (aarch64_cb_rhs): Likewise. * config/aarch64/aarch64.md (cbranch<mode>4): Rename to ... (cbranch<GPI:mode>4): ...here, and emit CMPBR if possible. (cbranch<SHORT:mode>4): New expand rule. (aarch64_cb<INT_CMP:code><GPI:mode>): New insn rule. (aarch64_cb<INT_CMP:code><SHORT:mode>): Likewise. * config/aarch64/constraints.md (Uc0): New constraint. (Uc1): Likewise. (Uc2): Likewise. * config/aarch64/iterators.md (cmpbr_suffix): New mode attr. (INT_CMP): New code iterator. (cmpbr_imm_constraint): New code attr. gcc/testsuite/ChangeLog: * gcc.target/aarch64/cmpbr.c:
2025-07-03AArch64: recognize `+cmpbr` optionKarl Meakin2-0/+5
Add the `+cmpbr` option to enable the FEAT_CMPBR architectural extension. gcc/ChangeLog: * config/aarch64/aarch64-option-extensions.def (cmpbr): New option. * config/aarch64/aarch64.h (TARGET_CMPBR): New macro. * doc/invoke.texi (cmpbr): New option.
2025-07-03AArch64: make `far_branch` attribute a booleanKarl Meakin1-12/+10
The `far_branch` attribute only ever takes the values 0 or 1, so make it a `no/yes` valued string attribute instead. gcc/ChangeLog: * config/aarch64/aarch64.md (far_branch): Replace 0/1 with no/yes. (aarch64_bcond): Handle rename. (aarch64_cbz<optab><mode>1): Likewise. (*aarch64_tbz<optab><mode>1): Likewise. (@aarch64_tbz<optab><ALLI:mode><GPI:mode>): Likewise.
2025-07-03AArch64: add constants for branch displacementsKarl Meakin1-16/+44
Extract the hardcoded values for the minimum PC-relative displacements into named constants and document them. gcc/ChangeLog: * config/aarch64/aarch64.md (BRANCH_LEN_P_1MiB): New constant. (BRANCH_LEN_N_1MiB): Likewise. (BRANCH_LEN_P_32KiB): Likewise. (BRANCH_LEN_N_32KiB): Likewise.
2025-07-03AArch64: rename branch instruction rulesKarl Meakin4-15/+18
Give the `define_insn` rules used in lowering `cbranch<mode>4` to RTL more descriptive and consistent names: from now on, each rule is named after the AArch64 instruction that it generates. Also add comments to document each rule. gcc/ChangeLog: * config/aarch64/aarch64.md (condjump): Rename to ... (aarch64_bcond): ...here. (*compare_condjump<GPI:mode>): Rename to ... (*aarch64_bcond_wide_imm<GPI:mode>): ...here. (aarch64_cb<optab><mode>): Rename to ... (aarch64_cbz<optab><mode>1): ...here. (*cb<optab><mode>1): Rename to ... (*aarch64_tbz<optab><mode>1): ...here. (@aarch64_tb<optab><ALLI:mode><GPI:mode>): Rename to ... (@aarch64_tbz<optab><ALLI:mode><GPI:mode>): ...here. (restore_stack_nonlocal): Handle rename. (stack_protect_combined_test): Likewise. * config/aarch64/aarch64-simd.md (cbranch<mode>4): Likewise. * config/aarch64/aarch64-sme.md (aarch64_restore_za): Likewise. * config/aarch64/aarch64.cc (aarch64_gen_test_and_branch): Likewise.
2025-07-03AArch64: reformat branch instruction rulesKarl Meakin1-42/+42
Make the formatting of the RTL templates in the rules for branch instructions more consistent with each other. gcc/ChangeLog: * config/aarch64/aarch64.md (cbranch<mode>4): Reformat. (cbranchcc4): Likewise. (condjump): Likewise. (*compare_condjump<GPI:mode>): Likewise. (aarch64_cb<optab><mode>1): Likewise. (*cb<optab><mode>1): Likewise. (tbranch_<code><mode>3): Likewise. (@aarch64_tb<optab><ALLI:mode><GPI:mode>): Likewise.
2025-07-03AArch64: place branch instruction rules togetherKarl Meakin1-186/+201
The rules for conditional branches were spread throughout `aarch64.md`. Group them together so it is easier to understand how `cbranch<mode>4` is lowered to RTL. gcc/ChangeLog: * config/aarch64/aarch64.md (condjump): Move. (*compare_condjump<GPI:mode>): Likewise. (aarch64_cb<optab><mode>1): Likewise. (*cb<optab><mode>1): Likewise. (tbranch_<code><mode>3): Likewise. (@aarch64_tb<optab><ALLI:mode><GPI:mode>): Likewise.
2025-07-03x86: Emit label only for __mcount_loc sectionH.J. Lu1-21/+31
commit ecc81e33123d7ac9c11742161e128858d844b99d Author: Andi Kleen <ak@linux.intel.com> Date: Fri Sep 26 04:06:40 2014 +0000 Add direct support for Linux kernel __fentry__ patching emitted a label, 1, for __mcount_loc section: 1: call mcount .section __mcount_loc, "a",@progbits .quad 1b .previous If __mcount_loc wasn't used, we got an unused label. Update x86_function_profiler to emit label only when __mcount_loc section is used. gcc/ PR target/120936 * config/i386/i386.cc (x86_print_call_or_nop): Add a label argument and use it to print label. (x86_function_profiler): Emit label only when __mcount_loc section is used. gcc/testsuite/ PR target/120936 * gcc.target/i386/pr120936-1.c: New test * gcc.target/i386/pr120936-2.c: Likewise. * gcc.target/i386/pr120936-3.c: Likewise. * gcc.target/i386/pr120936-4.c: Likewise. * gcc.target/i386/pr120936-5.c: Likewise. * gcc.target/i386/pr120936-6.c: Likewise. * gcc.target/i386/pr120936-7.c: Likewise. * gcc.target/i386/pr120936-8.c: Likewise. * gcc.target/i386/pr120936-9.c: Likewise. * gcc.target/i386/pr120936-10.c: Likewise. * gcc.target/i386/pr120936-11.c: Likewise. * gcc.target/i386/pr120936-12.c: Likewise. * gcc.target/i386/pr93492-3.c: Updated. * gcc.target/i386/pr93492-5.c: Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-07-03aarch64: Drop const_int from aarch64_maskload_else_operandAlex Coplan2-3/+3
The "else operand" to maskload should always be a const_vector, never a const_int. This was just an issue I noticed while looking through the code, I don't have a testcase which shows a concrete problem due to this. Testing of that change alone showed ICEs with load lanes vectorization and SVE. That turned out to be because the backend pattern was missing a mode for the else operand (causing the middle-end to choose a const_int during expansion), fixed thusly. That in turn exposed an issue with the unpredicated load lanes expander which was using the wrong mode for the else operand, so fixed that too. gcc/ChangeLog: * config/aarch64/aarch64-sve.md (vec_load_lanes<mode><vsingle>): Expand else operand in subvector mode, as per optab documentation. (vec_mask_load_lanes<mode><vsingle>): Add missing mode for operand 3. * config/aarch64/predicates.md (aarch64_maskload_else_operand): Remove const_int.
2025-07-03x86-64: Add RDI clobber to 64-bit dynamic TLS patternsH.J. Lu2-8/+13
*tls_global_dynamic_64_largepic, *tls_local_dynamic_64_<mode> and *tls_local_dynamic_base_64_largepic use RDI as the __tls_get_addr argument. Add RDI clobber to these patterns to show it. gcc/ PR target/120908 * config/i386/i386.cc (legitimize_tls_address): Pass RDI to gen_tls_local_dynamic_64. * config/i386/i386.md (*tls_global_dynamic_64_largepic): Add RDI clobber and use it to generate LEA. (*tls_local_dynamic_64_<mode>): Likewise. (*tls_local_dynamic_base_64_largepic): Likewise. (@tls_local_dynamic_64_<mode>): Add a clobber. gcc/testsuite/ PR target/120908 * gcc.target/i386/pr120908.c: New test. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-07-03x86-64: Add RDI clobber to tls_global_dynamic_64 patternsH.J. Lu2-4/+7
*tls_global_dynamic_64_<mode> uses RDI as the __tls_get_addr argument. Add RDI clobber to tls_global_dynamic_64 patterns to show it. PR target/120908 * config/i386/i386.cc (legitimize_tls_address): Pass RDI to gen_tls_global_dynamic_64. * config/i386/i386.md (*tls_global_dynamic_64_<mode>): Add RDI clobber and use it to generate LEA. (@tls_global_dynamic_64_<mode>): Add a clobber. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-07-02[PATCH] [RISC-V] Fix shift type for RVV interleaved stepped patterns [PR120356]Alexey Merzlyakov1-1/+1
It corrects the shift type of interleaved stepped patterns for const vector expanding in LRA. The shift instruction was initially LSHIFTRT, and it seems still should be the same type for both LRA and other cases. PR target/120356 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_const_vector_interleaved_stepped_npatterns): Fix ASHIFT to LSHIFTRT insn. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/pr120356.c: New test.
2025-07-02i386: Change Diamond Rapids feature detect when model number could not be ↵Haochen Jiang1-1/+1
distinguished We will use AMX-FP8 for DMR since it is a smaller and more unique feature. gcc/ChangeLog: * config/i386/driver-i386.cc (host_detect_local_cpu): Change to AMX-FP8 for Diamond Rapids.
2025-07-01AArch64 SIMD: convert mvn+shrn into mvni+subhnRemi Machet1-0/+30
Add an optimization to aarch64 SIMD converting mvn+shrn into mvni+subhn when possible, which allows for better optimization when the code is inside a loop by using a constant. The conversion is based on the fact that for an unsigned integer: -x = ~x + 1 => ~x = -1 - x thus '(u8)(~x >> imm)' is equivalent to '(u8)(((u16)-1 - x) >> imm)'. For the following function: uint8x8_t neg_narrow_v8hi(uint16x8_t a) { uint16x8_t b = vmvnq_u16(a); return vshrn_n_u16(b, 8); } Without this patch the assembly look like: not v0.16b, v0.16b shrn v0.8b, v0.8h, 8 After the patch it becomes: mvni v31.4s, 0 subhn v0.8b, v31.8h, v0.8h Bootstrapped and regtested on aarch64-linux-gnu. Signed-off-by: Remi Machet <rmachet@nvidia.com> gcc/ChangeLog: * config/aarch64/aarch64-simd.md (*shrn_to_subhn_<mode>): Add pattern converting mvn+shrn into mvni+subhn. gcc/testsuite/ChangeLog: * gcc.target/aarch64/simd/shrn2subhn.c: New test.
2025-07-01aarch64: Sync `aarch64-sys-regs.def' with Binutils.Ezra Sitorus1-8/+8
This patch updates `aarch64-sys-regs.def', bringing it into sync with the Binutils source after this change: https://sourceware.org/pipermail/binutils/2025-March/139894.html gcc/ChangeLog: * config/aarch64/aarch64-sys-regs.def: Copy from Binutils.
2025-06-30[RISC-V] Correct CFA notes for stack-clash protection [PR120714]Alexey Merzlyakov1-2/+11
Fixes incorrect SP-addresses used in CFA notes for the stack probes unrelative to the frame's top. It applied to the RISC-V targets code generation when the stack-clash protection is enabled. PR target/120714 gcc/ChangeLog: * config/riscv/riscv.cc (riscv_allocate_and_probe_stack_space): Fix SP-addresses in REG_CFA_DEF_CFA notes for stack-clash case. gcc/testsuite/ChangeLog: * gcc.target/riscv/pr120714.c: New test.
2025-06-30RISC-V: Combine vec_duplicate + vssubu.vv to vssubu.vx on GR2VR costPan Li3-1/+3
This patch would like to combine the vec_duplicate + vssubu.vv to the vssubu.vx. From example as below code. The related pattern will depend on the cost of vec_duplicate from GR2VR. Then the late-combine will take action if the cost of GR2VR is zero, and reject the combination if the GR2VR cost is greater than zero. Assume we have example code like below, GR2VR cost is 0. #define DEF_VX_BINARY(T, FUNC) \ void \ test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \ { \ for (unsigned i = 0; i < n; i++) \ out[i] = FUNC (in[i], x); \ } T sat_sub(T a, T b) { return (a - b) & (-(T)(a >= b)); } DEF_VX_BINARY(uint32_t, sat_sub) Before this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ vsetvli a5,zero,e32,m1,ta,ma 13 │ vmv.v.x v2,a2 14 │ slli a3,a3,32 15 │ srli a3,a3,32 16 │ .L3: 17 │ vsetvli a5,a3,e32,m1,ta,ma 18 │ vle32.v v1,0(a1) 19 │ slli a4,a5,2 20 │ sub a3,a3,a5 21 │ add a1,a1,a4 22 │ vssubu.vv v1,v1,v2 23 │ vse32.v v1,0(a0) 24 │ add a0,a0,a4 25 │ bne a3,zero,.L3 After this patch: 10 │ test_vx_binary_or_int32_t_case_0: 11 │ beq a3,zero,.L8 12 │ slli a3,a3,32 13 │ srli a3,a3,32 14 │ .L3: 15 │ vsetvli a5,a3,e32,m1,ta,ma 16 │ vle32.v v1,0(a1) 17 │ slli a4,a5,2 18 │ sub a3,a3,a5 19 │ add a1,a1,a4 20 │ vssubu.vx v1,v1,a2 21 │ vse32.v v1,0(a0) 22 │ add a0,a0,a4 23 │ bne a3,zero,.L3 gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_vx_binary_vec_vec_dup): Add new case US_MINUS. * config/riscv/riscv.cc (riscv_rtx_costs): Ditto. * config/riscv/vector-iterators.md: Add new op us_minus. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-06-30rs6000: Disassemble opaque modes using subregs to allow optimizationsPeter Bergner1-47/+4
PR109116 reveals missed optimizations when using unspecs to extract vector components from opaque-mode variables. Since RTL optimizers do not understand unspecs, this leads to redundant register copies. Replace unspecs with subregs, which are well understood by RTL passes, allowing optimizations to take place. 2025-06-30 Peter Bergner <bergner@linux.ibm.com> gcc/ PR target/109116 * config/rs6000/mma.md (unspec): Delete UNSPEC_MMA_EXTRACT. (vsx_disassemble_pair): Expand into a vector register sized subreg. (mma_disassemble_acc): Likewise. (*vsx_disassemble_pair): Delete. (*mma_disassemble_acc): Likewise.
2025-06-30RISC-V: Primary vector pipeline model for sifive 7 seriesKito Cheng1-1/+136
This commit introduces a primary vector pipeline model for the SiFive 7 series, that pipeline model is kind of simplified version, it only defined vector command queue, arithmetic unit, and vector load store unit. The latency of real hardware is LMUL-aware, but I realize that will complicate the model a lots, so I just use a simplified version, which all LMUL use same latency, we may improve it later once we have found meaningful performance difference. gcc/ChangeLog: * config/riscv/sifive-7.md: Add primary vector pipeline model for SiFive 7 series.