aboutsummaryrefslogtreecommitdiff
path: root/gcc/config
AgeCommit message (Collapse)AuthorFilesLines
2025-05-06Implement Windows TLSJulian Waters7-1/+87
This patch implements native Thread Local Storage access on Windows, as motivated by PR80881. Currently, Thread Local Storage accesses on Windows relies on emulation, which is detrimental to performance in certain applications, notably the Python Interpreter and the gcc port of the Java Virtual Machine. This patch was heavily inspired by Daniel Green's original work on native Windows Thread Local Storage from over a decade ago, which can be found at https://github.com/venix1/MinGW-GDC/blob/master/patches/mingw-tls-gcc-4.8.patch as a reference. Co-authored-by: Eric Botcazou <botcazou@adacore.com> Co-authored-by: Uroš Bizjak <ubizjak@gmail.com> Co-authored-by: Liu Hao <lh_mouse@126.com> Signed-off-by: Julian Waters <tanksherman27@gmail.com> Signed-off-by: Jonathan Yong <10walls@gmail.com> gcc/ChangeLog: * config/i386/i386.cc (ix86_legitimate_constant_p): Handle new UNSPEC. (legitimate_pic_operand_p): Handle new UNSPEC. (legitimate_pic_address_disp_p): Handle new UNSPEC. (ix86_legitimate_address_p): Handle new UNSPEC. (ix86_tls_index_symbol): New symbol for _tls_index. (ix86_tls_index): Handle creation of _tls_index symbol. (legitimize_tls_address): Create thread local access sequence. (output_pic_addr_const): Handle new UNSPEC. (i386_output_dwarf_dtprel): Handle new UNSPEC. (i386_asm_output_addr_const_extra): Handle new UNSPEC. * config/i386/i386.h (TARGET_WIN32_TLS): Define. * config/i386/i386.md: New UNSPEC. * config/i386/predicates.md: Handle new UNSPEC. * config/mingw/mingw32.h (TARGET_WIN32_TLS): Define. (TARGET_ASM_SELECT_SECTION): Define. (DEFAULT_TLS_SEG_REG): Define. * config/mingw/winnt.cc (mingw_pe_select_section): Select proper TLS section. (mingw_pe_unique_section): Handle TLS section. * config/mingw/winnt.h (mingw_pe_select_section): Declare. * configure: Regenerate. * configure.ac: New check for broken linker thread local support
2025-05-05[RISC-V][PR target/119971] Avoid losing shift count maskingJeff Law2-44/+34
As is outlined in the PR, we have a few define_insn_and_split patterns which optimize away explicit masking of shift/bit positions when the masking matches what the hardware's behavior. A small number of those define_insn_and_split patterns generate a single instruction. It's fairly elegant in that we were essentially just rewriting the RTL to match an existing pattern. In one case we'd do the rewriting and later turn a 32bit shift into a bset. That's not safe because the masking of a 32bit shift uses 0x1f while masking on bset uses 0x3f on rv64. The net was incorrect code as seen in the BZ entry. The fix is pretty simple. There's no real reason we need to use a define_insn_and_split. It was just convenient. Instead we can use a simple define_insn. That avoids a change in the masking behavior for the shift count/bit position and the masking stays in the RTL. I quickly scanned the entire port and didn't see any additional define_insn_and_splits that obviously generated a single instruction outside the shift/rotate space, though in the vector space that's nontrivial to ascertain. This was been run through my tester for the cross configurations, but not the native bootstrap/regression test (yet). PR target/119971 gcc/ * config/riscv/bitmanip.md (rotation with masked count): Rewrite as define_insn patterns. Fix formatting. * config/riscv/riscv.md (shift with masked count): Similarly. gcc/testsuite * gcc.target/riscv/pr119971.c: New test. * gcc.target/riscv/zbb-rol-ror-03.c: Adjust test slightly.
2025-05-05i386: Do not use explicit operands for MOVS instructions [PR120019]Uros Bizjak2-9/+33
Some assemblers do not support MOVS instructions with explicit operands. Emit instruction with implicit operands, but prefix the instruction with a segment override prefix if the memory operand refers to ADDR_SPACE_SEG_FS or ADDR_SPACE_SEG_GS named address space. PR target/120019 gcc/ChangeLog: * config/i386/i386.cc (ix86_print_operand): Handle 'v' operand modifier to emit segment override prefix. * config/i386/i386.md (*strmovdi_rex_1): Use %v operand modifier to emit segment override prefix. (*strmovsi_1): Ditto. (*strmovhi_1): Ditto. (*strmovqi_1): Ditto. (*rep_movdi_rex64): Ditto. (*rep_movsi): Ditto. (*rep_movqi): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr111657-1.c (dg-do): Change to "assemble". (dg-options): Remove -masm=att and add -save-temps. (dg-final): Update scan-assembler and scan-assembler-not strings. Co-authored-by: Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE>
2025-05-05i386: Fix comment typo on truncsfbf2 patternJakub Jelinek1-1/+1
I've noticed a typo on the flag name, fixed thusly. 2025-05-05 Jakub Jelinek <jakub@redhat.com> * config/i386/i386.md (truncsfbf2): Fix comment typo, unsafte -> unsafe.
2025-05-05RISC-V: Apply clang-format to genrvv-type-indexer.cc [NFC]Kito Cheng1-8/+15
Tweak the formatting of the genrvv-type-indexer.cc file to conform to the style used by clang-format. This is a no-functional-change commit that only modifies the formatting of the code. gcc/Changelog: * config/riscv/genrvv-type-indexer.cc: Apply clang-format to the file.
2025-05-04[V2][RISC-V] Trivial permutation constant derivationJeff Law4-0/+311
This is a patch from late 2024 (just before stage1 freeze), but I never pushed hard to the change, and thus never integrated it. It's mostly unchanged except for updating insn in the hash table after finding an optimizable case. We were holding the deleted insn in the hash table rather than the new insn. Just something I noticed recently. Bootstrapped and regression tested on my BPI and regression tested riscv32-elf and riscv64-elf configurations. We've used this since November internally, so it's well exercised on spec as well. gcc/ * config.gcc (riscv): Add riscv-vect-permcost.o to extra_objs. * config/riscv/riscv-passes.def (pass_vector_permcost): Add new pass. * config/riscv/riscv-protos.h (make_pass_vector_permconst): Declare. * config/riscv/riscv-vect-permconst.cc: New file. * config/riscv/t-riscv: Add build rule for riscv-vect-permcost.o
2025-05-04[PATCH] RISC-V: Implment H modifier for printing the next register nameJin Ma1-0/+22
For RV32 inline assembly, when handling 64-bit integer data, it is often necessary to process the lower and upper 32 bits separately. Unfortunately, we can only output the current register name (lower 32 bits) but not the next register name (upper 32 bits). To address this, the modifier 'H' has been added to allow users to handle the upper 32 bits of the data. While I believe the modifier 'N' (representing the next register name) might be more suitable for this functionality, 'N' is already in use. Therefore, 'H' (representing the high register) was chosen instead. Co-Authored-By: Dimitar Dimitrov <dimitar@dinux.eu> gcc/ChangeLog: * config/riscv/riscv.cc (riscv_print_operand): Add H. * doc/extend.texi: Document for H. gcc/testsuite/ChangeLog: * gcc.target/riscv/modifier-H-error-1.c: New test. * gcc.target/riscv/modifier-H-error-2.c: New test. * gcc.target/riscv/modifier-H.c: New test.
2025-05-04[to-be-committed][RISC-V] Adjust testcases and finish register move costing fixJeff Law1-4/+4
The recent adjustment to more correctly cost register moves tripped a few testsuite regressions. I'm pretty torn on the thead test adjustments. But in reality they only worked because the register move costing was broken. So I've reverted the scan-asm part of those to a prior state for two of those tests. The other was only failing at -Og/-Oz which was added to the exclude list. The other Zfa test is similar, but we can make the test behave with a suitable -mtune option and thus preserve the test. While investigating I also noted that vector moves aren't being handled correctly for subclasses of the integer/fp register files. So I fixed those while I was in there. Note this may have an impact on some of your work Pan. I haven't followed the changes from the last week or so due to illness. Waiting on pre-commit's verdict, though it did spin through my tester successfully, though not all of the regressions related to that change are addressed (there's still one for rv32 I'll look at shortly). gcc/ * config/riscv/riscv.cc (riscv_register_move_cost): Handle subclasses with vector registers as well. gcc/testsuite/ * gcc.target/riscv/xtheadfmemidx-xtheadfmv-medany.c: Adjust expected output. * gcc.target/riscv/xtheadfmemidx-zfa-medany.c: Likewise. * gcc.target/riscv/xtheadfmv-fmv.c: Skip for -Os and -Oz. * gcc.target/riscv/zfa-fmovh-fmovp.c: Use sifive-p400 tuning.
2025-05-04RISC-V: Remove unnecessary frm restore volatile define_insnPan Li3-34/+23
After we add the frm register to the global_regs, we may not need to define_insn that volatile to emit the frm restore insns. The cooperatively-managed global register will help to handle this, instead of emit the volatile define_insn explicitly. gcc/ChangeLog: * config/riscv/riscv.cc (riscv_emit_frm_mode_set): Refactor the frm mode set by removing fsrmsi_restore_volatile. * config/riscv/vector-iterators.md (unspecv): Remove as unnecessary. * config/riscv/vector.md (fsrmsi_restore_volatile): Ditto. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/float-point-dynamic-frm-49.c: Adjust the asm dump check times. * gcc.target/riscv/rvv/base/float-point-dynamic-frm-50.c: Ditto. * gcc.target/riscv/rvv/base/float-point-dynamic-frm-52.c: Ditto. * gcc.target/riscv/rvv/base/float-point-dynamic-frm-74.c: Ditto. * gcc.target/riscv/rvv/base/float-point-dynamic-frm-75.c: Ditto. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-05-03Improve ix86 VEC_MERGE costsJan Hubicka1-5/+77
ix86_rtx_costs VEC_MERGE by special casing AVX512 mask operations and otherwise returning cost->sse_op completely ignoring costs of the operands. Since VEC_MERGE is also used to represent scalar variant of SSE/AVX operation, this means that many instructions (such as SSE converisions) are often costed as sse_op instead of their real cost. This patch adds pattern matching for the VEC_MERGE pattern which also forced me to add special cases for masked versions and vcmp otherwise combine is confused by the default cost compred to the cost of recognized version of the instruction. Since now the important cases should be handled, I also added recursion to the remaining cases so substituting constants and memory is adequately costed. gcc/ChangeLog: * config/i386/i386.cc (unspec_pcmp_p): New function. (ix86_rtx_costs): Cost VEC_MERGE more realistically.
2025-05-02Revert "[PATCH 30/61] MSA: Make MSA and microMIPS R5 unsupported"Jeff Law1-3/+0
This reverts commit 727a43e0a66052235706379239359807230054e0.
2025-05-02Make ix86 cost of VEC_SELECT equivalent to SUBREG cost 1Jan Hubicka1-3/+36
This patch fixes regression of imagick with PGO and AVX512 where correcting size cost of SSE operations (to be 4 instead of 2 originally cut&pasted from x87) made late combine to eliminate zero registers introduced by rapd. The problem is that cost-model mistakely accounts VEC_SELECT as real instruction while it is optimized to nothing if src==dest (which is the case of these testcases). This register is used to eliminate false dependency between source and destination of int->fp conversions. While ix86_insn_cost hook already contains logic to incrase cost of the zero-extend the costs was not enough. gcc/ChangeLog: PR target/119900 * config/i386/i386.cc (ix86_can_change_mode_class): Add TODO comment. (ix86_rtx_costs): Make VEC_SELECT equivalent to SUBREG cost 1.
2025-05-02i386: -Wabi false positive with indirect call [PR60336]Jason Merrill1-3/+8
This warning relies on the TRANSLATION_UNIT_WARN_EMPTY_P flag (set in cxx_init_decl_processing) to decide whether we want to warn about the GCC 8 empty class parameter passing fix, but in a call through a function pointer we don't have a translation unit and so complain for any -Wabi flag, even now long after this was likely to be relevant. In that situation, let's check the TU for current_function_decl instead. And if we still can't come up with a TU, default to not warning. PR c++/60336 gcc/ChangeLog: * config/i386/i386.cc (ix86_warn_parameter_passing_abi): If no target, check the current TU. gcc/testsuite/ChangeLog: * g++.dg/abi/pr60336-8a.C: New test.
2025-05-02Remove TARGET_LRA_P override when defining to hook_bool_void_trueRichard Biener2-4/+0
Two targets were converted but retain the default. * config/arc/arc.cc (TARGET_LRA_P): Remove define. * config/gcn/gcn.cc (TARGET_LRA_P): Likewise.
2025-05-02aarch64: Optimize SVE extract last for VLS.Jennifer Schmitz1-3/+4
For the test case int32_t foo (svint32_t x) { svbool_t pg = svpfalse (); return svlastb_s32 (pg, x); } compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced: foo: pfalse p3.b lastb w0, p3, z0.s ret when it could use a Neon lane extract instead: foo: umov w0, v0.s[3] ret Similar optimizations can be made for VLS with other vector widths. We implemented this optimization by guarding the emission of pfalse+lastb in the pattern vec_extract<mode><Vel> by !val.is_constant (). Thus, for last-extract operations with VLS, the patterns *vec_extract<mode><Vel>_v128, *vec_extract<mode><Vel>_dup, or *vec_extract<mode><Vel>_ext are used instead. We added tests for 128-bit VLS and adjusted the tests for the other vector widths. The patch was bootstrapped and tested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com> gcc/ * config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>): Prevent the emission of pfalse+lastb for VLS. gcc/testsuite/ * gcc.target/aarch64/sve/extract_last_128.c: New test. * gcc.target/aarch64/sve/extract_1.c: Adjust expected outcome. * gcc.target/aarch64/sve/extract_2.c: Likewise. * gcc.target/aarch64/sve/extract_3.c: Likewise. * gcc.target/aarch64/sve/extract_4.c: Likewise.
2025-05-01Aarch64: Add __sqrt and __sqrtf intrinsics and corresponding testsAyan Shafqat1-0/+14
This patch introduces two new inline functions, __sqrt and __sqrtf, in arm_acle.h for Aarch64 targets. These functions wrap the new builtins __builtin_aarch64_sqrtdf and __builtin_aarch64_sqrtsf, respectively, providing direct access to hardware instructions without relying on the standard math library or optimization levels. This patch also introduces acle_sqrt.c in the AArch64 testsuite, verifying that the new __sqrt and __sqrtf intrinsics emit the expected fsqrt instructions for double and float arguments. Coverage for new intrinsics ensures that __sqrt and __sqrtf are correctly expanded to hardware instructions and do not fall back to library calls, regardless of optimization levels. gcc/ChangeLog: * config/aarch64/arm_acle.h (__sqrt, __sqrtf): New function. gcc/testsuite/ChangeLog: * gcc.target/aarch64/acle/acle_sqrt.c: New test. Signed-off-by: Ayan Shafqat <ayan.x.shafqat@gmail.com>
2025-05-01Aarch64: Use BUILTIN_VHSDF_HSDF for vector and scalar sqrt builtinsAyan Shafqat1-4/+1
This patch changes the `sqrt` builtin definition from `BUILTIN_VHSDF_DF` to `BUILTIN_VHSDF_HSDF` in `aarch64-simd-builtins.def`, ensuring the builtin covers half, single, and double precision variants. The redundant `VAR1 (UNOP, sqrt, 2, FP, hf)` lines are removed, as they are no longer needed now that `BUILTIN_VHSDF_HSDF` handles those cases. gcc/ChangeLog: * config/aarch64/aarch64-simd-builtins.def: Change BUILTIN_VHSDF_DF to BUILTIN_VHSDF_HSDF. Signed-off-by: Ayan Shafqat <ayan.x.shafqat@gmail.com> Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
2025-04-30AVR: target/119989 - Add missing clobbers to xload_<mode>_libgcc.Georg-Johann Lay1-0/+4
libgcc's __xload_1...4 is clobbering Z (and also R21 is some cases), but avr.md had clobbers of respective GPRs only up to reload. Outcome was that code reading from the same __memx address twice could be wrong. This patch adds respective clobbers. Forward-port from 2025-04-30 r14-11703 PR target/119989 gcc/ * config/avr/avr.md (xload_<mode>_libgcc): Clobber R21, Z. gcc/testsuite/ * gcc.target/avr/torture/pr119989.h: New file. * gcc.target/avr/torture/pr119989-memx-1.c: New test. * gcc.target/avr/torture/pr119989-memx-2.c: New test. * gcc.target/avr/torture/pr119989-memx-3.c: New test. * gcc.target/avr/torture/pr119989-memx-4.c: New test. * gcc.target/avr/torture/pr119989-flashx-1.c: New test. * gcc.target/avr/torture/pr119989-flashx-2.c: New test. * gcc.target/avr/torture/pr119989-flashx-3.c: New test. * gcc.target/avr/torture/pr119989-flashx-4.c: New test. (cherry picked from commit 1ca1c1fc3b58ae5e1d3db4f5a2014132fe69f82a)
2025-04-30RISC-V: Allow different dynamic floating point mode to be merged [PR119832]Kito Cheng1-0/+37
Although we already try to set the mode needed to FRM_DYN after a function call, there are still some corner cases where both FRM_DYN and FRM_DYN_CALL may appear on incoming edges. Therefore, we use TARGET_MODE_CONFLUENCE to tell GCC that FRM_DYN, FRM_DYN_CALL, and FRM_DYN_EXIT modes are compatible. gcc/ChangeLog: PR target/119832 * config/riscv/riscv.cc (riscv_dynamic_frm_mode_p): New. (riscv_mode_confluence): New. (TARGET_MODE_CONFLUENCE): Define to riscv_mode_confluence. gcc/testsuite/ChangeLog: PR target/119832 * g++.target/riscv/pr119832.C: New test.
2025-04-30AArch64: Fold LD1/ST1 with ptrue to LDR/STR for 128-bit VLSJennifer Schmitz1-6/+23
If -msve-vector-bits=128, SVE loads and stores (LD1 and ST1) with a ptrue predicate can be replaced by neon instructions (LDR and STR), thus avoiding the predicate altogether. This also enables formation of LDP/STP pairs. For example, the test cases svfloat64_t ptrue_load (float64_t *x) { svbool_t pg = svptrue_b64 (); return svld1_f64 (pg, x); } void ptrue_store (float64_t *x, svfloat64_t data) { svbool_t pg = svptrue_b64 (); return svst1_f64 (pg, x, data); } were previously compiled to (with -O2 -march=armv8.2-a+sve -msve-vector-bits=128): ptrue_load: ptrue p3.b, vl16 ld1d z0.d, p3/z, [x0] ret ptrue_store: ptrue p3.b, vl16 st1d z0.d, p3, [x0] ret Now the are compiled to: ptrue_load: ldr q0, [x0] ret ptrue_store: str q0, [x0] ret The implementation includes the if-statement if (known_eq (GET_MODE_SIZE (mode), 16) && aarch64_classify_vector_mode (mode) == VEC_SVE_DATA) which checks for 128-bit VLS and excludes partial modes with a mode size < 128 (e.g. VNx2QI). The patch was bootstrapped and tested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com> gcc/ * config/aarch64/aarch64.cc (aarch64_emit_sve_pred_move): Fold LD1/ST1 with ptrue to LDR/STR for 128-bit VLS. gcc/testsuite/ * gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c: New test. * gcc.target/aarch64/sve/cond_arith_6.c: Adjust expected outcome. * gcc.target/aarch64/sve/pcs/return_4_128.c: Likewise. * gcc.target/aarch64/sve/pcs/return_5_128.c: Likewise. * gcc.target/aarch64/sve/pcs/struct_3_128.c: Likewise.
2025-04-30RISC-V: Add intrinsics support for SiFive Xsfvcp extensions.yulong17-8/+1562
This version is same as v5, but rebase to trunk, send out to trigger CI. This commit adds intrinsics support for Xsfvcp extension. Diff with V4: Delete the sifive_vector.h file. Co-Authored by: Jiawei Chen <jiawei@iscas.ac.cn> Co-Authored by: Shihua Liao <shihua@iscas.ac.cn> Co-Authored by: Yixuan Chen <chenyixuan@iscas.ac.cn> gcc/ChangeLog: * config/riscv/constraints.md (Ou01): New constraint. (Ou02): Ditto. * config/riscv/generic-vector-ooo.md (vec_sf_vcp): New reservation. * config/riscv/genrvv-type-indexer.cc (main): New type. * config/riscv/riscv-c.cc (riscv_pragma_intrinsic): Add xsfvcp strings. * config/riscv/riscv-vector-builtins-shapes.cc (struct sf_vcix_se_def): New function. (struct sf_vcix_def): Ditto. (SHAPE): Ditto. * config/riscv/riscv-vector-builtins-shapes.h: Ditto. * config/riscv/riscv-vector-builtins-types.def (DEF_RVV_X2_U_OPS): New type. (DEF_RVV_X2_WU_OPS): Ditto. (vuint8mf8_t): Ditto. (vuint8mf4_t): Ditto. (vuint8mf2_t): Ditto. (vuint8m1_t): Ditto. (vuint8m2_t): Ditto. (vuint8m4_t): Ditto. (vuint16mf4_t): Ditto. (vuint16mf2_t): Ditto. (vuint16m1_t): Ditto. (vuint16m2_t): Ditto. (vuint16m4_t): Ditto. (vuint32mf2_t): Ditto. (vuint32m1_t): Ditto. (vuint32m2_t): Ditto. (vuint32m4_t): Ditto. * config/riscv/riscv-vector-builtins.cc (DEF_RVV_X2_U_OPS): New builtins def. (DEF_RVV_X2_WU_OPS): Ditto. (rvv_arg_type_info::get_scalar_float_type): Ditto. (function_instance::modifies_global_state_p): Ditto. * config/riscv/riscv-vector-builtins.def (v_x): New base type. (i): Ditto. (v_i): Ditto. (xv): Ditto. (iv): Ditto. (fv): Ditto. (vvv): Ditto. (xvv): Ditto. (ivv): Ditto. (fvv): Ditto. (vvw): Ditto. (xvw): Ditto. (ivw): Ditto. (fvw): Ditto. (v_vv): Ditto. (v_xv): Ditto. (v_iv): Ditto. (v_fv): Ditto. (v_vvv): Ditto. (v_xvv): Ditto. (v_ivv): Ditto. (v_fvv): Ditto. (v_vvw): Ditto. (v_xvw): Ditto. (v_ivw): Ditto. (v_fvw): Ditto. (x2_vector): Ditto. (scalar_float): Ditto. * config/riscv/riscv-vector-builtins.h (enum required_ext): New extension. (required_ext_to_isa_name): Ditto. (required_extensions_specified): Ditto. (struct rvv_arg_type_info): Ditto. (struct function_group_info): Ditto. * config/riscv/riscv.md: New attr. * config/riscv/sifive-vector-builtins-bases.cc (class sf_vc): New function. (BASE): New base_name. * config/riscv/sifive-vector-builtins-bases.h: New function_base. * config/riscv/sifive-vector-builtins-functions.def (REQUIRED_EXTENSIONS): New intrinsics def. (sf_vc): Ditto. * config/riscv/sifive-vector.md (@sf_vc_x_se<mode>): New RTL mode. (@sf_vc_v_x_se<mode>): Ditto. (@sf_vc_v_x<mode>): Ditto. (@sf_vc_i_se<mode>): Ditto. (@sf_vc_v_i_se<mode>): Ditto. (@sf_vc_v_i<mode>): Ditto. (@sf_vc_vv_se<mode>): Ditto. (@sf_vc_v_vv_se<mode>): Ditto. (@sf_vc_v_vv<mode>): Ditto. (@sf_vc_xv_se<mode>): Ditto. (@sf_vc_v_xv_se<mode>): Ditto. (@sf_vc_v_xv<mode>): Ditto. (@sf_vc_iv_se<mode>): Ditto. (@sf_vc_v_iv_se<mode>): Ditto. (@sf_vc_v_iv<mode>): Ditto. (@sf_vc_fv_se<mode>): Ditto. (@sf_vc_v_fv_se<mode>): Ditto. (@sf_vc_v_fv<mode>): Ditto. (@sf_vc_vvv_se<mode>): Ditto. (@sf_vc_v_vvv_se<mode>): Ditto. (@sf_vc_v_vvv<mode>): Ditto. (@sf_vc_xvv_se<mode>): Ditto. (@sf_vc_v_xvv_se<mode>): Ditto. (@sf_vc_v_xvv<mode>): Ditto. (@sf_vc_ivv_se<mode>): Ditto. (@sf_vc_v_ivv_se<mode>): Ditto. (@sf_vc_v_ivv<mode>): Ditto. (@sf_vc_fvv_se<mode>): Ditto. (@sf_vc_v_fvv_se<mode>): Ditto. (@sf_vc_v_fvv<mode>): Ditto. (@sf_vc_vvw_se<mode>): Ditto. (@sf_vc_v_vvw_se<mode>): Ditto. (@sf_vc_v_vvw<mode>): Ditto. (@sf_vc_xvw_se<mode>): Ditto. (@sf_vc_v_xvw_se<mode>): Ditto. (@sf_vc_v_xvw<mode>): Ditto. (@sf_vc_ivw_se<mode>): Ditto. (@sf_vc_v_ivw_se<mode>): Ditto. (@sf_vc_v_ivw<mode>): Ditto. (@sf_vc_fvw_se<mode>): Ditto. (@sf_vc_v_fvw_se<mode>): Ditto. (@sf_vc_v_fvw<mode>): Ditto. * config/riscv/vector-iterators.md: New iterator. * config/riscv/vector.md: New vtype.
2025-04-29i386: Disable string insn from non-default AS for Pmode != word_mode [PR111657]Uros Bizjak4-24/+42
0x67 prefix is applied before segment register. That is in rep movsq %gs:(%esi), (%edi) the address is %gs + %esi. In case Pmode != word_mode (x32 with a default -maddress-mode=short) instructions should not allow segment override prefixes. Also, remove explicit addr32 prefix from asm templates because address mode can be determined from explicit instruction operands. Also note that Pmode != word_mode only with TARGET_64BIT, so the check in ix86_print_operand is not needed. PR target/111657 gcc/ChangeLog: * config/i386/i386-expand.cc (alg_usable_p): For Pmode != word_mode reject rep_prefix_{1,4,8}_byte algorithms with src_as in the non-default address space. * config/i386/i386-protos.h (ix86_check_movs): New prototype. * config/i386/i386.cc (ix86_check_movs): New function. (ix86_print_operand) [case '^']: Remove excess check for TARGET_64BIT. * config/i386/i386.md (strmov): For Pmode != word_mode expand with gen_strmov_single only when operands[3] (source) is in the default address space. (*strmovdi_rex_1) Use ix86_check_movs. Remove %^ from asm template. (*strmovsi_1): Ditto. (*strmovhi_1): DItto. (*strmovqi_1): Ditto. (*rep_movdi_rex64): Ditto. (*rep_movsi): Ditto. (*rep_movqi): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr111657-1.c: Check that segment override is not generated for "rep movsq" for x32 target.
2025-04-29RISC-V: Fix register move cost for SIBCALL_REGS/JALR_REGSZhijin Zeng1-4/+4
SIBCALL_REGS/JALR_REGS are also subset of GR_REGS and need to be taken into acount in riscv_register_move_cost, otherwise it will get a incorrect cost. gcc/ChangeLog: * config/riscv/riscv.cc (riscv_register_move_cost): Use reg_class_subset_p to check the reg class.
2025-04-29i386: Allow string instructions from non-default address space [PR111657]Uros Bizjak2-52/+120
MOVS instructions allow segment override of their source operand, e.g.: rep movsq %gs:(%rsi), (%rdi) where %rsi is the address of the source location (with %gs segment override) and %rdi is the address of the destination location. The testcase improves from (-O2 -mno-sse -mtune=generic): xorl %eax, %eax .L2: movl %eax, %edx addl $8, %eax movq %gs:m(%rdx), %rcx movq %rcx, (%rdi,%rdx) cmpl $240, %eax jb .L2 ret to: movl $m, %esi movl $30, %ecx rep movsq %gs:(%rsi), (%rdi) ret PR target/111657 gcc/ChangeLog: * config/i386/i386-expand.cc (alg_usable_p): Remove have_as bool argument and add dst_as and src_as address space arguments. Reject libcall algorithm with dst_as and src_as in the non-default address spaces. Reject rep_prefix_{1,4,8}_byte algorithms with dst_as in the non-default address space. (decide_alg): Remove have_as bool argument and add dst_as and src_as address space arguments. Update calls to alg_usable_p. (ix86_expand_set_or_cpymem): Update call to decide_alg. * config/i386/i386.md (strmov): Do not fail if operand[3] (source) is in the non-default address space. Expand with gen_strmov_singleop only when operand[1] (destination) is in the default address space. (*strmovdi_rex_1): Determine memory operands from insn pattern. Allow only when destination is in the default address space. Rewrite asm template to use explicit operands. (*strmovsi_1): Ditto. (*strmovhi_1): DItto. (*strmovqi_1): Ditto. (*rep_movdi_rex64): Ditto. (*rep_movsi): Ditto. (*rep_movqi): Ditto. (*strsetdi_rex_1): Determine memory operands from insn pattern. Allow only when destination is in the default address space. (*strsetsi_1): Ditto. (*strsethi_1): Ditto. (*strsetqi_1): Ditto. (*rep_stosdi_rex64): Ditto. (*rep_stossi): Ditto. (*rep_stosqi): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr111657-1.c: New test.
2025-04-29i386: Skip sub-RTXes of memory operand in ix86_update_stack_alignmentUros Bizjak1-5/+10
Skip sub-RTXes of the memory operand if stack access register is not mentioned in the operand. gcc/ChangeLog: * config/i386/i386.cc (ix86_update_stack_alignment): Skip sub-RTXes of the memory operand if stack access register is not mentioned in the operand.
2025-04-29x86: Add a pass to remove redundant all 0s/1s vector loadH.J. Lu4-41/+290
For all different modes of all 0s/1s vectors, we can use the single widest all 0s/1s vector register for all 0s/1s vector uses in the whole function. Add a pass to generate a single widest all 0s/1s vector set instruction at entry of the nearest common dominator for basic blocks with all 0s/1s vector uses. On Linux/x86-64, in cc1plus, this patch reduces the number of vector xor instructions from 4803 to 4714 and pcmpeq instructions from 144 to 142. NB: PR target/92080 and PR target/117839 aren't same. PR target/117839 is for vectors of all 0s and all 1s with different sizes and different components. PR target/92080 is for broadcast of the same component to different vector sizes. This patch covers only all 0s and all 1s cases of PR target/92080. gcc/ PR target/92080 PR target/117839 * config/i386/i386-features.cc (ix86_place_single_vector_set): New function. (remove_partial_avx_dependency): Use it. (ix86_get_vector_load_mode): New function. (replace_vector_const): Likewise. (remove_redundant_vector_load): Likewise. (pass_data_remove_redundant_vector_load): Likewise. (pass_remove_redundant_vector_load): Likewise. (make_pass_remove_redundant_vector_load): Likewise. * config/i386/i386-passes.def: Add pass_remove_redundant_vector_load after pass_remove_partial_avx_dependency. * config/i386/i386-protos.h (make_pass_remove_redundant_vector_load): New. * config/i386/i386.cc (ix86_modes_tieable_p): Return true for narrower non-scalar-integer modes in SSE registers. gcc/testsuite/ PR target/92080 PR target/117839 * gcc.target/i386/pr117839-1a.c: New test. * gcc.target/i386/pr117839-1b.c: Likewise. * gcc.target/i386/pr117839-2.c: Likewise. * gcc.target/i386/pr92080-1.c: Likewise. * gcc.target/i386/pr92080-2.c: Likewise. * gcc.target/i386/pr92080-3.c: Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-04-29i386: Add ix86_expand_unsigned_small_int_cst_argumentH.J. Lu1-5/+53
When passing 0xff as an unsigned char function argument with the C frontend promotion, expand_normal used to get <integer_cst 0x7fffe6aa23a8 type <integer_type 0x7fffe98225e8 int> constant 255> and returned the rtx value using the sign-extended representation: (const_int 255 [0xff]) But after commit a670ebde3995481225ec62b29686ec07a21e5c10 Author: H.J. Lu <hjl.tools@gmail.com> Date: Thu Nov 21 07:54:35 2024 +0800 Drop targetm.promote_prototypes from C, C++ and Ada frontends expand_normal now gets <integer_cst 0x7fffe9824018 type <integer_type 0x7fffe9822348 unsigned char > constant 255> and returns (const_int -1 [0xffffffffffffffff]) which doesn't work with the predicates nor the instruction templates which expect the unsigned expanded value. Extract the unsigned char and short integer constants to return (const_int 255 [0xff]) so that the expanded value is always unsigned, without the C frontend promotion. PR target/117547 * config/i386/i386-expand.cc (ix86_expand_unsigned_small_int_cst_argument): New function. (ix86_expand_args_builtin): Call ix86_expand_unsigned_small_int_cst_argument to expand the argument before calling fixup_modeless_constant. (ix86_expand_round_builtin): Likewise. (ix86_expand_special_args_builtin): Likewise. (ix86_expand_builtin): Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
2025-04-29Remove other processors from X86_TUNE_DEST_FALSE_DEP_FOR_GLC except GLCliuhongt1-3/+1
Since the tune if only for GLC(sapphirerapids and alderlake-P). gcc/ChangeLog: * config/i386/x86-tune.def (X86_TUNE_DEST_FALSE_DEP_FOR_GLC): Remove other processor except for GLC since this one is only for GLC.
2025-04-28x86: Properly find the maximum stack slot alignmentH.J. Lu1-21/+174
Don't assume that stack slots can only be accessed by stack or frame registers. We first find all registers defined by stack or frame registers. Then check memory accesses by such registers, including stack and frame registers. gcc/ PR target/109780 PR target/109093 * config/i386/i386.cc (stack_access_data): New. (ix86_update_stack_alignment): Likewise. (ix86_find_all_reg_use_1): Likewise. (ix86_find_all_reg_use): Likewise. (ix86_find_max_used_stack_alignment): Also check memory accesses from registers defined by stack or frame registers. gcc/testsuite/ PR target/109780 PR target/109093 * g++.target/i386/pr109780-1.C: New test. * gcc.target/i386/pr109093-1.c: Likewise. * gcc.target/i386/pr109780-1.c: Likewise. * gcc.target/i386/pr109780-2.c: Likewise. * gcc.target/i386/pr109780-3.c: Likewise. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Co-Authored-By: Uros Bizjak <ubizjak@gmail.com>
2025-04-28gcc: For Windows x86-32, always attempt to realign stack regardless of SSELIU Hao1-1/+1
For Windows x86-32 targets, the Microsoft ABI only guarantees that the stack is aligned to 4-byte boundaries. GCC knows about the default alignment of the stack. However, before this commit, it did not realign the stack unless SSE was also enabled. When a stricter (larger) alignment is requested, it's always necessary to realign the stack, as what Solaris does. Reference: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111107#c14 Signed-off-by: LIU Hao <lh_mouse@126.com> Signed-off-by: Jonathan Yong <10walls@gmail.com> gcc/ChangeLog: PR target/111107 * config/i386/cygming.h (STACK_REALIGN_DEFAULT): Copy from sol2.h.
2025-04-27RISC-V: Extract vector stepped for expand_const_vector [NFC]Pan Li1-291/+299
Consider the expand_const_vector is quit long (about 500 lines) and complicated, we would like to extract the different case into different functions. For example, the const vector stepped will be extracted into expand_const_vector_stepped. The below test suites are passed for this patch. * The rv64gcv fully regression test. gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_const_vector): Extract const vector stepped into separated func. (expand_const_vector_single_step_npatterns): Add new func to take care of single step. (expand_const_vector_interleaved_stepped_npatterns): Add new func to take care of interleaved step. (expand_const_vector_stepped): Add new func to take care of const vector stepped. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-04-27RISC-V: Extract vector duplicate for expand_const_vector [NFC]Pan Li1-76/+104
Consider the expand_const_vector is quit long (about 500 lines) and complicated, we would like to extract the different case into different functions. For example, the const vector duplicate will be extracted into expand_const_vector_duplicate, and then expand_const_vector_duplicate_repeating and expand_const_vector_duplicate_default for the underlying function. The below test suites are passed for this patch. * The rv64gcv fully regression test. gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_const_vector_duplicate_repeating): Add new func to take care of vector duplicate with repeating. (expand_const_vector_duplicate_default): Add new func to take care of default const vector duplicate. (expand_const_vector_duplicate): Add new func to take care of all const vector duplicate. (expand_const_vector): Extract const vector duplicate into separated function. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-04-27RISC-V: Extract vec_series for expand_const_vector [NFC]Pan Li1-7/+13
Consider the expand_const_vector is quit long (about 500 lines) and complicated, we would like to extract the different case into different functions. For example, the const vec_series will be extracted into expand_const_vec_series. The below test suites are passed for this patch. * The rv64gcv fully regression test. gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_const_vec_series): Add new func to take care of the const vec_series. (expand_const_vector): Extract const vec_series into separated function. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-04-27RISC-V: Extract vec_duplicate for expand_const_vector [NFC]Pan Li1-42/+50
Consider the expand_const_vector is quit long (about 500 lines) and complicated, we would like to extract the different case into different functions. For example, the const vec_duplicate will be extracted into expand_const_vec_duplicate. The below test suites are passed for this patch. * The rv64gcv fully regression test. gcc/ChangeLog: * config/riscv/riscv-v.cc (expand_const_vector): Extract const vec_duplicate into separated function. (expand_const_vec_duplicate): Add new func to take care of the const vec_duplicate. Signed-off-by: Pan Li <pan2.li@intel.com>
2025-04-26Refactor msse4 and mno-sse4.liuhongt2-12/+1
gcc/ChangeLog: PR target/119549 * common/config/i386/i386-common.cc (ix86_handle_option): Refactor msse4 and mno-sse4. * config/i386/i386.opt (msse4): Remove RejectNegative. (mno-sse4): Remove the entry. * config/i386/i386-options.cc (ix86_valid_target_attribute_inner_p): Remove special code which handles mno-sse4.
2025-04-26Fix i386 vectorizer cost of FP scalar MAX_EXPR and MIN_EXPRJan Hubicka1-2/+4
I introduced a bug by last minute cleanups unifying the scalar and vector SSE conditional. This patch fixes it and restores cost of 1 of SSE scalar MIN/MAX Bootstrapped/regtested x86_64-linux, comitted. gcc/ChangeLog: PR target/105275 * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Fix cost of FP scalar MAX_EXPR and MIN_EXPR
2025-04-25s390: Allow 5+ argument tail-calls in some -m31 -mzarch special cases [PR119873]Jakub Jelinek1-1/+11
Here is a patch to handle the PARALLEL case too. I think we can just use rtx_equal_p there, because it will always use SImode in the EXPR_LIST REGs in that case. 2025-04-25 Jakub Jelinek <jakub@redhat.com> PR target/119873 * config/s390/s390.cc (s390_call_saved_register_used): Don't return true if default definition of PARM_DECL SSA_NAME of the same register is passed in call saved register in the PARALLEL case either. * gcc.target/s390/pr119873-5.c: New test.
2025-04-25GCN: Properly switch sections in 'gcn_hsa_declare_function_name' [PR119737]Andrew Pinski1-3/+3
There are GCN/C++ target as well as offloading codes, where the hard-coded section names in 'gcn_hsa_declare_function_name' do not fit, and assembly thus fails: LLVM ERROR: Size expression must be absolute. This commit progresses GCN target: [-FAIL: g++.dg/init/call1.C -std=gnu++17 (internal compiler error: Aborted signal terminated program as)-] [-FAIL:-]{+PASS:+} g++.dg/init/call1.C -std=gnu++17 (test for excess errors) [-UNRESOLVED:-]{+PASS:+} g++.dg/init/call1.C -std=gnu++17 [-compilation failed to produce executable-]{+execution test+} [-FAIL: g++.dg/init/call1.C -std=gnu++26 (internal compiler error: Aborted signal terminated program as)-] [-FAIL:-]{+PASS:+} g++.dg/init/call1.C -std=gnu++26 (test for excess errors) [-UNRESOLVED:-]{+PASS:+} g++.dg/init/call1.C -std=gnu++26 [-compilation failed to produce executable-]{+execution test+} UNSUPPORTED: g++.dg/init/call1.C -std=gnu++98: exception handling not supported ..., and GCN offloading: [-XFAIL: libgomp.c++/target-exceptions-throw-1.C (internal compiler error: Aborted signal terminated program as)-] [-XFAIL: libgomp.c++/target-exceptions-throw-1.C PR119737 at line 7 (test for bogus messages, line )-] [-XFAIL:-]{+PASS:+} libgomp.c++/target-exceptions-throw-1.C (test for excess errors) [-UNRESOLVED:-]{+PASS:+} libgomp.c++/target-exceptions-throw-1.C [-compilation failed to produce executable-]{+execution test+} {+PASS: libgomp.c++/target-exceptions-throw-1.C output pattern test+} [-XFAIL: libgomp.c++/target-exceptions-throw-2.C (internal compiler error: Aborted signal terminated program as)-] [-XFAIL: libgomp.c++/target-exceptions-throw-2.C PR119737 at line 7 (test for bogus messages, line )-] [-XFAIL:-]{+PASS:+} libgomp.c++/target-exceptions-throw-2.C (test for excess errors) [-UNRESOLVED:-]{+PASS:+} libgomp.c++/target-exceptions-throw-2.C [-compilation failed to produce executable-]{+execution test+} {+PASS: libgomp.c++/target-exceptions-throw-2.C output pattern test+} [-XFAIL: libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 (internal compiler error: Aborted signal terminated program as)-] [-XFAIL: libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 PR119737 at line 7 (test for bogus messages, line )-] [-XFAIL:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 (test for excess errors) [-UNRESOLVED:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 [-compilation failed to produce executable-]{+execution test+} {+PASS: libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 output pattern test+} [-XFAIL: libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 (internal compiler error: Aborted signal terminated program as)-] [-XFAIL: libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 PR119737 at line 9 (test for bogus messages, line )-] [-XFAIL:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 (test for excess errors) [-UNRESOLVED:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 [-compilation failed to produce executable-]{+execution test+} {+PASS: libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 output pattern test+} PR target/119737 gcc/ * config/gcn/gcn.cc (gcn_hsa_declare_function_name): Properly switch sections. libgomp/ * testsuite/libgomp.c++/target-exceptions-throw-1.C: Remove PR119737 XFAILing. * testsuite/libgomp.c++/target-exceptions-throw-2.C: Likewise. * testsuite/libgomp.oacc-c++/exceptions-throw-1.C: Likewise. * testsuite/libgomp.oacc-c++/exceptions-throw-2.C: Likewise. Co-authored-by: Thomas Schwinge <tschwinge@baylibre.com>
2025-04-24s390: Allow 5+ argument tail-calls in some special cases [PR119873]Jakub Jelinek1-3/+18
protobuf (and therefore firefox too) currently doesn't build on s390*-linux. The problem is that it uses [[clang::musttail]] attribute heavily, and in llvm (IMHO llvm bug) [[clang::musttail]] calls with 5+ arguments on s390*-linux are silently accepted and result in a normal non-tail call. In GCC we just reject those because the target hook refuses to tail call it (IMHO the right behavior). Now, the reason why that happens is as s390_function_ok_for_sibcall attempts to explain, the 5th argument (assuming normal <= wordsize integer or pointer arguments, nothing that needs 2+ registers) is passed in %r6 which is not call clobbered, so we can't do tail call when we'd have to change content of that register and then caller would assume %r6 content didn't change and use it again. In the protobuf case though, the 5th argument is always passed through from the caller to the musttail callee unmodified, so one can actually emit just jg tail_called_function or perhaps tweak some registers but keep %r6 untouched, and in that case I think it is just fine to tail call it (at least unless the stack slots used for 6+ argument can't be modified by the callee in the ABI and nothing checks for that). So, the following patch checks for this special case, where the argument which uses %r6 is passed in a single register and it is passed default definition of SSA_NAME of a PARM_DECL with the same DECL_INCOMING_RTL. It won't really work at -O0 but should work for -O1 and above, at least when one doesn't really try to modify the parameter conditionally and hope it will be optimized away in the end. 2025-04-24 Jakub Jelinek <jakub@redhat.com> Stefan Schulze Frielinghaus <stefansf@gcc.gnu.org> PR target/119873 * config/s390/s390.cc (s390_call_saved_register_used): Don't return true if default definition of PARM_DECL SSA_NAME of the same register is passed in call saved register. (s390_function_ok_for_sibcall): Adjust comment. * gcc.target/s390/pr119873-1.c: New test. * gcc.target/s390/pr119873-2.c: New test. * gcc.target/s390/pr119873-3.c: New test. * gcc.target/s390/pr119873-4.c: New test.
2025-04-24Fix i386 vectorizer cost of COND_EXPR and MIN_MAX with one of parameters 0 or -1Jan Hubicka1-8/+35
gcc/ChangeLog: PR target/119919 * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Account correctly cond_expr and min/max when one of operands is 0 or -1. gcc/testsuite/ChangeLog: * gcc.target/i386/pr119919.c: New test.
2025-04-24aarch64: Fix CFA offsets in non-initial stack probes [PR119610]Richard Sandiford1-26/+40
PR119610 is about incorrect CFI output for a stack probe when that probe is not the initial allocation. The main aarch64 stack probe function, aarch64_allocate_and_probe_stack_space, implicitly assumed that the incoming stack pointer pointed to the top of the frame, and thus held the CFA. aarch64_save_callee_saves and aarch64_restore_callee_saves use a parameter called bytes_below_sp to track how far the stack pointer is above the base of the static frame. This patch does the same thing for aarch64_allocate_and_probe_stack_space. Also, I noticed that the SVE path was attaching the first CFA note to the wrong instruction: it was attaching the note to the calculation of the stack size, rather than to the r11<-sp copy. gcc/ PR target/119610 * config/aarch64/aarch64.cc (aarch64_allocate_and_probe_stack_space): Add a bytes_below_sp parameter and use it to calculate the CFA offsets. Attach the first SVE CFA note to the move into the associated temporary register. (aarch64_allocate_and_probe_stack_space): Update calls accordingly. Start out with bytes_per_sp set to the frame size and decrement it after each allocation. gcc/testsuite/ PR target/119610 * g++.dg/torture/pr119610.C: New test. * g++.target/aarch64/sve/pr119610-sve.C: Likewise.
2025-04-23target: [PR103750] Also handle avx512 kmask & immediate 15 or 3 when VF is 4/2.liuhongt1-4/+80
Since the upper bits are already cleared by the comparison instructions. gcc/ChangeLog: PR target/103750 * config/i386/sse.md (*<avx512>_cmp<mode>3_and15): New define_insn. (*<avx512>_ucmp<mode>3_and15): Ditto. (*<avx512>_cmp<mode>3_and3): Ditto. (*avx512vl_ucmpv2di3_and3): Ditto. (*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>): Change operands[3] predicate to <cmp_imm_predicate>. (*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2): Ditto. (*<avx512>_cmp<mode>3): Add GET_MODE_NUNITS (<MODE>mode) >= 8 to the condition. (*<avx512>_ucmp<mode>3): Ditto. (V48_AVX512VL_4): New mode iterator. (VI48_AVX512VL_4): Ditto. (V8_AVX512VL_2): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512vl-pr103750-1.c: New test. * gcc.target/i386/avx512f-pr96891-3.c: Adjust testcase. * gcc.target/i386/avx512f-vpcmpgtuq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpeqq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpequq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpgeq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpgeuq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpgtq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpgtuq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpleq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpleuq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpltq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpltuq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpneqq-1.c: Ditto. * gcc.target/i386/avx512vl-vpcmpnequq-1.c: Ditto.
2025-04-23Cost truth_value exprs in i386 vectorizer costs.Jan Hubicka1-0/+18
this patch implements costing of truth_value exprs. I.e. a = b < c; Those seems to be now the most common operations that goes to the addss path except for in->fp and fp->int conversions. For integer we use setcc, for FP there is CMccSS and variants which sets the destination register a s a mast (i.e. -1 on true and 0 on false). Technically these needs res&1 to get into 1 on true, 0 on false, but looking on examples where this is used, it is common that the resulting code is optimized avoiding need for this (except for cases wehre result is directly saved to memory). For this reason I am accounting only one sse_op (CMccSS) itself. gcc/ChangeLog: * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Cost truth_value exprs.
2025-04-22Accept allones or 0 operand for vcond_mask op1.liuhongt4-7/+27
Since ix86_expand_sse_movcc will simplify them into a simple vmov, vpand or vpandn. gcc/ChangeLog: * config/i386/predicates.md (vector_or_0_or_1s_operand): New predicate. (nonimm_or_0_or_1s_operand): Ditto. * config/i386/sse.md (vcond_mask_<mode><sseintvecmodelower>): Extend the predicate of operands1 to accept 0 or allones operands. (vcond_mask_<mode><sseintvecmodelower>): Ditto. (vcond_mask_v1tiv1ti): Ditto. (vcond_mask_<mode><sseintvecmodelower>): Ditto. * config/i386/i386.md (mov<mode>cc): Ditto for operands[2] and operands[3]. * config/i386/i386-expand.cc (ix86_expand_sse_fp_minmax): Force immediate_operand to register. gcc/testsuite/ChangeLog: * gcc.target/i386/blendv-to-maxmin.c: New test. * gcc.target/i386/blendv-to-pand.c: New test.
2025-04-22Fix vectorizer costs of COND_EXPR, MIN_EXPR, MAX_EXPR, ABS_EXPR, ABSU_EXPRJan Hubicka1-9/+86
this patch adds special cases for vectorizer costs in COND_EXPR, MIN_EXPR, MAX_EXPR, ABS_EXPR and ABSU_EXPR. We previously costed ABS_EXPR and ABSU_EXPR but it was only correct for FP variant (wehre it corresponds to andss clearing sign bit). Integer abs/absu is open coded as conditinal move for SSE2 and SSE3 instroduced an instruction. MIN_EXPR/MAX_EXPR compiles to minss/maxss for FP and accroding to Agner Fog tables they costs same as sse_op on all targets. Integer translated to single instruction since SSE3. COND_EXPR translated to open-coded conditional move for SSE2, SSE4.1 simplified the sequence and AVX512 introduced masked registers. gcc/ChangeLog: * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Add special cases for COND_EXPR; make MIN_EXPR, MAX_EXPR, ABS_EXPR and ABSU_EXPR more realistic. gcc/testsuite/ChangeLog: * gcc.target/i386/pr89618-2.c: XFAIL.
2025-04-22rs6000: Ignore OPTION_MASK_SAVE_TOC_INDIRECT differences in inlining ↵Jakub Jelinek1-4/+7
decisions [PR119327] The following testcase FAILs because the always_inline function can't be inlined. The rs6000 backend has similarly to other targets a hook which rejects inlining which would bring in new ISAs which aren't there in the caller. And this hook rejects this because of OPTION_MASK_SAVE_TOC_INDIRECT differences. This flag is set if explicitly requested or by default depending on whether the current function looks hot (or at least not cold): if ((rs6000_isa_flags_explicit & OPTION_MASK_SAVE_TOC_INDIRECT) == 0 && flag_shrink_wrap_separate && optimize_function_for_speed_p (cfun)) rs6000_isa_flags |= OPTION_MASK_SAVE_TOC_INDIRECT; The target nodes that are being compared here are actually the default target node (which was created when cfun was NULL) vs. one that was created for the always_inline function when it wasn't NULL, so one doesn't have it, the other does. In any case, this flag feels like a tuning decision rather than hard ISA requirement and I see no problems why we couldn't inline even explicit -msave-toc-indirect function into -mno-save-toc-indirect or vice versa. We already ignore OPTION_MASK_P{8,10}_FUSION which are also more like tuning flags. 2025-04-22 Jakub Jelinek <jakub@redhat.com> PR target/119327 * config/rs6000/rs6000.cc (rs6000_can_inline_p): Ignore also OPTION_MASK_SAVE_TOC_INDIRECT differences. * g++.dg/opt/pr119327.C: New test.
2025-04-22aarch64: Define __ARM_FEATURE_FAMINMAXRichard Sandiford1-0/+1
We implemented FAMINMAX ACLE support but failed to define the associated feature macro. gcc/ * config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins): Define __ARM_FEATURE_FAMINMAX. gcc/testsuite/ * gcc.target/aarch64/pragma_cpp_predefs_4.c: Test __ARM_FEATURE_FAMINMAX.
2025-04-22AArch64: Emit half-precision FCMP/FCMPESpencer Abson1-13/+16
Enable a target with FEAT_FP16 to emit the half-precision variants of FCMP/FCMPE. gcc/ChangeLog: * config/aarch64/aarch64.md: Update cbranch, cstore, fcmp and fcmpe to use the GPF_F16 iterator for floating-point modes. gcc/testsuite/ChangeLog: * gcc.target/aarch64/_Float16_cmp_1.c: New test. * gcc.target/aarch64/_Float16_cmp_2.c: New (negative) test.
2025-04-22AArch64: Define the spaceship optab [PR117013]Spencer Abson3-0/+117
This expansion ensures that exactly one comparison is emitted for spacesip-like sequences on floating-point operands, including when the result of such sequences are compared against members of std::<some_ordering>::<some_value>. For both integer and floating-point types, we optimize for the case in which the result of a spaceship-like operation is written to a GPR. The PR highlights this issue for floating-point operands, but we also make an improvement for integers, preferring: cmp w0, w1 cset w1, gt csinv w0, w1, wzr, ge over: cmp w0, w1 mov w0, 1 csinv w0, w0, wzr, ge csel w0, w0, wzr, ne to compute: auto test(int a, int b) { return a <=> b;} gcc/ChangeLog: PR target/117013 * config/aarch64/aarch64-protos.h (aarch64_expand_fp_spaceship): Declare optab expander function for floating-point types. * config/aarch64/aarch64.cc (aarch64_expand_fp_spaceship): Define optab expansion for floating-point types (new function). * config/aarch64/aarch64.md (spaceship<mode>4): Add define_expands for spaceship<mode>4 on integer and floating-point types. gcc/testsuite/ChangeLog: PR target/117013 * g++.target/aarch64/spaceship_1.C: New test. * g++.target/aarch64/spaceship_2.C: New test. * g++.target/aarch64/spaceship_3.C: New test.
2025-04-22aarch64: Update FP8 dependencies for -mcpu=olympusKyrylo Tkachov1-1/+1
We had not noticed that after g:299a8e2dc667e795991bc439d2cad5ea5bd379e2 the FP8FMA and FP8DOT4 features aren't implied by FP8FMA. The intent is for -mcpu=olympus to support all of them. Fix the definition to include the relevant sub-features explicitly. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ * config/aarch64/aarch64-cores.def (olympus): Add fp8fma, fp8dot4 explicitly.