aboutsummaryrefslogtreecommitdiff
path: root/gcc/config/aarch64
AgeCommit message (Collapse)AuthorFilesLines
2 hoursaarch64: Fix general permutes of svbfloat16_tsRichard Sandiford2-18/+17
Testing gcc.target/aarch64/sve/permute_2.c without the associated GCC patch triggered an unrecognisable insn ICE for the svbfloat16_t tests. This was because the implementation of general two-vector permutes requires two TBLs and an ORR, with the ORR being represented as an unspec for floating-point modes. The associated pattern did not cover VNx8BF. gcc/ * config/aarch64/iterators.md (SVE_I): Move further up file. (SVE_F): New mode iterator. (SVE_ALL): Redefine in terms of SVE_I and SVE_F. * config/aarch64/aarch64-sve.md (*<LOGICALF:optab><mode>3): Extend to all SVE_F. gcc/testsuite/ * gcc.target/aarch64/sve/permute_5.c: New test.
2 hoursaarch64: Handle SVE modes in aarch64_evpc_reencode [PR116583]Richard Sandiford1-9/+46
For Advanced SIMD modes, aarch64_evpc_reencode tests whether a permute in a narrow element mode can be done more cheaply in a wider mode. For example, { 0, 1, 8, 9, 4, 5, 12, 13 } on V8HI is a natural TRN1 on V4SI ({ 0, 4, 2, 6 }). This patch extends the code to handle SVE data and predicate modes as well. This is a prerequisite to getting good results for PR116583. gcc/ PR target/116583 * config/aarch64/aarch64.cc (aarch64_coalesce_units): New function, extending the Advanced SIMD handling from... (aarch64_evpc_reencode): ...here to SVE data and predicate modes. gcc/testsuite/ PR target/116583 * gcc.target/aarch64/sve/permute_1.c: New test. * gcc.target/aarch64/sve/permute_2.c: Likewise. * gcc.target/aarch64/sve/permute_3.c: Likewise. * gcc.target/aarch64/sve/permute_4.c: Likewise.
3 daysaarch64: Fix bug with max/min (PR116934)Saurabh Jha1-4/+4
In ac4cdf5cb43c0b09e81760e2a1902ceebcf1a135, I introduced a bug where I put the new unspecs, UNSPEC_COND_SMAX and UNSPEC_COND_SMIN, into the wrong iterator. I should have put new unspecs in SVE_COND_FP_MAXMIN but I put it in SVE_COND_FP_BINARY_REG instead. That was incorrect because the SVE_COND_FP_MAXMIN iterator is being used for predicated floating-point maximum/minimum, not SVE_COND_FP_BINARY_REG. Also added a testcase to validate the new change. Regression tested on aarch64-unknown-linux-gnu and found no regressions. There are some test cases with "libitm" in their directory names which appear in compare_tests output as changed tests but it looks like they are in the output just because of changed build directories, like from build-patched/aarch64-unknown-linux-gnu/./libitm/* to build-pristine/aarch64-unknown-linux-gnu/./libitm/*. I didn't think it was a cause of concern and have pushed this for review. gcc/ChangeLog: PR target/116934 * config/aarch64/iterators.md: Move UNSPEC_COND_SMAX and UNSPEC_COND_SMIN to correct iterators. gcc/testsuite/ChangeLog: PR target/116934 * gcc.target/aarch64/sve2/pr116934.c: New test.
3 daysaarch64: Set Armv9-A generic L1 cache line size to 64 bytesKyrylo Tkachov1-1/+13
I'd like to use a value of 64 bytes for the L1 cache size for Armv9-A generic tuning. As described in g:9a99559a478111f7fbeec29bd78344df7651c707 this value is used to set the std::hardware_destructive_interference_size value which we want to be not overly large when running concurrent applications on large core-count systems. The generic value for Armv8-A systems and the port baseline is 256 bytes because that's what the A64FX CPU has, as set de-facto in aarch64_override_options_internal. But for Armv9-A CPUs as far as I know there isn't anything larger than 64 bytes, so we should be able to use the smaller value here and reduce the size of concurrent structs that use std::hardware_destructive_interference_size to pad their fields. Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> * config/aarch64/tuning_models/generic_armv9_a.h (generic_armv9a_prefetch_tune): Define. (generic_armv9_a_tunings): Use the above.
4 daysAarch64: Define WIDEST_HARDWARE_FP_SIZEEric Botcazou1-0/+2
The macro is documented like this in the internal manual: -- Macro: WIDEST_HARDWARE_FP_SIZE A C expression for the size in bits of the widest floating-point format supported by the hardware. If you define this macro, you must specify a value less than or equal to mode precision of the mode used for C type 'long double' (from hook 'targetm.c.mode_for_floating_type' with argument 'TI_LONG_DOUBLE_TYPE'). If you do not define this macro, mode precision of the mode used for C type 'long double' is the default. AArch64 uses 128-bit TFmode for long double but, as far as I know, no FPU implemented in hardware supports it. gcc/ * config/aarch64/aarch64.h (WIDEST_HARDWARE_FP_SIZE): Define to 64. gcc/testsuite/ * gnat.dg/specs/size_clause6.ads: New test.
4 daysaarch64: Fix early ra for -fno-delete-dead-exceptions [PR116927]Andrew Pinski1-0/+6
Early-RA was considering throwing instructions as being dead and removing them even if -fno-delete-dead-exceptions was in use. This fixes that oversight. Built and tested for aarch64-linux-gnu. PR target/116927 gcc/ChangeLog: * config/aarch64/aarch64-early-ra.cc (early_ra::is_dead_insn): Insns that throw are not dead with -fno-delete-dead-exceptions. gcc/testsuite/ChangeLog: * g++.dg/torture/pr116927-1.C: New test. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
6 daysaarch64: Introduce new unspecs for smax/sminSaurabh Jha2-61/+45
Introduce two new unspecs, UNSPEC_COND_SMAX and UNSPEC_COND_SMIN, corresponding to rtl operators smax and smin. UNSPEC_COND_SMAX is used to generate fmaxnm instruction and UNSPEC_COND_SMIN is used to generate fminnm instruction. With these new unspecs, we can generate SVE2 max/min instructions using existing generic unpredicated and predicated instruction patterns that use optab attribute. Thus, we have removed specialised instruction patterns for max/min instructions that were using SVE_COND_FP_MAXMIN_PUBLIC iterator. No new test cases as the existing test cases should be enough to test this refactoring. gcc/ChangeLog: * config/aarch64/aarch64-sve.md (<fmaxmin><mode>3): Remove this instruction pattern. (cond_<fmaxmin><mode>): Remove this instruction pattern. * config/aarch64/iterators.md: New unspecs and changes to iterators and attrs to use the new unspecs
6 daysaarch64: Add fp8 scalar typesClaudio Bantaloukas4-2/+79
The ACLE defines a new scalar type, __mfp8. This is an opaque 8bit types that can only be used by fp8 intrinsics. Additionally, the mfloat8_t type is made available in arm_neon.h and arm_sve.h as an alias of the same. This implementation uses an unsigned INTEGER_TYPE, with precision 8 to represent __mfp8. Conversions to int and other types are disabled via the TARGET_INVALID_CONVERSION hook. Additionally, operations that are typically available to integer types are disabled via TARGET_INVALID_UNARY_OP and TARGET_INVALID_BINARY_OP hooks. gcc/ChangeLog: * config/aarch64/aarch64-builtins.cc (aarch64_mfp8_type_node): Add node for __mfp8 type. (aarch64_mfp8_ptr_type_node): Add node for __mfp8 pointer type. (aarch64_init_fp8_types): New function to initialise fp8 types and register with language backends. * config/aarch64/aarch64.cc (aarch64_mangle_type): Add ABI mangling for new type. (aarch64_invalid_conversion): Add function implementing TARGET_INVALID_CONVERSION hook that blocks conversion to and from the __mfp8 type. (aarch64_invalid_unary_op): Add function implementing TARGET_UNARY_OP hook that blocks operations on __mfp8 other than &. (aarch64_invalid_binary_op): Extend TARGET_BINARY_OP hook to disallow operations on __mfp8 type. (TARGET_INVALID_CONVERSION): Add define. (TARGET_INVALID_UNARY_OP): Likewise. * config/aarch64/aarch64.h (aarch64_mfp8_type_node): Add node for __mfp8 type. (aarch64_mfp8_ptr_type_node): Add node for __mfp8 pointer type. * config/aarch64/arm_private_fp8.h (mfloat8_t): Add typedef. gcc/testsuite/ChangeLog: * g++.target/aarch64/fp8_mangling.C: New tests exercising mangling. * g++.target/aarch64/fp8_scalar_typecheck_2.C: New tests in C++. * gcc.target/aarch64/fp8_scalar_1.c: New tests in C. * gcc.target/aarch64/fp8_scalar_typecheck_1.c: Likewise.
7 daysaarch64: Fix aarch64 backend-use of (u|s|us)dot_prod patternsVictor Do Nascimento8-17/+51
Given recent changes to the dot_prod standard pattern name, this patch fixes the aarch64 back-end by implementing the following changes: 1. Add 2nd mode to all (u|s|us)dot_prod patterns in .md files. 2. Rewrite initialization and function expansion mechanism for simd builtins. 3. Fix all direct calls to back-end `dot_prod' patterns in SVE builtins. Finally, given that it is now possible for the compiler to differentiate between the two- and four-way dot product, we add a test to ensure that autovectorization picks up on dot-product patterns where the result is twice the width of the operands. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (<sur>dot_prod<vsi2qi><vczle><vczbe>): Renamed to... (<sur>dot_prod<mode><vsi2qi><vczle><vczbe>): ...this. (usdot_prod<vsi2qi><vczle><vczbe>): Renamed to... (usdot_prod<mode><vsi2qi><vczle><vczbe>): ...this. (<su>sadv16qi): Adjust call to gen_udot_prod take second mode. (popcount<mode2>): fix use of `udot_prod_optab'. * config/aarch64/aarch64-sve.md (<sur>dot_prod<vsi2qi>): Renamed to... (<sur>dot_prod<mode><vsi2qi>): ...this. (@<sur>dot_prod<vsi2qi>): Renamed to... (@<sur>dot_prod<mode><vsi2qi>): ...this. (<su>sad<vsi2qi>): Adjust call to gen_udot_prod take second mode. * config/aarch64/aarch64-sve2.md (@aarch64_sve_<sur>dotvnx4sivnx8hi): Renamed to... (<sur>dot_prodvnx4sivnx8hi): ...this. * config/aarch64/aarch64-simd-builtins.def: Modify macro expansion-based initialization and expansion of (u|s|us)dot_prod builtins. * config/aarch64/aarch64-builtins.cc (CODE_FOR_aarch64_sdot_prodv8qi): Define as alias to new CODE_FOR_sdot_prodv2siv8qi. (CODE_FOR_aarch64_udot_prodv8qi): Define as alias to new CODE_FOR_udot_prodv2siv8qi. (CODE_FOR_aarch64_usdot_prodv8qi): Define as alias to new CODE_FOR_usdot_prodv2siv8qi. (CODE_FOR_aarch64_sdot_prodv16qi): Define as alias to new CODE_FOR_sdot_prodv4siv16qi. (CODE_FOR_aarch64_udot_prodv16qi): Define as alias to new CODE_FOR_udot_prodv4siv16qi. (CODE_FOR_aarch64_usdot_prodv16qi): Define as alias to new CODE_FOR_usdot_prodv4siv16qi. * config/aarch64/aarch64-sve-builtins-base.cc (svdot_impl::expand): s/direct/convert/ in `convert_optab_handler_for_sign' function call. (svusdot_impl::expand): add second mode argument in call to `code_for_dot_prod'. * config/aarch64/aarch64-sve-builtins.cc (function_expander::convert_optab_handler_for_sign): New class method. * config/aarch64/aarch64-sve-builtins.h (class function_expander): Add prototype for new `convert_optab_handler_for_sign' method. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sme/vect-dotprod-twoway.c (udot2): New.
14 daysaarch64: Add codegen support for AdvSIMD faminmaxSaurabh Jha2-0/+12
The AArch64 FEAT_FAMINMAX extension is optional from Armv9.2-a and mandatory from Armv9.5-a. It introduces instructions for computing the floating point absolute maximum and minimum of the two vectors element-wise. This patch adds code generation support for famax and famin in terms of existing RTL operators. famax/famin is equivalent to first taking abs of the operands and then taking smax/smin on the results of abs. famax/famin (a, b) = smax/smin (abs (a), abs (b)) This fusion of operators is only possible when -march=armv9-a+faminmax flags are passed. We also need to pass -ffast-math flag; if we don't, then a statement like c[i] = __builtin_fmaxf16 (a[i], b[i]); is RTL expanded to UNSPEC_FMAXNM instead of smax (likewise for smin). This code generation is only available on -O2 or -O3 as that is when auto-vectorization is enabled. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (*aarch64_faminmax_fused): Instruction pattern for faminmax codegen. * config/aarch64/iterators.md: Attribute for faminmax codegen. gcc/testsuite/ChangeLog: * gcc.target/aarch64/simd/faminmax-codegen-no-flag.c: New test. * gcc.target/aarch64/simd/faminmax-codegen.c: New test. * gcc.target/aarch64/simd/faminmax-no-codegen.c: New test.
14 daysaarch64: Add AdvSIMD faminmax intrinsicsSaurabh Jha6-0/+167
The AArch64 FEAT_FAMINMAX extension is optional from Armv9.2-a and mandatory from Armv9.5-a. It introduces instructions for computing the floating point absolute maximum and minimum of the two vectors element-wise. This patch introduces AdvSIMD faminmax intrinsics. The intrinsics of this extension are implemented as the following builtin functions: * vamax_f16 * vamaxq_f16 * vamax_f32 * vamaxq_f32 * vamaxq_f64 * vamin_f16 * vaminq_f16 * vamin_f32 * vaminq_f32 * vaminq_f64 We are defining a new way to add AArch64 AdvSIMD intrinsics by listing all the intrinsics in a .def file and then using that .def file to initialise various data structures. This would lead to more concise code and easier addition of the new AdvSIMD intrinsics in future. The faminmax intrinsics are defined using the new approach. gcc/ChangeLog: * config/aarch64/aarch64-builtins.cc (ENTRY): Macro to parse the contents of aarch64-simd-pragma-builtins.def. (ENTRY_VHSDF): Macro to parse the contents of aarch64-simd-pragma-builtins.def. (enum aarch64_builtins): New enum values for faminmax builtins via aarch64-simd-pragma-builtins.def. (enum class aarch64_builtin_signatures): Enum class to specify the number of operands a builtin will take. (struct aarch64_pragma_builtins_data): Struct to hold data from aarch64-simd-pragma-builtins.def. (aarch64_fntype): New function to define function types of intrinsics given an object of type aarch64_pragma_builtins_data. (aarch64_init_pragma_builtins): New function to define pragma builtins. (aarch64_get_pragma_builtin): New function to get a row of aarch64_pragma_builtins, given code. (handle_arm_neon_h): Modify to call aarch64_init_pragma_builtins. (aarch64_general_check_builtin_call): Modify to check whether required flag is being used for pragma builtins. (aarch64_expand_pragma_builtin): New function to emit instructions of pragma_builtin. (aarch64_general_expand_builtin): Modify to call aarch64_expand_pragma_builtin. * config/aarch64/aarch64-option-extensions.def (AARCH64_OPT_EXTENSION): Introduce new flag for this extension. * config/aarch64/aarch64-simd.md (@aarch64_<faminmax_uns_op><mode>): Instruction pattern for faminmax intrinsics. * config/aarch64/aarch64.h (TARGET_FAMINMAX): Introduce new flag for this extension. * config/aarch64/iterators.md: New iterators and unspecs. * doc/invoke.texi: Document extension in AArch64 Options. * config/aarch64/aarch64-simd-pragma-builtins.def: New file to list pragma builtins. gcc/testsuite/ChangeLog: * gcc.target/aarch64/simd/faminmax-builtins-no-flag.c: New test. * gcc.target/aarch64/simd/faminmax-builtins.c: New test.
14 daysdwarf2: add hooks for architecture-specific CFIsMatthieu Longo1-0/+33
Architecture-specific CFI directives are currently declared an processed among others architecture-independent CFI directives in gcc/dwarf2* files. This approach creates confusion, specifically in the case of DWARF instructions in the vendor space and using the same instruction code. Such a clash currently happen between DW_CFA_GNU_window_save (used on SPARC) and DW_CFA_AARCH64_negate_ra_state (used on AArch64), and both having the same instruction code 0x2d. Then AArch64 compilers generates a SPARC CFI directive (.cfi_window_save) instead of .cfi_negate_ra_state, contrarilly to what is expected in [DWARF for the Arm 64-bit Architecture (AArch64)](https://github.com/ ARM-software/abi-aa/blob/main/aadwarf64/aadwarf64.rst). This refactoring does not solve completely the problem, but improve the situation by moving some of the processing of those directives (more specifically their output in the assembly) to the backend via 2 target hooks: - DW_CFI_OPRND1_DESC: parse the first operand of the directive (if any). - OUTPUT_CFI_DIRECTIVE: output the CFI directive as a string. Additionally, this patch also contains a renaming of an enum used for return address mangling on AArch64. gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_output_cfi_directive): New hook for CFI directives. (aarch64_dw_cfi_oprnd1_desc): Same. (TARGET_OUTPUT_CFI_DIRECTIVE): Hook for output_cfi_directive. (TARGET_DW_CFI_OPRND1_DESC): Hook for dw_cfi_oprnd1_desc. * config/sparc/sparc.cc (sparc_output_cfi_directive): New hook for CFI directives. (sparc_dw_cfi_oprnd1_desc): Same. (TARGET_OUTPUT_CFI_DIRECTIVE): Hook for output_cfi_directive. (TARGET_DW_CFI_OPRND1_DESC): Hook for dw_cfi_oprnd1_desc. * coretypes.h (struct dw_cfi_node): Forward declaration of CFI type from gcc/dwarf2out.h. (enum dw_cfi_oprnd_type): Same. (enum dwarf_call_frame_info): Same. * doc/tm.texi: Regenerated from doc/tm.texi.in. * doc/tm.texi.in: Add doc for new target hooks. type of enum to allow forward declaration. * dwarf2cfi.cc (struct dw_cfi_row): Update the description for window_save and ra_mangled. (dwarf2out_frame_debug_cfa_negate_ra_state): Use AArch64 CFI directive instead of the SPARC one. (change_cfi_row): Use the right CFI directive's name for RA mangling. (output_cfi): Remove explicit architecture-specific CFI directive DW_CFA_GNU_window_save that falls into default case. (output_cfi_directive): Use target hook as default. * dwarf2out.cc (dw_cfi_oprnd1_desc): Use target hook as default. * dwarf2out.h (enum dw_cfi_oprnd_type): specify underlying type of enum to allow forward declaration. (dw_cfi_oprnd1_desc): Call target hook. (output_cfi_directive): Use dw_cfi_ref instead of struct dw_cfi_node *. * hooks.cc (hook_bool_dwcfi_dwcfioprndtyperef_false): New. (hook_bool_FILEptr_dwcfiptr_false): New. * hooks.h (hook_bool_dwcfi_dwcfioprndtyperef_false): New. (hook_bool_FILEptr_dwcfiptr_false): New. * target.def: Documentation for new hooks. include/ChangeLog: * dwarf2.h (enum dwarf_call_frame_info): specify underlying libffi/ChangeLog: * include/ffi_cfi.h (cfi_negate_ra_state): Declare AArch64 cfi directive. libgcc/ChangeLog: * config/aarch64/aarch64-asm.h (PACIASP): Replace SPARC CFI directive by AArch64 one. (AUTIASP): Same. libitm/ChangeLog: * config/aarch64/sjlj.S: Replace SPARC CFI directive by AArch64 one. gcc/testsuite/ChangeLog: * g++.target/aarch64/pr94515-1.C: Replace SPARC CFI directive by AArch64 one. * g++.target/aarch64/pr94515-2.C: Same.
14 daysRename REG_CFA_TOGGLE_RA_MANGLE to REG_CFA_NEGATE_RA_STATEMatthieu Longo1-2/+2
The current name REG_CFA_TOGGLE_RA_MANGLE is not representative of what it really is, i.e. a register to represent several states, not only a binary one. Same for dwarf2out_frame_debug_cfa_toggle_ra_mangle. gcc/ChangeLog: * combine-stack-adj.cc (no_unhandled_cfa): Rename. * config/aarch64/aarch64.cc (aarch64_expand_prologue): Rename. (aarch64_expand_epilogue): Rename. * dwarf2cfi.cc (dwarf2out_frame_debug_cfa_toggle_ra_mangle): Rename this... (dwarf2out_frame_debug_cfa_negate_ra_state): To this. (dwarf2out_frame_debug): Rename. * reg-notes.def (REG_CFA_NOTE): Rename REG_CFA_TOGGLE_RA_MANGLE.
2024-09-22aarch64: Take into account when VF is higher than known scalar itersTamar Christina1-0/+13
Consider low overhead loops like: void foo (char *restrict a, int *restrict b, int *restrict c, int n) { for (int i = 0; i < 9; i++) { int res = c[i]; int t = b[i]; if (a[i] != 0) res = t; c[i] = res; } } For such loops we use latency only costing since the loop bounds is known and small. The current costing however does not consider the case where niters < VF. So when comparing the scalar vs vector costs it doesn't keep in mind that the scalar code can't perform VF iterations. This makes it overestimate the cost for the scalar loop and we incorrectly vectorize. This patch takes the minimum of the VF and niters in such cases. Before the patch we generate: note: Original vector body cost = 46 note: Vector loop iterates at most 1 times note: Scalar issue estimate: note: load operations = 2 note: store operations = 1 note: general operations = 1 note: reduction latency = 0 note: estimated min cycles per iteration = 1.000000 note: estimated cycles per vector iteration (for VF 32) = 32.000000 note: SVE issue estimate: note: load operations = 5 note: store operations = 4 note: general operations = 11 note: predicate operations = 12 note: reduction latency = 0 note: estimated min cycles per iteration without predication = 5.500000 note: estimated min cycles per iteration for predication = 12.000000 note: estimated min cycles per iteration = 12.000000 note: Low iteration count, so using pure latency costs note: Cost model analysis: vs after: note: Original vector body cost = 46 note: Known loop bounds, capping VF to 9 for analysis note: Vector loop iterates at most 1 times note: Scalar issue estimate: note: load operations = 2 note: store operations = 1 note: general operations = 1 note: reduction latency = 0 note: estimated min cycles per iteration = 1.000000 note: estimated cycles per vector iteration (for VF 9) = 9.000000 note: SVE issue estimate: note: load operations = 5 note: store operations = 4 note: general operations = 11 note: predicate operations = 12 note: reduction latency = 0 note: estimated min cycles per iteration without predication = 5.500000 note: estimated min cycles per iteration for predication = 12.000000 note: estimated min cycles per iteration = 12.000000 note: Increasing body cost to 1472 because the scalar code could issue within the limit imposed by predicate operations note: Low iteration count, so using pure latency costs note: Cost model analysis: gcc/ChangeLog: * config/aarch64/aarch64.cc (adjust_body_cost): Cap VF for low iteration loops. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/asrdiv_4.c: Update bounds. * gcc.target/aarch64/sve/cond_asrd_2.c: Likewise. * gcc.target/aarch64/sve/cond_uxt_6.c: Likewise. * gcc.target/aarch64/sve/cond_uxt_7.c: Likewise. * gcc.target/aarch64/sve/cond_uxt_8.c: Likewise. * gcc.target/aarch64/sve/miniloop_1.c: Likewise. * gcc.target/aarch64/sve/spill_6.c: Likewise. * gcc.target/aarch64/sve/sve_iters_low_1.c: New test. * gcc.target/aarch64/sve/sve_iters_low_2.c: New test.
2024-09-20AArch64: Define VECTOR_STORE_FLAG_VALUE.Tamar Christina1-0/+10
This defines VECTOR_STORE_FLAG_VALUE to CONST1_RTX for AArch64 so we simplify vector comparisons in AArch64. With this enabled res: movi v0.4s, 0 cmeq v0.4s, v0.4s, v0.4s ret is simplified to: res: mvni v0.4s, 0 ret gcc/ChangeLog: * config/aarch64/aarch64.h (VECTOR_STORE_FLAG_VALUE): New. gcc/testsuite/ChangeLog: * gcc.dg/rtl/aarch64/vector-eq.c: New test.
2024-09-19SVE intrinsics: Fold svmul with all-zero operands to zero vectorJennifer Schmitz1-1/+16
As recently implemented for svdiv, this patch folds svmul to a zero vector if one of the operands is a zero vector. This transformation is applied if at least one of the following conditions is met: - the first operand is all zeros or - the second operand is all zeros, and the predicate is ptrue or the predication is _x or _z. In contrast to constant folding, which was implemented in a previous patch, this transformation is applied as soon as one of the operands is a zero vector, while the other operand can be a variable. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com> gcc/ * config/aarch64/aarch64-sve-builtins-base.cc (svmul_impl::fold): Add folding of all-zero operands to zero vector. gcc/testsuite/ * gcc.target/aarch64/sve/const_fold_mul_1.c: Adjust expected outcome. * gcc.target/aarch64/sve/fold_mul_zero.c: New test.
2024-09-19aarch64: Define l1_cache_line_size for -mcpu=neoverse-v2Kyrylo Tkachov1-1/+14
This is a small patch that sets the L1 cache line size for Neoverse V2. Unlike the other cache-related constants in there this value is not used just for SW prefetch generation (which we want to avoid for Neoverse V2 presently). It's also used to set std::hardware_destructive_interference_size. See the links and recent discussions in PR116662 for reference. Some CPU tunings in aarch64 set this value to something useful, but for generic tuning we use the conservative 256, which forces 256-byte alignment in such atomic structures. Using a smaller value can decrease the size of such structs during layout and should not present an ABI problem as std::hardware_destructive_interference_size is not intended to be used for structs in an external interface, and GCC warns about such uses. Another place where the L1 cache line size is used is in phiopt for -fhoist-adjacent-loads where conditional accesses to adjacent struct members can be speculatively loaded as long as they are within the same L1 cache line. e.g. struct S { int i; int j; }; int bar (struct S *x, int y) { int r; if (y) r = x->i; else r = x->j; return r; } The Neoverse V2 L1 cache line is 64 bytes according to the TRM, so set it to that. The rest of the prefetch parameters inherit from the generic tuning so we don't do anything extra for software prefeteches. Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> * config/aarch64/tuning_models/neoversev2.h (neoversev2_prefetch_tune): Define. (neoversev2_tunings): Use it.
2024-09-17SVE intrinsics: Fold svdiv with all-zero operands to zero vectorJennifer Schmitz1-9/+20
This patch folds svdiv where one of the operands is all-zeros to a zero vector, if one of the following conditions holds: - the dividend is all zeros or - the divisor is all zeros, and the predicate is ptrue or the predication is _x or _z. This case was not covered by the recent patch that implemented constant folding, because that covered only cases where both operands are constant vectors. Here, the operation is folded as soon as one of the operands is a constant zero vector. Folding of divison by 0 to return 0 is in accordance with the semantics of sdiv and udiv. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com> gcc/ * config/aarch64/aarch64-sve-builtins-base.cc (svdiv_impl::fold): Add folding of all-zero operands to zero vector. gcc/testsuite/ * gcc.target/aarch64/sve/fold_div_zero.c: New test. * gcc.target/aarch64/sve/const_fold_div_1.c: Adjust expected outcome.
2024-09-16aarch64: Improve vector constant generation using SVE INDEX instruction ↵Pengxuan Zheng1-1/+12
[PR113328] SVE's INDEX instruction can be used to populate vectors by values starting from "base" and incremented by "step" for each subsequent value. We can take advantage of it to generate vector constants if TARGET_SVE is available and the base and step values are within [-16, 15]. For example, with the following function: typedef int v4si __attribute__ ((vector_size (16))); v4si f_v4si (void) { return (v4si){ 0, 1, 2, 3 }; } GCC currently generates: f_v4si: adrp x0, .LC4 ldr q0, [x0, #:lo12:.LC4] ret .LC4: .word 0 .word 1 .word 2 .word 3 With this patch, we generate an INDEX instruction instead if TARGET_SVE is available. f_v4si: index z0.s, #0, #1 ret PR target/113328 gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_simd_valid_immediate): Improve handling of some ADVSIMD vectors by using SVE's INDEX if TARGET_SVE is available. (aarch64_output_simd_mov_immediate): Likewise. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/acle/general/dupq_1.c: Update test to use SVE's INDEX instruction. * gcc.target/aarch64/sve/acle/general/dupq_2.c: Likewise. * gcc.target/aarch64/sve/acle/general/dupq_3.c: Likewise. * gcc.target/aarch64/sve/acle/general/dupq_4.c: Likewise. * gcc.target/aarch64/sve/vec_init_3.c: New test. Signed-off-by: Pengxuan Zheng <quic_pzheng@quicinc.com>
2024-09-16aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for SVE instructions.Soumya AR1-3/+15
On Neoverse V2, SVE ADD instructions have a throughput of 4, while shift instructions like SHL have a throughput of 2. We can lean on that to emit code like: add z31.b, z31.b, z31.b instead of: lsl z31.b, z31.b, #1 The implementation of this change for SVE vectors is similar to a prior patch <https://gcc.gnu.org/pipermail/gcc-patches/2024-August/659958.html> that adds the above functionality for Neon vectors. Here, the machine descriptor pattern is split up to separately accommodate left and right shifts, so we can specifically emit an add for all left shifts by 1. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Soumya AR <soumyaa@nvidia.com> gcc/ChangeLog: * config/aarch64/aarch64-sve.md (*post_ra_v<optab><mode>3): Split pattern to accomodate left and right shifts separately. (*post_ra_v_ashl<mode>3): Matches left shifts with additional constraint to check for shifts by 1. (*post_ra_v_<optab><mode>3): Matches right shifts. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/acle/asm/lsl_s16.c: Updated instances of lsl-1 with corresponding add. * gcc.target/aarch64/sve/acle/asm/lsl_s32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_s64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_s8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u64.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_u8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_s16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_s32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_s8.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_u16.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_u32.c: Likewise. * gcc.target/aarch64/sve/acle/asm/lsl_wide_u8.c: Likewise. * gcc.target/aarch64/sve/adr_1.c: Likewise. * gcc.target/aarch64/sve/adr_6.c: Likewise. * gcc.target/aarch64/sve/cond_mla_7.c: Likewise. * gcc.target/aarch64/sve/cond_mla_8.c: Likewise. * gcc.target/aarch64/sve/shift_2.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1sh_gather_u64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/ldnt1uh_gather_u64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s16.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s32.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_s8.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u16.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u32.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/rshl_u8.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_s64.c: Likewise. * gcc.target/aarch64/sve2/acle/asm/stnt1h_scatter_u64.c: Likewise. * gcc.target/aarch64/sve/sve_shl_add.c: New test.
2024-09-10Pass host specific ABI opts from mkoffload.Prathamesh Kulkarni1-2/+2
The patch adds an option -foffload-abi-host-opts, which is set by host in TARGET_OFFLOAD_OPTIONS, and mkoffload then passes its value to host_compiler. gcc/ChangeLog: PR target/96265 * common.opt (foffload-abi-host-opts): New option. * config/aarch64/aarch64.cc (aarch64_offload_options): Pass -foffload-abi-host-opts. * config/i386/i386-options.cc (ix86_offload_options): Likewise. * config/rs6000/rs6000.cc (rs6000_offload_options): Likewise. * config/nvptx/mkoffload.cc (offload_abi_host_opts): Define. (compile_native): Append offload_abi_host_opts to argv_obstack. (main): Handle option -foffload-abi-host-opts. * config/gcn/mkoffload.cc (offload_abi_host_opts): Define. (compile_native): Append offload_abi_host_opts to argv_obstack. (main): Handle option -foffload-abi-host-opts. * lto-wrapper.cc (merge_and_complain): Handle -foffload-abi-host-opts. (append_compiler_options): Likewise. * opts.cc (common_handle_option): Likewise. Signed-off-by: Prathamesh Kulkarni <prathameshk@nvidia.com>
2024-09-06aarch64: Use is_attribute_namespace_p and get_attribute_name inside ↵Andrew Pinski1-6/+2
aarch64_lookup_shared_state_flags [PR116598] The code in aarch64_lookup_shared_state_flags all C++11 attributes on the function type had a namespace associated with them. But with the addition of reproducible/unsequenced, this is not true. This fixes the issue by using is_attribute_namespace_p instead of manually figuring out the namespace is named "arm" and uses get_attribute_name instead of manually grabbing the attribute name. Built and tested for aarch64-linux-gnu. gcc/ChangeLog: PR target/116598 * config/aarch64/aarch64.cc (aarch64_lookup_shared_state_flags): Use is_attribute_namespace_p and get_attribute_name instead of manually grabbing the namespace and name of the attribute. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
2024-09-03SVE intrinsics: Fold constant operands for svmul.Jennifer Schmitz1-1/+14
This patch implements constant folding for svmul by calling gimple_folder::fold_const_binary with tree_code MULT_EXPR. Tests were added to check the produced assembly for different predicates, signed and unsigned integers, and the svmul_n_* case. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com> gcc/ * config/aarch64/aarch64-sve-builtins-base.cc (svmul_impl::fold): Try constant folding. gcc/testsuite/ * gcc.target/aarch64/sve/const_fold_mul_1.c: New test.
2024-09-03SVE intrinsics: Fold constant operands for svdiv.Jennifer Schmitz3-3/+52
This patch implements constant folding for svdiv: The new function aarch64_const_binop was created, which - in contrast to int_const_binop - does not treat operations as overflowing. This function is passed as callback to vector_const_binop from the new gimple_folder method fold_const_binary, if the predicate is ptrue or predication is _x. From svdiv_impl::fold, fold_const_binary is called with TRUNC_DIV_EXPR as tree_code. In aarch64_const_binop, a case was added for TRUNC_DIV_EXPR to return 0 for division by 0, as defined in the semantics for svdiv. Tests were added to check the produced assembly for different predicates, signed and unsigned integers, and the svdiv_n_* case. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com> gcc/ * config/aarch64/aarch64-sve-builtins-base.cc (svdiv_impl::fold): Try constant folding. * config/aarch64/aarch64-sve-builtins.h: Declare gimple_folder::fold_const_binary. * config/aarch64/aarch64-sve-builtins.cc (aarch64_const_binop): New function to fold binary SVE intrinsics without overflow. (gimple_folder::fold_const_binary): New helper function for constant folding of SVE intrinsics. gcc/testsuite/ * gcc.target/aarch64/sve/const_fold_div_1.c: New test.
2024-08-29Use std::unique_ptr for optinfo_itemDavid Malcolm1-0/+1
As preliminary work towards an overhaul of how optinfo_items interact with dump_pretty_printer, replace uses of optinfo_item * with std::unique_ptr<optinfo_item> to make ownership clearer. No functional change intended. gcc/ChangeLog: * config/aarch64/aarch64.cc: Define INCLUDE_MEMORY. * config/arm/arm.cc: Likewise. * config/i386/i386.cc: Likewise. * config/loongarch/loongarch.cc: Likewise. * config/riscv/riscv-vector-costs.cc: Likewise. * config/riscv/riscv.cc: Likewise. * config/rs6000/rs6000.cc: Likewise. * dump-context.h (dump_context::emit_item): Convert "item" param from * to const &. (dump_pretty_printer::stash_item): Convert "item" param from optinfo_ * to std::unique_ptr<optinfo_item>. (dump_pretty_printer::emit_item): Likewise. * dumpfile.cc: Include "make-unique.h". (make_item_for_dump_gimple_stmt): Replace uses of optinfo_item * with std::unique_ptr<optinfo_item>. (dump_context::dump_gimple_stmt): Likewise. (make_item_for_dump_gimple_expr): Likewise. (dump_context::dump_gimple_expr): Likewise. (make_item_for_dump_generic_expr): Likewise. (dump_context::dump_generic_expr): Likewise. (make_item_for_dump_symtab_node): Likewise. (dump_pretty_printer::emit_items): Likewise. (dump_pretty_printer::emit_any_pending_textual_chunks): Likewise. (dump_pretty_printer::emit_item): Likewise. (dump_pretty_printer::stash_item): Likewise. (dump_pretty_printer::decode_format): Likewise. (dump_context::dump_printf_va): Fix overlong line. (make_item_for_dump_dec): Replace uses of optinfo_item * with std::unique_ptr<optinfo_item>. (dump_context::dump_dec): Likewise. (dump_context::dump_symtab_node): Likewise. (dump_context::begin_scope): Likewise. (dump_context::emit_item): Likewise. * gimple-loop-interchange.cc: Define INCLUDE_MEMORY. * gimple-loop-jam.cc: Likewise. * gimple-loop-versioning.cc: Likewise. * graphite-dependences.cc: Likewise. * graphite-isl-ast-to-gimple.cc: Likewise. * graphite-optimize-isl.cc: Likewise. * graphite-poly.cc: Likewise. * graphite-scop-detection.cc: Likewise. * graphite-sese-to-poly.cc: Likewise. * graphite.cc: Likewise. * opt-problem.cc: Likewise. * optinfo.cc (optinfo::add_item): Convert "item" param from optinfo_ * to std::unique_ptr<optinfo_item>. (optinfo::emit_for_opt_problem): Update for change to dump_context::emit_item. * optinfo.h: Add #error to fail immediately if INCLUDE_MEMORY wasn't defined, rather than fail to find std::unique_ptr. (optinfo::add_item): Convert "item" param from optinfo_ * to std::unique_ptr<optinfo_item>. * sese.cc: Define INCLUDE_MEMORY. * targhooks.cc: Likewise. * tree-data-ref.cc: Likewise. * tree-if-conv.cc: Likewise. * tree-loop-distribution.cc: Likewise. * tree-parloops.cc: Likewise. * tree-predcom.cc: Likewise. * tree-ssa-live.cc: Likewise. * tree-ssa-loop-ivcanon.cc: Likewise. * tree-ssa-loop-ivopts.cc: Likewise. * tree-ssa-loop-prefetch.cc: Likewise. * tree-ssa-loop-unswitch.cc: Likewise. * tree-ssa-phiopt.cc: Likewise. * tree-ssa-threadbackward.cc: Likewise. * tree-ssa-threadupdate.cc: Likewise. * tree-vect-data-refs.cc: Likewise. * tree-vect-generic.cc: Likewise. * tree-vect-loop-manip.cc: Likewise. * tree-vect-loop.cc: Likewise. * tree-vect-patterns.cc: Likewise. * tree-vect-slp-patterns.cc: Likewise. * tree-vect-slp.cc: Likewise. * tree-vect-stmts.cc: Likewise. * tree-vectorizer.cc: Likewise. gcc/testsuite/ChangeLog: * gcc.dg/plugin/dump_plugin.c: Define INCLUDE_MEMORY. Signed-off-by: David Malcolm <dmalcolm@redhat.com>
2024-08-28aarch64: Assume zero gather/scatter set-up cost for -mtune=genericRichard Sandiford1-2/+2
generic_vector_cost is not currently used by any SVE target by default; it has to be specifically selected by -mtune=generic. Its SVE costing has historically been somewhat idealised, since it predated any actual SVE cores. This seems like a useful tradition to continue, at least for testing purposes. The ideal case is that gathers and scatters do not induce a specific one-off overhead. This patch therefore sets the gather/scatter init costs to zero. This patch is necessary to switch -mtune=generic over to the "new" vector costs. gcc/ * config/aarch64/tuning_models/generic.h (generic_sve_vector_cost): Set gather_load_x32_init_cost and gather_load_x64_init_cost to 0.
2024-08-28aarch64: Fix gather x32/x64 selectionRichard Sandiford1-2/+5
The SVE gather and scatter costs are classified based on whether they do 4 loads per 128 bits (x32) or 2 loads per 128 bits (x64). The number after the "x" refers to the number of bits in each "container". However, the test for which to use was based on the element size rather than the container size. This meant that we'd use the overly conservative x32 costs for VNx2SI gathers. VNx2SI gathers are really .D gathers in which the upper half of each extension result is ignored. This patch is necessary to switch -mtune=generic over to the "new" vector costs. gcc/ * config/aarch64/aarch64.cc (aarch64_detect_vector_stmt_subtype) (aarch64_vector_costs::add_stmt_cost): Use the x64 cost rather than x32 cost for all VNx2 modes.
2024-08-23optabs-query: Use opt_machine_mode for smallest_int_mode_for_size [PR115495].Robin Dapp1-2/+4
In get_best_extraction_insn we use smallest_int_mode_for_size with struct_bits as size argument. PR115495 has struct_bits = 256 and we don't have a mode for that. This patch makes smallest_mode_for_size and smallest_int_mode_for_size return opt modes so we can just skip over the loop when there is no mode. PR middle-end/115495 gcc/ChangeLog: * cfgexpand.cc (expand_debug_expr): Require mode. * combine.cc (make_extraction): Ditto. * config/aarch64/aarch64.cc (aarch64_expand_cpymem): Ditto. (aarch64_expand_setmem): Ditto. * config/arc/arc.cc (arc_expand_cpymem): Ditto. * config/arm/arm.cc (arm_expand_divmod_libfunc): Ditto. * config/i386/i386.cc (ix86_get_mask_mode): Ditto. * config/rs6000/predicates.md: Ditto. * config/rs6000/rs6000.cc (vspltis_constant): Ditto. * config/s390/s390.cc (s390_expand_insv): Ditto. * config/sparc/sparc.cc (assign_int_registers): Ditto. * coverage.cc (get_gcov_type): Ditto. (get_gcov_unsigned_t): Ditto. * dse.cc (find_shift_sequence): Ditto. * expmed.cc (store_integral_bit_field): Ditto. * expr.cc (convert_mode_scalar): Ditto. (op_by_pieces_d::smallest_fixed_size_mode_for_size): Ditto. (emit_block_move_via_oriented_loop): Ditto. (copy_blkmode_to_reg): Ditto. (store_field): Ditto. * internal-fn.cc (expand_arith_overflow): Ditto. * machmode.h (HAVE_MACHINE_MODES): Ditto. (smallest_mode_for_size): Use opt_machine_mode. (smallest_int_mode_for_size): Use opt_scalar_int_mode. * optabs-query.cc (get_best_extraction_insn): Require mode. * optabs.cc (expand_twoval_binop_libfunc): Ditto. * stor-layout.cc (smallest_mode_for_size): Return opt_machine_mode. (layout_type): Require mode. (initialize_sizetypes): Ditto. * tree-ssa-loop-manip.cc (canonicalize_loop_ivs): Ditto. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/pr115495.c: New test. gcc/ada/ChangeLog: * gcc-interface/utils2.cc (fast_modulo_reduction): Require mode. (nonbinary_modular_operation): Ditto.
2024-08-22PR target/116365: Add user-friendly arguments to --param ↵Jennifer Schmitz3-8/+45
aarch64-autovec-preference=N The param aarch64-autovec-preference=N is a useful tool for testing auto-vectorisation in GCC as it allows the user to force a particular strategy. So far, N could be a numerical value between 0 and 4. This patch replaces the numerical values by more user-friendly names to distinguish the options. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. Ok for mainline? Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com> gcc/ PR target/116365 * config/aarch64/aarch64-opts.h (enum aarch64_autovec_preference_enum): New enum. * config/aarch64/aarch64.cc (aarch64_cmp_autovec_modes): Change numerical to enum values. (aarch64_autovectorize_vector_modes): Change numerical to enum values. (aarch64_vector_costs::record_potential_advsimd_unrolling): Change numerical to enum values. * config/aarch64/aarch64.opt: Change param type to enum. * doc/invoke.texi: Update documentation. gcc/testsuite/ PR target/116365 * gcc.target/aarch64/autovec_param_asimd-only.c: New test. * gcc.target/aarch64/autovec_param_default.c: Likewise. * gcc.target/aarch64/autovec_param_prefer-asimd.c: Likewise. * gcc.target/aarch64/autovec_param_prefer-sve.c: Likewise. * gcc.target/aarch64/autovec_param_sve-only.c: Likewise. * gcc.target/aarch64/neoverse_v1_2.c: Update parameter value. * gcc.target/aarch64/neoverse_v1_3.c: Likewise. * gcc.target/aarch64/sve/cond_asrd_1.c: Likewise. * gcc.target/aarch64/sve/cond_cnot_4.c: Likewise. * gcc.target/aarch64/sve/cond_unary_5.c: Likewise. * gcc.target/aarch64/sve/cond_uxt_5.c: Likewise. * gcc.target/aarch64/sve/cond_xorsign_2.c: Likewise. * gcc.target/aarch64/sve/pr98268-1.c: Likewise. * gcc.target/aarch64/sve/pr98268-2.c: Likewise.
2024-08-21aarch64: Fix caller saves of VNx2QI [PR116238]Richard Sandiford1-3/+4
The testcase contains a VNx2QImode pseudo that is live across a call and that cannot be allocated a call-preserved register. LRA quite reasonably tried to save it before the call and restore it afterwards. Unfortunately, the target told it to do that in SImode, even though punning between SImode and VNx2QImode is disallowed by both TARGET_CAN_CHANGE_MODE_CLASS and TARGET_MODES_TIEABLE_P. The natural class to use for SImode is GENERAL_REGS, so this led to an unsalvageable situation in which we had: (set (subreg:VNx2QI (reg:SI A) 0) (reg:VNx2QI B)) where A needed GENERAL_REGS and B needed FP_REGS. We therefore ended up in a reload loop. The hooks above should ensure that this situation can never occur for incoming subregs. It only happened here because the target explicitly forced it. The decision to use SImode for modes smaller than 4 bytes dates back to the beginning of the port, before 16-bit floating-point modes existed. I'm not sure whether promoting to SImode really makes sense for any FPR, but that's a separate performance/QoI discussion. For now, this patch just disallows using SImode when it is wrong for correctness reasons, since that should be safer to backport. gcc/ PR testsuite/116238 * config/aarch64/aarch64.cc (aarch64_hard_regno_caller_save_mode): Only return SImode if we can convert to and from it. gcc/testsuite/ PR testsuite/116238 * gcc.target/aarch64/sve/pr116238.c: New test.
2024-08-21aarch64: Implement popcountti2 pattern [PR113042]Andrew Pinski1-0/+13
When CSSC is not enabled, 128bit popcount can be implemented just via the vector (v16qi) cnt instruction followed by a reduction, like how the 64bit one is currently implemented instead of splitting into 2 64bit popcount. Changes since v1: * v2: Make operand 0 be DImode instead of TImode and simplify. Build and tested for aarch64-linux-gnu. PR target/113042 gcc/ChangeLog: * config/aarch64/aarch64.md (popcountti2): New define_expand. gcc/testsuite/ChangeLog: * gcc.target/aarch64/popcnt10.c: New test. * gcc.target/aarch64/popcnt9.c: New test. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
2024-08-19aarch64: Fix ls64 intrinsic availabilityAndrew Carlotti2-4/+8
The availability of ls64 intrinsics and data types were determined solely by the globally specified architecture features, which did not reflect any changes specified in target pragmas or attributes. This patch removes the initialisation-time guards for the intrinsics, and replaces them with checks at use time. We also get better error messages when ls64 is not available (matching the existing error messages for SVE intrinsics). The data512_t type is made always available; this is consistent with the present behaviour for Neon fp16/bf16 types. gcc/ChangeLog: PR target/112108 * config/aarch64/aarch64-builtins.cc (handle_arm_acle_h): Remove feature check at initialisation. (aarch64_general_check_builtin_call): Check ls64 intrinsics. * config/aarch64/arm_acle.h: (data512_t) Make always available. gcc/testsuite/ChangeLog: PR target/112108 * gcc.target/aarch64/acle/ls64_guard-1.c: New test. * gcc.target/aarch64/acle/ls64_guard-2.c: New test. * gcc.target/aarch64/acle/ls64_guard-3.c: New test. * gcc.target/aarch64/acle/ls64_guard-4.c: New test.
2024-08-19aarch64: Fix memtag intrinsic availabilityAndrew Carlotti2-33/+13
The availability of memtag intrinsics and data types were determined solely by the globally specified architecture features, which did not reflect any changes specified in target pragmas or attributes. This patch removes the initialisation-time guards for the intrinsics, and replaces them with checks at use time. It also removes the macro indirection from the header file - this simplifies the header, and allows the missing extension error reporting to find the user-facing intrinsic names. gcc/ChangeLog: PR target/112108 * config/aarch64/aarch64-builtins.cc (aarch64_init_memtag_builtins): Define intrinsic names directly. (aarch64_general_init_builtins): Move memtag intialisation... (handle_arm_acle_h): ...to here, and remove feature check. (aarch64_general_check_builtin_call): Check memtag intrinsics. * config/aarch64/arm_acle.h (__arm_mte_create_random_tag) (__arm_mte_exclude_tag, __arm_mte_ptrdiff) (__arm_mte_increment_tag, __arm_mte_set_tag, __arm_mte_get_tag): Remove. gcc/testsuite/ChangeLog: PR target/112108 * gcc.target/aarch64/acle/memtag_guard-1.c: New test. * gcc.target/aarch64/acle/memtag_guard-2.c: New test. * gcc.target/aarch64/acle/memtag_guard-3.c: New test. * gcc.target/aarch64/acle/memtag_guard-4.c: New test.
2024-08-19aarch64: Fix tme intrinsic availabilityAndrew Carlotti2-59/+34
The availability of tme intrinsics was previously gated at both initialisation time (using global target options) and usage time (accounting for function-specific target options). This patch removes the check at initialisation time, and also moves the intrinsics out of the header file to allow for better error messages (matching the existing error messages for SVE intrinsics). gcc/ChangeLog: PR target/112108 * config/aarch64/aarch64-builtins.cc (aarch64_init_tme_builtins): Define intrinsic names directly. (aarch64_general_init_builtins): Move tme initialisation... (handle_arm_acle_h): ...to here, and remove feature check. (aarch64_general_check_builtin_call): Check tme intrinsics. * config/aarch64/arm_acle.h (__tstart, __tcommit, __tcancel) (__ttest): Remove. (_TMFAILURE_*): Define unconditionally. gcc/testsuite/ChangeLog: PR target/112108 * gcc.target/aarch64/acle/tme_guard-1.c: New test. * gcc.target/aarch64/acle/tme_guard-2.c: New test. * gcc.target/aarch64/acle/tme_guard-3.c: New test. * gcc.target/aarch64/acle/tme_guard-4.c: New test.
2024-08-19aarch64: Move check_required_extensionsAndrew Carlotti3-103/+106
Move SVE extension checking functionality to aarch64-builtins.cc, so that it can be shared by non-SVE intrinsics. gcc/ChangeLog: * config/aarch64/aarch64-sve-builtins.cc (check_builtin_call) (expand_builtin): Update calls to the below. (report_missing_extension, report_missing_registers) (check_required_extensions): Move out of aarch64_sve namespace, rename, and move into... * config/aarch64/aarch64-builtins.cc (aarch64_report_missing_extension) (aarch64_report_missing_registers) (aarch64_check_required_extensions) ...here. * config/aarch64/aarch64-protos.h (aarch64_check_required_extensions): Add prototype.
2024-08-19aarch64: Refactor check_required_extensionsAndrew Carlotti1-18/+20
Replace TARGET_GENERAL_REGS_ONLY check with an explicit check that aarch64_isa_flags enables all required extensions. This will be more flexible when repurposing this function for non-SVE intrinsics. gcc/ChangeLog: * config/aarch64/aarch64-sve-builtins.cc (check_required_registers): Remove target check and rename to... (report_missing_registers): ...this. (check_required_extensions): Refactor.
2024-08-19aarch64: Reduce FP reassociation width for Neoverse V2 and set ↵Kyrylo Tkachov1-3/+4
AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA The fp reassociation width for Neoverse V2 was set to 6 since its introduction and I guess it was empirically tuned. But since AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA was added the tree reassociation pass seems to be more deliberate in forming FMAs and when that flag is used it seems to more properly evaluate the FMA vs non-FMA reassociation widths. According to the Neoverse V2 SWOG the core has a throughput of 4 for most FP operations, so the value 6 is not accurate anyway. Also, the SWOG does state that FMADD operations are pipelined and the results can be forwarded from FP multiplies to the accumulation operands of FMADD instructions, which seems to be what AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA expresses. This patch sets the fp_reassoc_width field to 4 and enables AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA for -mcpu=neoverse-v2. On SPEC2017 fprate I see the following changes on a Grace system: 503.bwaves_r 0.16% 507.cactuBSSN_r -0.32% 508.namd_r 3.04% 510.parest_r 0.00% 511.povray_r 0.78% 519.lbm_r 0.35% 521.wrf_r 0.69% 526.blender_r -0.53% 527.cam4_r 0.84% 538.imagick_r 0.00% 544.nab_r -0.97% 549.fotonik3d_r -0.45% 554.roms_r 0.97% Geomean 0.35% with -Ofast -mcpu=grace -flto. So slight overall improvement with a meaningful improvement in 508.namd_r. I think other tunings in aarch64 should look into AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA as well, but I'll leave the benchmarking to someone else. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ChangeLog: * config/aarch64/tuning_models/neoversev2.h (fp_reassoc_width): Set to 4. (tune_flags): Add AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.
2024-08-19aarch64: Implement 16-byte vector mode const0 store by TImodeHaochen Gui1-1/+10
gcc/ * config/aarch64/aarch64-simd.md (mov<mode> for VSTRUCT_QD): Expand 16-byte vector mode const0 store by TImode.
2024-08-15aarch64: Improve popcount for bytes [PR113042]Andrew Pinski1-13/+24
For popcount for bytes, we don't need the reduction addition after the vector cnt instruction as we are only counting one byte's popcount. This changes the popcount extend to cover all ALLI rather than GPI. Changes since v1: * v2 - Use ALLI iterator and combine all into one pattern. Add new testcases popcnt[6-8].c. * v3 - Simplify TARGET_CSSC path. Use convert_to_mode instead of gen_zero_extend* directly. Some other small cleanups. Bootstrapped and tested on aarch64-linux-gnu with no regressions. PR target/113042 gcc/ChangeLog: * config/aarch64/aarch64.md (popcount<mode>2): Update pattern to support ALLI modes. gcc/testsuite/ChangeLog: * gcc.target/aarch64/popcnt5.c: New test. * gcc.target/aarch64/popcnt6.c: New test. * gcc.target/aarch64/popcnt7.c: New test. * gcc.target/aarch64/popcnt8.c: New test. Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
2024-08-15aarch64: Rename svpext to svpext_lane [PR116371]Richard Sandiford3-4/+4
When implementing the SME2 ACLE, I somehow missed off the _lane suffix on svpext. gcc/ PR target/116371 * config/aarch64/aarch64-sve-builtins-sve2.h (svpext): Rename to... (svpext_lane): ...this. * config/aarch64/aarch64-sve-builtins-sve2.cc (svpext_impl): Rename to... (svpext_lane_impl): ...this and update instantiation accordingly. * config/aarch64/aarch64-sve-builtins-sve2.def (svpext): Rename to... (svpext_lane): ...this. gcc/testsuite/ PR target/116371 * gcc.target/aarch64/sme2/acle-asm/pext_c16.c, gcc.target/aarch64/sme2/acle-asm/pext_c16_x2.c, gcc.target/aarch64/sme2/acle-asm/pext_c32.c, gcc.target/aarch64/sme2/acle-asm/pext_c32_x2.c, gcc.target/aarch64/sme2/acle-asm/pext_c64.c, gcc.target/aarch64/sme2/acle-asm/pext_c64_x2.c, gcc.target/aarch64/sme2/acle-asm/pext_c8.c, gcc.target/aarch64/sme2/acle-asm/pext_c8_x2.c: Replace with... * gcc.target/aarch64/sme2/acle-asm/pext_lane_c16.c, gcc.target/aarch64/sme2/acle-asm/pext_lane_c16_x2.c, gcc.target/aarch64/sme2/acle-asm/pext_lane_c32.c, gcc.target/aarch64/sme2/acle-asm/pext_lane_c32_x2.c, gcc.target/aarch64/sme2/acle-asm/pext_lane_c64.c, gcc.target/aarch64/sme2/acle-asm/pext_lane_c64_x2.c, gcc.target/aarch64/sme2/acle-asm/pext_lane_c8.c, gcc.target/aarch64/sme2/acle-asm/pext_lane_c8_x2.c: ...these new tests, testing for svpext_lane instead of svpext.
2024-08-12aarch64: Emit ADD X, Y, Y instead of SHL X, Y, #1 for Advanced SIMDKyrylo Tkachov2-5/+13
On many cores, including Neoverse V2 the throughput of vector ADD instructions is higher than vector shifts like SHL. We can lean on that to emit code like: add v0.4s, v0.4s, v0.4s instead of: shl v0.4s, v0.4s, 1 LLVM already does this trick. In RTL the code gets canonincalised from (plus x x) to (ashift x 1) so I opted to instead do this at the final assembly printing stage, similar to how we emit CMLT instead of SSHR elsewhere in the backend. I'd like to also do this for SVE shifts, but those will have to be separate patches. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_simd_imm_shl<mode><vczle><vczbe>): Rewrite to new syntax. Add =w,w,vs1 alternative. * config/aarch64/constraints.md (vs1): New constraint. gcc/testsuite/ChangeLog: * gcc.target/aarch64/advsimd_shl_add.c: New test.
2024-08-09aarch64: Check CONSTM1_RTX in definition of Dm constraintKyrylo Tkachov1-1/+1
The constraint Dm is intended to match vectors of minus 1, but actually checks for CONST1_RTX. This doesn't have a bad effect in practice as its only use in the aarch64_wrffr pattern for the setffr instruction which is a VNx16BI operation and -1 and 1 are the same there. That pattern can only be currently generated through intrinsics anyway that create it with a CONSTM1_RTX constant. Fix the constraint definition so that it doesn't become a footgun if its used in some other pattern. Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com> gcc/ChangeLog: * config/aarch64/constraints.md (Dm): Match CONSTM1_RTX rather CONST1_RTX.
2024-08-08AArch64: Fix signbit mask creation after late combine [PR116229]Tamar Christina3-2/+12
The optimization to generate a Di signbit constant by using fneg was relying on nothing being able to push the constant into the negate. It's run quite late for this reason. However late combine now runs after it and triggers RTL simplification based on the neg. When -fno-signed-zeros this ends up dropping the - from the -0.0 and thus producing incorrect code. This change adds a new unspec FNEG on DI mode which prevents this simplication. gcc/ChangeLog: PR target/116229 * config/aarch64/aarch64-simd.md (aarch64_fnegv2di2<vczle><vczbe>): New. * config/aarch64/aarch64.cc (aarch64_maybe_generate_simd_constant): Update call to gen_aarch64_fnegv2di2. * config/aarch64/iterators.md: New UNSPEC_FNEG. gcc/testsuite/ChangeLog: PR target/116229 * gcc.target/aarch64/pr116229.c: New test.
2024-08-06AArch64: take gather/scatter decode overhead into accountTamar Christina14-0/+60
Gather and scatters are not usually beneficial when the loop count is small. This is because there's not only a cost to their execution within the loop but there is also some cost to enter loops with them. As such this patch models this overhead. For generic tuning we however still prefer gathers/scatters when the loop costs work out. gcc/ChangeLog: * config/aarch64/aarch64-protos.h (struct sve_vec_cost): Add gather_load_x32_init_cost and gather_load_x64_init_cost. * config/aarch64/aarch64.cc (aarch64_vector_costs): Add m_sve_gather_scatter_init_cost. (aarch64_vector_costs::add_stmt_cost): Use them. (aarch64_vector_costs::finish_cost): Likewise. * config/aarch64/tuning_models/a64fx.h: Update. * config/aarch64/tuning_models/cortexx925.h: Update. * config/aarch64/tuning_models/generic.h: Update. * config/aarch64/tuning_models/generic_armv8_a.h: Update. * config/aarch64/tuning_models/generic_armv9_a.h: Update. * config/aarch64/tuning_models/neoverse512tvb.h: Update. * config/aarch64/tuning_models/neoversen2.h: Update. * config/aarch64/tuning_models/neoversen3.h: Update. * config/aarch64/tuning_models/neoversev1.h: Update. * config/aarch64/tuning_models/neoversev2.h: Update. * config/aarch64/tuning_models/neoversev3.h: Update. * config/aarch64/tuning_models/neoversev3ae.h: Update.
2024-08-05AArch64: Set instruction attribute of TST to logics_immJennifer Schmitz1-1/+1
As suggested in https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658249.html, this patch changes the instruction attribute of "*and<mode>_compare0" (TST) from alus_imm to logics_imm. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. OK for mainline? Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com> gcc/ * config/aarch64/aarch64.md (*and<mode>_compare0): Change attribute.
2024-08-02AArch64: Fuse CMP+CSEL and CMP+CSET for -mcpu=neoverse-v2Jennifer Schmitz3-1/+26
According to the Neoverse V2 Software Optimization Guide (section 4.14), the instruction pairs CMP+CSEL and CMP+CSET can be fused, which had not been implemented so far. This patch implements and tests the two fusion pairs. The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression. There was also no non-noise impact on SPEC CPU2017 benchmark. OK for mainline? Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com> gcc/ * config/aarch64/aarch64.cc (aarch_macro_fusion_pair_p): Implement fusion logic. * config/aarch64/aarch64-fusion-pairs.def (cmp+csel): New entry. (cmp+cset): Likewise. * config/aarch64/tuning_models/neoversev2.h: Enable logic in field fusible_ops. gcc/testsuite/ * gcc.target/aarch64/fuse_cmp_csel.c: New test. * gcc.target/aarch64/fuse_cmp_cset.c: Likewise.
2024-08-01aarch64: Improve Advanced SIMD popcount expansion by using SVE [PR113860]Pengxuan Zheng3-6/+21
This patch improves the Advanced SIMD popcount expansion by using SVE if available. For example, GCC currently generates the following code sequence for V2DI: cnt v31.16b, v31.16b uaddlp v31.8h, v31.16b uaddlp v31.4s, v31.8h uaddlp v31.2d, v31.4s However, by using SVE, we can generate the following sequence instead: ptrue p7.b, all cnt z31.d, p7/m, z31.d Similar improvements can be made for V4HI, V8HI, V2SI and V4SI too. The scalar popcount expansion can also be improved similarly by using SVE and those changes will be included in a separate patch. PR target/113860 gcc/ChangeLog: * config/aarch64/aarch64-simd.md (popcount<mode>2): Add TARGET_SVE support. * config/aarch64/aarch64-sve.md (@aarch64_pred_<optab><mode>): Use new iterator SVE_VDQ_I. * config/aarch64/iterators.md (SVE_VDQ_I): New mode iterator. (VPRED): Add V8QI, V16QI, V4HI, V8HI and V2SI. gcc/testsuite/ChangeLog: * gcc.target/aarch64/popcnt-sve.c: New test. Signed-off-by: Pengxuan Zheng <quic_pzheng@quicinc.com>
2024-08-01AArch64: Add Cortex-X925 core definition and cost modelTamar Christina4-1/+249
This adds a cost model and core definition for Cortex-X925. gcc/ChangeLog: * config/aarch64/aarch64-cores.def (cortex-x925): New. * config/aarch64/aarch64-tune.md: Regenerate. * config/aarch64/tuning_models/cortexx925.h: New file. * config/aarch64/aarch64.cc: Use it. * doc/invoke.texi: Document it.
2024-08-01AArch64: Update Neoverse N2 cost model to release costsTamar Christina1-23/+23
This updates the cost for Neoverse N2 to reflect the updated Software Optimization Guide. gcc/ChangeLog: * config/aarch64/tuning_models/neoversen2.h: Update costs.
2024-08-01AArch64: Update Generic Armv9-a cost model to release costsTamar Christina1-25/+25
this updates the costs for gener-armv9-a based on the updated costs for Neoverse V2 and Neoverse N2. gcc/ChangeLog: * config/aarch64/tuning_models/generic_armv9_a.h: Update costs.