aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-04-20tree-vect-patterns: Pattern recognize ctz or ffs using clz, popcount or ctz ↵Jakub Jelinek6-24/+442
[PR109011] The following patch allows to vectorize __builtin_ffs*/.FFS even if we just have vector .CTZ support, or __builtin_ffs*/.FFS/__builtin_ctz*/.CTZ if we just have vector .CLZ or .POPCOUNT support. It uses various expansions from Hacker's Delight book as well as GCC's expansion, in particular: .CTZ (X) = PREC - .CLZ ((X - 1) & ~X) .CTZ (X) = .POPCOUNT ((X - 1) & ~X) .CTZ (X) = (PREC - 1) - .CLZ (X & -X) .FFS (X) = PREC - .CLZ (X & -X) .CTZ (X) = PREC - .POPCOUNT (X | -X) .FFS (X) = (PREC + 1) - .POPCOUNT (X | -X) .FFS (X) = .CTZ (X) + 1 where the first one can be only used if both CTZ and CLZ have value defined at zero (kind 2) and both have value of PREC there. If the original has value defined at zero and the latter doesn't for other forms or if it doesn't have matching value for that case, a COND_EXPR is added for that afterwards. The patch also modifies vect_recog_popcount_clz_ctz_ffs_pattern such that the two can work together. 2023-04-20 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/109011 * tree-vect-patterns.cc (vect_recog_ctz_ffs_pattern): New function. (vect_recog_popcount_clz_ctz_ffs_pattern): Move vect_pattern_detected call later. Don't punt for IFN_CTZ or IFN_FFS if it doesn't have direct optab support, but has instead IFN_CLZ, IFN_POPCOUNT or for IFN_FFS IFN_CTZ support, use vect_recog_ctz_ffs_pattern for that case. (vect_vect_recog_func_ptrs): Add ctz_ffs entry. * gcc.dg/vect/pr109011-1.c: Remove -mpower9-vector from dg-additional-options. (baz, qux): Remove functions and corresponding dg-final. * gcc.dg/vect/pr109011-2.c: New test. * gcc.dg/vect/pr109011-3.c: New test. * gcc.dg/vect/pr109011-4.c: New test. * gcc.dg/vect/pr109011-5.c: New test.
2023-04-20Remove duplicate DFS walks from DF initRichard Biener1-5/+0
The following removes unused CFG order computes from rest_of_handle_df_initialize. The CFG orders are computed from df_analyze (). This also removes code duplication that would have to be kept in sync. * df-core.cc (rest_of_handle_df_initialize): Remove computation of df->postorder, df->postorder_inverted and df->n_blocks.
2023-04-20testsuite: Fix up g++.dg/ext/int128-8.C testcase [PR109560]Jakub Jelinek1-1/+1
The testcase needs to be restricted to int128 effective targets, it expectedly fails on i386 and other 32-bit targets. 2023-04-20 Jakub Jelinek <jakub@redhat.com> PR c++/108099 PR testsuite/109560 * g++.dg/ext/int128-8.C: Require int128 effective target.
2023-04-20PR testsuite/106879 FAIL: gcc.dg/vect/bb-slp-layout-19.c on powerpc64Jiufu Guo1-1/+6
On P7, option -mno-allow-movmisalign is added during testing, which prevents slp happen on the case. Like PR65484 and PR87306, this patch use vect_hw_misalign to guard the case on powerpc targets. gcc/testsuite/ChangeLog: PR testsuite/106879 * gcc.dg/vect/bb-slp-layout-19.c: Modify to guard the check with vect_hw_misalign on POWERs.
2023-04-20i386: Share AES xmm intrin with VAESHaochen Jiang9-63/+75
Currently in GCC, the 128 bit intrin for instruction vaes{end,dec}{last,} is under AES ISA. Because there is no dependency between ISA set AES and VAES, The 128 bit intrin is not available when we use compiler flag -mvaes -mavx512vl and there is no other way to use that intrin. But it should according to Intel SDM. Although VAES aims to be a VEX/EVEX promotion for AES, but it is only part of it. Therefore, we share the AES xmm intrin with VAES. Also, since -mvaes indicates that we could use VEX encoding for ymm, we should imply AVX for VAES. gcc/ChangeLog: * common/config/i386/i386-common.cc (OPTION_MASK_ISA2_AVX_UNSET): Add OPTION_MASK_ISA2_VAES_UNSET. (ix86_handle_option): Set AVX flag for VAES. * config/i386/i386-builtins.cc (ix86_init_mmx_sse_builtins): Add OPTION_MASK_ISA2_VAES_UNSET. (def_builtin): Share builtin between AES and VAES. * config/i386/i386-expand.cc (ix86_check_builtin_isa_match): Ditto. * config/i386/i386.md (aes): New isa attribute. * config/i386/sse.md (aesenc): Add pattern for VAES with xmm. (aesenclast): Ditto. (aesdec): Ditto. (aesdeclast): Ditto. * config/i386/vaesintrin.h: Remove redundant avx target push. * config/i386/wmmintrin.h (_mm_aesdec_si128): Change to macro. (_mm_aesdeclast_si128): Ditto. (_mm_aesenc_si128): Ditto. (_mm_aesenclast_si128): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512fvl-vaes-1.c: Add VAES xmm test. * gcc.target/i386/pr109117-1.c: Modify error message.
2023-04-20Add reduce_*_ep[i|u][8|16] series intrinsicsHu, Lin13-0/+809
gcc/ChangeLog: * config/i386/avx2intrin.h (_MM_REDUCE_OPERATOR_BASIC_EPI16): New macro. (_MM_REDUCE_OPERATOR_MAX_MIN_EP16): Ditto. (_MM256_REDUCE_OPERATOR_BASIC_EPI16): Ditto. (_MM256_REDUCE_OPERATOR_MAX_MIN_EP16): Ditto. (_MM_REDUCE_OPERATOR_BASIC_EPI8): Ditto. (_MM_REDUCE_OPERATOR_MAX_MIN_EP8): Ditto. (_MM256_REDUCE_OPERATOR_BASIC_EPI8): Ditto. (_MM256_REDUCE_OPERATOR_MAX_MIN_EP8): Ditto. (_mm_reduce_add_epi16): New instrinsics. (_mm_reduce_mul_epi16): Ditto. (_mm_reduce_and_epi16): Ditto. (_mm_reduce_or_epi16): Ditto. (_mm_reduce_max_epi16): Ditto. (_mm_reduce_max_epu16): Ditto. (_mm_reduce_min_epi16): Ditto. (_mm_reduce_min_epu16): Ditto. (_mm256_reduce_add_epi16): Ditto. (_mm256_reduce_mul_epi16): Ditto. (_mm256_reduce_and_epi16): Ditto. (_mm256_reduce_or_epi16): Ditto. (_mm256_reduce_max_epi16): Ditto. (_mm256_reduce_max_epu16): Ditto. (_mm256_reduce_min_epi16): Ditto. (_mm256_reduce_min_epu16): Ditto. (_mm_reduce_add_epi8): Ditto. (_mm_reduce_mul_epi8): Ditto. (_mm_reduce_and_epi8): Ditto. (_mm_reduce_or_epi8): Ditto. (_mm_reduce_max_epi8): Ditto. (_mm_reduce_max_epu8): Ditto. (_mm_reduce_min_epi8): Ditto. (_mm_reduce_min_epu8): Ditto. (_mm256_reduce_add_epi8): Ditto. (_mm256_reduce_mul_epi8): Ditto. (_mm256_reduce_and_epi8): Ditto. (_mm256_reduce_or_epi8): Ditto. (_mm256_reduce_max_epi8): Ditto. (_mm256_reduce_max_epu8): Ditto. (_mm256_reduce_min_epi8): Ditto. (_mm256_reduce_min_epu8): Ditto. * config/i386/avx512vlbwintrin.h: (_mm_mask_reduce_add_epi16): Ditto. (_mm_mask_reduce_mul_epi16): Ditto. (_mm_mask_reduce_and_epi16): Ditto. (_mm_mask_reduce_or_epi16): Ditto. (_mm_mask_reduce_max_epi16): Ditto. (_mm_mask_reduce_max_epu16): Ditto. (_mm_mask_reduce_min_epi16): Ditto. (_mm_mask_reduce_min_epu16): Ditto. (_mm256_mask_reduce_add_epi16): Ditto. (_mm256_mask_reduce_mul_epi16): Ditto. (_mm256_mask_reduce_and_epi16): Ditto. (_mm256_mask_reduce_or_epi16): Ditto. (_mm256_mask_reduce_max_epi16): Ditto. (_mm256_mask_reduce_max_epu16): Ditto. (_mm256_mask_reduce_min_epi16): Ditto. (_mm256_mask_reduce_min_epu16): Ditto. (_mm_mask_reduce_add_epi8): Ditto. (_mm_mask_reduce_mul_epi8): Ditto. (_mm_mask_reduce_and_epi8): Ditto. (_mm_mask_reduce_or_epi8): Ditto. (_mm_mask_reduce_max_epi8): Ditto. (_mm_mask_reduce_max_epu8): Ditto. (_mm_mask_reduce_min_epi8): Ditto. (_mm_mask_reduce_min_epu8): Ditto. (_mm256_mask_reduce_add_epi8): Ditto. (_mm256_mask_reduce_mul_epi8): Ditto. (_mm256_mask_reduce_and_epi8): Ditto. (_mm256_mask_reduce_or_epi8): Ditto. (_mm256_mask_reduce_max_epi8): Ditto. (_mm256_mask_reduce_max_epu8): Ditto. (_mm256_mask_reduce_min_epi8): Ditto. (_mm256_mask_reduce_min_epu8): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512vlbw-reduce-op-1.c: New test.
2023-04-20i386: Add PCLMUL dependency for VPCLMULQDQHaochen Jiang5-11/+20
Currently in GCC, the 128 bit intrin for instruction vpclmulqdq is under PCLMUL ISA. Because there is no dependency between ISA set PCLMUL and VPCLMULQDQ, The 128 bit intrin is not available when we just use compiler flag -mvpclmulqdq. But it should according to Intel SDM. Since VPCLMULQDQ is a VEX/EVEX promotion for PCLMUL, it is natural to add dependency between them. Also, with -mvpclmulqdq, we can use ymm under VEX encoding, so VPCLMULQDQ should imply AVX. gcc/ChangeLog: * common/config/i386/i386-common.cc (OPTION_MASK_ISA_VPCLMULQDQ_SET): Add OPTION_MASK_ISA_PCLMUL_SET and OPTION_MASK_ISA_AVX_SET. (OPTION_MASK_ISA_AVX_UNSET): Add OPTION_MASK_ISA_VPCLMULQDQ_UNSET. (OPTION_MASK_ISA_PCLMUL_UNSET): Ditto. * config/i386/i386.md (vpclmulqdqvl): New. * config/i386/sse.md (pclmulqdq): Add evex encoding. * config/i386/vpclmulqdqintrin.h: Remove redudant avx target push. gcc/testsuite/ChangeLog: * gcc.target/i386/vpclmulqdq.c: Add compile test for xmm.
2023-04-20i386: Fix vpblendm{b,w} intrins and insnsHaochen Jiang3-178/+115
For vpblendm{b,w}, they actually do not have constant parameters. Therefore, there is no need for them been wrapped in __OPTIMIZE__. Also, we should check TARGET_AVX512VL for 128/256 bit vectors. gcc/ChangeLog: * config/i386/avx512vlbwintrin.h (_mm_mask_blend_epi16): Remove __OPTIMIZE__ wrapper. (_mm_mask_blend_epi8): Ditto. (_mm256_mask_blend_epi16): Ditto. (_mm256_mask_blend_epi8): Ditto. * config/i386/avx512vlintrin.h (_mm256_mask_blend_pd): Ditto. (_mm256_mask_blend_ps): Ditto. (_mm256_mask_blend_epi64): Ditto. (_mm256_mask_blend_epi32): Ditto. (_mm_mask_blend_pd): Ditto. (_mm_mask_blend_ps): Ditto. (_mm_mask_blend_epi64): Ditto. (_mm_mask_blend_epi32): Ditto. * config/i386/sse.md (VF_AVX512BWHFBF16): Removed. (VF_AVX512HFBFVL): Move it before the first usage. (<avx512>_blendm<mode>): Change iterator from VF_AVX512BWHFBF16 to VF_AVX512HFBFVL.
2023-04-20i386: Add AVX512BW dependency to AVX512VBMI2Haochen Jiang57-160/+106
gcc/ChangeLog: * common/config/i386/i386-common.cc (OPTION_MASK_ISA_AVX512VBMI2_SET): Change OPTION_MASK_ISA_AVX512F_SET to OPTION_MASK_ISA_AVX512BW_SET. (OPTION_MASK_ISA_AVX512F_UNSET): Remove OPTION_MASK_ISA_AVX512VBMI2_UNSET. (OPTION_MASK_ISA_AVX512BW_UNSET): Add OPTION_MASK_ISA_AVX512VBMI2_UNSET. * config/i386/avx512vbmi2intrin.h: Do not push avx512bw. * config/i386/avx512vbmi2vlintrin.h: Ditto. * config/i386/i386-builtin.def: Remove OPTION_MASK_ISA_AVX512BW. * config/i386/sse.md (VI12_AVX512VLBW): Removed. (VI12_VI48F_AVX512VLBW): Rename to VI12_VI48F_AVX512VL. (compress<mode>_mask): Change iterator from VI12_AVX512VLBW to VI12_AVX512VL. (compressstore<mode>_mask): Ditto. (expand<mode>_mask): Ditto. (expand<mode>_maskz): Ditto. (*expand<mode>_mask): Change iterator from VI12_VI48F_AVX512VLBW to VI12_VI48F_AVX512VL. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512bw-pr100267-1.c: Remove avx512f and avx512bw. * gcc.target/i386/avx512bw-pr100267-b-2.c: Ditto. * gcc.target/i386/avx512bw-pr100267-d-2.c: Ditto. * gcc.target/i386/avx512bw-pr100267-q-2.c: Ditto. * gcc.target/i386/avx512bw-pr100267-w-2.c: Ditto. * gcc.target/i386/avx512f-vpcompressb-1.c: Ditto. * gcc.target/i386/avx512f-vpcompressb-2.c: Ditto. * gcc.target/i386/avx512f-vpcompressw-1.c: Ditto. * gcc.target/i386/avx512f-vpcompressw-2.c: Ditto. * gcc.target/i386/avx512f-vpexpandb-1.c: Ditto. * gcc.target/i386/avx512f-vpexpandb-2.c: Ditto. * gcc.target/i386/avx512f-vpexpandw-1.c: Ditto. * gcc.target/i386/avx512f-vpexpandw-2.c: Ditto. * gcc.target/i386/avx512f-vpshld-1.c: Ditto. * gcc.target/i386/avx512f-vpshldd-2.c: Ditto. * gcc.target/i386/avx512f-vpshldq-2.c: Ditto. * gcc.target/i386/avx512f-vpshldv-1.c: Ditto. * gcc.target/i386/avx512f-vpshldvd-2.c: Ditto. * gcc.target/i386/avx512f-vpshldvq-2.c: Ditto. * gcc.target/i386/avx512f-vpshldvw-2.c: Ditto. * gcc.target/i386/avx512f-vpshrdd-2.c: Ditto. * gcc.target/i386/avx512f-vpshrdq-2.c: Ditto. * gcc.target/i386/avx512f-vpshrdv-1.c: Ditto. * gcc.target/i386/avx512f-vpshrdvd-2.c: Ditto. * gcc.target/i386/avx512f-vpshrdvq-2.c: Ditto. * gcc.target/i386/avx512f-vpshrdvw-2.c: Ditto. * gcc.target/i386/avx512f-vpshrdw-2.c: Ditto. * gcc.target/i386/avx512vbmi2-vpshld-1.c: Ditto. * gcc.target/i386/avx512vbmi2-vpshrd-1.c: Ditto. * gcc.target/i386/avx512vl-vpcompressb-1.c: Ditto. * gcc.target/i386/avx512vl-vpcompressb-2.c: Ditto. * gcc.target/i386/avx512vl-vpcompressw-2.c: Ditto. * gcc.target/i386/avx512vl-vpexpandb-1.c: Ditto. * gcc.target/i386/avx512vl-vpexpandb-2.c: Ditto. * gcc.target/i386/avx512vl-vpexpandw-1.c: Ditto. * gcc.target/i386/avx512vl-vpexpandw-2.c: Ditto. * gcc.target/i386/avx512vl-vpshldd-2.c: Ditto. * gcc.target/i386/avx512vl-vpshldq-2.c: Ditto. * gcc.target/i386/avx512vl-vpshldv-1.c: Ditto. * gcc.target/i386/avx512vl-vpshldvd-2.c: Ditto. * gcc.target/i386/avx512vl-vpshldvq-2.c: Ditto. * gcc.target/i386/avx512vl-vpshldvw-2.c: Ditto. * gcc.target/i386/avx512vl-vpshrdd-2.c: Ditto. * gcc.target/i386/avx512vl-vpshrdq-2.c: Ditto. * gcc.target/i386/avx512vl-vpshrdv-1.c: Ditto. * gcc.target/i386/avx512vl-vpshrdvd-2.c: Ditto. * gcc.target/i386/avx512vl-vpshrdvq-2.c: Ditto. * gcc.target/i386/avx512vl-vpshrdvw-2.c: Ditto. * gcc.target/i386/avx512vl-vpshrdw-2.c: Ditto. * gcc.target/i386/avx512vlbw-pr100267-1.c: Ditto. * gcc.target/i386/avx512vlbw-pr100267-b-2.c: Ditto. * gcc.target/i386/avx512vlbw-pr100267-w-2.c: Ditto.
2023-04-20i386: Add AVX512BW dependency to AVX512BITALGHaochen Jiang17-63/+32
Since some of the AVX512BITALG intrins use 32/64 bit mask, AVX512BW should be implied. gcc/ChangeLog: * common/config/i386/i386-common.cc (OPTION_MASK_ISA_AVX512BITALG_SET): Change OPTION_MASK_ISA_AVX512F_SET to OPTION_MASK_ISA_AVX512BW_SET. (OPTION_MASK_ISA_AVX512F_UNSET): Remove OPTION_MASK_ISA_AVX512BITALG_SET. (OPTION_MASK_ISA_AVX512BW_UNSET): Add OPTION_MASK_ISA_AVX512BITALG_SET. * config/i386/avx512bitalgintrin.h: Do not push avx512bw. * config/i386/i386-builtin.def: Remove redundant OPTION_MASK_ISA_AVX512BW. * config/i386/sse.md (VI1_AVX512VLBW): Removed. (avx512vl_vpshufbitqmb<mode><mask_scalar_merge_name>): Change the iterator from VI1_AVX512VLBW to VI1_AVX512VL. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512bitalg-vpopcntb-1.c: Remove avx512bw. * gcc.target/i386/avx512bitalg-vpopcntb.c: Ditto. * gcc.target/i386/avx512bitalg-vpopcntbvl.c: Ditto. * gcc.target/i386/avx512bitalg-vpopcntw-1.c: Ditto. * gcc.target/i386/avx512bitalg-vpopcntw.c: Ditto. * gcc.target/i386/avx512bitalg-vpopcntwvl.c: Ditto. * gcc.target/i386/avx512bitalg-vpshufbitqmb-1.c: Ditto. * gcc.target/i386/avx512bitalg-vpshufbitqmb.c: Ditto. * gcc.target/i386/avx512bitalgvl-vpopcntb-1.c: Ditto. * gcc.target/i386/avx512bitalgvl-vpopcntw-1.c: Ditto. * gcc.target/i386/avx512bitalgvl-vpshufbitqmb-1.c: Ditto. * gcc.target/i386/pr93696-1.c: Ditto. * gcc.target/i386/pr93696-2.c: Ditto.
2023-04-20i386: Use macro to wrap up share builtin exceptions in builtin isa checkHaochen Jiang1-48/+24
gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_check_builtin_isa_match): Correct wrong comments. Add a new macro SHARE_BUILTIN and refactor the current if clauses to macro.
2023-04-20Re-arrange sections of i386 cpuidMo, Zewei1-29/+32
gcc/ChangeLog: * config/i386/cpuid.h: Open a new section for Extended Features Leaf (%eax == 7, %ecx == 0) and Extended Features Sub-leaf (%eax == 7, %ecx == 1).
2023-04-20Optimize vshuf{i,f}{32x4,64x2} ymm and vperm{i,f}128 ymmHu, Lin18-8/+218
vshuf{i,f}{32x4,64x2} ymm and vperm{i,f}128 ymm are 3 clk. We can optimze them to vblend, vmovaps when there's no cross-lane. gcc/ChangeLog: * config/i386/sse.md: Modify insn vperm{i,f} and vshuf{i,f}. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512vl-vshuff32x4-1.c: Modify test. * gcc.target/i386/avx512vl-vshuff64x2-1.c: Ditto. * gcc.target/i386/avx512vl-vshufi32x4-1.c: Ditto. * gcc.target/i386/avx512vl-vshufi64x2-1.c: Ditto. * gcc.target/i386/opt-vperm-vshuf-1.c: New test. * gcc.target/i386/opt-vperm-vshuf-2.c: Ditto. * gcc.target/i386/opt-vperm-vshuf-3.c: Ditto.
2023-04-20Daily bump.GCC Administrator6-1/+490
2023-04-19gcc: xtensa: add -m[no-]strict-align optionMax Filippov5-2/+53
gcc/ * config/xtensa/xtensa-opts.h: New header. * config/xtensa/xtensa.h (STRICT_ALIGNMENT): Redefine as xtensa_strict_align. * config/xtensa/xtensa.cc (xtensa_option_override): When -m[no-]strict-align is not specified in the command line set xtensa_strict_align to 0 if the hardware supports both unaligned loads and stores or to 1 otherwise. * config/xtensa/xtensa.opt (mstrict-align): New option. * doc/invoke.texi (Xtensa Options): Document -m[no-]strict-align.
2023-04-19gcc: xtensa: add data alignment properties to dynconfigMax Filippov2-1/+76
gcc/ * config/xtensa/xtensa-dynconfig.cc (xtensa_get_config_v4): New function. include/ * xtensa-dynconfig.h (xtensa_config_v4): New struct. (XCHAL_DATA_WIDTH, XCHAL_UNALIGNED_LOAD_EXCEPTION) (XCHAL_UNALIGNED_STORE_EXCEPTION, XCHAL_UNALIGNED_LOAD_HW) (XCHAL_UNALIGNED_STORE_HW, XTENSA_CONFIG_V4_ENTRY_LIST): New definitions. (XTENSA_CONFIG_INSTANCE_LIST): Add xtensa_config_v4 instance. (XTENSA_CONFIG_ENTRY_LIST): Add XTENSA_CONFIG_V4_ENTRY_LIST.
2023-04-19c++: Define built-in for std::tuple_element [PR100157]Patrick Palka13-20/+169
This adds a new built-in to replace the recursive class template instantiations done by traits such as std::tuple_element and std::variant_alternative. The purpose is to select the Nth type from a list of types, e.g. __type_pack_element<1, char, int, float> is int. We implement it as a special kind of TRAIT_TYPE. For a pathological example tuple_element_t<1000, tuple<2000 types...>> the compilation time is reduced by more than 90% and the memory used by the compiler is reduced by 97%. In realistic examples the gains will be much smaller, but still relevant. Unlike the other built-in traits, __type_pack_element uses template-id syntax instead of call syntax and is SFINAE-enabled, matching Clang's implementation. And like the other built-in traits, it's not mangleable so we can't use it directly in function signatures. N.B. Clang seems to implement __type_pack_element as a first-class template that can e.g. be used as a template-template argument. For simplicity we implement it in a more ad-hoc way. Co-authored-by: Jonathan Wakely <jwakely@redhat.com> PR c++/100157 gcc/cp/ChangeLog: * cp-trait.def (TYPE_PACK_ELEMENT): Define. * cp-tree.h (finish_trait_type): Add complain parameter. * cxx-pretty-print.cc (pp_cxx_trait): Handle CPTK_TYPE_PACK_ELEMENT. * parser.cc (cp_parser_constant_expression): Document default arguments. (cp_parser_trait): Handle CPTK_TYPE_PACK_ELEMENT. Pass tf_warning_or_error to finish_trait_type. * pt.cc (tsubst) <case TRAIT_TYPE>: Handle non-type first argument. Pass complain to finish_trait_type. * semantics.cc (finish_type_pack_element): Define. (finish_trait_type): Add complain parameter. Handle CPTK_TYPE_PACK_ELEMENT. * tree.cc (strip_typedefs): Handle non-type first argument. Pass tf_warning_or_error to finish_trait_type. * typeck.cc (structural_comptypes) <case TRAIT_TYPE>: Use cp_tree_equal instead of same_type_p for the first argument. libstdc++-v3/ChangeLog: * include/bits/utility.h (_Nth_type): Conditionally define in terms of __type_pack_element if available. * testsuite/20_util/tuple/element_access/get_neg.cc: Prune additional errors from the new built-in. gcc/testsuite/ChangeLog: * g++.dg/ext/type_pack_element1.C: New test. * g++.dg/ext/type_pack_element2.C: New test. * g++.dg/ext/type_pack_element3.C: New test.
2023-04-19c++: bad ggc_free in try_class_unification [PR109556]Patrick Palka2-5/+18
Aside from correcting how try_class_unification copies multi-dimensional 'targs', r13-377-g3e948d645bc908 also made it ggc_free this copy as an optimization. But this is wrong since the call to unify within might've captured the args in persistent memory such as the satisfaction cache (as part of constrained auto deduction). PR c++/109556 gcc/cp/ChangeLog: * pt.cc (try_class_unification): Don't ggc_free the copy of 'targs'. gcc/testsuite/ChangeLog: * g++.dg/cpp2a/concepts-placeholder13.C: New test.
2023-04-19testsuite: fix scan-tree-dump patterns [PR83904,PR100297]Harald Anlauf2-2/+2
Adjust scan-tree-dump patterns so that they do not accidentally match a valid path. gcc/testsuite/ChangeLog: PR testsuite/83904 PR fortran/100297 * gfortran.dg/allocatable_function_1.f90: Use "__builtin_free " instead of the naive "free". * gfortran.dg/reshape_8.f90: Extend pattern from a simple "data".
2023-04-19i386: Add new pattern for zero-extend cmovAndrew Pinski3-0/+36
After a phiopt change, I got a failure of cmov9.c. The RTL IR has zero_extend on the outside of the if_then_else rather than on the side. Both ways are considered canonical as mentioned in PR 66588. This fixes the failure I got and also adds a testcase which fails before even my phiopt patch but will pass with this patch. OK? Bootstrapped and tested on x86_64-linux-gnu with no regressions. gcc/ChangeLog: * config/i386/i386.md (*movsicc_noc_zext_1): New pattern. gcc/testsuite/ChangeLog: * gcc.target/i386/cmov10.c: New test. * gcc.target/i386/cmov11.c: New test.
2023-04-19c++: fix 'unsigned __int128_t' semantics [PR108099]Jason Merrill2-2/+28
My earlier patch for 108099 made us accept this non-standard pattern but messed up the semantics, so that e.g. unsigned __int128_t was not a 128-bit type. PR c++/108099 gcc/cp/ChangeLog: * decl.cc (grokdeclarator): Keep typedef_decl for __int128_t. gcc/testsuite/ChangeLog: * g++.dg/ext/int128-8.C: New test.
2023-04-19RISC-V: Support 128 bit vector chunkJuzhe-Zhong16-455/+783
RISC-V has provide different VLEN configuration by different ISA extension like `zve32x`, `zve64x` and `v` zve32x just guarantee the minimal VLEN is 32 bits, zve64x guarantee the minimal VLEN is 64 bits, and v guarantee the minimal VLEN is 128 bits, Current status (without this patch): Zve32x: Mode for one vector register mode is VNx1SImode and VNx1DImode is invalid mode - one vector register could hold 1 + 1x SImode where x is 0~n, so it might hold just one SI Zve64x: Mode for one vector register mode is VNx1DImode or VNx2SImode - one vector register could hold 1 + 1x DImode where x is 0~n, so it might hold just one DI. - one vector register could hold 2 + 2x SImode where x is 0~n, so it might hold just two SI. However `v` extension guarantees the minimal VLEN is 128 bits. We introduce another type/mode mapping for this configure: v: Mode for one vector register mode is VNx2DImode or VNx4SImode - one vector register could hold 2 + 2x DImode where x is 0~n, so it will hold at least two DI - one vector register could hold 4 + 4x SImode where x is 0~n, so it will hold at least four DI This patch model the mode more precisely for the RVV, and help some middle-end optimization that assume number of element must be a multiple of two. gcc/ChangeLog: * config/riscv/riscv-modes.def (FLOAT_MODE): Add chunk 128 support. (VECTOR_BOOL_MODE): Ditto. (ADJUST_NUNITS): Ditto. (ADJUST_ALIGNMENT): Ditto. (ADJUST_BYTESIZE): Ditto. (ADJUST_PRECISION): Ditto. (RVV_MODES): Ditto. (VECTOR_MODE_WITH_PREFIX): Ditto. * config/riscv/riscv-v.cc (ENTRY): Ditto. (get_vlmul): Ditto. (get_ratio): Ditto. * config/riscv/riscv-vector-builtins.cc (DEF_RVV_TYPE): Ditto. * config/riscv/riscv-vector-builtins.def (DEF_RVV_TYPE): Ditto. (vbool64_t): Ditto. (vbool32_t): Ditto. (vbool16_t): Ditto. (vbool8_t): Ditto. (vbool4_t): Ditto. (vbool2_t): Ditto. (vbool1_t): Ditto. (vint8mf8_t): Ditto. (vuint8mf8_t): Ditto. (vint8mf4_t): Ditto. (vuint8mf4_t): Ditto. (vint8mf2_t): Ditto. (vuint8mf2_t): Ditto. (vint8m1_t): Ditto. (vuint8m1_t): Ditto. (vint8m2_t): Ditto. (vuint8m2_t): Ditto. (vint8m4_t): Ditto. (vuint8m4_t): Ditto. (vint8m8_t): Ditto. (vuint8m8_t): Ditto. (vint16mf4_t): Ditto. (vuint16mf4_t): Ditto. (vint16mf2_t): Ditto. (vuint16mf2_t): Ditto. (vint16m1_t): Ditto. (vuint16m1_t): Ditto. (vint16m2_t): Ditto. (vuint16m2_t): Ditto. (vint16m4_t): Ditto. (vuint16m4_t): Ditto. (vint16m8_t): Ditto. (vuint16m8_t): Ditto. (vint32mf2_t): Ditto. (vuint32mf2_t): Ditto. (vint32m1_t): Ditto. (vuint32m1_t): Ditto. (vint32m2_t): Ditto. (vuint32m2_t): Ditto. (vint32m4_t): Ditto. (vuint32m4_t): Ditto. (vint32m8_t): Ditto. (vuint32m8_t): Ditto. (vint64m1_t): Ditto. (vuint64m1_t): Ditto. (vint64m2_t): Ditto. (vuint64m2_t): Ditto. (vint64m4_t): Ditto. (vuint64m4_t): Ditto. (vint64m8_t): Ditto. (vuint64m8_t): Ditto. (vfloat32mf2_t): Ditto. (vfloat32m1_t): Ditto. (vfloat32m2_t): Ditto. (vfloat32m4_t): Ditto. (vfloat32m8_t): Ditto. (vfloat64m1_t): Ditto. (vfloat64m2_t): Ditto. (vfloat64m4_t): Ditto. (vfloat64m8_t): Ditto. * config/riscv/riscv-vector-switch.def (ENTRY): Ditto. * config/riscv/riscv.cc (riscv_legitimize_poly_move): Ditto. (riscv_convert_vector_bits): Ditto. * config/riscv/riscv.md: * config/riscv/vector-iterators.md: * config/riscv/vector.md (@pred_indexed_<order>store<VNX32_QH:mode><VNX32_QHI:mode>): Ditto. (@pred_indexed_<order>store<VNX32_QHS:mode><VNX32_QHSI:mode>): Ditto. (@pred_indexed_<order>store<VNX64_Q:mode><VNX64_Q:mode>): Ditto. (@pred_indexed_<order>store<VNX64_QH:mode><VNX64_QHI:mode>): Ditto. (@pred_indexed_<order>store<VNX128_Q:mode><VNX128_Q:mode>): Ditto. (@pred_reduc_<reduc><mode><vlmul1_zve64>): Ditto. (@pred_widen_reduc_plus<v_su><mode><vwlmul1_zve64>): Ditto. (@pred_reduc_plus<order><mode><vlmul1_zve64>): Ditto. (@pred_widen_reduc_plus<order><mode><vwlmul1_zve64>): Ditto. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/pr108185-4.c: Adapt testcase. * gcc.target/riscv/rvv/base/spill-1.c: Ditto. * gcc.target/riscv/rvv/base/spill-11.c: Ditto. * gcc.target/riscv/rvv/base/spill-2.c: Ditto. * gcc.target/riscv/rvv/base/spill-3.c: Ditto. * gcc.target/riscv/rvv/base/spill-5.c: Ditto. * gcc.target/riscv/rvv/base/spill-9.c: Ditto.
2023-04-19RISC-V: Align IOR optimization MODE_CLASS condition to AND.Pan Li3-4/+53
This patch aligned the MODE_CLASS condition of the IOR to the AND. Then more MODE_CLASS besides SCALAR_INT can able to perform the optimization A | (~A) -> -1 similar to AND operator. For example as below sample code. vbool32_t test_shortcut_for_riscv_vmorn_case_5(vbool32_t v1, size_t vl) { return __riscv_vmorn_mm_b32(v1, v1, vl); } Before this patch: vsetvli a5,zero,e8,mf4,ta,ma vlm.v v24,0(a1) vsetvli zero,a2,e8,mf4,ta,ma vmorn.mm v24,v24,v24 vsetvli a5,zero,e8,mf4,ta,ma vsm.v v24,0(a0) ret After this patch: vsetvli zero,a2,e8,mf4,ta,ma vmset.m v24 vsetvli a5,zero,e8,mf4,ta,ma vsm.v v24,0(a0) ret Or in RTL's perspective, from: (ior:VNx2BI (reg/v:VNx2BI 137 [ v1 ]) (not:VNx2BI (reg/v:VNx2BI 137 [ v1 ]))) to: (const_vector:VNx2BI repeat [ (const_int 1 [0x1]) ]) The similar optimization like VMANDN has enabled already. There should be no difference execpt the operator when compare the VMORN and VMANDN for such kind of optimization. The patch aligns the IOR MODE_CLASS condition of the simplification to the AND operator. gcc/ChangeLog: * simplify-rtx.cc (simplify_context::simplify_binary_operation_1): Align IOR (A | (~A) -> -1) optimization MODE_CLASS condition to AND. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/mask_insn_shortcut.c: Update check condition. * gcc.target/riscv/simplify_ior_optimization.c: New test. Signed-off-by: Pan Li <pan2.li@intel.com>
2023-04-19i386: Emit compares between high registers and memoryUros Bizjak2-10/+124
Following code: typedef __SIZE_TYPE__ size_t; struct S1s { char pad1; char val; short pad2; }; extern char ts[256]; _Bool foo (struct S1s a, size_t i) { return (ts[i] > a.val); } compiles with -O2 to: movl %edi, %eax movsbl %ah, %edi cmpb %dil, ts(%rsi) setg %al ret the compare could use high register %ah instead of %dil: movl %edi, %eax cmpb ts(%rsi), %ah setl %al ret Use any_extract code iterator to handle signed and unsigned extracts from high register and introduce peephole2 patterns to propagate norex memory opeerand into the compare insn. gcc/ChangeLog: PR target/78904 PR target/78952 * config/i386/i386.md (*cmpqi_ext<mode>_1_mem_rex64): New insn pattern. (*cmpqi_ext<mode>_1): Use nonimmediate_operand predicate for operand 0. Use any_extract code iterator. (*cmpqi_ext<mode>_1 peephole2): New peephole2 pattern. (*cmpqi_ext<mode>_2): Use any_extract code iterator. (*cmpqi_ext<mode>_3_mem_rex64): New insn pattern. (*cmpqi_ext<mode>_1): Use general_operand predicate for operand 1. Use any_extract code iterator. (*cmpqi_ext<mode>_3 peephole2): New peephole2 pattern. (*cmpqi_ext<mode>_4): Use any_extract code iterator. gcc/testsuite/ChangeLog: PR target/78904 PR target/78952 * gcc.target/i386/pr78952-3.c: New test.
2023-04-19aarch64: Factorise widening add/sub high-half expanders with iteratorsKyrylo Tkachov1-46/+20
I noticed these define_expand are almost identical modulo some string substitutions. This patch compresses them together with a couple of code iterators. No functional change intended. Bootstrapped and tested on aarch64-none-linux-gnu. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_saddw2<mode>): Delete. (aarch64_uaddw2<mode>): Delete. (aarch64_ssubw2<mode>): Delete. (aarch64_usubw2<mode>): Delete. (aarch64_<ANY_EXTEND:su><ADDSUB:optab>w2<mode>): New define_expand.
2023-04-19Use solve_add_graph_edge in more placesRichard Biener1-7/+4
The following makes sure to use solve_add_graph_edge and honoring special-cases, especially edges from escaped, in the remaining places the solver adds edges. * tree-ssa-structalias.cc (do_ds_constraint): Use solve_add_graph_edge.
2023-04-19Split out solve_add_graph_edgeRichard Biener1-11/+24
Split out a worker with all the special-casings when adding a graph edge during solving. * tree-ssa-structalias.cc (solve_add_graph_edge): New function, split out from ... (do_sd_constraint): ... here.
2023-04-19Remove odd code from gimple_can_merge_blocks_pRichard Biener1-6/+0
The following removes a special case to not merge a block with only a non-local label. We have a restriction of non-local labels to be the first statement (and label) in a block, but otherwise nothing, if the last stmt of A is a non-local label then it will be still the first statement of the combined A + B. In particular we'd happily merge when there's a stmt after that label. The check originates from the tree-ssa merge. Bootstrapped and tested on x86_64-unknown-linux-gnu with all languages. * tree-cfg.cc (gimple_can_merge_blocks_p): Remove condition rejecting the merge when A contains only a non-local label.
2023-04-19Introduce VIRTUAL_REGISTER_P and VIRTUAL_REGISTER_NUM_P predicatesUros Bizjak4-12/+14
These two predicates are similar to existing HARD_REGISTER_P and HARD_REGISTER_NUM_P predicates and return 1 if the given register corresponds to a virtual register. gcc/ChangeLog: * rtl.h (VIRTUAL_REGISTER_P): New predicate. (VIRTUAL_REGISTER_NUM_P): Ditto. (REGNO_PTR_FRAME_P): Use VIRTUAL_REGISTER_NUM_P predicate. * expr.cc (force_operand): Use VIRTUAL_REGISTER_P predicate. * function.cc (instantiate_decl_rtl): Ditto. * rtlanal.cc (rtx_addr_can_trap_p_1): Ditto. (nonzero_address_p): Ditto. (refers_to_regno_p): Use VIRTUAL_REGISTER_NUM_P predicate.
2023-04-19Fix pointer sharing in Value_Range constructor.Aldy Hernandez1-1/+1
gcc/ChangeLog: * value-range.h (Value_Range::Value_Range): Avoid pointer sharing.
2023-04-19Transform more gmp/mpfr uses to use RAIIRichard Biener6-46/+20
The following picks up the coccinelle generated patch from Bernhard, leaving out the fortran frontend parts and fixing up the rest. In particular both gmp.h and mpfr.h contain macros like #define mpfr_inf_p(_x) ((_x)->_mpfr_exp == __MPFR_EXP_INF) for which I add operator-> overloads to the auto_* classes. * system.h (auto_mpz::operator->()): New. * realmpfr.h (auto_mpfr::operator->()): New. * builtins.cc (do_mpfr_lgamma_r): Use auto_mpfr. * real.cc (real_from_string): Likewise. (dconst_e_ptr): Likewise. (dconst_sqrt2_ptr): Likewise. * tree-ssa-loop-niter.cc (refine_value_range_using_guard): Use auto_mpz. (bound_difference_of_offsetted_base): Likewise. (number_of_iterations_ne): Likewise. (number_of_iterations_lt_to_ne): Likewise. * ubsan.cc: Include realmpfr.h. (ubsan_instrument_float_cast): Use auto_mpfr.
2023-04-19Revert "libstdc++: Export global iostreams with GLIBCXX_3.4.31 symver ↵Jonathan Wakely14-176/+83
[PR108969]" This reverts commit b7c54e3f48086c29179f7765a35c381de5109a0a. libstdc++-v3/ChangeLog: * config/abi/post/aarch64-linux-gnu/baseline_symbols.txt: * config/abi/post/i486-linux-gnu/baseline_symbols.txt: * config/abi/post/m68k-linux-gnu/baseline_symbols.txt: * config/abi/post/powerpc64-linux-gnu/baseline_symbols.txt: * config/abi/post/riscv64-linux-gnu/baseline_symbols.txt: * config/abi/post/s390x-linux-gnu/baseline_symbols.txt: * config/abi/post/x86_64-linux-gnu/32/baseline_symbols.txt: * config/abi/post/x86_64-linux-gnu/baseline_symbols.txt: * config/abi/pre/gnu.ver: * src/Makefile.am: * src/Makefile.in: * src/c++98/Makefile.am: * src/c++98/Makefile.in: * src/c++98/globals_io.cc (defined): (_GLIBCXX_IO_GLOBAL):
2023-04-19Revert "libstdc++: Fix preprocessor condition in linker script [PR108969]"Jonathan Wakely1-2/+3
This reverts commit 6067ae4557a3a7e5b08359e78a29b8a9d5dfedce. libstdc++-v3/ChangeLog: * config/abi/pre/gnu.ver:
2023-04-19Remove special-cased edges when solving copiesRichard Biener1-11/+14
The following makes sure to remove the copy edges we ignore or need to special-case only once. * tree-ssa-structalias.cc (solve_graph): Remove self-copy edges, remove edges from escaped after special-casing them.
2023-04-19Fix do_sd_constraint escape special casingRichard Biener1-1/+1
The following fixes the escape special casing to test the proper variable IDs. * tree-ssa-structalias.cc (do_sd_constraint): Fixup escape special casing.
2023-04-19Remove senseless store in do_sd_constraintRichard Biener1-4/+1
* tree-ssa-structalias.cc (do_sd_constraint): Do not write to the LHS varinfo solution member.
2023-04-19Avoid non-unified nodes on the topological sorting for PTA solvingRichard Biener1-2/+3
Since we do not update successor edges when merging nodes we have to deal with this in the users. The following avoids putting those on the topo order vector. * tree-ssa-structalias.cc (topo_visit): Look at the real destination of edges.
2023-04-19tree-optimization/44794 - avoid excessive RTL unrolling on epiloguesRichard Biener1-0/+6
The following adjusts tree_[transform_and_]unroll_loop to set an upper bound on the number of iterations on the epilogue loop it creates. For the testcase at hand which involves array prefetching this avoids applying RTL unrolling to them when -funroll-loops is specified. Other users of this API includes predictive commoning and unroll-and-jam. PR tree-optimization/44794 * tree-ssa-loop-manip.cc (tree_transform_and_unroll_loop): If an epilogue loop is required set its iteration upper bound.
2023-04-19LoongArch: Improve cpymemsi expansion [PR109465]Xi Ruoyao7-49/+91
We'd been generating really bad block move sequences which is recently complained by kernel developers who tried __builtin_memcpy. To improve it: 1. Take the advantage of -mno-strict-align. When it is set, set mode size to UNITS_PER_WORD regardless of the alignment. 2. Half the mode size when (block size) % (mode size) != 0, instead of falling back to ld.bu/st.b at once. 3. Limit the length of block move sequence considering the number of instructions, not the size of block. When -mstrict-align is set and the block is not aligned, the old size limit for straight-line implementation (64 bytes) was definitely too large (we don't have 64 registers anyway). Change since v1: add a comment about the calculation of num_reg. gcc/ChangeLog: PR target/109465 * config/loongarch/loongarch-protos.h (loongarch_expand_block_move): Add a parameter as alignment RTX. * config/loongarch/loongarch.h: (LARCH_MAX_MOVE_BYTES_PER_LOOP_ITER): Remove. (LARCH_MAX_MOVE_BYTES_STRAIGHT): Remove. (LARCH_MAX_MOVE_OPS_PER_LOOP_ITER): Define. (LARCH_MAX_MOVE_OPS_STRAIGHT): Define. (MOVE_RATIO): Use LARCH_MAX_MOVE_OPS_PER_LOOP_ITER instead of LARCH_MAX_MOVE_BYTES_PER_LOOP_ITER. * config/loongarch/loongarch.cc (loongarch_expand_block_move): Take the alignment from the parameter, but set it to UNITS_PER_WORD if !TARGET_STRICT_ALIGN. Limit the length of straight-line implementation with LARCH_MAX_MOVE_OPS_STRAIGHT instead of LARCH_MAX_MOVE_BYTES_STRAIGHT. (loongarch_block_move_straight): When there are left-over bytes, half the mode size instead of falling back to byte mode at once. (loongarch_block_move_loop): Limit the length of loop body with LARCH_MAX_MOVE_OPS_PER_LOOP_ITER instead of LARCH_MAX_MOVE_BYTES_PER_LOOP_ITER. * config/loongarch/loongarch.md (cpymemsi): Pass the alignment to loongarch_expand_block_move. gcc/testsuite/ChangeLog: PR target/109465 * gcc.target/loongarch/pr109465-1.c: New test. * gcc.target/loongarch/pr109465-2.c: New test. * gcc.target/loongarch/pr109465-3.c: New test.
2023-04-19LoongArch: Improve GAR store for va_listXi Ruoyao2-1/+27
LoongArch backend used to save all GARs for a function with variable arguments. But sometimes a function only accepts variable arguments for a purpose like C++ function overloading. For example, POSIX defines open() as: int open(const char *path, int oflag, ...); But only two forms are actually used: int open(const char *pathname, int flags); int open(const char *pathname, int flags, mode_t mode); So it's obviously a waste to save all 8 GARs in open(). We can use the cfun->va_list_gpr_size field set by the stdarg pass to only save the GARs necessary to be saved. If the va_list escapes (for example, in fprintf() we pass it to vfprintf()), stdarg would set cfun->va_list_gpr_size to 255 so we don't need a special case. With this patch, only one GAR ($a2/$r6) is saved in open(). Ideally even this stack store should be omitted too, but doing so is not trivial and AFAIK there are no compilers (for any target) performing the "ideal" optimization here, see https://godbolt.org/z/n1YqWq9c9. Bootstrapped and regtested on loongarch64-linux-gnu. Ok for trunk (GCC 14 or now)? gcc/ChangeLog: * config/loongarch/loongarch.cc (loongarch_setup_incoming_varargs): Don't save more GARs than cfun->va_list_gpr_size / UNITS_PER_WORD. gcc/testsuite/ChangeLog: * gcc.target/loongarch/va_arg.c: New test.
2023-04-19Avoid unnecessary epilogues from tree_unroll_loopRichard Biener1-1/+1
The following fixes the condition determining whether we need an epilogue. * tree-ssa-loop-manip.cc (determine_exit_conditions): Fix no epilogue condition.
2023-04-19Simplify gimple_assign_loadRichard Biener2-17/+21
The following simplifies and outlines gimple_assign_load. In particular it is not necessary to get at the base of the possibly loaded expression but just handle the case of a single handled component wrapping a non-memory operand. * gimple.h (gimple_assign_load): Outline... * gimple.cc (gimple_assign_load): ... here. Avoid get_base_address and instead just strip the outermost handled component, treating a remaining handled component as load.
2023-04-19aarch64: Delete __builtin_aarch64_neg* builtins and their useKyrylo Tkachov2-4/+1
I don't think we need to keep the __builtin_aarch64_neg* builtins around. They are only used once in the vnegh_f16 intrinsic in arm_fp16.h and I AFAICT it was added this way only for the sake of orthogonality in https://gcc.gnu.org/g:d7f33f07d88984cbe769047e3d07fc21067fbba9 We already use normal "-" negation in the other vneg* intrinsics, so do so here as well. Bootstrapped and tested on aarch64-none-linux-gnu. gcc/ChangeLog: * config/aarch64/aarch64-simd-builtins.def (neg): Delete builtins definition. * config/aarch64/arm_fp16.h (vnegh_f16): Reimplement using normal negation.
2023-04-19tree-vect-patterns: Improve __builtin_{clz,ctz,ffs}ll vectorization [PR109011]Jakub Jelinek2-25/+171
For __builtin_popcountll tree-vect-patterns.cc has vect_recog_popcount_pattern, which improves the vectorized code. Without that the vectorization is always multi-type vectorization in the loop (at least int and long long types) where we emit two .POPCOUNT calls with long long arguments and int return value and then widen to long long, so effectively after vectorization do the V?DImode -> V?DImode popcount twice, then pack the result into V?SImode and immediately unpack. The following patch extends that handling to __builtin_{clz,ctz,ffs}ll builtins as well (as long as there is an optab for them; more to come laster). x86 can do __builtin_popcountll with -mavx512vpopcntdq, __builtin_clzll with -mavx512cd, ppc can do __builtin_popcountll and __builtin_clzll with -mpower8-vector and __builtin_ctzll with -mpower9-vector, s390 can do __builtin_{popcount,clz,ctz}ll with -march=z13 -mzarch (i.e. VX). 2023-04-19 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/109011 * tree-vect-patterns.cc (vect_recog_popcount_pattern): Rename to ... (vect_recog_popcount_clz_ctz_ffs_pattern): ... this. Handle also CLZ, CTZ and FFS. Remove vargs variable, use gimple_build_call_internal rather than gimple_build_call_internal_vec. (vect_vect_recog_func_ptrs): Adjust popcount entry. * gcc.dg/vect/pr109011-1.c: New test.
2023-04-19dse: Use SUBREG_REG for copy_to_mode_reg in DSE replace_read for ↵Jakub Jelinek1-1/+13
WORD_REGISTER_OPERATIONS targets [PR109040] While we've agreed this is not the right fix for the PR109040 bug, the patch clearly improves generated code (at least on the testcase from the PR), so I'd like to propose this as optimization heuristics improvement for GCC 14. 2023-04-19 Jakub Jelinek <jakub@redhat.com> PR target/109040 * dse.cc (replace_read): If read_reg is a SUBREG of a word mode REG, for WORD_REGISTER_OPERATIONS copy SUBREG_REG of it into a new REG rather than the SUBREG.
2023-04-19[aarch64] Use wzr/xzr for assigning 0 to vector element.Prathamesh Kulkarni2-0/+54
gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_simd_vec_set_zero<mode>): New pattern. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vec-set-zero.c: New test.
2023-04-19aarch64: PR target/108840 Simplify register shift RTX costs and eliminate ↵Kyrylo Tkachov2-52/+49
shift amount masking In this PR we fail to eliminate explicit &31 operations for variable shifts such as in: void bar (int x[3], int y) { x[0] <<= (y & 31); x[1] <<= (y & 31); x[2] <<= (y & 31); } This is rejected by RTX costs that end up giving too high a cost for: (set (reg:SI 96) (ashift:SI (reg:SI 98) (subreg:QI (and:SI (reg:SI 99) (const_int 31 [0x1f])) 0))) There is code to handle the AND-31 case in rtx costs, but it gets confused by the subreg. It's easy enough to fix by looking inside the subreg when costing the expression. While doing that I noticed that the ASHIFT case and the other shift-like cases are almost identical and we should just merge them. This code will only be used for valid insns anyway, so the code after this patch should do the Right Thing (TM) for all such shift cases. With this patch there are no more "and wn, wn, 31" instructions left in the testcase. Bootstrapped and tested on aarch64-none-linux-gnu. PR target/108840 gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_rtx_costs): Merge ASHIFT and ROTATE, ROTATERT, LSHIFTRT, ASHIFTRT cases. Handle subregs in op1. gcc/testsuite/ChangeLog: * gcc.target/aarch64/pr108840.c: New test.
2023-04-19rtl-optimization/109237 - quadraticness in delete_trivially_dead_insnsRichard Biener1-15/+24
The following addresses quadraticness in processing debug insns in delete_trivially_dead_insns and insn_live_p by using TREE_VISITED on the INSN_VAR_LOCATION_DECL to indicate a later debug bind with the same decl and no intervening real insn or debug marker. That gets rid of the NEXT_INSN walk in insn_live_p in favor of first clearing TREE_VISITED in the first loop over insn and the book-keeping of decls we set the bit since we need to clear them when visiting a real or debug marker insn. That improves the time spent in delete_trivially_dead_insns from 10.6s to 2.2s for the testcase. PR rtl-optimization/109237 * cse.cc (insn_live_p): Remove NEXT_INSN walk, instead check TREE_VISITED on INSN_VAR_LOCATION_DECL. (delete_trivially_dead_insns): Maintain TREE_VISITED on active debug bind INSN_VAR_LOCATION_DECL.
2023-04-19rtl-optimization/109237 - speedup bb_is_just_returnRichard Biener1-2/+2
For the testcase bb_is_just_return is on top of the profile, changing it to walk BB insns backwards puts it off the profile. That's because in the forward walk you have to process possibly many debug insns but in a backward walk you very likely run into control insns first. PR rtl-optimization/109237 * cfgcleanup.cc (bb_is_just_return): Walk insns backwards.
2023-04-19testsuite: Fix up pr109524.C for -std=c++23 [PR109524]Jakub Jelinek1-1/+1
This testcase was reduced such that it isn't valid C++23, so with my usual testing with GXX_TESTSUITE_STDS=98,11,14,17,20,2b it fails: FAIL: g++.dg/pr109524.C -std=gnu++2b (test for excess errors) .../gcc/testsuite/g++.dg/pr109524.C: In function 'nn hh(nn)': .../gcc/testsuite/g++.dg/pr109524.C:35:12: error: cannot bind non-const lvalue reference of type 'nn&' to an rvalue of type 'nn' .../gcc/testsuite/g++.dg/pr109524.C:17:6: note: initializing argument 1 of 'nn::nn(nn&)' The following patch fixes that and I've verified it doesn't change anything on what the test was testing, it still ICEs in r13-7198 and passes in r13-7203, now in all language modes (except for 98 where it is intentionally UNSUPPORTED). 2023-04-19 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/109524 * g++.dg/pr109524.C (nn::nn): Change argument type from nn & to const nn &.