riscv-gnu-toolchain/gcc.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author	Files	Lines
2023-06-26	i386: New *ashl<dwi3>_doubleword_highpart define_insn_and_split.	Roger Sayle	1	-0/+34
	This patch contains a pair of (related) optimizations in i386.md that allow us to generate better code for the example below (this is a step towards fixing a bugzilla PR, but I've forgotten the number). __int128 foo64(__int128 x, long long y) { __int128 t = (__int128)y << 64; return x ^ t; } The hidden issue is that the RTL currently seen by reload contains the sign extension of y from DImode to TImode, even though this is dead (not required) for left shifts by more than WORD_SIZE bits. (insn 11 8 12 2 (parallel [ (set (reg:TI 0 ax [orig:91 y ] [91]) (sign_extend:TI (reg:DI 1 dx [97]))) (clobber (reg:CC 17 flags)) (clobber (scratch:DI)) ]) {extendditi2} What makes this particularly undesirable is that the sign-extension pattern above requires an additional DImode scratch register, indicated by the clobber, which unnecessarily increases register pressure. The proposed solution is to add a define_insn_and_split for such left shifts (of sign or zero extensions) that only have a non-zero highpart, where the extension is redundant and eliminated, that can be split after reload, without scratch registers or early clobbers. This (late split) exposes a second optimization opportunity where setting the lowpart to zero can sometimes be combined/simplified with the following instruction during peephole2. For the test case above, we previously generated with -O2: foo64: xorl %eax, %eax xorq %rsi, %rdx xorq %rdi, %rax ret with this patch, we now generate: foo64: movq %rdi, %rax xorq %rsi, %rdx ret Likewise for the related -m32 test case, we go from: foo32: movl 12(%esp), %eax movl %eax, %edx xorl %eax, %eax xorl 8(%esp), %edx xorl 4(%esp), %eax ret to the improved: foo32: movl 12(%esp), %edx movl 4(%esp), %eax xorl 8(%esp), %edx ret 2023-06-26 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog * config/i386/i386.md (peephole2): Simplify zeroing a register followed by an IOR, XOR or PLUS operation on it, into a move. (ashl<dwi>3_doubleword_highpart): New define_insn_and_split to eliminate (and hide from reload) unnecessary word to doubleword extensions that are followed by left shifts by sufficiently large, but valid, bit counts. gcc/testsuite/ChangeLog gcc.target/i386/ashldi3-1.c: New 32-bit test case. * gcc.target/i386/ashlti3-2.c: New 64-bit test case.
2023-06-26	i386: Sync tune_string with arch_string for target attribute arch=*	Hongyu Wang	1	-1/+5
	For function with target attribute arch=, current logic will set its tune to -mtune from command line so all target_clones will get same tuning flags which would affect the performance for each clone. Override tune with arch if tune was not explicitly specified to get proper tuning flags for target_clones. gcc/ChangeLog: config/i386/i386-options.cc (ix86_valid_target_attribute_tree): Override tune_string with arch_string if tune_string is not explicitly specified. gcc/testsuite/ChangeLog: * gcc.target/i386/mvc17.c: New test.
2023-06-25	Refine maskloadmn pattern with UNSPEC_MASKLOAD.	liuhongt	1	-14/+18
	If mem_addr points to a memory region with less than whole vector size bytes of accessible memory and k is a mask that would prevent reading the inaccessible bytes from mem_addr, add UNSPEC_MASKLOAD to prevent it to be transformed to vpblendd. gcc/ChangeLog: PR target/110309 * config/i386/sse.md (maskload<mode><avx512fmaskmodelower>): Refine pattern with UNSPEC_MASKLOAD. (maskload<mode><avx512fmaskmodelower>): Ditto. (<avx512>_load<mode>_mask): Extend mode iterator to VI12HFBF_AVX512VL. (<avx512>_load<mode>): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110309.c: New test.
2023-06-24	i386: Add alternate representation for {and,or,xor}b %ah,%dh.	Roger Sayle	1	-0/+22
	A patch that I'm working on to improve RTL simplifications in the middle-end results in the regression of pr78904-1b.c, due to changes in the canonical representation of high-byte (%ah, %bh, %ch, %dh) logic. See also PR target/78904. This patch avoids/prevents those failures by adding support for the alternate representation, duplicating the existing <code>qi_ext<mode>_2 as <code>qi_ext<mode>_3 (the new version also replacing any_or with any_logic to provide andqi_ext<mode>_3 in the same pattern). Removing the original pattern isn't trivial, as it's generated by define_split, but this can be investigated after the other pieces are approved. The current representation of this instruction is: (set (zero_extract:DI (reg/v:DI 87 [ aD.2763 ]) (const_int 8 [0x8]) (const_int 8 [0x8])) (subreg:DI (xor:QI (subreg:QI (zero_extract:DI (reg:DI 94) (const_int 8 [0x8]) (const_int 8 [0x8])) 0) (subreg:QI (zero_extract:DI (reg/v:DI 87 [ aD.2763 ]) (const_int 8 [0x8]) (const_int 8 [0x8])) 0)) 0)) after my proposed middle-end improvement, we attempt to recognize: (set (zero_extract:DI (reg/v:DI 87 [ aD.2763 ]) (const_int 8 [0x8]) (const_int 8 [0x8])) (zero_extract:DI (xor:DI (reg:DI 94) (reg/v:DI 87 [ aD.2763 ])) (const_int 8 [0x8]) (const_int 8 [0x8]))) 2023-06-24 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog config/i386/i386.md (*<code>qi_ext<mode>_3): New define_insn.
2023-06-22	i386: Convert ptestz of pandn into ptestc.	Roger Sayle	3	-8/+91
	This patch is the next installment in a set of backend patches around improvements to ptest/vptest. A previous patch optimized the sequence t=pand(x,y); ptestz(t,t) into the equivalent ptestz(x,y), using the property that ZF is set to (X&Y) == 0. This patch performs a similar transformation, converting t=pandn(x,y); ptestz(t,t) into the (almost) equivalent ptestc(y,x), using the property that the CF flags is set to (~X&Y) == 0. The tricky bit is that this sets the CF flag instead of the ZF flag, so we can only perform this transformation when we can also convert the flags consumer, as well as the producer. For the test case: int foo (__m128i x, __m128i y) { __m128i a = x & ~y; return __builtin_ia32_ptestz128 (a, a); } With -O2 -msse4.1 we previously generated: foo: pandn %xmm0, %xmm1 xorl %eax, %eax ptest %xmm1, %xmm1 sete %al ret with this patch we now generate: foo: xorl %eax, %eax ptest %xmm0, %xmm1 setc %al ret At the same time, this patch also provides alternative fixes for PR target/109973 and PR target/110118, by recognizing that ptestc(x,x) always sets the carry flag (X&~X is always zero). This is achieved both by recognizing the special case in ix86_expand_sse_ptest and with a splitter to convert an eligible ptest into an stc. 2023-06-22 Roger Sayle <roger@nextmovesoftware.com> Uros Bizjak <ubizjak@gmail.com> gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_sse_ptest): Recognize expansion of ptestc with equal operands as producing const1_rtx. * config/i386/i386.cc (ix86_rtx_costs): Provide accurate cost estimates of UNSPEC_PTEST, where the ptest performs the PAND or PAND of its operands. * config/i386/sse.md (define_split): Transform CCCmode UNSPEC_PTEST of reg_equal_p operands into an x86_stc instruction. (define_split): Split pandn/ptestz/set{n?}e into ptestc/set{n?}c. (define_split): Similar to above for strict_low_part destinations. (define_split): Split pandn/ptestz/j{n?}e into ptestc/j{n?}c. gcc/testsuite/ChangeLog * gcc.target/i386/avx-vptest-4.c: New test case. * gcc.target/i386/avx-vptest-5.c: Likewise. * gcc.target/i386/avx-vptest-6.c: Likewise. * gcc.target/i386/pr109973-1.c: Update test case. * gcc.target/i386/pr109973-2.c: Likewise. * gcc.target/i386/sse4_1-ptest-4.c: New test case. * gcc.target/i386/sse4_1-ptest-5.c: Likewise. * gcc.target/i386/sse4_1-ptest-6.c: Likewise.
2023-06-21	[i386] Reject too large vectors for partial vector vectorization	Richard Biener	1	-0/+26
	The following works around the lack of the x86 backend making the vectorizer compare the costs of the different possible vector sizes the backed advertises through the vector_modes hook. When enabling masked epilogues or main loops then this means we will select the prefered vector mode which is usually the largest even for loops that do not iterate close to the times the vector has lanes. When not using masking the vectorizer would reject any mode resulting in a VF bigger than the number of iterations but with masking they are simply masked out. So this overloads the finish_cost function and matches for the problematic case, forcing a high cost to make us try a smaller vector size. * config/i386/i386.cc (ix86_vector_costs::finish_cost): Overload. For masked main loops make sure the vectorization factor isn't more than double the number of iterations. * gcc.target/i386/vect-partial-vectors-1.c: New testcase. * gcc.target/i386/vect-partial-vectors-2.c: Likewise.
2023-06-21	x86: make VPTERNLOG* usable on less than 512-bit operands with just AVX512F	Jan Beulich	2	-11/+27
	There's no reason to constrain this to AVX512VL, unless instructed so by -mprefer-vector-width=, as the wider operation is unusable for more narrow operands only when the possible memory source is a non-broadcast one. This way even the scalar copysign<mode>3 can benefit from the operation being a single-insn one (leaving aside moves which the compiler decides to insert for unclear reasons, and leaving aside the fact that bcst_mem_operand() is too restrictive for broadcast to be embedded right into VPTERNLOG). While there also bring <avx512>_vternlog<mode>_all's in sync with that of the three splitters. Along with this also request value duplication in ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating excess space allocation in .rodata., filled with zeros which are never read. gcc/ config/i386/i386-expand.cc (ix86_expand_copysign): Request value duplication by ix86_build_signbit_mask() when AVX512F and not HFmode. * config/i386/sse.md (<avx512>_vternlog<mode>_all): Convert to 2-alternative form. Adjust "mode" attribute. Add "enabled" attribute. (<avx512>_vpternlog<mode>_1): Also permit when TARGET_AVX512F && !TARGET_PREFER_AVX256. (<avx512>_vpternlog<mode>_2): Likewise. (<avx512>_vpternlog<mode>_3): Likewise. gcc/testsuite/ * gcc.target/i386/avx512f-copysign.c: New test.
2023-06-20	x86: correct and improve "*vec_dupv2di"	Jan Beulich	1	-6/+22
	The input constraint for the %vmovddup alternative was wrong, as the upper 16 XMM registers require AVX512VL to be used with this insn. To compensate, introduce a new alternative permitting all 32 registers, by broadcasting to the full 512 bits in that case if AVX512VL is not available. gcc/ * config/i386/sse.md (vec_dupv2di): Correct %vmovddup input constraint. Add new AVX512F alternative. gcc/testsuite/ * gcc.target/i386/avx512f-dupv2di.c: New test.
2023-06-19	Refined 256/512-bit vpacksswb/vpackssdw patterns.	liuhongt	1	-18/+147
	The packing in vpacksswb/vpackssdw is not a simple concat, it's an interweave from src1 and src2 for every 128 bit(or 64-bit for the ss_truncate result). .i.e. dst[192-255] = ss_truncate (src2[128-255]) dst[128-191] = ss_truncate (src1[128-255]) dst[64-127] = ss_truncate (src2[0-127]) dst[0-63] = ss_truncate (src1[0-127] The patch refined those patterns with an extra vec_select for the interweave. gcc/ChangeLog: PR target/110235 * config/i386/sse.md (<sse2_avx2>_packsswb<mask_name>): Substitute with .. (sse2_packsswb<mask_name>): .. this, .. (avx2_packsswb<mask_name>): .. this and .. (avx512bw_packsswb<mask_name>): .. this. (<sse2_avx2>_packssdw<mask_name>): Substitute with .. (sse2_packssdw<mask_name>): .. this, .. (avx2_packssdw<mask_name>): .. this and .. (avx512bw_packssdw<mask_name>): .. this. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512bw-vpackssdw-3.c: New test. * gcc.target/i386/avx512bw-vpacksswb-3.c: New test.
2023-06-19	Reimplement packuswb/packusdw with UNSPEC_US_TRUNCATE instead of original ↵	liuhongt	4	-30/+59
	us_truncate. packuswb/packusdw does unsigned saturation for signed source, but rtl us_truncate means does unsigned saturation for unsigned source. So for value -1, packuswb will produce 0, but us_truncate produces 255. The patch reimplement those related patterns and functions with UNSPEC_US_TRUNCATE instead of us_truncate. gcc/ChangeLog: PR target/110235 * config/i386/i386-expand.cc (ix86_split_mmx_pack): Use UNSPEC_US_TRUNCATE instead of original us_truncate for packusdw/packuswb. * config/i386/mmx.md (mmx_pack<s_trunsuffix>swb): Substitute with .. (mmx_packsswb): .. this and .. (mmx_packuswb): .. this. (mmx_packusdw): Use UNSPEC_US_TRUNCATE instead of original us_truncate. (s_trunsuffix): Removed code iterator. (any_s_truncate): Ditto. * config/i386/sse.md (<sse2_avx2>_packuswb<mask_name>): Use UNSPEC_US_TRUNCATE instead of original us_truncate. (<sse4_1_avx2>_packusdw<mask_name>): Ditto. * config/i386/i386.md (UNSPEC_US_TRUNCATE): New unspec_c_enum.
2023-06-18	i386: Refactor new ix86_expand_carry to set the carry flag.	Roger Sayle	3	-14/+19
	This patch refactors the three places in the i386.md backend that we set the carry flag into a new ix86_expand_carry helper function, that allows Jakub's recently added uaddc<mode>5 and usubc<mode>5 expanders to take advantage of the recently added support for the stc instruction. 2023-06-18 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_carry): New helper function for setting the carry flag. (ix86_expand_builtin) <handlecarry>: Use it here. * config/i386/i386-protos.h (ix86_expand_carry): Prototype here. * config/i386/i386.md (uaddc<mode>5): Use ix86_expand_carry. (usubc<mode>5): Likewise.
2023-06-18	i386: Standardize shift amount constants as QImode in i386.md.	Roger Sayle	1	-6/+6
	This clean-up improves consistency within i386.md by using QImode for the constant shift count in patterns that specify a mode. 2023-06-18 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog * config/i386/i386.md (concat<mode><dwi>3_1): Use QImode for the immediate constant shift count. (concat<mode><dwi>3_2): Likewise. (concat<mode><dwi>3_3): Likewise. (concat<mode><dwi>3_4): Likewise. (concat<mode><dwi>3_5): Likewise. (concat<mode><dwi>3_6): Likewise.
2023-06-17	i386: Two minor tweaks to ix86_expand_move.	Roger Sayle	1	-4/+6
	This patch splits out two (independent) minor changes to i386-expand.cc's ix86_expand_move from a larger patch, given that it's better to review and commit these independent pieces separately from a more complex patch. The first change is to test for CONST_WIDE_INT_P before calling ix86_convert_const_wide_int_to_broadcast. Whilst stepping through this function in gdb, I was surprised that the code was continually jumping into this function with operands that obviously weren't appropriate. The second change is to generalize the optimization for efficiently moving a TImode value to V1TImode (via V2DImode), to cover all 128-bit vector modes. Hence for the test case: typedef unsigned long uv2di __attribute__ ((__vector_size__ (16))); uv2di foo2(__int128 x) { return (uv2di)x; } we'd previously move via memory with: foo2: movq %rdi, -24(%rsp) movq %rsi, -16(%rsp) movdqa -24(%rsp), %xmm0 ret with this patch we now generate with -O2 (the same as V1TImode): foo2: movq %rdi, %xmm0 movq %rsi, %xmm1 punpcklqdq %xmm1, %xmm0 ret and with -O2 -msse4 the even better: foo2: movq %rdi, %xmm0 pinsrq $1, %rsi, %xmm0 ret The new test case is unimaginatively called sse2-v1ti-mov-2.c given the original test case just for V1TI mode was called sse2-v1ti-mov-1.c. 2023-06-17 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_move): Check that OP1 is CONST_WIDE_INT_P before calling ix86_convert_wide_int_to_broadcast. Generalize special case for converting TImode to V1TImode to handle all 128-bit vector conversions. gcc/testsuite/ChangeLog * gcc.target/i386/sse2-v1ti-mov-2.c: New test case.
2023-06-16	PR target/31985: Improve memory operand use with doubleword add.	Roger Sayle	1	-0/+30
	This patch addresses the last remaining issue with PR target/31985, that GCC could make better use of memory addressing modes when implementing double word addition. This is achieved by adding a define_insn_and_split that combines an add<dwi>3_doubleword with a concat<mode><dwi>3, so that the components of the concat can be used directly, without first being loaded into a double word register. For test_c in the bugzilla PR: Before: pushl %ebx subl $16, %esp movl 28(%esp), %eax movl 36(%esp), %ecx movl 32(%esp), %ebx movl 24(%esp), %edx addl %ecx, %eax adcl %ebx, %edx movl %eax, 8(%esp) movl %edx, 12(%esp) addl $16, %esp popl %ebx ret After: test_c: subl $20, %esp movl 36(%esp), %eax movl 32(%esp), %edx addl 28(%esp), %eax adcl 24(%esp), %edx movl %eax, 8(%esp) movl %edx, 12(%esp) addl $20, %esp ret 2023-06-16 Roger Sayle <roger@nextmovesoftware.com> Uros Bizjak <ubizjak@gmail.com> gcc/ChangeLog PR target/31985 * config/i386/i386.md (add<dwi>3_doubleword_concat): New define_insn_and_split combine add<dwi>3_doubleword with a concat<mode><dwi>3 for more efficient lowering after reload. gcc/testsuite/ChangeLog PR target/31985 gcc.target/i386/pr31985.c: New test case.
2023-06-16	Add MinGW option -mcrtdll= for choosing C RunTime DLL library	Pali Rohár	3	-5/+49
	It adjust preprocess, compile and link flags, which allows to change default -lmsvcrt library by another provided by MinGW runtime. gcc/ChangeLog: * config/i386/mingw-w64.h (CPP_SPEC): Adjust for -mcrtdll=. (REAL_LIBGCC_SPEC): New define. * config/i386/mingw.opt: Add mcrtdll= * config/i386/mingw32.h (CPP_SPEC): Adjust for -mcrtdll=. (REAL_LIBGCC_SPEC): Adjust for -mcrtdll=. (STARTFILE_SPEC): Adjust for -mcrtdll=. * doc/invoke.texi: Add mcrtdll= documentation. Signed-off-by: Jonathan Yong <10walls@gmail.com>
2023-06-15	x86/AVX512: use VMOVDDUP for broadcast to V2DF	Jan Beulich	1	-2/+2
	Like is already the case for the AVX/AVX2 form, VMOVDDUP - acting on double precision floating values - is more appropriate to use here, and it can also result in shorter insn encodings when source is memory or %xmm0...%xmm7, and no masking is applied (in allowing a 2-byte VEX prefix then instead of a 3-byte one). gcc/ * config/i386/sse.md (<avx512>_vec_dup<mode><mask_name>): Use vmovddup.
2023-06-15	x86: add Bk and Br to comment list B's sub-chars	Jan Beulich	1	-0/+2
	gcc/ * config/i386/constraints.md: Mention k and r for B.
2023-06-15	middle-end, i386: Pattern recognize add/subtract with carry [PR79173]	Jakub Jelinek	1	-4/+69
	The following patch introduces {add,sub}c5_optab and pattern recognizes various forms of add with carry and subtract with carry/borrow, see pr79173-{1,2,3,4,5,6}.c tests on what is matched. Primarily forms with 2 __builtin_add_overflow or __builtin_sub_overflow calls per limb (with just one for the least significant one), for add with carry even when it is hand written in C (for subtraction reassoc seems to change it too much so that the pattern recognition doesn't work). __builtin_{add,sub}_overflow are standardized in C23 under ckd_{add,sub} names, so it isn't any longer a GNU only extension. Note, clang has for these (IMHO badly designed) __builtin_{add,sub}c{b,s,,l,ll} builtins which don't add/subtract just a single bit of carry, but basically add 3 unsigned values or subtract 2 unsigned values from one, and result in carry out of 0, 1, or 2 because of that. If we wanted to introduce those for clang compatibility, we could and lower them early to just two __builtin_{add,sub}_overflow calls and let the pattern matching in this patch recognize it later. I've added expanders for this on ix86 and in addition to that added various peephole2s (in preparation patches for this patch) to make sure we get nice (and small) code for the common cases. I think there are other PRs which request that e.g. for the _{addcarry,subborrow}_u{32,64} intrinsics, which the patch also improves. Would be nice if support for these optabs was added to many other targets, arm/aarch64 and powerpc* certainly have such instructions, I'd expect in fact that most targets do. The _BitInt support I'm working on will also need this to emit reasonable code. 2023-06-15 Jakub Jelinek <jakub@redhat.com> PR middle-end/79173 * internal-fn.def (UADDC, USUBC): New internal functions. * internal-fn.cc (expand_UADDC, expand_USUBC): New functions. (commutative_ternary_fn_p): Return true also for IFN_UADDC. * optabs.def (uaddc5_optab, usubc5_optab): New optabs. * tree-ssa-math-opts.cc (uaddc_cast, uaddc_ne0, uaddc_is_cplxpart, match_uaddc_usubc): New functions. (math_opts_dom_walker::after_dom_children): Call match_uaddc_usubc for PLUS_EXPR, MINUS_EXPR, BIT_IOR_EXPR and BIT_XOR_EXPR unless other optimizations have been successful for those. * gimple-fold.cc (gimple_fold_call): Handle IFN_UADDC and IFN_USUBC. * fold-const-call.cc (fold_const_call): Likewise. * gimple-range-fold.cc (adjust_imagpart_expr): Likewise. * tree-ssa-dce.cc (eliminate_unnecessary_stmts): Likewise. * doc/md.texi (uaddc<mode>5, usubc<mode>5): Document new named patterns. * config/i386/i386.md (uaddc<mode>5, usubc<mode>5): New define_expand patterns. (setcc_qi_addqi3_cconly_overflow_1_<mode>, setccc): Split into NOTE_INSN_DELETED note rather than nop instruction. (setcc_qi_negqi_ccc_1_<mode>, setcc_qi_negqi_ccc_2_<mode>): Likewise. * gcc.target/i386/pr79173-1.c: New test. * gcc.target/i386/pr79173-2.c: New test. * gcc.target/i386/pr79173-3.c: New test. * gcc.target/i386/pr79173-4.c: New test. * gcc.target/i386/pr79173-5.c: New test. * gcc.target/i386/pr79173-6.c: New test. * gcc.target/i386/pr79173-7.c: New test. * gcc.target/i386/pr79173-8.c: New test. * gcc.target/i386/pr79173-9.c: New test. * gcc.target/i386/pr79173-10.c: New test.
2023-06-15	i386: Add peephole2 patterns to improve subtract with borrow with memory ↵	Jakub Jelinek	1	-3/+151
	destination [PR79173] This patch adds subborrow<mode> alternative so that it can have memory destination and adds various peephole2s which help to match it. 2023-06-15 Jakub Jelinek <jakub@redhat.com> PR middle-end/79173 * config/i386/i386.md (subborrow<mode>): Add alternative with memory destination and add for it define_peephole2 TARGET_READ_MODIFY_WRITE/-Os patterns to prefer using memory destination in these patterns.
2023-06-15	i386: Add peephole2 patterns to improve add with carry or subtract with ↵	Jakub Jelinek	1	-0/+289
	borrow with memory destination [PR79173] This patch adds various peephole2s which help to recognize add with carry or subtract with borrow with memory destination. 2023-06-14 Jakub Jelinek <jakub@redhat.com> PR middle-end/79173 * config/i386/i386.md (sub<mode>_3, @add<mode>3_carry, addcarry<mode>, @sub<mode>3_carry, add<mode>3_cc_overflow_1): Add define_peephole2 TARGET_READ_MODIFY_WRITE/-Os patterns to prefer using memory destination in these patterns.
2023-06-14	Use x instead of v for alternative 2 (v, BH) in mov<mode>_internal.	liuhongt	1	-1/+1
	Since there's no evex version for vpcmpeq ymm, ymm, ymm. gcc/ChangeLog: PR target/110227 * config/i386/sse.md (mov<mode>_internal>): Use x instead of v for alternative 2 since there's no evex version for vpcmpeqd ymm, ymm, ymm. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110227.c: New test.
2023-06-13	i386: Fix up whitespace in assembly	Jakub Jelinek	1	-3/+3
	I've noticed that standard_sse_constant_opcode emits some spurious whitespace around tab, that isn't something which is done for any other instruction and looks wrong. 2023-06-13 Jakub Jelinek <jakub@redhat.com> * config/i386/i386.cc (standard_sse_constant_opcode): Remove superfluous spaces around \t for vpcmpeqd.
2023-06-12	Update perf auto profile script	Andi Kleen	1	-1/+8
	- Fix gen_autofdo_event: The download URL for the Intel Perfmon Event list has changed, as well as the JSON format. Also it now uses pattern matching to match CPUs. Update the script to support all of this. - Regenerate gcc-auto-profile with the latest published Intel model numbers, so it works with recent systems. - So far it's still broken on hybrid systems contrib/ChangeLog: * gen_autofdo_event.py: Update for download server changes gcc/ChangeLog * config/i386/gcc-auto-profile: Regenerate.
2023-06-12	Add missing vec_pack/unpacks patterns for _Float16 <-> int/float conversion.	liuhongt	1	-9/+207
	This patch only support optabs for vector modes whose lenth >= 128. For 32/64-bit vector, they're more hanlded by BB vectorizer with truncmn2/extendmn2/fix{,uns}_truncmn2. gcc/ChangeLog: * config/i386/sse.md (vec_pack<floatprefix>_float_<mode>): New expander. (vec_unpack_<fixprefix>fix_trunc_lo_<mode>): Ditto. (vec_unpack_<fixprefix>fix_trunc_hi_<mode>): Ditto. (vec_unpacks_lo_<mode>): Ditto. (vec_unpacks_hi_<mode>): Ditto. (sse_movlhps_<mode>): New define_insn. (ssse3_palignr<mode>_perm): Extend to V_128H. (V_128H): New mode iterator. (ssepackPHmode): New mode attribute. (vunpck_extract_mode): Ditto. (vpckfloat_concat_mode): Extend to VxSI/VxSF for _Float16. (vpckfloat_temp_mode): Ditto. (vpckfloat_op_mode): Ditto. (vunpckfixt_mode): Extend to VxHF. (vunpckfixt_model): Ditto. (vunpckfixt_extract_mode): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/vec_pack_fp16-1.c: New test. * gcc.target/i386/vec_pack_fp16-2.c: New test. * gcc.target/i386/vec_pack_fp16-3.c: New test.
2023-06-09	Explicitly view_convert_expr mask to signed type when folding pblendvb builtins.	liuhongt	1	-1/+3
	Since mask < 0 will be always false for vector char when -funsigned-char, but vpblendvb needs to check the most significant bit. The patch explicitly VCE to vector signed char. gcc/ChangeLog: PR target/110108 * config/i386/i386.cc (ix86_gimple_fold_builtin): Explicitly view_convert_expr mask to signed type when folding pblendvb builtins. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110108-2.c: New test.
2023-06-09	Fold _mm{,256,512}_abs_{epi8,epi16,epi32,epi64} into gimple ABSU_EXPR + VCE.	liuhongt	2	-10/+23
	r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for TYPE_MIN, but PABSB will store unsigned result into dst. The patch uses ABSU_EXPR + VCE instead of ABS_EXPR. Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit vector absm2 is guarded with TARGET_MMX_WITH_SSE. gcc/ChangeLog: PR target/110108 * config/i386/i386.cc (ix86_gimple_fold_builtin): Fold _mm{,256,512}_abs_{epi8,epi16,epi32,epi64} into gimple ABSU_EXPR + VCE, don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT. * config/i386/i386-builtin.def: Replace CODE_FOR_nothing with real codename for __builtin_ia32_pabs{b,w,d}. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110108.c: New test. * gcc.target/i386/pr110108-3.c: New test. * gcc.target/i386/pr109900.c: Adjust testcase.
2023-06-08	i386: Fix endless recursion in ix86_expand_vector_init_general with MMX ↵	Jakub Jelinek	1	-1/+1
	[PR110152] I'm getting +FAIL: gcc.target/i386/3dnow-1.c (internal compiler error: Segmentation fault signal terminated program cc1) +FAIL: gcc.target/i386/3dnow-1.c (test for excess errors) +FAIL: gcc.target/i386/3dnow-2.c (internal compiler error: Segmentation fault signal terminated program cc1) +FAIL: gcc.target/i386/3dnow-2.c (test for excess errors) +FAIL: gcc.target/i386/mmx-1.c (internal compiler error: Segmentation fault signal terminated program cc1) +FAIL: gcc.target/i386/mmx-1.c (test for excess errors) +FAIL: gcc.target/i386/mmx-2.c (internal compiler error: Segmentation fault signal terminated program cc1) +FAIL: gcc.target/i386/mmx-2.c (test for excess errors) regressions on i686-linux since r14-1166. The problem is when ix86_expand_vector_init_general is called with mmx_ok = true and mode = V4HImode, it newly recurses with mmx_ok = false and mode = V2SImode, but as mmx_ok is false and !TARGET_SSE, we recurse again with the same arguments (ok, fresh new tmp and vals) infinitely. The following patch fixes that by passing mmx_ok to that recursive call. For n_words == 4 it isn't needed, because we only care about mmx_ok for V2SImode or V2SFmode and no other modes. 2023-06-08 Jakub Jelinek <jakub@redhat.com> PR target/110152 * config/i386/i386-expand.cc (ix86_expand_vector_init_general): For n_words == 2 recurse with mmx_ok as first argument rather than false.
2023-06-07	Add support for stc and cmc instructions in i386.md	Roger Sayle	5	-4/+126
	This patch is the latest revision of my patch to add support for the STC (set carry flag) and CMC (complement carry flag) instructions to the i386 backend, incorporating Uros' previous feedback. The significant changes are (i) the inclusion of CMC, (ii) the use of UNSPEC for pattern, (iii) Use of a new X86_TUNE_SLOW_STC tuning flag to use alternate implementations on pentium4 (which has a notoriously slow STC) when not optimizing for size. An example of the use of the stc instruction is: unsigned int foo (unsigned int a, unsigned int b, unsigned int c) { return __builtin_ia32_addcarryx_u32 (1, a, b, c); } which previously generated: movl $1, %eax addb $-1, %al adcl %esi, %edi setc %al movl %edi, (%rdx) movzbl %al, %eax ret with this patch now generates: stc adcl %esi, %edi setc %al movl %edi, (%rdx) movzbl %al, %eax ret An example of the use of the cmc instruction (where the carry from a first adc is inverted/complemented as input to a second adc) is: unsigned int bar (unsigned int a, unsigned int b, unsigned int c, unsigned int d) { unsigned int c1 = __builtin_ia32_addcarryx_u32 (1, a, b, &o1); return __builtin_ia32_addcarryx_u32 (c1 ^ 1, c, d, &o2); } which previously generated: movl $1, %eax addb $-1, %al adcl %esi, %edi setnc %al movl %edi, o1(%rip) addb $-1, %al adcl %ecx, %edx setc %al movl %edx, o2(%rip) movzbl %al, %eax ret and now generates: stc adcl %esi, %edi cmc movl %edi, o1(%rip) adcl %ecx, %edx setc %al movl %edx, o2(%rip) movzbl %al, %eax ret This version implements Uros' suggestions/refinements. (i) Avoid the UNSPEC_CMC by using the canonical RTL idiom for x86_cmc, (ii) Use peephole2s to convert x86_stc and x86_cmc into alternate forms on TARGET_SLOW_STC CPUs (pentium4), when a suitable QImode register is available, (iii) Prefer the addqi_cconly_overflow idiom (addb $-1,%al) over negqi_ccc_1 (neg %al) for setting the carry from a QImode value, These changes required two minor edits to i386.cc: ix86_cc_mode had to be tweaked to suggest CCCmode for the new x86_cmc pattern, and x86_cmc needed to be handled/parameterized in ix86_rtx_costs so that combine would appreciate that this complex RTL expression was actually a fast, single byte instruction [i.e. preferable]. 2022-06-07 Roger Sayle <roger@nextmovesoftware.com> Uros Bizjak <ubizjak@gmail.com> gcc/ChangeLog config/i386/i386-expand.cc (ix86_expand_builtin) <handlecarry>: Use new x86_stc instruction when the carry flag must be set. * config/i386/i386.cc (ix86_cc_mode): Use CCCmode for x86_cmc. (ix86_rtx_costs): Provide accurate rtx_costs for x86_cmc. * config/i386/i386.h (TARGET_SLOW_STC): New define. * config/i386/i386.md (UNSPEC_STC): New UNSPEC for stc. (x86_stc): New define_insn. (define_peephole2): Convert x86_stc into alternate implementation on pentium4 without -Os when a QImode register is available. (x86_cmc): New define_insn. (define_peephole2): Convert x86_cmc into alternate implementation on pentium4 without -Os when a QImode register is available. (setccc): New define_insn_and_split for a no-op CCCmode move. (setcc_qi_negqi_ccc_1_<mode>): New define_insn_and_split to recognize (and eliminate) the carry flag being copied to itself. (setcc_qi_negqi_ccc_2_<mode>): Likewise. config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag. gcc/testsuite/ChangeLog * gcc.target/i386/cmc-1.c: New test case. * gcc.target/i386/stc-1.c: Likewise.
2023-06-04	PR target/110083: Fix-up REG_EQUAL notes on COMPARE in STV.	Roger Sayle	1	-0/+33
	This patch fixes PR target/110083, an ICE-on-valid regression exposed by my recent PTEST improvements (to address PR target/109973). The latent bug (admittedly mine) is that the scalar-to-vector (STV) pass doesn't update or delete REG_EQUAL notes attached to COMPARE instructions. As a result the operands of COMPARE would be mismatched, with the register transformed to V1TImode, but the immediate operand left as const_wide_int, which is valid for TImode but not V1TImode. This remained latent when the STV conversion converted the mode of the COMPARE to CCmode, with later passes recognizing the REG_EQUAL note is obviously invalid as the modes didn't match, but now that we (correctly) preserve the CCZmode on COMPARE, the mismatched operand modes trigger a sanity checking ICE downstream. Fixed by updating (or deleting) any REG_EQUAL notes in convert_compare. Before: (expr_list:REG_EQUAL (compare:CCZ (reg:V1TI 119 [ ivin.29_38 ]) (const_wide_int 0x80000000000000000000000000000000)) After: (expr_list:REG_EQUAL (compare:CCZ (reg:V1TI 119 [ ivin.29_38 ]) (const_vector:V1TI [ (const_wide_int 0x80000000000000000000000000000000) ])) 2023-06-04 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog PR target/110083 * config/i386/i386-features.cc (scalar_chain::convert_compare): Update or delete REG_EQUAL notes, converting CONST_INT and CONST_WIDE_INT immediate operands to a suitable CONST_VECTOR. gcc/testsuite/ChangeLog PR target/110083 * gcc.target/i386/pr110083.c: New test case.
2023-06-03	i386: Add missing vector truncate patterns [PR92658].	liuhongt	1	-0/+21
	Add missing insn patterns for v2si -> v2hi/v2qi and v2hi-> v2qi vector truncate. gcc/ChangeLog: PR target/92658 * config/i386/mmx.md (truncv2hiv2qi2): New define_insn. (truncv2si<mode>2): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/pr92658-avx512bw-trunc-2.c: New test.
2023-06-01	PR target/109973: CCZmode and CCCmode variants of [v]ptest on x86.	Roger Sayle	6	-28/+76
	This is my proposed minimal fix for PR target/109973 (hopefully suitable for backporting) that follows Jakub Jelinek's suggestion that we introduce CCZmode and CCCmode variants of ptest and vptest, so that the i386 backend treats [v]ptest instructions similarly to testl instructions; using different CCmodes to indicate which condition flags are desired, and then relying on the RTL cmpelim pass to eliminate redundant tests. This conveniently matches Intel's intrinsics, that provide different functions for retrieving different flags, _mm_testz_si128 tests the Z flag, _mm_testc_si128 tests the carry flag. Currently we use the same instruction (pattern) for both, and unfortunately the ptest<mode>_and optimization is only valid when the ptest/vptest instruction is used to set/test the Z flag. The downside, as predicted by Jakub, is that GCC's cmpelim pass is currently COMPARE-centric and not able to merge the ptests from expressions such as _mm256_testc_si256 (a, b) + _mm256_testz_si256 (a, b), which is a known issue, PR target/80040. 2023-06-01 Roger Sayle <roger@nextmovesoftware.com> Uros Bizjak <ubizjak@gmail.com> gcc/ChangeLog PR target/109973 config/i386/i386-builtin.def (__builtin_ia32_ptestz128): Use new CODE_for_sse4_1_ptestzv2di. (__builtin_ia32_ptestc128): Use new CODE_for_sse4_1_ptestcv2di. (__builtin_ia32_ptestz256): Use new CODE_for_avx_ptestzv4di. (__builtin_ia32_ptestc256): Use new CODE_for_avx_ptestcv4di. * config/i386/i386-expand.cc (ix86_expand_branch): Use CCZmode when expanding UNSPEC_PTEST to compare against zero. * config/i386/i386-features.cc (scalar_chain::convert_compare): Likewise generate CCZmode UNSPEC_PTESTs when converting comparisons. (general_scalar_chain::convert_insn): Use CCZmode for COMPARE result. (timode_scalar_chain::convert_insn): Use CCZmode for COMPARE result. * config/i386/i386-protos.h (ix86_match_ptest_ccmode): Prototype. * config/i386/i386.cc (ix86_match_ptest_ccmode): New predicate to check for suitable matching modes for the UNSPEC_PTEST pattern. * config/i386/sse.md (define_split): When splitting UNSPEC_MOVMSK to UNSPEC_PTEST, preserve the FLAG_REG mode as CCZ. (<sse4_1>_ptest<mode>): Add asterisk to hide define_insn. Remove ":CC" mode of FLAGS_REG, instead use ix86_match_ptest_ccmode. (<sse4_1>_ptestz<mode>): New define_expand to specify CCZ. (<sse4_1>_ptestc<mode>): New define_expand to specify CCC. (<sse4_1>_ptest<mode>): A define_expand using CC to preserve the current behavior. (ptest<mode>_and): Specify CCZ to only perform this optimization when only the Z flag is required. gcc/testsuite/ChangeLog PR target/109973 * gcc.target/i386/pr109973-1.c: New test case. * gcc.target/i386/pr109973-2.c: Likewise.
2023-05-30	i386: Fix misleading identation in i386-expand.cc [PR110041]	Uros Bizjak	1	-12/+12
	gcc/ChangeLog: PR target/110041 * config/i386/i386-expand.cc (ix86_expand_vecop_qihi2): Fix misleading identation.
2023-05-29	i386: Also require TARGET_AVX512BW to generate truncv16hiv16qi2 [PR110021]	Uros Bizjak	1	-1/+1
	gcc/ChangeLog: PR target/110021 * config/i386/i386-expand.cc (ix86_expand_vecop_qihi2): Also require TARGET_AVX512BW to generate truncv16hiv16qi2.
2023-05-27	Split notl + pbraodcast + pand to pbroadcast + pandn more modes.	liuhongt	1	-6/+6
	r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554 adds 2 splitter to transform notl + pbroadcast + pand to pbroadcast + pandn for VI124_AVX2 which leaves out all DI-element-size ones as well as all 512-bit ones. This patch extend the splitter to VI_AVX2 which will handle DImode for AVX2, and V64QImode,V32HImode,V16SImode,V8DImode for AVX512. gcc/ChangeLog: PR target/100711 * config/i386/sse.md (andnot<mode>3): Extend below splitter to VI_AVX2 to cover more modes. gcc/testsuite/ChangeLog: gcc.target/i386/pr100711-2.c: Add v4di/v2di testcases. * gcc.target/i386/pr100711-3.c: New test.
2023-05-27	Disable avoid_false_dep_for_bmi for atom and icelake(and later) core processors.	liuhongt	1	-1/+2
	lzcnt/tzcnt has been fixed since skylake, popcnt has been fixed since icelake. At least for icelake and later intel Core processors, the errata tune is not needed. And the tune isn't need for ATOM either. gcc/ChangeLog: * config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI): Remove ATOM and ICELAKE(and later) core processors.
2023-05-26	i386: Do not disable call to ix86_expand_vecop_qihi2	Uros Bizjak	1	-1/+1
	gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_vecop_qihi): Do not disable call to ix86_expand_vecop_qihi2.
2023-05-25	i386: Use 2x-wider modes when emulating QImode vector instructions	Uros Bizjak	2	-169/+254
	Rewrite ix86_expand_vecop_qihi2 to expand fo 2x-wider (e.g. V16QI -> V16HImode) instructions when available. Currently, the compiler generates following assembly for V16QImode multiplication (-mavx2): vpunpcklbw %xmm0, %xmm0, %xmm3 vpunpcklbw %xmm1, %xmm1, %xmm2 vpunpckhbw %xmm0, %xmm0, %xmm0 movl $255, %eax vpunpckhbw %xmm1, %xmm1, %xmm1 vpmullw %xmm3, %xmm2, %xmm2 vmovd %eax, %xmm3 vpmullw %xmm0, %xmm1, %xmm1 vpbroadcastw %xmm3, %xmm3 vpand %xmm2, %xmm3, %xmm0 vpand %xmm1, %xmm3, %xmm3 vpackuswb %xmm3, %xmm0, %xmm0 and only with -mavx512bw -mavx512vl generates: vpmovzxbw %xmm1, %ymm1 vpmovzxbw %xmm0, %ymm0 vpmullw %ymm1, %ymm0, %ymm0 vpmovwb %ymm0, %xmm0 Patched compiler generates more optimized code involving multiplication in 2x-wider mode in cases where missing truncate instruction has to be emulated with a permutation (-mavx2): vpmovzxbw %xmm0, %ymm0 vpmovzxbw %xmm1, %ymm1 movl $255, %eax vpmullw %ymm1, %ymm0, %ymm1 vmovd %eax, %xmm0 vpbroadcastw %xmm0, %ymm0 vpand %ymm1, %ymm0, %ymm0 vpackuswb %ymm0, %ymm0, %ymm0 vpermq $216, %ymm0, %ymm0 The patch also adjusts cost calculation of VQImode emulations to account for generation of 2x-wider mode instructions. gcc/ChangeLog: config/i386/i386-expand.cc (ix86_expand_vecop_qihi2): Rewrite to expand to 2x-wider (e.g. V16QI -> V16HImode) instructions when available. Emulate truncation via ix86_expand_vec_perm_const_1 when native truncate insn is not available. (ix86_expand_vecop_qihi_partial) <case MULT>: Use pmovzx when available. Trivially rename some variables. (ix86_expand_vecop_qihi): Unconditionally call ix86_expand_vecop_qihi2. * config/i386/i386.cc (ix86_multiplication_cost): Rewrite cost calculation of VQImode emulations to account for generation of 2x-wider mode instructions. (ix86_shift_rotate_cost): Update cost calculation of VQImode emulations to account for generation of 2x-wider mode instructions. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512vl-pr95488-1.c: Revert 2023-05-18 change.
2023-05-25	i386: Fix incorrect intrinsic signature for AVX512 s{lli\|rai\|rli}	Hu, Lin1	4	-167/+200
	This patch aims to fix incorrect intrinsic signature for _mm{512\|256\|}_s{lli\|rai\|rli}_epi. gcc/ChangeLog: PR target/109173 PR target/109174 config/i386/avx512bwintrin.h (_mm512_srli_epi16): Change type from int to const int or const int to const unsigned int. (_mm512_mask_srli_epi16): Ditto. (_mm512_slli_epi16): Ditto. (_mm512_mask_slli_epi16): Ditto. (_mm512_maskz_slli_epi16): Ditto. (_mm512_srai_epi16): Ditto. (_mm512_mask_srai_epi16): Ditto. (_mm512_maskz_srai_epi16): Ditto. * config/i386/avx512fintrin.h (_mm512_slli_epi64): Ditto. (_mm512_mask_slli_epi64): Ditto. (_mm512_maskz_slli_epi64): Ditto. (_mm512_srli_epi64): Ditto. (_mm512_mask_srli_epi64): Ditto. (_mm512_maskz_srli_epi64): Ditto. (_mm512_srai_epi64): Ditto. (_mm512_mask_srai_epi64): Ditto. (_mm512_maskz_srai_epi64): Ditto. (_mm512_slli_epi32): Ditto. (_mm512_mask_slli_epi32): Ditto. (_mm512_maskz_slli_epi32): Ditto. (_mm512_srli_epi32): Ditto. (_mm512_mask_srli_epi32): Ditto. (_mm512_maskz_srli_epi32): Ditto. (_mm512_srai_epi32): Ditto. (_mm512_mask_srai_epi32): Ditto. (_mm512_maskz_srai_epi32): Ditto. * config/i386/avx512vlbwintrin.h (_mm256_mask_srai_epi16): Ditto. (_mm256_maskz_srai_epi16): Ditto. (_mm_mask_srai_epi16): Ditto. (_mm_maskz_srai_epi16): Ditto. (_mm256_mask_slli_epi16): Ditto. (_mm256_maskz_slli_epi16): Ditto. (_mm_mask_slli_epi16): Ditto. (_mm_maskz_slli_epi16): Ditto. (_mm_maskz_srli_epi16): Ditto. * config/i386/avx512vlintrin.h (_mm256_mask_srli_epi32): Ditto. (_mm256_maskz_srli_epi32): Ditto. (_mm_mask_srli_epi32): Ditto. (_mm_maskz_srli_epi32): Ditto. (_mm256_mask_srli_epi64): Ditto. (_mm256_maskz_srli_epi64): Ditto. (_mm_mask_srli_epi64): Ditto. (_mm_maskz_srli_epi64): Ditto. (_mm256_mask_srai_epi32): Ditto. (_mm256_maskz_srai_epi32): Ditto. (_mm_mask_srai_epi32): Ditto. (_mm_maskz_srai_epi32): Ditto. (_mm256_srai_epi64): Ditto. (_mm256_mask_srai_epi64): Ditto. (_mm256_maskz_srai_epi64): Ditto. (_mm_srai_epi64): Ditto. (_mm_mask_srai_epi64): Ditto. (_mm_maskz_srai_epi64): Ditto. (_mm_mask_slli_epi32): Ditto. (_mm_maskz_slli_epi32): Ditto. (_mm_mask_slli_epi64): Ditto. (_mm_maskz_slli_epi64): Ditto. (_mm256_mask_slli_epi32): Ditto. (_mm256_maskz_slli_epi32): Ditto. (_mm256_mask_slli_epi64): Ditto. (_mm256_maskz_slli_epi64): Ditto. gcc/testsuite/ChangeLog: PR target/109173 PR target/109174 * gcc.target/i386/pr109173-1.c: New test. * gcc.target/i386/pr109174-1.c: Ditto.
2023-05-24	i386: Add v<any_shift:insn>v4qi3 expander	Uros Bizjak	3	-17/+27
	Also, move v<any_shift:insn>v8qi3 expander to a better place and enable it with TARGET_MMX_WITH_SSE. Remove handling of V8QImode from ix86_expand_vecop_qihi2 since all partial QI->HI vector modes expand via ix86_expand_vecop_qihi_partial. gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_vecop_qihi2): Remove handling of V8QImode. * config/i386/mmx.md (v<insn>v8qi3): Move from sse.md. Call ix86_expand_vecop_qihi_partial. Enable for TARGET_MMX_WITH_SSE. (v<insn>v4qi3): Ditto. * config/i386/sse.md (v<insn>v8qi3): Remove. gcc/testsuite/ChangeLog: * gcc.target/i386/vect-shiftv4qi.c (dg-options): Remove -ftree-vectorize. * gcc.target/i386/vect-shiftv8qi.c (dg-options): Ditto. * gcc.target/i386/vect-vshiftv4qi.c: New test. * gcc.target/i386/vect-vshiftv8qi.c: New test.
2023-05-24	target/109944 - avoid STLF fail for V16QImode CTOR expansion	Richard Biener	1	-5/+6
	The following dispatches to V2DImode CTOR expansion instead of using sets of (subreg:DI (reg:V16QI 146) [08]) which causes LRA to spill DImode and reload V16QImode. The same applies for V8QImode or V4HImode construction from SImode parts which happens during 32bit libgcc build. PR target/109944 * config/i386/i386-expand.cc (ix86_expand_vector_init_general): Perform final vector composition using ix86_expand_vector_init_general instead of setting the highpart and lowpart which causes spilling. * gcc.target/i386/pr109944-1.c: New testcase. * gcc.target/i386/pr109944-2.c: Likewise.
2023-05-24	Fold _mm{,256,512}_abs_{epi8,epi16,epi32,epi64} into gimple ABS_EXPR.	liuhongt	2	-33/+71
	Also for 64-bit vector abs intrinsics _mm_abs_{pi8,pi16,pi32}. gcc/ChangeLog: PR target/109900 * config/i386/i386.cc (ix86_gimple_fold_builtin): Fold _mm{,256,512}_abs_{epi8,epi16,epi32,epi64} and _mm_abs_{pi8,pi16,pi32} into gimple ABS_EXPR. (ix86_masked_all_ones): Handle 64-bit mask. * config/i386/i386-builtin.def: Replace icode of related non-mask simd abs builtins with CODE_FOR_nothing. gcc/testsuite/ChangeLog: * gcc.target/i386/pr109900.c: New test.
2023-05-23	Account for vector splat GPR->XMM move cost	Richard Biener	1	-2/+4
	The following also accounts for a GPR->XMM move cost for splat operations and properly guards eliding the cost when moving from memory only for SSE4.1 or HImode or larger operands. This doesn't fix the PR fully yet. PR target/109944 * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): For vector construction or splats apply GPR->XMM move costing. QImode memory can be handled directly only with SSE4.1 pinsrb.
2023-05-23	i386: Add V8QI and V4QImode partial vector shift operations	Uros Bizjak	3	-3/+69
	Add V8QImode and V4QImode vector shift patterns that call into ix86_expand_vecop_qihi_partial. Generate special sequences for constant count operands. gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial): Call ix86_expand_vec_shift_qihi_constant for shifts with constant count operand. * config/i386/i386.cc (ix86_shift_rotate_cost): Handle V4QImode and V8QImode. * config/i386/mmx.md (<insn>v8qi3): New insn pattern. (<insn>v4qi3): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/vect-shiftv4qi.c: New test. * gcc.target/i386/vect-shiftv8qi.c: New test.
2023-05-22	i386: Adjust emulated integer vector mode shift costs	Uros Bizjak	1	-34/+64
	Returned integer vector mode costs of emulated instructions in ix86_shift_rotate_cost are wrong and do not reflect generated instruction sequences. Rewrite handling of different integer vector modes and different target ABIs to return real instruction counts in order to calcuate better costs of various emulated modes. Also add the cost of a memory read, when the instruction in the sequence reads memory. gcc/ChangeLog: * config/i386/i386.cc (ix86_shift_rotate_cost): Correct calcuation of integer vector mode costs to reflect generated instruction sequences of different integer vector modes and different target ABIs. Remove "speed" function argument. (ix86_rtx_costs): Update call for removed function argument. (ix86_vector_costs::add_stmt_cost): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/sse2-shiftqihi-constant-1.c: Remove XFAILs.
2023-05-22	i386: Account for the memory read in V*QImode multiplication sequences	Uros Bizjak	1	-8/+23
	Add the cost of a memory read to the cost of VQImode vector mult sequences. gcc/ChangeLog: config/i386/i386.cc (ix86_multiplication_cost): Add the cost of a memory read to the cost of V?QImode sequences.
2023-05-18	gcc/config/*: use _P() defines from tree.h	Bernhard Reutner-Fischer	4	-22/+18
	gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_short_vector_p): Use _P defines from tree.h. (aarch64_mangle_type): Ditto. * config/alpha/alpha.cc (alpha_in_small_data_p): Ditto. (alpha_gimplify_va_arg_1): Ditto. * config/arc/arc.cc (arc_encode_section_info): Ditto. (arc_is_aux_reg_p): Ditto. (arc_is_uncached_mem_p): Ditto. (arc_handle_aux_attribute): Ditto. * config/arm/arm.cc (arm_handle_isr_attribute): Ditto. (arm_handle_cmse_nonsecure_call): Ditto. (arm_set_default_type_attributes): Ditto. (arm_is_segment_info_known): Ditto. (arm_mangle_type): Ditto. * config/arm/unknown-elf.h (IN_NAMED_SECTION_P): Ditto. * config/avr/avr.cc (avr_lookup_function_attribute1): Ditto. (avr_decl_absdata_p): Ditto. (avr_insert_attributes): Ditto. (avr_section_type_flags): Ditto. (avr_encode_section_info): Ditto. * config/bfin/bfin.cc (bfin_handle_l2_attribute): Ditto. * config/bpf/bpf.cc (bpf_core_compute): Ditto. * config/c6x/c6x.cc (c6x_in_small_data_p): Ditto. * config/csky/csky.cc (csky_handle_isr_attribute): Ditto. (csky_mangle_type): Ditto. * config/darwin-c.cc (darwin_pragma_unused): Ditto. * config/darwin.cc (is_objc_metadata): Ditto. * config/epiphany/epiphany.cc (epiphany_function_ok_for_sibcall): Ditto. * config/epiphany/epiphany.h (ROUND_TYPE_ALIGN): Ditto. * config/frv/frv.cc (frv_emit_movsi): Ditto. * config/gcn/gcn-tree.cc (gcn_lockless_update): Ditto. * config/gcn/gcn.cc (gcn_asm_output_symbol_ref): Ditto. * config/h8300/h8300.cc (h8300_encode_section_info): Ditto. * config/i386/i386-expand.cc: Ditto. * config/i386/i386.cc (type_natural_mode): Ditto. (ix86_function_arg): Ditto. (ix86_data_alignment): Ditto. (ix86_local_alignment): Ditto. (ix86_simd_clone_compute_vecsize_and_simdlen): Ditto. * config/i386/winnt-cxx.cc (i386_pe_type_dllimport_p): Ditto. (i386_pe_type_dllexport_p): Ditto. (i386_pe_adjust_class_at_definition): Ditto. * config/i386/winnt.cc (i386_pe_determine_dllimport_p): Ditto. (i386_pe_binds_local_p): Ditto. (i386_pe_section_type_flags): Ditto. * config/ia64/ia64.cc (ia64_encode_section_info): Ditto. (ia64_gimplify_va_arg): Ditto. (ia64_in_small_data_p): Ditto. * config/iq2000/iq2000.cc (iq2000_function_arg): Ditto. * config/lm32/lm32.cc (lm32_in_small_data_p): Ditto. * config/loongarch/loongarch.cc (loongarch_handle_model_attribute): Ditto. * config/m32c/m32c.cc (m32c_insert_attributes): Ditto. * config/mcore/mcore.cc (mcore_mark_dllimport): Ditto. (mcore_encode_section_info): Ditto. * config/microblaze/microblaze.cc (microblaze_elf_in_small_data_p): Ditto. * config/mips/mips.cc (mips_output_aligned_decl_common): Ditto. * config/mmix/mmix.cc (mmix_encode_section_info): Ditto. * config/nvptx/nvptx.cc (nvptx_encode_section_info): Ditto. (pass_in_memory): Ditto. (nvptx_generate_vector_shuffle): Ditto. (nvptx_lockless_update): Ditto. * config/pa/pa.cc (pa_function_arg_padding): Ditto. (pa_function_value): Ditto. (pa_function_arg): Ditto. * config/pa/pa.h (IN_NAMED_SECTION_P): Ditto. (TEXT_SPACE_P): Ditto. * config/pa/som.h (MAKE_DECL_ONE_ONLY): Ditto. * config/pdp11/pdp11.cc (pdp11_return_in_memory): Ditto. * config/riscv/riscv.cc (riscv_in_small_data_p): Ditto. (riscv_mangle_type): Ditto. * config/rl78/rl78.cc (rl78_insert_attributes): Ditto. (rl78_addsi3_internal): Ditto. * config/rs6000/aix.h (ROUND_TYPE_ALIGN): Ditto. * config/rs6000/darwin.h (ROUND_TYPE_ALIGN): Ditto. * config/rs6000/freebsd64.h (ROUND_TYPE_ALIGN): Ditto. * config/rs6000/linux64.h (ROUND_TYPE_ALIGN): Ditto. * config/rs6000/rs6000-call.cc (rs6000_function_arg_boundary): Ditto. (rs6000_function_arg_advance_1): Ditto. (rs6000_function_arg): Ditto. (rs6000_pass_by_reference): Ditto. * config/rs6000/rs6000-logue.cc (rs6000_function_ok_for_sibcall): Ditto. * config/rs6000/rs6000.cc (rs6000_data_alignment): Ditto. (rs6000_set_default_type_attributes): Ditto. (rs6000_elf_in_small_data_p): Ditto. (IN_NAMED_SECTION): Ditto. (rs6000_xcoff_encode_section_info): Ditto. (rs6000_function_value): Ditto. (invalid_arg_for_unprototyped_fn): Ditto. * config/s390/s390-c.cc (s390_fn_types_compatible): Ditto. (s390_vec_n_elem): Ditto. * config/s390/s390.cc (s390_check_type_for_vector_abi): Ditto. (s390_function_arg_integer): Ditto. (s390_return_in_memory): Ditto. (s390_encode_section_info): Ditto. * config/sh/sh.cc (sh_gimplify_va_arg_expr): Ditto. (sh_function_value): Ditto. * config/sol2.cc (solaris_insert_attributes): Ditto. * config/sparc/sparc.cc (function_arg_slotno): Ditto. * config/sparc/sparc.h (ROUND_TYPE_ALIGN): Ditto. * config/stormy16/stormy16.cc (xstormy16_encode_section_info): Ditto. (xstormy16_handle_below100_attribute): Ditto. * config/v850/v850.cc (v850_encode_section_info): Ditto. (v850_insert_attributes): Ditto. * config/visium/visium.cc (visium_pass_by_reference): Ditto. (visium_return_in_memory): Ditto. * config/xtensa/xtensa.cc (xtensa_multibss_section_type_flags): Ditto.
2023-05-18	i386: Add infrastructure for QImode partial vector mult and shift operations	Uros Bizjak	5	-16/+143
	QImode partial vector multiplications and shifts can be implemented using their HImode counterparts. Add infrastructure to handle V8QImode and V4QImode vectors by extending (interleaving) their input operands to V8HImode, performing V8HImode operation and truncating output back to the original QImode vector. The patch implements V8QImode and V4QImode multiplication for SSE2 targets, using generic permutation to truncate output operand, but still taking advantage of VPMOVWB down convert instruction, when available. The patch also removes setting of REG_EQAUL note to the last insn of ix86_expand_vecop_qihi expander. This is what generic code does automatically when named pattern is expanded. gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial): New. (ix86_expand_vecop_qihi): Add op2vec bool variable. Do not set REG_EQUAL note. * config/i386/i386-protos.h (ix86_expand_vecop_qihi_partial): Add prototype. * config/i386/i386.cc (ix86_multiplication_cost): Handle V4QImode and V8QImode. * config/i386/mmx.md (mulv8qi3): New expander. (mulv4qi3): Ditto. * config/i386/sse.md (mulv8qi3): Remove. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512vl-pr95488-1.c: Adjust expected scan-assembler-times frequency and strings.. * gcc.target/i386/vect-mulv4qi.c: New test. * gcc.target/i386/vect-mulv8qi.c: New test.
2023-05-17	i386: Fix up types in __builtin_{inf,huge_val,nan{,s},fabs,copysign}q ↵	Jakub Jelinek	1	-1/+1
	builtins [PR109884] When _Float128 support has been added to C++ for 13.1, float128t_type_node tree has been added - in C float128_type_node and float128t_type_node is the same and represents both _Float128 and __float128, but in C++ they are distinct types which have different handling in the FEs. When doing that change, I mistakenly forgot to change FLOAT128 primitive type, which is used for the __builtin_{inf,huge_val,nan{,s},fabs,copysign}q builtins results and some of their arguments (and nothing else). The following patch fixes that. On ia64 we already use float128t_type_node for those builtins, pa while it has __float128 that type is the same as long double and so those builtins have long double types and on powerpc seems we don't have these builtins but instead define macros which map them to __builtin_f128. That will not work properly in C++, perhaps we should change those macros to be function-like and cast to __float128. 2023-05-17 Jakub Jelinek <jakub@redhat.com> PR c++/109884 config/i386/i386-builtin-types.def (FLOAT128): Use float128t_type_node rather than float128_type_node. * c-c++-common/pr109884.c: New test.
2023-05-17	i386: Adjust emulated integer vector mode multiplication costs	Uros Bizjak	1	-28/+44
	Returned integer vector mode costs of emulated modes in ix86_multiplication_cost are wrong and do not reflect generated instruction sequences. Rewrite handling of different integer vector modes and different target ABIs to return real instruction counts in order to calcuate better costs of various emulated modes. gcc/ChangeLog: * config/i386/i386.cc (ix86_multiplication_cost): Correct calcuation of integer vector mode costs to reflect generated instruction sequences of different integer vector modes and different target ABIs.
2023-05-14	i386: Handle unsupported modes from ix86_widen_mult_cost [PR109807]	Uros Bizjak	1	-3/+2
	Revert my previous change that faked handling of V4HI and V2SImodes in ix86_widen_mult_cost and rather return arbitrary high value for unsupported modes. This should prevent cost estimator from selecting non-existent vector widen multiply operation. gcc/ChangeLog: PR target/109807 * config/i386/i386.cc: Revert the 2023-05-11 change. (ix86_widen_mult_cost): Return high value instead of ICEing for unsupported modes. gcc/testsuite/ChangeLog: PR target/109807 * gcc.target/i386/pr109825.c: New test.