Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch contains a pair of (related) optimizations in i386.md that
allow us to generate better code for the example below (this is a step
towards fixing a bugzilla PR, but I've forgotten the number).
__int128 foo64(__int128 x, long long y)
{
__int128 t = (__int128)y << 64;
return x ^ t;
}
The hidden issue is that the RTL currently seen by reload contains
the sign extension of y from DImode to TImode, even though this is
dead (not required) for left shifts by more than WORD_SIZE bits.
(insn 11 8 12 2 (parallel [
(set (reg:TI 0 ax [orig:91 y ] [91])
(sign_extend:TI (reg:DI 1 dx [97])))
(clobber (reg:CC 17 flags))
(clobber (scratch:DI))
]) {extendditi2}
What makes this particularly undesirable is that the sign-extension
pattern above requires an additional DImode scratch register, indicated
by the clobber, which unnecessarily increases register pressure.
The proposed solution is to add a define_insn_and_split for such
left shifts (of sign or zero extensions) that only have a non-zero
highpart, where the extension is redundant and eliminated, that can
be split after reload, without scratch registers or early clobbers.
This (late split) exposes a second optimization opportunity where
setting the lowpart to zero can sometimes be combined/simplified with
the following instruction during peephole2.
For the test case above, we previously generated with -O2:
foo64: xorl %eax, %eax
xorq %rsi, %rdx
xorq %rdi, %rax
ret
with this patch, we now generate:
foo64: movq %rdi, %rax
xorq %rsi, %rdx
ret
Likewise for the related -m32 test case, we go from:
foo32: movl 12(%esp), %eax
movl %eax, %edx
xorl %eax, %eax
xorl 8(%esp), %edx
xorl 4(%esp), %eax
ret
to the improved:
foo32: movl 12(%esp), %edx
movl 4(%esp), %eax
xorl 8(%esp), %edx
ret
2023-06-26 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386.md (peephole2): Simplify zeroing a register
followed by an IOR, XOR or PLUS operation on it, into a move.
(*ashl<dwi>3_doubleword_highpart): New define_insn_and_split to
eliminate (and hide from reload) unnecessary word to doubleword
extensions that are followed by left shifts by sufficiently large,
but valid, bit counts.
gcc/testsuite/ChangeLog
* gcc.target/i386/ashldi3-1.c: New 32-bit test case.
* gcc.target/i386/ashlti3-2.c: New 64-bit test case.
|
|
For function with target attribute arch=*, current logic will set its
tune to -mtune from command line so all target_clones will get same
tuning flags which would affect the performance for each clone. Override
tune with arch if tune was not explicitly specified to get proper tuning
flags for target_clones.
gcc/ChangeLog:
* config/i386/i386-options.cc (ix86_valid_target_attribute_tree):
Override tune_string with arch_string if tune_string is not
explicitly specified.
gcc/testsuite/ChangeLog:
* gcc.target/i386/mvc17.c: New test.
|
|
If mem_addr points to a memory region with less than whole vector size
bytes of accessible memory and k is a mask that would prevent reading
the inaccessible bytes from mem_addr, add UNSPEC_MASKLOAD to prevent
it to be transformed to vpblendd.
gcc/ChangeLog:
PR target/110309
* config/i386/sse.md (maskload<mode><avx512fmaskmodelower>):
Refine pattern with UNSPEC_MASKLOAD.
(maskload<mode><avx512fmaskmodelower>): Ditto.
(*<avx512>_load<mode>_mask): Extend mode iterator to
VI12HFBF_AVX512VL.
(*<avx512>_load<mode>): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr110309.c: New test.
|
|
A patch that I'm working on to improve RTL simplifications in the
middle-end results in the regression of pr78904-1b.c, due to changes in
the canonical representation of high-byte (%ah, %bh, %ch, %dh) logic.
See also PR target/78904.
This patch avoids/prevents those failures by adding support for the
alternate representation, duplicating the existing *<code>qi_ext<mode>_2
as *<code>qi_ext<mode>_3 (the new version also replacing any_or with
any_logic to provide *andqi_ext<mode>_3 in the same pattern). Removing
the original pattern isn't trivial, as it's generated by define_split,
but this can be investigated after the other pieces are approved.
The current representation of this instruction is:
(set (zero_extract:DI (reg/v:DI 87 [ aD.2763 ])
(const_int 8 [0x8])
(const_int 8 [0x8]))
(subreg:DI (xor:QI (subreg:QI (zero_extract:DI (reg:DI 94)
(const_int 8 [0x8])
(const_int 8 [0x8])) 0)
(subreg:QI (zero_extract:DI (reg/v:DI 87 [ aD.2763 ])
(const_int 8 [0x8])
(const_int 8 [0x8])) 0)) 0))
after my proposed middle-end improvement, we attempt to recognize:
(set (zero_extract:DI (reg/v:DI 87 [ aD.2763 ])
(const_int 8 [0x8])
(const_int 8 [0x8]))
(zero_extract:DI (xor:DI (reg:DI 94)
(reg/v:DI 87 [ aD.2763 ]))
(const_int 8 [0x8])
(const_int 8 [0x8])))
2023-06-24 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386.md (*<code>qi_ext<mode>_3): New define_insn.
|
|
This patch is the next installment in a set of backend patches around
improvements to ptest/vptest. A previous patch optimized the sequence
t=pand(x,y); ptestz(t,t) into the equivalent ptestz(x,y), using the
property that ZF is set to (X&Y) == 0. This patch performs a similar
transformation, converting t=pandn(x,y); ptestz(t,t) into the (almost)
equivalent ptestc(y,x), using the property that the CF flags is set to
(~X&Y) == 0. The tricky bit is that this sets the CF flag instead of
the ZF flag, so we can only perform this transformation when we can
also convert the flags consumer, as well as the producer.
For the test case:
int foo (__m128i x, __m128i y)
{
__m128i a = x & ~y;
return __builtin_ia32_ptestz128 (a, a);
}
With -O2 -msse4.1 we previously generated:
foo: pandn %xmm0, %xmm1
xorl %eax, %eax
ptest %xmm1, %xmm1
sete %al
ret
with this patch we now generate:
foo: xorl %eax, %eax
ptest %xmm0, %xmm1
setc %al
ret
At the same time, this patch also provides alternative fixes for
PR target/109973 and PR target/110118, by recognizing that ptestc(x,x)
always sets the carry flag (X&~X is always zero). This is achieved
both by recognizing the special case in ix86_expand_sse_ptest and with
a splitter to convert an eligible ptest into an stc.
2023-06-22 Roger Sayle <roger@nextmovesoftware.com>
Uros Bizjak <ubizjak@gmail.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_sse_ptest): Recognize
expansion of ptestc with equal operands as producing const1_rtx.
* config/i386/i386.cc (ix86_rtx_costs): Provide accurate cost
estimates of UNSPEC_PTEST, where the ptest performs the PAND
or PAND of its operands.
* config/i386/sse.md (define_split): Transform CCCmode UNSPEC_PTEST
of reg_equal_p operands into an x86_stc instruction.
(define_split): Split pandn/ptestz/set{n?}e into ptestc/set{n?}c.
(define_split): Similar to above for strict_low_part destinations.
(define_split): Split pandn/ptestz/j{n?}e into ptestc/j{n?}c.
gcc/testsuite/ChangeLog
* gcc.target/i386/avx-vptest-4.c: New test case.
* gcc.target/i386/avx-vptest-5.c: Likewise.
* gcc.target/i386/avx-vptest-6.c: Likewise.
* gcc.target/i386/pr109973-1.c: Update test case.
* gcc.target/i386/pr109973-2.c: Likewise.
* gcc.target/i386/sse4_1-ptest-4.c: New test case.
* gcc.target/i386/sse4_1-ptest-5.c: Likewise.
* gcc.target/i386/sse4_1-ptest-6.c: Likewise.
|
|
The following works around the lack of the x86 backend making the
vectorizer compare the costs of the different possible vector
sizes the backed advertises through the vector_modes hook. When
enabling masked epilogues or main loops then this means we will
select the prefered vector mode which is usually the largest even
for loops that do not iterate close to the times the vector has
lanes. When not using masking the vectorizer would reject any
mode resulting in a VF bigger than the number of iterations
but with masking they are simply masked out.
So this overloads the finish_cost function and matches for
the problematic case, forcing a high cost to make us try a
smaller vector size.
* config/i386/i386.cc (ix86_vector_costs::finish_cost):
Overload. For masked main loops make sure the vectorization
factor isn't more than double the number of iterations.
* gcc.target/i386/vect-partial-vectors-1.c: New testcase.
* gcc.target/i386/vect-partial-vectors-2.c: Likewise.
|
|
There's no reason to constrain this to AVX512VL, unless instructed so by
-mprefer-vector-width=, as the wider operation is unusable for more
narrow operands only when the possible memory source is a non-broadcast
one. This way even the scalar copysign<mode>3 can benefit from the
operation being a single-insn one (leaving aside moves which the
compiler decides to insert for unclear reasons, and leaving aside the
fact that bcst_mem_operand() is too restrictive for broadcast to be
embedded right into VPTERNLOG*).
While there also bring *<avx512>_vternlog<mode>_all's in sync with that
of the three splitters.
Along with this also request value duplication in
ix86_expand_copysign()'s call to ix86_build_signbit_mask(), eliminating
excess space allocation in .rodata.*, filled with zeros which are never
read.
gcc/
* config/i386/i386-expand.cc (ix86_expand_copysign): Request
value duplication by ix86_build_signbit_mask() when AVX512F and
not HFmode.
* config/i386/sse.md (*<avx512>_vternlog<mode>_all): Convert to
2-alternative form. Adjust "mode" attribute. Add "enabled"
attribute.
(*<avx512>_vpternlog<mode>_1): Also permit when TARGET_AVX512F
&& !TARGET_PREFER_AVX256.
(*<avx512>_vpternlog<mode>_2): Likewise.
(*<avx512>_vpternlog<mode>_3): Likewise.
gcc/testsuite/
* gcc.target/i386/avx512f-copysign.c: New test.
|
|
The input constraint for the %vmovddup alternative was wrong, as the
upper 16 XMM registers require AVX512VL to be used with this insn. To
compensate, introduce a new alternative permitting all 32 registers, by
broadcasting to the full 512 bits in that case if AVX512VL is not
available.
gcc/
* config/i386/sse.md (vec_dupv2di): Correct %vmovddup input
constraint. Add new AVX512F alternative.
gcc/testsuite/
* gcc.target/i386/avx512f-dupv2di.c: New test.
|
|
The packing in vpacksswb/vpackssdw is not a simple concat, it's an
interweave from src1 and src2 for every 128 bit(or 64-bit for the
ss_truncate result).
.i.e.
dst[192-255] = ss_truncate (src2[128-255])
dst[128-191] = ss_truncate (src1[128-255])
dst[64-127] = ss_truncate (src2[0-127])
dst[0-63] = ss_truncate (src1[0-127]
The patch refined those patterns with an extra vec_select for the
interweave.
gcc/ChangeLog:
PR target/110235
* config/i386/sse.md (<sse2_avx2>_packsswb<mask_name>):
Substitute with ..
(sse2_packsswb<mask_name>): .. this, ..
(avx2_packsswb<mask_name>): .. this and ..
(avx512bw_packsswb<mask_name>): .. this.
(<sse2_avx2>_packssdw<mask_name>): Substitute with ..
(sse2_packssdw<mask_name>): .. this, ..
(avx2_packssdw<mask_name>): .. this and ..
(avx512bw_packssdw<mask_name>): .. this.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512bw-vpackssdw-3.c: New test.
* gcc.target/i386/avx512bw-vpacksswb-3.c: New test.
|
|
us_truncate.
packuswb/packusdw does unsigned saturation for signed source, but rtl
us_truncate means does unsigned saturation for unsigned source.
So for value -1, packuswb will produce 0, but us_truncate produces
255. The patch reimplement those related patterns and functions with
UNSPEC_US_TRUNCATE instead of us_truncate.
gcc/ChangeLog:
PR target/110235
* config/i386/i386-expand.cc (ix86_split_mmx_pack): Use
UNSPEC_US_TRUNCATE instead of original us_truncate for
packusdw/packuswb.
* config/i386/mmx.md (mmx_pack<s_trunsuffix>swb): Substitute
with ..
(mmx_packsswb): .. this and ..
(mmx_packuswb): .. this.
(mmx_packusdw): Use UNSPEC_US_TRUNCATE instead of original
us_truncate.
(s_trunsuffix): Removed code iterator.
(any_s_truncate): Ditto.
* config/i386/sse.md (<sse2_avx2>_packuswb<mask_name>): Use
UNSPEC_US_TRUNCATE instead of original us_truncate.
(<sse4_1_avx2>_packusdw<mask_name>): Ditto.
* config/i386/i386.md (UNSPEC_US_TRUNCATE): New unspec_c_enum.
|
|
This patch refactors the three places in the i386.md backend that we
set the carry flag into a new ix86_expand_carry helper function, that
allows Jakub's recently added uaddc<mode>5 and usubc<mode>5 expanders
to take advantage of the recently added support for the stc instruction.
2023-06-18 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_carry): New helper
function for setting the carry flag.
(ix86_expand_builtin) <handlecarry>: Use it here.
* config/i386/i386-protos.h (ix86_expand_carry): Prototype here.
* config/i386/i386.md (uaddc<mode>5): Use ix86_expand_carry.
(usubc<mode>5): Likewise.
|
|
This clean-up improves consistency within i386.md by using QImode for
the constant shift count in patterns that specify a mode.
2023-06-18 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386.md (*concat<mode><dwi>3_1): Use QImode
for the immediate constant shift count.
(*concat<mode><dwi>3_2): Likewise.
(*concat<mode><dwi>3_3): Likewise.
(*concat<mode><dwi>3_4): Likewise.
(*concat<mode><dwi>3_5): Likewise.
(*concat<mode><dwi>3_6): Likewise.
|
|
This patch splits out two (independent) minor changes to i386-expand.cc's
ix86_expand_move from a larger patch, given that it's better to review
and commit these independent pieces separately from a more complex patch.
The first change is to test for CONST_WIDE_INT_P before calling
ix86_convert_const_wide_int_to_broadcast. Whilst stepping through
this function in gdb, I was surprised that the code was continually
jumping into this function with operands that obviously weren't
appropriate.
The second change is to generalize the optimization for efficiently
moving a TImode value to V1TImode (via V2DImode), to cover all 128-bit
vector modes.
Hence for the test case:
typedef unsigned long uv2di __attribute__ ((__vector_size__ (16)));
uv2di foo2(__int128 x) { return (uv2di)x; }
we'd previously move via memory with:
foo2: movq %rdi, -24(%rsp)
movq %rsi, -16(%rsp)
movdqa -24(%rsp), %xmm0
ret
with this patch we now generate with -O2 (the same as V1TImode):
foo2: movq %rdi, %xmm0
movq %rsi, %xmm1
punpcklqdq %xmm1, %xmm0
ret
and with -O2 -msse4 the even better:
foo2: movq %rdi, %xmm0
pinsrq $1, %rsi, %xmm0
ret
The new test case is unimaginatively called sse2-v1ti-mov-2.c given
the original test case just for V1TI mode was called sse2-v1ti-mov-1.c.
2023-06-17 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_move): Check that OP1 is
CONST_WIDE_INT_P before calling ix86_convert_wide_int_to_broadcast.
Generalize special case for converting TImode to V1TImode to handle
all 128-bit vector conversions.
gcc/testsuite/ChangeLog
* gcc.target/i386/sse2-v1ti-mov-2.c: New test case.
|
|
This patch addresses the last remaining issue with PR target/31985, that
GCC could make better use of memory addressing modes when implementing
double word addition. This is achieved by adding a define_insn_and_split
that combines an *add<dwi>3_doubleword with a *concat<mode><dwi>3, so
that the components of the concat can be used directly, without first
being loaded into a double word register.
For test_c in the bugzilla PR:
Before:
pushl %ebx
subl $16, %esp
movl 28(%esp), %eax
movl 36(%esp), %ecx
movl 32(%esp), %ebx
movl 24(%esp), %edx
addl %ecx, %eax
adcl %ebx, %edx
movl %eax, 8(%esp)
movl %edx, 12(%esp)
addl $16, %esp
popl %ebx
ret
After:
test_c:
subl $20, %esp
movl 36(%esp), %eax
movl 32(%esp), %edx
addl 28(%esp), %eax
adcl 24(%esp), %edx
movl %eax, 8(%esp)
movl %edx, 12(%esp)
addl $20, %esp
ret
2023-06-16 Roger Sayle <roger@nextmovesoftware.com>
Uros Bizjak <ubizjak@gmail.com>
gcc/ChangeLog
PR target/31985
* config/i386/i386.md (*add<dwi>3_doubleword_concat): New
define_insn_and_split combine *add<dwi>3_doubleword with
a *concat<mode><dwi>3 for more efficient lowering after reload.
gcc/testsuite/ChangeLog
PR target/31985
* gcc.target/i386/pr31985.c: New test case.
|
|
It adjust preprocess, compile and link flags, which allows to change
default -lmsvcrt library by another provided by MinGW runtime.
gcc/ChangeLog:
* config/i386/mingw-w64.h (CPP_SPEC): Adjust for -mcrtdll=.
(REAL_LIBGCC_SPEC): New define.
* config/i386/mingw.opt: Add mcrtdll=
* config/i386/mingw32.h (CPP_SPEC): Adjust for -mcrtdll=.
(REAL_LIBGCC_SPEC): Adjust for -mcrtdll=.
(STARTFILE_SPEC): Adjust for -mcrtdll=.
* doc/invoke.texi: Add mcrtdll= documentation.
Signed-off-by: Jonathan Yong <10walls@gmail.com>
|
|
Like is already the case for the AVX/AVX2 form, VMOVDDUP - acting on
double precision floating values - is more appropriate to use here, and
it can also result in shorter insn encodings when source is memory or
%xmm0...%xmm7, and no masking is applied (in allowing a 2-byte VEX
prefix then instead of a 3-byte one).
gcc/
* config/i386/sse.md (<avx512>_vec_dup<mode><mask_name>): Use
vmovddup.
|
|
gcc/
* config/i386/constraints.md: Mention k and r for B.
|
|
The following patch introduces {add,sub}c5_optab and pattern recognizes
various forms of add with carry and subtract with carry/borrow, see
pr79173-{1,2,3,4,5,6}.c tests on what is matched.
Primarily forms with 2 __builtin_add_overflow or __builtin_sub_overflow
calls per limb (with just one for the least significant one), for
add with carry even when it is hand written in C (for subtraction
reassoc seems to change it too much so that the pattern recognition
doesn't work). __builtin_{add,sub}_overflow are standardized in C23
under ckd_{add,sub} names, so it isn't any longer a GNU only extension.
Note, clang has for these (IMHO badly designed)
__builtin_{add,sub}c{b,s,,l,ll} builtins which don't add/subtract just
a single bit of carry, but basically add 3 unsigned values or
subtract 2 unsigned values from one, and result in carry out of 0, 1, or 2
because of that. If we wanted to introduce those for clang compatibility,
we could and lower them early to just two __builtin_{add,sub}_overflow
calls and let the pattern matching in this patch recognize it later.
I've added expanders for this on ix86 and in addition to that
added various peephole2s (in preparation patches for this patch) to make
sure we get nice (and small) code for the common cases. I think there are
other PRs which request that e.g. for the _{addcarry,subborrow}_u{32,64}
intrinsics, which the patch also improves.
Would be nice if support for these optabs was added to many other targets,
arm/aarch64 and powerpc* certainly have such instructions, I'd expect
in fact that most targets do.
The _BitInt support I'm working on will also need this to emit reasonable
code.
2023-06-15 Jakub Jelinek <jakub@redhat.com>
PR middle-end/79173
* internal-fn.def (UADDC, USUBC): New internal functions.
* internal-fn.cc (expand_UADDC, expand_USUBC): New functions.
(commutative_ternary_fn_p): Return true also for IFN_UADDC.
* optabs.def (uaddc5_optab, usubc5_optab): New optabs.
* tree-ssa-math-opts.cc (uaddc_cast, uaddc_ne0, uaddc_is_cplxpart,
match_uaddc_usubc): New functions.
(math_opts_dom_walker::after_dom_children): Call match_uaddc_usubc
for PLUS_EXPR, MINUS_EXPR, BIT_IOR_EXPR and BIT_XOR_EXPR unless
other optimizations have been successful for those.
* gimple-fold.cc (gimple_fold_call): Handle IFN_UADDC and IFN_USUBC.
* fold-const-call.cc (fold_const_call): Likewise.
* gimple-range-fold.cc (adjust_imagpart_expr): Likewise.
* tree-ssa-dce.cc (eliminate_unnecessary_stmts): Likewise.
* doc/md.texi (uaddc<mode>5, usubc<mode>5): Document new named
patterns.
* config/i386/i386.md (uaddc<mode>5, usubc<mode>5): New
define_expand patterns.
(*setcc_qi_addqi3_cconly_overflow_1_<mode>, *setccc): Split
into NOTE_INSN_DELETED note rather than nop instruction.
(*setcc_qi_negqi_ccc_1_<mode>, *setcc_qi_negqi_ccc_2_<mode>):
Likewise.
* gcc.target/i386/pr79173-1.c: New test.
* gcc.target/i386/pr79173-2.c: New test.
* gcc.target/i386/pr79173-3.c: New test.
* gcc.target/i386/pr79173-4.c: New test.
* gcc.target/i386/pr79173-5.c: New test.
* gcc.target/i386/pr79173-6.c: New test.
* gcc.target/i386/pr79173-7.c: New test.
* gcc.target/i386/pr79173-8.c: New test.
* gcc.target/i386/pr79173-9.c: New test.
* gcc.target/i386/pr79173-10.c: New test.
|
|
destination [PR79173]
This patch adds subborrow<mode> alternative so that it can have memory
destination and adds various peephole2s which help to match it.
2023-06-15 Jakub Jelinek <jakub@redhat.com>
PR middle-end/79173
* config/i386/i386.md (subborrow<mode>): Add alternative with
memory destination and add for it define_peephole2
TARGET_READ_MODIFY_WRITE/-Os patterns to prefer using memory
destination in these patterns.
|
|
borrow with memory destination [PR79173]
This patch adds various peephole2s which help to recognize add with
carry or subtract with borrow with memory destination.
2023-06-14 Jakub Jelinek <jakub@redhat.com>
PR middle-end/79173
* config/i386/i386.md (*sub<mode>_3, @add<mode>3_carry,
addcarry<mode>, @sub<mode>3_carry, *add<mode>3_cc_overflow_1): Add
define_peephole2 TARGET_READ_MODIFY_WRITE/-Os patterns to prefer
using memory destination in these patterns.
|
|
Since there's no evex version for vpcmpeq ymm, ymm, ymm.
gcc/ChangeLog:
PR target/110227
* config/i386/sse.md (mov<mode>_internal>): Use x instead of v
for alternative 2 since there's no evex version for vpcmpeqd
ymm, ymm, ymm.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr110227.c: New test.
|
|
I've noticed that standard_sse_constant_opcode emits some spurious
whitespace around tab, that isn't something which is done for
any other instruction and looks wrong.
2023-06-13 Jakub Jelinek <jakub@redhat.com>
* config/i386/i386.cc (standard_sse_constant_opcode): Remove
superfluous spaces around \t for vpcmpeqd.
|
|
- Fix gen_autofdo_event: The download URL for the Intel Perfmon Event
list has changed, as well as the JSON format.
Also it now uses pattern matching to match CPUs. Update the script to support all of this.
- Regenerate gcc-auto-profile with the latest published Intel model
numbers, so it works with recent systems.
- So far it's still broken on hybrid systems
contrib/ChangeLog:
* gen_autofdo_event.py: Update for download server changes
gcc/ChangeLog
* config/i386/gcc-auto-profile: Regenerate.
|
|
This patch only support optabs for vector modes whose lenth >= 128.
For 32/64-bit vector, they're more hanlded by BB vectorizer with
truncmn2/extendmn2/fix{,uns}_truncmn2.
gcc/ChangeLog:
* config/i386/sse.md (vec_pack<floatprefix>_float_<mode>): New expander.
(vec_unpack_<fixprefix>fix_trunc_lo_<mode>): Ditto.
(vec_unpack_<fixprefix>fix_trunc_hi_<mode>): Ditto.
(vec_unpacks_lo_<mode>): Ditto.
(vec_unpacks_hi_<mode>): Ditto.
(sse_movlhps_<mode>): New define_insn.
(ssse3_palignr<mode>_perm): Extend to V_128H.
(V_128H): New mode iterator.
(ssepackPHmode): New mode attribute.
(vunpck_extract_mode): Ditto.
(vpckfloat_concat_mode): Extend to VxSI/VxSF for _Float16.
(vpckfloat_temp_mode): Ditto.
(vpckfloat_op_mode): Ditto.
(vunpckfixt_mode): Extend to VxHF.
(vunpckfixt_model): Ditto.
(vunpckfixt_extract_mode): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/vec_pack_fp16-1.c: New test.
* gcc.target/i386/vec_pack_fp16-2.c: New test.
* gcc.target/i386/vec_pack_fp16-3.c: New test.
|
|
Since mask < 0 will be always false for vector char when
-funsigned-char, but vpblendvb needs to check the most significant
bit. The patch explicitly VCE to vector signed char.
gcc/ChangeLog:
PR target/110108
* config/i386/i386.cc (ix86_gimple_fold_builtin): Explicitly
view_convert_expr mask to signed type when folding pblendvb
builtins.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr110108-2.c: New test.
|
|
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for
TYPE_MIN, but PABSB will store unsigned result into dst. The patch
uses ABSU_EXPR + VCE instead of ABS_EXPR.
Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit
vector absm2 is guarded with TARGET_MMX_WITH_SSE.
gcc/ChangeLog:
PR target/110108
* config/i386/i386.cc (ix86_gimple_fold_builtin): Fold
_mm{,256,512}_abs_{epi8,epi16,epi32,epi64} into gimple
ABSU_EXPR + VCE, don't fold _mm_abs_{pi8,pi16,pi32} w/o
TARGET_64BIT.
* config/i386/i386-builtin.def: Replace CODE_FOR_nothing with
real codename for __builtin_ia32_pabs{b,w,d}.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr110108.c: New test.
* gcc.target/i386/pr110108-3.c: New test.
* gcc.target/i386/pr109900.c: Adjust testcase.
|
|
[PR110152]
I'm getting
+FAIL: gcc.target/i386/3dnow-1.c (internal compiler error: Segmentation fault signal terminated program cc1)
+FAIL: gcc.target/i386/3dnow-1.c (test for excess errors)
+FAIL: gcc.target/i386/3dnow-2.c (internal compiler error: Segmentation fault signal terminated program cc1)
+FAIL: gcc.target/i386/3dnow-2.c (test for excess errors)
+FAIL: gcc.target/i386/mmx-1.c (internal compiler error: Segmentation fault signal terminated program cc1)
+FAIL: gcc.target/i386/mmx-1.c (test for excess errors)
+FAIL: gcc.target/i386/mmx-2.c (internal compiler error: Segmentation fault signal terminated program cc1)
+FAIL: gcc.target/i386/mmx-2.c (test for excess errors)
regressions on i686-linux since r14-1166. The problem is when
ix86_expand_vector_init_general is called with mmx_ok = true and
mode = V4HImode, it newly recurses with mmx_ok = false and mode = V2SImode,
but as mmx_ok is false and !TARGET_SSE, we recurse again with the same
arguments (ok, fresh new tmp and vals) infinitely.
The following patch fixes that by passing mmx_ok to that recursive call.
For n_words == 4 it isn't needed, because we only care about mmx_ok for
V2SImode or V2SFmode and no other modes.
2023-06-08 Jakub Jelinek <jakub@redhat.com>
PR target/110152
* config/i386/i386-expand.cc (ix86_expand_vector_init_general): For
n_words == 2 recurse with mmx_ok as first argument rather than false.
|
|
This patch is the latest revision of my patch to add support for the
STC (set carry flag) and CMC (complement carry flag) instructions to
the i386 backend, incorporating Uros' previous feedback. The significant
changes are (i) the inclusion of CMC, (ii) the use of UNSPEC for pattern,
(iii) Use of a new X86_TUNE_SLOW_STC tuning flag to use alternate
implementations on pentium4 (which has a notoriously slow STC) when
not optimizing for size.
An example of the use of the stc instruction is:
unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) {
return __builtin_ia32_addcarryx_u32 (1, a, b, c);
}
which previously generated:
movl $1, %eax
addb $-1, %al
adcl %esi, %edi
setc %al
movl %edi, (%rdx)
movzbl %al, %eax
ret
with this patch now generates:
stc
adcl %esi, %edi
setc %al
movl %edi, (%rdx)
movzbl %al, %eax
ret
An example of the use of the cmc instruction (where the carry from
a first adc is inverted/complemented as input to a second adc) is:
unsigned int bar (unsigned int a, unsigned int b,
unsigned int c, unsigned int d)
{
unsigned int c1 = __builtin_ia32_addcarryx_u32 (1, a, b, &o1);
return __builtin_ia32_addcarryx_u32 (c1 ^ 1, c, d, &o2);
}
which previously generated:
movl $1, %eax
addb $-1, %al
adcl %esi, %edi
setnc %al
movl %edi, o1(%rip)
addb $-1, %al
adcl %ecx, %edx
setc %al
movl %edx, o2(%rip)
movzbl %al, %eax
ret
and now generates:
stc
adcl %esi, %edi
cmc
movl %edi, o1(%rip)
adcl %ecx, %edx
setc %al
movl %edx, o2(%rip)
movzbl %al, %eax
ret
This version implements Uros' suggestions/refinements. (i) Avoid the
UNSPEC_CMC by using the canonical RTL idiom for *x86_cmc, (ii) Use
peephole2s to convert x86_stc and *x86_cmc into alternate forms on
TARGET_SLOW_STC CPUs (pentium4), when a suitable QImode register is
available, (iii) Prefer the addqi_cconly_overflow idiom (addb $-1,%al)
over negqi_ccc_1 (neg %al) for setting the carry from a QImode value,
These changes required two minor edits to i386.cc: ix86_cc_mode had
to be tweaked to suggest CCCmode for the new *x86_cmc pattern, and
*x86_cmc needed to be handled/parameterized in ix86_rtx_costs so that
combine would appreciate that this complex RTL expression was actually
a fast, single byte instruction [i.e. preferable].
2022-06-07 Roger Sayle <roger@nextmovesoftware.com>
Uros Bizjak <ubizjak@gmail.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_builtin) <handlecarry>:
Use new x86_stc instruction when the carry flag must be set.
* config/i386/i386.cc (ix86_cc_mode): Use CCCmode for *x86_cmc.
(ix86_rtx_costs): Provide accurate rtx_costs for *x86_cmc.
* config/i386/i386.h (TARGET_SLOW_STC): New define.
* config/i386/i386.md (UNSPEC_STC): New UNSPEC for stc.
(x86_stc): New define_insn.
(define_peephole2): Convert x86_stc into alternate implementation
on pentium4 without -Os when a QImode register is available.
(*x86_cmc): New define_insn.
(define_peephole2): Convert *x86_cmc into alternate implementation
on pentium4 without -Os when a QImode register is available.
(*setccc): New define_insn_and_split for a no-op CCCmode move.
(*setcc_qi_negqi_ccc_1_<mode>): New define_insn_and_split to
recognize (and eliminate) the carry flag being copied to itself.
(*setcc_qi_negqi_ccc_2_<mode>): Likewise.
* config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag.
gcc/testsuite/ChangeLog
* gcc.target/i386/cmc-1.c: New test case.
* gcc.target/i386/stc-1.c: Likewise.
|
|
This patch fixes PR target/110083, an ICE-on-valid regression exposed by
my recent PTEST improvements (to address PR target/109973). The latent
bug (admittedly mine) is that the scalar-to-vector (STV) pass doesn't update
or delete REG_EQUAL notes attached to COMPARE instructions. As a result
the operands of COMPARE would be mismatched, with the register transformed
to V1TImode, but the immediate operand left as const_wide_int, which is
valid for TImode but not V1TImode. This remained latent when the STV
conversion converted the mode of the COMPARE to CCmode, with later passes
recognizing the REG_EQUAL note is obviously invalid as the modes didn't
match, but now that we (correctly) preserve the CCZmode on COMPARE, the
mismatched operand modes trigger a sanity checking ICE downstream.
Fixed by updating (or deleting) any REG_EQUAL notes in convert_compare.
Before:
(expr_list:REG_EQUAL (compare:CCZ (reg:V1TI 119 [ ivin.29_38 ])
(const_wide_int 0x80000000000000000000000000000000))
After:
(expr_list:REG_EQUAL (compare:CCZ (reg:V1TI 119 [ ivin.29_38 ])
(const_vector:V1TI [
(const_wide_int 0x80000000000000000000000000000000)
]))
2023-06-04 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
PR target/110083
* config/i386/i386-features.cc (scalar_chain::convert_compare):
Update or delete REG_EQUAL notes, converting CONST_INT and
CONST_WIDE_INT immediate operands to a suitable CONST_VECTOR.
gcc/testsuite/ChangeLog
PR target/110083
* gcc.target/i386/pr110083.c: New test case.
|
|
Add missing insn patterns for v2si -> v2hi/v2qi and v2hi-> v2qi vector
truncate.
gcc/ChangeLog:
PR target/92658
* config/i386/mmx.md (truncv2hiv2qi2): New define_insn.
(truncv2si<mode>2): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr92658-avx512bw-trunc-2.c: New test.
|
|
This is my proposed minimal fix for PR target/109973 (hopefully suitable
for backporting) that follows Jakub Jelinek's suggestion that we introduce
CCZmode and CCCmode variants of ptest and vptest, so that the i386
backend treats [v]ptest instructions similarly to testl instructions;
using different CCmodes to indicate which condition flags are desired,
and then relying on the RTL cmpelim pass to eliminate redundant tests.
This conveniently matches Intel's intrinsics, that provide different
functions for retrieving different flags, _mm_testz_si128 tests the
Z flag, _mm_testc_si128 tests the carry flag. Currently we use the
same instruction (pattern) for both, and unfortunately the *ptest<mode>_and
optimization is only valid when the ptest/vptest instruction is used to
set/test the Z flag.
The downside, as predicted by Jakub, is that GCC's cmpelim pass is
currently COMPARE-centric and not able to merge the ptests from expressions
such as _mm256_testc_si256 (a, b) + _mm256_testz_si256 (a, b), which is a
known issue, PR target/80040.
2023-06-01 Roger Sayle <roger@nextmovesoftware.com>
Uros Bizjak <ubizjak@gmail.com>
gcc/ChangeLog
PR target/109973
* config/i386/i386-builtin.def (__builtin_ia32_ptestz128): Use new
CODE_for_sse4_1_ptestzv2di.
(__builtin_ia32_ptestc128): Use new CODE_for_sse4_1_ptestcv2di.
(__builtin_ia32_ptestz256): Use new CODE_for_avx_ptestzv4di.
(__builtin_ia32_ptestc256): Use new CODE_for_avx_ptestcv4di.
* config/i386/i386-expand.cc (ix86_expand_branch): Use CCZmode
when expanding UNSPEC_PTEST to compare against zero.
* config/i386/i386-features.cc (scalar_chain::convert_compare):
Likewise generate CCZmode UNSPEC_PTESTs when converting comparisons.
(general_scalar_chain::convert_insn): Use CCZmode for COMPARE result.
(timode_scalar_chain::convert_insn): Use CCZmode for COMPARE result.
* config/i386/i386-protos.h (ix86_match_ptest_ccmode): Prototype.
* config/i386/i386.cc (ix86_match_ptest_ccmode): New predicate to
check for suitable matching modes for the UNSPEC_PTEST pattern.
* config/i386/sse.md (define_split): When splitting UNSPEC_MOVMSK
to UNSPEC_PTEST, preserve the FLAG_REG mode as CCZ.
(*<sse4_1>_ptest<mode>): Add asterisk to hide define_insn. Remove
":CC" mode of FLAGS_REG, instead use ix86_match_ptest_ccmode.
(<sse4_1>_ptestz<mode>): New define_expand to specify CCZ.
(<sse4_1>_ptestc<mode>): New define_expand to specify CCC.
(<sse4_1>_ptest<mode>): A define_expand using CC to preserve the
current behavior.
(*ptest<mode>_and): Specify CCZ to only perform this optimization
when only the Z flag is required.
gcc/testsuite/ChangeLog
PR target/109973
* gcc.target/i386/pr109973-1.c: New test case.
* gcc.target/i386/pr109973-2.c: Likewise.
|
|
gcc/ChangeLog:
PR target/110041
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi2):
Fix misleading identation.
|
|
gcc/ChangeLog:
PR target/110021
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi2): Also require
TARGET_AVX512BW to generate truncv16hiv16qi2.
|
|
r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554 adds 2 splitter to
transform notl + pbroadcast + pand to pbroadcast + pandn for
VI124_AVX2 which leaves out all DI-element-size ones as
well as all 512-bit ones.
This patch extend the splitter to VI_AVX2 which will handle DImode for
AVX2, and V64QImode,V32HImode,V16SImode,V8DImode for AVX512.
gcc/ChangeLog:
PR target/100711
* config/i386/sse.md (*andnot<mode>3): Extend below splitter
to VI_AVX2 to cover more modes.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr100711-2.c: Add v4di/v2di testcases.
* gcc.target/i386/pr100711-3.c: New test.
|
|
lzcnt/tzcnt has been fixed since skylake, popcnt has been fixed since
icelake. At least for icelake and later intel Core processors, the
errata tune is not needed. And the tune isn't need for ATOM either.
gcc/ChangeLog:
* config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI):
Remove ATOM and ICELAKE(and later) core processors.
|
|
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi):
Do not disable call to ix86_expand_vecop_qihi2.
|
|
Rewrite ix86_expand_vecop_qihi2 to expand fo 2x-wider (e.g. V16QI -> V16HImode)
instructions when available. Currently, the compiler generates following
assembly for V16QImode multiplication (-mavx2):
vpunpcklbw %xmm0, %xmm0, %xmm3
vpunpcklbw %xmm1, %xmm1, %xmm2
vpunpckhbw %xmm0, %xmm0, %xmm0
movl $255, %eax
vpunpckhbw %xmm1, %xmm1, %xmm1
vpmullw %xmm3, %xmm2, %xmm2
vmovd %eax, %xmm3
vpmullw %xmm0, %xmm1, %xmm1
vpbroadcastw %xmm3, %xmm3
vpand %xmm2, %xmm3, %xmm0
vpand %xmm1, %xmm3, %xmm3
vpackuswb %xmm3, %xmm0, %xmm0
and only with -mavx512bw -mavx512vl generates:
vpmovzxbw %xmm1, %ymm1
vpmovzxbw %xmm0, %ymm0
vpmullw %ymm1, %ymm0, %ymm0
vpmovwb %ymm0, %xmm0
Patched compiler generates more optimized code involving multiplication
in 2x-wider mode in cases where missing truncate instruction has to be
emulated with a permutation (-mavx2):
vpmovzxbw %xmm0, %ymm0
vpmovzxbw %xmm1, %ymm1
movl $255, %eax
vpmullw %ymm1, %ymm0, %ymm1
vmovd %eax, %xmm0
vpbroadcastw %xmm0, %ymm0
vpand %ymm1, %ymm0, %ymm0
vpackuswb %ymm0, %ymm0, %ymm0
vpermq $216, %ymm0, %ymm0
The patch also adjusts cost calculation of V*QImode emulations to account
for generation of 2x-wider mode instructions.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi2):
Rewrite to expand to 2x-wider (e.g. V16QI -> V16HImode)
instructions when available. Emulate truncation via
ix86_expand_vec_perm_const_1 when native truncate insn
is not available.
(ix86_expand_vecop_qihi_partial) <case MULT>: Use pmovzx
when available. Trivially rename some variables.
(ix86_expand_vecop_qihi): Unconditionally call ix86_expand_vecop_qihi2.
* config/i386/i386.cc (ix86_multiplication_cost): Rewrite cost
calculation of V*QImode emulations to account for generation of
2x-wider mode instructions.
(ix86_shift_rotate_cost): Update cost calculation of V*QImode
emulations to account for generation of 2x-wider mode instructions.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512vl-pr95488-1.c: Revert 2023-05-18 change.
|
|
This patch aims to fix incorrect intrinsic signature for
_mm{512|256|}_s{lli|rai|rli}_epi*.
gcc/ChangeLog:
PR target/109173
PR target/109174
* config/i386/avx512bwintrin.h (_mm512_srli_epi16): Change type from
int to const int or const int to const unsigned int.
(_mm512_mask_srli_epi16): Ditto.
(_mm512_slli_epi16): Ditto.
(_mm512_mask_slli_epi16): Ditto.
(_mm512_maskz_slli_epi16): Ditto.
(_mm512_srai_epi16): Ditto.
(_mm512_mask_srai_epi16): Ditto.
(_mm512_maskz_srai_epi16): Ditto.
* config/i386/avx512fintrin.h (_mm512_slli_epi64): Ditto.
(_mm512_mask_slli_epi64): Ditto.
(_mm512_maskz_slli_epi64): Ditto.
(_mm512_srli_epi64): Ditto.
(_mm512_mask_srli_epi64): Ditto.
(_mm512_maskz_srli_epi64): Ditto.
(_mm512_srai_epi64): Ditto.
(_mm512_mask_srai_epi64): Ditto.
(_mm512_maskz_srai_epi64): Ditto.
(_mm512_slli_epi32): Ditto.
(_mm512_mask_slli_epi32): Ditto.
(_mm512_maskz_slli_epi32): Ditto.
(_mm512_srli_epi32): Ditto.
(_mm512_mask_srli_epi32): Ditto.
(_mm512_maskz_srli_epi32): Ditto.
(_mm512_srai_epi32): Ditto.
(_mm512_mask_srai_epi32): Ditto.
(_mm512_maskz_srai_epi32): Ditto.
* config/i386/avx512vlbwintrin.h (_mm256_mask_srai_epi16): Ditto.
(_mm256_maskz_srai_epi16): Ditto.
(_mm_mask_srai_epi16): Ditto.
(_mm_maskz_srai_epi16): Ditto.
(_mm256_mask_slli_epi16): Ditto.
(_mm256_maskz_slli_epi16): Ditto.
(_mm_mask_slli_epi16): Ditto.
(_mm_maskz_slli_epi16): Ditto.
(_mm_maskz_srli_epi16): Ditto.
* config/i386/avx512vlintrin.h (_mm256_mask_srli_epi32): Ditto.
(_mm256_maskz_srli_epi32): Ditto.
(_mm_mask_srli_epi32): Ditto.
(_mm_maskz_srli_epi32): Ditto.
(_mm256_mask_srli_epi64): Ditto.
(_mm256_maskz_srli_epi64): Ditto.
(_mm_mask_srli_epi64): Ditto.
(_mm_maskz_srli_epi64): Ditto.
(_mm256_mask_srai_epi32): Ditto.
(_mm256_maskz_srai_epi32): Ditto.
(_mm_mask_srai_epi32): Ditto.
(_mm_maskz_srai_epi32): Ditto.
(_mm256_srai_epi64): Ditto.
(_mm256_mask_srai_epi64): Ditto.
(_mm256_maskz_srai_epi64): Ditto.
(_mm_srai_epi64): Ditto.
(_mm_mask_srai_epi64): Ditto.
(_mm_maskz_srai_epi64): Ditto.
(_mm_mask_slli_epi32): Ditto.
(_mm_maskz_slli_epi32): Ditto.
(_mm_mask_slli_epi64): Ditto.
(_mm_maskz_slli_epi64): Ditto.
(_mm256_mask_slli_epi32): Ditto.
(_mm256_maskz_slli_epi32): Ditto.
(_mm256_mask_slli_epi64): Ditto.
(_mm256_maskz_slli_epi64): Ditto.
gcc/testsuite/ChangeLog:
PR target/109173
PR target/109174
* gcc.target/i386/pr109173-1.c: New test.
* gcc.target/i386/pr109174-1.c: Ditto.
|
|
Also, move v<any_shift:insn>v8qi3 expander to a better place and enable
it with TARGET_MMX_WITH_SSE. Remove handling of V8QImode from
ix86_expand_vecop_qihi2 since all partial QI->HI vector modes expand
via ix86_expand_vecop_qihi_partial.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi2):
Remove handling of V8QImode.
* config/i386/mmx.md (v<insn>v8qi3): Move from sse.md.
Call ix86_expand_vecop_qihi_partial. Enable for TARGET_MMX_WITH_SSE.
(v<insn>v4qi3): Ditto.
* config/i386/sse.md (v<insn>v8qi3): Remove.
gcc/testsuite/ChangeLog:
* gcc.target/i386/vect-shiftv4qi.c (dg-options):
Remove -ftree-vectorize.
* gcc.target/i386/vect-shiftv8qi.c (dg-options): Ditto.
* gcc.target/i386/vect-vshiftv4qi.c: New test.
* gcc.target/i386/vect-vshiftv8qi.c: New test.
|
|
The following dispatches to V2DImode CTOR expansion instead of
using sets of (subreg:DI (reg:V16QI 146) [08]) which causes
LRA to spill DImode and reload V16QImode. The same applies for
V8QImode or V4HImode construction from SImode parts which happens
during 32bit libgcc build.
PR target/109944
* config/i386/i386-expand.cc (ix86_expand_vector_init_general):
Perform final vector composition using
ix86_expand_vector_init_general instead of setting
the highpart and lowpart which causes spilling.
* gcc.target/i386/pr109944-1.c: New testcase.
* gcc.target/i386/pr109944-2.c: Likewise.
|
|
Also for 64-bit vector abs intrinsics _mm_abs_{pi8,pi16,pi32}.
gcc/ChangeLog:
PR target/109900
* config/i386/i386.cc (ix86_gimple_fold_builtin): Fold
_mm{,256,512}_abs_{epi8,epi16,epi32,epi64} and
_mm_abs_{pi8,pi16,pi32} into gimple ABS_EXPR.
(ix86_masked_all_ones): Handle 64-bit mask.
* config/i386/i386-builtin.def: Replace icode of related
non-mask simd abs builtins with CODE_FOR_nothing.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr109900.c: New test.
|
|
The following also accounts for a GPR->XMM move cost for splat
operations and properly guards eliding the cost when moving from
memory only for SSE4.1 or HImode or larger operands. This
doesn't fix the PR fully yet.
PR target/109944
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
For vector construction or splats apply GPR->XMM move
costing. QImode memory can be handled directly only
with SSE4.1 pinsrb.
|
|
Add V8QImode and V4QImode vector shift patterns that call into
ix86_expand_vecop_qihi_partial. Generate special sequences
for constant count operands.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
Call ix86_expand_vec_shift_qihi_constant for shifts
with constant count operand.
* config/i386/i386.cc (ix86_shift_rotate_cost):
Handle V4QImode and V8QImode.
* config/i386/mmx.md (<insn>v8qi3): New insn pattern.
(<insn>v4qi3): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/vect-shiftv4qi.c: New test.
* gcc.target/i386/vect-shiftv8qi.c: New test.
|
|
Returned integer vector mode costs of emulated instructions in
ix86_shift_rotate_cost are wrong and do not reflect generated
instruction sequences. Rewrite handling of different integer vector
modes and different target ABIs to return real instruction
counts in order to calcuate better costs of various emulated modes.
Also add the cost of a memory read, when the instruction in the
sequence reads memory.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_shift_rotate_cost): Correct
calcuation of integer vector mode costs to reflect generated
instruction sequences of different integer vector modes and
different target ABIs. Remove "speed" function argument.
(ix86_rtx_costs): Update call for removed function argument.
(ix86_vector_costs::add_stmt_cost): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/sse2-shiftqihi-constant-1.c: Remove XFAILs.
|
|
Add the cost of a memory read to the cost of V*QImode vector mult sequences.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_multiplication_cost): Add
the cost of a memory read to the cost of V?QImode sequences.
|
|
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_short_vector_p): Use _P
defines from tree.h.
(aarch64_mangle_type): Ditto.
* config/alpha/alpha.cc (alpha_in_small_data_p): Ditto.
(alpha_gimplify_va_arg_1): Ditto.
* config/arc/arc.cc (arc_encode_section_info): Ditto.
(arc_is_aux_reg_p): Ditto.
(arc_is_uncached_mem_p): Ditto.
(arc_handle_aux_attribute): Ditto.
* config/arm/arm.cc (arm_handle_isr_attribute): Ditto.
(arm_handle_cmse_nonsecure_call): Ditto.
(arm_set_default_type_attributes): Ditto.
(arm_is_segment_info_known): Ditto.
(arm_mangle_type): Ditto.
* config/arm/unknown-elf.h (IN_NAMED_SECTION_P): Ditto.
* config/avr/avr.cc (avr_lookup_function_attribute1): Ditto.
(avr_decl_absdata_p): Ditto.
(avr_insert_attributes): Ditto.
(avr_section_type_flags): Ditto.
(avr_encode_section_info): Ditto.
* config/bfin/bfin.cc (bfin_handle_l2_attribute): Ditto.
* config/bpf/bpf.cc (bpf_core_compute): Ditto.
* config/c6x/c6x.cc (c6x_in_small_data_p): Ditto.
* config/csky/csky.cc (csky_handle_isr_attribute): Ditto.
(csky_mangle_type): Ditto.
* config/darwin-c.cc (darwin_pragma_unused): Ditto.
* config/darwin.cc (is_objc_metadata): Ditto.
* config/epiphany/epiphany.cc (epiphany_function_ok_for_sibcall): Ditto.
* config/epiphany/epiphany.h (ROUND_TYPE_ALIGN): Ditto.
* config/frv/frv.cc (frv_emit_movsi): Ditto.
* config/gcn/gcn-tree.cc (gcn_lockless_update): Ditto.
* config/gcn/gcn.cc (gcn_asm_output_symbol_ref): Ditto.
* config/h8300/h8300.cc (h8300_encode_section_info): Ditto.
* config/i386/i386-expand.cc: Ditto.
* config/i386/i386.cc (type_natural_mode): Ditto.
(ix86_function_arg): Ditto.
(ix86_data_alignment): Ditto.
(ix86_local_alignment): Ditto.
(ix86_simd_clone_compute_vecsize_and_simdlen): Ditto.
* config/i386/winnt-cxx.cc (i386_pe_type_dllimport_p): Ditto.
(i386_pe_type_dllexport_p): Ditto.
(i386_pe_adjust_class_at_definition): Ditto.
* config/i386/winnt.cc (i386_pe_determine_dllimport_p): Ditto.
(i386_pe_binds_local_p): Ditto.
(i386_pe_section_type_flags): Ditto.
* config/ia64/ia64.cc (ia64_encode_section_info): Ditto.
(ia64_gimplify_va_arg): Ditto.
(ia64_in_small_data_p): Ditto.
* config/iq2000/iq2000.cc (iq2000_function_arg): Ditto.
* config/lm32/lm32.cc (lm32_in_small_data_p): Ditto.
* config/loongarch/loongarch.cc (loongarch_handle_model_attribute): Ditto.
* config/m32c/m32c.cc (m32c_insert_attributes): Ditto.
* config/mcore/mcore.cc (mcore_mark_dllimport): Ditto.
(mcore_encode_section_info): Ditto.
* config/microblaze/microblaze.cc (microblaze_elf_in_small_data_p): Ditto.
* config/mips/mips.cc (mips_output_aligned_decl_common): Ditto.
* config/mmix/mmix.cc (mmix_encode_section_info): Ditto.
* config/nvptx/nvptx.cc (nvptx_encode_section_info): Ditto.
(pass_in_memory): Ditto.
(nvptx_generate_vector_shuffle): Ditto.
(nvptx_lockless_update): Ditto.
* config/pa/pa.cc (pa_function_arg_padding): Ditto.
(pa_function_value): Ditto.
(pa_function_arg): Ditto.
* config/pa/pa.h (IN_NAMED_SECTION_P): Ditto.
(TEXT_SPACE_P): Ditto.
* config/pa/som.h (MAKE_DECL_ONE_ONLY): Ditto.
* config/pdp11/pdp11.cc (pdp11_return_in_memory): Ditto.
* config/riscv/riscv.cc (riscv_in_small_data_p): Ditto.
(riscv_mangle_type): Ditto.
* config/rl78/rl78.cc (rl78_insert_attributes): Ditto.
(rl78_addsi3_internal): Ditto.
* config/rs6000/aix.h (ROUND_TYPE_ALIGN): Ditto.
* config/rs6000/darwin.h (ROUND_TYPE_ALIGN): Ditto.
* config/rs6000/freebsd64.h (ROUND_TYPE_ALIGN): Ditto.
* config/rs6000/linux64.h (ROUND_TYPE_ALIGN): Ditto.
* config/rs6000/rs6000-call.cc (rs6000_function_arg_boundary): Ditto.
(rs6000_function_arg_advance_1): Ditto.
(rs6000_function_arg): Ditto.
(rs6000_pass_by_reference): Ditto.
* config/rs6000/rs6000-logue.cc (rs6000_function_ok_for_sibcall): Ditto.
* config/rs6000/rs6000.cc (rs6000_data_alignment): Ditto.
(rs6000_set_default_type_attributes): Ditto.
(rs6000_elf_in_small_data_p): Ditto.
(IN_NAMED_SECTION): Ditto.
(rs6000_xcoff_encode_section_info): Ditto.
(rs6000_function_value): Ditto.
(invalid_arg_for_unprototyped_fn): Ditto.
* config/s390/s390-c.cc (s390_fn_types_compatible): Ditto.
(s390_vec_n_elem): Ditto.
* config/s390/s390.cc (s390_check_type_for_vector_abi): Ditto.
(s390_function_arg_integer): Ditto.
(s390_return_in_memory): Ditto.
(s390_encode_section_info): Ditto.
* config/sh/sh.cc (sh_gimplify_va_arg_expr): Ditto.
(sh_function_value): Ditto.
* config/sol2.cc (solaris_insert_attributes): Ditto.
* config/sparc/sparc.cc (function_arg_slotno): Ditto.
* config/sparc/sparc.h (ROUND_TYPE_ALIGN): Ditto.
* config/stormy16/stormy16.cc (xstormy16_encode_section_info): Ditto.
(xstormy16_handle_below100_attribute): Ditto.
* config/v850/v850.cc (v850_encode_section_info): Ditto.
(v850_insert_attributes): Ditto.
* config/visium/visium.cc (visium_pass_by_reference): Ditto.
(visium_return_in_memory): Ditto.
* config/xtensa/xtensa.cc (xtensa_multibss_section_type_flags): Ditto.
|
|
QImode partial vector multiplications and shifts can be implemented using
their HImode counterparts. Add infrastructure to handle V8QImode and
V4QImode vectors by extending (interleaving) their input operands to
V8HImode, performing V8HImode operation and truncating output back to
the original QImode vector.
The patch implements V8QImode and V4QImode multiplication for SSE2 targets,
using generic permutation to truncate output operand, but still taking
advantage of VPMOVWB down convert instruction, when available.
The patch also removes setting of REG_EQAUL note to the last insn
of ix86_expand_vecop_qihi expander. This is what generic code does
automatically when named pattern is expanded.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial): New.
(ix86_expand_vecop_qihi): Add op2vec bool variable.
Do not set REG_EQUAL note.
* config/i386/i386-protos.h (ix86_expand_vecop_qihi_partial):
Add prototype.
* config/i386/i386.cc (ix86_multiplication_cost): Handle
V4QImode and V8QImode.
* config/i386/mmx.md (mulv8qi3): New expander.
(mulv4qi3): Ditto.
* config/i386/sse.md (mulv8qi3): Remove.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512vl-pr95488-1.c: Adjust
expected scan-assembler-times frequency and strings..
* gcc.target/i386/vect-mulv4qi.c: New test.
* gcc.target/i386/vect-mulv8qi.c: New test.
|
|
builtins [PR109884]
When _Float128 support has been added to C++ for 13.1, float128t_type_node
tree has been added - in C float128_type_node and float128t_type_node is
the same and represents both _Float128 and __float128, but in C++ they
are distinct types which have different handling in the FEs.
When doing that change, I mistakenly forgot to change FLOAT128 primitive
type, which is used for the __builtin_{inf,huge_val,nan{,s},fabs,copysign}q
builtins results and some of their arguments (and nothing else).
The following patch fixes that.
On ia64 we already use float128t_type_node for those builtins, pa while
it has __float128 that type is the same as long double and so those builtins
have long double types and on powerpc seems we don't have these builtins
but instead define macros which map them to __builtin_*f128. That will
not work properly in C++, perhaps we should change those macros to be
function-like and cast to __float128.
2023-05-17 Jakub Jelinek <jakub@redhat.com>
PR c++/109884
* config/i386/i386-builtin-types.def (FLOAT128): Use
float128t_type_node rather than float128_type_node.
* c-c++-common/pr109884.c: New test.
|
|
Returned integer vector mode costs of emulated modes in
ix86_multiplication_cost are wrong and do not reflect generated
instruction sequences. Rewrite handling of different integer vector
modes and different target ABIs to return real instruction
counts in order to calcuate better costs of various emulated modes.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_multiplication_cost): Correct
calcuation of integer vector mode costs to reflect generated
instruction sequences of different integer vector modes and
different target ABIs.
|
|
Revert my previous change that faked handling of V4HI and V2SImodes
in ix86_widen_mult_cost and rather return arbitrary high value
for unsupported modes. This should prevent cost estimator from
selecting non-existent vector widen multiply operation.
gcc/ChangeLog:
PR target/109807
* config/i386/i386.cc: Revert the 2023-05-11 change.
(ix86_widen_mult_cost): Return high value instead of
ICEing for unsupported modes.
gcc/testsuite/ChangeLog:
PR target/109807
* gcc.target/i386/pr109825.c: New test.
|