Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch adds a new 2-instructions constant synthesis pattern:
- A non-negative square value that root can fit into a signed 12-bit:
=> "MOVI(.N) Ax, simm12" + "MULL Ax, Ax, Ax"
Due to the execution cost of the integer multiply instruction (MULL), this
synthesis works only when the 32-bit Integer Multiply Option is configured
and optimize for size is specified.
gcc/ChangeLog:
* config/xtensa/xtensa.cc (xtensa_constantsynth_2insn):
Add new pattern for the abovementioned case.
|
|
It used to always return a constant 4, which is same as the default
behavior, but doesn't take into account the effects of secondary
reloads.
Therefore, the implementation of this target hook is removed.
gcc/ChangeLog:
* config/xtensa/xtensa.cc
(TARGET_MEMORY_MOVE_COST, xtensa_memory_move_cost): Remove.
|
|
There is a functionality as const_anchor in cse.cc. This const_anchor
supports to generate new constants through adding small gap/offsets to
existing constant. For example:
void __attribute__ ((noinline)) foo (long long *a)
{
*a++ = 0x2351847027482577LL;
*a++ = 0x2351847027482578LL;
}
The second constant (0x2351847027482578LL) can be compated by adding '1'
to the first constant (0x2351847027482577LL).
This is profitable if more than one instructions are need to build the
second constant.
* For rs6000, we can enable this functionality, as the instruction
'addi' is just for this when gap is smaller than 0x8000.
* One potential side effect of this feature:
Comparing with
"r101=0x2351847027482577LL
...
r201=0x2351847027482578LL"
The new r201 will be "r201=r101+1", and then r101 will live longer,
and would increase pressure when allocating registers.
But I feel, this would be acceptable for this const_anchor feature.
With this feature, for GCC source code and SPEC object files, the
significant changes are the improvement that: "addi" vs. "2 or more
insns: lis+or.."; it also exposes some other optimizations
opportunities: like combine/jump2. While the side effect is also
occurring in few cases, but it does not impact overall performance.
gcc/ChangeLog:
* config/rs6000/rs6000.cc (TARGET_CONST_ANCHOR): New define.
gcc/testsuite/ChangeLog:
* gcc.target/powerpc/const_anchors.c: New test.
* gcc.target/powerpc/try_const_anchors_ice.c: New test.
|
|
The packing in vpacksswb/vpackssdw is not a simple concat, it's an
interweave from src1 and src2 for every 128 bit(or 64-bit for the
ss_truncate result).
.i.e.
dst[192-255] = ss_truncate (src2[128-255])
dst[128-191] = ss_truncate (src1[128-255])
dst[64-127] = ss_truncate (src2[0-127])
dst[0-63] = ss_truncate (src1[0-127]
The patch refined those patterns with an extra vec_select for the
interweave.
gcc/ChangeLog:
PR target/110235
* config/i386/sse.md (<sse2_avx2>_packsswb<mask_name>):
Substitute with ..
(sse2_packsswb<mask_name>): .. this, ..
(avx2_packsswb<mask_name>): .. this and ..
(avx512bw_packsswb<mask_name>): .. this.
(<sse2_avx2>_packssdw<mask_name>): Substitute with ..
(sse2_packssdw<mask_name>): .. this, ..
(avx2_packssdw<mask_name>): .. this and ..
(avx512bw_packssdw<mask_name>): .. this.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512bw-vpackssdw-3.c: New test.
* gcc.target/i386/avx512bw-vpacksswb-3.c: New test.
|
|
us_truncate.
packuswb/packusdw does unsigned saturation for signed source, but rtl
us_truncate means does unsigned saturation for unsigned source.
So for value -1, packuswb will produce 0, but us_truncate produces
255. The patch reimplement those related patterns and functions with
UNSPEC_US_TRUNCATE instead of us_truncate.
gcc/ChangeLog:
PR target/110235
* config/i386/i386-expand.cc (ix86_split_mmx_pack): Use
UNSPEC_US_TRUNCATE instead of original us_truncate for
packusdw/packuswb.
* config/i386/mmx.md (mmx_pack<s_trunsuffix>swb): Substitute
with ..
(mmx_packsswb): .. this and ..
(mmx_packuswb): .. this.
(mmx_packusdw): Use UNSPEC_US_TRUNCATE instead of original
us_truncate.
(s_trunsuffix): Removed code iterator.
(any_s_truncate): Ditto.
* config/i386/sse.md (<sse2_avx2>_packuswb<mask_name>): Use
UNSPEC_US_TRUNCATE instead of original us_truncate.
(<sse4_1_avx2>_packusdw<mask_name>): Ditto.
* config/i386/i386.md (UNSPEC_US_TRUNCATE): New unspec_c_enum.
|
|
This patch would like to fix one typo when GET_MODE_CLASS by mode.
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins-bases.cc: Fix one typo.
|
|
Testing the V2 version of Manolis's fold-mem-offsets patch exposed a minor bug
in the arc backend.
The movsf_insn pattern has constraints which allow storing certain constants
to memory. reload/lra will target those alternatives under the right
circumstances. However the insn's condition requires that one of the two
operands must be a register.
Thus if a pass were to force re-recognition of the pattern we can get an
unrecognized insn failure.
This patch adjusts the conditions to more closely match movsi_insn. More
specifically it allows storing a constant into a limited set of memory
operands (as defined by the Usc constraint). movqi_insn has the same
core problem and gets the same solution.
Committed after the tester validated there are not regresisons
gcc/
* config/arc/arc.md (movqi_insn): Allow certain constants to
be stored into memory in the pattern's condition.
(movsf_insn): Similarly.
|
|
This patch refactors the three places in the i386.md backend that we
set the carry flag into a new ix86_expand_carry helper function, that
allows Jakub's recently added uaddc<mode>5 and usubc<mode>5 expanders
to take advantage of the recently added support for the stc instruction.
2023-06-18 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_carry): New helper
function for setting the carry flag.
(ix86_expand_builtin) <handlecarry>: Use it here.
* config/i386/i386-protos.h (ix86_expand_carry): Prototype here.
* config/i386/i386.md (uaddc<mode>5): Use ix86_expand_carry.
(usubc<mode>5): Likewise.
|
|
This clean-up improves consistency within i386.md by using QImode for
the constant shift count in patterns that specify a mode.
2023-06-18 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386.md (*concat<mode><dwi>3_1): Use QImode
for the immediate constant shift count.
(*concat<mode><dwi>3_2): Likewise.
(*concat<mode><dwi>3_3): Likewise.
(*concat<mode><dwi>3_4): Likewise.
(*concat<mode><dwi>3_5): Likewise.
(*concat<mode><dwi>3_6): Likewise.
|
|
This patch adds support for the float16 tuple type.
gcc/ChangeLog:
* config/riscv/genrvv-type-indexer.cc (valid_type): Enable FP16 tuple.
* config/riscv/riscv-modes.def (RVV_TUPLE_MODES): New macro.
(ADJUST_ALIGNMENT): Ditto.
(RVV_TUPLE_PARTIAL_MODES): Ditto.
(ADJUST_NUNITS): Ditto.
* config/riscv/riscv-vector-builtins-types.def (vfloat16mf4x2_t):
New types.
(vfloat16mf4x3_t): Ditto.
(vfloat16mf4x4_t): Ditto.
(vfloat16mf4x5_t): Ditto.
(vfloat16mf4x6_t): Ditto.
(vfloat16mf4x7_t): Ditto.
(vfloat16mf4x8_t): Ditto.
(vfloat16mf2x2_t): Ditto.
(vfloat16mf2x3_t): Ditto.
(vfloat16mf2x4_t): Ditto.
(vfloat16mf2x5_t): Ditto.
(vfloat16mf2x6_t): Ditto.
(vfloat16mf2x7_t): Ditto.
(vfloat16mf2x8_t): Ditto.
(vfloat16m1x2_t): Ditto.
(vfloat16m1x3_t): Ditto.
(vfloat16m1x4_t): Ditto.
(vfloat16m1x5_t): Ditto.
(vfloat16m1x6_t): Ditto.
(vfloat16m1x7_t): Ditto.
(vfloat16m1x8_t): Ditto.
(vfloat16m2x2_t): Ditto.
(vfloat16m2x3_t): Ditto.
(vfloat16m2x4_t): Ditto.
(vfloat16m4x2_t): Ditto.
* config/riscv/riscv-vector-builtins.def (vfloat16mf4x2_t): New macro.
(vfloat16mf4x3_t): Ditto.
(vfloat16mf4x4_t): Ditto.
(vfloat16mf4x5_t): Ditto.
(vfloat16mf4x6_t): Ditto.
(vfloat16mf4x7_t): Ditto.
(vfloat16mf4x8_t): Ditto.
(vfloat16mf2x2_t): Ditto.
(vfloat16mf2x3_t): Ditto.
(vfloat16mf2x4_t): Ditto.
(vfloat16mf2x5_t): Ditto.
(vfloat16mf2x6_t): Ditto.
(vfloat16mf2x7_t): Ditto.
(vfloat16mf2x8_t): Ditto.
(vfloat16m1x2_t): Ditto.
(vfloat16m1x3_t): Ditto.
(vfloat16m1x4_t): Ditto.
(vfloat16m1x5_t): Ditto.
(vfloat16m1x6_t): Ditto.
(vfloat16m1x7_t): Ditto.
(vfloat16m1x8_t): Ditto.
(vfloat16m2x2_t): Ditto.
(vfloat16m2x3_t): Ditto.
(vfloat16m2x4_t): Ditto.
(vfloat16m4x2_t): Ditto.
* config/riscv/riscv-vector-switch.def (TUPLE_ENTRY): New.
* config/riscv/riscv.md: New.
* config/riscv/vector-iterators.md: New.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/tuple-28.c: New test.
* gcc.target/riscv/rvv/base/tuple-29.c: New test.
* gcc.target/riscv/rvv/base/tuple-30.c: New test.
* gcc.target/riscv/rvv/base/tuple-31.c: New test.
* gcc.target/riscv/rvv/base/tuple-32.c: New test.
|
|
This patch splits out two (independent) minor changes to i386-expand.cc's
ix86_expand_move from a larger patch, given that it's better to review
and commit these independent pieces separately from a more complex patch.
The first change is to test for CONST_WIDE_INT_P before calling
ix86_convert_const_wide_int_to_broadcast. Whilst stepping through
this function in gdb, I was surprised that the code was continually
jumping into this function with operands that obviously weren't
appropriate.
The second change is to generalize the optimization for efficiently
moving a TImode value to V1TImode (via V2DImode), to cover all 128-bit
vector modes.
Hence for the test case:
typedef unsigned long uv2di __attribute__ ((__vector_size__ (16)));
uv2di foo2(__int128 x) { return (uv2di)x; }
we'd previously move via memory with:
foo2: movq %rdi, -24(%rsp)
movq %rsi, -16(%rsp)
movdqa -24(%rsp), %xmm0
ret
with this patch we now generate with -O2 (the same as V1TImode):
foo2: movq %rdi, %xmm0
movq %rsi, %xmm1
punpcklqdq %xmm1, %xmm0
ret
and with -O2 -msse4 the even better:
foo2: movq %rdi, %xmm0
pinsrq $1, %rsi, %xmm0
ret
The new test case is unimaginatively called sse2-v1ti-mov-2.c given
the original test case just for V1TI mode was called sse2-v1ti-mov-1.c.
2023-06-17 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_expand_move): Check that OP1 is
CONST_WIDE_INT_P before calling ix86_convert_wide_int_to_broadcast.
Generalize special case for converting TImode to V1TImode to handle
all 128-bit vector conversions.
gcc/testsuite/ChangeLog
* gcc.target/i386/sse2-v1ti-mov-2.c: New test case.
|
|
The rvv integer reduction has 3 different patterns for zve128+, zve64
and zve32. They take the same iterator with different attributions.
However, we need the generated function code_for_reduc (code, mode1, mode2).
The implementation of code_for_reduc may look like below.
code_for_reduc (code, mode1, mode2)
{
if (code == max && mode1 == VNx1QI && mode2 == VNx1QI)
return CODE_FOR_pred_reduc_maxvnx1qivnx16qi; // ZVE128+
if (code == max && mode1 == VNx1QI && mode2 == VNx1QI)
return CODE_FOR_pred_reduc_maxvnx1qivnx8qi; // ZVE64
if (code == max && mode1 == VNx1QI && mode2 == VNx1QI)
return CODE_FOR_pred_reduc_maxvnx1qivnx4qi; // ZVE32
}
Thus there will be a problem here. For example zve32, we will have
code_for_reduc (max, VNx1QI, VNx1QI) which will return the code of
the ZVE128+ instead of the ZVE32 logically.
This patch will merge the 3 patterns into pattern, and pass both the
input_vector and the ret_vector of code_for_reduc. For example, ZVE32 will be
code_for_reduc (max, VNx1Q1, VNx8QI), then the correct code of ZVE32
will be returned as expectation.
Please note both GCC 13 and 14 are impacted by this issue.
Signed-off-by: Pan Li <pan2.li@intel.com>
Co-Authored by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
PR target/110265
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins-bases.cc: Add ret_mode for
integer reduction expand.
* config/riscv/vector-iterators.md: Add VQI, VHI, VSI and VDI,
and the LMUL1 attr respectively.
* config/riscv/vector.md
(@pred_reduc_<reduc><mode><vlmul1>): Removed.
(@pred_reduc_<reduc><mode><vlmul1_zve64>): Likewise.
(@pred_reduc_<reduc><mode><vlmul1_zve32>): Likewise.
(@pred_reduc_<reduc><VQI:mode><VQI_LMUL1:mode>): New pattern.
(@pred_reduc_<reduc><VHI:mode><VHI_LMUL1:mode>): Likewise.
(@pred_reduc_<reduc><VSI:mode><VSI_LMUL1:mode>): Likewise.
(@pred_reduc_<reduc><VDI:mode><VDI_LMUL1:mode>): Likewise.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/pr110265-1.c: New test.
* gcc.target/riscv/rvv/base/pr110265-1.h: New test.
* gcc.target/riscv/rvv/base/pr110265-2.c: New test.
* gcc.target/riscv/rvv/base/pr110265-2.h: New test.
* gcc.target/riscv/rvv/base/pr110265-3.c: New test.
|
|
This patch fixes this issue happens on both GCC-13 and GCC-14.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110264
The testcase is too big and I failed to reduce it so I didn't append
test into this patch.
This patch should not only land into GCC-14 but also should backport to GCC-13.
PR target/110264
gcc/ChangeLog:
* config/riscv/riscv-vsetvl.cc (insert_vsetvl): Fix bug.
|
|
This patch addresses the last remaining issue with PR target/31985, that
GCC could make better use of memory addressing modes when implementing
double word addition. This is achieved by adding a define_insn_and_split
that combines an *add<dwi>3_doubleword with a *concat<mode><dwi>3, so
that the components of the concat can be used directly, without first
being loaded into a double word register.
For test_c in the bugzilla PR:
Before:
pushl %ebx
subl $16, %esp
movl 28(%esp), %eax
movl 36(%esp), %ecx
movl 32(%esp), %ebx
movl 24(%esp), %edx
addl %ecx, %eax
adcl %ebx, %edx
movl %eax, 8(%esp)
movl %edx, 12(%esp)
addl $16, %esp
popl %ebx
ret
After:
test_c:
subl $20, %esp
movl 36(%esp), %eax
movl 32(%esp), %edx
addl 28(%esp), %eax
adcl 24(%esp), %edx
movl %eax, 8(%esp)
movl %edx, 12(%esp)
addl $20, %esp
ret
2023-06-16 Roger Sayle <roger@nextmovesoftware.com>
Uros Bizjak <ubizjak@gmail.com>
gcc/ChangeLog
PR target/31985
* config/i386/i386.md (*add<dwi>3_doubleword_concat): New
define_insn_and_split combine *add<dwi>3_doubleword with
a *concat<mode><dwi>3 for more efficient lowering after reload.
gcc/testsuite/ChangeLog
PR target/31985
* gcc.target/i386/pr31985.c: New test case.
|
|
Similar to the low-half patterns, we want to match both ashiftrt and
lshiftrt with the truncate for SHRN2. We reuse the SHIFTRT iterator
and the AARCH64_VALID_SHRN_OP check to help, but because we expand the
high-half patterns by their gen_* names we need to disambiguate all the
different trunc+shift combinations in the pattern name, which leads to a
slight renaming of the builtins. The AARCH64_VALID_SHRN_OP check on the
expander and the define_insns ensures that no invalid combination ends
up getting matched.
Bootstrapped and tested on aarch64-none-linux-gnu and
aarch64_be-none-elf.
gcc/ChangeLog:
* config/aarch64/aarch64-simd-builtins.def (shrn2_n): Rename builtins to...
(ushrn2_n): ... This.
(sqshrn2_n): Rename builtins to...
(ssqshrn2_n): ... This.
(uqshrn2_n): Rename builtins to...
(uqushrn2_n): ... This.
* config/aarch64/arm_neon.h (vqshrn_high_n_s16): Adjust for the above.
(vqshrn_high_n_s32): Likewise.
(vqshrn_high_n_s64): Likewise.
(vqshrn_high_n_u16): Likewise.
(vqshrn_high_n_u32): Likewise.
(vqshrn_high_n_u64): Likewise.
(vshrn_high_n_s16): Likewise.
(vshrn_high_n_s32): Likewise.
(vshrn_high_n_s64): Likewise.
(vshrn_high_n_u16): Likewise.
(vshrn_high_n_u32): Likewise.
(vshrn_high_n_u64): Likewise.
* config/aarch64/aarch64-simd.md (aarch64_<shrn_op>shrn2_n<mode>_insn_le):
Rename to...
(aarch64_<shrn_op><sra_op>shrn2_n<mode>_insn_le): ... This.
Use SHIFTRT iterator and AARCH64_VALID_SHRN_OP check.
(aarch64_<shrn_op>shrn2_n<mode>_insn_be): Rename to...
(aarch64_<shrn_op><sra_op>shrn2_n<mode>_insn_be): ... This.
Use SHIFTRT iterator and AARCH64_VALID_SHRN_OP check.
(aarch64_<shrn_op>shrn2_n<mode>): Rename to...
(aarch64_<shrn_op><sra_op>shrn2_n<mode>): ... This.
Update expander for the above.
|
|
This patch is large in lines of code, but it is a fairly regular
extension of the first patch as it converts the high-half patterns
to standard RTL codes in the same fashion as the first patch did for the
low-half ones.
This now allows us to remove the unspec codes for these instructions as
there are no more uses of them left.
Bootstrapped and tested on aarch64-none-linux-gnu and
aarch64_be-none-elf.
gcc/ChangeLog:
* config/aarch64/aarch64-simd-builtins.def (shrn2): Rename builtins to...
(shrn2_n): ... This.
(rshrn2): Rename builtins to...
(rshrn2_n): ... This.
* config/aarch64/arm_neon.h (vrshrn_high_n_s16): Adjust for the above.
(vrshrn_high_n_s32): Likewise.
(vrshrn_high_n_s64): Likewise.
(vrshrn_high_n_u16): Likewise.
(vrshrn_high_n_u32): Likewise.
(vrshrn_high_n_u64): Likewise.
(vshrn_high_n_s16): Likewise.
(vshrn_high_n_s32): Likewise.
(vshrn_high_n_s64): Likewise.
(vshrn_high_n_u16): Likewise.
(vshrn_high_n_u32): Likewise.
(vshrn_high_n_u64): Likewise.
* config/aarch64/aarch64-simd.md (*aarch64_<srn_op>shrn<mode>2_vect_le):
Delete.
(*aarch64_<srn_op>shrn<mode>2_vect_be): Likewise.
(aarch64_shrn2<mode>_insn_le): Likewise.
(aarch64_shrn2<mode>_insn_be): Likewise.
(aarch64_shrn2<mode>): Likewise.
(aarch64_rshrn2<mode>_insn_le): Likewise.
(aarch64_rshrn2<mode>_insn_be): Likewise.
(aarch64_rshrn2<mode>): Likewise.
(aarch64_<sur>q<r>shr<u>n2_n<mode>_insn_le): Likewise.
(aarch64_<shrn_op>shrn2_n<mode>_insn_le): New define_insn.
(aarch64_<sur>q<r>shr<u>n2_n<mode>_insn_be): Delete.
(aarch64_<shrn_op>shrn2_n<mode>_insn_be): New define_insn.
(aarch64_<sur>q<r>shr<u>n2_n<mode>): Delete.
(aarch64_<shrn_op>shrn2_n<mode>): New define_expand.
(aarch64_<shrn_op>rshrn2_n<mode>_insn_le): New define_insn.
(aarch64_<shrn_op>rshrn2_n<mode>_insn_be): New define_insn.
(aarch64_<shrn_op>rshrn2_n<mode>): New define_expand.
(aarch64_sqshrun2_n<mode>_insn_le): New define_insn.
(aarch64_sqshrun2_n<mode>_insn_be): New define_insn.
(aarch64_sqshrun2_n<mode>): New define_expand.
(aarch64_sqrshrun2_n<mode>_insn_le): New define_insn.
(aarch64_sqrshrun2_n<mode>_insn_be): New define_insn.
(aarch64_sqrshrun2_n<mode>): New define_expand.
* config/aarch64/iterators.md (UNSPEC_SQSHRUN, UNSPEC_SQRSHRUN,
UNSPEC_SQSHRN, UNSPEC_UQSHRN, UNSPEC_SQRSHRN, UNSPEC_UQRSHRN):
Delete unspec values.
(VQSHRN_N): Delete int iterator.
|
|
The first patch in the series has some fallout in the testsuite,
particularly gcc.target/aarch64/shrn-combine-2.c.
Our previous patterns for SHRN matched both
(truncate (ashiftrt (x) (N))) and (truncate (lshiftrt (x) (N))
as these are equivalent for the shift amounts involved.
In our refactoring, however, we mapped shrn to truncate+lshiftrt.
The fix here is to iterate over ashiftrt,lshiftrt in the pattern for it.
However, we don't want to allow ashiftrt for us_truncate or lshiftrt for
ss_truncate from the ALL_TRUNC iterator.
This patch addds a AARCH64_VALID_SHRN_OP helper to gate the valid
combinations of truncations and shifts.
Bootstrapped and tested on aarch64-none-linux-gnu and
aarch64_be-none-elf.
gcc/ChangeLog:
* config/aarch64/aarch64.h (AARCH64_VALID_SHRN_OP): Define.
* config/aarch64/aarch64-simd.md
(*aarch64_<shrn_op>shrn_n<mode>_insn<vczle><vczbe>): Rename to...
(*aarch64_<shrn_op><shrn_s>shrn_n<mode>_insn<vczle><vczbe>): ... This.
Use SHIFTRT iterator and add AARCH64_VALID_SHRN_OP to condition.
* config/aarch64/iterators.md (shrn_s): New code attribute.
|
|
Some instructions from the previous patch have scalar forms:
SQSHRN,SQRSHRN,UQSHRN,UQRSHRN,SQSHRUN,SQRSHRUN.
This patch converts the patterns for these to use standard RTL codes.
Their MD patterns deviate slightly from the vector forms mostly due to
things like operands being scalar rather than vectors.
One nuance is in the SQSHRUN,SQRSHRUN patterns. These end in a truncate
to the scalar narrow mode e.g. SI -> QI. This gets simplified by the
RTL passes to a subreg rather than keeping it as a truncate.
So we end up representing these without the truncate and in the expander
read the narrow subreg in order to comply with the expected width of the
intrinsic.
Bootstrapped and tested on aarch64-none-linux-gnu and
aarch64_be-none-elf.
gcc/ChangeLog:
* config/aarch64/aarch64-simd.md (aarch64_<sur>q<r>shr<u>n_n<mode>):
Rename to...
(aarch64_<shrn_op>shrn_n<mode>): ... This. Reimplement with RTL codes.
(*aarch64_<shrn_op>rshrn_n<mode>_insn): New define_insn.
(aarch64_sqrshrun_n<mode>_insn): Likewise.
(aarch64_sqshrun_n<mode>_insn): Likewise.
(aarch64_<shrn_op>rshrn_n<mode>): New define_expand.
(aarch64_sqshrun_n<mode>): Likewise.
(aarch64_sqrshrun_n<mode>): Likewise.
* config/aarch64/iterators.md (V2XWIDE): Add HI and SI modes.
|
|
This patch reimplements the MD patterns for the instructions that
perform narrowing right shifts with optional rounding and saturation
using standard RTL codes rather than unspecs.
There are four groups of patterns involved:
* Simple narrowing shifts with optional signed or unsigned truncation:
SHRN, SQSHRN, UQSHRN. These are expressed as a truncation operation of
a right shift. The matrix of valid combinations looks like this:
| ashiftrt | lshiftrt |
------------------------------------------
ss_truncate | SQSHRN | X |
us_truncate | X | UQSHRN |
truncate | X | SHRN |
------------------------------------------
* Narrowing shifts with rounding with optional signed or unsigned
truncation: RSHRN, SQRSHRN, UQRSHRN. These follow the same
combinations of truncation and shift codes as above, but also perform
intermediate widening of the results in order to represent the addition
of the rounding constant. This group also corrects an existing
inaccuracy for RSHRN where we don't currently model the intermediate
widening for rounding.
* The somewhat special "Signed saturating Shift Right Unsigned Narrow":
SQSHRUN. Similar to the SQXTUN instructions, these perform a
saturating truncation that isn't represented by US_TRUNCATE or
SS_TRUNCATE but needs to use a clamping operation followed by a
TRUNCATE.
* The rounding version of the above: SQRSHRUN. It needs the special
clamping truncate representation but with an intermediate widening and
rounding addition.
Besides using standard RTL codes for all of the above instructions, this
patch allows us to get rid of the explicit define_insns and
define_expands for SHRN and RSHRN.
Bootstrapped and tested on aarch64-none-linux-gnu and
aarch64_be-none-elf. We've got pretty thorough execute tests in
advsimd-intrinsics.exp that exercise these and many instances of these
instructions get constant-folded away during optimisation and the
validation still passes (during development where I was figuring out the
details of the semantics they were discovering failures), so I'm fairly
confident in the representation.
gcc/ChangeLog:
* config/aarch64/aarch64-simd-builtins.def (shrn): Rename builtins to...
(shrn_n): ... This.
(rshrn): Rename builtins to...
(rshrn_n): ... This.
* config/aarch64/arm_neon.h (vshrn_n_s16): Adjust for the above.
(vshrn_n_s32): Likewise.
(vshrn_n_s64): Likewise.
(vshrn_n_u16): Likewise.
(vshrn_n_u32): Likewise.
(vshrn_n_u64): Likewise.
(vrshrn_n_s16): Likewise.
(vrshrn_n_s32): Likewise.
(vrshrn_n_s64): Likewise.
(vrshrn_n_u16): Likewise.
(vrshrn_n_u32): Likewise.
(vrshrn_n_u64): Likewise.
* config/aarch64/aarch64-simd.md
(*aarch64_<srn_op>shrn<mode><vczle><vczbe>): Delete.
(aarch64_shrn<mode>): Likewise.
(aarch64_rshrn<mode><vczle><vczbe>_insn): Likewise.
(aarch64_rshrn<mode>): Likewise.
(aarch64_<sur>q<r>shr<u>n_n<mode>_insn<vczle><vczbe>): Likewise.
(aarch64_<sur>q<r>shr<u>n_n<mode>): Likewise.
(*aarch64_<shrn_op>shrn_n<mode>_insn<vczle><vczbe>): New define_insn.
(*aarch64_<shrn_op>rshrn_n<mode>_insn<vczle><vczbe>): Likewise.
(*aarch64_sqshrun_n<mode>_insn<vczle><vczbe>): Likewise.
(*aarch64_sqrshrun_n<mode>_insn<vczle><vczbe>): Likewise.
(aarch64_<shrn_op>shrn_n<mode>): New define_expand.
(aarch64_<shrn_op>rshrn_n<mode>): Likewise.
(aarch64_sqshrun_n<mode>): Likewise.
(aarch64_sqrshrun_n<mode>): Likewise.
* config/aarch64/iterators.md (ALL_TRUNC): New code iterator.
(TRUNCEXTEND): New code attribute.
(TRUNC_SHIFT): Likewise.
(shrn_op): Likewise.
* config/aarch64/predicates.md (aarch64_simd_umax_quarter_mode):
New predicate.
|
|
This patch would like to fix one maybe-uninitialized warning. Aka:
riscv-vsetvl.cc:4354:3: error: 'vsetvl_rinsn' may be used uninitialized [-Werror=maybe-uninitialized]
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-vsetvl.cc
(pass_vsetvl::global_eliminate_vsetvl_insn): Initialize var by NULL.
|
|
It adjust preprocess, compile and link flags, which allows to change
default -lmsvcrt library by another provided by MinGW runtime.
gcc/ChangeLog:
* config/i386/mingw-w64.h (CPP_SPEC): Adjust for -mcrtdll=.
(REAL_LIBGCC_SPEC): New define.
* config/i386/mingw.opt: Add mcrtdll=
* config/i386/mingw32.h (CPP_SPEC): Adjust for -mcrtdll=.
(REAL_LIBGCC_SPEC): Adjust for -mcrtdll=.
(STARTFILE_SPEC): Adjust for -mcrtdll=.
* doc/invoke.texi: Add mcrtdll= documentation.
Signed-off-by: Jonathan Yong <10walls@gmail.com>
|
|
Support for __attribute__ ((code_readable)). Takes up to one argument of
"yes", "no", "pcrel". This will change the code readability setting for just
that function. If no argument is supplied, then the setting is 'yes'.
gcc/ChangeLog:
* config/mips/mips.cc (enum mips_code_readable_setting):New enmu.
(mips_handle_code_readable_attr):New static function.
(mips_get_code_readable_attr):New static enum function.
(mips_set_current_function):Set the code_readable mode.
(mips_option_override):Same as above.
* doc/extend.texi:Document code_readable.
gcc/testsuite/ChangeLog:
* gcc.target/mips/code-readable-attr-1.c: New test.
* gcc.target/mips/code-readable-attr-2.c: New test.
* gcc.target/mips/code-readable-attr-3.c: New test.
* gcc.target/mips/code-readable-attr-4.c: New test.
* gcc.target/mips/code-readable-attr-5.c: New test.
|
|
Like is already the case for the AVX/AVX2 form, VMOVDDUP - acting on
double precision floating values - is more appropriate to use here, and
it can also result in shorter insn encodings when source is memory or
%xmm0...%xmm7, and no masking is applied (in allowing a 2-byte VEX
prefix then instead of a 3-byte one).
gcc/
* config/i386/sse.md (<avx512>_vec_dup<mode><mask_name>): Use
vmovddup.
|
|
gcc/
* config/i386/constraints.md: Mention k and r for B.
|
|
Micro-architecture unconditionally treats a "jr $ra" as "return from subroutine",
hence doing "jr $ra" would interfere with both subroutine return prediction and
the more general indirect branch prediction.
Therefore, a problem like PR110136 can cause a significant increase in branch error
prediction rate and affect performance. The same problem exists with "indirect_jump".
gcc/ChangeLog:
PR target/110136
* config/loongarch/loongarch.md: Modify the register constraints for template
"jumptable" and "indirect_jump" from "r" to "e".
Co-authored-by: Andrew Pinski <apinski@marvell.com>
|
|
The LA464 micro-architecture is sensitive to alignment of code. The
Loongson team has benchmarked various combinations of function, the
results [1] show that 16-byte label alignment together with 32-byte
function alignment gives best results in terms of SPEC score.
Add a mtune-based table-driven mechanism to set the default of
-falign-{functions,labels}. As LA464 is the first (and the only for
now) uarch supported by GCC, the same setting is also used for
the "generic" -mtune=loongarch64. In the future we may set different
settings for LA{2,3,6}64 once we add the support for them.
Bootstrapped and regtested on loongarch64-linux-gnu. Ok for trunk?
gcc/ChangeLog:
* config/loongarch/loongarch-tune.h (loongarch_align): New
struct.
* config/loongarch/loongarch-def.h (loongarch_cpu_align): New
array.
* config/loongarch/loongarch-def.c (loongarch_cpu_align): Define
the array.
* config/loongarch/loongarch.cc
(loongarch_option_override_internal): Set the value of
-falign-functions= if -falign-functions is enabled but no value
is given. Likewise for -falign-labels=.
|
|
The following patch introduces {add,sub}c5_optab and pattern recognizes
various forms of add with carry and subtract with carry/borrow, see
pr79173-{1,2,3,4,5,6}.c tests on what is matched.
Primarily forms with 2 __builtin_add_overflow or __builtin_sub_overflow
calls per limb (with just one for the least significant one), for
add with carry even when it is hand written in C (for subtraction
reassoc seems to change it too much so that the pattern recognition
doesn't work). __builtin_{add,sub}_overflow are standardized in C23
under ckd_{add,sub} names, so it isn't any longer a GNU only extension.
Note, clang has for these (IMHO badly designed)
__builtin_{add,sub}c{b,s,,l,ll} builtins which don't add/subtract just
a single bit of carry, but basically add 3 unsigned values or
subtract 2 unsigned values from one, and result in carry out of 0, 1, or 2
because of that. If we wanted to introduce those for clang compatibility,
we could and lower them early to just two __builtin_{add,sub}_overflow
calls and let the pattern matching in this patch recognize it later.
I've added expanders for this on ix86 and in addition to that
added various peephole2s (in preparation patches for this patch) to make
sure we get nice (and small) code for the common cases. I think there are
other PRs which request that e.g. for the _{addcarry,subborrow}_u{32,64}
intrinsics, which the patch also improves.
Would be nice if support for these optabs was added to many other targets,
arm/aarch64 and powerpc* certainly have such instructions, I'd expect
in fact that most targets do.
The _BitInt support I'm working on will also need this to emit reasonable
code.
2023-06-15 Jakub Jelinek <jakub@redhat.com>
PR middle-end/79173
* internal-fn.def (UADDC, USUBC): New internal functions.
* internal-fn.cc (expand_UADDC, expand_USUBC): New functions.
(commutative_ternary_fn_p): Return true also for IFN_UADDC.
* optabs.def (uaddc5_optab, usubc5_optab): New optabs.
* tree-ssa-math-opts.cc (uaddc_cast, uaddc_ne0, uaddc_is_cplxpart,
match_uaddc_usubc): New functions.
(math_opts_dom_walker::after_dom_children): Call match_uaddc_usubc
for PLUS_EXPR, MINUS_EXPR, BIT_IOR_EXPR and BIT_XOR_EXPR unless
other optimizations have been successful for those.
* gimple-fold.cc (gimple_fold_call): Handle IFN_UADDC and IFN_USUBC.
* fold-const-call.cc (fold_const_call): Likewise.
* gimple-range-fold.cc (adjust_imagpart_expr): Likewise.
* tree-ssa-dce.cc (eliminate_unnecessary_stmts): Likewise.
* doc/md.texi (uaddc<mode>5, usubc<mode>5): Document new named
patterns.
* config/i386/i386.md (uaddc<mode>5, usubc<mode>5): New
define_expand patterns.
(*setcc_qi_addqi3_cconly_overflow_1_<mode>, *setccc): Split
into NOTE_INSN_DELETED note rather than nop instruction.
(*setcc_qi_negqi_ccc_1_<mode>, *setcc_qi_negqi_ccc_2_<mode>):
Likewise.
* gcc.target/i386/pr79173-1.c: New test.
* gcc.target/i386/pr79173-2.c: New test.
* gcc.target/i386/pr79173-3.c: New test.
* gcc.target/i386/pr79173-4.c: New test.
* gcc.target/i386/pr79173-5.c: New test.
* gcc.target/i386/pr79173-6.c: New test.
* gcc.target/i386/pr79173-7.c: New test.
* gcc.target/i386/pr79173-8.c: New test.
* gcc.target/i386/pr79173-9.c: New test.
* gcc.target/i386/pr79173-10.c: New test.
|
|
destination [PR79173]
This patch adds subborrow<mode> alternative so that it can have memory
destination and adds various peephole2s which help to match it.
2023-06-15 Jakub Jelinek <jakub@redhat.com>
PR middle-end/79173
* config/i386/i386.md (subborrow<mode>): Add alternative with
memory destination and add for it define_peephole2
TARGET_READ_MODIFY_WRITE/-Os patterns to prefer using memory
destination in these patterns.
|
|
borrow with memory destination [PR79173]
This patch adds various peephole2s which help to recognize add with
carry or subtract with borrow with memory destination.
2023-06-14 Jakub Jelinek <jakub@redhat.com>
PR middle-end/79173
* config/i386/i386.md (*sub<mode>_3, @add<mode>3_carry,
addcarry<mode>, @sub<mode>3_carry, *add<mode>3_cc_overflow_1): Add
define_peephole2 TARGET_READ_MODIFY_WRITE/-Os patterns to prefer
using memory destination in these patterns.
|
|
This patch adds new RTL and tests for sabd and uabd
PR tree-optimization/109156
gcc/ChangeLog:
* config/aarch64/aarch64-simd.md (aarch64_<su>abd<mode>):
Rename to <su>abd<mode>3.
* config/aarch64/aarch64-sve.md (<su>abd<mode>_3): Rename
to <su>abd<mode>3.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/abd.h: New file.
* gcc.target/aarch64/abd_2.c: New test.
* gcc.target/aarch64/abd_3.c: New test.
* gcc.target/aarch64/abd_4.c: New test.
* gcc.target/aarch64/abd_none_2.c: New test.
* gcc.target/aarch64/abd_none_3.c: New test.
* gcc.target/aarch64/abd_none_4.c: New test.
* gcc.target/aarch64/abd_run_1.c: New test.
* gcc.target/aarch64/sve/abd_1.c: New test.
* gcc.target/aarch64/sve/abd_none_1.c: New test.
* gcc.target/aarch64/sve/abd_2.c: New test.
* gcc.target/aarch64/sve/abd_none_2.c: New test.
|
|
During the regression testing of the LoongArch architecture GCC, it was found
that the tests in the pr90883.C file failed. The problem was modulated and
found that the error was caused by setting the macro LARCH_CALL_RATIO to a too
large value. Combined with the actual LoongArch architecture, the different
thresholds for meeting the test conditions were tested using the engineering method
(SPEC CPU 2006), and the results showed that its optimal threshold should be set
to 6.
gcc/ChangeLog:
* config/loongarch/loongarch.h (LARCH_CALL_RATIO): Modify the value
of macro LARCH_CALL_RATIO on LoongArch to make it perform optimally.
|
|
This patch is to optimize the permuation case that is suiteable use
merge approach.
Consider this following case:
typedef int8_t vnx16qi __attribute__((vector_size (16)));
void __attribute__ ((noipa))
merge0 (vnx16qi x, vnx16qi y, vnx16qi *out)
{
vnx16qi v = __builtin_shufflevector ((vnx16qi) x, (vnx16qi) y, MASK_16);
*(vnx16qi*)out = v;
}
The gimple IR:
v_3 = VEC_PERM_EXPR <x_1(D), y_2(D), { 0, 17, 2, 19, 4, 21, 6, 23, 8, 9, 10, 27, 12, 29, 14, 31 }>;
Selector = { 0, 17, 2, 19, 4, 21, 6, 23, 8, 9, 10, 27, 12, 29, 14, 31 }, the common expression:
{ 0, nunits + 1, 2, nunits + 3, 4, nunits + 5, ... }
For this selector, we can use vmsltu + vmerge to optimize the codegen.
Before this patch:
merge0:
addi a5,sp,16
vl1re8.v v3,0(a5)
li a5,31
vsetivli zero,16,e8,m1,ta,mu
vmv.v.x v2,a5
lui a5,%hi(.LANCHOR0)
addi a5,a5,%lo(.LANCHOR0)
vl1re8.v v1,0(a5)
vl1re8.v v4,0(sp)
vand.vv v1,v1,v2
vmsgeu.vi v0,v1,16
vrgather.vv v2,v4,v1
vadd.vi v1,v1,-16
vrgather.vv v2,v3,v1,v0.t
vs1r.v v2,0(a0)
ret
After this patch:
merge0:
addi a5,sp,16
vl1re8.v v1,0(a5)
lui a5,%hi(.LANCHOR0)
addi a5,a5,%lo(.LANCHOR0)
vsetivli zero,16,e8,m1,ta,ma
vl1re8.v v0,0(a5)
vl1re8.v v2,0(sp)
vmsltu.vi v0,v0,16
vmerge.vvm v1,v1,v2,v0
vs1r.v v1,0(a0)
ret
The key of this optimization is that:
1. mask = vmsltu (selector, nunits)
2. result = vmerge (op0, op1, mask)
gcc/ChangeLog:
* config/riscv/riscv-v.cc (shuffle_merge_patterns): New pattern.
(expand_vec_perm_const_1): Add merge optmization.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-6.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge-7.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-1.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-2.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-3.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-4.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-5.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-6.c: New test.
* gcc.target/riscv/rvv/autovec/vls-vlmax/merge_run-7.c: New test.
|
|
The V2 patch address comments from Juzhe, thanks.
Hi,
The reason for this bug is that in the case where the vector register is set
to a fixed length (with `--param=riscv-autovec-preference=fixed-vlmax` option),
TARGET_PASS_BY_REFERENCE thinks that variables of type vint32m1 can be passed
through two scalar registers, but when GCC calls FUNCTION_VALUE (call function
riscv_get_arg_info inside) it returns NULL_RTX. These two functions are not
unified. The current treatment is to pass all vector arguments and returns
through the function stack, and a new calling convention for vector registers
will be added in the future.
https://github.com/riscv-non-isa/riscv-elf-psabi-doc/
https://github.com/palmer-dabbelt/riscv-elf-psabi-doc/commit/126fa719972ff998a8a239c47d506c7809aea363
Best,
Lehua
gcc/ChangeLog:
PR target/110119
* config/riscv/riscv.cc (riscv_get_arg_info): Return NULL_RTX for vector mode
(riscv_pass_by_reference): Return true for vector mode
gcc/testsuite/ChangeLog:
PR target/110119
* gcc.target/riscv/rvv/base/pr110119-1.c: New test.
* gcc.target/riscv/rvv/base/pr110119-2.c: New test.
|
|
This patch is considered as the follow up of the below PATCH.
https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621347.html
We aligned the predictor style for the define_insn_and_split suggested
by Kito. To avoid potential issues before we hit.
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/autovec-opt.md: Align the predictor sytle.
* config/riscv/autovec.md: Ditto.
|
|
When constructing a vector mask from individual elements we wrongly
assumed that we can broadcast BITS_PER_WORD (i.e. XLEN). The maximum is
actually the vector element length (i.e. ELEN). This patch fixes this.
After this patch, below failures on RV32 will be fixed.
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-2.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-2.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-2.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-2.c execution test
FAIL: gcc.target/riscv/rvv/autovec/vls-vlmax/repeat_run-3.c -std=c99 -O3 -ftree-vectorize --param riscv-autovec-preference=fixed-vlmax execution test
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/riscv-v.cc (rvv_builder::get_merge_scalar_mask):
Take elen instead of scalar BITS_PER_WORD.
(expand_vector_init_merge_repeating_sequence): Use inner_bits_size
instead of scaler BITS_PER_WORD.
|
|
This patch removes a remnant of mudflap.
gcc/ChangeLog:
* config/moxie/uclinux.h (MFWRAP_SPEC): Remove
|
|
Pushing to fix bootstrap.
gcc/ChangeLog:
* config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold):
Fix signed comparison warning in loop from npats to enelts.
|
|
..., so that users don't manually need to specify
'-foffload-options=-lgfortran', '-foffload-options=-lm' in addition to
'-lgfortran', '-lm' (specified manually, or implicitly by the driver).
gcc/
* gcc.cc (driver_handle_option): Forward host '-lgfortran', '-lm'
to offloading compilation.
* config/gcn/mkoffload.cc (main): Adjust.
* config/nvptx/mkoffload.cc (main): Likewise.
* doc/invoke.texi (foffload-options): Update example.
libgomp/
* testsuite/libgomp.fortran/fortran.exp (lang_link_flags): Don't
set.
* testsuite/libgomp.oacc-fortran/fortran.exp (lang_link_flags):
Likewise.
* testsuite/libgomp.c/simd-math-1.c: Remove
'-foffload-options=-lm'.
* testsuite/libgomp.fortran/fortran-torture_execute_math.f90:
Likewise.
* testsuite/libgomp.oacc-fortran/fortran-torture_execute_math.f90:
Likewise.
|
|
Since there's no evex version for vpcmpeq ymm, ymm, ymm.
gcc/ChangeLog:
PR target/110227
* config/i386/sse.md (mov<mode>_internal>): Use x instead of v
for alternative 2 since there's no evex version for vpcmpeqd
ymm, ymm, ymm.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr110227.c: New test.
|
|
Spurred by Akari Takahashi's patch to config/sh/divtab.cc, this removes
divtab.cc completely.
divtab.cc was used to calculate a division table for the sh5 media
processor. GCC dropped support for that (unmanufactured) chip back
in 2016 and this file simply got missed AFAICT.
gcc/
* config/sh/divtab.cc: Remove.
|
|
I've noticed that standard_sse_constant_opcode emits some spurious
whitespace around tab, that isn't something which is done for
any other instruction and looks wrong.
2023-06-13 Jakub Jelinek <jakub@redhat.com>
* config/i386/i386.cc (standard_sse_constant_opcode): Remove
superfluous spaces around \t for vpcmpeqd.
|
|
Hi,
This patch remove the duplicate `#include "riscv-vector-switch.def"` statement
and add #undef for ENTRY and TUPLE_ENTRY macros later.
Best,
Lehua
gcc/ChangeLog:
* config/riscv/riscv-v.cc (struct mode_vtype_group): Remove duplicate
#include.
(ENTRY): Undef.
(TUPLE_ENTRY): Undef.
|
|
gcc/ChangeLog:
* config/riscv/riscv-v.cc (rvv_builder::single_step_npatterns_p): Add comment.
(shuffle_generic_patterns): Ditto.
(expand_vec_perm_const_1): Ditto.
|
|
Sorry for producing bugs in the previous VLA SLP patch.
Consider this following permutation:
_85 = VEC_PERM_EXPR <{ 99, 17, ... }, { 11, 80, ... }, { 0, POLY_INT_CST [4, 4], 1, POLY_INT_CST [5, 4], 2, POLY_INT_CST [6, 4], ... }>;
The correct result should be:
_85 = { 99, 11, 17, 80, ... }
However, I did wrong in the previous patch.
Code sequence before this patch:
set mask = { 0, 1, 0, 1, ... }
set v0 = { 99, 17, 99, 17, ... }
set v1 = { 11, 80, 11, 80, ... }
set index = viota (mask) = { 0, 0, 1, 1, 2, 2, ... }
set result = vrgather_mu (v0, v1, index, mask) = { 99, 11, 99, 80 }
The result is incorrect.
After this patch:
set mask = { 0, 1, 0, 1, ... }
set index = viota (mask) = { 0, 0, 1, 1, 2, 2, ... }
set v0 = vrgather ({ 99, 17, 99, 17, ... }, index) = { 99, 99, 17, 17, ... }
set v1 = { 11, 80, 11, 80, ... }
set result = vrgather_mu (v0, v1, index, mask) = { 99, 11, 17, 80 }
The result is what we expected.
This issue was discovered in the test I appended in this patch with --param=riscv-autovec-lmul=2.
gcc/ChangeLog:
* config/riscv/riscv-v.cc (emit_vlmax_decompress_insn): Fix bug.
(shuffle_decompress_patterns): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/partial/slp-12.c: New test.
* gcc.target/riscv/rvv/autovec/partial/slp_run-12.c: New test.
|
|
This patch adds support to check function's argument or return is vector type
and throw warning if yes.
There're two exceptions,
- The vector_size attribute.
- The intrinsic functions.
Some cases that need to add -Wno-psabi to ignore the warning.
gcc/ChangeLog:
* config/riscv/riscv-protos.h (riscv_init_cumulative_args): Set
warning flag if func is not builtin
* config/riscv/riscv.cc
(riscv_scalable_vector_type_p): Determine whether the type is scalable vector.
(riscv_arg_has_vector): Determine whether the arg is vector type.
(riscv_pass_in_vector_p): Check the vector type param is passed by value.
(riscv_init_cumulative_args): The same as header.
(riscv_get_arg_info): Add the checking.
(riscv_function_value): Check the func return and set warning flag
* config/riscv/riscv.h (INIT_CUMULATIVE_ARGS): Add a flag to
determine whether warning psabi or not.
gcc/testsuite/ChangeLog:
* g++.target/riscv/rvv/base/pr109244.C: Add the -Wno-psabi.
* g++.target/riscv/rvv/base/pr109535.C: Same
* gcc.target/riscv/rvv/base/binop_vx_constraint-120.c: Same
* gcc.target/riscv/rvv/base/integer_compare_insn_shortcut.c: Same
* gcc.target/riscv/rvv/base/mask_insn_shortcut.c: Same
* gcc.target/riscv/rvv/base/misc_vreinterpret_vbool_vint.c: Same
* gcc.target/riscv/rvv/base/pr110109-2.c: Same
* gcc.target/riscv/rvv/base/scalar_move-9.c: Same
* gcc.target/riscv/rvv/base/spill-10.c: Same
* gcc.target/riscv/rvv/base/spill-11.c: Same
* gcc.target/riscv/rvv/base/spill-9.c: Same
* gcc.target/riscv/rvv/base/vlmul_ext-1.c: Same
* gcc.target/riscv/rvv/base/zero_base_load_store_optimization.c: Same
* gcc.target/riscv/rvv/base/zvfh-intrinsic.c: Same
* gcc.target/riscv/rvv/base/zvfh-over-zvfhmin.c: Same
* gcc.target/riscv/rvv/base/zvfhmin-intrinsic.c: Same
* gcc.target/riscv/rvv/vsetvl/vsetvl-1.c: Same
* gcc.target/riscv/vector-abi-1.c: New test.
* gcc.target/riscv/vector-abi-2.c: New test.
* gcc.target/riscv/vector-abi-3.c: New test.
* gcc.target/riscv/vector-abi-4.c: New test.
* gcc.target/riscv/vector-abi-5.c: New test.
* gcc.target/riscv/vector-abi-6.c: New test.
Signed-off-by: Yanzhang Wang <yanzhang.wang@intel.com>
Co-authored-by: Kito Cheng <kito.cheng@sifive.com>
|
|
After discussing the -mtp= option with Arm's LLVM developers we'd like to extend
the functionality of the option somewhat.
There are actually 3 system registers that can be accessed for the thread pointer
in aarch32: tpidrurw, tpidruro, tpidrprw. They are all read through the CP15 co-processor
mechanism. The current -mtp=cp15 option reads the tpidruro register.
This patch extends -mtp to allow for the above three explicit tpidr names and
keeps -mtp=cp15 as an alias of -mtp=tpidruro for backwards compatibility.
Bootstrapped and tested on arm-none-linux-gnueabihf.
gcc/ChangeLog:
* config/arm/arm-opts.h (enum arm_tp_type): Remove TP_CP15.
Add TP_TPIDRURW, TP_TPIDRURO, TP_TPIDRPRW values.
* config/arm/arm-protos.h (arm_output_load_tpidr): Declare prototype.
* config/arm/arm.cc (arm_option_reconfigure_globals): Replace TP_CP15
with TP_TPIDRURO.
(arm_output_load_tpidr): Define.
* config/arm/arm.h (TARGET_HARD_TP): Define in terms of TARGET_SOFT_TP.
* config/arm/arm.md (load_tp_hard): Call arm_output_load_tpidr to output
assembly.
(reload_tp_hard): Likewise.
* config/arm/arm.opt (tpidrurw, tpidruro, tpidrprw): New values for
arm_tp_type.
* doc/invoke.texi (Arm Options, mtp): Document new values.
gcc/testsuite/ChangeLog:
* gcc.target/arm/mtp.c: New test.
* gcc.target/arm/mtp_1.c: New test.
* gcc.target/arm/mtp_2.c: New test.
* gcc.target/arm/mtp_3.c: New test.
* gcc.target/arm/mtp_4.c: New test.
|
|
After discussing the -mtp= option with Arm's LLVM developers we'd like to extend
the functionality of the option somewhat.
First of all, there is another TPIDR register that can be used to read the thread pointer:
TPIDRRO_EL0 (which can also be accessed by AArch32 under another name) so it makes sense
to add -mtp=tpidrr0_el0. This makes the existing arguments el0, el1, el2, el3 somewhat
inconsistent in their naming so this patch introduces the more "full" names
tpidr_el0, tpidr_el1, tpidr_el2, tpidr_el3 and makes the above short names alias of these new ones.
Long story short, we preserve backwards compatibility and add a new TPIDR register to access through
-mtp that wasn't available previously.
There is more relevant discussion of the options at https://reviews.llvm.org/D152433 if you're interested.
Bootstrapped and tested on aarch64-none-linux-gnu.
gcc/ChangeLog:
PR target/108779
* config/aarch64/aarch64-opts.h (enum aarch64_tp_reg): Add
AARCH64_TPIDRRO_EL0 value.
* config/aarch64/aarch64.cc (aarch64_output_load_tp): Define.
* config/aarch64/aarch64.opt (tpidr_el0, tpidr_el1, tpidr_el2,
tpidr_el3, tpidrro_el3): New accepted values to -mtp=.
* doc/invoke.texi (AArch64 Options): Document new -mtp= options.
gcc/testsuite/ChangeLog:
PR target/108779
* gcc.target/aarch64/mtp_5.c: New test.
* gcc.target/aarch64/mtp_6.c: New test.
* gcc.target/aarch64/mtp_7.c: New test.
* gcc.target/aarch64/mtp_8.c: New test.
* gcc.target/aarch64/mtp_9.c: New test.
|
|
This PR optimizes an SVE intrinsics sequence where
svlasta (svptrue_pat_b8 (SV_VL1), x)
a scalar is selected based on a constant predicate and a variable vector.
This sequence is optimized to return the correspoding element of a NEON
vector. For eg.
svlasta (svptrue_pat_b8 (SV_VL1), x)
returns
umov w0, v0.b[1]
Likewise,
svlastb (svptrue_pat_b8 (SV_VL1), x)
returns
umov w0, v0.b[0]
This optimization only works provided the constant predicate maps to a range
that is within the bounds of a 128-bit NEON register.
gcc/ChangeLog:
PR target/96339
* config/aarch64/aarch64-sve-builtins-base.cc (svlast_impl::fold): Fold sve
calls that have a constant input predicate vector.
(svlast_impl::is_lasta): Query to check if intrinsic is svlasta.
(svlast_impl::is_lastb): Query to check if intrinsic is svlastb.
(svlast_impl::vect_all_same): Check if all vector elements are equal.
gcc/testsuite/ChangeLog:
PR target/96339
* gcc.target/aarch64/sve/acle/general-c/svlast.c: New.
* gcc.target/aarch64/sve/acle/general-c/svlast128_run.c: New.
* gcc.target/aarch64/sve/acle/general-c/svlast256_run.c: New.
* gcc.target/aarch64/sve/pcs/return_4.c (caller_bf16): Fix asm
to expect optimized code for function body.
* gcc.target/aarch64/sve/pcs/return_4_128.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_256.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_512.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_1024.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_4_2048.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_128.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_256.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_512.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_1024.c (caller_bf16): Likewise.
* gcc.target/aarch64/sve/pcs/return_5_2048.c (caller_bf16): Likewise.
|
|
- Fix gen_autofdo_event: The download URL for the Intel Perfmon Event
list has changed, as well as the JSON format.
Also it now uses pattern matching to match CPUs. Update the script to support all of this.
- Regenerate gcc-auto-profile with the latest published Intel model
numbers, so it works with recent systems.
- So far it's still broken on hybrid systems
contrib/ChangeLog:
* gen_autofdo_event.py: Update for download server changes
gcc/ChangeLog
* config/i386/gcc-auto-profile: Regenerate.
|
|
This patch fixes the requirement of V_WHOLE and V_FRACT.
E.g. VNx8QI in V_WHOLE has no requirement which is incorrect.
Actually, VNx8QI should be whole(full) mode when TARGET_MIN_VLEN < 128
since when TARGET_MIN_VLEN == 128, VNx8QI is e8mf2 which is fractional
vector.
Co-Authored by: Robin Dapp <rdapp@ventanamicro.com>
gcc/ChangeLog:
* config/riscv/vector-iterators.md: Fix requirement.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vls-vlmax/full-vec-move1.c: New test.
|