Age | Commit message (Collapse) | Author | Files | Lines |
|
Fix bug of vector.md which generate incorrect information to
VSETVL PASS when testing FMA auto vectorization ternop-3.c.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
gcc/ChangeLog:
* config/riscv/vector.md: Fix vimuladd instruction bug.
|
|
Currently mode switching incorrect codegen for the following case:
void fn (void);
void f (void * in, void *out, int32_t x, int n, int m)
{
for (int i = 0; i < n; i++) {
vint32m1_t v = __riscv_vle32_v_i32m1 (in + i, 4);
vint32m1_t v2 = __riscv_vle32_v_i32m1_tu (v, in + 100 + i, 4);
vint32m1_t v3 = __riscv_vaadd_vx_i32m1 (v2, 0, VXRM_RDN, 4);
fn ();
v3 = __riscv_vaadd_vx_i32m1 (v3, 3, VXRM_RDN, 4);
__riscv_vse32_v_i32m1 (out + 100 + i, v3, 4);
}
}
Before this patch:
Preheader:
...
csrwi vxrm,2
Loop Body:
... (no cswri vxrm,2)
vaadd.vx
...
vaadd.vx
...
This codegen is incorrect.
After this patch:
Preheader:
...
csrwi vxrm,2
Loop Body:
...
vaadd.vx
...
csrwi vxrm,2
...
vaadd.vx
...
cross-compile build PASS and regression PASS.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
gcc/ChangeLog:
* config/riscv/riscv.cc (global_state_unknown_p): New function.
(riscv_mode_after): Fix incorrect VXM.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/vxrm-11.c: New test.
* gcc.target/riscv/rvv/base/vxrm-12.c: New test.
|
|
This patch would like to add new sub extension (aka ZVFHMIN) to the
-march= option. To make it simple, only the sub extension itself is
involved in this patch, and the underlying FP16 related RVV intrinsic
API depends on the TARGET_ZVFHMIN.
The Zvfhmin extension depends on the Zve32f extension. You can locate
more information about ZVFHMIN from below spec doc.
https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#zvfhmin-vector-extension-for-minimal-half-precision-floating-point
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* common/config/riscv/riscv-common.cc:
(riscv_implied_info): Add zvfhmin item.
(riscv_ext_version_table): Ditto.
(riscv_ext_flag_table): Ditto.
* config/riscv/riscv-opts.h (MASK_ZVFHMIN): New macro.
(TARGET_ZFHMIN): Align indent.
(TARGET_ZFH): Ditto.
(TARGET_ZVFHMIN): New macro.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/arch-20.c: New test.
* gcc.target/riscv/predef-26.c: New test.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554 adds 2 splitter to
transform notl + pbroadcast + pand to pbroadcast + pandn for
VI124_AVX2 which leaves out all DI-element-size ones as
well as all 512-bit ones.
This patch extend the splitter to VI_AVX2 which will handle DImode for
AVX2, and V64QImode,V32HImode,V16SImode,V8DImode for AVX512.
gcc/ChangeLog:
PR target/100711
* config/i386/sse.md (*andnot<mode>3): Extend below splitter
to VI_AVX2 to cover more modes.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr100711-2.c: Add v4di/v2di testcases.
* gcc.target/i386/pr100711-3.c: New test.
|
|
lzcnt/tzcnt has been fixed since skylake, popcnt has been fixed since
icelake. At least for icelake and later intel Core processors, the
errata tune is not needed. And the tune isn't need for ATOM either.
gcc/ChangeLog:
* config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_BMI):
Remove ATOM and ICELAKE(and later) core processors.
|
|
This patch implements abs<mode>2, vneg<mode>2 and vnot<mode>2
expanders for integer vector registers and adds tests for them.
gcc/ChangeLog:
* config/riscv/autovec.md (<optab><mode>2): Add vneg/vnot.
(abs<mode>2): Add.
* config/riscv/riscv-protos.h (emit_vlmax_masked_mu_insn):
Declare.
* config/riscv/riscv-v.cc (emit_vlmax_masked_mu_insn): New
function.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/rvv.exp: Add unop tests.
* gcc.target/riscv/rvv/autovec/unop/abs-run.c: New test.
* gcc.target/riscv/rvv/autovec/unop/abs-rv32gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/abs-rv64gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/abs-template.h: New test.
* gcc.target/riscv/rvv/autovec/unop/vneg-run.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vneg-rv32gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vneg-rv64gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vneg-template.h: New test.
* gcc.target/riscv/rvv/autovec/unop/vnot-run.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vnot-rv32gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vnot-rv64gcv.c: New test.
* gcc.target/riscv/rvv/autovec/unop/vnot-template.h: New test.
|
|
This patch implements the autovec expanders for sign and zero extension
patterns as well as the accompanying truncations. In order to use them
additional mode_attr iterators as well as vectorizer hooks are required.
Using these hooks we can e.g. vectorize with VNx4QImode as base mode
and extend VNx4SI to VNx4DI. They are still going to be expanded in the
future.
vf4 and vf8 truncations are emulated by truncating two and three times
respectively.
The patch also adds tests and changes some expectations for already
existing ones.
Combine does not yet handle binary operations of two widened operands
as we are missing the necessary split/rewrite patterns. These will be
added at a later time.
Co-authored-by: Juzhe Zhong <juzhe.zhong@rivai.ai>
gcc/ChangeLog:
* config/riscv/autovec.md (<optab><v_double_trunc><mode>2): New
expander.
(<optab><v_quad_trunc><mode>2): Dito.
(<optab><v_oct_trunc><mode>2): Dito.
(trunc<mode><v_double_trunc>2): Dito.
(trunc<mode><v_quad_trunc>2): Dito.
(trunc<mode><v_oct_trunc>2): Dito.
* config/riscv/riscv-protos.h (vectorize_related_mode): Define.
(autovectorize_vector_modes): Define.
* config/riscv/riscv-v.cc (vectorize_related_mode): Implement
hook.
(autovectorize_vector_modes): Implement hook.
* config/riscv/riscv.cc (riscv_autovectorize_vector_modes):
Implement target hook.
(riscv_vectorize_related_mode): Implement target hook.
(TARGET_VECTORIZE_AUTOVECTORIZE_VECTOR_MODES): Define.
(TARGET_VECTORIZE_RELATED_MODE): Define.
* config/riscv/vector-iterators.md: Add lowercase versions of
mode_attr iterators.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/binop/shift-rv32gcv.c: Adjust
expectation.
* gcc.target/riscv/rvv/autovec/binop/shift-rv64gcv.c: Dito.
* gcc.target/riscv/rvv/autovec/binop/vdiv-run.c: Dito.
* gcc.target/riscv/rvv/autovec/binop/vdiv-rv32gcv.c: Dito.
* gcc.target/riscv/rvv/autovec/binop/vdiv-rv64gcv.c: Dito.
* gcc.target/riscv/rvv/autovec/binop/vdiv-template.h: Dito.
* gcc.target/riscv/rvv/autovec/binop/vrem-rv32gcv.c: Dito.
* gcc.target/riscv/rvv/autovec/binop/vrem-rv64gcv.c: Dito.
* gcc.target/riscv/rvv/autovec/zve32f_zvl128b-2.c: Dito.
* gcc.target/riscv/rvv/autovec/zve32x_zvl128b-2.c: Dito.
* gcc.target/riscv/rvv/autovec/zve64d-2.c: Dito.
* gcc.target/riscv/rvv/autovec/zve64f-2.c: Dito.
* gcc.target/riscv/rvv/autovec/zve64x-2.c: Dito.
* gcc.target/riscv/rvv/rvv.exp: Add new conversion tests.
* gcc.target/riscv/rvv/vsetvl/avl_single-38.c: Do not vectorize.
* gcc.target/riscv/rvv/vsetvl/avl_single-47.c: Dito.
* gcc.target/riscv/rvv/vsetvl/avl_single-48.c: Dito.
* gcc.target/riscv/rvv/vsetvl/avl_single-49.c: Dito.
* gcc.target/riscv/rvv/vsetvl/imm_switch-8.c: Dito.
* gcc.target/riscv/rvv/autovec/conversions/vncvt-run.c: New test.
* gcc.target/riscv/rvv/autovec/conversions/vncvt-rv32gcv.c: New test.
* gcc.target/riscv/rvv/autovec/conversions/vncvt-rv64gcv.c: New test.
* gcc.target/riscv/rvv/autovec/conversions/vncvt-template.h: New test.
* gcc.target/riscv/rvv/autovec/conversions/vsext-run.c: New test.
* gcc.target/riscv/rvv/autovec/conversions/vsext-rv32gcv.c: New test.
* gcc.target/riscv/rvv/autovec/conversions/vsext-rv64gcv.c: New test.
* gcc.target/riscv/rvv/autovec/conversions/vsext-template.h: New test.
* gcc.target/riscv/rvv/autovec/conversions/vzext-run.c: New test.
* gcc.target/riscv/rvv/autovec/conversions/vzext-rv32gcv.c: New test.
* gcc.target/riscv/rvv/autovec/conversions/vzext-rv64gcv.c: New test.
* gcc.target/riscv/rvv/autovec/conversions/vzext-template.h: New test.
|
|
Since object code target ID V4, xnack has the values unspecified, '+' and '-',
which with this commit is represented in GCC as 'any', 'on', and 'off',
following the precidence for 'sram(-)ecc' and -msram-ecc=.
The current default was 'no' and is now 'off'; however, once XNACK is
implemented, the default should be probably 'any'.
This commit updates the commandline options to permit the new tristate and
updates the documentation. As the feature itself is currently not really
supported in GCC, the change should not affect real-world users.
The XNACK feature allows memory load instructions to restart safely following
a page-miss interrupt. This is useful for shared-memory devices, like APUs,
and to implement OpenMP Unified Shared Memory.
2023-05-26 Andrew Stubbs <ams@codesourcery.com>
Tobias Burnus <tobias@codesourcery.com>
* config/gcn/gcn-hsa.h (XNACKOPT): New macro.
(ASM_SPEC): Use XNACKOPT.
* config/gcn/gcn-opts.h (enum sram_ecc_type): Rename to ...
(enum hsaco_attr_type): ... this, and generalize the names.
(TARGET_XNACK): New macro.
* config/gcn/gcn.cc (gcn_option_override): Update to sorry for all
but -mxnack=off.
(output_file_start): Update xnack handling.
(gcn_hsa_declare_function_name): Use TARGET_XNACK.
* config/gcn/gcn.opt (-mxnack): Add the "on/off/any" syntax.
(sram_ecc_type): Rename to ...
(hsaco_attr_type: ... this.)
* config/gcn/mkoffload.cc (SET_XNACK_ANY): New macro.
(TEST_XNACK): Delete.
(TEST_XNACK_ANY): New macro.
(TEST_XNACK_ON): New macro.
(main): Support the new -mxnack=on/off/any syntax.
* doc/invoke.texi (-mxnack): Update for new syntax.
|
|
In order to reject voodoo estimation logic with lots of magic numbers,
this patch revises the code to measure the costs of the three memset
methods based on the actual emission size of the insn sequence
corresponding to each method and choose the smallest one.
gcc/ChangeLog:
* config/xtensa/xtensa-protos.h
(xtensa_expand_block_set_unrolled_loop,
xtensa_expand_block_set_small_loop): Remove.
(xtensa_expand_block_set): New prototype.
* config/xtensa/xtensa.cc
(xtensa_expand_block_set_libcall): New subfunction.
(xtensa_expand_block_set_unrolled_loop,
xtensa_expand_block_set_small_loop): Rewrite as subfunctions.
(xtensa_expand_block_set): New function that calls the above
subfunctions.
* config/xtensa/xtensa.md (memsetsi): Change to invoke only
xtensa_expand_block_set().
|
|
This patch makes try to eliminate using temporary pseudo for
'(minus:SI (const_int) (reg:SI))' if the addition of negative constant
value can be emitted in a single machine instruction.
/* example */
int test0(int x) {
return 1 - x;
}
int test1(int x) {
return 100 - x;
}
int test2(int x) {
return 25600 - x;
}
;; before
test0:
movi.n a9, 1
sub a2, a9, a2
ret.n
test1:
movi a9, 0x64
sub a2, a9, a2
ret.n
test2:
movi.n a9, 0x19
slli a9, a9, 10
sub a2, a9, a2
ret.n
;; after
test0:
addi.n a2, a2, -1
neg a2, a2
ret.n
test1:
addi a2, a2, -100
neg a2, a2
ret.n
test2:
addmi a2, a2, -0x6400
neg a2, a2
ret.n
gcc/ChangeLog:
* config/xtensa/xtensa-protos.h (xtensa_m1_or_1_thru_15):
New prototype.
* config/xtensa/xtensa.cc (xtensa_m1_or_1_thru_15):
New function.
* config/xtensa/constraints.md (O):
Change to use the above function.
* config/xtensa/xtensa.md (*subsi3_from_const):
New insn_and_split pattern.
|
|
gcc/ChangeLog:
* config/xtensa/xtensa.md (*extzvsi-1bit_ashlsi3):
Retract excessive line folding, and correct the value of
the "length" insn attribute related to TARGET_DENSITY.
(*extzvsi-1bit_addsubx): Ditto.
|
|
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi):
Do not disable call to ix86_expand_vecop_qihi2.
|
|
gcc/ChangeLog:
* config/riscv/riscv.cc (vector_zero_call_used_regs): Add
explict VL and drop VL in ops.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
|
|
Rewrite ix86_expand_vecop_qihi2 to expand fo 2x-wider (e.g. V16QI -> V16HImode)
instructions when available. Currently, the compiler generates following
assembly for V16QImode multiplication (-mavx2):
vpunpcklbw %xmm0, %xmm0, %xmm3
vpunpcklbw %xmm1, %xmm1, %xmm2
vpunpckhbw %xmm0, %xmm0, %xmm0
movl $255, %eax
vpunpckhbw %xmm1, %xmm1, %xmm1
vpmullw %xmm3, %xmm2, %xmm2
vmovd %eax, %xmm3
vpmullw %xmm0, %xmm1, %xmm1
vpbroadcastw %xmm3, %xmm3
vpand %xmm2, %xmm3, %xmm0
vpand %xmm1, %xmm3, %xmm3
vpackuswb %xmm3, %xmm0, %xmm0
and only with -mavx512bw -mavx512vl generates:
vpmovzxbw %xmm1, %ymm1
vpmovzxbw %xmm0, %ymm0
vpmullw %ymm1, %ymm0, %ymm0
vpmovwb %ymm0, %xmm0
Patched compiler generates more optimized code involving multiplication
in 2x-wider mode in cases where missing truncate instruction has to be
emulated with a permutation (-mavx2):
vpmovzxbw %xmm0, %ymm0
vpmovzxbw %xmm1, %ymm1
movl $255, %eax
vpmullw %ymm1, %ymm0, %ymm1
vmovd %eax, %xmm0
vpbroadcastw %xmm0, %ymm0
vpand %ymm1, %ymm0, %ymm0
vpackuswb %ymm0, %ymm0, %ymm0
vpermq $216, %ymm0, %ymm0
The patch also adjusts cost calculation of V*QImode emulations to account
for generation of 2x-wider mode instructions.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi2):
Rewrite to expand to 2x-wider (e.g. V16QI -> V16HImode)
instructions when available. Emulate truncation via
ix86_expand_vec_perm_const_1 when native truncate insn
is not available.
(ix86_expand_vecop_qihi_partial) <case MULT>: Use pmovzx
when available. Trivially rename some variables.
(ix86_expand_vecop_qihi): Unconditionally call ix86_expand_vecop_qihi2.
* config/i386/i386.cc (ix86_multiplication_cost): Rewrite cost
calculation of V*QImode emulations to account for generation of
2x-wider mode instructions.
(ix86_shift_rotate_cost): Update cost calculation of V*QImode
emulations to account for generation of 2x-wider mode instructions.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512vl-pr95488-1.c: Revert 2023-05-18 change.
|
|
avr-common.cc introduces the following options that are set depending
on optimization level: -mgas-isr-prologues, -mmain-is-OS-task and
-fsplit-wide-types-early. The inliner thinks that different options
disallow cross-optimization inlining, so provide can_inline_p.
gcc/
PR target/104327
* config/avr/avr.cc (avr_can_inline_p): New static function.
(TARGET_CAN_INLINE_P): Define to that function.
|
|
There is already a pattern in avr.md that matches single-bit transfers
from one register to another one, but it only handled bit 0 of 8-bit
registers. This change makes that pattern more generic so it matches
more of similar single-bit transfers.
gcc/
PR target/82931
* config/avr/avr.md (*movbitqi.0): Rename to *movbit<mode>.0-6.
Handle any bit position and use mode QISI.
* config/avr/avr.cc (avr_rtx_costs_1) [IOR]: Return a cost
of 2 insns for bit-transfer of respective style.
gcc/testsuite/
PR target/82931
* gcc.target/avr/pr82931.c: New test.
|
|
MVE_5 and MVE_6 iterators are the same: this patch replaces MVE_6 with
MVE_5 everywhere in mve.md and removes MVE_6 from iterators.md.
2023-05-25 Christophe Lyon <christophe.lyon@linaro.org>
gcc/
* config/arm/iterators.md (MVE_6): Remove.
* config/arm/mve.md: Replace MVE_6 with MVE_5.
|
|
This patch annotates the complex add and mla patterns for vec-concat-zero.
Testing showed an interesting bug in our MD patterns where they were defined to match:
(plus:VHSDF (match_operand:VHSDF 1 "register_operand" "0")
(unspec:VHSDF [(match_operand:VHSDF 2 "register_operand" "w")
(match_operand:VHSDF 3 "register_operand" "w")
(match_operand:SI 4 "const_int_operand" "n")]
FCMLA))
but the canonicalisation rules for PLUS require the more "complex" operand to be first so
during combine when the new substituted patterns were attempted to be formed combine/recog would
try to match:
(plus:V2SF (unspec:V2SF [
(reg:V2SF 100)
(reg:V2SF 101)
(const_int 0 [0])
] UNSPEC_FCMLA270)
(reg:V2SF 99))
instead. This patch fixes the operands of the PLUS RTX in these patterns.
Similar patterns for the dot-product instructions already used the right order.
Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
gcc/ChangeLog:
PR target/99195
* config/aarch64/aarch64-simd.md (aarch64_fcadd<rot><mode>): Rename to...
(aarch64_fcadd<rot><mode><vczle><vczbe>): ... This.
Fix canonicalization of PLUS operands.
(aarch64_fcmla<rot><mode>): Rename to...
(aarch64_fcmla<rot><mode><vczle><vczbe>): ... This.
Fix canonicalization of PLUS operands.
(aarch64_fcmla_lane<rot><mode>): Rename to...
(aarch64_fcmla_lane<rot><mode><vczle><vczbe>): ... This.
Fix canonicalization of PLUS operands.
(aarch64_fcmla_laneq<rot>v4hf): Rename to...
(aarch64_fcmla_laneq<rot>v4hf<vczle><vczbe>): ... This.
Fix canonicalization of PLUS operands.
(aarch64_fcmlaq_lane<rot><mode>): Fix canonicalization of PLUS operands.
gcc/testsuite/ChangeLog:
PR target/99195
* gcc.target/aarch64/simd/pr99195_9.c: New test.
|
|
This patch implements a number of scalar data processing intrinsics from ACLE
that were requested by some users. Some of these have fast single-instruction
sequences for Armv6 and later, but even for earlier versions they can still emit
an inline sequence or a call to libgcc (and ACLE recommends them being unconditionally
available).
Chris Sidebottom wrote most of the patch, I just cleaned it up, wired up some builtins
and adjusted the tests.
Bootstrapped and tested on arm-none-linux-gnueabihf.
Co-authored-by: Chris Sidebottom <chris.sidebottom@arm.com>
gcc/ChangeLog:
* config/arm/arm.md (rbitsi2): Rename to...
(arm_rbit): ... This.
(ctzsi2): Adjust for the above.
(arm_rev16si2): Convert to define_expand.
(arm_rev16si2_alt1): New pattern.
(arm_rev16si2_alt): Rename to...
(*arm_rev16si2_alt2): ... This.
* config/arm/arm_acle.h (__ror, __rorl, __rorll, __clz, __clzl, __clzll,
__cls, __clsl, __clsll, __revsh, __rev, __revl, __revll, __rev16,
__rev16l, __rev16ll, __rbit, __rbitl, __rbitll): Define intrinsics.
* config/arm/arm_acle_builtins.def (rbit, rev16si2): Define builtins.
gcc/testsuite/ChangeLog:
* gcc.target/arm/acle/data-intrinsics-armv6.c: New test.
* gcc.target/arm/acle/data-intrinsics-assembly.c: New test.
* gcc.target/arm/acle/data-intrinsics-rbit.c: New test.
* gcc.target/arm/acle/data-intrinsics.c: New test.
|
|
In r11-966-g9a182ef9ee011935d827ab5c6c9a7cd8e22257d8 we introduce a
simplification to emit_move_insn that attempts to simplify moves of the form:
(set (subreg:M1 (reg:M2 ...)) (constant C))
where M1 and M2 are of equal mode size. That is problematic for the splitter
vfp.md:no_literal_pool_df_immediate in the arm backend, which tries to pun an
lvalue DFmode pseudo into DImode and assign a constant to it with
emit_move_insn, as the new transformation simply undoes this, and we end up
splitting indefinitely.
This patch changes things around in the arm backend so that we use a
DImode temporary (instead of DFmode) and first load the DImode constant
into the pseudo, and then pun the pseudo into DFmode as an rvalue in a
reg -> reg move. I believe this should be semantically equivalent but
avoids the pathalogical behaviour seen in the PR.
gcc/ChangeLog:
PR target/109800
* config/arm/arm.md (movdf): Generate temporary pseudo in DImode
instead of DFmode.
* config/arm/vfp.md (no_literal_pool_df_immediate): Rather than punning an
lvalue DFmode pseudo into DImode, use a DImode pseudo and pun it into
DFmode as an rvalue.
gcc/testsuite/ChangeLog:
PR target/109800
* gcc.target/arm/pure-code/pr109800.c: New test.
|
|
Current ARC's TLS Local Dynamic model is using two anchors to access
data, namely `.tdata` and `.tbss`. This implementation is unnecessary
complicated. However, the TLS Local Dynamic model has better results
using Global Dynamic model and anchors.
gcc/ChangeLog;
* config/arc/arc.cc (arc_call_tls_get_addr): Simplify access using
TLS Local Dynamic.
Signed-off-by: Claudiu Zissulescu <claziss@gmail.com>
|
|
gcc/ChangeLog:
* config/aarch64/aarch64.cc (scalar_move_insn_p): New function.
(seq_cost_ignoring_scalar_moves): Likewise.
(aarch64_expand_vector_init): Call seq_cost_ignoring_scalar_moves.
|
|
While optimising some vector math library code with intrinsics we stumbled upon the issue in the testcase.
The compiler should be generating a FACGT instruction but instead we generate:
foo(__Float32x4_t, __Float32x4_t, __Float32x4_t):
fabs v0.4s, v0.4s
adrp x0, .LC0
ldr q31, [x0, #:lo12:.LC0]
fcmgt v0.4s, v0.4s, v31.4s
ret
This is because the vcagtq_f32 intrinsic is open-coded in arm_neon.h as
return vabsq_f32 (__a) > vabsq_f32 (__b)
thus relying on the optimisers to merge it back together. But since one of the arms of the comparison
is a vector constant the combine pass optimises the abs into it and tries matching:
(set (reg:V4SI 101)
(neg:V4SI (gt:V4SI (reg:V4SF 100)
(const_vector:V4SF [
(const_double:SF 1.0e+2 [0x0.c8p+7]) repeated x4
]))))
and
(set (reg:V4SI 101)
(neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 104))
(reg:V4SF 103))))
instead of what we want:
(insn 13 9 14 2 (set (reg/i:V4SI 32 v0)
(neg:V4SI (gt:V4SI (abs:V4SF (reg:V4SF 98))
(abs:V4SF (reg:V4SF 96)))))
I don't really see a good way around that with our current implementation of these intrinsics.
Therefore this patch reimplements these intrinsics with aarch64 builtins that generate the RTL for these
instructions directly. Apparently we already had them defined in aarch64-simd-builtins.def and have been
using them for the fp16 case already.
I realise that this approach is against the general principle of expressing intrinsics in the higher-level constructs,
so I'm willing to listen to counter-arguments.
That said, the FACGT/FACGE instructions are as fast as the non-ABS comparison instructions on all microarchitectures that I know of
so it should always be a win to have them in the merged form rather than split the fabs step separately or try to hoist it.
And the testcase does come from real library code that we're trying to optimise.
With this patch for the testcase we generate:
foo:
adrp x0, .LC0
ldr q31, [x0, #:lo12:.LC0]
facgt v0.4s, v0.4s, v31.4s
ret
gcc/ChangeLog:
* config/aarch64/arm_neon.h (vcage_f64): Reimplement with builtins.
(vcage_f32): Likewise.
(vcages_f32): Likewise.
(vcageq_f32): Likewise.
(vcaged_f64): Likewise.
(vcageq_f64): Likewise.
(vcagts_f32): Likewise.
(vcagt_f32): Likewise.
(vcagt_f64): Likewise.
(vcagtq_f32): Likewise.
(vcagtd_f64): Likewise.
(vcagtq_f64): Likewise.
(vcale_f32): Likewise.
(vcale_f64): Likewise.
(vcaled_f64): Likewise.
(vcales_f32): Likewise.
(vcaleq_f32): Likewise.
(vcaleq_f64): Likewise.
(vcalt_f32): Likewise.
(vcalt_f64): Likewise.
(vcaltd_f64): Likewise.
(vcaltq_f32): Likewise.
(vcaltq_f64): Likewise.
(vcalts_f32): Likewise.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/simd/facgt_constpool_1.c: New test.
|
|
This patch aims to fix incorrect intrinsic signature for
_mm{512|256|}_s{lli|rai|rli}_epi*.
gcc/ChangeLog:
PR target/109173
PR target/109174
* config/i386/avx512bwintrin.h (_mm512_srli_epi16): Change type from
int to const int or const int to const unsigned int.
(_mm512_mask_srli_epi16): Ditto.
(_mm512_slli_epi16): Ditto.
(_mm512_mask_slli_epi16): Ditto.
(_mm512_maskz_slli_epi16): Ditto.
(_mm512_srai_epi16): Ditto.
(_mm512_mask_srai_epi16): Ditto.
(_mm512_maskz_srai_epi16): Ditto.
* config/i386/avx512fintrin.h (_mm512_slli_epi64): Ditto.
(_mm512_mask_slli_epi64): Ditto.
(_mm512_maskz_slli_epi64): Ditto.
(_mm512_srli_epi64): Ditto.
(_mm512_mask_srli_epi64): Ditto.
(_mm512_maskz_srli_epi64): Ditto.
(_mm512_srai_epi64): Ditto.
(_mm512_mask_srai_epi64): Ditto.
(_mm512_maskz_srai_epi64): Ditto.
(_mm512_slli_epi32): Ditto.
(_mm512_mask_slli_epi32): Ditto.
(_mm512_maskz_slli_epi32): Ditto.
(_mm512_srli_epi32): Ditto.
(_mm512_mask_srli_epi32): Ditto.
(_mm512_maskz_srli_epi32): Ditto.
(_mm512_srai_epi32): Ditto.
(_mm512_mask_srai_epi32): Ditto.
(_mm512_maskz_srai_epi32): Ditto.
* config/i386/avx512vlbwintrin.h (_mm256_mask_srai_epi16): Ditto.
(_mm256_maskz_srai_epi16): Ditto.
(_mm_mask_srai_epi16): Ditto.
(_mm_maskz_srai_epi16): Ditto.
(_mm256_mask_slli_epi16): Ditto.
(_mm256_maskz_slli_epi16): Ditto.
(_mm_mask_slli_epi16): Ditto.
(_mm_maskz_slli_epi16): Ditto.
(_mm_maskz_srli_epi16): Ditto.
* config/i386/avx512vlintrin.h (_mm256_mask_srli_epi32): Ditto.
(_mm256_maskz_srli_epi32): Ditto.
(_mm_mask_srli_epi32): Ditto.
(_mm_maskz_srli_epi32): Ditto.
(_mm256_mask_srli_epi64): Ditto.
(_mm256_maskz_srli_epi64): Ditto.
(_mm_mask_srli_epi64): Ditto.
(_mm_maskz_srli_epi64): Ditto.
(_mm256_mask_srai_epi32): Ditto.
(_mm256_maskz_srai_epi32): Ditto.
(_mm_mask_srai_epi32): Ditto.
(_mm_maskz_srai_epi32): Ditto.
(_mm256_srai_epi64): Ditto.
(_mm256_mask_srai_epi64): Ditto.
(_mm256_maskz_srai_epi64): Ditto.
(_mm_srai_epi64): Ditto.
(_mm_mask_srai_epi64): Ditto.
(_mm_maskz_srai_epi64): Ditto.
(_mm_mask_slli_epi32): Ditto.
(_mm_maskz_slli_epi32): Ditto.
(_mm_mask_slli_epi64): Ditto.
(_mm_maskz_slli_epi64): Ditto.
(_mm256_mask_slli_epi32): Ditto.
(_mm256_maskz_slli_epi32): Ditto.
(_mm256_mask_slli_epi64): Ditto.
(_mm256_maskz_slli_epi64): Ditto.
gcc/testsuite/ChangeLog:
PR target/109173
PR target/109174
* gcc.target/i386/pr109173-1.c: New test.
* gcc.target/i386/pr109174-1.c: Ditto.
|
|
According to RVV ISA:
The conversions use the dynamic rounding mode in frm, except for the rtz
variants, which round towards zero.
So rtz conversion patterns should not have FRM dependency.
We can't support mode switching for FRM yet since rvv intrinsic doc is
not updated but
I think this patch is correct.
gcc/ChangeLog:
* config/riscv/vector.md: Remove FRM_REGNUM dependency in rtz
instructions.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
|
|
One of the supplied argument strings is unneccesarily long (c-sky, using
basically the same code, fixed it to a shorter length) and this fixes overflow
warnings, as GCC fails to deduce that the full 256 bytes for load_op[] are
not used at all.
gcc/ChangeLog:
* config/mcore/mcore.cc (output_inline_const) Make buffer smaller to
silence overflow warnings later on.
|
|
Also, move v<any_shift:insn>v8qi3 expander to a better place and enable
it with TARGET_MMX_WITH_SSE. Remove handling of V8QImode from
ix86_expand_vecop_qihi2 since all partial QI->HI vector modes expand
via ix86_expand_vecop_qihi_partial.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi2):
Remove handling of V8QImode.
* config/i386/mmx.md (v<insn>v8qi3): Move from sse.md.
Call ix86_expand_vecop_qihi_partial. Enable for TARGET_MMX_WITH_SSE.
(v<insn>v4qi3): Ditto.
* config/i386/sse.md (v<insn>v8qi3): Remove.
gcc/testsuite/ChangeLog:
* gcc.target/i386/vect-shiftv4qi.c (dg-options):
Remove -ftree-vectorize.
* gcc.target/i386/vect-shiftv8qi.c (dg-options): Ditto.
* gcc.target/i386/vect-vshiftv4qi.c: New test.
* gcc.target/i386/vect-vshiftv8qi.c: New test.
|
|
Continuing the series of straightforward annotations, this one handles the normal (not widening or narrowing) vector shifts.
Tests included.
Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf.
gcc/ChangeLog:
PR target/99195
* config/aarch64/aarch64-simd.md (aarch64_simd_lshr<mode>): Rename to...
(aarch64_simd_lshr<mode><vczle><vczbe>): ... This.
(aarch64_simd_ashr<mode>): Rename to...
(aarch64_simd_ashr<mode><vczle><vczbe>): ... This.
(aarch64_simd_imm_shl<mode>): Rename to...
(aarch64_simd_imm_shl<mode><vczle><vczbe>): ... This.
(aarch64_simd_reg_sshl<mode>): Rename to...
(aarch64_simd_reg_sshl<mode><vczle><vczbe>): ... This.
(aarch64_simd_reg_shl<mode>_unsigned): Rename to...
(aarch64_simd_reg_shl<mode>_unsigned<vczle><vczbe>): ... This.
(aarch64_simd_reg_shl<mode>_signed): Rename to...
(aarch64_simd_reg_shl<mode>_signed<vczle><vczbe>): ... This.
(vec_shr_<mode>): Rename to...
(vec_shr_<mode><vczle><vczbe>): ... This.
(aarch64_<sur>shl<mode>): Rename to...
(aarch64_<sur>shl<mode><vczle><vczbe>): ... This.
(aarch64_<sur>q<r>shl<mode>): Rename to...
(aarch64_<sur>q<r>shl<mode><vczle><vczbe>): ... This.
gcc/testsuite/ChangeLog:
PR target/99195
* gcc.target/aarch64/simd/pr99195_1.c: Add testing for shifts.
* gcc.target/aarch64/simd/pr99195_6.c: Likewise.
* gcc.target/aarch64/simd/pr99195_8.c: New test.
|
|
The following dispatches to V2DImode CTOR expansion instead of
using sets of (subreg:DI (reg:V16QI 146) [08]) which causes
LRA to spill DImode and reload V16QImode. The same applies for
V8QImode or V4HImode construction from SImode parts which happens
during 32bit libgcc build.
PR target/109944
* config/i386/i386-expand.cc (ix86_expand_vector_init_general):
Perform final vector composition using
ix86_expand_vector_init_general instead of setting
the highpart and lowpart which causes spilling.
* gcc.target/i386/pr109944-1.c: New testcase.
* gcc.target/i386/pr109944-2.c: Likewise.
|
|
An obvious fix to make all enum naming consistent.
gcc/ChangeLog:
* config/riscv/riscv-protos.h (enum frm_field_enum): Add FRM_
prefix.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
|
|
As the PR says we shouldn't be using qualifier_unsigned for the return type of the __ssat intrinsics.
UNSIGNED_SAT_BINOP_UNSIGNED_IMM_QUALIFIERS already exists for that.
This was just a thinko.
This patch fixes this and the warning with -Wconversion goes away.
Bootstrapped and tested on arm-none-linux-gnueabihf.
gcc/ChangeLog:
PR target/109939
* config/arm/arm-builtins.cc (SAT_BINOP_UNSIGNED_IMM_QUALIFIERS): Use
qualifier_none for the return operand.
gcc/testsuite/ChangeLog:
PR target/109939
* gcc.target/arm/pr109939.c: New test.
|
|
This patch is adding mask logic auto-vectorization, define the pattern
as "define_insn_and_split" to allow combine PASS easily combine series
instructions.
For example:
combine vmxor.mm + vmnot.m into vmxnor.mm
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
gcc/ChangeLog:
* config/riscv/autovec.md (<optab><mode>3): New pattern.
(one_cmpl<mode>2): Ditto.
(*<optab>not<mode>): Ditto.
(*n<optab><mode>): Ditto.
* config/riscv/riscv-v.cc (expand_vec_cmp_float): Change to
one_cmpl.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/cmp/vcond-4.c: New test.
* gcc.target/riscv/rvv/autovec/cmp/vcond_run-4.c: New test.
|
|
This patch enable RVV auto-vectorization including floating-point
unorder and order comparison.
The testcases are leveraged from Richard. So include Richard as co-author.
And this patch is the prerequisite patch for my current middle-end work.
Without this patch, I can't support len_mask_xxx middle-end pattern
since the mask is generated by comparison.
For example,
for (int i...; i < n.)
if (cond[i])
a[i] = b[i]
We need len_mask_load/len_mask_store for such code and I am gonna
support them in the middle-end after this patch is merged.
Both integer && floating (order and unorder) are tested.
built && regression passed.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
Co-Authored-By: Richard Sandiford <richard.sandiford@arm.com>
gcc/ChangeLog:
* config/riscv/autovec.md (@vcond_mask_<mode><vm>): New pattern.
(vec_cmp<mode><vm>): New pattern.
(vec_cmpu<mode><vm>): New pattern.
(vcond<V:mode><VI:mode>): New pattern.
(vcondu<V:mode><VI:mode>): New pattern.
* config/riscv/riscv-protos.h (enum insn_type): Add new enum.
(emit_vlmax_merge_insn): New function.
(emit_vlmax_cmp_insn): Ditto.
(emit_vlmax_cmp_mu_insn): Ditto.
(expand_vec_cmp): Ditto.
(expand_vec_cmp_float): Ditto.
(expand_vcond): Ditto.
* config/riscv/riscv-v.cc (emit_vlmax_merge_insn): Ditto.
(emit_vlmax_cmp_insn): Ditto.
(emit_vlmax_cmp_mu_insn): Ditto.
(get_cmp_insn_code): Ditto.
(expand_vec_cmp): Ditto.
(expand_vec_cmp_float): Ditto.
(expand_vcond): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/rvv.exp:
* gcc.target/riscv/rvv/autovec/cmp/vcond-1.c: New test.
* gcc.target/riscv/rvv/autovec/cmp/vcond-2.c: New test.
* gcc.target/riscv/rvv/autovec/cmp/vcond-3.c: New test.
* gcc.target/riscv/rvv/autovec/cmp/vcond_run-1.c: New test.
* gcc.target/riscv/rvv/autovec/cmp/vcond_run-2.c: New test.
* gcc.target/riscv/rvv/autovec/cmp/vcond_run-3.c: New test.
|
|
This patch support the RVV VREINTERPRET from the vbool*_t to the
vuint*m1_t. Aka:
vuint*m1_t __riscv_vreinterpret_x_x(vbool*_t);
These APIs help the users to convert vector the vbool*_t to the LMUL=1
unsigned integer vint*_t. According to the RVV intrinsic SPEC as below,
the reinterpret intrinsics only change the types of the underlying contents.
https://github.com/riscv-non-isa/rvv-intrinsic-doc/blob/master/rvv-intrinsic-rfc.md#reinterpret-vbool-o-vintm1
For example, given below code.
vuint8m1_t test_vreinterpret_v_b1_vuint8m1 (vbool1_t src) {
return __riscv_vreinterpret_v_b1_u8m1 (src);
}
It will generate the assembly code similar as below:
vsetvli a5,zero,e8,m8,ta,ma
vlm.v v1,0(a1)
vs1r.v v1,0(a0)
ret
Please NOTE the test files doesn't cover all the possible combinations
of the intrinsic APIs introduced by this PATCH due to too many.
This is the last PATCH for the reinterpret between the signed/unsigned
and the bool vector types.
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/genrvv-type-indexer.cc (main): Add
unsigned_eew*_lmul1_interpret for indexer.
* config/riscv/riscv-vector-builtins-functions.def (vreinterpret):
Register vuint*m1_t interpret function.
* config/riscv/riscv-vector-builtins-types.def (DEF_RVV_UNSIGNED_EEW8_LMUL1_INTERPRET_OPS):
New macro for vuint8m1_t.
(DEF_RVV_UNSIGNED_EEW16_LMUL1_INTERPRET_OPS): Likewise.
(DEF_RVV_UNSIGNED_EEW32_LMUL1_INTERPRET_OPS): Likewise.
(DEF_RVV_UNSIGNED_EEW64_LMUL1_INTERPRET_OPS): Likewise.
(vbool1_t): Add to unsigned_eew*_interpret_ops.
(vbool2_t): Likewise.
(vbool4_t): Likewise.
(vbool8_t): Likewise.
(vbool16_t): Likewise.
(vbool32_t): Likewise.
(vbool64_t): Likewise.
* config/riscv/riscv-vector-builtins.cc (DEF_RVV_UNSIGNED_EEW8_LMUL1_INTERPRET_OPS):
New macro for vuint*m1_t.
(DEF_RVV_UNSIGNED_EEW16_LMUL1_INTERPRET_OPS): Likewise.
(DEF_RVV_UNSIGNED_EEW32_LMUL1_INTERPRET_OPS): Likewise.
(DEF_RVV_UNSIGNED_EEW64_LMUL1_INTERPRET_OPS): Likewise.
(required_extensions_p): Add vuint*m1_t interpret case.
* config/riscv/riscv-vector-builtins.def (unsigned_eew8_lmul1_interpret):
Add vuint*m1_t interpret to base type.
(unsigned_eew16_lmul1_interpret): Likewise.
(unsigned_eew32_lmul1_interpret): Likewise.
(unsigned_eew64_lmul1_interpret): Likewise.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/misc_vreinterpret_vbool_vint.c:
Enrich test cases.
|
|
This patch support the RVV VREINTERPRET from the vbool*_t to the
vint*m1_t. Aka:
vint*m1_t __riscv_vreinterpret_x_x(vbool*_t);
These APIs help the users to convert vector the vbool*_t to the LMUL=1
signed integer vint*_t. According to the RVV intrinsic SPEC as below,
the reinterpret intrinsics only change the types of the underlying contents.
https://github.com/riscv-non-isa/rvv-intrinsic-doc/blob/master/rvv-intrinsic-rfc.md#reinterpret-vbool-o-vintm1
For example, given below code.
vint8m1_t test_vreinterpret_v_b1_vint8m1 (vbool1_t src) {
return __riscv_vreinterpret_v_b1_i8m1 (src);
}
It will generate the assembly code similar as below:
vsetvli a5,zero,e8,m8,ta,ma
vlm.v v1,0(a1)
vs1r.v v1,0(a0)
ret
Please NOTE the test files doesn't cover all the possible combinations
of the intrinsic APIs introduced by this PATCH due to too many.
The reinterpret from vbool*_t to vuint*m1_t with lmul=1 will be coverred
in another PATCH.
Signed-off-by: Pan Li <pan2.li@intel.com>
gcc/ChangeLog:
* config/riscv/genrvv-type-indexer.cc (EEW_SIZE_LIST): New macro
for the eew size list.
(LMUL1_LOG2): New macro for the log2 value of lmul=1.
(main): Add signed_eew*_lmul1_interpret for indexer.
* config/riscv/riscv-vector-builtins-functions.def (vreinterpret):
Register vint*m1_t interpret function.
* config/riscv/riscv-vector-builtins-types.def (DEF_RVV_SIGNED_EEW8_LMUL1_INTERPRET_OPS):
New macro for vint8m1_t.
(DEF_RVV_SIGNED_EEW16_LMUL1_INTERPRET_OPS): Likewise.
(DEF_RVV_SIGNED_EEW32_LMUL1_INTERPRET_OPS): Likewise.
(DEF_RVV_SIGNED_EEW64_LMUL1_INTERPRET_OPS): Likewise.
(vbool1_t): Add to signed_eew*_interpret_ops.
(vbool2_t): Likewise.
(vbool4_t): Likewise.
(vbool8_t): Likewise.
(vbool16_t): Likewise.
(vbool32_t): Likewise.
(vbool64_t): Likewise.
* config/riscv/riscv-vector-builtins.cc (DEF_RVV_SIGNED_EEW8_LMUL1_INTERPRET_OPS):
New macro for vint*m1_t.
(DEF_RVV_SIGNED_EEW16_LMUL1_INTERPRET_OPS): Likewise.
(DEF_RVV_SIGNED_EEW32_LMUL1_INTERPRET_OPS): Likewise.
(DEF_RVV_SIGNED_EEW64_LMUL1_INTERPRET_OPS): Likewise.
(required_extensions_p): Add vint8m1_t interpret case.
* config/riscv/riscv-vector-builtins.def (signed_eew8_lmul1_interpret):
Add vint*m1_t interpret to base type.
(signed_eew16_lmul1_interpret): Likewise.
(signed_eew32_lmul1_interpret): Likewise.
(signed_eew64_lmul1_interpret): Likewise.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/misc_vreinterpret_vbool_vint.c:
Enrich the test cases.
|
|
To fix this issue, we seperate Vl operand and normal operands.
gcc/ChangeLog:
* config/riscv/autovec.md: Adjust for new interface.
* config/riscv/riscv-protos.h (emit_vlmax_insn): Add VL operand.
(emit_nonvlmax_insn): Add AVL operand.
* config/riscv/riscv-v.cc (emit_vlmax_insn): Add VL operand.
(emit_nonvlmax_insn): Add AVL operand.
(sew64_scalar_helper): Adjust for new interface.
(expand_tuple_move): Ditto.
* config/riscv/vector.md: Ditto.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
|
|
This simple patch fixes the magic number, remove magic number make codes
more reasonable.
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vec_series): Remove magic number.
(expand_const_vector): Ditto.
(legitimize_move): Ditto.
(sew64_scalar_helper): Ditto.
(expand_tuple_move): Ditto.
(expand_vector_init_insert_elems): Ditto.
* config/riscv/riscv.cc (vector_zero_call_used_regs): Ditto.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
|
|
Also for 64-bit vector abs intrinsics _mm_abs_{pi8,pi16,pi32}.
gcc/ChangeLog:
PR target/109900
* config/i386/i386.cc (ix86_gimple_fold_builtin): Fold
_mm{,256,512}_abs_{epi8,epi16,epi32,epi64} and
_mm_abs_{pi8,pi16,pi32} into gimple ABS_EXPR.
(ix86_masked_all_ones): Handle 64-bit mask.
* config/i386/i386-builtin.def: Replace icode of related
non-mask simd abs builtins with CODE_FOR_nothing.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr109900.c: New test.
|
|
By making use of the 'addsub_operator' added in the last patch.
gcc/ChangeLog:
* config/xtensa/xtensa.md (*addsubx): Rename from '*addx',
and change to also accept '*subx' pattern.
(*subx): Remove.
|
|
This patch decreses one machine instruction from "single bit extraction
with shifting" operation, and tries to eliminate the conditional
branch if CST2_POW2 doesn't fit into signed 12 bits with the help
of ifcvt optimization.
/* example #1 */
int test0(int x) {
return (x & 1048576) != 0 ? 1024 : 0;
}
extern int foo(void);
int test1(void) {
return (foo() & 1048576) != 0 ? 16777216 : 0;
}
;; before
test0:
movi a9, 0x400
srai a2, a2, 10
and a2, a2, a9
ret.n
test1:
addi sp, sp, -16
s32i.n a0, sp, 12
call0 foo
extui a2, a2, 20, 1
slli a2, a2, 20
beqz.n a2, .L2
movi.n a2, 1
slli a2, a2, 24
.L2:
l32i.n a0, sp, 12
addi sp, sp, 16
ret.n
;; after
test0:
extui a2, a2, 20, 1
slli a2, a2, 10
ret.n
test1:
addi sp, sp, -16
s32i.n a0, sp, 12
call0 foo
l32i.n a0, sp, 12
extui a2, a2, 20, 1
slli a2, a2, 24
addi sp, sp, 16
ret.n
In addition, if the left shift amount ('exact_log2(CST2_POW2)') is
between 1 through 3 and a either addition or subtraction with another
register follows, emit a ADDX[248] or SUBX[248] machine instruction
instead of separate left shift and add/subtract ones.
/* example #2 */
int test2(int x, int y) {
return ((x & 1048576) != 0 ? 4 : 0) + y;
}
int test3(int x, int y) {
return ((x & 2) != 0 ? 8 : 0) - y;
}
;; before
test2:
movi.n a9, 4
srai a2, a2, 18
and a2, a2, a9
add.n a2, a2, a3
ret.n
test3:
movi.n a9, 8
slli a2, a2, 2
and a2, a2, a9
sub a2, a2, a3
ret.n
;; after
test2:
extui a2, a2, 20, 1
addx4 a2, a2, a3
ret.n
test3:
extui a2, a2, 1, 1
subx8 a2, a2, a3
ret.n
gcc/ChangeLog:
* config/xtensa/predicates.md (addsub_operator): New.
* config/xtensa/xtensa.md (*extzvsi-1bit_ashlsi3,
*extzvsi-1bit_addsubx): New insn_and_split patterns.
* config/xtensa/xtensa.cc (xtensa_rtx_costs):
Add a special case about ifcvt 'noce_try_cmove()' to handle
constant loads that do not fit into signed 12 bits in the
patterns added above.
|
|
Some miscomputation of rtx_costs lead to sub-optimal code for
single-bit bit insertions. This patch implements TARGET_INSN_COST,
which has a chance to see the whole insn during insn combination;
in partictlar the SET_DEST of (set (zero_extract (...) ...)).
gcc/
* config/avr/avr.cc (avr_insn_cost): New static function.
(TARGET_INSN_COST): Define to that function.
|
|
The following also accounts for a GPR->XMM move cost for splat
operations and properly guards eliding the cost when moving from
memory only for SSE4.1 or HImode or larger operands. This
doesn't fix the PR fully yet.
PR target/109944
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
For vector construction or splats apply GPR->XMM move
costing. QImode memory can be handled directly only
with SSE4.1 pinsrb.
|
|
Add V8QImode and V4QImode vector shift patterns that call into
ix86_expand_vecop_qihi_partial. Generate special sequences
for constant count operands.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_vecop_qihi_partial):
Call ix86_expand_vec_shift_qihi_constant for shifts
with constant count operand.
* config/i386/i386.cc (ix86_shift_rotate_cost):
Handle V4QImode and V8QImode.
* config/i386/mmx.md (<insn>v8qi3): New insn pattern.
(<insn>v4qi3): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/vect-shiftv4qi.c: New test.
* gcc.target/i386/vect-shiftv8qi.c: New test.
|
|
I just notice the warning:
../../../riscv-gcc/gcc/config/riscv/vector.md:618:1: warning: source
missing a mode?
gcc/ChangeLog:
* config/riscv/vector.md: Add mode.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
|
|
At -O2, and so with SLP vectorisation enabled:
struct complx_t { float re, im; };
complx_t add(complx_t a, complx_t b) {
return {a.re + b.re, a.im + b.im};
}
generates:
fmov w3, s1
fmov x0, d0
fmov x1, d2
fmov w2, s3
bfi x0, x3, 32, 32
fmov d31, x0
bfi x1, x2, 32, 32
fmov d30, x1
fadd v31.2s, v31.2s, v30.2s
fmov x1, d31
lsr x0, x1, 32
fmov s1, w0
lsr w0, w1, 0
fmov s0, w0
ret
This is because complx_t is passed and returned in FPRs, but GCC gives
it DImode. We therefore “need” to assemble a DImode pseudo from the
two individual floats, bitcast it to a vector, do the arithmetic,
bitcast it back to a DImode pseudo, then extract the individual floats.
There are many problems here. The most basic is that we shouldn't
use SLP for such a trivial example. But SLP should in principle be
beneficial for more complicated examples, so preventing SLP for the
example above just changes the reproducer needed. A more fundamental
problem is that it doesn't make sense to use single DImode pseudos in a
testcase like this. I have a WIP patch to allow re and im to be stored
in individual SFmode pseudos instead, but it's quite an invasive change
and might end up going nowhere.
A simpler problem to tackle is that we allow DImode pseudos to be stored
in FPRs, but we don't provide any patterns for inserting values into
them, even though INS makes that easy for element-like insertions.
This patch adds some patterns for that.
Doing that showed that aarch64_modes_tieable_p was too strict:
it didn't allow SFmode and DImode values to be tied, even though
both of them occupy a single GPR and FPR, and even though we allow
both classes to change between the modes.
The *aarch64_bfidi<ALLX:mode>_subreg_<SUBDI_BITS> pattern is
especially ugly, but it's not clear what target-independent
code ought to simplify it to, if it was going to simplify it.
We should probably do the same thing for extractions, but that's left
as future work.
After the patch we generate:
ins v0.s[1], v1.s[0]
ins v2.s[1], v3.s[0]
fadd v0.2s, v0.2s, v2.2s
fmov x0, d0
ushr d1, d0, 32
lsr w0, w0, 0
fmov s0, w0
ret
which seems like a step in the right direction.
All in all, there's nothing elegant about this patchh. It just
seems like the least worst option.
gcc/
PR target/109632
* config/aarch64/aarch64.cc (aarch64_modes_tieable_p): Allow
subregs between any scalars that are 64 bits or smaller.
* config/aarch64/iterators.md (SUBDI_BITS): New int iterator.
(bits_etype): New int attribute.
* config/aarch64/aarch64.md (*insv_reg<mode>_<SUBDI_BITS>)
(*aarch64_bfi<GPI:mode><ALLX:mode>_<SUBDI_BITS>): New patterns.
(*aarch64_bfidi<ALLX:mode>_subreg_<SUBDI_BITS>): Likewise.
gcc/testsuite/
* gcc.target/aarch64/ins_bitfield_1.c: New test.
* gcc.target/aarch64/ins_bitfield_2.c: Likewise.
* gcc.target/aarch64/ins_bitfield_3.c: Likewise.
* gcc.target/aarch64/ins_bitfield_4.c: Likewise.
* gcc.target/aarch64/ins_bitfield_5.c: Likewise.
* gcc.target/aarch64/ins_bitfield_6.c: Likewise.
|
|
This patch is to refactor the framework of RVV auto-vectorization.
Since we find out are keep adding helpers && wrappers when implementing
auto-vectorization.
It will make the RVV auto-vectorizaiton very messy.
After double check my downstream RVV GCC, assemble all auto-vectorization
patterns we are going to have. Base on these informations, I refactor the
RVV framework to make it is easier and flexible for future use.
For example, we will definitely implement len_mask_load/len_mask_store
patterns which have both length && mask operand and use undefine merge operand.
len_cond_div or cond_div will have length or mask operand and use a real
merge operand instead of undefine merge operand.
Also, we will have some patterns will use tail undisturbed and mask any.
etc..... We will defintely have various features.
Base on these circumstances, we add these following private members:
int m_op_num;
/* It't true when the pattern has a dest operand. Most of the patterns have
dest operand wheras some patterns like STOREs does not have dest
operand.
*/
bool m_has_dest_p;
bool m_fully_unmasked_p;
bool m_use_real_merge_p;
bool m_has_avl_p;
bool m_vlmax_p;
bool m_has_tail_policy_p;
bool m_has_mask_policy_p;
enum tail_policy m_tail_policy;
enum mask_policy m_mask_policy;
machine_mode m_dest_mode;
machine_mode m_mask_mode;
These variables I believe can cover all potential situations.
And the instruction generater wrapper is "emit_insn" which will add
operands and
emit instruction according to the variables I mentioned above.
After this is done. We will easily add helpers without changing any base
class "insn_expand".
Currently, we have "emit_vlmax_tany_many" and "emit_nonvlmax_tany_many".
For example, when we want to emit a binary operations:
We have
Then just use emit_vlmax_tany_many (...RVV_BINOP_NUM...)
So, if we support ternary operation in the future. It's quite simple:
emit_vlmax_tany_many (...RVV_BINOP_NUM...)
"*_tany_many" means we are using tail any and mask any.
We will definitely need tail undisturbed or mask undisturbed when we
support these patterns
in middle-end. It's very simple to extend such helper base on current
framework:
we can do that in the future like this:
void
emit_nonvlmax_tu_mu (unsigned icode, int op_num, rtx *ops)
{
machine_mode data_mode = GET_MODE (ops[0]);
machine_mode mask_mode = get_mask_mode (data_mode).require ();
/* The number = 11 is because we have maximum 11 operands for
RVV instruction patterns according to vector.md. */
insn_expander<11> e (/*OP_NUM*/ op_num,
/*HAS_DEST_P*/ true,
/*USE_ALL_TRUES_MASK_P*/ true,
/*USE_UNDEF_MERGE_P*/ true,
/*HAS_AVL_P*/ true,
/*VLMAX_P*/ false,
/*HAS_TAIL_POLICY_P*/ true,
/*HAS_MASK_POLICY_P*/ true,
/*TAIL_POLICY*/ TAIL_UNDISTURBED,
/*MASK_POLICY*/ MASK_UNDISTURBED,
/*DEST_MODE*/ data_mode,
/*MASK_MODE*/ mask_mode);
e.emit_insn ((enum insn_code) icode, ops);
}
That's enough (I have tested it fully in my downstream RVV GCC).
I didn't add it in this patch.
Thanks.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
gcc/ChangeLog:
* config/riscv/autovec.md: Refactor the framework of RVV auto-vectorization.
* config/riscv/riscv-protos.h (RVV_MISC_OP_NUM): Ditto.
(RVV_UNOP_NUM): New macro.
(RVV_BINOP_NUM): Ditto.
(legitimize_move): Refactor the framework of RVV auto-vectorization.
(emit_vlmax_op): Ditto.
(emit_vlmax_reg_op): Ditto.
(emit_len_op): Ditto.
(emit_len_binop): Ditto.
(emit_vlmax_tany_many): Ditto.
(emit_nonvlmax_tany_many): Ditto.
(sew64_scalar_helper): Ditto.
(expand_tuple_move): Ditto.
* config/riscv/riscv-v.cc (emit_pred_op): Ditto.
(emit_pred_binop): Ditto.
(emit_vlmax_op): Ditto.
(emit_vlmax_tany_many): New function.
(emit_len_op): Remove.
(emit_nonvlmax_tany_many): New function.
(emit_vlmax_reg_op): Remove.
(emit_len_binop): Ditto.
(emit_index_op): Ditto.
(expand_vec_series): Refactor the framework of RVV auto-vectorization.
(expand_const_vector): Ditto.
(legitimize_move): Ditto.
(sew64_scalar_helper): Ditto.
(expand_tuple_move): Ditto.
(expand_vector_init_insert_elems): Ditto.
* config/riscv/riscv.cc (vector_zero_call_used_regs): Ditto.
* config/riscv/vector.md: Ditto.
|
|
aarch64-simd.md
In this PR we ICE because the substituted pattern for mla "lost" its predicate and constraint for operand 0
because the define_subst template:
[(set (match_operand:<VDBL> 0)
(vec_concat:<VDBL>
(match_dup 1)
(match_operand:VDZ 2 "aarch64_simd_or_scalar_imm_zero")))])
Uses match_operand instead of match_dup for operand 0. We can't use match_dup 0 for it because we need to specify the widened mode.
The problem is fixed by adding a "register_operand" predicate and "=w" constraint to the match_operand.
This makes sense conceptually too as the transformation we're targeting only applies to instructions that write a "w" register.
With this change the mddump pattern that ICEs goes from:
(define_insn ("aarch64_mlav4hi_vec_concatz_le")
[
(set (match_operand:V8HI 0 ("") ("")) <<------ Missing constraint!
(vec_concat:V8HI (plus:V4HI (mult:V4HI (match_operand:V4HI 2 ("register_operand") ("w"))
(match_operand:V4HI 3 ("register_operand") ("w")))
(match_operand:V4HI 1 ("register_operand") ("0")))
(match_operand:V4HI 4 ("aarch64_simd_or_scalar_imm_zero") (""))))
] ("(!BYTES_BIG_ENDIAN) && (TARGET_SIMD)") ("mla\t%0.4h, %2.4h, %3.4h")
to the proper:
(define_insn ("aarch64_mlav4hi_vec_concatz_le")
[
(set (match_operand:V8HI 0 ("register_operand") ("=w")) <<-------- Constraint in the right place
(vec_concat:V8HI (plus:V4HI (mult:V4HI (match_operand:V4HI 2 ("register_operand") ("w"))
(match_operand:V4HI 3 ("register_operand") ("w")))
(match_operand:V4HI 1 ("register_operand") ("0")))
(match_operand:V4HI 4 ("aarch64_simd_or_scalar_imm_zero") (""))))
] ("(!BYTES_BIG_ENDIAN) && (TARGET_SIMD)") ("mla\t%0.4h, %2.4h, %3.4h")
This seems to do the right thing for multi-alternative patterns as well, the annotated pattern for aarch64_cmltv8qi is:
(define_insn ("aarch64_cmltv8qi")
[
(set (match_operand:V8QI 0 ("register_operand") ("=w,w"))
(neg:V8QI (lt:V8QI (match_operand:V8QI 1 ("register_operand") ("w,w"))
(match_operand:V8QI 2 ("aarch64_simd_reg_or_zero") ("w,ZDz")))))
]
whereas the substituted version now looks like:
(define_insn ("aarch64_cmltv8qi_vec_concatz_le")
[
(set (match_operand:V16QI 0 ("register_operand") ("=w,w"))
(vec_concat:V16QI (neg:V8QI (lt:V8QI (match_operand:V8QI 1 ("register_operand") ("w,w"))
(match_operand:V8QI 2 ("aarch64_simd_reg_or_zero") ("w,ZDz"))))
(match_operand:V8QI 3 ("aarch64_simd_or_scalar_imm_zero") (""))))
]
Bootstrapped and tested on aarch64-none-linux-gnu.
gcc/ChangeLog:
PR target/109855
* config/aarch64/aarch64-simd.md (add_vec_concat_subst_le): Add predicate
and constraint for operand 0.
(add_vec_concat_subst_be): Likewise.
gcc/testsuite/ChangeLog:
PR target/109855
* gcc.target/aarch64/pr109855.c: New test.
|
|
Returned integer vector mode costs of emulated instructions in
ix86_shift_rotate_cost are wrong and do not reflect generated
instruction sequences. Rewrite handling of different integer vector
modes and different target ABIs to return real instruction
counts in order to calcuate better costs of various emulated modes.
Also add the cost of a memory read, when the instruction in the
sequence reads memory.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_shift_rotate_cost): Correct
calcuation of integer vector mode costs to reflect generated
instruction sequences of different integer vector modes and
different target ABIs. Remove "speed" function argument.
(ix86_rtx_costs): Update call for removed function argument.
(ix86_vector_costs::add_stmt_cost): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/sse2-shiftqihi-constant-1.c: Remove XFAILs.
|
|
Add the cost of a memory read to the cost of V*QImode vector mult sequences.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_multiplication_cost): Add
the cost of a memory read to the cost of V?QImode sequences.
|
|
Since the current framework is hard to maintain and
hard to be used in the future possible auto-vectorization patterns.
We will need to keep adding more helpers and arguments during the
auto-vectorization supporting. We should refactor the framework
now for the future use since the we don't support too much
auto-vectorization
patterns for now.
Start with this simple patch, this patch is adding "m_" prefix for
private the members.
gcc/ChangeLog:
* config/riscv/riscv-v.cc: Add "m_" prefix.
Signed-off-by: Juzhe-Zhong <juzhe.zhong@rivai.ai>
|