aboutsummaryrefslogtreecommitdiff
path: root/gcc/config
AgeCommit message (Collapse)AuthorFilesLines
2023-06-13RISC-V: Enhance RVV VLA SLP auto-vectorization with decompress operationJuzhe-Zhong1-0/+111
According to RVV ISA: https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc We can enhance VLA SLP auto-vectorization with (16.5.1. Synthesizing vdecompress) Decompress operation. Case 1 (nunits = POLY_INT_CST [16, 16]): _48 = VEC_PERM_EXPR <_37, _35, { 0, POLY_INT_CST [16, 16], 1, POLY_INT_CST [17, 16], 2, POLY_INT_CST [18, 16], ... }>; We can optimize such VLA SLP permuation pattern into: _48 = vdecompress (_37, _35, mask = { 0, 1, 0, 1, ... }; Case 2 (nunits = POLY_INT_CST [16, 16]): _23 = VEC_PERM_EXPR <_46, _44, { POLY_INT_CST [1, 1], POLY_INT_CST [3, 3], POLY_INT_CST [2, 1], POLY_INT_CST [4, 3], POLY_INT_CST [3, 1], POLY_INT_CST [5, 3], ... }>; We can optimize such VLA SLP permuation pattern into: _48 = vdecompress (slidedown(_46, 1/2 nunits), slidedown(_44, 1/2 nunits), mask = { 0, 1, 0, 1, ... }; For example: void __attribute__ ((noinline, noclone)) vec_slp (uint64_t *restrict a, uint64_t b, uint64_t c, int n) { for (int i = 0; i < n; ++i) { a[i * 2] += b; a[i * 2 + 1] += c; } } ASM: ... vid.v v0 vand.vi v0,v0,1 vmseq.vi v0,v0,1 ===> mask = { 0, 1, 0, 1, ... } vdecompress: viota.m v3,v0 vrgather.vv v2,v1,v3,v0.t Loop: vsetvli zero,a5,e64,m1,ta,ma vle64.v v1,0(a0) vsetvli a6,zero,e64,m1,ta,ma vadd.vv v1,v2,v1 vsetvli zero,a5,e64,m1,ta,ma mv a5,a3 vse64.v v1,0(a0) add a3,a3,a1 add a0,a0,a2 bgtu a5,a4,.L4 gcc/ChangeLog: * config/riscv/riscv-v.cc (emit_vlmax_decompress_insn): New function. (shuffle_decompress_patterns): New function. (expand_vec_perm_const_1): Add decompress optimization. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/partial/slp-8.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-9.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-8.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-9.c: New test.
2023-06-12[aarch64] Improve code-gen for vector initialization with single constant ↵Prathamesh Kulkarni1-8/+30
element. gcc/ChangeLog: * config/aarch64/aarch64.cc (aarch64_expand_vector_init): Tweak condition if (n_var == n_elts && n_elts <= 16) to allow a single constant, and if maxv == 1, use constant element for duplicating into register. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vec-init-single-const.c: New test. * gcc.target/aarch64/vec-init-single-const-be.c: Likewise. * gcc.target/aarch64/vec-init-single-const-2.c: Likewise.
2023-06-12RISC-V: Support RVV FP16 MISC vget/vset intrinsic APIPan Li1-0/+3
This patch support the intrinsic API of FP16 ZVFHMIN vget/vset. From the user's perspective, it is reasonable to do some get/set operations for the vfloat16*_t types when only ZVFHMIN is enabled. Signed-off-by: Pan Li <pan2.li@intel.com> gcc/ChangeLog: * config/riscv/riscv-vector-builtins-types.def (vfloat16m1_t): Add type to lmul1 ops. (vfloat16m2_t): Likewise. (vfloat16m4_t): Likewise. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/zvfh-over-zvfhmin.c: Add new test cases. * gcc.target/riscv/rvv/base/zvfhmin-intrinsic.c: Likewise.
2023-06-12RISC-V: Add RVV narrow shift right lowering auto-vectorizationJuzhe-Zhong2-14/+75
Optimize the following auto-vectorization codes: void foo (int16_t * __restrict a, int32_t * __restrict b, int32_t c, int n) { for (int i = 0; i < n; i++) a[i] = b[i] >> c; } Before this patch: foo: ble a3,zero,.L5 .L3: vsetvli a5,a3,e32,m1,ta,ma vle32.v v1,0(a1) vsetvli a4,zero,e32,m1,ta,ma vsra.vx v1,v1,a2 vsetvli zero,zero,e16,mf2,ta,ma slli a7,a5,2 vncvt.x.x.w v1,v1 slli a6,a5,1 vsetvli zero,a5,e16,mf2,ta,ma sub a3,a3,a5 vse16.v v1,0(a0) add a1,a1,a7 add a0,a0,a6 bne a3,zero,.L3 .L5: ret After this patch: foo: ble a3,zero,.L5 .L3: vsetvli a5,a3,e32,m1,ta,ma vle32.v v1,0(a1) vsetvli a7,zero,e16,mf2,ta,ma slli a6,a5,2 vnsra.wx v1,v1,a2 slli a4,a5,1 vsetvli zero,a5,e16,mf2,ta,ma sub a3,a3,a5 vse16.v v1,0(a0) add a1,a1,a6 add a0,a0,a4 bne a3,zero,.L3 .L5: ret gcc/ChangeLog: * config/riscv/autovec-opt.md (*v<any_shiftrt:optab><any_extend:optab>trunc<mode>): New pattern. (*<any_shiftrt:optab>trunc<mode>): Ditto. * config/riscv/autovec.md (<optab><mode>3): Change to define_insn_and_split. (v<optab><mode>3): Ditto. (trunc<mode><v_double_trunc>2): Ditto. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/binop/narrow-1.c: New test. * gcc.target/riscv/rvv/autovec/binop/narrow-2.c: New test. * gcc.target/riscv/rvv/autovec/binop/narrow-3.c: New test. * gcc.target/riscv/rvv/autovec/binop/narrow_run-1.c: New test. * gcc.target/riscv/rvv/autovec/binop/narrow_run-2.c: New test. * gcc.target/riscv/rvv/autovec/binop/narrow_run-3.c: New test.
2023-06-12Add missing vec_pack/unpacks patterns for _Float16 <-> int/float conversion.liuhongt1-9/+207
This patch only support optabs for vector modes whose lenth >= 128. For 32/64-bit vector, they're more hanlded by BB vectorizer with truncmn2/extendmn2/fix{,uns}_truncmn2. gcc/ChangeLog: * config/i386/sse.md (vec_pack<floatprefix>_float_<mode>): New expander. (vec_unpack_<fixprefix>fix_trunc_lo_<mode>): Ditto. (vec_unpack_<fixprefix>fix_trunc_hi_<mode>): Ditto. (vec_unpacks_lo_<mode>): Ditto. (vec_unpacks_hi_<mode>): Ditto. (sse_movlhps_<mode>): New define_insn. (ssse3_palignr<mode>_perm): Extend to V_128H. (V_128H): New mode iterator. (ssepackPHmode): New mode attribute. (vunpck_extract_mode): Ditto. (vpckfloat_concat_mode): Extend to VxSI/VxSF for _Float16. (vpckfloat_temp_mode): Ditto. (vpckfloat_op_mode): Ditto. (vunpckfixt_mode): Extend to VxHF. (vunpckfixt_model): Ditto. (vunpckfixt_extract_mode): Ditto. gcc/testsuite/ChangeLog: * gcc.target/i386/vec_pack_fp16-1.c: New test. * gcc.target/i386/vec_pack_fp16-2.c: New test. * gcc.target/i386/vec_pack_fp16-3.c: New test.
2023-06-12rs6000: Guard __builtin_{un,}pack_vector_int128 with vsx [PR109932]Kewen Lin1-7/+7
As PR109932 shows, builtins __builtin_{un,}pack_vector_int128 should be guarded under vsx rather than power7, as their corresponding bif patterns have the conditions TARGET_VSX and VECTOR_MEM_ALTIVEC_OR_VSX_P (V1TImode). This patch is to move __builtin_{un,}pack_vector_int128 to stanza vsx to ensure their supports. PR target/109932 gcc/ChangeLog: * config/rs6000/rs6000-builtins.def (__builtin_pack_vector_int128, __builtin_unpack_vector_int128): Move from stanza power7 to vsx. gcc/testsuite/ChangeLog: * gcc.target/powerpc/pr109932-1.c: New test. * gcc.target/powerpc/pr109932-2.c: New test.
2023-06-12rs6000: Don't use TFmode for 128 bits fp constant in toc [PR110011]Kewen Lin1-1/+1
As PR110011 shows, when encoding 128 bits fp constant into toc, we adopts REAL_VALUE_TO_TARGET_LONG_DOUBLE which is to find the first float mode with LONG_DOUBLE_TYPE_SIZE bits of precision, it would be TFmode here. But the 128 bits fp constant can be with mode IFmode or KFmode, which doesn't necessarily have the same underlying float format as the one of TFmode, like this PR exposes, with option -mabi=ibmlongdouble TFmode has ibm_extended_format while KFmode has ieee_quad_format, mixing up the formats (the encoding/decoding ways) would cause unexpected results. This patch is to make it use constant's own mode instead of TFmode for real_to_target call. PR target/110011 gcc/ChangeLog: * config/rs6000/rs6000.cc (output_toc): Use the mode of the 128-bit floating constant itself for real_to_target call. gcc/testsuite/ChangeLog: * gcc.target/powerpc/pr110011.c: New test.
2023-06-12RISC-V: Support RVV FP16 MISC vlmul ext intrinsic APIPan Li1-0/+15
This patch support the intrinsic API of FP16 ZVFHMIN vlmul ext. Aka: vfloat16*_t <==> vfloat16*_t. From the user's perspective, it is reasonable to do some type convert between vfloat16*_t and vfloat16*_t when only ZVFHMIN is enabled. Signed-off-by: Pan Li <pan2.li@intel.com> gcc/ChangeLog: * config/riscv/riscv-vector-builtins-types.def (vfloat16mf4_t): Add type to X2/X4/X8/X16/X32 vlmul ext ops. (vfloat16mf2_t): Ditto. (vfloat16m1_t): Ditto. (vfloat16m2_t): Ditto. (vfloat16m4_t): Ditto. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/zvfh-over-zvfhmin.c: Add new test cases. * gcc.target/riscv/rvv/base/zvfhmin-intrinsic.c: Add new test cases.
2023-06-11aix: Debugging does not require a stack frame.David Edelsohn1-3/+0
The rs6000 port has allocated a stack frame when debugging is enabled on AIX since the earliest versions of the port. Apparently the earliest versions of the debuggers for AIX had difficulty with stackless frames. Both AIX DBX and GDB support stackless frames on AIX, and IBM XLC, OpenXL and LLVM for AIX do not generate an extraneous stack frame when debugging is enabled. This patch updates the rs6000 stack info function to not set the stack frame flag when debugging is enabled for AIX. gcc/ChangeLog: * config/rs6000/rs6000-logue.cc (rs6000_stack_info): Do not require a stack frame when debugging is enabled for AIX. Signed-off-by: David Edelsohn <dje.gcc@gmail.com>
2023-06-11Use canonical form for reversed single-bit insertions after reload.Georg-Johann Lay3-111/+41
We now split almost all insns after reload in order to add clobber of REG_CC. If insns are coming from insn combiner and there is no canonical form for the respective arithmetic (like for reversed bit insertions), there is no need to keep all these different representations after reload: Instead of splitting such patterns to their clobber-REG_CC-analogon, we can split to a canonical representation, which is insv_notbit for the present case. This is a no-op change. gcc/ * config/avr/avr.md (adjust_len) [insv_notbit_0, insv_notbit_7]: Remove attribute values. (insv_notbit): New post-reload insn. (*insv.not-shiftrt_split, *insv.xor1-bit.0_split) (*insv.not-bit.0_split, *insv.not-bit.7_split) (*insv.xor-extract_split): Split to insv_notbit. (*insv.not-shiftrt, *insv.xor1-bit.0, *insv.not-bit.0, *insv.not-bit.7) (*insv.xor-extract): Remove post-reload insns. * config/avr/avr.cc (avr_out_insert_notbit) [bitno]: Remove parameter. (avr_adjust_insn_length): Adjust call of avr_out_insert_notbit. [ADJUST_LEN_INSV_NOTBIT_0, ADJUST_LEN_INSV_NOTBIT_7]: Remove cases. * config/avr/avr-protos.h (avr_out_insert_notbit): Adjust prototype.
2023-06-11target/19907: Overhaul bit extractions.Georg-Johann Lay5-114/+519
o Logical right shift that shifts the MSB to position 0 can be performed in such a way that the input operand constraint can be relaxed from "0" to "r". This results in less register pressure. Moreover, no scratch register is required in that case. o The deprecated "extzv" pattern is replaced by "extzv<mode>" that allows inputs of scalar integer modes of different sizes (1 up to 4 bytes). o Existing patterns are adjusted to the more generic "extzv<mode>" pattern. Some patterns are added as the middle-end has been reworked to spot more bit-extraction opportunities. o A C function is used to print the asm for bit extractions, which is more convenient for complex output logic. The generated code is still not optimal because RTL optimizers might still prefer arithmetic like shift over bit-extractions. For test cases see also PR36884 and PR55181. gcc/ PR target/109907 * config/avr/avr.md (adjust_len) [extr, extr_not]: New elements. (MSB, SIZE): New mode attributes. (any_shift): New code iterator. (*lshr<mode>3_split, *lshr<mode>3, lshr<mode>3) (*lshr<mode>3_const_split): Add constraint alternative for the case of shift-offset = MSB. Ditch "length" attribute. (extzv<mode): New. replaces extzv. Adjust following patterns. Use avr_out_extr, avr_out_extr_not to print asm. (*extzv.subreg.<mode>, *extzv.<mode>.subreg, *extzv.xor) (*extzv<mode>.ge, *neg.ashiftrt<mode>.msb, *extzv.io.lsr7): New. * config/avr/constraints.md (C15, C23, C31, Yil): New * config/avr/predicates.md (reg_or_low_io_operand) (const7_operand, reg_or_low_io_operand) (const15_operand, const_0_to_15_operand) (const23_operand, const_0_to_23_operand) (const31_operand, const_0_to_31_operand): New. * config/avr/avr-protos.h (avr_out_extr, avr_out_extr_not): New. * config/avr/avr.cc (avr_out_extr, avr_out_extr_not): New funcs. (lshrqi3_out, lshrhi3_out, lshrpsi3_out, lshrsi3_out): Adjust MSB case to new insn constraint "r" for operands[1]. (avr_adjust_insn_length) [ADJUST_LEN_EXTR_NOT, ADJUST_LEN_EXTR]: Handle these cases. (avr_rtx_costs_1): Adjust cost for a new pattern. gcc/testsuite/ PR target/109907 * gcc.target/avr/pr109907.c: New test. * gcc.target/avr/torture/pr109907-1.c: New test. * gcc.target/avr/torture/pr109907-2.c: New test.
2023-06-11RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASSJuzhe-Zhong2-150/+288
Address comments from Jeff. This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase 5 && Phase 6 are quite messy and cause some bugs discovered by my downstream auto-vectorization test-generator. Before this patch. Phase 5 is cleanup_insns is the function remove AVL operand dependency from each RVV instruction. E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). Since "a5" is used in "vsetvl" instructions and after the correct "vsetvl" instructions are inserted, each RVV instruction doesn't need AVL operand "a5" anymore. Then, we remove this operand dependency helps for the following scheduling PASS. Phase 6 is propagate_avl do the following 2 things: 1. Local && Global user vsetvl instructions optimization. E.g. vsetvli a2, a2, e8, mf8 ======> Change it into vsetvli a2, a2, e32, mf2 vsetvli zero,a2, e32, mf2 ======> eliminate 2. Optimize user vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is not used by any instructions. Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base on LCM which change the CFG, I re-new a new RTL_SSA framework (which is more expensive than just using DF) for Phase 6 and optmize user vsetvli base on the new RTL_SSA. There are 2 issues in Phase 5 && Phase 6: 1. local_eliminate_vsetvl_insn was introduced by @kito which can do better local user vsetvl optimizations better than Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. So the local user vsetvli instructions optimizaiton in Phase 6 is redundant and should be removed. 2. A bug discovered by my downstream auto-vectorization test-generator (I can't put the test in this patch since we are missing autovec patterns for it so we can't use the upstream GCC directly reproduce such issue but I will remember put it back after I support the necessary autovec patterns). Such bug is causing by using RTL_SSA re-new framework. The issue description is this: Before Phase 6: ... insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern. slli a4,a3,3 ... insn2: vsetvli zero, a3, ... load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" is removed in Phase 5) ... In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1. insn2 is the vsetvli instruction inserted in Phase 4 which is not included in the RLT_SSA framework even though we renew it (I didn't take a look at it and I don't think we need to now). Base on this situation, the def_info of insn2 has the information "set->single_nondebug_insn_use ()" which return true. Obviously, this information is not correct, since insn1 has aleast 2 uses: 1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test generated by my downstream test-generator execution test failed. Conclusion of RTL_SSA framework: Before this patch, we initialize RTL_SSA 2 times. One is at the beginning of the VSETVL PASS which is absolutely correct, the other is re-new after Phase 4 (LCM) has incorrect information that causes bugs. Besides, we don't like to initialize RTL_SSA second time it seems to be a waste since we just need to do a little optimization. Base on all circumstances I described above, I rework and reorganize Phase 5 && Phase 6 as follows: 1. Phase 5 is called ssa_post_optimization which is doing the optimization base on the RTL_SSA information (The RTL_SSA is initialized at the beginning of the VSETVL PASS, no need to re-new it again). This phase includes 3 optimizaitons: 1). local_eliminate_vsetvl_insn we already have (no change). 2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from orignal Phase 6 but with more powerful and reliable implementation. E.g. void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) { size_t avl; if (m > 100) avl = __riscv_vsetvl_e16mf4(vl << 4); else avl = __riscv_vsetvl_e32mf2(vl >> 8); for (size_t i = 0; i < m; i++) { vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl); v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl); __riscv_vse8_v_i8mf8(out + i, v0, avl); } } This example failed to global user vsetvl optimize before this patch: f: li a5,100 bleu a3,a5,.L2 slli a2,a2,4 vsetvli a4,a2,e16,mf4,ta,mu .L3: li a5,0 vsetvli zero,a4,e8,mf8,ta,ma .L5: add a6,a0,a5 add a2,a1,a5 vle8.v v1,0(a6) addi a5,a5,1 vadd.vv v1,v1,v1 vse8.v v1,0(a2) bgtu a3,a5,.L5 .L10: ret .L2: beq a3,zero,.L10 srli a2,a2,8 vsetvli a4,a2,e32,mf2,ta,mu j .L3 With this patch: f: li a5,100 bleu a3,a5,.L2 slli a2,a2,4 vsetvli zero,a2,e8,mf8,ta,ma .L3: li a5,0 .L5: add a6,a0,a5 add a2,a1,a5 vle8.v v1,0(a6) addi a5,a5,1 vadd.vv v1,v1,v1 vse8.v v1,0(a2) bgtu a3,a5,.L5 .L10: ret .L2: beq a3,zero,.L10 srli a2,a2,8 vsetvli zero,a2,e8,mf8,ta,ma j .L3 3). Remove AVL operand dependency of each RVV instructions. 2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into Optimize "vsetvl zero,a2...." base on dataflow analysis of new CFG (new CFG is created by LCM). The reason we need to do use new CFG and after Phase 5: ... vsetvl a3, a2... vadd.vv (use a3) If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will fail to optimize vsetvl a3,a2 into vsetvl zero,a2. This patch passed all tests in rvv.exp with ONLY peformance && codegen improved (no performance decline and no bugs including my downstream tests). gcc/ChangeLog: * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Enhance user vsetvl optimization. (vector_insn_info::parse_insn): Add rtx_insn parse. (pass_vsetvl::local_eliminate_vsetvl_insn): Enhance user vsetvl optimization. (get_first_vsetvl): New function. (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto. (pass_vsetvl::cleanup_insns): Remove it. (pass_vsetvl::ssa_post_optimization): New function. (has_no_uses): Ditto. (pass_vsetvl::propagate_avl): Remove it. (pass_vsetvl::df_post_optimization): New function. (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6. * config/riscv/riscv-vsetvl.h: Adapt declaration. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test. * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto. * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto. * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test. * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test. * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test.
2023-06-10target/109650: Fix wrong code after cc0 -> CCmode transition.Georg-Johann Lay7-777/+1318
This patch fixes a wrong-code bug in the wake of PR92729, the transition that turned the AVR backend from cc0 to CCmode. In cc0, the insn that uses cc0 like a conditional branch always follows the cc0 setter, which is no more the case with CCmode where set and use of REG_CC might be in different basic blocks. This patch removes the machine-dependent reorg pass in avr_reorg entirely. It is replaced by a new, AVR specific mini-pass that runs prior to split2. Canonicalization of comparisons away from the "difficult" codes GT[U] and LE[U] is now mostly performed by implementing TARGET_CANONICALIZE_COMPARISON. Moreover: * Text peephole conditions get "dead_or_set_regno_p (*, REG_CC)" as needed. * RTL peephole conditions get "peep2_regno_dead_p (*, REG_CC)" as needed. * Conditional branches no more clobber REG_CC. * insn output for compares looks ahead to determine the branch mode in use. This needs also "dead_or_set_regno_p (*, REG_CC)". * Add RTL peepholes for decrement-and-branch detection. * Some of the patterns like "*cmphi.zero-extend.0" lost their combine-ational part wit PR92729. Restore them. Finally, it fixes some of the many indentation glitches left over from PR92729. gcc/ PR target/109650 PR target/92729 * config/avr/avr-passes.def (avr_pass_ifelse): Insert new pass. * config/avr/avr.cc (avr_pass_ifelse): New RTL pass. (avr_pass_data_ifelse): New pass_data for it. (make_avr_pass_ifelse, avr_redundant_compare, avr_cbranch_cost) (avr_canonicalize_comparison, avr_out_plus_set_ZN) (avr_out_cmp_ext): New functions. (compare_condtition): Make sure REG_CC dies in the branch insn. (avr_rtx_costs_1): Add computation of cbranch costs. (avr_adjust_insn_length) [ADJUST_LEN_ADD_SET_ZN, ADJUST_LEN_CMP_ZEXT]: [ADJUST_LEN_CMP_SEXT]Handle them. (TARGET_CANONICALIZE_COMPARISON): New define. (avr_simplify_comparison_p, compare_diff_p, avr_compare_pattern) (avr_reorg_remove_redundant_compare, avr_reorg): Remove functions. (TARGET_MACHINE_DEPENDENT_REORG): Remove define. * config/avr/avr-protos.h (avr_simplify_comparison_p): Remove proto. (make_avr_pass_ifelse, avr_out_plus_set_ZN, cc_reg_rtx) (avr_out_cmp_zext): New Protos * config/avr/avr.md (branch, difficult_branch): Don't split insns. (*cbranchhi.zero-extend.0", *cbranchhi.zero-extend.1") (*swapped_tst<mode>, *add.for.eqne.<mode>): New insns. (*cbranch<mode>4): Rename to cbranch<mode>4_insn. (define_peephole): Add dead_or_set_regno_p(insn,REG_CC) as needed. (define_deephole2): Add peep2_regno_dead_p(*,REG_CC) as needed. Add new RTL peepholes for decrement-and-branch and *swapped_tst<mode>. Rework signtest-and-branch peepholes for *sbrx_branch<mode>. (adjust_len) [add_set_ZN, cmp_zext]: New. (QIPSI): New mode iterator. (ALLs1, ALLs2, ALLs4, ALLs234): New mode iterators. (gelt): New code iterator. (gelt_eqne): New code attribute. (rvbranch, *rvbranch, difficult_rvbranch, *difficult_rvbranch) (branch_unspec, *negated_tst<mode>, *reversed_tst<mode>) (*cmpqi_sign_extend): Remove insns. (define_c_enum "unspec") [UNSPEC_IDENTITY]: Remove. * config/avr/avr-dimode.md (cbranch<mode>4): Canonicalize comparisons. * config/avr/predicates.md (scratch_or_d_register_operand): New. * config/avr/constraints.md (Yxx): New constraint. gcc/testsuite/ PR target/109650 * gcc.target/avr/torture/pr109650-1.c: New test. * gcc.target/avr/torture/pr109650-2.c: New test.
2023-06-10RISC-V: Enable select_vl for RVV auto-vectorizationJuzhe-Zhong3-0/+27
Consider this following example: void vec_add(int32_t *restrict c, int32_t *restrict a, int32_t *restrict b, int N) { for (long i = 0; i < N; i++) { c[i] = a[i] + b[i]; } } After this patch: vec_add: ble a3,zero,.L5 .L3: vsetvli a5,a3,e32,m1,ta,ma vle32.v v2,0(a1) vle32.v v1,0(a2) vsetvli a6,zero,e32,m1,ta,ma ===> redundant vsetvl. slli a4,a5,2 vadd.vv v1,v1,v2 sub a3,a3,a5 vsetvli zero,a5,e32,m1,ta,ma ===> redundant vsetvl. vse32.v v1,0(a0) add a1,a1,a4 add a2,a2,a4 add a0,a0,a4 bne a3,zero,.L3 .L5: ret We can get close-to-optimal codegen but with some redundant vsetvls. This is not the big issue which will be easily addressed in RISC-V backend. I am going to add a standalone PASS "AVL propagation" (avlprop) to addresse such issue. gcc/ChangeLog: * config/riscv/autovec.md (select_vl<mode>): New pattern. * config/riscv/riscv-protos.h (expand_select_vl): New function. * config/riscv/riscv-v.cc (expand_select_vl): Ditto. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/ternop/ternop-2.c: Adapt test. * gcc.target/riscv/rvv/autovec/ternop/ternop-5.c: Ditto. * gcc.target/riscv/rvv/autovec/partial/select_vl-1.c: New test.
2023-06-09RISC-V: Refactor requirement of ZVFH and ZVFHMIN.Pan Li2-16/+46
This patch would like to refactor the requirement of both the ZVFH and ZVFHMIN. By default, the ZVFHMIN will enable FP16 for all the iterators of RVV. And then the ZVFH will leverage one define attr as the gate for FP16 supported or not. Please note the ZVFH will cover the ZVFHMIN instructions. This patch add one test for this. Signed-off-by: Pan Li <pan2.li@intel.com> Co-Authored by: Juzhe-Zhong <juzhe.zhong@rivai.ai> Co-Authored by: Kito Cheng <kito.cheng@sifive.com> gcc/ChangeLog: * config/riscv/riscv.md (enabled): Move to another place, and add fp_vector_disabled to the cond. (fp_vector_disabled): New attr defined for disabling fp. * config/riscv/vector-iterators.md: Fix V_WHOLE and V_FRACT. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/zvfhmin-intrinsic.c: Add vle16 test for ZVFHMIN.
2023-06-09RISC-V: Fix one warning of frm enum.Pan Li1-7/+10
This patch would like to fix one warning similar as below, and add the link for where the values comes from. ./gcc/config/riscv/riscv-protos.h:260:13: warning: binary constants are a C++14 feature or GCC extension FRM_RNE = 0b000, ^~~~~ Signed-off-by: Pan Li <pan2.li@intel.com> gcc/ChangeLog: * config/riscv/riscv-protos.h (enum frm_field_enum): Adjust literal to int.
2023-06-09Explicitly view_convert_expr mask to signed type when folding pblendvb builtins.liuhongt1-1/+3
Since mask < 0 will be always false for vector char when -funsigned-char, but vpblendvb needs to check the most significant bit. The patch explicitly VCE to vector signed char. gcc/ChangeLog: PR target/110108 * config/i386/i386.cc (ix86_gimple_fold_builtin): Explicitly view_convert_expr mask to signed type when folding pblendvb builtins. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110108-2.c: New test.
2023-06-09Fold _mm{,256,512}_abs_{epi8,epi16,epi32,epi64} into gimple ABSU_EXPR + VCE.liuhongt2-10/+23
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for TYPE_MIN, but PABSB will store unsigned result into dst. The patch uses ABSU_EXPR + VCE instead of ABS_EXPR. Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit vector absm2 is guarded with TARGET_MMX_WITH_SSE. gcc/ChangeLog: PR target/110108 * config/i386/i386.cc (ix86_gimple_fold_builtin): Fold _mm{,256,512}_abs_{epi8,epi16,epi32,epi64} into gimple ABSU_EXPR + VCE, don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT. * config/i386/i386-builtin.def: Replace CODE_FOR_nothing with real codename for __builtin_ia32_pabs{b,w,d}. gcc/testsuite/ChangeLog: * gcc.target/i386/pr110108.c: New test. * gcc.target/i386/pr110108-3.c: New test. * gcc.target/i386/pr109900.c: Adjust testcase.
2023-06-08i386: Fix endless recursion in ix86_expand_vector_init_general with MMX ↵Jakub Jelinek1-1/+1
[PR110152] I'm getting +FAIL: gcc.target/i386/3dnow-1.c (internal compiler error: Segmentation fault signal terminated program cc1) +FAIL: gcc.target/i386/3dnow-1.c (test for excess errors) +FAIL: gcc.target/i386/3dnow-2.c (internal compiler error: Segmentation fault signal terminated program cc1) +FAIL: gcc.target/i386/3dnow-2.c (test for excess errors) +FAIL: gcc.target/i386/mmx-1.c (internal compiler error: Segmentation fault signal terminated program cc1) +FAIL: gcc.target/i386/mmx-1.c (test for excess errors) +FAIL: gcc.target/i386/mmx-2.c (internal compiler error: Segmentation fault signal terminated program cc1) +FAIL: gcc.target/i386/mmx-2.c (test for excess errors) regressions on i686-linux since r14-1166. The problem is when ix86_expand_vector_init_general is called with mmx_ok = true and mode = V4HImode, it newly recurses with mmx_ok = false and mode = V2SImode, but as mmx_ok is false and !TARGET_SSE, we recurse again with the same arguments (ok, fresh new tmp and vals) infinitely. The following patch fixes that by passing mmx_ok to that recursive call. For n_words == 4 it isn't needed, because we only care about mmx_ok for V2SImode or V2SFmode and no other modes. 2023-06-08 Jakub Jelinek <jakub@redhat.com> PR target/110152 * config/i386/i386-expand.cc (ix86_expand_vector_init_general): For n_words == 2 recurse with mmx_ok as first argument rather than false.
2023-06-07Add support for stc and cmc instructions in i386.mdRoger Sayle5-4/+126
This patch is the latest revision of my patch to add support for the STC (set carry flag) and CMC (complement carry flag) instructions to the i386 backend, incorporating Uros' previous feedback. The significant changes are (i) the inclusion of CMC, (ii) the use of UNSPEC for pattern, (iii) Use of a new X86_TUNE_SLOW_STC tuning flag to use alternate implementations on pentium4 (which has a notoriously slow STC) when not optimizing for size. An example of the use of the stc instruction is: unsigned int foo (unsigned int a, unsigned int b, unsigned int *c) { return __builtin_ia32_addcarryx_u32 (1, a, b, c); } which previously generated: movl $1, %eax addb $-1, %al adcl %esi, %edi setc %al movl %edi, (%rdx) movzbl %al, %eax ret with this patch now generates: stc adcl %esi, %edi setc %al movl %edi, (%rdx) movzbl %al, %eax ret An example of the use of the cmc instruction (where the carry from a first adc is inverted/complemented as input to a second adc) is: unsigned int bar (unsigned int a, unsigned int b, unsigned int c, unsigned int d) { unsigned int c1 = __builtin_ia32_addcarryx_u32 (1, a, b, &o1); return __builtin_ia32_addcarryx_u32 (c1 ^ 1, c, d, &o2); } which previously generated: movl $1, %eax addb $-1, %al adcl %esi, %edi setnc %al movl %edi, o1(%rip) addb $-1, %al adcl %ecx, %edx setc %al movl %edx, o2(%rip) movzbl %al, %eax ret and now generates: stc adcl %esi, %edi cmc movl %edi, o1(%rip) adcl %ecx, %edx setc %al movl %edx, o2(%rip) movzbl %al, %eax ret This version implements Uros' suggestions/refinements. (i) Avoid the UNSPEC_CMC by using the canonical RTL idiom for *x86_cmc, (ii) Use peephole2s to convert x86_stc and *x86_cmc into alternate forms on TARGET_SLOW_STC CPUs (pentium4), when a suitable QImode register is available, (iii) Prefer the addqi_cconly_overflow idiom (addb $-1,%al) over negqi_ccc_1 (neg %al) for setting the carry from a QImode value, These changes required two minor edits to i386.cc: ix86_cc_mode had to be tweaked to suggest CCCmode for the new *x86_cmc pattern, and *x86_cmc needed to be handled/parameterized in ix86_rtx_costs so that combine would appreciate that this complex RTL expression was actually a fast, single byte instruction [i.e. preferable]. 2022-06-07 Roger Sayle <roger@nextmovesoftware.com> Uros Bizjak <ubizjak@gmail.com> gcc/ChangeLog * config/i386/i386-expand.cc (ix86_expand_builtin) <handlecarry>: Use new x86_stc instruction when the carry flag must be set. * config/i386/i386.cc (ix86_cc_mode): Use CCCmode for *x86_cmc. (ix86_rtx_costs): Provide accurate rtx_costs for *x86_cmc. * config/i386/i386.h (TARGET_SLOW_STC): New define. * config/i386/i386.md (UNSPEC_STC): New UNSPEC for stc. (x86_stc): New define_insn. (define_peephole2): Convert x86_stc into alternate implementation on pentium4 without -Os when a QImode register is available. (*x86_cmc): New define_insn. (define_peephole2): Convert *x86_cmc into alternate implementation on pentium4 without -Os when a QImode register is available. (*setccc): New define_insn_and_split for a no-op CCCmode move. (*setcc_qi_negqi_ccc_1_<mode>): New define_insn_and_split to recognize (and eliminate) the carry flag being copied to itself. (*setcc_qi_negqi_ccc_2_<mode>): Likewise. * config/i386/x86-tune.def (X86_TUNE_SLOW_STC): New tuning flag. gcc/testsuite/ChangeLog * gcc.target/i386/cmc-1.c: New test case. * gcc.target/i386/stc-1.c: Likewise.
2023-06-07RISC-V: Eliminate extension after for *w instructionsJeff Law4-31/+177
This patch tries to prevent generating unnecessary sign extension after *w instructions like "addiw" or "divw". The main idea of it is to add SUBREG_PROMOTED fields during expanding. I have tested on SPEC2017 there is no regression. Only gcc.dg/pr30957-1.c test failed. To solve that I did some changes in loop-iv.cc, but not sure that it is suitable. gcc/ChangeLog: * config/riscv/bitmanip.md (rotrdi3, rotrsi3, rotlsi3): New expanders. (rotrsi3_sext): Expose generator. (rotlsi3 pattern): Hide generator. * config/riscv/riscv-protos.h (riscv_emit_binary): New function declaration. * config/riscv/riscv.cc (riscv_emit_binary): Removed static * config/riscv/riscv.md (addsi3, subsi3, negsi2): Hide generator. (mulsi3, <optab>si3): Likewise. (addsi3, subsi3, negsi2, mulsi3, <optab>si3): New expanders. (addv<mode>4, subv<mode>4, mulv<mode>4): Use riscv_emit_binary. (<u>mulsidi3): Likewise. (addsi3_extended, subsi3_extended, negsi2_extended): Expose generator. (mulsi3_extended, <optab>si3_extended): Likewise. (splitter for shadd feeding divison): Update RTL pattern to account for changes in how 32 bit ops are expanded for TARGET_64BIT. * loop-iv.cc (get_biv_step_1): Process src of extension when it PLUS. gcc/testsuite/ChangeLog: * gcc.target/riscv/shift-and-2.c: New tests. * gcc.target/riscv/shift-shift-2.c: Adjust expected output. * gcc.target/riscv/sign-extend.c: New test. * gcc.target/riscv/zbb-rol-ror-03.c: Adjust expected output. Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
2023-06-07riscv: Fix scope for memory model calculationDimitar Dimitrov1-4/+9
During libgcc configure stage for riscv32-none-elf, when "--enable-checking=yes,rtl" has been activated, the following error is observed: during RTL pass: final conftest.c: In function 'main': conftest.c:16:1: internal compiler error: RTL check: expected code 'const_int', have 'reg' in riscv_print_operand, at config/riscv/riscv.cc:4462 16 | } | ^ 0x843c4d rtl_check_failed_code1(rtx_def const*, rtx_code, char const*, int, char const*) /mnt/nvme/dinux/local-workspace/gcc/gcc/rtl.cc:916 0x8ea823 riscv_print_operand /mnt/nvme/dinux/local-workspace/gcc/gcc/config/riscv/riscv.cc:4462 0xde84b5 output_operand(rtx_def*, int) /mnt/nvme/dinux/local-workspace/gcc/gcc/final.cc:3632 0xde8ef8 output_asm_insn(char const*, rtx_def**) /mnt/nvme/dinux/local-workspace/gcc/gcc/final.cc:3544 0xded33b output_asm_insn(char const*, rtx_def**) /mnt/nvme/dinux/local-workspace/gcc/gcc/final.cc:3421 0xded33b final_scan_insn_1 /mnt/nvme/dinux/local-workspace/gcc/gcc/final.cc:2841 0xded6cb final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*) /mnt/nvme/dinux/local-workspace/gcc/gcc/final.cc:2887 0xded8b7 final_1 /mnt/nvme/dinux/local-workspace/gcc/gcc/final.cc:1979 0xdee518 rest_of_handle_final /mnt/nvme/dinux/local-workspace/gcc/gcc/final.cc:4240 0xdee518 execute /mnt/nvme/dinux/local-workspace/gcc/gcc/final.cc:4318 Fix by moving the calculation of memmodel to the cases where it is used. Regression tested for riscv32-none-elf. No changes in gcc.sum and g++.sum. PR target/109725 gcc/ChangeLog: * config/riscv/riscv.cc (riscv_print_operand): Calculate memmodel only when it is valid. Signed-off-by: Dimitar Dimitrov <dimitar@dinux.eu>
2023-06-07riscv: Fix insn cost calculationDimitar Dimitrov1-1/+1
When building riscv32-none-elf with "--enable-checking=yes,rtl", the following ICE is observed: cc1: internal compiler error: RTL check: expected code 'const_int', have 'const_double' in riscv_const_insns, at config/riscv/riscv.cc:1313 0x843c4d rtl_check_failed_code1(rtx_def const*, rtx_code, char const*, int, char const*) /mnt/nvme/dinux/local-workspace/gcc/gcc/rtl.cc:916 0x8eab61 riscv_const_insns(rtx_def*) /mnt/nvme/dinux/local-workspace/gcc/gcc/config/riscv/riscv.cc:1313 0x15443bb riscv_legitimate_constant_p /mnt/nvme/dinux/local-workspace/gcc/gcc/config/riscv/riscv.cc:826 0xdd3c71 emit_move_insn(rtx_def*, rtx_def*) /mnt/nvme/dinux/local-workspace/gcc/gcc/expr.cc:4310 0x15f28e5 run_const_vector_selftests /mnt/nvme/dinux/local-workspace/gcc/gcc/config/riscv/riscv-selftests.cc:285 0x15f37bd selftest::riscv_run_selftests() /mnt/nvme/dinux/local-workspace/gcc/gcc/config/riscv/riscv-selftests.cc:364 0x1f6fba9 selftest::run_tests() /mnt/nvme/dinux/local-workspace/gcc/gcc/selftest-run-tests.cc:111 0x11d1f39 toplev::run_self_tests() /mnt/nvme/dinux/local-workspace/gcc/gcc/toplev.cc:2185 Fix by following the spirit of the adjacent comment, and using the dedicated riscv_const_insns() function to calculate cost for loading a constant element. Infinite recursion is not possible because the first invocation is on a CONST_VECTOR, whereas the second is on a single element of the vector (e.g. CONST_INT or CONST_DOUBLE). Regression tested for riscv32-none-elf. No changes in gcc.sum and g++.sum. gcc/ChangeLog: * config/riscv/riscv.cc (riscv_const_insns): Recursively call for constant element of a vector. Signed-off-by: Dimitar Dimitrov <dimitar@dinux.eu>
2023-06-07aarch64: Allow compiler to define ls64 builtins [PR110132]Alex Coplan2-38/+19
This patch refactors the ls64 builtins to allow the compiler to define them directly instead of having wrapper functions in arm_acle.h. This should be not only easier to maintain, but it makes two important correctness fixes: - It fixes PR110132, where the builtins ended up getting declared with invisible bindings in the C FE, so the FE ended up synthesizing incompatible implicit definitions for these builtins. - It allows the builtins to be used with LTO, which didn't work previously. We also take the opportunity to add test coverage from C++ for these builtins. gcc/ChangeLog: PR target/110132 * config/aarch64/aarch64-builtins.cc (aarch64_general_simulate_builtin): New. Use it ... (aarch64_init_ls64_builtins): ... here. Switch to declaring public ACLE names for builtins. (aarch64_general_init_builtins): Ensure we invoke the arm_acle.h setup if in_lto_p, just like we do for SVE. * config/aarch64/arm_acle.h: (__arm_ld64b): Delete. (__arm_st64b): Delete. (__arm_st64bv): Delete. (__arm_st64bv0): Delete. gcc/testsuite/ChangeLog: PR target/110132 * lib/target-supports.exp (check_effective_target_aarch64_asm_FUNC_ok): Extend to ls64. * g++.target/aarch64/acle/acle.exp: New. * g++.target/aarch64/acle/ls64.C: New test. * g++.target/aarch64/acle/ls64_lto.C: New test. * gcc.target/aarch64/acle/ls64_lto.c: New test. * gcc.target/aarch64/acle/pr110132.c: New test.
2023-06-07aarch64: Fix wrong code with st64b builtin [PR110100]Alex Coplan2-2/+2
The st64b pattern incorrectly had an output constraint on the register operand containing the destination address for the store, leading to wrong code. This patch fixes that. gcc/ChangeLog: PR target/110100 * config/aarch64/aarch64-builtins.cc (aarch64_expand_builtin_ls64): Use input operand for the destination address. * config/aarch64/aarch64.md (st64b): Fix constraint on address operand. gcc/testsuite/ChangeLog: PR target/110100 * gcc.target/aarch64/acle/pr110100.c: New test.
2023-06-07aarch64: Fix whitespace in ls64 builtin implementation [PR110100]Alex Coplan2-43/+43
The ls64 builtin code was using incorrect GNU style with eight spaces where there should be a tab. Fixed thusly. gcc/ChangeLog: PR target/110100 * config/aarch64/aarch64-builtins.cc (aarch64_init_ls64_builtins_types): Replace eight consecutive spaces with tabs. (aarch64_init_ls64_builtins): Likewise. (aarch64_expand_builtin_ls64): Likewise. * config/aarch64/aarch64.md (ld64b): Likewise. (st64b): Likewise. (st64bv): Likewise (st64bv0): Likewise.
2023-06-07aarch64: Represent SQXTUN with RTL operationsKyrylo Tkachov3-14/+56
This patch removes UNSPEC_SQXTUN and uses organic RTL codes to represent the operation. SQXTUN is an odd one. It's described in the architecture as "Signed saturating extract Unsigned Narrow". It's not a straightforward ss_truncate nor a us_truncate. It is a sort of truncating signed clamp operation with limits derived from the unsigned extrema of the narrow mode: (truncate:N (smin:M (smax:M (reg:M) (const_int 0)) (const_int <unsigned-max-for-mode-N>))) This patch implements these semantics. I've checked that the vqmovun tests in advsimd-intrinsics.exp now get constant-folded and still pass validation, so I'm pretty confident in the semantics. Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_sqmovun<mode><vczle><vczbe>): Rename to... (*aarch64_sqmovun<mode>_insn<vczle><vczbe>): ... This. Reimplement with RTL codes. (aarch64_sqmovun<mode> [SD_HSDI]): Reimplement with RTL codes. (aarch64_sqxtun2<mode>_le): Likewise. (aarch64_sqxtun2<mode>_be): Likewise. (aarch64_sqxtun2<mode>): Adjust for the above. (aarch64_sqmovun<mode>): New define_expand. * config/aarch64/iterators.md (UNSPEC_SQXTUN): Delete. (half_mask): New mode attribute. * config/aarch64/predicates.md (aarch64_simd_umax_half_mode): New predicate.
2023-06-07aarch64: Improve RTL representation of ADDP instructionsKyrylo Tkachov1-7/+63
Similar to the ADDLP instructions the non-widening ADDP ones can be represented by adding the odd lanes with the even lanes of a vector. These instructions take two vector inputs and the architecture spec describes the operation as concatenating them together before going through it with pairwise additions. This patch chooses to represent ADDP on 64-bit and 128-bit input vectors slightly differently, reasons explained in the comments in aarhc64-simd.md. Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_addp<mode><vczle><vczbe>): Reimplement as... (aarch64_addp<mode>_insn): ... This... (aarch64_addp<mode><vczle><vczbe>_insn): ... And this. (aarch64_addp<mode>): New define_expand.
2023-06-07RISC-V: Support RVV VLA SLP auto-vectorizationJuzhe-Zhong3-23/+394
This patch enables basic VLA SLP auto-vectorization. Consider this following case: void f (uint8_t *restrict a, uint8_t *restrict b) { for (int i = 0; i < 100; ++i) { a[i * 8 + 0] = b[i * 8 + 7] + 1; a[i * 8 + 1] = b[i * 8 + 7] + 2; a[i * 8 + 2] = b[i * 8 + 7] + 8; a[i * 8 + 3] = b[i * 8 + 7] + 4; a[i * 8 + 4] = b[i * 8 + 7] + 5; a[i * 8 + 5] = b[i * 8 + 7] + 6; a[i * 8 + 6] = b[i * 8 + 7] + 7; a[i * 8 + 7] = b[i * 8 + 7] + 3; } } To enable VLA SLP auto-vectorization, we should be able to handle this following const vector: 1. NPATTERNS = 8, NELTS_PER_PATTERN = 3. { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16, ... } 2. NPATTERNS = 8, NELTS_PER_PATTERN = 1. { 1, 2, 8, 4, 5, 6, 7, 3, ... } And these vector can be generated at prologue. After this patch, we end up with this following codegen: Prologue: ... vsetvli a7,zero,e16,m2,ta,ma vid.v v4 vsrl.vi v4,v4,3 li a3,8 vmul.vx v4,v4,a3 ===> v4 = { 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 16, 16, 16, 16, 16, 16, 16, 16, ... } ... li t1,67633152 addi t1,t1,513 li a3,50790400 addi a3,a3,1541 slli a3,a3,32 add a3,a3,t1 vsetvli t1,zero,e64,m1,ta,ma vmv.v.x v3,a3 ===> v3 = { 1, 2, 8, 4, 5, 6, 7, 3, ... } ... LoopBody: ... min a3,... vsetvli zero,a3,e8,m1,ta,ma vle8.v v2,0(a6) vsetvli a7,zero,e8,m1,ta,ma vrgatherei16.vv v1,v2,v4 vadd.vv v1,v1,v3 vsetvli zero,a3,e8,m1,ta,ma vse8.v v1,0(a2) add a6,a6,a4 add a2,a2,a4 mv a3,a5 add a5,a5,t1 bgtu a3,a4,.L3 ... Note: we need to use "vrgatherei16.vv" instead of "vrgather.vv" for SEW = 8 since "vrgatherei16.vv" can cover larger range than "vrgather.vv" (which only can maximum element index = 255). Epilogue: lbu a5,799(a1) addiw a4,a5,1 sb a4,792(a0) addiw a4,a5,2 sb a4,793(a0) addiw a4,a5,8 sb a4,794(a0) addiw a4,a5,4 sb a4,795(a0) addiw a4,a5,5 sb a4,796(a0) addiw a4,a5,6 sb a4,797(a0) addiw a4,a5,7 sb a4,798(a0) addiw a5,a5,3 sb a5,799(a0) ret There is one more last thing we need to do is the "Epilogue auto-vectorization" which needs VLS modes support. I will support VLS modes for "Epilogue auto-vectorization" in the future. gcc/ChangeLog: * config/riscv/riscv-protos.h (expand_vec_perm_const): New function. * config/riscv/riscv-v.cc (rvv_builder::can_duplicate_repeating_sequence_p): Support POLY handling. (rvv_builder::single_step_npatterns_p): New function. (rvv_builder::npatterns_all_equal_p): Ditto. (const_vec_all_in_range_p): Support POLY handling. (gen_const_vector_dup): Ditto. (emit_vlmax_gather_insn): Add vrgatherei16. (emit_vlmax_masked_gather_mu_insn): Ditto. (expand_const_vector): Add VLA SLP const vector support. (expand_vec_perm): Support POLY. (struct expand_vec_perm_d): New struct. (shuffle_generic_patterns): New function. (expand_vec_perm_const_1): Ditto. (expand_vec_perm_const): Ditto. * config/riscv/riscv.cc (riscv_vectorize_vec_perm_const): Ditto. (TARGET_VECTORIZE_VEC_PERM_CONST): New targethook. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/scalable-1.c: Adapt testcase for VLA vectorizer. * gcc.target/riscv/rvv/autovec/v-1.c: Ditto. * gcc.target/riscv/rvv/autovec/zve32f_zvl128b-1.c: Ditto. * gcc.target/riscv/rvv/autovec/zve32x_zvl128b-1.c: Ditto. * gcc.target/riscv/rvv/autovec/zve64d-1.c: Ditto. * gcc.target/riscv/rvv/autovec/zve64d_zvl128b-1.c: Ditto. * gcc.target/riscv/rvv/autovec/zve64f-1.c: Ditto. * gcc.target/riscv/rvv/autovec/zve64f_zvl128b-1.c: Ditto. * gcc.target/riscv/rvv/autovec/zve64x_zvl128b-1.c: Ditto. * gcc.target/riscv/rvv/autovec/partial/slp-1.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-2.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-3.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-4.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-5.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-6.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp-7.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-1.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-2.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-3.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-4.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-5.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-6.c: New test. * gcc.target/riscv/rvv/autovec/partial/slp_run-7.c: New test.
2023-06-07RISC-V: Fix ICE when include riscv_vector.h with rv64gcvPan Li1-33/+33
This patch would like to fix the incorrect requirement of the vector builtin types for the ZVFH/ZVFHMIN extension. The incorrect requirement will result in the ops mismatch with iterators, and then ICE will be triggered if ZVFH/ZVFHMIN is not given. Sorry for inconviensient. Signed-off-by: Pan Li <pan2.li@intel.com> gcc/ChangeLog: * config/riscv/riscv-vector-builtins-types.def (vfloat32mf2_t): Take RVV_REQUIRE_ELEN_FP_16 as requirement. (vfloat32m1_t): Ditto. (vfloat32m2_t): Ditto. (vfloat32m4_t): Ditto. (vfloat32m8_t): Ditto. (vint16mf4_t): Ditto. (vint16mf2_t): Ditto. (vint16m1_t): Ditto. (vint16m2_t): Ditto. (vint16m4_t): Ditto. (vint16m8_t): Ditto. (vuint16mf4_t): Ditto. (vuint16mf2_t): Ditto. (vuint16m1_t): Ditto. (vuint16m2_t): Ditto. (vuint16m4_t): Ditto. (vuint16m8_t): Ditto. (vint32mf2_t): Ditto. (vint32m1_t): Ditto. (vint32m2_t): Ditto. (vint32m4_t): Ditto. (vint32m8_t): Ditto. (vuint32mf2_t): Ditto. (vuint32m1_t): Ditto. (vuint32m2_t): Ditto. (vuint32m4_t): Ditto. (vuint32m8_t): Ditto.
2023-06-06RISC-V: Add RVV vwmacc/vwmaccu/vwmaccsu combine lowering optmizationJuzhe-Zhong2-0/+161
Fix according to comments from Robin of V1 patch. This patch add combine optimization for following case: __attribute__ ((noipa)) void vwmaccsu (int16_t *__restrict dst, int8_t *__restrict a, uint8_t *__restrict b, int n) { for (int i = 0; i < n; i++) dst[i] += (int16_t) a[i] * (int16_t) b[i]; } Before this patch: ... vsext.vf2 vzext.vf2 vmadd.vv .. After this patch: ... vwmaccsu.vv ... gcc/ChangeLog: * config/riscv/autovec-opt.md (*<optab>_fma<mode>): New pattern. (*single_<optab>mult_plus<mode>): Ditto. (*double_<optab>mult_plus<mode>): Ditto. (*sign_zero_extend_fma): Ditto. (*zero_sign_extend_fma): Ditto. * config/riscv/riscv-protos.h (enum insn_type): New enum. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/autovec/widen/widen-8.c: New test. * gcc.target/riscv/rvv/autovec/widen/widen-9.c: New test. * gcc.target/riscv/rvv/autovec/widen/widen-complicate-5.c: New test. * gcc.target/riscv/rvv/autovec/widen/widen-complicate-6.c: New test. * gcc.target/riscv/rvv/autovec/widen/widen_run-8.c: New test. * gcc.target/riscv/rvv/autovec/widen/widen_run-9.c: New test.
2023-06-06rs6000: genfusion: Delete dead codeSegher Boessenkool1-3/+0
2023-06-06 Segher Boessenkool <segher@kernel.crashing.org> * config/rs6000/genfusion.pl: Delete some dead code.
2023-06-06rs6000: genfusion: Rewrite load/compare codeSegher Boessenkool1-82/+103
This makes the code more readable, more digestible, more maintainable, more extensible. That kind of thing. It does that by pulling things apart a bit, but also making what stays together more cohesive lumps. The original function was a bunch of loops and early-outs, and then quite a bit of stuff done per iteration, with the iterations essentially independent of each other. This patch moves the stuff done for one iteration to a new _one function. The second big thing is the stuff printed to the .md file is done in "here documents" now, which is a lot more readable than having to quote and escape and double-escape pieces of text. Whitespace inside the here-document is significant (will be printed as-is), which is a bit awkward sometimes, or might take some getting used to, but it is also one of the benefits of using them. Local variables are declared at first use (or close to first use). There also shouldn't be many at all, often you can write easier to read and manage code by omitting to name something that is hard to name in the first place. Finally some things are done in more typical, more modern, and tighter Perl style, for example REs in "if"s or "qw" for lists of constants. 2023-06-06 Segher Boessenkool <segher@kernel.crashing.org> * config/rs6000/genfusion.pl (gen_ld_cmpi_p10_one): New, rewritten and split out from... (gen_ld_cmpi_p10): ... this.
2023-06-06rs6000: Remove duplicate expression [PR106907]Jeevitha Palanisamy1-1/+0
PR106907 has few warnings spotted from cppcheck. In that addressing duplicate expression issue here. Here the same expression is used twice in logical AND(&&) operation which result in same result so removing that. 2023-06-06 Jeevitha Palanisamy <jeevitha@linux.ibm.com> gcc/ PR target/106907 * config/rs6000/rs6000.cc (vec_const_128bit_to_bytes): Remove duplicate expression.
2023-06-06aarch64: Improve representation of vpaddd intrinsicsKyrylo Tkachov4-14/+3
The aarch64_addpdi pattern is redundant as the reduc_plus_scal_<mode> pattern can already generate the required form of the ADDP instruction, and is mostly folded to GIMPLE early on so can benefit from more optimisations. Though it turns out that we were missing the folding for the unsigned variants. This patch adds that and wires up the vpaddd_u64 and vpaddd_s64 intrinsics through the above pattern instead so that we can remove a redundant pattern and get more optimisation earlier. Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf. gcc/ChangeLog: * config/aarch64/aarch64-builtins.cc (aarch64_general_gimple_fold_builtin): Handle unsigned reduc_plus_scal_ builtins. * config/aarch64/aarch64-simd-builtins.def (addp): Delete DImode instances. * config/aarch64/aarch64-simd.md (aarch64_addpdi): Delete. * config/aarch64/arm_neon.h (vpaddd_s64): Reimplement with __builtin_aarch64_reduc_plus_scal_v2di. (vpaddd_u64): Reimplement with __builtin_aarch64_reduc_plus_scal_v2di_uu.
2023-06-06aarch64: Reimplement URSHR,SRSHR patterns with standard RTL codesKyrylo Tkachov1-7/+37
Having converted the patterns for the URSRA,SRSRA instructions to standard RTL codes we can also easily convert the non-accumulating forms URSHR,SRSHR. This patch does that, reusing the various helpers and predicates from that patch in a straightforward way. This allows GCC to perform the optimisations in the testcase, matching what Clang does. Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_<sur>shr_n<mode>): Delete. (aarch64_<sra_op>rshr_n<mode><vczle><vczbe>_insn): New define_insn. (aarch64_<sra_op>rshr_n<mode>): New define_expand. gcc/testsuite/ChangeLog: * gcc.target/aarch64/simd/vrshr_1.c: New test.
2023-06-06aarch64: Simplify SHRN, RSHRN expanders and patternsKyrylo Tkachov1-80/+11
Now that we've got the <vczle><vczbe> annotations we can get rid of explicit !BYTES_BIG_ENDIAN and BYTES_BIG_ENDIAN patterns for the narrowing shift instructions. This allows us to clean up the expanders as well. Bootstrapped and tested on aarch64-none-linux-gnu and aarch64_be-none-elf. gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_shrn<mode>_insn_le): Delete. (aarch64_shrn<mode>_insn_be): Delete. (*aarch64_<srn_op>shrn<mode>_vect): Rename to... (*aarch64_<srn_op>shrn<mode><vczle><vczbe>): ... This. (aarch64_shrn<mode>): Remove reference to the above deleted patterns. (aarch64_rshrn<mode>_insn_le): Delete. (aarch64_rshrn<mode>_insn_be): Delete. (aarch64_rshrn<mode><vczle><vczbe>_insn): New define_insn. (aarch64_rshrn<mode>): Remove references to the above deleted patterns. gcc/testsuite/ChangeLog: * gcc.target/aarch64/simd/pr99195_5.c: Add testing for shrn_n, rshrn_n intrinsics.
2023-06-06aarch64: Improve representation of ADDLV instructionsKyrylo Tkachov5-11/+125
We've received requests to optimise the attached intrinsics testcase. We currently generate: foo_1: uaddlp v0.4s, v0.8h uaddlv d31, v0.4s fmov x0, d31 ret foo_2: uaddlp v0.4s, v0.8h addv s31, v0.4s fmov w0, s31 ret foo_3: saddlp v0.4s, v0.8h addv s31, v0.4s fmov w0, s31 ret The widening pair-wise addition addlp instructions can be omitted if we're just doing an ADDV afterwards. Making this optimisation would be quite simple if we had a standard RTL PLUS vector reduction code. As we don't, we can use UNSPEC_ADDV as a stand in. This patch expresses the SADDLV and UADDLV instructions as an UNSPEC_ADDV over a widened input, thus removing the need for separate UNSPEC_SADDLV and UNSPEC_UADDLV codes. To optimise the testcases involved we add two splitters that match a vector addition where all participating elements are taken and widened from the same vector and then fed into an UNSPEC_ADDV. In that case we can just remove the vector PLUS and just emit the simple RTL for SADDLV/UADDLV. Bootstrapped and tested on aarch64-none-linux-gnu. gcc/ChangeLog: * config/aarch64/aarch64-protos.h (aarch64_parallel_select_half_p): Define prototype. (aarch64_pars_overlap_p): Likewise. * config/aarch64/aarch64-simd.md (aarch64_<su>addlv<mode>): Express in terms of UNSPEC_ADDV. (*aarch64_<su>addlv<VDQV_L:mode>_ze<GPI:mode>): Likewise. (*aarch64_<su>addlv<mode>_reduction): Define. (*aarch64_uaddlv<mode>_reduction_2): Likewise. * config/aarch64/aarch64.cc (aarch64_parallel_select_half_p): Define. (aarch64_pars_overlap_p): Likewise. * config/aarch64/iterators.md (UNSPEC_SADDLV, UNSPEC_UADDLV): Delete. (VQUADW): New mode attribute. (VWIDE2X_S): Likewise. (USADDLV): Delete. (su): Delete handling of UNSPEC_SADDLV, UNSPEC_UADDLV. * config/aarch64/predicates.md (vect_par_cnst_select_half): Define. gcc/testsuite/ChangeLog: * gcc.target/aarch64/simd/addlv_1.c: New test.
2023-06-06RISC-V: Support RVV FP16 ZVFH Reduction floating-point intrinsic APIPan Li2-0/+19
This patch support the intrinsic API of FP16 ZVFH Reduction floating-point. Aka SEW=16 for below instructions: vfredosum vfredusum vfredmax vfredmin vfwredosum vfwredusum Then users can leverage the instrinsic APIs to perform the FP=16 related reduction operations. Please note not all the instrinsic APIs are coverred in the test files, only pick some typical ones due to too many. We will perform the FP16 related instrinsic API test entirely soon. Signed-off-by: Pan Li <pan2.li@intel.com> gcc/ChangeLog: * config/riscv/riscv-vector-builtins-types.def (vfloat16mf4_t): Add vfloat16mf4_t to WF operations. (vfloat16mf2_t): Likewise. (vfloat16m1_t): Likewise. (vfloat16m2_t): Likewise. (vfloat16m4_t): Likewise. (vfloat16m8_t): Likewise. * config/riscv/vector-iterators.md: Add FP=16 to VWF, VWF_ZVE64, VWLMUL1, VWLMUL1_ZVE64, vwlmul1 and vwlmul1_zve64. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/zvfh-intrinsic.c: Add new test cases.
2023-06-05[RISC-V] correct machine mode in save-restore cfi RTL.Fei Gao1-5/+5
gcc/ChangeLog: * config/riscv/riscv.cc (riscv_adjust_libcall_cfi_prologue): Use Pmode for cfi reg/mem machmode (riscv_adjust_libcall_cfi_epilogue): Use Pmode for cfi reg machmode gcc/testsuite/ChangeLog: * gcc.target/riscv/save-restore-cfi-2.c: New test to check machmode for cfi reg/mem.
2023-06-06RISC-V: Fix 'REQUIREMENT' for machine_mode 'MODE' in vector-iterators.md.Li Xu2-16/+16
gcc/ChangeLog: * config/riscv/vector-iterators.md: Fix 'REQUIREMENT' for machine_mode 'MODE'. * config/riscv/vector.md (@pred_indexed_<order>store<VNX16_QHS:mode> <VNX16_QHSI:mode>): change VNX16_QHSI to VNX16_QHSDI. (@pred_indexed_<order>store<VNX16_QHS:mode><VNX16_QHSDI:mode>): Ditto.
2023-06-06RISC-V: Fix some typo in vector-iterators.mdPan Li1-4/+4
This patch would like to fix some typo in vector-iterators.md, aka: [-"vnx1DI")-]{+"vnx1di")+} [-"vnx2SI")-]{+"vnx2si")+} [-"vnx1SI")-]{+"vnx1si")+} Signed-off-by: Pan Li <pan2.li@intel.com> gcc/ChangeLog: * config/riscv/vector-iterators.md: Fix typo in mode attr.
2023-06-05internal-fn,vect: Refactor widen_plus as internal_fnAndre Vieira1-4/+4
DEF_INTERNAL_WIDENING_OPTAB_FN and DEF_INTERNAL_NARROWING_OPTAB_FN are like DEF_INTERNAL_SIGNED_OPTAB_FN and DEF_INTERNAL_OPTAB_FN respectively. With the exception that they provide convenience wrappers for a single vector to vector conversion, a hi/lo split or an even/odd split. Each definition for <NAME> will require either signed optabs named <UOPTAB> and <SOPTAB> (for widening) or a single <OPTAB> (for narrowing) for each of the five functions it creates. For example, for widening addition the DEF_INTERNAL_WIDENING_OPTAB_FN will create five internal functions: IFN_VEC_WIDEN_PLUS, IFN_VEC_WIDEN_PLUS_HI, IFN_VEC_WIDEN_PLUS_LO, IFN_VEC_WIDEN_PLUS_EVEN and IFN_VEC_WIDEN_PLUS_ODD. Each requiring two optabs, one for signed and one for unsigned. Aarch64 implements the hi/lo split optabs: IFN_VEC_WIDEN_PLUS_HI -> vec_widen_<su>add_hi_<mode> -> (u/s)addl2 IFN_VEC_WIDEN_PLUS_LO -> vec_widen_<su>add_lo_<mode> -> (u/s)addl This gives the same functionality as the previous WIDEN_PLUS/WIDEN_MINUS tree codes which are expanded into VEC_WIDEN_PLUS_LO, VEC_WIDEN_PLUS_HI. 2023-06-05 Andre Vieira <andre.simoesdiasvieira@arm.com> Joel Hutton <joel.hutton@arm.com> Tamar Christina <tamar.christina@arm.com> gcc/ChangeLog: * config/aarch64/aarch64-simd.md (vec_widen_<su>addl_lo_<mode>): Rename this ... (vec_widen_<su>add_lo_<mode>): ... to this. (vec_widen_<su>addl_hi_<mode>): Rename this ... (vec_widen_<su>add_hi_<mode>): ... to this. (vec_widen_<su>subl_lo_<mode>): Rename this ... (vec_widen_<su>sub_lo_<mode>): ... to this. (vec_widen_<su>subl_hi_<mode>): Rename this ... (vec_widen_<su>sub_hi_<mode>): ...to this. * doc/generic.texi: Document new IFN codes. * internal-fn.cc (lookup_hilo_internal_fn): Add lookup function. (commutative_binary_fn_p): Add widen_plus fn's. (widening_fn_p): New function. (narrowing_fn_p): New function. (direct_internal_fn_optab): Change visibility. * internal-fn.def (DEF_INTERNAL_WIDENING_OPTAB_FN): Macro to define an internal_fn that expands into multiple internal_fns for widening. (IFN_VEC_WIDEN_PLUS, IFN_VEC_WIDEN_PLUS_HI, IFN_VEC_WIDEN_PLUS_LO, IFN_VEC_WIDEN_PLUS_EVEN, IFN_VEC_WIDEN_PLUS_ODD, IFN_VEC_WIDEN_MINUS, IFN_VEC_WIDEN_MINUS_HI, IFN_VEC_WIDEN_MINUS_LO, IFN_VEC_WIDEN_MINUS_ODD, IFN_VEC_WIDEN_MINUS_EVEN): Define widening plus,minus functions. * internal-fn.h (direct_internal_fn_optab): Declare new prototype. (lookup_hilo_internal_fn): Likewise. (widening_fn_p): Likewise. (Narrowing_fn_p): Likewise. * optabs.cc (commutative_optab_p): Add widening plus optabs. * optabs.def (OPTAB_D): Define widen add, sub optabs. * tree-vect-patterns.cc (vect_recog_widen_op_pattern): Support patterns with a hi/lo or even/odd split. (vect_recog_sad_pattern): Refactor to use new IFN codes. (vect_recog_widen_plus_pattern): Likewise. (vect_recog_widen_minus_pattern): Likewise. (vect_recog_average_pattern): Likewise. * tree-vect-stmts.cc (vectorizable_conversion): Add support for _HILO IFNs. (supportable_widening_operation): Likewise. * tree.def (WIDEN_SUM_EXPR): Update example to use new IFNs. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vect-widen-add.c: Test that new IFN_VEC_WIDEN_PLUS is being used. * gcc.target/aarch64/vect-widen-sub.c: Test that new IFN_VEC_WIDEN_MINUS is being used.
2023-06-05RISC-V: Support RVV FP16 ZVFH floating-point intrinsic APIPan Li2-0/+53
This patch support the intrinsic API of FP16 ZVFH floating-point. Aka SEW=16 for below instructions: vfadd vfsub vfrsub vfwadd vfwsub vfmul vfdiv vfrdiv vfwmul vfmacc vfnmacc vfmsac vfnmsac vfmadd vfnmadd vfmsub vfnmsub vfwmacc vfwnmacc vfwmsac vfwnmsac vfsqrt vfrsqrt7 vfrec7 vfmin vfmax vfsgnj vfsgnjn vfsgnjx vmfeq vmfne vmflt vmfle vmfgt vmfge vfclass vfmerge vfmv vfcvt vfwcvt vfncvt Then users can leverage the instrinsic APIs to perform the FP=16 related operations. Please note not all the instrinsic APIs are coverred in the test files, only pick some typical ones due to too many. We will perform the FP16 related instrinsic API test entirely soon. Signed-off-by: Pan Li <pan2.li@intel.com> gcc/ChangeLog: * config/riscv/riscv-vector-builtins-types.def (vfloat32mf2_t): New type for DEF_RVV_WEXTF_OPS. (vfloat32m1_t): Ditto. (vfloat32m2_t): Ditto. (vfloat32m4_t): Ditto. (vfloat32m8_t): Ditto. (vint16mf4_t): New type for DEF_RVV_CONVERT_I_OPS. (vint16mf2_t): Ditto. (vint16m1_t): Ditto. (vint16m2_t): Ditto. (vint16m4_t): Ditto. (vint16m8_t): Ditto. (vuint16mf4_t): New type for DEF_RVV_CONVERT_U_OPS. (vuint16mf2_t): Ditto. (vuint16m1_t): Ditto. (vuint16m2_t): Ditto. (vuint16m4_t): Ditto. (vuint16m8_t): Ditto. (vint32mf2_t): New type for DEF_RVV_WCONVERT_I_OPS. (vint32m1_t): Ditto. (vint32m2_t): Ditto. (vint32m4_t): Ditto. (vint32m8_t): Ditto. (vuint32mf2_t): New type for DEF_RVV_WCONVERT_U_OPS. (vuint32m1_t): Ditto. (vuint32m2_t): Ditto. (vuint32m4_t): Ditto. (vuint32m8_t): Ditto. * config/riscv/vector-iterators.md: Add FP=16 support for V, VWCONVERTI, VCONVERT, VNCONVERT, VMUL1 and vlmul1. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/zvfh-intrinsic.c: New test. Signed-off-by: Pan Li <pan2.li@intel.com>
2023-06-05MIPS: Add speculation_barrier supportYunQiang Su3-0/+26
speculation_barrier for MIPS needs sync+jr.hb (r2+), so we implement __speculation_barrier in libgcc, like arm32 does. gcc/ChangeLog: * config/mips/mips-protos.h (mips_emit_speculation_barrier): New prototype. * config/mips/mips.cc (speculation_barrier_libfunc): New static variable. (mips_init_libfuncs): Initialize it. (mips_emit_speculation_barrier): New function. * config/mips/mips.md (speculation_barrier): Call mips_emit_speculation_barrier. libgcc/ChangeLog: * config/mips/lib1funcs.S: New file. define __speculation_barrier and include mips16.S. * config/mips/t-mips: define LIB1ASMSRC as mips/lib1funcs.S. define LIB1ASMFUNCS as _speculation_barrier. set version info for __speculation_barrier. * config/mips/libgcc-mips.ver: New file. * config/mips/t-mips16: don't define LIB1ASMSRC as mips16.S included in lib1funcs.S now.
2023-06-05RISC-V: Reorganize riscv-v.ccJuzhe-Zhong1-248/+249
This patch is just reorganizing the functions for the following patch. I put rvv_builder and emit_* functions located before expand_const_vector function since I will use them in expand_const_vector in the following patch. gcc/ChangeLog: * config/riscv/riscv-v.cc (class rvv_builder): Reorganize functions. (rvv_builder::can_duplicate_repeating_sequence_p): Ditto. (rvv_builder::repeating_sequence_use_merge_profitable_p): Ditto. (rvv_builder::get_merged_repeating_sequence): Ditto. (rvv_builder::get_merge_scalar_mask): Ditto. (emit_scalar_move_insn): Ditto. (emit_vlmax_integer_move_insn): Ditto. (emit_nonvlmax_integer_move_insn): Ditto. (emit_vlmax_gather_insn): Ditto. (emit_vlmax_masked_gather_mu_insn): Ditto. (get_repeating_sequence_dup_machine_mode): Ditto.
2023-06-05RISC-V: Split arguments of expand_vec_permJuzhe-Zhong3-7/+4
Since the following patch will calls expand_vec_perm with splitted arguments, change the expand_vec_perm interface in this patch. gcc/ChangeLog: * config/riscv/autovec.md: Split arguments. * config/riscv/riscv-protos.h (expand_vec_perm): Ditto. * config/riscv/riscv-v.cc (expand_vec_perm): Ditto.
2023-06-04Convert H8 port to LRAJeff Law3-34/+1
With Vlad's recent LRA fix to the elimination code, the H8 can be converted to LRA. This patch has two changes of note. First, this turns Zz into a standard constraint. This helps reloading for the H8/SX movqi pattern. Second, this drops the whole pattern for the SX bit memory operations. I can't see why those exist to begin with. They should be handled by the standard bit manipulation patterns. If someone wants to try and improve SX bit support, that'd be great and they can do so within the LRA framework :-) Pushed to the trunk... gcc/ * config/h8300/constraints.md (Zz): Make this a normal constraint. * config/h8300/h8300.cc (TARGET_LRA_P): Remove. * config/h8300/logical.md (H8/SX bit patterns): Remove.
2023-06-04xtensa: Optimize boolean evaluation or branching when EQ/NE to INT_MINTakayuki 'January June' Suwa1-0/+65
This patch optimizes both the boolean evaluation of and the branching of EQ/NE against INT_MIN (-2147483648), by taking advantage of the specifi- cation the ABS machine instruction on Xtensa returns INT_MIN iff INT_MIN, otherwise non-negative value. /* example */ int test0(int x) { return (x == -2147483648); } int test1(int x) { return (x != -2147483648); } extern void foo(void); void test2(int x) { if(x == -2147483648) foo(); } void test3(int x) { if(x != -2147483648) foo(); } ;; before test0: movi.n a9, -1 slli a9, a9, 31 add.n a2, a2, a9 nsau a2, a2 srli a2, a2, 5 ret.n test1: movi.n a9, -1 slli a9, a9, 31 add.n a9, a2, a9 movi.n a2, 1 moveqz a2, a9, a9 ret.n test2: movi.n a9, -1 slli a9, a9, 31 bne a2, a9, .L3 j.l foo, a9 .L3: ret.n test3: movi.n a9, -1 slli a9, a9, 31 beq a2, a9, .L5 j.l foo, a9 .L5: ret.n ;; after test0: abs a2, a2 extui a2, a2, 31, 1 ret.n test1: abs a2, a2 srai a2, a2, 31 addi.n a2, a2, 1 ret.n test2: abs a2, a2 bbci a2, 31, .L3 j.l foo, a9 .L3: ret.n test3: abs a2, a2 bbsi a2, 31, .L5 j.l foo, a9 .L5: ret.n gcc/ChangeLog: * config/xtensa/xtensa.md (*btrue_INT_MIN, *eqne_INT_MIN): New insn_and_split patterns.
2023-06-04RISC-V: Remove redundant vlmul_ext_* patterns to fix PR110109Juzhe-Zhong2-144/+4
This patch is to fix PR110109 issue. This issue happens is because: (define_insn_and_split "*vlmul_extx2<mode>" [(set (match_operand:<VLMULX2> 0 "register_operand" "=vr, ?&vr") (subreg:<VLMULX2> (match_operand:VLMULEXT2 1 "register_operand" " 0, vr") 0))] "TARGET_VECTOR" "#" "&& reload_completed" [(const_int 0)] { emit_insn (gen_rtx_SET (gen_lowpart (<MODE>mode, operands[0]), operands[1])); DONE; }) Such pattern generate such codes in insn-recog.cc: static int pattern57 (rtx x1) { rtx * const operands ATTRIBUTE_UNUSED = &recog_data.operand[0]; rtx x2; int res ATTRIBUTE_UNUSED; if (maybe_ne (SUBREG_BYTE (x1).to_constant (), 0)) return -1; ... PR110109 ICE at maybe_ne (SUBREG_BYTE (x1).to_constant (), 0) since for scalable RVV modes can not be accessed as SUBREG_BYTE (x1).to_constant () I create that patterns is to optimize the following test: vfloat32m2_t test_vlmul_ext_v_f32mf2_f32m2(vfloat32mf2_t op1) { return __riscv_vlmul_ext_v_f32mf2_f32m2(op1); } codegen: test_vlmul_ext_v_f32mf2_f32m2: vsetvli a5,zero,e32,m2,ta,ma vmv.v.i v2,0 vsetvli a5,zero,e32,mf2,ta,ma vle32.v v2,0(a1) vs2r.v v2,0(a0) ret There is a redundant 'vmv.v.i' here, Since GCC doesn't undefine IR (unlike LLVM, LLVM has undef/poison). For vlmul_ext_* RVV intrinsic, GCC will initiate all zeros into register. However, I think it's not a big issue after we support subreg livness tracking. PR target/110109 gcc/ChangeLog: * config/riscv/riscv-vector-builtins-bases.cc: Change expand approach. * config/riscv/vector.md (@vlmul_extx2<mode>): Remove it. (@vlmul_extx4<mode>): Ditto. (@vlmul_extx8<mode>): Ditto. (@vlmul_extx16<mode>): Ditto. (@vlmul_extx32<mode>): Ditto. (@vlmul_extx64<mode>): Ditto. (*vlmul_extx2<mode>): Ditto. (*vlmul_extx4<mode>): Ditto. (*vlmul_extx8<mode>): Ditto. (*vlmul_extx16<mode>): Ditto. (*vlmul_extx32<mode>): Ditto. (*vlmul_extx64<mode>): Ditto. gcc/testsuite/ChangeLog: * gcc.target/riscv/rvv/base/pr110109-1.c: New test. * gcc.target/riscv/rvv/base/pr110109-2.c: New test.