Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch removes the remaining traces of the vcond{,u,eq} optabs.
Earlier patches removed the target-independent uses and I couldn't
find any direct references to either the *_optabs or the ifns
in target-specific code.
gcc/
* doc/md.texi (vcond@var{m}@var{n}, vcondu@var{m}@var{n})
(vcondeq@var{m}@var{n}): Delete.
(vcond_mask_@var{m}@var{n}): Redocument in standalone form.
* internal-fn.def (VCOND, VCONDU, VCONDEQ): Delete.
* internal-fn.cc (expand_vec_cond_optab_fn): Delete.
* optabs.def (vcond_optab, vcondu_optab, vcondeq_optab): Delete.
|
|
Add two new internal functions (IFN_CRC, IFN_CRC_REV), to provide faster
CRC generation.
One performs bit-forward and the other bit-reversed CRC computation.
If CRC optabs are supported, they are used for the CRC computation.
Otherwise, table-based CRC is generated.
The supported data and CRC sizes are 8, 16, 32, and 64 bits.
The polynomial is without the leading 1.
A table with 256 elements is used to store precomputed CRCs.
For the reflection of inputs and the output, a simple algorithm involving
SHIFT, AND, and OR operations is used.
gcc/
* doc/md.texi (crc@var{m}@var{n}4, crc_rev@var{m}@var{n}4): Document.
* expr.cc (calculate_crc): New function.
(assemble_crc_table): Likewise.
(generate_crc_table): Likewise.
(calculate_table_based_CRC): Likewise.
(expand_crc_table_based): Likewise.
(gen_common_operation_to_reflect): Likewise.
(reflect_64_bit_value): Likewise.
(reflect_32_bit_value): Likewise.
(reflect_16_bit_value): Likewise.
(reflect_8_bit_value): Likewise.
(generate_reflecting_code_standard): Likewise.
(expand_reversed_crc_table_based): Likewise.
* expr.h (generate_reflecting_code_standard): New function declaration.
(expand_crc_table_based): Likewise.
(expand_reversed_crc_table_based): Likewise.
* internal-fn.cc: (crc_direct): Define.
(direct_crc_optab_supported_p): Likewise.
(expand_crc_optab_fn): New function
* internal-fn.def (CRC, CRC_REV): New internal functions.
* optabs.def (crc_optab, crc_rev_optab): New optabs.
Signed-off-by: Mariam Arutunian <mariamarutunian@gmail.com>
Co-authored-by: Joern Rennecke <joern.rennecke@embecosm.com>
Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
|
|
This patch adds middle-end support for the `dispatch` construct and the
`adjust_args` clause. The heavy lifting is done in `gimplify_omp_dispatch` and
`gimplify_call_expr` respectively. For `adjust_args`, this mostly consists in
emitting a call to `omp_get_mapped_ptr` for the adequate device.
For dispatch, the following steps are performed:
* Handle the device clause, if any: set the default-device ICV at the top of the
dispatch region and restore its previous value at the end.
* Handle novariants and nocontext clauses, if any. Evaluate compile-time
constants and select a variant, if possible. Otherwise, emit code to handle all
possible cases at run time.
gcc/ChangeLog:
* builtins.cc (builtin_fnspec): Handle BUILT_IN_OMP_GET_MAPPED_PTR.
* gimple-low.cc (lower_stmt): Handle GIMPLE_OMP_DISPATCH.
* gimple-pretty-print.cc (dump_gimple_omp_dispatch): New function.
(pp_gimple_stmt_1): Handle GIMPLE_OMP_DISPATCH.
* gimple-walk.cc (walk_gimple_stmt): Likewise.
* gimple.cc (gimple_build_omp_dispatch): New function.
(gimple_copy): Handle GIMPLE_OMP_DISPATCH.
* gimple.def (GIMPLE_OMP_DISPATCH): Define.
* gimple.h (gimple_build_omp_dispatch): Declare.
(gimple_has_substatements): Handle GIMPLE_OMP_DISPATCH.
(gimple_omp_dispatch_clauses): New function.
(gimple_omp_dispatch_clauses_ptr): Likewise.
(gimple_omp_dispatch_set_clauses): Likewise.
(gimple_return_set_retval): Handle GIMPLE_OMP_DISPATCH.
* gimplify.cc (enum omp_region_type): Add ORT_DISPATCH.
(struct gimplify_omp_ctx): Add in_call_args.
(gimplify_call_expr): Handle need_device_ptr arguments.
(is_gimple_stmt): Handle OMP_DISPATCH.
(gimplify_scan_omp_clauses): Handle OMP_CLAUSE_DEVICE in a dispatch
construct. Handle OMP_CLAUSE_NOVARIANTS and OMP_CLAUSE_NOCONTEXT.
(omp_has_novariants): New function.
(omp_has_nocontext): Likewise.
(omp_construct_selector_matches): Handle OMP_DISPATCH with nocontext
clause.
(find_ifn_gomp_dispatch): New function.
(gimplify_omp_dispatch): Likewise.
(gimplify_expr): Handle OMP_DISPATCH.
* gimplify.h (omp_has_novariants): Declare.
* internal-fn.cc (expand_GOMP_DISPATCH): New function.
* internal-fn.def (GOMP_DISPATCH): Define.
* omp-builtins.def (BUILT_IN_OMP_GET_MAPPED_PTR): Define.
(BUILT_IN_OMP_GET_DEFAULT_DEVICE): Define.
(BUILT_IN_OMP_SET_DEFAULT_DEVICE): Define.
* omp-general.cc (omp_construct_traits_to_codes): Add OMP_DISPATCH.
(struct omp_ts_info): Add dispatch.
(omp_resolve_declare_variant): Handle novariants. Adjust
DECL_ASSEMBLER_NAME.
* omp-low.cc (scan_omp_1_stmt): Handle GIMPLE_OMP_DISPATCH.
(lower_omp_dispatch): New function.
(lower_omp_1): Call it.
* tree-inline.cc (remap_gimple_stmt): Handle GIMPLE_OMP_DISPATCH.
(estimate_num_insns): Handle GIMPLE_OMP_DISPATCH.
|
|
This patch uses the FSCALE instruction provided by SVE to implement the
standard ldexp family of functions.
Currently, with '-Ofast -mcpu=neoverse-v2', GCC generates libcalls for the
following code:
float
test_ldexpf (float x, int i)
{
return __builtin_ldexpf (x, i);
}
double
test_ldexp (double x, int i)
{
return __builtin_ldexp(x, i);
}
GCC Output:
test_ldexpf:
b ldexpf
test_ldexp:
b ldexp
Since SVE has support for an FSCALE instruction, we can use this to process
scalar floats by moving them to a vector register and performing an fscale call,
similar to how LLVM tackles an ldexp builtin as well.
New Output:
test_ldexpf:
fmov s31, w0
ptrue p7.b, vl4
fscale z0.s, p7/m, z0.s, z31.s
ret
test_ldexp:
sxtw x0, w0
ptrue p7.b, vl8
fmov d31, x0
fscale z0.d, p7/m, z0.d, z31.d
ret
This is a revision of an earlier patch, and now uses the extended definition of
aarch64_ptrue_reg to generate predicate registers with the appropriate set bits.
The patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.
OK for mainline?
Signed-off-by: Soumya AR <soumyaa@nvidia.com>
gcc/ChangeLog:
PR target/111733
* config/aarch64/aarch64-sve.md
(ldexp<mode>3): Added a new pattern to match ldexp calls with scalar
floating modes and expand to the existing pattern for FSCALE.
* config/aarch64/iterators.md:
(SVE_FULL_F_SCALAR): Added an iterator to match all FP SVE modes as well
as their scalar equivalents.
(VPRED): Extended the attribute to handle GPF_HF modes.
* internal-fn.def (LDEXP): Changed macro to incorporate ldexpf16.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/sve/fscale.c: New test.
|
|
Delay omp_max_vf call until after the host and device compilers have diverged
so that the max_vf value can be tuned exactly right on both variants.
This change means that the ompdevlow pass must be enabled for functions that
use OpenMP directives with both "simd" and "schedule" enabled.
gcc/ChangeLog:
* internal-fn.cc (expand_GOMP_MAX_VF): New function.
* internal-fn.def (GOMP_MAX_VF): New internal function.
* omp-expand.cc (omp_adjust_chunk_size): Emit IFN_GOMP_MAX_VF when
called in offload context, otherwise assume host context.
* omp-offload.cc (execute_omp_device_lower): Expand IFN_GOMP_MAX_VF.
|
|
This patch would like to introduce new IFN for strided load and store.
LOAD: v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias)
STORE: MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias)
The IFN target below code example similar as below
void foo (int * a, int * b, int stride, int n)
{
for (int i = 0; i < n; i++)
a[i * stride] = b[i * stride];
}
The below test suites are passed for this patch.
* The rv64gcv fully regression test.
* The x86 bootstrap test.
* The x86 fully regression test.
gcc/ChangeLog:
* internal-fn.cc (strided_load_direct): Add new define direct
for strided load.
(strided_store_direct): Ditto but for store.
(expand_strided_load_optab_fn): Add new func to expand the IFN
MASK_LEN_STRIDED_LOAD in middle-end.
(expand_strided_store_optab_fn): Ditto but for store.
(direct_strided_load_optab_supported_p): Add define for stride
load optab supported.
(direct_strided_store_optab_supported_p): Ditto but for store.
(internal_fn_len_index): Add strided load/store len index.
(internal_fn_mask_index): Ditto but for mask.
(internal_fn_stored_value_index): Add strided store value index.
* internal-fn.def (MASK_LEN_STRIDED_LOAD): Add new IFN for
strided load.
(MASK_LEN_STRIDED_STORE): Ditto but for store.
* optabs.def (mask_len_strided_load_optab): Add strided load optab.
(mask_len_strided_store_optab): Add strided store optab.
Signed-off-by: Pan Li <pan2.li@intel.com>
Co-Authored-By: Juzhe-Zhong <juzhe.zhong@rivai.ai>
|
|
We rely on .CO_YIELD calls being followed by an assignment (optionally)
and then a switch/if in the same basic block. This implies that a
.CO_YIELD can never end a block. However, since a call to .CO_YIELD is
still a call, if the function containing it calls setjmp, GCC thinks
that the .CO_YIELD can introduce abnormal control flow, and generates an
edge for the call.
We know this is not the case; .CO_YIELD calls get removed quite early on
and have no effect, and result in no other calls, so .CO_YIELD can be
considered a leaf function, preventing generating an edge when calling
it.
PR c++/106973 - coroutine generator and setjmp
PR c++/106973
gcc/ChangeLog:
* internal-fn.def (CO_YIELD): Mark as ECF_LEAF.
gcc/testsuite/ChangeLog:
* g++.dg/coroutines/pr106973.C: New test.
|
|
When I was trying to add an scalar version of iorc and andc, the optab that
got matched was for and/ior with the mode of csi and cdi instead of iorc and
andc optabs for si and di modes. Since csi/cdi are the complex integer modes,
we need to rename the optabs to be without c there. This changes c to n which
is a neutral and known not to be first letter of a mode.
Bootstrapped and tested on x86_64 and powerpc64le.
gcc/ChangeLog:
* config/rs6000/rs6000-builtins.def: s/iorc/iorn/. s/andc/andn/
for the code.
* config/rs6000/rs6000-string.cc (expand_cmp_vec_sequence): Update
to iorn.
* config/rs6000/rs6000.md (andc<mode>3): Rename to ...
(andn<mode>3): This.
(iorc<mode>3): Rename to ...
(iorn<mode>3): This.
* doc/md.texi: Update documentation for the rename.
* internal-fn.def (BIT_ANDC): Rename to ...
(BIT_ANDN): This.
(BIT_IORC): Rename to ...
(BIT_IORN): This.
* optabs.def (andc_optab): Rename to ...
(andn_optab): This.
(iorc_optab): Rename to ...
(iorn_optab): This.
* gimple-isel.cc (gimple_expand_vec_cond_expr): Update for the
renamed internal functions, ANDC/IORC to ANDN/IORN.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
|
|
As PR115659 shows, assuming c = x CMP y, there are some
folding chances for patterns r = c ? 0/z : z/-1:
- for r = c ? 0 : z, it can be folded into r = ~c & z.
- for r = c ? z : -1, it can be folded into r = ~c | z.
But BIT_AND/BIT_IOR applied on one BIT_NOT operand is a
compound operation, it's arguable to consider it beats
vector selection. So this patch is to introduce new
optabs andc, iorc and its corresponding internal functions
BIT_{ANDC,IORC}, and if targets defines such optabs for
vector modes, it means targets support these hardware
insns and should be not worse than vector selection.
PR tree-optimization/115659
gcc/ChangeLog:
* doc/md.texi: Document andcm3 and iorcm3.
* gimple-isel.cc (gimple_expand_vec_cond_expr): Add more foldings for
patterns x CMP y ? 0 : z and x CMP y ? z : -1.
* internal-fn.def (BIT_ANDC): New internal function.
(BIT_IORC): Likewise.
* optabs.def (andc, iorc): New optab.
|
|
This patch would like to add the middle-end presentation for the
saturation truncation. Aka set the result of truncated value to
the max value when overflow. It will take the pattern similar
as below.
Form 1:
#define DEF_SAT_U_TRUC_FMT_1(WT, NT) \
NT __attribute__((noinline)) \
sat_u_truc_##T##_fmt_1 (WT x) \
{ \
bool overflow = x > (WT)(NT)(-1); \
return ((NT)x) | (NT)-overflow; \
}
For example, truncated uint16_t to uint8_t, we have
* SAT_TRUNC (254) => 254
* SAT_TRUNC (255) => 255
* SAT_TRUNC (256) => 255
* SAT_TRUNC (65536) => 255
Given below SAT_TRUNC from uint64_t to uint32_t.
DEF_SAT_U_TRUC_FMT_1 (uint64_t, uint32_t)
Before this patch:
__attribute__((noinline))
uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
{
_Bool overflow;
unsigned int _1;
unsigned int _2;
unsigned int _3;
uint32_t _6;
;; basic block 2, loop depth 0
;; pred: ENTRY
overflow_5 = x_4(D) > 4294967295;
_1 = (unsigned int) x_4(D);
_2 = (unsigned int) overflow_5;
_3 = -_2;
_6 = _1 | _3;
return _6;
;; succ: EXIT
}
After this patch:
__attribute__((noinline))
uint32_t sat_u_truc_T_fmt_1 (uint64_t x)
{
uint32_t _6;
;; basic block 2, loop depth 0
;; pred: ENTRY
_6 = .SAT_TRUNC (x_4(D)); [tail call]
return _6;
;; succ: EXIT
}
The below tests are passed for this patch:
*. The rv64gcv fully regression tests.
*. The rv64gcv build with glibc.
*. The x86 bootstrap tests.
*. The x86 fully regression tests.
gcc/ChangeLog:
* internal-fn.def (SAT_TRUNC): Add new signed IFN sat_trunc as
unary_convert.
* match.pd: Add new matching pattern for unsigned int sat_trunc.
* optabs.def (OPTAB_CL): Add unsigned and signed optab.
* tree-ssa-math-opts.cc (gimple_unsigend_integer_sat_trunc): Add
new decl for the matching pattern generated func.
(match_unsigned_saturation_trunc): Add new func impl to match
the .SAT_TRUNC.
(math_opts_dom_walker::after_dom_children): Add .SAT_TRUNC match
function under BIT_IOR_EXPR case.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
This patch would like to add the middle-end presentation for the
saturation sub. Aka set the result of add to the min when downflow.
It will take the pattern similar as below.
SAT_SUB (x, y) => (x - y) & (-(TYPE)(x >= y));
For example for uint8_t, we have
* SAT_SUB (255, 0) => 255
* SAT_SUB (1, 2) => 0
* SAT_SUB (254, 255) => 0
* SAT_SUB (0, 255) => 0
Given below SAT_SUB for uint64
uint64_t sat_sub_u64 (uint64_t x, uint64_t y)
{
return (x - y) & (-(TYPE)(x >= y));
}
Before this patch:
uint64_t sat_sub_u_0_uint64_t (uint64_t x, uint64_t y)
{
_Bool _1;
long unsigned int _3;
uint64_t _6;
;; basic block 2, loop depth 0
;; pred: ENTRY
_1 = x_4(D) >= y_5(D);
_3 = x_4(D) - y_5(D);
_6 = _1 ? _3 : 0;
return _6;
;; succ: EXIT
}
After this patch:
uint64_t sat_sub_u_0_uint64_t (uint64_t x, uint64_t y)
{
uint64_t _6;
;; basic block 2, loop depth 0
;; pred: ENTRY
_6 = .SAT_SUB (x_4(D), y_5(D)); [tail call]
return _6;
;; succ: EXIT
}
The below tests are running for this patch:
*. The riscv fully regression tests.
*. The x86 bootstrap tests.
*. The x86 fully regression tests.
PR target/51492
PR target/112600
gcc/ChangeLog:
* internal-fn.def (SAT_SUB): Add new IFN define for SAT_SUB.
* match.pd: Add new match for SAT_SUB.
* optabs.def (OPTAB_NL): Remove fixed-point for ussub/ssub.
* tree-ssa-math-opts.cc (gimple_unsigned_integer_sat_sub): Add
new decl for generated in match.pd.
(build_saturation_binary_arith_call): Add new helper function
to build the gimple call to binary SAT alu.
(match_saturation_arith): Rename from.
(match_unsigned_saturation_add): Rename to.
(match_unsigned_saturation_sub): Add new func to match the
unsigned sat sub.
(math_opts_dom_walker::after_dom_children): Add SAT_SUB matching
try when COND_EXPR.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
Including the following changes:
* The definition of the new internal function .ACCESS_WITH_SIZE
in internal-fn.def.
* C FE converts every reference to a FAM with a "counted_by" attribute
to a call to the internal function .ACCESS_WITH_SIZE.
(build_component_ref in c_typeck.cc)
This includes the case when the object is statically allocated and
initialized.
In order to make this working, the routine digest_init in c-typeck.cc
is updated to fold calls to .ACCESS_WITH_SIZE to its first argument
when require_constant is TRUE.
However, for the reference inside "offsetof", the "counted_by" attribute is
ignored since it's not useful at all.
(c_parser_postfix_expression in c/c-parser.cc)
In addtion to "offsetof", for the reference inside operator "typeof" and
"alignof", we ignore counted_by attribute too.
When building ADDR_EXPR for the .ACCESS_WITH_SIZE in C FE,
replace the call with its first argument.
* Convert every call to .ACCESS_WITH_SIZE to its first argument.
(expand_ACCESS_WITH_SIZE in internal-fn.cc)
* Provide the utility routines to check the call is .ACCESS_WITH_SIZE and
get the reference from the call to .ACCESS_WITH_SIZE.
(is_access_with_size_p and get_ref_from_access_with_size in tree.cc)
gcc/c/ChangeLog:
* c-parser.cc (c_parser_postfix_expression): Ignore the counted-by
attribute when build_component_ref inside offsetof operator.
* c-tree.h (build_component_ref): Add one more parameter.
* c-typeck.cc (build_counted_by_ref): New function.
(build_access_with_size_for_counted_by): New function.
(build_component_ref): Check the counted-by attribute and build
call to .ACCESS_WITH_SIZE.
(build_unary_op): When building ADDR_EXPR for
.ACCESS_WITH_SIZE, use its first argument.
(lvalue_p): Accept call to .ACCESS_WITH_SIZE.
(digest_init): Fold call to .ACCESS_WITH_SIZE to its first
argument when require_constant is TRUE.
gcc/ChangeLog:
* internal-fn.cc (expand_ACCESS_WITH_SIZE): New function.
* internal-fn.def (ACCESS_WITH_SIZE): New internal function.
* tree.cc (is_access_with_size_p): New function.
(get_ref_from_access_with_size): New function.
* tree.h (is_access_with_size_p): New prototype.
(get_ref_from_access_with_size): New prototype.
gcc/testsuite/ChangeLog:
* gcc.dg/flex-array-counted-by-2.c: New test.
|
|
This patch would like to add the middle-end presentation for the
saturation add. Aka set the result of add to the max when overflow.
It will take the pattern similar as below.
SAT_ADD (x, y) => (x + y) | (-(TYPE)((TYPE)(x + y) < x))
Take uint8_t as example, we will have:
* SAT_ADD (1, 254) => 255.
* SAT_ADD (1, 255) => 255.
* SAT_ADD (2, 255) => 255.
* SAT_ADD (255, 255) => 255.
Given below example for the unsigned scalar integer uint64_t:
uint64_t sat_add_u64 (uint64_t x, uint64_t y)
{
return (x + y) | (- (uint64_t)((uint64_t)(x + y) < x));
}
Before this patch:
uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
{
long unsigned int _1;
_Bool _2;
long unsigned int _3;
long unsigned int _4;
uint64_t _7;
long unsigned int _10;
__complex__ long unsigned int _11;
;; basic block 2, loop depth 0
;; pred: ENTRY
_11 = .ADD_OVERFLOW (x_5(D), y_6(D));
_1 = REALPART_EXPR <_11>;
_10 = IMAGPART_EXPR <_11>;
_2 = _10 != 0;
_3 = (long unsigned int) _2;
_4 = -_3;
_7 = _1 | _4;
return _7;
;; succ: EXIT
}
After this patch:
uint64_t sat_add_uint64_t (uint64_t x, uint64_t y)
{
uint64_t _7;
;; basic block 2, loop depth 0
;; pred: ENTRY
_7 = .SAT_ADD (x_5(D), y_6(D)); [tail call]
return _7;
;; succ: EXIT
}
The below tests are passed for this patch:
1. The riscv fully regression tests.
3. The x86 bootstrap tests.
4. The x86 fully regression tests.
PR target/51492
PR target/112600
gcc/ChangeLog:
* internal-fn.cc (commutative_binary_fn_p): Add type IFN_SAT_ADD
to the return true switch case(s).
* internal-fn.def (SAT_ADD): Add new signed optab SAT_ADD.
* match.pd: Add unsigned SAT_ADD match(es).
* optabs.def (OPTAB_NL): Remove fixed-point limitation for
us/ssadd.
* tree-ssa-math-opts.cc (gimple_unsigned_integer_sat_add): New
extern func decl generated in match.pd match.
(match_saturation_arith): New func impl to match the saturation arith.
(math_opts_dom_walker::after_dom_children): Try match saturation
arith when IOR expr.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
While it seems a lot of places in various optimization passes fold
bit query internal functions with INTEGER_CST arguments to INTEGER_CST
when there is a lhs, when lhs is missing, all the removals of such dead
stmts are guarded with -ftree-dce, so with -fno-tree-dce those unfolded
ifn calls remain in the IL until expansion. If they have large/huge
BITINT_TYPE arguments, there is no BLKmode optab and so expansion ICEs,
and bitint lowering doesn't touch such calls because it doesn't know they
need touching, functions only containing those will not even be further
processed by the pass because there are no non-small BITINT_TYPE SSA_NAMEs
+ the 2 exceptions (stores of BITINT_TYPE INTEGER_CSTs and conversions
from BITINT_TYPE INTEGER_CSTs to floating point SSA_NAMEs) and when walking
there is no special case for calls with BITINT_TYPE INTEGER_CSTs either,
those are for normal calls normally handled at expansion time.
So, the following patch adjust the expansion of these 6 ifns, by doing
nothing if there is no lhs, and also just in case and user disabled all
possible passes that would fold this handles the case of setting lhs
to ifn call with INTEGER_CST argument.
2024-02-27 Jakub Jelinek <jakub@redhat.com>
PR rtl-optimization/114044
* internal-fn.def (CLRSB, CLZ, CTZ, FFS, PARITY): Use
DEF_INTERNAL_INT_EXT_FN macro rather than DEF_INTERNAL_INT_FN.
* internal-fn.h (expand_CLRSB, expand_CLZ, expand_CTZ, expand_FFS,
expand_PARITY): Declare.
* internal-fn.cc (expand_bitquery, expand_CLRSB, expand_CLZ,
expand_CTZ, expand_FFS, expand_PARITY): New functions.
(expand_POPCOUNT): Use expand_bitquery.
* gcc.dg/bitint-95.c: New test.
|
|
|
|
optimization for direct optab [PR90693]
On Fri, Nov 17, 2023 at 03:01:04PM +0100, Jakub Jelinek wrote:
> As a follow-up, I'm considering changing in this routine the popcount
> call to IFN_POPCOUNT with 2 arguments and during expansion test costs.
Here is the follow-up which does the rtx costs testing.
2023-11-20 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/90693
* tree-ssa-math-opts.cc (match_single_bit_test): Mark POPCOUNT with
result only used in equality comparison against 1 with direct optab
support as .POPCOUNT call with 2 arguments.
* internal-fn.h (expand_POPCOUNT): Declare.
* internal-fn.def (DEF_INTERNAL_INT_EXT_FN): New macro, document it,
undefine at the end.
(POPCOUNT): Use it instead of DEF_INTERNAL_INT_FN.
* internal-fn.cc (DEF_INTERNAL_INT_EXT_FN): Define to nothing before
inclusion to define expanders.
(expand_POPCOUNT): New function.
|
|
I have noticed we are inconsistent, some DEF_INTERNAL*
macros (most of them) were undefined at the end of internal-fn.def (but in
some cases uselessly undefined again after inclusion), while others were not
(and sometimes undefined after the inclusion). I've changed it to always
undefine at the end of internal-fn.def.
2023-11-20 Jakub Jelinek <jakub@redhat.com>
* internal-fn.def: Document missing DEF_INTERNAL* macros and make sure
they are all undefined at the end.
* internal-fn.cc (lookup_hilo_internal_fn, lookup_evenodd_internal_fn,
widening_fn_p, get_len_internal_fn): Don't undef DEF_INTERNAL_*FN
macros after inclusion of internal-fn.def.
|
|
The following patch implements
CWG 2406 - [[fallthrough]] attribute and iteration statements
The genericization of some loops leaves nothing at all or just a label
after a body of a loop, so if the loop is later followed by
case or default label in a switch, the fallthrough statement isn't
diagnosed.
The following patch implements it by marking the IFN_FALLTHROUGH call
in such a case, such that during gimplification it can be pedantically
diagnosed even if it is followed by case or default label or some normal
labels followed by case/default labels.
While looking into this, I've discovered other problems.
expand_FALLTHROUGH_r is removing the IFN_FALLTHROUGH calls from the IL,
but wasn't telling that to walk_gimple_stmt/walk_gimple_seq_mod, so
the callers would then skip the next statement after it, and it would
return non-NULL if the removed stmt was last in the sequence. This could
lead to wi->callback_result being set even if it didn't appear at the very
end of switch sequence.
The patch makes use of wi->removed_stmt such that the callers properly
know what happened, and use different way to handle the end of switch
sequence case.
That change discovered a bug in the gimple-walk handling of
wi->removed_stmt. If that flag is set, the callback is telling the callers
that the current statement has been removed and so the innermost
walk_gimple_seq_mod shouldn't gsi_next. The problem is that
wi->removed_stmt is only reset at the start of a walk_gimple_stmt, but that
can be too late for some cases. If we have two nested gimple sequences,
say GIMPLE_BIND as the last stmt of some gimple seq, we remove the last
statement inside of that GIMPLE_BIND, set wi->removed_stmt there, don't
do gsi_next correctly because already gsi_remove moved us to the next stmt,
there is no next stmt, so we return back to the caller, but wi->removed_stmt
is still set and so we don't do gsi_next even in the outer sequence, despite
the GIMPLE_BIND (etc.) not being removed. That means we walk the
GIMPLE_BIND with its whole sequence again.
The patch fixes that by resetting wi->removed_stmt after we've used that
flag in walk_gimple_seq_mod. Nothing really uses that flag after the
outermost walk_gimple_seq_mod, it is just a private notification that
the stmt callback has removed a stmt.
2023-11-17 Jakub Jelinek <jakub@redhat.com>
PR c++/107571
gcc/
* gimplify.cc (expand_FALLTHROUGH_r): Use wi->removed_stmt after
gsi_remove, change the way of passing fallthrough stmt at the end
of sequence to expand_FALLTHROUGH. Diagnose IFN_FALLTHROUGH
with GF_CALL_NOTHROW flag.
(expand_FALLTHROUGH): Change loc into array of 2 location_t elts,
don't test wi.callback_result, instead check whether first
elt is not UNKNOWN_LOCATION and in that case pedwarn with the
second location.
* gimple-walk.cc (walk_gimple_seq_mod): Clear wi->removed_stmt
after the flag has been used.
* internal-fn.def (FALLTHROUGH): Mention in comment the special
meaning of the TREE_NOTHROW/GF_CALL_NOTHROW flag on the calls.
gcc/c-family/
* c-gimplify.cc (genericize_c_loop): For C++ mark IFN_FALLTHROUGH
call at the end of loop body as TREE_NOTHROW.
gcc/testsuite/
* g++.dg/DRs/dr2406.C: New test.
|
|
The defined DEF_EXT_LIB_FLOATN_NX_BUILTINS functions should also
have DEF_INTERNAL_FLT_FLOATN_FN instead of DEF_INTERNAL_FLT_FN for
the FLOATN support. According to the glibc API and gcc builtin, we
have below table for the FLOATN is supported or not.
+---------+-------+-------------------------------------+
| | glibc | gcc: DEF_EXT_LIB_FLOATN_NX_BUILTINS |
+---------+-------+-------------------------------------+
| iceil | N | N |
| ifloor | N | N |
| irint | N | N |
| iround | N | N |
| lceil | N | N |
| lfloor | N | N |
| lrint | Y | Y |
| lround | Y | Y |
| llceil | N | N |
| llfllor | N | N |
| llrint | Y | Y |
| llround | Y | Y |
+---------+-------+-------------------------------------+
This patch would like to support FLOATN for:
1. lrint
2. lround
3. llrint
4. llround
The below tests are passed within this patch:
1. x86 bootstrap and regression test.
2. aarch64 regression test.
3. riscv regression tests.
PR target/112432
gcc/ChangeLog:
* internal-fn.def (LRINT): Add FLOATN support.
(LROUND): Ditto.
(LLRINT): Ditto.
(LLROUND): Ditto.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
This adds a masked variant of copysign. Nothing very exciting just the
general machinery to define and use a new masked IFN.
Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
Note: This patch is part of a testseries and tests for it are added in the
AArch64 patch that adds supports for the optab.
gcc/ChangeLog:
PR tree-optimization/109154
* internal-fn.def (COPYSIGN): New.
* match.pd (UNCOND_BINARY, COND_BINARY): Map IFN_COPYSIGN to
IFN_COND_COPYSIGN.
* optabs.def (cond_copysign_optab, cond_len_copysign_optab): New.
|
|
In order to prevent simplification of a COND_OP with degenerate mask
(CONSTM1_RTX) into just an OP in the presence of length masking this
patch introduces a length-masked analog to VEC_COND_EXPR:
IFN_VCOND_MASK_LEN.
It also adds new match patterns that allow the combination of
unconditional unary, binary and ternay operations with the
VCOND_MASK_LEN into a conditional operation if the target supports it.
gcc/ChangeLog:
PR tree-optimization/111760
* config/riscv/autovec.md (vcond_mask_len_<mode><vm>): Add
expander.
* config/riscv/riscv-protos.h (enum insn_type): Add.
* config/riscv/riscv-v.cc (needs_fp_rounding): Add !pred_mov.
* doc/md.texi: Add vcond_mask_len.
* gimple-match-exports.cc (maybe_resimplify_conditional_op):
Create VCOND_MASK_LEN when length masking.
* gimple-match.h (gimple_match_op::gimple_match_op): Always
initialize len and bias.
* internal-fn.cc (vec_cond_mask_len_direct): Add.
(direct_vec_cond_mask_len_optab_supported_p): Add.
(internal_fn_len_index): Add VCOND_MASK_LEN.
(internal_fn_mask_index): Ditto.
* internal-fn.def (VCOND_MASK_LEN): New internal function.
* match.pd: Combine unconditional unary, binary and ternary
operations into the respective COND_LEN operations.
* optabs.def (OPTAB_D): Add vcond_mask_len optab.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-cond-arith-2.c: No vect cost model for
riscv_v.
|
|
The following patch introduces the middle-end part of the _BitInt
support, a new BITINT_TYPE, handling it where needed, except the lowering
pass and sanitizer support.
2023-09-06 Jakub Jelinek <jakub@redhat.com>
PR c/102989
* tree.def (BITINT_TYPE): New type.
* tree.h (TREE_CHECK6, TREE_NOT_CHECK6): Define.
(NUMERICAL_TYPE_CHECK, INTEGRAL_TYPE_P): Include
BITINT_TYPE.
(BITINT_TYPE_P): Define.
(CONSTRUCTOR_BITFIELD_P): Return true even for BLKmode bit-fields if
they have BITINT_TYPE type.
(tree_check6, tree_not_check6): New inline functions.
(any_integral_type_check): Include BITINT_TYPE.
(build_bitint_type): Declare.
* tree.cc (tree_code_size, wide_int_to_tree_1, cache_integer_cst,
build_zero_cst, type_hash_canon_hash, type_cache_hasher::equal,
type_hash_canon): Handle BITINT_TYPE.
(bitint_type_cache): New variable.
(build_bitint_type): New function.
(signed_or_unsigned_type_for, verify_type_variant, verify_type):
Handle BITINT_TYPE.
(tree_cc_finalize): Free bitint_type_cache.
* builtins.cc (type_to_class): Handle BITINT_TYPE.
(fold_builtin_unordered_cmp): Handle BITINT_TYPE like INTEGER_TYPE.
* cfgexpand.cc (expand_debug_expr): Punt on BLKmode BITINT_TYPE
INTEGER_CSTs.
* convert.cc (convert_to_pointer_1, convert_to_real_1,
convert_to_complex_1): Handle BITINT_TYPE like INTEGER_TYPE.
(convert_to_integer_1): Likewise. For BITINT_TYPE don't check
GET_MODE_PRECISION (TYPE_MODE (type)).
* doc/generic.texi (BITINT_TYPE): Document.
* doc/tm.texi.in (TARGET_C_BITINT_TYPE_INFO): New.
* doc/tm.texi: Regenerated.
* dwarf2out.cc (base_type_die, is_base_type, modified_type_die,
gen_type_die_with_usage): Handle BITINT_TYPE.
(rtl_for_decl_init): Punt on BLKmode BITINT_TYPE INTEGER_CSTs or
handle those which fit into shwi.
* expr.cc (expand_expr_real_1): Define EXTEND_BITINT macro, reduce
to bitfield precision reads from BITINT_TYPE vars, parameters or
memory locations. Expand large/huge BITINT_TYPE INTEGER_CSTs into
memory.
* fold-const.cc (fold_convert_loc, make_range_step): Handle
BITINT_TYPE.
(extract_muldiv_1): For BITINT_TYPE use TYPE_PRECISION rather than
GET_MODE_SIZE (SCALAR_INT_TYPE_MODE).
(native_encode_int, native_interpret_int, native_interpret_expr):
Handle BITINT_TYPE.
* gimple-expr.cc (useless_type_conversion_p): Make BITINT_TYPE
to some other integral type or vice versa conversions non-useless.
* gimple-fold.cc (gimple_fold_builtin_memset): Punt for BITINT_TYPE.
(clear_padding_unit): Mention in comment that _BitInt types don't need
to fit either.
(clear_padding_bitint_needs_padding_p): New function.
(clear_padding_type_may_have_padding_p): Handle BITINT_TYPE.
(clear_padding_type): Likewise.
* internal-fn.cc (expand_mul_overflow): For unsigned non-mode
precision operands force pos_neg? to 1.
(expand_MULBITINT, expand_DIVMODBITINT, expand_FLOATTOBITINT,
expand_BITINTTOFLOAT): New functions.
* internal-fn.def (MULBITINT, DIVMODBITINT, FLOATTOBITINT,
BITINTTOFLOAT): New internal functions.
* internal-fn.h (expand_MULBITINT, expand_DIVMODBITINT,
expand_FLOATTOBITINT, expand_BITINTTOFLOAT): Declare.
* match.pd (non-equality compare simplifications from fold_binary):
Punt if TYPE_MODE (arg1_type) is BLKmode.
* pretty-print.h (pp_wide_int): Handle printing of large precision
wide_ints which would buffer overflow digit_buffer.
* stor-layout.cc (finish_bitfield_representative): For bit-fields
with BITINT_TYPE, prefer representatives with precisions in
multiple of limb precision.
(layout_type): Handle BITINT_TYPE. Handle COMPLEX_TYPE with BLKmode
element type and assert it is BITINT_TYPE.
* target.def (bitint_type_info): New C target hook.
* target.h (struct bitint_info): New type.
* targhooks.cc (default_bitint_type_info): New function.
* targhooks.h (default_bitint_type_info): Declare.
* tree-pretty-print.cc (dump_generic_node): Handle BITINT_TYPE.
Handle printing large wide_ints which would buffer overflow
digit_buffer.
* tree-ssa-sccvn.cc: Include target.h.
(eliminate_dom_walker::eliminate_stmt): Punt for large/huge
BITINT_TYPE.
* tree-switch-conversion.cc (jump_table_cluster::emit): For more than
64-bit BITINT_TYPE subtract low bound from expression and cast to
64-bit integer type both the controlling expression and case labels.
* typeclass.h (enum type_class): Add bitint_type_class enumerator.
* varasm.cc (output_constant): Handle BITINT_TYPE INTEGER_CSTs.
* vr-values.cc (check_for_binary_op_overflow): Use widest2_int rather
than widest_int.
(simplify_using_ranges::simplify_internal_call_using_ranges): Use
unsigned_type_for rather than build_nonstandard_integer_type.
|
|
Hi, Richard and Richi.
This is the last autovec pattern I want to add for RVV (length loop control).
This patch is supposed to handled this following case:
int __attribute__ ((noinline, noclone))
condition_reduction (int *a, int min_v, int n)
{
int last = 66; /* High start value. */
for (int i = 0; i < n; i++)
if (a[i] < min_v)
last = i;
return last;
}
ARM SVE IR:
...
mask__7.11_39 = vect__4.10_37 < vect_cst__38;
_40 = loop_mask_36 & mask__7.11_39;
last_5 = .FOLD_EXTRACT_LAST (last_15, _40, vect_vec_iv_.7_32);
...
RVV IR, we want to see:
...
loop_len = SELECT_VL
mask__7.11_39 = vect__4.10_37 < vect_cst__38;
last_5 = .LEN_FOLD_EXTRACT_LAST (last_15, _40, vect_vec_iv_.7_32, loop_len, bias);
...
gcc/ChangeLog:
* doc/md.texi: Add LEN_FOLD_EXTRACT_LAST pattern.
* internal-fn.cc (fold_len_extract_direct): Ditto.
(expand_fold_len_extract_optab_fn): Ditto.
(direct_fold_len_extract_optab_supported_p): Ditto.
* internal-fn.def (LEN_FOLD_EXTRACT_LAST): Ditto.
* optabs.def (OPTAB_D): Ditto.
|
|
Like the support conditional neg (r12-4470-g20dcda98ed376cb61c74b2c71),
this just adds conditional not too.
Also we should be able to turn `(a ? -1 : 0) ^ b` into a conditional
not.
OK? Bootstrapped and tested on x86_64-linux-gnu and aarch64-linux-gnu.
gcc/ChangeLog:
* internal-fn.def (COND_NOT): New internal function.
* match.pd (UNCOND_UNARY, COND_UNARY): Add bit_not/not
to the lists.
(`vec (a ? -1 : 0) ^ b`): New pattern to convert
into conditional not.
* optabs.def (cond_one_cmpl): New optab.
(cond_len_one_cmpl): Likewise.
gcc/testsuite/ChangeLog:
PR target/110986
* gcc.target/aarch64/sve/cond_unary_9.c: New test.
|
|
This patch is add vec_mask_len_{load_lanes,store_stores} autovectorization patterns.
Here we want to support this following autovectorization:
void
foo (int8_t *__restrict a,
int8_t *__restrict b,
int8_t *__restrict cond,
int n)
{
for (intptr_t i = 0; i < n; ++i)
{
if (cond[i])
a[i] = b[i * 2] + b[i * 2 + 1];
}
}
ARM SVE IR:
https://godbolt.org/z/cro1Eqc6a
# loop_mask_60 = PHI <next_mask_82(4), max_mask_81(3)>
...
mask__39.12_63 = vect__3.11_61 != { 0, ... };
vec_mask_and_66 = loop_mask_60 & mask__39.12_63;
...
vect_array.15 = .MASK_LOAD_LANES (_57, 8B, vec_mask_and_66);
...
For RVV, we would like to see IR:
loop_len = SELECT_VL;
...
mask__39.12_63 = vect__3.11_61 != { 0, ... };
...
vect_array.15 = .MASK_LEN_LOAD_LANES (_57, 8B, mask__39.12_63, loop_len, bias);
...
Bootstrap and Regression on X86 passed.
Ok for trunk ?
gcc/ChangeLog:
* doc/md.texi: Add vec_mask_len_{load_lanes,store_lanes} patterns.
* internal-fn.cc (expand_partial_load_optab_fn): Ditto.
(expand_partial_store_optab_fn): Ditto.
* internal-fn.def (MASK_LEN_LOAD_LANES): Ditto.
(MASK_LEN_STORE_LANES): Ditto.
* optabs.def (OPTAB_CD): Ditto.
|
|
Both .VEC_SET and .VEC_EXTACT and the various .VCOND internal functions
are operating on registers only and they are not supposed to raise
any exceptions. The following makes them const/nothrow. I've
verified this avoids useless SSA updates in ISEL.
* internal-fn.def (VCOND, VCONDU, VCONDEQ, VCOND_MASK,
VEC_SET, VEC_EXTRACT): Make ECF_CONST | ECF_NOTHROW.
|
|
Hi, Richard and Richi.
Base on previous disscussions, we should make COND_* and COND_LEN_*
consistent.
So, this patch define these internal function together by these 2
wrappers:
DEF_INTERNAL_OPTAB_FN (COND_##NAME, FLAGS, cond_##OPTAB, cond_##TYPE) \
DEF_INTERNAL_OPTAB_FN (COND_LEN_##NAME, FLAGS, cond_len_##OPTAB, \
cond_len_##TYPE)
UNSIGNED_OPTAB, TYPE) \
DEF_INTERNAL_SIGNED_OPTAB_FN (COND_##NAME, FLAGS, SELECTOR, \
cond_##SIGNED_OPTAB, cond_##UNSIGNED_OPTAB, \
cond_##TYPE) \
DEF_INTERNAL_SIGNED_OPTAB_FN (COND_LEN_##NAME, FLAGS, SELECTOR, \
cond_len_##SIGNED_OPTAB, \
cond_len_##UNSIGNED_OPTAB, cond_len_##TYPE)
Bootstrap and Regression on X86 passed.
Ok for trunk ?
gcc/ChangeLog:
* internal-fn.def (DEF_INTERNAL_COND_FN): New macro.
(DEF_INTERNAL_SIGNED_COND_FN): Ditto.
(COND_ADD): Remove.
(COND_SUB): Ditto.
(COND_MUL): Ditto.
(COND_DIV): Ditto.
(COND_MOD): Ditto.
(COND_RDIV): Ditto.
(COND_MIN): Ditto.
(COND_MAX): Ditto.
(COND_FMIN): Ditto.
(COND_FMAX): Ditto.
(COND_AND): Ditto.
(COND_IOR): Ditto.
(COND_XOR): Ditto.
(COND_SHL): Ditto.
(COND_SHR): Ditto.
(COND_FMA): Ditto.
(COND_FMS): Ditto.
(COND_FNMA): Ditto.
(COND_FNMS): Ditto.
(COND_NEG): Ditto.
(COND_LEN_ADD): Ditto.
(COND_LEN_SUB): Ditto.
(COND_LEN_MUL): Ditto.
(COND_LEN_DIV): Ditto.
(COND_LEN_MOD): Ditto.
(COND_LEN_RDIV): Ditto.
(COND_LEN_MIN): Ditto.
(COND_LEN_MAX): Ditto.
(COND_LEN_FMIN): Ditto.
(COND_LEN_FMAX): Ditto.
(COND_LEN_AND): Ditto.
(COND_LEN_IOR): Ditto.
(COND_LEN_XOR): Ditto.
(COND_LEN_SHL): Ditto.
(COND_LEN_SHR): Ditto.
(COND_LEN_FMA): Ditto.
(COND_LEN_FMS): Ditto.
(COND_LEN_FNMA): Ditto.
(COND_LEN_FNMS): Ditto.
(COND_LEN_NEG): Ditto.
(ADD): New macro define.
(SUB): Ditto.
(MUL): Ditto.
(DIV): Ditto.
(MOD): Ditto.
(RDIV): Ditto.
(MIN): Ditto.
(MAX): Ditto.
(FMIN): Ditto.
(FMAX): Ditto.
(AND): Ditto.
(IOR): Ditto.
(XOR): Ditto.
(SHL): Ditto.
(SHR): Ditto.
(FMA): Ditto.
(FMS): Ditto.
(FNMA): Ditto.
(FNMS): Ditto.
(NEG): Ditto.
|
|
Hi.
Since start from LEN_MASK_GATHER_LOAD/LEN_MASK_SCATTER_STORE, COND_LEN_* patterns,
the order of len and mask is {mask,len,bias}.
The reason we make "mask" argument comes before "len" is because we want to keep
the "mask" location same as mask_* or cond_* patterns to make use of current codes flow
of mask_* and cond_*. Otherwise, we will need to change codes much more and make codes
hard to maintain.
Now, we already have COND_LEN_*, it's naturally that we should rename "LEN_MASK" into "MASK_LEN"
to keep name scheme consistent.
This patch only changes the name "LEN_MASK" into "MASK_LEN".
No codes functionality change.
gcc/ChangeLog:
* config/riscv/autovec.md (len_maskload<mode><vm>): Change LEN_MASK into MASK_LEN.
(mask_len_load<mode><vm>): Ditto.
(len_maskstore<mode><vm>): Ditto.
(mask_len_store<mode><vm>): Ditto.
(len_mask_gather_load<RATIO64:mode><RATIO64I:mode>): Ditto.
(mask_len_gather_load<RATIO64:mode><RATIO64I:mode>): Ditto.
(len_mask_gather_load<RATIO32:mode><RATIO32I:mode>): Ditto.
(mask_len_gather_load<RATIO32:mode><RATIO32I:mode>): Ditto.
(len_mask_gather_load<RATIO16:mode><RATIO16I:mode>): Ditto.
(mask_len_gather_load<RATIO16:mode><RATIO16I:mode>): Ditto.
(len_mask_gather_load<RATIO8:mode><RATIO8I:mode>): Ditto.
(mask_len_gather_load<RATIO8:mode><RATIO8I:mode>): Ditto.
(len_mask_gather_load<RATIO4:mode><RATIO4I:mode>): Ditto.
(mask_len_gather_load<RATIO4:mode><RATIO4I:mode>): Ditto.
(len_mask_gather_load<RATIO2:mode><RATIO2I:mode>): Ditto.
(mask_len_gather_load<RATIO2:mode><RATIO2I:mode>): Ditto.
(len_mask_gather_load<RATIO1:mode><RATIO1:mode>): Ditto.
(mask_len_gather_load<RATIO1:mode><RATIO1:mode>): Ditto.
(len_mask_scatter_store<RATIO64:mode><RATIO64I:mode>): Ditto.
(mask_len_scatter_store<RATIO64:mode><RATIO64I:mode>): Ditto.
(len_mask_scatter_store<RATIO32:mode><RATIO32I:mode>): Ditto.
(mask_len_scatter_store<RATIO32:mode><RATIO32I:mode>): Ditto.
(len_mask_scatter_store<RATIO16:mode><RATIO16I:mode>): Ditto.
(mask_len_scatter_store<RATIO16:mode><RATIO16I:mode>): Ditto.
(len_mask_scatter_store<RATIO8:mode><RATIO8I:mode>): Ditto.
(mask_len_scatter_store<RATIO8:mode><RATIO8I:mode>): Ditto.
(len_mask_scatter_store<RATIO4:mode><RATIO4I:mode>): Ditto.
(mask_len_scatter_store<RATIO4:mode><RATIO4I:mode>): Ditto.
(len_mask_scatter_store<RATIO2:mode><RATIO2I:mode>): Ditto.
(mask_len_scatter_store<RATIO2:mode><RATIO2I:mode>): Ditto.
(len_mask_scatter_store<RATIO1:mode><RATIO1:mode>): Ditto.
(mask_len_scatter_store<RATIO1:mode><RATIO1:mode>): Ditto.
* doc/md.texi: Ditto.
* genopinit.cc (main): Ditto.
(CMP_NAME): Ditto. Ditto.
* gimple-fold.cc (arith_overflowed_p): Ditto.
(gimple_fold_partial_load_store_mem_ref): Ditto.
(gimple_fold_call): Ditto.
* internal-fn.cc (len_maskload_direct): Ditto.
(mask_len_load_direct): Ditto.
(len_maskstore_direct): Ditto.
(mask_len_store_direct): Ditto.
(expand_call_mem_ref): Ditto.
(expand_len_maskload_optab_fn): Ditto.
(expand_mask_len_load_optab_fn): Ditto.
(expand_len_maskstore_optab_fn): Ditto.
(expand_mask_len_store_optab_fn): Ditto.
(direct_len_maskload_optab_supported_p): Ditto.
(direct_mask_len_load_optab_supported_p): Ditto.
(direct_len_maskstore_optab_supported_p): Ditto.
(direct_mask_len_store_optab_supported_p): Ditto.
(internal_load_fn_p): Ditto.
(internal_store_fn_p): Ditto.
(internal_gather_scatter_fn_p): Ditto.
(internal_fn_len_index): Ditto.
(internal_fn_mask_index): Ditto.
(internal_fn_stored_value_index): Ditto.
(internal_len_load_store_bias): Ditto.
* internal-fn.def (LEN_MASK_GATHER_LOAD): Ditto.
(MASK_LEN_GATHER_LOAD): Ditto.
(LEN_MASK_LOAD): Ditto.
(MASK_LEN_LOAD): Ditto.
(LEN_MASK_SCATTER_STORE): Ditto.
(MASK_LEN_SCATTER_STORE): Ditto.
(LEN_MASK_STORE): Ditto.
(MASK_LEN_STORE): Ditto.
* optabs-query.cc (supports_vec_gather_load_p): Ditto.
(supports_vec_scatter_store_p): Ditto.
* optabs-tree.cc (target_supports_mask_load_store_p): Ditto.
(target_supports_len_load_store_p): Ditto.
* optabs.def (OPTAB_CD): Ditto.
* tree-ssa-alias.cc (ref_maybe_used_by_call_p_1): Ditto.
(call_may_clobber_ref_p_1): Ditto.
* tree-ssa-dse.cc (initialize_ao_ref_for_dse): Ditto.
(dse_optimize_stmt): Ditto.
* tree-ssa-loop-ivopts.cc (get_mem_type_for_internal_fn): Ditto.
(get_alias_ptr_type_for_ptr_address): Ditto.
* tree-vect-data-refs.cc (vect_gather_scatter_fn_p): Ditto.
* tree-vect-patterns.cc (vect_recog_gather_scatter_pattern): Ditto.
* tree-vect-stmts.cc (check_load_store_for_partial_vectors): Ditto.
(vect_get_strided_load_store_ops): Ditto.
(vectorizable_store): Ditto.
(vectorizable_load): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-1.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-10.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-11.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-12.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-3.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-4.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-5.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-6.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-7.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-8.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/gather_load-9.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-1.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-10.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-11.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-3.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-4.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-5.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-6.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-7.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-8.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_gather_load-9.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-1.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-10.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-3.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-4.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-5.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-6.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-7.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-8.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/mask_scatter_store-9.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-1.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-10.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-3.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-4.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-5.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-6.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-7.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-8.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/scatter_store-9.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_load-1.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_load-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_store-1.c: Ditto.
* gcc.target/riscv/rvv/autovec/gather-scatter/strided_store-2.c: Ditto.
* gcc.target/riscv/rvv/autovec/partial/gimple_fold-1.c: Ditto.
|
|
Hi, Richard and Richi.
This patch adds mask_len_fold_left_plus pattern to support in-order floating-point
reduction for target support len loop control.
Consider this following case:
double
foo2 (double *__restrict a,
double init,
int *__restrict cond,
int n)
{
for (int i = 0; i < n; i++)
if (cond[i])
init += a[i];
return init;
}
ARM SVE:
...
vec_mask_and_60 = loop_mask_54 & mask__23.33_57;
vect__ifc__35.37_64 = .VCOND_MASK (vec_mask_and_60, vect__8.36_61, { 0.0, ... });
_36 = .MASK_FOLD_LEFT_PLUS (init_20, vect__ifc__35.37_64, loop_mask_54);
...
For RVV, we want to see:
...
_36 = .MASK_LEN_FOLD_LEFT_PLUS (init_20, vect__ifc__35.37_64, control_mask, loop_len, bias);
...
gcc/ChangeLog:
* doc/md.texi: Add mask_len_fold_left_plus.
* internal-fn.cc (mask_len_fold_left_direct): Ditto.
(expand_mask_len_fold_left_optab_fn): Ditto.
(direct_mask_len_fold_left_optab_supported_p): Ditto.
* internal-fn.def (MASK_LEN_FOLD_LEFT_PLUS): Ditto.
* optabs.def (OPTAB_D): Ditto.
|
|
Hi, Richard and Richi.
This patch is adding cond_len_* operations pattern for target support loop control with length.
These patterns will be used in these following case:
1. Integer division:
void
f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int n)
{
for (int i = 0; i < n; ++i)
{
a[i] = b[i] / c[i];
}
}
ARM SVE IR:
...
max_mask_36 = .WHILE_ULT (0, bnd.5_32, { 0, ... });
Loop:
...
# loop_mask_29 = PHI <next_mask_37(4), max_mask_36(3)>
...
vect__4.8_28 = .MASK_LOAD (_33, 32B, loop_mask_29);
...
vect__6.11_25 = .MASK_LOAD (_20, 32B, loop_mask_29);
vect__8.12_24 = .COND_DIV (loop_mask_29, vect__4.8_28, vect__6.11_25, vect__4.8_28);
...
.MASK_STORE (_1, 32B, loop_mask_29, vect__8.12_24);
...
next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
...
For target like RVV who support loop control with length, we want to see IR as follows:
Loop:
...
# loop_len_29 = SELECT_VL
...
vect__4.8_28 = .LEN_MASK_LOAD (_33, 32B, loop_len_29);
...
vect__6.11_25 = .LEN_MASK_LOAD (_20, 32B, loop_len_29);
vect__8.12_24 = .COND_LEN_DIV (dummp_mask, vect__4.8_28, vect__6.11_25, vect__4.8_28, loop_len_29, bias);
...
.LEN_MASK_STORE (_1, 32B, loop_len_29, vect__8.12_24);
...
next_mask_37 = .WHILE_ULT (_2, bnd.5_32, { 0, ... });
...
Notice here, we use dummp_mask = { -1, -1, .... , -1 }
2. Integer conditional division:
Similar case with (1) but with condtion:
void
f (int32_t *restrict a, int32_t *restrict b, int32_t *restrict c, int32_t * cond, int n)
{
for (int i = 0; i < n; ++i)
{
if (cond[i])
a[i] = b[i] / c[i];
}
}
ARM SVE:
...
max_mask_76 = .WHILE_ULT (0, bnd.6_52, { 0, ... });
Loop:
...
# loop_mask_55 = PHI <next_mask_77(5), max_mask_76(4)>
...
vect__4.9_56 = .MASK_LOAD (_51, 32B, loop_mask_55);
mask__29.10_58 = vect__4.9_56 != { 0, ... };
vec_mask_and_61 = loop_mask_55 & mask__29.10_58;
...
vect__6.13_62 = .MASK_LOAD (_24, 32B, vec_mask_and_61);
...
vect__8.16_66 = .MASK_LOAD (_1, 32B, vec_mask_and_61);
vect__10.17_68 = .COND_DIV (vec_mask_and_61, vect__6.13_62, vect__8.16_66, vect__6.13_62);
...
.MASK_STORE (_2, 32B, vec_mask_and_61, vect__10.17_68);
...
next_mask_77 = .WHILE_ULT (_3, bnd.6_52, { 0, ... });
Here, ARM SVE use vec_mask_and_61 = loop_mask_55 & mask__29.10_58; to gurantee the correct result.
However, target with length control can not perform this elegant flow, for RVV, we would expect:
Loop:
...
loop_len_55 = SELECT_VL
...
mask__29.10_58 = vect__4.9_56 != { 0, ... };
...
vect__10.17_68 = .COND_LEN_DIV (mask__29.10_58, vect__6.13_62, vect__8.16_66, vect__6.13_62, loop_len_55, bias);
...
Here we expect COND_LEN_DIV predicated by a real mask which is the outcome of comparison: mask__29.10_58 = vect__4.9_56 != { 0, ... };
and a real length which is produced by loop control : loop_len_55 = SELECT_VL
3. conditional Floating-point operations (no -ffast-math):
void
f (float *restrict a, float *restrict b, int32_t *restrict cond, int n)
{
for (int i = 0; i < n; ++i)
{
if (cond[i])
a[i] = b[i] + a[i];
}
}
ARM SVE IR:
max_mask_70 = .WHILE_ULT (0, bnd.6_46, { 0, ... });
...
# loop_mask_49 = PHI <next_mask_71(4), max_mask_70(3)>
...
mask__27.10_52 = vect__4.9_50 != { 0, ... };
vec_mask_and_55 = loop_mask_49 & mask__27.10_52;
...
vect__9.17_62 = .COND_ADD (vec_mask_and_55, vect__6.13_56, vect__8.16_60, vect__6.13_56);
...
next_mask_71 = .WHILE_ULT (_22, bnd.6_46, { 0, ... });
...
For RVV, we would expect IR:
...
loop_len_49 = SELECT_VL
...
mask__27.10_52 = vect__4.9_50 != { 0, ... };
...
vect__9.17_62 = .COND_LEN_ADD (mask__27.10_52, vect__6.13_56, vect__8.16_60, vect__6.13_56, loop_len_49, bias);
...
4. Conditional un-ordered reduction:
int32_t
f (int32_t *restrict a,
int32_t *restrict cond, int n)
{
int32_t result = 0;
for (int i = 0; i < n; ++i)
{
if (cond[i])
result += a[i];
}
return result;
}
ARM SVE IR:
Loop:
# vect_result_18.7_37 = PHI <vect__33.16_51(4), { 0, ... }(3)>
...
# loop_mask_40 = PHI <next_mask_58(4), max_mask_57(3)>
...
mask__17.11_43 = vect__4.10_41 != { 0, ... };
vec_mask_and_46 = loop_mask_40 & mask__17.11_43;
...
vect__33.16_51 = .COND_ADD (vec_mask_and_46, vect_result_18.7_37, vect__7.14_47, vect_result_18.7_37);
...
next_mask_58 = .WHILE_ULT (_15, bnd.6_36, { 0, ... });
...
Epilogue:
_53 = .REDUC_PLUS (vect__33.16_51); [tail call]
For RVV, we expect:
Loop:
# vect_result_18.7_37 = PHI <vect__33.16_51(4), { 0, ... }(3)>
...
loop_len_40 = SELECT_VL
...
mask__17.11_43 = vect__4.10_41 != { 0, ... };
...
vect__33.16_51 = .COND_LEN_ADD (mask__17.11_43, vect_result_18.7_37, vect__7.14_47, vect_result_18.7_37, loop_len_40, bias);
...
next_mask_58 = .WHILE_ULT (_15, bnd.6_36, { 0, ... });
...
Epilogue:
_53 = .REDUC_PLUS (vect__33.16_51); [tail call]
I name these patterns as "cond_len_*" since I want the length operand comes after mask operand and all other operands except length operand
same order as "cond_*" patterns. Such order will make life easier in the following loop vectorizer support.
gcc/ChangeLog:
* doc/md.texi: Add COND_LEN_* operations for loop control with length.
* internal-fn.cc (cond_len_unary_direct): Ditto.
(cond_len_binary_direct): Ditto.
(cond_len_ternary_direct): Ditto.
(expand_cond_len_unary_optab_fn): Ditto.
(expand_cond_len_binary_optab_fn): Ditto.
(expand_cond_len_ternary_optab_fn): Ditto.
(direct_cond_len_unary_optab_supported_p): Ditto.
(direct_cond_len_binary_optab_supported_p): Ditto.
(direct_cond_len_ternary_optab_supported_p): Ditto.
* internal-fn.def (COND_LEN_ADD): Ditto.
(COND_LEN_SUB): Ditto.
(COND_LEN_MUL): Ditto.
(COND_LEN_DIV): Ditto.
(COND_LEN_MOD): Ditto.
(COND_LEN_RDIV): Ditto.
(COND_LEN_MIN): Ditto.
(COND_LEN_MAX): Ditto.
(COND_LEN_FMIN): Ditto.
(COND_LEN_FMAX): Ditto.
(COND_LEN_AND): Ditto.
(COND_LEN_IOR): Ditto.
(COND_LEN_XOR): Ditto.
(COND_LEN_SHL): Ditto.
(COND_LEN_SHR): Ditto.
(COND_LEN_FMA): Ditto.
(COND_LEN_FMS): Ditto.
(COND_LEN_FNMA): Ditto.
(COND_LEN_FNMS): Ditto.
(COND_LEN_NEG): Ditto.
* optabs.def (OPTAB_D): Ditto.
|
|
In gimple-isel we already deduce a vec_set pattern from an
ARRAY_REF(VIEW_CONVERT_EXPR). This patch does the same for a
vec_extract.
The code is largely similar to the vec_set one
including the addition of a can_vec_extract_var_idx_p function
in optabs.cc to check if the backend can handle a register
operand as index. We already have can_vec_extract in
optabs-query but that one checks whether we can extract
specific modes.
With the introduction of an internal function for vec_extract
the expander must not FAIL. For vec_set this has already been
the case so adjust the documentation accordingly.
Additionally, clarify the wording of the vector-vector case for
vec_extract.
gcc/ChangeLog:
* doc/md.texi: Document that vec_set and vec_extract must not
fail.
* gimple-isel.cc (gimple_expand_vec_set_expr): Rename this...
(gimple_expand_vec_set_extract_expr): ...to this.
(gimple_expand_vec_exprs): Call renamed function.
* internal-fn.cc (vec_extract_direct): Add.
(expand_vec_extract_optab_fn): New function to expand
vec_extract optab.
(direct_vec_extract_optab_supported_p): Add.
* internal-fn.def (VEC_EXTRACT): Add.
* optabs.cc (can_vec_extract_var_idx_p): New function.
* optabs.h (can_vec_extract_var_idx_p): Declare.
|
|
Hi, Richi and Richard.
Base one the review comments from Richard:
https://gcc.gnu.org/pipermail/gcc-patches/2023-July/623405.html
I change len_mask_gather_load/len_mask_scatter_store order into:
{len,bias,mask}
We adjust adding len and mask using using add_len_and_mask_args
which is same as partial_load/parial_store.
Now, the codes become more reasonable and easier maintain.
This patch is adding LEN_MASK_{GATHER_LOAD,SCATTER_STORE} to allow targets
handle flow control by mask and loop control by length on gather/scatter memory
operations. Consider this following case:
void
f (uint8_t *restrict a,
uint8_t *restrict b, int n,
int base, int step,
int *restrict cond)
{
for (int i = 0; i < n; ++i)
{
if (cond[i])
a[i * step + base] = b[i * step + base];
}
}
We hope RVV can vectorize such case into following IR:
loop_len = SELECT_VL
control_mask = comparison
v = LEN_MASK_GATHER_LOAD (.., loop_len, bias, control_mask)
LEN_SCATTER_STORE (... v, ..., loop_len, bias, control_mask)
This patch doesn't apply such patterns into vectorizer, just add patterns
and update the documents.
Will send patch which apply such patterns into vectorizer soon after this
patch is approved.
Ok for trunk?
gcc/ChangeLog:
* doc/md.texi: Add len_mask_gather_load/len_mask_scatter_store.
* internal-fn.cc (expand_scatter_store_optab_fn): Ditto.
(expand_gather_load_optab_fn): Ditto.
(internal_load_fn_p): Ditto.
(internal_store_fn_p): Ditto.
(internal_gather_scatter_fn_p): Ditto.
(internal_fn_len_index): Ditto.
(internal_fn_mask_index): Ditto.
(internal_fn_stored_value_index): Ditto.
* internal-fn.def (LEN_MASK_GATHER_LOAD): Ditto.
(LEN_MASK_SCATTER_STORE): Ditto.
* optabs.def (OPTAB_CD): Ditto.
|
|
This updates vect_recog_abd_pattern to recognize the widening
variant of absolute difference (ABDL, ABDL2).
gcc/ChangeLog:
* internal-fn.def (VEC_WIDEN_ABD): New internal hilo optab.
* optabs.def (vec_widen_sabd_optab,
vec_widen_sabd_hi_optab, vec_widen_sabd_lo_optab,
vec_widen_sabd_odd_even, vec_widen_sabd_even_optab,
vec_widen_uabd_optab,
vec_widen_uabd_hi_optab, vec_widen_uabd_lo_optab,
vec_widen_uabd_odd_even, vec_widen_uabd_even_optab):
New optabs.
* doc/md.texi: Document them.
* tree-vect-patterns.cc (vect_recog_abd_pattern): Update to
to build a VEC_WIDEN_ABD call if the input precision is smaller
than the precision of the output.
(vect_recog_widen_abd_pattern): Should an ABD expression be
found preceeding an extension, replace the two with a
VEC_WIDEN_ABD.
|
|
This patch adds LEN_MASK_ LOAD/STORE to support flow control for targets
like RISC-V that uses length in loop control.
Normalize load/store into LEN_MASK_ LOAD/STORE as long as either length
or mask is valid. Length is the outcome of SELECT_VL or MIN_EXPR.
Mask is the outcome of comparison.
LEN_MASK_ LOAD/STORE format is defined as follows:
1). LEN_MASK_LOAD (ptr, align, length, mask).
2). LEN_MASK_STORE (ptr, align, length, mask, vec).
Consider these 4 following cases:
VLA: Variable-length auto-vectorization
VLS: Specific-length auto-vectorization
Case 1 (VLS): -mrvv-vector-bits=128 IR (Does not use LEN_MASK_*):
Code: v1 = MEM (...)
for (int i = 0; i < 4; i++) v2 = MEM (...)
a[i] = b[i] + c[i]; v3 = v1 + v2
MEM[...] = v3
Case 2 (VLS): -mrvv-vector-bits=128 IR (LEN_MASK_* with length = VF, mask = comparison):
Code: mask = comparison
for (int i = 0; i < 4; i++) v1 = LEN_MASK_LOAD (length = VF, mask)
if (cond[i]) v2 = LEN_MASK_LOAD (length = VF, mask)
a[i] = b[i] + c[i]; v3 = v1 + v2
LEN_MASK_STORE (length = VF, mask, v3)
Case 3 (VLA):
Code: loop_len = SELECT_VL or MIN
for (int i = 0; i < n; i++) v1 = LEN_MASK_LOAD (length = loop_len, mask = {-1,-1,...})
a[i] = b[i] + c[i]; v2 = LEN_MASK_LOAD (length = loop_len, mask = {-1,-1,...})
v3 = v1 + v2
LEN_MASK_STORE (length = loop_len, mask = {-1,-1,...}, v3)
Case 4 (VLA):
Code: loop_len = SELECT_VL or MIN
for (int i = 0; i < n; i++) mask = comparison
if (cond[i]) v1 = LEN_MASK_LOAD (length = loop_len, mask)
a[i] = b[i] + c[i]; v2 = LEN_MASK_LOAD (length = loop_len, mask)
v3 = v1 + v2
LEN_MASK_STORE (length = loop_len, mask, v3)
Co-authored-by: Robin Dapp <rdapp.gcc@gmail.com>
gcc/ChangeLog:
* doc/md.texi: Add len_mask{load,store}.
* genopinit.cc (main): Ditto.
(CMP_NAME): Ditto.
* internal-fn.cc (len_maskload_direct): Ditto.
(len_maskstore_direct): Ditto.
(expand_call_mem_ref): Ditto.
(expand_partial_load_optab_fn): Ditto.
(expand_len_maskload_optab_fn): Ditto.
(expand_partial_store_optab_fn): Ditto.
(expand_len_maskstore_optab_fn): Ditto.
(direct_len_maskload_optab_supported_p): Ditto.
(direct_len_maskstore_optab_supported_p): Ditto.
* internal-fn.def (LEN_MASK_LOAD): Ditto.
(LEN_MASK_STORE): Ditto.
* optabs.def (OPTAB_CD): Ditto.
|
|
The following patch introduces {add,sub}c5_optab and pattern recognizes
various forms of add with carry and subtract with carry/borrow, see
pr79173-{1,2,3,4,5,6}.c tests on what is matched.
Primarily forms with 2 __builtin_add_overflow or __builtin_sub_overflow
calls per limb (with just one for the least significant one), for
add with carry even when it is hand written in C (for subtraction
reassoc seems to change it too much so that the pattern recognition
doesn't work). __builtin_{add,sub}_overflow are standardized in C23
under ckd_{add,sub} names, so it isn't any longer a GNU only extension.
Note, clang has for these (IMHO badly designed)
__builtin_{add,sub}c{b,s,,l,ll} builtins which don't add/subtract just
a single bit of carry, but basically add 3 unsigned values or
subtract 2 unsigned values from one, and result in carry out of 0, 1, or 2
because of that. If we wanted to introduce those for clang compatibility,
we could and lower them early to just two __builtin_{add,sub}_overflow
calls and let the pattern matching in this patch recognize it later.
I've added expanders for this on ix86 and in addition to that
added various peephole2s (in preparation patches for this patch) to make
sure we get nice (and small) code for the common cases. I think there are
other PRs which request that e.g. for the _{addcarry,subborrow}_u{32,64}
intrinsics, which the patch also improves.
Would be nice if support for these optabs was added to many other targets,
arm/aarch64 and powerpc* certainly have such instructions, I'd expect
in fact that most targets do.
The _BitInt support I'm working on will also need this to emit reasonable
code.
2023-06-15 Jakub Jelinek <jakub@redhat.com>
PR middle-end/79173
* internal-fn.def (UADDC, USUBC): New internal functions.
* internal-fn.cc (expand_UADDC, expand_USUBC): New functions.
(commutative_ternary_fn_p): Return true also for IFN_UADDC.
* optabs.def (uaddc5_optab, usubc5_optab): New optabs.
* tree-ssa-math-opts.cc (uaddc_cast, uaddc_ne0, uaddc_is_cplxpart,
match_uaddc_usubc): New functions.
(math_opts_dom_walker::after_dom_children): Call match_uaddc_usubc
for PLUS_EXPR, MINUS_EXPR, BIT_IOR_EXPR and BIT_XOR_EXPR unless
other optimizations have been successful for those.
* gimple-fold.cc (gimple_fold_call): Handle IFN_UADDC and IFN_USUBC.
* fold-const-call.cc (fold_const_call): Likewise.
* gimple-range-fold.cc (adjust_imagpart_expr): Likewise.
* tree-ssa-dce.cc (eliminate_unnecessary_stmts): Likewise.
* doc/md.texi (uaddc<mode>5, usubc<mode>5): Document new named
patterns.
* config/i386/i386.md (uaddc<mode>5, usubc<mode>5): New
define_expand patterns.
(*setcc_qi_addqi3_cconly_overflow_1_<mode>, *setccc): Split
into NOTE_INSN_DELETED note rather than nop instruction.
(*setcc_qi_negqi_ccc_1_<mode>, *setcc_qi_negqi_ccc_2_<mode>):
Likewise.
* gcc.target/i386/pr79173-1.c: New test.
* gcc.target/i386/pr79173-2.c: New test.
* gcc.target/i386/pr79173-3.c: New test.
* gcc.target/i386/pr79173-4.c: New test.
* gcc.target/i386/pr79173-5.c: New test.
* gcc.target/i386/pr79173-6.c: New test.
* gcc.target/i386/pr79173-7.c: New test.
* gcc.target/i386/pr79173-8.c: New test.
* gcc.target/i386/pr79173-9.c: New test.
* gcc.target/i386/pr79173-10.c: New test.
|
|
This adds a recognition pattern for the non-widening
absolute difference (ABD).
gcc/ChangeLog:
* doc/md.texi (sabd, uabd): Document them.
* internal-fn.def (ABD): Use new optab.
* optabs.def (sabd_optab, uabd_optab): New optabs,
* tree-vect-patterns.cc (vect_recog_absolute_difference):
Recognize the following idiom abs (a - b).
(vect_recog_sad_pattern): Refactor to use
vect_recog_absolute_difference.
(vect_recog_abd_pattern): Use patterns found by
vect_recog_absolute_difference to build a new ABD
internal call.
|
|
This patch address comments from Richard && Richi and rebase to trunk.
This patch is adding SELECT_VL middle-end support
allow target have target dependent optimization in case of
length calculation.
This patch is inspired by RVV ISA and LLVM:
https://reviews.llvm.org/D99750
The SELECT_VL is same behavior as LLVM "get_vector_length" with
these following properties:
1. Only apply on single-rgroup.
2. non SLP.
3. adjust loop control IV.
4. adjust data reference IV.
5. allow non-vf elements processing in non-final iteration
Code
# void vvaddint32(size_t n, const int*x, const int*y, int*z)
# { for (size_t i=0; i<n; i++) { z[i]=x[i]+y[i]; } }
Take RVV codegen for example:
Before this patch:
vvaddint32:
ble a0,zero,.L6
csrr a4,vlenb
srli a6,a4,2
.L4:
mv a5,a0
bleu a0,a6,.L3
mv a5,a6
.L3:
vsetvli zero,a5,e32,m1,ta,ma
vle32.v v2,0(a1)
vle32.v v1,0(a2)
vsetvli a7,zero,e32,m1,ta,ma
sub a0,a0,a5
vadd.vv v1,v1,v2
vsetvli zero,a5,e32,m1,ta,ma
vse32.v v1,0(a3)
add a2,a2,a4
add a3,a3,a4
add a1,a1,a4
bne a0,zero,.L4
.L6:
ret
After this patch:
vvaddint32:
vsetvli t0, a0, e32, ta, ma # Set vector length based on 32-bit vectors
vle32.v v0, (a1) # Get first vector
sub a0, a0, t0 # Decrement number done
slli t0, t0, 2 # Multiply number done by 4 bytes
add a1, a1, t0 # Bump pointer
vle32.v v1, (a2) # Get second vector
add a2, a2, t0 # Bump pointer
vadd.vv v2, v0, v1 # Sum vectors
vse32.v v2, (a3) # Store result
add a3, a3, t0 # Bump pointer
bnez a0, vvaddint32 # Loop back
ret # Finished
Co-authored-by: Richard Sandiford<richard.sandiford@arm.com>
Co-authored-by: Richard Biener <rguenther@suse.de>
gcc/ChangeLog:
* doc/md.texi: Add SELECT_VL support.
* internal-fn.def (SELECT_VL): Ditto.
* optabs.def (OPTAB_D): Ditto.
* tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Ditto.
* tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): Ditto.
* tree-vect-stmts.cc (get_select_vl_data_ref_ptr): Ditto.
(vectorizable_store): Ditto.
(vectorizable_load): Ditto.
* tree-vectorizer.h (LOOP_VINFO_USING_SELECT_VL_P): Ditto.
|
|
DEF_INTERNAL_WIDENING_OPTAB_FN and DEF_INTERNAL_NARROWING_OPTAB_FN
are like DEF_INTERNAL_SIGNED_OPTAB_FN and DEF_INTERNAL_OPTAB_FN
respectively. With the exception that they provide convenience wrappers
for a single vector to vector conversion, a hi/lo split or an even/odd
split. Each definition for <NAME> will require either signed optabs
named <UOPTAB> and <SOPTAB> (for widening) or a single <OPTAB> (for
narrowing) for each of the five functions it creates.
For example, for widening addition the
DEF_INTERNAL_WIDENING_OPTAB_FN will create five internal functions:
IFN_VEC_WIDEN_PLUS, IFN_VEC_WIDEN_PLUS_HI, IFN_VEC_WIDEN_PLUS_LO,
IFN_VEC_WIDEN_PLUS_EVEN and IFN_VEC_WIDEN_PLUS_ODD. Each requiring two
optabs, one for signed and one for unsigned.
Aarch64 implements the hi/lo split optabs:
IFN_VEC_WIDEN_PLUS_HI -> vec_widen_<su>add_hi_<mode> -> (u/s)addl2
IFN_VEC_WIDEN_PLUS_LO -> vec_widen_<su>add_lo_<mode> -> (u/s)addl
This gives the same functionality as the previous
WIDEN_PLUS/WIDEN_MINUS tree codes which are expanded into
VEC_WIDEN_PLUS_LO, VEC_WIDEN_PLUS_HI.
2023-06-05 Andre Vieira <andre.simoesdiasvieira@arm.com>
Joel Hutton <joel.hutton@arm.com>
Tamar Christina <tamar.christina@arm.com>
gcc/ChangeLog:
* config/aarch64/aarch64-simd.md (vec_widen_<su>addl_lo_<mode>): Rename
this ...
(vec_widen_<su>add_lo_<mode>): ... to this.
(vec_widen_<su>addl_hi_<mode>): Rename this ...
(vec_widen_<su>add_hi_<mode>): ... to this.
(vec_widen_<su>subl_lo_<mode>): Rename this ...
(vec_widen_<su>sub_lo_<mode>): ... to this.
(vec_widen_<su>subl_hi_<mode>): Rename this ...
(vec_widen_<su>sub_hi_<mode>): ...to this.
* doc/generic.texi: Document new IFN codes.
* internal-fn.cc (lookup_hilo_internal_fn): Add lookup function.
(commutative_binary_fn_p): Add widen_plus fn's.
(widening_fn_p): New function.
(narrowing_fn_p): New function.
(direct_internal_fn_optab): Change visibility.
* internal-fn.def (DEF_INTERNAL_WIDENING_OPTAB_FN): Macro to define an
internal_fn that expands into multiple internal_fns for widening.
(IFN_VEC_WIDEN_PLUS, IFN_VEC_WIDEN_PLUS_HI, IFN_VEC_WIDEN_PLUS_LO,
IFN_VEC_WIDEN_PLUS_EVEN, IFN_VEC_WIDEN_PLUS_ODD,
IFN_VEC_WIDEN_MINUS, IFN_VEC_WIDEN_MINUS_HI,
IFN_VEC_WIDEN_MINUS_LO, IFN_VEC_WIDEN_MINUS_ODD,
IFN_VEC_WIDEN_MINUS_EVEN): Define widening plus,minus functions.
* internal-fn.h (direct_internal_fn_optab): Declare new prototype.
(lookup_hilo_internal_fn): Likewise.
(widening_fn_p): Likewise.
(Narrowing_fn_p): Likewise.
* optabs.cc (commutative_optab_p): Add widening plus optabs.
* optabs.def (OPTAB_D): Define widen add, sub optabs.
* tree-vect-patterns.cc (vect_recog_widen_op_pattern): Support
patterns with a hi/lo or even/odd split.
(vect_recog_sad_pattern): Refactor to use new IFN codes.
(vect_recog_widen_plus_pattern): Likewise.
(vect_recog_widen_minus_pattern): Likewise.
(vect_recog_average_pattern): Likewise.
* tree-vect-stmts.cc (vectorizable_conversion): Add support for
_HILO IFNs.
(supportable_widening_operation): Likewise.
* tree.def (WIDEN_SUM_EXPR): Update example to use new IFNs.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/vect-widen-add.c: Test that new
IFN_VEC_WIDEN_PLUS is being used.
* gcc.target/aarch64/vect-widen-sub.c: Test that new
IFN_VEC_WIDEN_MINUS is being used.
|
|
There has been support for generating "inbranch" SIMD clones for a long time,
but nothing actually uses them (as far as I can see).
This patch add supports for a sub-set of possible cases (those using
mask_mode == VOIDmode). The other cases fail to vectorize, just as before,
so there should be no regressions.
The sub-set of support should cover all cases needed by amdgcn, at present.
gcc/ChangeLog:
* internal-fn.cc (expand_MASK_CALL): New.
* internal-fn.def (MASK_CALL): New.
* internal-fn.h (expand_MASK_CALL): New prototype.
* omp-simd-clone.cc (simd_clone_adjust_argument_types): Set vector_type
for mask arguments also.
* tree-if-conv.cc: Include cgraph.h.
(if_convertible_stmt_p): Do if conversions for calls to SIMD calls.
(predicate_statements): Convert functions to IFN_MASK_CALL.
* tree-vect-loop.cc (vect_get_datarefs_in_loop): Recognise
IFN_MASK_CALL as a SIMD function call.
* tree-vect-stmts.cc (vectorizable_simd_clone_call): Handle
IFN_MASK_CALL as an inbranch SIMD function call.
Generate the mask vector arguments.
gcc/testsuite/ChangeLog:
* gcc.dg/vect/vect-simd-clone-16.c: New test.
* gcc.dg/vect/vect-simd-clone-16b.c: New test.
* gcc.dg/vect/vect-simd-clone-16c.c: New test.
* gcc.dg/vect/vect-simd-clone-16d.c: New test.
* gcc.dg/vect/vect-simd-clone-16e.c: New test.
* gcc.dg/vect/vect-simd-clone-16f.c: New test.
* gcc.dg/vect/vect-simd-clone-17.c: New test.
* gcc.dg/vect/vect-simd-clone-17b.c: New test.
* gcc.dg/vect/vect-simd-clone-17c.c: New test.
* gcc.dg/vect/vect-simd-clone-17d.c: New test.
* gcc.dg/vect/vect-simd-clone-17e.c: New test.
* gcc.dg/vect/vect-simd-clone-17f.c: New test.
* gcc.dg/vect/vect-simd-clone-18.c: New test.
* gcc.dg/vect/vect-simd-clone-18b.c: New test.
* gcc.dg/vect/vect-simd-clone-18c.c: New test.
* gcc.dg/vect/vect-simd-clone-18d.c: New test.
* gcc.dg/vect/vect-simd-clone-18e.c: New test.
* gcc.dg/vect/vect-simd-clone-18f.c: New test.
|
|
For PR106099 I've added IFN_TRAP as an alternative to __builtin_trap
meant for __builtin_unreachable purposes (e.g. with -funreachable-traps
or some sanitizers) which doesn't need vops because __builtin_unreachable
doesn't need them either. This works in various cases, but unfortunately
IPA likes to decide on the redirection to unreachable just by tweaking
the cgraph edge to point to a different FUNCTION_DECL. As internal
functions don't have a decl, this causes problems like in the following
testcase.
The following patch fixes it by removing IFN_TRAP again and replacing
it with user inaccessible BUILT_IN_UNREACHABLE_TRAP, so that e.g.
builtin_decl_unreachable can return it directly and we don't need to tweak
it later in wherever we actually replace the call stmt.
2023-02-02 Jakub Jelinek <jakub@redhat.com>
PR ipa/107300
* builtins.def (BUILT_IN_UNREACHABLE_TRAP): New builtin.
* internal-fn.def (TRAP): Remove.
* internal-fn.cc (expand_TRAP): Remove.
* tree.cc (build_common_builtin_nodes): Define
BUILT_IN_UNREACHABLE_TRAP if not yet defined.
(builtin_decl_unreachable): Use BUILT_IN_UNREACHABLE_TRAP
instead of BUILT_IN_TRAP.
* gimple.cc (gimple_build_builtin_unreachable): Remove
emitting internal function for BUILT_IN_TRAP.
* asan.cc (maybe_instrument_call): Handle BUILT_IN_UNREACHABLE_TRAP.
* cgraph.cc (cgraph_edge::verify_corresponds_to_fndecl): Handle
BUILT_IN_UNREACHABLE_TRAP instead of BUILT_IN_TRAP.
* ipa-devirt.cc (possible_polymorphic_call_target_p): Handle
BUILT_IN_UNREACHABLE_TRAP.
* builtins.cc (expand_builtin, is_inexpensive_builtin): Likewise.
* tree-cfg.cc (verify_gimple_call,
pass_warn_function_return::execute): Likewise.
* attribs.cc (decl_attributes): Don't report exclusions on
BUILT_IN_UNREACHABLE_TRAP either.
* gcc.dg/pr107300.c: New test.
|
|
|
|
The following patch implements C++23 P1774R8 - Portable assumptions
paper, by introducing support for [[assume (cond)]]; attribute for C++.
In addition to that the patch adds [[gnu::assume (cond)]]; and
__attribute__((assume (cond))); support to both C and C++.
As described in C++23, the attribute argument is conditional-expression
rather than the usual assignment-expression for attribute arguments,
the condition is contextually converted to bool (for C truthvalue conversion
is done on it) and is never evaluated at runtime.
For C++ constant expression evaluation, I only check the simplest conditions
for undefined behavior, because otherwise I'd need to undo changes to
*ctx->global which happened during the evaluation (but I believe the spec
allows that and we can further improve later).
The patch uses a new internal function, .ASSUME, to hold the condition
in the FEs. At gimplification time, if the condition is simple/without
side-effects, it is gimplified as if (cond) ; else __builtin_unreachable ();
and otherwise for now dropped on the floor. The intent is to incrementally
outline the conditions into separate artificial functions and use
.ASSUME further to tell the ranger and perhaps other optimization passes
about the assumptions, as detailed in the PR.
When implementing it, I found that assume entry hasn't been added to
https://eel.is/c++draft/cpp.cond#6
Jonathan said he'll file a NB comment about it, this patch assumes it
has been added into the table as 202207L when the paper has been voted in.
With the attributes for both C/C++, I'd say we don't need to add
__builtin_assume with similar purpose, especially when __builtin_assume
in LLVM is just weird. It is strange for side-effects in function call's
argument not to be evaluated, and LLVM in that case (annoyingly) warns
and ignores the side-effects (but doesn't do then anything with it),
if there are no side-effects, it will work like our
if (!cond) __builtin_unreachable ();
2022-10-06 Jakub Jelinek <jakub@redhat.com>
PR c++/106654
gcc/
* internal-fn.def (ASSUME): New internal function.
* internal-fn.h (expand_ASSUME): Declare.
* internal-fn.cc (expand_ASSUME): Define.
* gimplify.cc (gimplify_call_expr): Gimplify IFN_ASSUME.
* fold-const.h (simple_condition_p): Declare.
* fold-const.cc (simple_operand_p_2): Rename to ...
(simple_condition_p): ... this. Remove forward declaration.
No longer static. Adjust function comment and fix a typo in it.
Adjust recursive call.
(simple_operand_p): Adjust function comment.
(fold_truth_andor): Adjust simple_operand_p_2 callers to call
simple_condition_p.
* doc/extend.texi: Document assume attribute. Move fallthrough
attribute example to its section.
gcc/c-family/
* c-attribs.cc (handle_assume_attribute): New function.
(c_common_attribute_table): Add entry for assume attribute.
* c-lex.cc (c_common_has_attribute): Handle
__have_cpp_attribute (assume).
gcc/c/
* c-parser.cc (handle_assume_attribute): New function.
(c_parser_declaration_or_fndef): Handle assume attribute.
(c_parser_attribute_arguments): Add assume_attr argument,
if true, parse first argument as conditional expression.
(c_parser_gnu_attribute, c_parser_std_attribute): Adjust
c_parser_attribute_arguments callers.
(c_parser_statement_after_labels) <case RID_ATTRIBUTE>: Handle
assume attribute.
gcc/cp/
* cp-tree.h (process_stmt_assume_attribute): Implement C++23
P1774R8 - Portable assumptions. Declare.
(diagnose_failing_condition): Declare.
(find_failing_clause): Likewise.
* parser.cc (assume_attr): New enumerator.
(cp_parser_parenthesized_expression_list): Handle assume_attr.
Remove identifier variable, for id_attr push the identifier into
expression_list right away instead of inserting it before all the
others at the end.
(cp_parser_conditional_expression): New function.
(cp_parser_constant_expression): Use it.
(cp_parser_statement): Handle assume attribute.
(cp_parser_expression_statement): Likewise.
(cp_parser_gnu_attribute_list): Use assume_attr for assume
attribute.
(cp_parser_std_attribute): Likewise. Handle standard assume
attribute like gnu::assume.
* cp-gimplify.cc (process_stmt_assume_attribute): New function.
* constexpr.cc: Include fold-const.h.
(find_failing_clause_r, find_failing_clause): New functions,
moved from semantics.cc with ctx argument added and if non-NULL,
call cxx_eval_constant_expression rather than fold_non_dependent_expr.
(cxx_eval_internal_function): Handle IFN_ASSUME.
(potential_constant_expression_1): Likewise.
* pt.cc (tsubst_copy_and_build): Likewise.
* semantics.cc (diagnose_failing_condition): New function.
(find_failing_clause_r, find_failing_clause): Moved to constexpr.cc.
(finish_static_assert): Use it. Add auto_diagnostic_group.
gcc/testsuite/
* gcc.dg/attr-assume-1.c: New test.
* gcc.dg/attr-assume-2.c: New test.
* gcc.dg/attr-assume-3.c: New test.
* g++.dg/cpp2a/feat-cxx2a.C: Add colon to C++20 features
comment, add C++20 attributes comment and move C++20
new features after the attributes before them.
* g++.dg/cpp23/feat-cxx2b.C: Likewise. Test
__has_cpp_attribute(assume).
* g++.dg/cpp23/attr-assume1.C: New test.
* g++.dg/cpp23/attr-assume2.C: New test.
* g++.dg/cpp23/attr-assume3.C: New test.
* g++.dg/cpp23/attr-assume4.C: New test.
|
|
gcc/ChangeLog:
* internal-fn.cc (expand_GOMP_TARGET_REV): New.
* internal-fn.def (GOMP_TARGET_REV): New.
* lto-cgraph.cc (lto_output_node, verify_node_partition): Mark
'omp target device_ancestor_host' as in_other_partition and don't
error if absent.
* omp-low.cc (create_omp_child_function): Mark as 'noclone'.
* omp-expand.cc (expand_omp_target): For reverse offload, remove
sorry, use device = GOMP_DEVICE_HOST_FALLBACK and create
empty-body nohost function.
* omp-offload.cc (execute_omp_device_lower): Handle
IFN_GOMP_TARGET_REV.
(pass_omp_target_link::execute): For ACCEL_COMPILER, don't
nullify fn argument for reverse offload
libgomp/ChangeLog:
* libgomp.texi (OpenMP 5.0): Mark 'ancestor' as implemented but
refer to 'requires'.
* testsuite/libgomp.c-c++-common/reverse-offload-1-aux.c: New test.
* testsuite/libgomp.c-c++-common/reverse-offload-1.c: New test.
* testsuite/libgomp.fortran/reverse-offload-1-aux.f90: New test.
* testsuite/libgomp.fortran/reverse-offload-1.f90: New test.
gcc/testsuite/ChangeLog:
* c-c++-common/gomp/reverse-offload-1.c: Remove dg-sorry.
* c-c++-common/gomp/target-device-ancestor-4.c: Likewise.
* gfortran.dg/gomp/target-device-ancestor-4.f90: Likewise.
* gfortran.dg/gomp/target-device-ancestor-5.f90: Likewise.
* c-c++-common/goacc/classify-kernels-parloops.c: Add 'noclone' to
scan-tree-dump-times.
* c-c++-common/goacc/classify-kernels-unparallelized-parloops.c:
Likewise.
* c-c++-common/goacc/classify-kernels-unparallelized.c: Likewise.
* c-c++-common/goacc/classify-kernels.c: Likewise.
* c-c++-common/goacc/classify-parallel.c: Likewise.
* c-c++-common/goacc/classify-serial.c: Likewise.
* c-c++-common/goacc/kernels-counter-vars-function-scope.c: Likewise.
* c-c++-common/goacc/kernels-loop-2.c: Likewise.
* c-c++-common/goacc/kernels-loop-3.c: Likewise.
* c-c++-common/goacc/kernels-loop-data-2.c: Likewise.
* c-c++-common/goacc/kernels-loop-data-enter-exit-2.c: Likewise.
* c-c++-common/goacc/kernels-loop-data-enter-exit.c: Likewise.
* c-c++-common/goacc/kernels-loop-data-update.c: Likewise.
* c-c++-common/goacc/kernels-loop-data.c: Likewise.
* c-c++-common/goacc/kernels-loop-g.c: Likewise.
* c-c++-common/goacc/kernels-loop-mod-not-zero.c: Likewise.
* c-c++-common/goacc/kernels-loop-n.c: Likewise.
* c-c++-common/goacc/kernels-loop-nest.c: Likewise.
* c-c++-common/goacc/kernels-loop.c: Likewise.
* c-c++-common/goacc/kernels-one-counter-var.c: Likewise.
* c-c++-common/goacc/kernels-parallel-loop-data-enter-exit.c: Likewise.
* gfortran.dg/goacc/classify-kernels-parloops.f95: Likewise.
* gfortran.dg/goacc/classify-kernels-unparallelized-parloops.f95:
Likewise.
* gfortran.dg/goacc/classify-kernels-unparallelized.f95: Likewise.
* gfortran.dg/goacc/classify-kernels.f95: Likewise.
* gfortran.dg/goacc/classify-parallel.f95: Likewise.
* gfortran.dg/goacc/classify-serial.f95: Likewise.
* gfortran.dg/goacc/kernels-loop-2.f95: Likewise.
* gfortran.dg/goacc/kernels-loop-data-2.f95: Likewise.
* gfortran.dg/goacc/kernels-loop-data-enter-exit-2.f95: Likewise.
* gfortran.dg/goacc/kernels-loop-data-enter-exit.f95: Likewise.
* gfortran.dg/goacc/kernels-loop-data-update.f95: Likewise.
* gfortran.dg/goacc/kernels-loop-data.f95: Likewise.
* gfortran.dg/goacc/kernels-loop-n.f95: Likewise.
* gfortran.dg/goacc/kernels-loop.f95: Likewise.
* gfortran.dg/goacc/kernels-parallel-loop-data-enter-exit.f95: Likewise.
|
|
issue [PR106099]
This patch fixes 2 __builtin_unreachable/__builtin_trap related issues.
One (first hunk) is that CDDCE happily removes calls to .TRAP ()
internal-fn as useless. The problem is that the internal-fn is
ECF_CONST | ECF_NORETURN, doesn't have lhs and so DCE thinks it doesn't
have side-effects and removes it. __builtin_unreachable which has
the same ECF_* flags works fine, as since PR44485 we implicitly add
ECF_LOOPING_CONST_OR_PURE to ECF_CONST | ECF_NORETURN builtins, but
do it in flags_from_decl_or_type which isn't called for internal-fns.
As IFN_TRAP is the only ifn with such flags, it seems easier to
add it explicitly.
The other issue (which on the testcase can be seen only with the
first bug unfixed) is that execute_fixup_cfg can add a __builtin_trap
which needs vops, but nothing adds it and it can appear in many passes
which don't have corresponding TODO_update_ssa_only_virtuals etc.
Fixed similarly as last time but emitting ifn there instead.
2022-08-26 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/106099
* internal-fn.def (TRAP): Add ECF_LOOPING_CONST_OR_PURE flag.
* tree-cfg.cc (execute_fixup_cfg): Add IFN_TRAP instead of
__builtin_trap to avoid the need of vops.
* gcc.dg/pr106099.c: New test.
|
|
[PR106099]
__builtin_unreachable and __ubsan_handle_builtin_unreachable don't
use vops, they are marked const/leaf/noreturn/nothrow/cold.
But __builtin_trap uses vops, isn't const, just leaf/noreturn/nothrow/cold.
This is I believe so that when users explicitly use __builtin_trap in their
sources they get stores visible at the trap side.
-fsanitize=unreachable -fsanitize-undefined-trap-on-error used to transform
__builtin_unreachable to __builtin_trap even in the past, but the sanopt pass
has TODO_update_ssa, so it worked fine.
Now that gimple_build_builtin_unreachable can build a __builtin_trap call
right away, we can run into problems that whenever we need it we would need
to either manually or through TODO_update* ensure the vops being updated.
Though, as it is originally __builtin_unreachable which is just implemented
as trap, I think for this case it is fine to avoid vops. For this the
patch introduces IFN_TRAP, which has ECF_* flags like __builtin_unreachable
and is expanded as __builtin_trap.
2022-07-28 Jakub Jelinek <jakub@redhat.com>
PR tree-optimization/106099
* internal-fn.def (TRAP): New internal fn.
* internal-fn.h (expand_TRAP): Declare.
* internal-fn.cc (expand_TRAP): Define.
* gimple.cc (gimple_build_builtin_unreachable): For BUILT_IN_TRAP,
use internal fn rather than builtin.
* gcc.dg/ubsan/pr106099.c: New test.
|
|
The PR is about the aarch64 port using an ACLE built-in function
to vectorise a scalar function call, even though the ECF_* flags for
the ACLE function didn't match the ECF_* flags for the scalar call.
To some extent that kind of difference is inevitable, since the
ACLE intrinsics are supposed to follow the behaviour of the
underlying instruction as closely as possible. Also, using
target-specific builtins has the drawback of limiting further
gimple optimisation, since the gimple optimisers won't know what
the function does.
We handle several other maths functions, including round, floor
and ceil, by defining directly-mapped internal functions that
are linked to the associated built-in functions. This has two
main advantages:
- it means that, internally, we are not restricted to the set of
scalar types that happen to have associated C/C++ functions
- the functions (and thus the underlying optabs) extend naturally
to vectors
This patch takes the same approach for the remaining functions
handled by aarch64_builtin_vectorized_function.
gcc/
PR target/106253
* predict.h (insn_optimization_type): Declare.
* predict.cc (insn_optimization_type): New function.
* internal-fn.def (IFN_ICEIL, IFN_IFLOOR, IFN_IRINT, IFN_IROUND)
(IFN_LCEIL, IFN_LFLOOR, IFN_LRINT, IFN_LROUND, IFN_LLCEIL)
(IFN_LLFLOOR, IFN_LLRINT, IFN_LLROUND): New internal functions.
* internal-fn.cc (unary_convert_direct): New macro.
(expand_convert_optab_fn): New function.
(expand_unary_convert_optab_fn): New macro.
(direct_unary_convert_optab_supported_p): Likewise.
* optabs.cc (expand_sfix_optab): Pass insn_optimization_type to
convert_optab_handler.
* config/aarch64/aarch64-protos.h
(aarch64_builtin_vectorized_function): Delete.
* config/aarch64/aarch64-builtins.cc
(aarch64_builtin_vectorized_function): Delete.
* config/aarch64/aarch64.cc
(TARGET_VECTORIZE_BUILTIN_VECTORIZED_FUNCTION): Delete.
* config/i386/i386.cc (ix86_optab_supported_p): Handle lround_optab.
* config/i386/i386.md (lround<X87MODEF:mode><SWI248x:mode>2): Remove
optimize_insn_for_size_p test.
gcc/testsuite/
PR target/106253
* gcc.target/aarch64/vect_unary_1.c: Add tests for iroundf,
llround, iceilf, llceil, ifloorf, llfloor, irintf and llrint.
* gfortran.dg/vect/pr106253.f: New test.
|
|
The recent internal-fn “clean-ups” triggered problems on nvptx
because some of the omp_simt_* patterns had modeless operands.
I wondered about adapting expand_fn_using_insn to cope with that,
but then the problem becomes: what should the mode of operand 0
be when there is no lhs? The answer depends on the target insn.
For GOMP_SIMT_ENTER_ALLOC the answer was: use Pmode.
For GOMP_SIMT_ORDERED_PRED and others the answer was: elide the call.
(However, GOMP_SIMT_ORDERED_PRED doesn't seem to have ECF_* flags
that would normally allow it to be dropped at the gimple level.)
So these instructions seem to be special enough that they need
their own code after all. This patch reverts the second patch
and most of the first. The only part retained from the first
is splitting expand_fn_using_insn out of expand_direct_optab_fn,
since I think expand_fn_using_insn could still be useful in future.
gcc/
PR middle-end/105975
Revert everything apart from the expand_fn_using_insn and
expand_direct_optab_fn changes from:
* internal-fn.def (DEF_INTERNAL_INSN_FN): New macro.
(GOMP_SIMT_ENTER_ALLOC, GOMP_SIMT_EXIT, GOMP_SIMT_LANE)
(GOMP_SIMT_LAST_LANE, GOMP_SIMT_ORDERED_PRED, GOMP_SIMT_VOTE_ANY)
(GOMP_SIMT_XCHG_BFLY, GOMP_SIMT_XCHG_IDX): Use it.
* internal-fn.h (direct_internal_fn_info::directly_mapped): New
member variable.
(direct_internal_fn_info::vectorizable): Reduce to 1 bit.
(direct_internal_fn_p): Also return true for internal functions
that map directly to instructions defined target-insns.def.
(direct_internal_fn): Adjust comment accordingly.
* internal-fn.cc (direct_insn, optab1, optab2, vectorizable_optab1)
(vectorizable_optab2): New local macros.
(not_direct): Initialize directly_mapped.
(mask_load_direct, load_lanes_direct, mask_load_lanes_direct)
(gather_load_direct, len_load_direct, mask_store_direct)
(store_lanes_direct, mask_store_lanes_direct, vec_cond_mask_direct)
(vec_cond_direct, scatter_store_direct, len_store_direct)
(vec_set_direct, unary_direct, binary_direct, ternary_direct)
(cond_unary_direct, cond_binary_direct, cond_ternary_direct)
(while_direct, fold_extract_direct, fold_left_direct)
(mask_fold_left_direct, check_ptrs_direct): Use the macros above.
(expand_GOMP_SIMT_ENTER_ALLOC, expand_GOMP_SIMT_EXIT): Delete
(expand_GOMP_SIMT_LANE, expand_GOMP_SIMT_LAST_LANE): Likewise;
(expand_GOMP_SIMT_ORDERED_PRED, expand_GOMP_SIMT_VOTE_ANY): Likewise.
(expand_GOMP_SIMT_XCHG_BFLY, expand_GOMP_SIMT_XCHG_IDX): Likewise.
(direct_internal_fn_types): Handle functions that map to instructions
defined in target-insns.def.
(direct_internal_fn_types): Likewise.
(direct_internal_fn_supported_p): Likewise.
(internal_fn_expanders): Likewise.
(expand_fn_using_insn): New function,
split out and adapted from...
(expand_direct_optab_fn): ...here.
(expand_GOMP_SIMT_ENTER_ALLOC): Use it.
(expand_GOMP_SIMT_EXIT): Likewise.
(expand_GOMP_SIMT_LANE): Likewise.
(expand_GOMP_SIMT_LAST_LANE): Likewise.
(expand_GOMP_SIMT_ORDERED_PRED): Likewise.
(expand_GOMP_SIMT_VOTE_ANY): Likewise.
(expand_GOMP_SIMT_XCHG_BFLY): Likewise.
(expand_GOMP_SIMT_XCHG_IDX): Likewise.
|
|
Several existing internal functions map directly to an instruction
defined in target-insns.def. This patch makes it easier to define
more such functions in future.
This should help to reduce cut-&-paste, but more importantly, it allows
the difference between optab functions and target-insns.def functions
to be abstracted away; both are now treated as “directly-mapped”.
gcc/
* internal-fn.def (DEF_INTERNAL_INSN_FN): New macro.
(GOMP_SIMT_ENTER_ALLOC, GOMP_SIMT_EXIT, GOMP_SIMT_LANE)
(GOMP_SIMT_LAST_LANE, GOMP_SIMT_ORDERED_PRED, GOMP_SIMT_VOTE_ANY)
(GOMP_SIMT_XCHG_BFLY, GOMP_SIMT_XCHG_IDX): Use it.
* internal-fn.h (direct_internal_fn_info::directly_mapped): New
member variable.
(direct_internal_fn_info::vectorizable): Reduce to 1 bit.
(direct_internal_fn_p): Also return true for internal functions
that map directly to instructions defined target-insns.def.
(direct_internal_fn): Adjust comment accordingly.
* internal-fn.cc (direct_insn, optab1, optab2, vectorizable_optab1)
(vectorizable_optab2): New local macros.
(not_direct): Initialize directly_mapped.
(mask_load_direct, load_lanes_direct, mask_load_lanes_direct)
(gather_load_direct, len_load_direct, mask_store_direct)
(store_lanes_direct, mask_store_lanes_direct, vec_cond_mask_direct)
(vec_cond_direct, scatter_store_direct, len_store_direct)
(vec_set_direct, unary_direct, binary_direct, ternary_direct)
(cond_unary_direct, cond_binary_direct, cond_ternary_direct)
(while_direct, fold_extract_direct, fold_left_direct)
(mask_fold_left_direct, check_ptrs_direct): Use the macros above.
(expand_GOMP_SIMT_ENTER_ALLOC, expand_GOMP_SIMT_EXIT): Delete
(expand_GOMP_SIMT_LANE, expand_GOMP_SIMT_LAST_LANE): Likewise;
(expand_GOMP_SIMT_ORDERED_PRED, expand_GOMP_SIMT_VOTE_ANY): Likewise.
(expand_GOMP_SIMT_XCHG_BFLY, expand_GOMP_SIMT_XCHG_IDX): Likewise.
(direct_internal_fn_types): Handle functions that map to instructions
defined in target-insns.def.
(direct_internal_fn_types): Likewise.
(direct_internal_fn_supported_p): Likewise.
(internal_fn_expanders): Likewise.
|
|
C++20:
#include <compare>
auto cmp4way(double a, double b)
{
return a <=> b;
}
expands to:
ucomisd %xmm1, %xmm0
jp .L8
movl $0, %eax
jne .L8
.L2:
ret
.p2align 4,,10
.p2align 3
.L8:
comisd %xmm0, %xmm1
movl $-1, %eax
ja .L2
ucomisd %xmm1, %xmm0
setbe %al
addl $1, %eax
ret
That is 3 comparisons of the same operands.
The following patch improves it to just one comparison:
comisd %xmm1, %xmm0
jp .L4
seta %al
movl $0, %edx
leal -1(%rax,%rax), %eax
cmove %edx, %eax
ret
.L4:
movl $2, %eax
ret
While a <=> b expands to a == b ? 0 : a < b ? -1 : a > b ? 1 : 2
where the first comparison is equality and this shouldn't raise
exceptions on qNaN operands, if the operands aren't equal (which
includes unordered cases), then it immediately performs < or >
comparison and that raises exceptions even on qNaNs, so we can just
perform a single comparison that raises exceptions on qNaN.
As the 4 different cases are encoded as
ZF CF PF
1 1 1 a unordered b
0 0 0 a > b
0 1 0 a < b
1 0 0 a == b
we can emit optimal sequence of comparions, first jp
for the unordered case, then je for the == case and finally jb
for the < case.
The patch pattern recognizes spaceship-like comparisons during
widening_mul if the spaceship optab is implemented, and replaces
those comparisons with comparisons of .SPACESHIP ifn which returns
-1/0/1/2 based on the comparison. This seems to work well both for the
case of just returning the -1/0/1/2 (when we have just a common
successor with a PHI) or when the different cases are handled with
various other basic blocks. The testcases cover both of those cases,
the latter with different function calls in those.
2022-01-17 Jakub Jelinek <jakub@redhat.com>
PR target/103973
* tree-cfg.h (cond_only_block_p): Declare.
* tree-ssa-phiopt.c (cond_only_block_p): Move function to ...
* tree-cfg.c (cond_only_block_p): ... here. No longer static.
* optabs.def (spaceship_optab): New optab.
* internal-fn.def (SPACESHIP): New internal function.
* internal-fn.h (expand_SPACESHIP): Declare.
* internal-fn.c (expand_PHI): Formatting fix.
(expand_SPACESHIP): New function.
* tree-ssa-math-opts.c (optimize_spaceship): New function.
(math_opts_dom_walker::after_dom_children): Use it.
* config/i386/i386.md (spaceship<mode>3): New define_expand.
* config/i386/i386-protos.h (ix86_expand_fp_spaceship): Declare.
* config/i386/i386-expand.c (ix86_expand_fp_spaceship): New function.
* doc/md.texi (spaceship@var{m}3): Document.
* gcc.target/i386/pr103973-1.c: New test.
* gcc.target/i386/pr103973-2.c: New test.
* gcc.target/i386/pr103973-3.c: New test.
* gcc.target/i386/pr103973-4.c: New test.
* gcc.target/i386/pr103973-5.c: New test.
* gcc.target/i386/pr103973-6.c: New test.
* gcc.target/i386/pr103973-7.c: New test.
* gcc.target/i386/pr103973-8.c: New test.
* gcc.target/i386/pr103973-9.c: New test.
* gcc.target/i386/pr103973-10.c: New test.
* gcc.target/i386/pr103973-11.c: New test.
* gcc.target/i386/pr103973-12.c: New test.
* gcc.target/i386/pr103973-13.c: New test.
* gcc.target/i386/pr103973-14.c: New test.
* gcc.target/i386/pr103973-15.c: New test.
* gcc.target/i386/pr103973-16.c: New test.
* gcc.target/i386/pr103973-17.c: New test.
* gcc.target/i386/pr103973-18.c: New test.
* gcc.target/i386/pr103973-19.c: New test.
* gcc.target/i386/pr103973-20.c: New test.
* g++.target/i386/pr103973-1.C: New test.
* g++.target/i386/pr103973-2.C: New test.
* g++.target/i386/pr103973-3.C: New test.
* g++.target/i386/pr103973-4.C: New test.
* g++.target/i386/pr103973-5.C: New test.
* g++.target/i386/pr103973-6.C: New test.
* g++.target/i386/pr103973-7.C: New test.
* g++.target/i386/pr103973-8.C: New test.
* g++.target/i386/pr103973-9.C: New test.
* g++.target/i386/pr103973-10.C: New test.
* g++.target/i386/pr103973-11.C: New test.
* g++.target/i386/pr103973-12.C: New test.
* g++.target/i386/pr103973-13.C: New test.
* g++.target/i386/pr103973-14.C: New test.
* g++.target/i386/pr103973-15.C: New test.
* g++.target/i386/pr103973-16.C: New test.
* g++.target/i386/pr103973-17.C: New test.
* g++.target/i386/pr103973-18.C: New test.
* g++.target/i386/pr103973-19.C: New test.
* g++.target/i386/pr103973-20.C: New test.
|
|
{==,!=,<,<=,>,>=} 0 [PR98737]
On Wed, Jan 27, 2021 at 12:27:13PM +0100, Ulrich Drepper via Gcc-patches wrote:
> On 1/27/21 11:37 AM, Jakub Jelinek wrote:
> > Would equality comparison against 0 handle the most common cases.
> >
> > The user can write it as
> > __atomic_sub_fetch (x, y, z) == 0
> > or
> > __atomic_fetch_sub (x, y, z) - y == 0
> > thouch, so the expansion code would need to be able to cope with both.
>
> Please also keep !=0, <0, <=0, >0, and >=0 in mind. They all can be
> useful and can be handled with the flags.
<= 0 and > 0 don't really work well with lock {add,sub,inc,dec}, x86 doesn't
have comparisons that would look solely at both SF and ZF and not at other
flags (and emitting two separate conditional jumps or two setcc insns and
oring them together looks awful).
But the rest can work.
Here is a patch that adds internal functions and optabs for these,
recognizes them at the same spot as e.g. .ATOMIC_BIT_TEST_AND* internal
functions (fold all builtins pass) and expands them appropriately (or for
the <= 0 and > 0 cases of +/- FAILs and let's middle-end fall back).
So far I have handled just the op_fetch builtins, IMHO instead of handling
also __atomic_fetch_sub (x, y, z) - y == 0 etc. we should canonicalize
__atomic_fetch_sub (x, y, z) - y to __atomic_sub_fetch (x, y, z) (and vice
versa).
2022-01-03 Jakub Jelinek <jakub@redhat.com>
PR target/98737
* internal-fn.def (ATOMIC_ADD_FETCH_CMP_0, ATOMIC_SUB_FETCH_CMP_0,
ATOMIC_AND_FETCH_CMP_0, ATOMIC_OR_FETCH_CMP_0, ATOMIC_XOR_FETCH_CMP_0):
New internal fns.
* internal-fn.h (ATOMIC_OP_FETCH_CMP_0_EQ, ATOMIC_OP_FETCH_CMP_0_NE,
ATOMIC_OP_FETCH_CMP_0_LT, ATOMIC_OP_FETCH_CMP_0_LE,
ATOMIC_OP_FETCH_CMP_0_GT, ATOMIC_OP_FETCH_CMP_0_GE): New enumerators.
* internal-fn.c (expand_ATOMIC_ADD_FETCH_CMP_0,
expand_ATOMIC_SUB_FETCH_CMP_0, expand_ATOMIC_AND_FETCH_CMP_0,
expand_ATOMIC_OR_FETCH_CMP_0, expand_ATOMIC_XOR_FETCH_CMP_0): New
functions.
* optabs.def (atomic_add_fetch_cmp_0_optab,
atomic_sub_fetch_cmp_0_optab, atomic_and_fetch_cmp_0_optab,
atomic_or_fetch_cmp_0_optab, atomic_xor_fetch_cmp_0_optab): New
direct optabs.
* builtins.h (expand_ifn_atomic_op_fetch_cmp_0): Declare.
* builtins.c (expand_ifn_atomic_op_fetch_cmp_0): New function.
* tree-ssa-ccp.c: Include internal-fn.h.
(optimize_atomic_bit_test_and): Add . before internal fn call
in function comment. Change return type from void to bool and
return true only if successfully replaced.
(optimize_atomic_op_fetch_cmp_0): New function.
(pass_fold_builtins::execute): Use optimize_atomic_op_fetch_cmp_0
for BUILT_IN_ATOMIC_{ADD,SUB,AND,OR,XOR}_FETCH_{1,2,4,8,16} and
BUILT_IN_SYNC_{ADD,SUB,AND,OR,XOR}_AND_FETCH_{1,2,4,8,16},
for *XOR* ones only if optimize_atomic_bit_test_and failed.
* config/i386/sync.md (atomic_<plusminus_mnemonic>_fetch_cmp_0<mode>,
atomic_<logic>_fetch_cmp_0<mode>): New define_expand patterns.
(atomic_add_fetch_cmp_0<mode>_1, atomic_sub_fetch_cmp_0<mode>_1,
atomic_<logic>_fetch_cmp_0<mode>_1): New define_insn patterns.
* doc/md.texi (atomic_add_fetch_cmp_0<mode>,
atomic_sub_fetch_cmp_0<mode>, atomic_and_fetch_cmp_0<mode>,
atomic_or_fetch_cmp_0<mode>, atomic_xor_fetch_cmp_0<mode>): Document
new named patterns.
* gcc.target/i386/pr98737-1.c: New test.
* gcc.target/i386/pr98737-2.c: New test.
* gcc.target/i386/pr98737-3.c: New test.
* gcc.target/i386/pr98737-4.c: New test.
* gcc.target/i386/pr98737-5.c: New test.
* gcc.target/i386/pr98737-6.c: New test.
* gcc.target/i386/pr98737-7.c: New test.
|