Age | Commit message (Collapse) | Author | Files | Lines |
|
Motivation of this patch is we want to use ld/sd if possible when zilsd
is enabled, however the subreg pass may split that into two lw/sw
instructions because the cost, and it only check cost for 64 bits reg move,
that's why we need adjust cost for 64 bit reg move as well.
However even we adjust the cost model, 64 bit shift still use 32 bit
load because it already got split at expand time, this may need to fix
on the expander side, and this apparently need few more time to
investigate, so I just added a testcase with XFAIL to show the current behavior,
and we can fix that...when we have time.
For long term, we may adding a new field to riscv_tune_param to control
the cost model for that.
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_cost_model): Add cost model for
zilsd.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/zilsd-code-gen-split-subreg-1.c: New test.
* gcc.target/riscv/zilsd-code-gen-split-subreg-2.c: New test.
|
|
gcc/ChangeLog:
PR target/120697
* config/i386/i386.cc (ix86_expand_prologue):
Remove 3 assertions and associated code.
gcc/testsuite/ChangeLog:
PR target/120697
* gcc.target/i386/stack-clash-protection.c: New test.
|
|
This patch would like to combine the vec_duplicate + vmin.vv to the
vmin.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, FUNC) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = FUNC (in[i], x); \
}
int32_t min(int32 a, int32 b)
{
return a > b ? b : a;
}
DEF_VX_BINARY(int32_t, min)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vmin.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vmin.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add
new case SMIN.
(expand_vx_binary_vec_vec_dup): Ditto.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op smin.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
This commit implements the target macros (TARGET_SHRINK_WRAP_*) that
enable separate shrink wrapping for function prologues/epilogues in
x86.
When performing separate shrink wrapping, we choose to use mov instead
of push/pop, because using push/pop is more complicated to handle rsp
adjustment and may lose performance, so here we choose to use mov, which
has a small impact on code size, but guarantees performance.
Using mov means we need to use sub/add to maintain the stack frame. In
some special cases, we need to use lea to prevent affecting EFlags.
Avoid inserting sub between test-je-jle to change EFlags, lea should be
used here.
foo:
xorl %eax, %eax
testl %edi, %edi
je .L11
sub $16, %rsp ------> leaq -16(%rsp), %rsp
movq %r13, 8(%rsp)
movl $1, %r13d
jle .L4
Tested against SPEC CPU 2017, this change always has a net-positive
effect on the dynamic instruction count. See the following table for
the breakdown on how this reduces the number of dynamic instructions
per workload on a like-for-like (with/without this commit):
instruction count base with commit (commit-base)/commit
502.gcc_r 98666845943 96891561634 -1.80%
526.blender_r 6.21226E+11 6.12992E+11 -1.33%
520.omnetpp_r 1.1241E+11 1.11093E+11 -1.17%
500.perlbench_r 1271558717 1263268350 -0.65%
523.xalancbmk_r 2.20103E+11 2.18836E+11 -0.58%
531.deepsjeng_r 2.73591E+11 2.72114E+11 -0.54%
500.perlbench_r 64195557393 63881512409 -0.49%
541.leela_r 2.99097E+11 2.98245E+11 -0.29%
548.exchange2_r 1.27976E+11 1.27784E+11 -0.15%
527.cam4_r 88981458425 88887334679 -0.11%
554.roms_r 2.60072E+11 2.59809E+11 -0.10%
Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No performance regression was observed.
For O2 multi-copy :
511.povray_r improved by 2.8% on ZNVER5.
511.povray_r improved by 4% on EMR
511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.
gcc/ChangeLog:
* config/i386/i386-protos.h (ix86_get_separate_components):
New function.
(ix86_components_for_bb): Likewise.
(ix86_disqualify_components): Likewise.
(ix86_emit_prologue_components): Likewise.
(ix86_emit_epilogue_components): Likewise.
(ix86_set_handled_components): Likewise.
* config/i386/i386.cc (save_regs_using_push_pop):
Split from ix86_compute_frame_layout.
(ix86_compute_frame_layout):
Use save_regs_using_push_pop.
(pro_epilogue_adjust_stack):
Use gen_pro_epilogue_adjust_stack_add_nocc.
(ix86_expand_prologue): Add some assertions and adjust
the stack frame at the beginning of the prolog for shrink
wrapping separate.
(ix86_emit_save_regs_using_mov):
Skip registers that are wrapped separately.
(ix86_emit_restore_regs_using_mov): Likewise.
(ix86_expand_epilogue): Add some assertions and set
restore_regs_via_mov to true for shrink wrapping separate.
(ix86_get_separate_components): New function.
(ix86_components_for_bb): Likewise.
(ix86_disqualify_components): Likewise.
(ix86_emit_prologue_components): Likewise.
(ix86_emit_epilogue_components): Likewise.
(ix86_set_handled_components): Likewise.
(TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define.
(TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise.
(TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise.
(TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise.
(TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise.
(TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise.
* config/i386/i386.h (struct machine_function):Add
reg_is_wrapped_separately array for register wrapping
information.
* config/i386/i386.md
(@pro_epilogue_adjust_stack_add_nocc<mode>): New.
gcc/testsuite/ChangeLog:
* gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test.
* gcc.target/i386/interrupt-16.c: Likewise.
* gfortran.dg/guality/arg1.f90: Likewise.
* gcc.target/i386/avx10_2-comibf-1.c: Likewise.
* g++.target/i386/shrink_wrap_separate.C: New test.
* gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise.
Co-authored-by: Michael Matz <matz@suse.de>
|
|
By using the scratch register for loop control rather than the output
of the lr instruction we can avoid an unnecessary "mv" instruction.
--
V2: Testcase update with no regressions found for the following the changes.
gcc/ChangeLog:
* config/riscv/sync.md (lrsc_atomic_exchange<mode>): Use scratch
register for loop control rather than lr output.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/zalrsc.c: New test.
|
|
We generated inefficient code for bitfield references to Advanced
SIMD structure modes. In RTL, these modes are just extra-long
vectors, and so inserting and extracting an element is simply
a vec_set or vec_extract operation.
For the record, I don't think these modes should ever become fully
fledged vector modes. We shouldn't provide add, etc. for them.
But vec_set and vec_extract are the vector equivalent of insv
and extv. From that point of view, they seem closer to moves
than to arithmetic.
gcc/
PR target/113027
* config/aarch64/aarch64-protos.h (aarch64_decompose_vec_struct_index):
Declare.
* config/aarch64/aarch64.cc (aarch64_decompose_vec_struct_index): New
function.
* config/aarch64/iterators.md (VEL, Vel): Add Advanced SIMD
structure modes.
* config/aarch64/aarch64-simd.md (vec_set<VSTRUCT_QD:mode>)
(vec_extract<VSTRUCT_QD:mode>): New patterns.
gcc/testsuite/
PR target/113027
* gcc.target/aarch64/pr113027-1.c: New test.
* gcc.target/aarch64/pr113027-2.c: Likewise.
* gcc.target/aarch64/pr113027-3.c: Likewise.
* gcc.target/aarch64/pr113027-4.c: Likewise.
* gcc.target/aarch64/pr113027-5.c: Likewise.
* gcc.target/aarch64/pr113027-6.c: Likewise.
* gcc.target/aarch64/pr113027-7.c: Likewise.
|
|
This patch introduces expanders for FP<-FP conversions that levarage
partial vector modes. We also extend the INT<-FP and FP<-INT conversions
using the same approach.
The ACLE enables vectorized conversions like the following:
fcvt z0.h, p7/m, z1.s
modelling the source vector as VNx4SF:
... | SF| SF| SF| SF|
and the destination as a VNx8HF, where this operation would yield:
... | 0 | HF| 0 | HF| 0 | HF| 0 | HF|
hence the useful results are stored unpacked, i.e.
... | X | HF| X | HF| X | HF| X | HF| (VNx4HF)
This patch allows the vectorizer to use this variant of fcvt as a
conversion from VNx4SF to VNx4HF. The same idea applies to widening
conversions, and between vectors with FP and integer base types.
If the source itself had been unpacked, e.g.
... | X | SF| X | SF| (VNx2SF)
The result would yield
... | X | X | X | HF| X | X | X | HF| (VNx2HF)
The upper bits of each container here are undefined, it's important to
avoid interpreting them during FP operations - doing so could introduce
spurious traps. The obvious route we've taken here is to mask undefined
lanes using the operation's predicate if we have flag_trapping_math.
The VPRED predicate mode (e.g. VNx2BI here) cannot do this; to ensure
correct behavior, we need a predicate mode that can control the data as if
it were fully-packed (VNx4BI).
Both VNx2BI and VNx4BI must be recognised as legal governing predicate modes
by the corresponding FP insns. In general, the governing predicate mode for
an insn could be any such with at least as many significant lanes as the data
mode. For example, addvnx4hf3 could be controlled by any of VNx{4,8,16}BI.
We implement 'aarch64_predicate_operand', a new define_special_predicate, to
acheive this.
gcc/ChangeLog:
* config/aarch64/aarch64-protos.h (aarch64_sve_valid_pred_p):
Declare helper for aarch64_predicate_operand.
(aarch64_sve_packed_pred): Declare helper for new expanders.
(aarch64_sve_fp_pred): Likewise.
* config/aarch64/aarch64-sve.md (<optab><mode><v_int_equiv>2):
Extend into...
(<optab><SVE_HSF:mode><SVE_HSDI:mode>2): New expander for converting
vectors of HF,SF to vectors of HI,SI,DI.
(<optab><VNx2DF_ONLY:mode><SVE_2SDI:mode>2): New expander for converting
vectors of SI,DI to vectors of DF.
(*aarch64_sve_<optab>_nontrunc<SVE_PARTIAL_F:mode><SVE_HSDI:mode>):
New pattern to match those we've added here.
(@aarch64_sve_<optab>_trunc<VNx2DF_ONLY:mode><VNx4SI_ONLY:mode>): Extend
into...
(@aarch64_sve_<optab>_trunc<VNx2DF_ONLY:mode><SVE_SI:mode>): Match both
VNx2SI<-VNx2DF and VNx4SI<-VNx4DF.
(<optab><v_int_equiv><mode>2): Extend into...
(<optab><SVE_HSDI:mode><SVE_F:mode>2): New expander for converting vectors
of HI,SI,DI to vectors of HF,SF,DF.
(*aarch64_sve_<optab>_nonextend<SVE_HSDI:mode><SVE_PARTIAL_F:mode>): New
pattern to match those we've added here.
(trunc<SVE_SDF:mode><SVE_PARTIAL_HSF:mode>2): New expander to handle
narrowing ('truncating') FP<-FP conversions.
(*aarch64_sve_<optab>_trunc<SVE_SDF:mode><SVE_PARTIAL_HSF:mode>): New
pattern to handle those we've added here.
(extend<SVE_PARTIAL_HSF:mode><SVE_SDF:mode>2): New expander to handle
widening ('extending') FP<-FP conversions.
(*aarch64_sve_<optab>_nontrunc<SVE_PARTIAL_HSF:mode><SVE_SDF:mode>): New
pattern to handle those we've added here.
* config/aarch64/aarch64.cc (aarch64_sve_packed_pred): New function.
(aarch64_sve_fp_pred): Likewise.
(aarch64_sve_valid_pred_p): Likewise.
* config/aarch64/iterators.md (SVE_PARTIAL_HSF): New mode iterator.
(SVE_HSF): Likewise.
(SVE_SDF): Likewise.
(SVE_SI): Likewise.
(SVE_2SDI) Likewise.
(self_mask): Extend to all integer/FP vector modes.
(narrower_mask): Likewise (excluding QI).
* config/aarch64/predicates.md (aarch64_predicate_operand): New special
predicate to handle narrower predicate modes.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/sve/pack_fcvt_signed_1.c: Disable the aarch64 vector
cost model to preserve this test.
* gcc.target/aarch64/sve/pack_fcvt_unsigned_1.c: Likewise.
* gcc.target/aarch64/sve/pack_float_1.c: Likewise.
* gcc.target/aarch64/sve/unpack_float_1.c: Likewise.
* gcc.target/aarch64/sve/unpacked_cvtf_1.c: New test.
* gcc.target/aarch64/sve/unpacked_cvtf_2.c: Likewise.
* gcc.target/aarch64/sve/unpacked_cvtf_3.c: Likewise.
* gcc.target/aarch64/sve/unpacked_fcvt_1.c: Likewise.
* gcc.target/aarch64/sve/unpacked_fcvt_2.c: Likewise.
* gcc.target/aarch64/sve/unpacked_fcvtz_1.c: Likewise.
* gcc.target/aarch64/sve/unpacked_fcvtz_2.c: Likewise.
|
|
Define new iterators for partial floating-point modes, and cover these
in some existing mode_attrs. This patch serves as a starting point for
an effort to extend support for unpacked floating-point operations.
To differentiate between BFloat mode iterators that need to test
TARGET_SSVE_B16B16, and those that don't (see LOGICALF), this patch
enforces the following naming convention:
- _BF: BF16 modes will not test TARGET_SSVE_B16B16.
- _B16B16: BF16 modes will test TARGET_SSVE_B16B16.
gcc/ChangeLog:
* config/aarch64/aarch64-sve.md: Replace uses of SVE_FULL_F_BF
with SVE_FULL_F_B16B16.
Replace use of SVE_F with SVE_F_BF.
* config/aarch64/iterators.md (SVE_PARTIAL_F): New iterator for
partial SVE FP modes.
(SVE_FULL_F_BF): Rename to SVE_FULL_F_B16B16.
(SVE_PARTIAL_F_B16B16): New iterator (BF16 included) for partial
SVE FP modes.
(SVE_F_B16B16): New iterator for all SVE FP modes.
(SVE_BF): New iterator for all SVE BF16 modes.
(SVE_F): Redefine to exclude BF16 modes.
(SVE_F_BF): New iterator to replace the previous SVE_F.
(VPRED): Describe the VPRED mapping for partial vector modes.
(b): Cover partial FP modes.
(is_bf16): Likewise.
|
|
GCS (Guarded Control Stack, an Armv9.4-a extension) requires some
caution at runtime. The runtime linker needs to reason about the
compatibility of a set of relocable object files that might not
have been compiled with the same compiler.
Up until now, those metadata, used for the previously mentioned
runtime checks, have been provided to the runtime linker via GNU
properties which are stored in the ELF section ".note.gnu.property".
However, GNU properties are limited in their expressibility, and a
long-term commmitment was taken in the ABI for the Arm architecture
[1] to provide Build Attributes (a.k.a. BAs).
This patch adds the support for emitting AArch64 Build Attributes.
This support includes generating two new assembler directives:
.aeabi_subsection and .aeabi_attribute. These directives are generated
as per the syntax mentioned in spec "Build Attributes for the Arm®
64-bit Architecture (AArch64)" available at [1].
gcc/configure.ac now includes a new check to test whether the
assembler being used to build the toolchain supports these new
directives.
Two behaviors can be observed when -mbranch-protection=[standard|...]
is passed:
- If the assembler support BAs, GCC emits the BAs directives and
no GNU properties. Note: the static linker will derive the values
of GNU properties from the BAs, and will emit both BAs and GNU
properties into the output object.
- If the assembler do not support them, only .note.gnu.property
section will contain the relevant information.
Bootstrapped on aarch64-none-linux-gnu, and no regression found.
[1]: https://github.com/ARM-software/abi-aa/pull/230
gcc/ChangeLog:
* config.in: Regenerate.
* config/aarch64/aarch64-elf-metadata.h
(class aeabi_subsection): New class for BAs.
* config/aarch64/aarch64-protos.h
(aarch64_pacret_enabled): New function.
* config/aarch64/aarch64.cc
(HAVE_AS_AEABI_BUILD_ATTRIBUTES): New definition.
(aarch64_file_end_indicate_exec_stack): Emit BAss.
(aarch64_pacret_enabled): New function.
(aarch64_start_file): Indent.
* configure: Regenerate.
* configure.ac: New configure check for BAs support in binutils.
gcc/testsuite/ChangeLog:
* lib/target-supports.exp:
(check_effective_target_aarch64_gas_has_build_attributes): New checker.
* gcc.target/aarch64/build-attributes/aarch64-build-attributes.exp: New DejaGNU file.
* gcc.target/aarch64/build-attributes/build-attribute-bti.c: New test.
* gcc.target/aarch64/build-attributes/build-attribute-gcs.c: New test.
* gcc.target/aarch64/build-attributes/build-attribute-pac.c: New test.
* gcc.target/aarch64/build-attributes/build-attribute-standard.c: New test.
* gcc.target/aarch64/build-attributes/no-build-attribute-bti.c: New test.
* gcc.target/aarch64/build-attributes/no-build-attribute-gcs.c: New test.
* gcc.target/aarch64/build-attributes/no-build-attribute-pac.c: New test.
* gcc.target/aarch64/build-attributes/no-build-attribute-standard.c: New test.
Co-Authored-By: Srinath Parvathaneni <srinath.parvathaneni@arm.com>
|
|
The code emitting the GNU properties was moved to a separate file to
improve modularity and "releave" the 31000-lines long aarch64.cc file
from a few lines.
It introduces a new namespace "aarch64::" for AArch64 backend which
reduce the length of function names by not prepending 'aarch64_' to
each of them.
gcc/ChangeLog:
* Makefile.in: Add missing declaration of BACKEND_H.
* config.gcc: Add aarch64-elf-metadata.o to extra_objs.
* config/aarch64/aarch64-elf-metadata.h: New file
* config/aarch64/aarch64-elf-metadata.cc: New file.
* config/aarch64/aarch64.cc
(GNU_PROPERTY_AARCH64_FEATURE_1_AND): Removed.
(GNU_PROPERTY_AARCH64_FEATURE_1_BTI): Likewise.
(GNU_PROPERTY_AARCH64_FEATURE_1_PAC): Likewise.
(GNU_PROPERTY_AARCH64_FEATURE_1_GCS): Likewise.
(aarch64_file_end_indicate_exec_stack): Move GNU properties code to
aarch64-elf-metadata.cc
* config/aarch64/t-aarch64: Declare target aarch64-elf-metadata.o
|
|
GNU properties are emitted to provide some information about the features
used in the generated code like BTI, GCS, or PAC. However, no debug
comment are emitted in the generated assembly even if -dA is provided.
It makes understanding the information stored in the .note.gnu.property
section more difficult than needed.
This patch adds assembly comments (if -dA is provided) next to the GNU
properties. For instance, if BTI and PAC are enabled, it will emit:
.word 0x3 // GNU_PROPERTY_AARCH64_FEATURE_1_AND (BTI, PAC)
gcc/ChangeLog:
* config/aarch64/aarch64.cc
(aarch64_file_end_indicate_exec_stack): Emit assembly comments.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/bti-1.c: Emit assembly comments, and update
test assertion.
|
|
reloads"
Since there are no unwanted reg-reg moves during DFmode input reloads in
recent GCCs, the previously committed patch
"xtensa: eliminate unwanted reg-reg moves during DFmode input reloads"
(commit cfad4856fa46abc878934a9433d0bfc2482ccf00) is no longer necessary
and is therefore being reverted.
gcc/ChangeLog:
* config/xtensa/predicates.md (reload_operand):
Remove.
* config/xtensa/xtensa.md:
Remove the peephole2 pattern that was previously added.
|
|
Due to improved register allocation for GP registers whose modes has been
changed by paradoxical SUBREGs, the previously committed patch
"xtensa: eliminate unnecessary general-purpose reg-reg moves"
(commit f83e76c3f998c8708fe2ddca16ae3f317c39c37a) is no longer necessary
and is therefore reverted.
gcc/ChangeLog:
* config/xtensa/xtensa.md:
Remove the peephole2 pattern that was previously added.
gcc/testsuite/ChangeLog:
* gcc.target/xtensa/elim_GP_regmove_0.c: Remove.
* gcc.target/xtensa/elim_GP_regmove_1.c: Remove.
|
|
This patch would like to combine the vec_duplicate + vmaxu.vv to the
vmaxu.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, OP) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = in[i] OP x; \
}
DEF_VX_BINARY(int32_t, /)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vmaxu.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vmaxu.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add new
case UMAX.
(expand_vx_binary_vec_vec_dup): Ditto.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op umax.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
The problem with PR120423 and PR116389 is that reload might assign an invalid
hard register to a paradoxical subreg. For example with the test case from
the PR, it assigns (REG:QI 31) to the inner of (subreg:HI (QI) 0) which is
valid, but the subreg will be turned into (REG:HI 31) which is invalid
and triggers an ICE in postreload.
The problem only occurs with the old reload pass.
The patch maps the paradoxical subregs to a zero-extends which will be
allocated correctly. For the 120423 testcases, the code is the same like
with -mlra (which doesn't implement the fix), so the patch doesn't even
introduce a performance penalty.
The patch is only needed for v15: v14 is not affected, and in v16 reload
will be removed.
PR rtl-optimization/120423
PR rtl-optimization/116389
gcc/
* config/avr/avr.md [-mno-lra]: Add pre-reload split to transform
(left shift of) a paradoxical subreg to a (left shift of) zero-extend.
gcc/testsuite/
* gcc.target/avr/torture/pr120423-1.c: New test.
* gcc.target/avr/torture/pr120423-2.c: New test.
* gcc.target/avr/torture/pr120423-116389.c: New test.
(cherry picked from commit 61789b5abec3079d02ee9eaa7468015ab1f6f701)
|
|
Add combiner patterns for folding NOT+PTEST to NOTS when they share
the same governing predicate.
gcc/ChangeLog:
PR target/118150
* config/aarch64/aarch64-sve.md (*one_cmpl<mode>3_cc): New
combiner pattern.
(*one_cmpl<mode>3_ptest): Likewise.
gcc/testsuite/ChangeLog:
PR target/118150
* gcc.target/aarch64/sve/acle/general/not_1.c: New test.
|
|
On mcore-elf, mcore_mark_dllimport generated
(gdb) call debug_tree (decl)
<function_decl 0x7fffe9941200 f1
type <function_type 0x7fffe981f000
type <void_type 0x7fffe98180a8 void VOID
align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffe98180a8
pointer_to_this <pointer_type 0x7fffe9818150>>
HI
size <integer_cst 0x7fffe9802738 constant 16>
unit-size <integer_cst 0x7fffe9802750 constant 2>
align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffe981f000
arg-types <tree_list 0x7fffe980b988 value <void_type 0x7fffe98180a8 void>>
pointer_to_this <pointer_type 0x7fffe991b0a8>>
addressable used public external decl_5 SI /tmp/x.c:1:40 align:16 warn_if_not_align:0 context <translation_unit_decl 0x7fffe9955080 /tmp/x.c>
attributes <tree_list 0x7fffe9932708
purpose <identifier_node 0x7fffe9954000 dllimport>>
(mem:SI (mem:SI (symbol_ref:SI ("@i.__imp_f1")) [0 S4 A32]) [0 S4 A32]) chain <function_decl 0x7fffe9941300 f2>>
which caused:
(gdb) bt
file=0x2c0f1c8 "/export/gnu/import/git/sources/gcc-test/gcc/calls.cc",
line=3746, function=0x2c0f747 "expand_call")
at /export/gnu/import/git/sources/gcc-test/gcc/diagnostic.cc:1780
target=0x0, ignore=1)
at /export/gnu/import/git/sources/gcc-test/gcc/calls.cc:3746
...
(gdb) call debug_rtx (datum)
(mem:SI (symbol_ref:SI ("@i.__imp_f1")) [0 S4 A32])
(gdb)
Don't use gen_rtx_MEM in mcore_mark_dllimport to generate
(gdb) call debug_tree (fndecl)
<function_decl 0x7fffe9941200 f1
type <function_type 0x7fffe981f000
type <void_type 0x7fffe98180a8 void VOID
align:8 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffe98180a8
pointer_to_this <pointer_type 0x7fffe9818150>>
HI
size <integer_cst 0x7fffe9802738 constant 16>
unit-size <integer_cst 0x7fffe9802750 constant 2>
align:16 warn_if_not_align:0 symtab:0 alias-set -1 canonical-type 0x7fffe981f000
arg-types <tree_list 0x7fffe980b988 value <void_type 0x7fffe98180a8 void>>
pointer_to_this <pointer_type 0x7fffe991b0a8>>
addressable used public external decl_5 SI /tmp/x.c:1:40 align:16 warn_if_not_align:0 context <translation_unit_decl 0x7fffe9955080 /tmp/x.c>
attributes <tree_list 0x7fffe9932708
purpose <identifier_node 0x7fffe9954000 dllimport>>
(mem:SI (symbol_ref:SI ("@i.__imp_f1")) [0 S4 A32]) chain <function_decl 0x7fffe9941300 f2>>
(gdb)
instead. This fixes:
gcc.c-torture/compile/dll.c -O0 (internal compiler error: in assemble_variable, at varasm.cc:2544)
gcc.dg/visibility-12.c (internal compiler error: in expand_call, at calls.cc:3744)
for more-elf.
PR target/120589
* config/mcore/mcore.cc (mcore_mark_dllimport): Don't use
gen_rtx_MEM.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
After commit eb2ea476db2 ("emit-rtl: Allow extra checks for
paradoxical subregs [PR119966]") paradoxical subregs or the OpenRISC
condition flag register (reg:BI sr_f) are no longer allowed.
This causes and ICE in the ce1 pass which tries to get the or1k flag
register into an SI register, which is no longer possible.
Adjust or1k_can_change_mode_class to allow changing the or1k flag reg to
SI mode which in turn allows paradoxical subregs to be generated again.
gcc/ChangeLog:
PR target/120587
* config/or1k/or1k.cc (or1k_can_change_mode_class): Allow
changing flags mode from BI to SI to allow for paradoxical
subregs.
|
|
Make sure we can represent the difference between two 64-bit DImode immediate
values in 64-bit HOST_WIDE_INT and return false if this is not the case.
ix86_expand_int_movcc is used in mov<mode>cc expaner. Expander will FAIL
when the function returns false and middle-end will retry expansion with
values forced to registers.
PR target/120604
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_int_movcc): Make sure
we can represent the difference between two 64-bit DImode
immediate values in 64-bit HOST_WIDE_INT.
|
|
This patch would like to combine the vec_duplicate + vmax.vv to the
vmax.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, OP) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = in[i] OP x; \
}
DEF_VX_BINARY(int32_t, /)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vmax.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vmax.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add new
case SMAX.
(expand_vx_binary_vec_vec_dup): Ditto.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op smax.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
The PCS defines a lazy save scheme for managing ZA across normal
"private-ZA" functions. GCC currently uses this scheme for calls
to all private-ZA functions (rather than using caller-save).
Therefore, before a sequence of calls to private-ZA functions, GCC emits
code to set up a lazy save. After the sequence of calls, GCC emits code
to check whether lazy save was committed and restore the ZA contents
if so.
These sequences are emitted by the mode-switching pass, in an attempt
to reduce the number of redundant saves and restores.
The lazy save scheme also means that, before a function can use ZA,
it must first conditionally store the old contents of ZA to the caller's
lazy save buffer, if any.
This all creates some relatively complex dependencies between
setup code, save/restore code, and normal reads from and writes to ZA.
These dependencies are modelled using special fake hard registers:
;; Sometimes we use placeholder instructions to mark where later
;; ABI-related lowering is needed. These placeholders read and
;; write this register. Instructions that depend on the lowering
;; read the register.
(LOWERING_REGNUM 87)
;; Represents the contents of the current function's TPIDR2 block,
;; in abstract form.
(TPIDR2_BLOCK_REGNUM 88)
;; Holds the value that the current function wants PSTATE.ZA to be.
;; The actual value can sometimes vary, because it does not track
;; changes to PSTATE.ZA that happen during a lazy save and restore.
;; Those effects are instead tracked by ZA_SAVED_REGNUM.
(SME_STATE_REGNUM 89)
;; Instructions write to this register if they set TPIDR2_EL0 to a
;; well-defined value. Instructions read from the register if they
;; depend on the result of such writes.
;;
;; The register does not model the architected TPIDR2_ELO, just the
;; current function's management of it.
(TPIDR2_SETUP_REGNUM 90)
;; Represents the property "has an incoming lazy save been committed?".
(ZA_FREE_REGNUM 91)
;; Represents the property "are the current function's ZA contents
;; stored in the lazy save buffer, rather than in ZA itself?".
(ZA_SAVED_REGNUM 92)
;; Represents the contents of the current function's ZA state in
;; abstract form. At various times in the function, these contents
;; might be stored in ZA itself, or in the function's lazy save buffer.
;;
;; The contents persist even when the architected ZA is off. Private-ZA
;; functions have no effect on its contents.
(ZA_REGNUM 93)
Every normal read from ZA and write to ZA depends on SME_STATE_REGNUM,
in order to sequence the code with the initial setup of ZA and
with the lazy save scheme.
The code to restore ZA after a call involves several instructions,
including conditional control flow. It is initially represented as
a single define_insn and is split late, after shrink-wrapping and
prologue/epilogue insertion.
The split form of the restore instruction includes a conditional call
to __arm_tpidr2_restore:
(define_insn "aarch64_tpidr2_restore"
[(set (reg:DI ZA_SAVED_REGNUM)
(unspec:DI [(reg:DI R0_REGNUM)] UNSPEC_TPIDR2_RESTORE))
(set (reg:DI SME_STATE_REGNUM)
(unspec:DI [(reg:DI SME_STATE_REGNUM)] UNSPEC_TPIDR2_RESTORE))
...
)
The write to SME_STATE_REGNUM indicates the end of the region where
ZA_REGNUM might differ from the real contents of ZA. In other words,
it is the point at which normal reads from ZA and writes to ZA
can safely take place.
To finally get to the point, the problem in this PR was that the
unsplit aarch64_restore_za pattern was missing this change to
SME_STATE_REGNUM. It could therefore be deleted as dead before
it had chance to be split. The split form had the correct dataflow,
but the unsplit form didn't.
Unfortunately, the tests for this code tended to use calls and asms
to model regions of ZA usage, and those don't seem to be affected
in the same way.
gcc/
PR target/120624
* config/aarch64/aarch64.md (SME_STATE_REGNUM): Expand on comments.
* config/aarch64/aarch64-sme.md (aarch64_restore_za): Also set
SME_STATE_REGNUM
gcc/testsuite/
PR target/120624
* gcc.target/aarch64/sme/za_state_7.c: New test.
|
|
Renames record_function_versions to add_function_version, and make it
explicit that it is adding a single version to the function structure.
Additionally, change the insertion point to always maintain priority ordering
of the versions.
This allows for removing logic for moving the default to the first
position which was duplicated across target specific code and enables
easier reasoning about function sets.
gcc/ChangeLog:
* cgraph.cc (cgraph_node::record_function_versions): Refactor and
rename to...
(cgraph_node::add_function_version): new function.
* cgraph.h (cgraph_node::record_function_versions): Refactor and
rename to...
(cgraph_node::add_function_version): new function.
* config/aarch64/aarch64.cc (aarch64_get_function_versions_dispatcher):
Remove reordering.
* config/i386/i386-features.cc (ix86_get_function_versions_dispatcher):
Remove reordering.
* config/riscv/riscv.cc (riscv_get_function_versions_dispatcher):
Remove reordering.
* config/rs6000/rs6000.cc (rs6000_get_function_versions_dispatcher):
Remove reordering.
gcc/cp/ChangeLog:
* decl.cc (maybe_version_functions): Change record_function_versions
call to add_function_version.
|
|
Hi,
This patch aims to set SRF issue rate to 4, GNR issue rate to 6. According to
tests about spec2017, the patch has little effect on performance.
For GRR, CWF, DMR, ARL and PTL, the patch set their issue rate to 6. Waiting for
more information to update.
Bootstrapped and regtested on x86_64-linux-pc-gnu, OK for trunk.
BRs,
Lin
gcc/ChangeLog:
* config/i386/x86-tune-sched.cc (ix86_issue_rate): Set 4 for SRF,
6 for GRR, GNR, CWF, DMR, ARL, PTL.
|
|
The instruction scheduler appears to be speculatively hoisting vsetvl
insns outside of their basic block without checking for data
dependencies. This resulted in a situation where the following occurs
vsetvli a5,a1,e32,m1,tu,ma
vle32.v v2,0(a0)
sub a1,a1,a5 <-- a1 potentially set to 0
sh2add a0,a5,a0
vfmacc.vv v1,v2,v2
vsetvli a5,a1,e32,m1,tu,ma <-- incompatible vinfo. update vl to 0
beq a1,zero,.L12 <-- check if avl is 0
This patch would essentially delay the vsetvl update to after the branch
to prevent unnecessarily updating the vinfo at the end of a basic block.
PR/117974
gcc/ChangeLog:
* config/riscv/riscv.cc (struct riscv_tune_param): Add tune
param.
(riscv_sched_can_speculate_insn): Implement.
(TARGET_SCHED_CAN_SPECULATE_INSN): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/vsetvl/pr117974.c: New test.
Signed-off-by: Edwin Lu <ewlu@rivosinc.com>
|
|
Patch for PR120553 enabled full 64-bit DImode immediates in
ix86_expand_int_movcc. However, the function calculates the difference
between two immediate arguments using signed 64-bit HOST_WIDE_INT
subtractions that can cause signed integer overflow.
Avoid the overflow by casting operands of subtractions to
(unsigned HOST_WIDE_INT).
PR target/120604
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_int_movcc): Cast operands of
signed 64-bit HOST_WIDE_INT subtractions to (unsigned HOST_WIDE_INT).
|
|
This pattern enables the combine pass (or late-combine, depending on the case)
to merge a vec_duplicate into a (possibly negated) minus-mult RTL instruction.
Before this patch, we have two instructions, e.g.:
vfmv.v.f v6,fa0
vfnmadd.vv v2,v6,v4
After, we get only one:
vfnmadd.vf v2,fa0,v4
This also fixes a sign mistake in the handling of vfmsub.
PR target/119100
gcc/ChangeLog:
* config/riscv/autovec-opt.md (*<optab>_vf_<mode>): Only handle vfmadd
and vfmsub.
(*vfnmsub_<mode>): New pattern.
(*vfnmadd_<mode>): New pattern.
* config/riscv/riscv.cc (get_vector_binary_rtx_cost): Add cost model for
NEG and VEC_DUPLICATE.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfnmadd and
vfnmsub.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop.h: Add support for neg
variants. Fix sign for sub.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop_data.h: Add data for neg
variants. Fix data for sub.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop_run.h: Rename x to f.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f16.c: Add neg
argument.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f64.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f64.c: New test.
|
|
The just posted second PR120434 patch causes
+FAIL: gcc.target/i386/pr78103-3.c scan-assembler \\\\m(leaq|addq|incq)\\\\M
+FAIL: gcc.target/i386/pr78103-3.c scan-assembler-not \\\\mmovl\\\\M+
+FAIL: gcc.target/i386/pr78103-3.c scan-assembler-not \\\\msubq\\\\M
+FAIL: gcc.target/i386/pr78103-3.c scan-assembler-not \\\\mxor[lq]\\\\M
While the patch generally improves code generation by often using
ZERO_EXTEND instead of SIGN_EXTEND, where the former is often for free
on x86_64 while the latter requires an extra instruction or larger
instruction than one with just zero extend, the PR78103 combine patterns
and splitters were written only with SIGN_EXTEND in mind. As CLZ is UB
on 0 and otherwise returns just [0,63] and is xored with 63, ZERO_EXTEND
does the same thing there as SIGN_EXTEND.
2025-06-10 Jakub Jelinek <jakub@redhat.com>
PR middle-end/120434
* config/i386/i386.md (*bsr_rex64_2): Rename to ...
(*bsr_rex64<u>_2): ... this. Use any_extend instead of sign_extend.
(*bsr_2): Rename to ...
(*bsr<u>_2): ... this. Use any_extend instead of sign_extend.
(bsr splitters after those): Use any_extend instead of sign_extend.
|
|
As gfx942 and gfx950 belong to gfx9-4-generic, the latter two are also added.
Note that there are no specific optimizations for MI300, yet.
For none of the mentioned devices, any multilib is build by default; use
'--with-multilib-list=' when configuring GCC to build them alongside.
gfx942 was added in LLVM (and its mc assembler, used by GCC) in version 18,
generic support in LLVM 19 and gfx950 in LLVM 20.
gcc/ChangeLog:
* config/gcn/gcn-devices.def: Add gfx942, gfx950 and gfx9-4-generic.
* config/gcn/gcn-opts.h (TARGET_CDNA3, TARGET_CDNA3_PLUS,
TARGET_GLC_NAME, TARGET_TARGET_SC_CACHE): Define.
(TARGET_ARCHITECTED_FLAT_SCRATCH): Use also for CDNA3.
* config/gcn/gcn.h (gcn_isa): Add ISA_CDNA3 to the enum.
* config/gcn/gcn.cc (print_operand): Update 'g' to use
TARGET_GLC_NAME; add 'G' to print TARGET_GLC_NAME unconditionally.
* config/gcn/gcn-valu.md (scatter, gather): Use TARGET_GLC_NAME.
* config/gcn/gcn.md: Use %G<num> instead of glc; use 'buffer_inv sc1'
for TARGET_TARGET_SC_CACHE.
* doc/invoke.texi (march): Add gfx942, gfx950 and gfx9-4-generic.
* doc/install.texi (amdgcn*-*-*): Add gfx942, gfx950 and gfx9-4-generic.
* config/gcn/gcn-tables.opt: Regenerate.
libgomp/ChangeLog:
* testsuite/libgomp.c/declare-variant-4.h (gfx942): New variant function.
* testsuite/libgomp.c/declare-variant-4-gfx942.c: New test.
|
|
This is a fix for a bug found internally in Ventana using the cf3 testsuite.
cf3 looks to be dead as a project and likely subsumed by modern fuzzers. In
fact internally we tripped another issue with cf3 that had already been
reported by Edwin with the fuzzer he runs.
Anyway, the splitter in question blindly emits the 2nd adjusted constant into a
register, that's not valid if the constant requires any kind of synthesis --
and it well could since we're mostly focused on the first constant turning into
something that can be loaded via LUI without increasing the cost of the second
constant.
Instead of using the split RTL template, this just emits the code we want
directly, using riscv_move_insn to synthesize the constant into the provided
temporary register.
Tested in my system. Waiting on upstream CI's verdict before moving forward.
gcc/
* config/riscv/riscv.md (lui-constraint<X:mode>and_to_or): Do not use
the RTL template for split code. Emit it directly taking care to avoid
emitting a constant load that needed synthesis. Fix formatting.
gcc/testsuite/
* gcc.target/riscv/ventana-16122.c: New test.
|
|
This patch would like to combine the vec_duplicate + vremu.vv to the
vremu.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, OP) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = in[i] OP x; \
}
DEF_VX_BINARY(int32_t, /)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vremu.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vremu.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_vec_dup): Add new
case UMOD.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op umod.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
Another czero related adjustment. This time in costing of conditional move
sequences. Essentially a copy from a promoted subreg can and should be ignored
from a costing standpoint. We had some code to do this, but its conditions
were too strict.
No real surprises evaluating spec. This should be a minor, but probably not
measurable improvement in x264 and xz. It is if-converting more in some
particular harm to hot routines, but not necessarily in the hot parts of those
routines.
It's been tested on riscv32-elf and riscv64-elf. Versions of this have
bootstrapped and regression tested as well, though perhaps not this exact
version.
Waiting on pre-commit testing.
gcc/
* config/riscv/riscv.cc (riscv_noce_conversion_profitable_p): Relax
condition for adjustments due to copies from promoted SUBREGs.
|
|
like r16-105-g599bca27dc37b3, the patch handles redunduant clean up of
upper-bits for maskload.
.i.e
Successfully matched this instruction:
(set (reg:V4DF 175)
(vec_merge:V4DF (unspec:V4DF [
(mem:V4DF (plus:DI (reg/v/f:DI 155 [ b ])
(reg:DI 143 [ ivtmp.56 ])) [1 S32 A64])
] UNSPEC_MASKLOAD)
(const_vector:V4DF [
(const_double:DF 0.0 [0x0.0p+0]) repeated x4
])
(and:QI (reg:QI 125 [ mask__29.16 ])
(const_int 15 [0xf]))))
For maskstore, looks like it's already optimal(at least I can't make a
testcase).
So The patch only hanldes maskload.
gcc/ChangeLog:
PR target/103750
* config/i386/i386.cc (ix86_rtx_costs): Adjust rtx_cost for
maskload.
* config/i386/sse.md (*<avx512>_load<mode>mask_and15): New
define_insn_and_split.
(*<avx512>_load<mode>mask_and3): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512f-pr103750-3.c: New test.
|
|
This patch would like to combine the vec_duplicate + vrem.vv to the
vrem.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, OP) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = in[i] OP x; \
}
DEF_VX_BINARY(int32_t, /)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vrem.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vrem.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_vec_dup): Add new
case MOD.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op mod.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
As described in prior patches of this series, FRM mode switching state
machine has special handling around calls. After a call_insn, if in DYN_CALL
state, it needs to transition back to DYN, which requires back checking
if prev insn was indeed a call.
Defering/delaying this could lead to unncessary final transitions leading
to extraenous FRM save/restores.
However the current back checking of call_insn was too coarse-grained.
It used prev_nonnote_nondebug_insn_bb () which implies current insn to
be in the same BB as the call_insn, which need not always be true.
The problem is not with the API, but the use thereof.
Fix this by tracking call_insn more explicitly in TARGET_MODE_NEEDED.
- On seeing a call_insn, record a "call note".
- On subsequent insns if a "call note" is seen, do the needed state switch
and clear the note.
- Remove the old BB based search.
The number of FRM read/writes across SPEC2017 -Ofast -mrv64gcv improves.
Before After
------------- ---------------
frrm fsrmi fsrm frrm fsrmi frrm
perlbench_r 17 0 1 17 0 1
cpugcc_r 11 0 0 11 0 0
bwaves_r 16 0 1 16 0 1
mcf_r 11 0 0 11 0 0
cactusBSSN_r 19 0 1 19 0 1
namd_r 14 0 1 14 0 1
parest_r 24 0 1 24 0 1
povray_r 26 1 6 26 1 6
lbm_r 6 0 0 6 0 0
omnetpp_r 17 0 1 17 0 1
wrf_r 1268 13 1603 613 13 82
cpuxalan_r 17 0 1 17 0 1
ldecod_r 11 0 0 11 0 0
x264_r 11 0 0 11 0 0
blender_r 61 12 42 39 12 16
cam4_r 45 13 20 40 13 17
deepsjeng_r 11 0 0 11 0 0
imagick_r 132 16 25 33 16 18
leela_r 12 0 0 12 0 0
nab_r 13 0 1 13 0 1
exchange2_r 16 0 1 16 0 1
fotonik3d_r 19 0 1 19 0 1
roms_r 21 0 1 21 0 1
xz_r 6 0 0 6 0 0
----------------- --------------
1804 55 1707 1023 55 150
----------------- --------------
3566 1228
----------------- --------------
While this was a missed-optimization exercise, testing exposed a latent
bug as additional testsuite failure, captured as PR120203. The existing
test float-point-dynamic-frm-74.c was missing FRM save after a call
which this fixes (as a side-effect of robust call state tracking).
| frrm a5
| fsrmi 1
|
| vfadd.vv v1,v8,v9
| fsrm a5
| beq a1,zero,.L2
|
| call normalize_vl_1
| frrm a5
|
| .L3:
| fsrmi 3
| vfadd.vv v8,v8,v9
| fsrm a5
| jr ra
|
| .L2:
| call normalize_vl_2
| frrm a5 <-- missing
| j .L3
PR target/120203
gcc/ChangeLog:
* config/riscv/riscv.cc (CFUN_IN_CALL): New macro.
(struct mode_switching_info): Add new field.
(riscv_frm_adjust_mode_after_call): Remove.
(riscv_frm_mode_needed): Track call_insn.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-74.c: Expect
an additional FRRM.
Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
|
|
FRM mode switching state machine has DYN as default state which it also
fallsback to after transitioning to other states such as DYN_CALL.
Currently TARGET_MODE_EMIT generates a FRM restore on any transition to
DYN leading to spurious/extraneous FRM restores.
Only do this if an interim static Rounding Mode was observed in the state
machine.
Fixes the extraneous FRM read/write in PR119164 (and also PR119832 w/o need
for TARGET_MODE_CONFLUENCE). Also reduces the number of FRM writes in
SPEC2017 -Ofast -mrv64gcv build significantly.
Before After
------------- -------------
frrm fsrmi fsrm frrm fsrmi frrm
perlbench_r 42 0 4 17 0 1
cpugcc_r 167 0 17 11 0 0
bwaves_r 16 0 1 16 0 1
mcf_r 11 0 0 11 0 0
cactusBSSN_r 76 0 27 19 0 1
namd_r 119 0 63 14 0 1
parest_r 168 0 114 24 0 1
povray_r 123 1 17 26 1 6
lbm_r 6 0 0 6 0 0
omnetpp_r 17 0 1 17 0 1
wrf_r 2287 13 1956 1268 13 1603
cpuxalan_r 17 0 1 17 0 1
ldecod_r 11 0 0 11 0 0
x264_r 14 0 1 11 0 0
blender_r 724 12 182 61 12 42
cam4_r 324 13 169 45 13 20
deepsjeng_r 11 0 0 11 0 0
imagick_r 265 16 34 132 16 25
leela_r 12 0 0 12 0 0
nab_r 13 0 1 13 0 1
exchange2_r 16 0 1 16 0 1
fotonik3d_r 20 0 11 19 0 1
roms_r 33 0 23 21 0 1
xz_r 6 0 0 6 0 0
--------------- --------------
4498 55 2623 1804 55 1707
--------------- --------------
7176 3566
--------------- --------------
PR target/119164
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_emit_frm_mode_set): check
STATIC_FRM_P for transition to DYN.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/pr119164.c: New test.
Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
|
|
This showed up when debugging the testcase for PR119164.
RISC-V FRM mode-switching state machine has special handling for transitions
to and from a call_insn as FRM needs to saved/restored around calls
despite it not being a callee-saved reg; rather it's a "global" reg
which can be temporarily modified "locally" with a static RM.
Thus a call needs to see the prior global state, hence the restore (from a
prior backup) before the call. Corollarily any call can potentially clobber
the FRM, thus post-call it needs to be it needs to be re-read/saved.
The following example demostrate this:
- insns 2, 4, 6 correspond to actual user code,
- rest 1, 3, 5, 6 are frm save/restore insns generated by mode switch for the
above described ABI semantics.
test_float_point_frm_static:
1: frrm a5 <--
2: fsrmi 2
3: fsrm a5 <--
4: call normalize_vl
5: frrm a5 <--
6: fsrmi 3
7: fsrm a5 <--
Current implementation of RISC-V TARGET_MODE_NEEDED has special handling
if the call_insn is last insn of BB, to ensure FRM save/reads are emitted
on all the edges. However it doesn't work as intended and is borderline
bogus for following reasons:
- It fails to detect call_insn as last of BB (PR119164 test) if the
next BB starts with a code label (say due to call being conditional).
Granted this is a deficiency of API next_nonnote_nondebug_insn_bb ()
which incorrectly returns next BB code_label as opposed to returning
NULL (and this behavior is kind of relied upon by much of gcc).
This causes missed/delayed state transition to DYN.
- If code is tightened to actually detect above such as:
- rtx_insn *insn = next_nonnote_nondebug_insn_bb (cur_insn);
- if (!insn)
+ if (BB_END (BLOCK_FOR_INSN (cur_insn)) == cur_insn)
edge insertion happens but ends up splitting the BB which generic
mode-sw doesn't expect and ends up hittng an ICE.
- TARGET_MODE_NEEDED hook typically don't modify the CFG.
- For abnormal edges, insert_insn_end_basic_block () is called, which
by design on encountering call_insn as last in BB, inserts new insn
BEFORE the call, not after.
So this is just all wrong and ripe for removal. Moreover there seems to
be no testsuite coverage for this code path at all. Results don't change
at all if this is removed.
The total number of FRM read/writes emitted (static count) across all
benchmarks of a SPEC2017 -Ofast -march=rv64gcv build decrease slightly
so its a net win even if minimal but the real gain is reduced complexity
and maintenance.
Before Patch
---------------- ---------------
frrm fsrmi fsrm frrm fsrmi frrm
perlbench_r 42 0 4 42 0 4
cpugcc_r 167 0 17 167 0 17
bwaves_r 16 0 1 16 0 1
mcf_r 11 0 0 11 0 0
cactusBSSN_r 79 0 27 76 0 27
namd_r 119 0 63 119 0 63
parest_r 218 0 114 168 0 114 <--
povray_r 123 1 17 123 1 17
lbm_r 6 0 0 6 0 0
omnetpp_r 17 0 1 17 0 1
wrf_r 2287 13 1956 2287 13 1956
cpuxalan_r 17 0 1 17 0 1
ldecod_r 11 0 0 11 0 0
x264_r 14 0 1 14 0 1
blender_r 724 12 182 724 12 182
cam4_r 324 13 169 324 13 169
deepsjeng_r 11 0 0 11 0 0
imagick_r 265 16 34 265 16 34
leela_r 12 0 0 12 0 0
nab_r 13 0 1 13 0 1
exchange2_r 16 0 1 16 0 1
fotonik3d_r 20 0 11 20 0 11
roms_r 33 0 23 33 0 23
xz_r 6 0 0 6 0 0
---------------- ---------------
4551 55 2623 4498 55 2623
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_frm_emit_after_bb_end): Delete.
(riscv_frm_mode_needed): Remove call riscv_frm_emit_after_bb_end.
Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
|
|
This is effectively reverting e5d1f538bb7d
"(RISC-V: Allow different dynamic floating point mode to be merged)"
while retaining the testcase.
The change itself is valid, however it obfuscates the deficiencies in
current frm mode switching code.
Also for a SPEC2017 -Ofast -march=rv64gcv build, it ends up generating
net more FRM restores (writes) vs. the rest of this changeset.
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_dynamic_frm_mode_p): Remove.
(riscv_mode_confluence): Ditto.
(TARGET_MODE_CONFLUENCE): Ditto.
Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
|
|
So here's the next chunk of conditional move work from Shreya.
It's been a long standing wart that the conditional move expander does not
support sub-word operands in the comparison. Particularly since we have
support routines to handle the necessary extensions for that case.
This patch adjusts the expander to use riscv_extend_comparands rather than fail
for that case. I've built spec2017 before/after this and we definitely get
more conditional moves and they look sensible from a performance standpoint.
None are likely hitting terribly hot code, so I wouldn't expect any performance
jumps.
Waiting on pre-commit testing to do its thing.
gcc/
* config/riscv/riscv.cc (riscv_expand_conditional_move): Use
riscv_extend_comparands to extend sub-word comparison arguments.
Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
|
|
By using the previously unused CEIL|FLOOR.S floating-point coprocessor
instructions. In addition, two instruction operand format codes are added
to output the scale value as assembler source.
/* example */
int test0(float a) {
return __builtin_lceilf(a);
}
int test1(float a) {
return __builtin_lceilf(a * 2);
}
int test2(float a) {
return __builtin_lfloorf(a);
}
int test3(float a) {
return __builtin_lfloorf(a * 32768);
}
;; result
test0:
entry sp, 32
wfr f0, a2
ceil.s a2, f0, 0
retw.n
test1:
entry sp, 32
wfr f0, a2
ceil.s a2, f0, 1
retw.n
test2:
entry sp, 32
wfr f0, a2
floor.s a2, f0, 0
retw.n
test3:
entry sp, 32
wfr f0, a2
floor.s a2, f0, 15
retw.n
However, because the rounding-half behavior (e.g., the rule that determines
whether 1.5 should be rounded to 1 or 2) of the two is inconsistent;
the lroundsfsi2 pattern is explicitly specified that rounding to nearest
integer and away from zero, but the other hand, the ROUND.S instruction is
not specified that by the ISA and is implementation-dependent.
Therefore lroundsfsi2 cannot be implemented by ROUND.S.
gcc/ChangeLog:
* config/xtensa/xtensa.cc (printx, print_operand):
Add two instruction operand format codes 'U' and 'V',
whose represent scale factors of 0 to 15th positive/negative
power of two.
* config/xtensa/xtensa.md (c_enum "unspec"):
Add UNSPEC_CEIL and UNSPEC_FLOOR.
(int_iterator ANY_ROUND, int_attr m_round):
New integer iterator and its attribute.
(fix<s_fix>_truncsfsi2, *fix<s_fix>_truncsfsi2_2x,
*fix<s_fix>_truncsfsi2_scaled, float<s_float>sisf2,
*float<s_float>sisf2_scaled):
Use output templates with the operand formats added above,
instead of individual output statements.
(l<m_round>sfsi2, *l<m_round>sfsi2_2x, *l<m_round>sfsi2_scaled):
New insn patterns.
|
|
moves
So here's the next chunk of conditional move work from Shreya.
It's been a long standing wart that the conditional move expander does not
support sub-word operands in the comparison. Particularly since we have
support routines to handle the necessary extensions for that case.
This patch adjusts the expander to use riscv_extend_comparands rather than fail
for that case. I've built spec2017 before/after this and we definitely get
more conditional moves and they look sensible from a performance standpoint.
None are likely hitting terribly hot code, so I wouldn't expect any performance
jumps.
Waiting on pre-commit testing to do its thing.
* config/riscv/riscv.cc (riscv_expand_conditional_move): Use
riscv_extend_comparands to extend sub-word comparison arguments.
Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
|
|
This patch would like to combine the vec_duplicate + vdivu.vv to the
vdivu.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, OP) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = in[i] OP x; \
}
DEF_VX_BINARY(int32_t, /)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vdivu.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vdivu.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
The below test suites are passed for this patch.
* The rv64gcv fully regression test.
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_vec_dup): Add new
case UDIV.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op divu.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
Now that create_tmp_reg_or_ssa_name just calls make_ssa_name replace
all of its uses.
* gimple-fold.h (create_tmp_reg_or_ssa_name): Remove.
* gimple-fold.cc (create_tmp_reg_or_ssa_name): Likewise.
(gimple_fold_builtin_memory_op): Use make_ssa_name.
(gimple_fold_builtin_strchr): Likewise.
(gimple_fold_builtin_strcat): Likewise.
(gimple_load_first_char): Likewise.
(gimple_fold_builtin_string_compare): Likewise.
(gimple_build): Likewise.
* tree-inline.cc (copy_bb): Likewise.
* config/rs6000/rs6000-builtin.cc (fold_build_vec_cmp): Likewise.
(rs6000_gimple_fold_mma_builtin): Likewise.
(rs6000_gimple_fold_builtin): Likewise.
|
|
This patch adds support for the XiangShan Kunminghu CPU in GCC, allowing
the use of the `-mcpu=xiangshan-kunminghu` option.
XiangShan-KunMingHu is the third-generation open-source high-performance
RISC-V processor.[1] You can find the corresponding ISA extension from the
XiangShan Github repository.[2] The latest news of KunMingHu can be found
in the XiangShan Biweekly.[3]
[1] https://github.com/OpenXiangShan/XiangShan-User-Guide/releases.
[2] https://github.com/OpenXiangShan/XiangShan/blob/master/src/main/scala/xiangshan/Parameters.scala
[3] https://docs.xiangshan.cc/zh-cn/latest/blog
A dedicated scheduling model for KunMingHu's hybrid pipeline will be
proposed in a subsequent PR.
gcc/ChangeLog:
* config/riscv/riscv-cores.def (RISCV_TUNE): New cpu tune.
(RISCV_CORE): New cpu.
* doc/invoke.texi: Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/mcpu-xiangshan-kunminghu.c: New test.
Co-Authored-By: Jiawei Chen <jiawei@iscas.ac.cn>
Co-Authored-By: Yangyu Chen <cyy@cyyself.name>
Co-Authored-By: Tang Haojin <tanghaojin@outlook.com>
|
|
So another class of cases where we can do better than a zicond sequence. Like
the prior patch this came up evaluating some code from Shreya to detect more
conditional move cases.
This patch allows us to use the "splat the sign bit" idiom to efficiently
select between 0 and 2^n-1. That's particularly important for signed division
by a power of two.
For signed division by a power of 2, you conditionally add 2^n-1 to the
numerator, then right shift that result. Using zicond somewhat naively you get
something like this (for n / 4096):
> li a5,4096
> addi a5,a5,-1
> slti a4,a0,0
> add a5,a0,a5
> czero.eqz a5,a5,a4
> czero.nez a0,a0,a4
> add a0,a0,a5
> srai a0,a0,12
After this patch you get this instead:
> srai a5,a0,63
> srli a5,a5,52
> add a0,a5,a0
> srai a0,a0,12
It's not *that* much faster, but it's certainly shorter.
So the trick here is that after splatting the sign bit we have 0, -1. So a
subsequent logical shift right would generate 0 or 2^n-1.
Yes, there a nice variety of other constant pairs we can select between. Some
notes have been added to the PR I opened yesterday.
The first thing we need to do is throttle back zicond generation. Unfortunately
we don't see the constants from the division-by-2^n algorithm, so we have to
disable for all lt/ge 0 cases. This can have small negative impacts. I looked
at this across spec and didn't see anything I was particularly worried about
and numerous small improvements from that alone.
With that in place we need to recognize the form seen by combine. Essentially
it sees the splat of the sign bit feeding a logical AND. We split that into two
right shifts.
This has survived in my tester. Waiting on upstream pre-commit before moving
forward.
gcc/
* config/riscv/riscv.cc (riscv_expand_conditional_move): Avoid
zicond in some cases involving sign bit tests.
* config/riscv/riscv.md: Split a splat of the sign bit feeding a
masking off high bits into a pair of right shifts.
gcc/testsuite
* gcc.target/riscv/nozicond-3.c: New test.
|
|
"mov<mode>cc" expander uses x86_64_general_operand predicate that limits the
range of immediate operands to 32-bit size. The usage of this predicate
causes ifcvt to force out-of-range immediates to registers when converting
through noce_try_cmove. The testcase:
long long foo (long long c) { return c >= 0 ? 0x400000000ll : -1ll; }
compiles (-O2) to:
foo:
testq %rdi, %rdi
movq $-1, %rax
movabsq $0x400000000, %rdx
cmovns %rdx, %rax
ret
The above testcase can be compiled to a more optimized code without
problematic CMOV instruction if 64-bit immediates are allowed in
"mov<mode>cc" expander:
foo:
movq %rdi, %rax
sarq $63, %rax
btsq $34, %rax
ret
The expander calls the ix86_expand_int_movcc function which internally
sanitizes arguments of emitted logical insns using expand_simple_binop.
The out-of-range immediates are forced to a temporary register just
before the instruction, so the instruction combiner is then able to
synthesize 64-bit BTS instruction.
The code improves even for non-exact-log2 64-bit immediates, e.g.
long long foo (long long c) { return c >= 0 ? 0x400001234ll : -1ll; }
that now compiles to:
foo:
movabsq $0x400001234, %rdx
movq %rdi, %rax
sarq $63, %rax
orq %rdx, %rax
ret
again avoiding problematic CMOV instruction.
PR target/120553
gcc/ChangeLog:
* config/i386/i386.md (mov<mode>cc): Use "general_operand"
predicate for operands 2 and 3 for all modes.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr120553.c: New test.
|
|
Currently gimple_folder::convert_and_fold calls create_tmp_var; that
means while in ssa form, the pass which calls fold_stmt will always
have to update the ssa (via TODO_update_ssa or otherwise). This seems
not very useful since we know that this will always be a ssa name, using
make_ssa_name instead is better and don't need to depend on the ssa updater.
Plus this should have a small compile time performance and memory usage
improvement too since this uses an anonymous ssa name rather than creating a
full decl for this.
Changes since v1:
* Use make_ssa_name instead of create_tmp_reg_or_ssa_name, anonymous ssa
names are allowed early on in gimple too.
Built and tested on aarch64-linux-gnu.
gcc/ChangeLog:
* config/aarch64/aarch64-sve-builtins.cc: Include value-range.h and tree-ssanames.h
(gimple_folder::convert_and_fold): Use make_ssa_name
instead of create_tmp_var for the temporary. Add comment about callback argument.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
|
|
The div of rvv has not such insn v2 = div (vec_dup (x), v1), thus
the generated rtl like that hit the unreachable assert when
expand insn. This patch would like to remove op div from
the binary op form (vec_dup (x), v) to avoid pattern matching
by mistake.
No new test introduced as pr33576.c covered already.
The below test suites are passed for this patch series.
* The rv64gcv fully regression test.
gcc/ChangeLog:
* config/riscv/autovec-opt.md: Leverage vdup_v and v_vdup
binary op for different patterns.
* config/riscv/vector-iterators.md: Add vdup_v and v_vdup
binary op iterators.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
This patch has a minor improvement to if-converted sequences based on
observations I found while evaluating another patch from Shreya to handle more
cases with zicond insns.
Specifically there is a smaller/faster way than zicond to generate a -1,1
result when the condition is testing the sign bit.
So let's consider these two tests (rv64):
long foo1 (long c, long a) { return c >= 0 ? 1 : -1; }
long foo2 (long c, long a) { return c < 0 ? -1 : 1; }
So if we right arithmetic shift c by 63 bits, that splats the sign bit across a
register giving us 0, -1 for the first test and -1, 0 for the second test. We
then unconditionally turn on the LSB resulting in 1, -1 for the first case and
-1, 1 for the second.
This is implemented as a 4->2 splitter. There's another pair of cases we don't
handle because we don't have 4->3 splitters. Specifically if the true/false
values are reversed in the above examples without reversing the condition.
Raphael is playing a bit in the gimple space to see what opportunities might
exist to recognize more idioms in phiopt and generate better code earlier. No
idea how that's likely to pan out.
This is a pretty consistent small win. It's been through the rounds in my
tester. Just waiting on a green light from pre-commit testing.
gcc/
* config/riscv/zicond.md: Add new splitters to select
1, -1 or -1, 1 based on a sign bit test.
gcc/testsuite/
* gcc.target/riscv/nozicond-1.c: New test.
* gcc.target/riscv/nozicond-2.c: New test.
|
|
Support the Ssu64xl extension, which requires UXLEN to be 64.
gcc/ChangeLog:
* config/riscv/riscv-ext.def: New extension definition.
* config/riscv/riscv-ext.opt: New extension mask.
* doc/riscv-ext.texi: Document the new extension.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/arch-ssu64xl.c: New test.
Signed-off-by: Jiawei <jiawei@iscas.ac.cn>
|
|
Support the Sstvecd extension, which allows Supervisor Trap Vector
Base Address register (stvec) to support Direct mode.
gcc/ChangeLog:
* config/riscv/riscv-ext.def: New extension definition.
* config/riscv/riscv-ext.opt: New extension mask.
* doc/riscv-ext.texi: Document the new extension.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/arch-sstvecd.c: New test.
Signed-off-by: Jiawei <jiawei@iscas.ac.cn>
|