Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch modifies the shift expander to immediately lower constant
shifts without unspec. It also modifies the ADR, SRA and ADDHNB patterns
to match the lowered forms of the shifts, as the predicate register is
not required for these instructions.
Bootstrapped and regtested on aarch64-linux-gnu.
Signed-off-by: Dhruv Chawla <dhruvc@nvidia.com>
Co-authored-by: Richard Sandiford <richard.sandiford@arm.com>
gcc/ChangeLog:
* config/aarch64/aarch64-sve.md (@aarch64_adr<mode>_shift):
Match lowered form of ashift.
(*aarch64_adr<mode>_shift): Likewise.
(*aarch64_adr_shift_sxtw): Likewise.
(*aarch64_adr_shift_uxtw): Likewise.
(<ASHIFT:optab><mode>3): Check amount instead of operands[2] in
aarch64_sve_<lr>shift_operand.
(v<optab><mode>3): Generate unpredicated shifts for constant
operands.
(@aarch64_pred_<optab><mode>): Convert to a define_expand.
(*aarch64_pred_<optab><mode>): Create define_insn_and_split pattern
from @aarch64_pred_<optab><mode>.
(*post_ra_v_ashl<mode>3): Rename to ...
(aarch64_vashl<mode>3_const): ... this and remove reload requirement.
(*post_ra_v_<optab><mode>3): Rename to ...
(aarch64_v<optab><mode>3_const): ... this and remove reload
requirement.
* config/aarch64/aarch64-sve2.md
(@aarch64_sve_add_<sve_int_op><mode>): Match lowered form of
SHIFTRT.
(*aarch64_sve2_sra<mode>): Likewise.
(*bitmask_shift_plus<mode>): Match lowered form of lshiftrt.
|
|
[aarch64] [vxworks] mark x18 as fixed, adjust tests
VxWorks uses x18 as the TCB, so STATIC_CHAIN_REGNUM has long been set
(in gcc/config/aarch64/aarch64-vxworks.h) to use x9 instead.
This patch marks x18 as fixed if the newly-introduced
TARGET_OS_USES_R18 is defined, so that it is not chosen by the
register allocator, rejects -fsanitize-shadow-call-stack due to the
register conflict, and adjusts tests that depend on x18 or on the
static chain register.
for gcc/ChangeLog
* config/aarch64/aarch64-vxworks.h (TARGET_OS_USES_R18): Define.
Update comments.
* config/aarch64/aarch64.cc (aarch64_conditional_register_usage):
Mark x18 as fixed on VxWorks.
(aarch64_override_options_internal): Issue sorry message on
-fsanitize=shadow-call-stack if TARGET_OS_USES_R18.
for gcc/testsuite/ChangeLog
* gcc.dg/cwsc1.c (CHAIN, aarch64): x9 instead x18 for __vxworks.
* gcc.target/aarch64/reg-alloc-4.c: Drop x18-assigned asm
operand on vxworks.
* gcc.target/aarch64/shadow_call_stack_1.c: Don't expect
-ffixed-x18 error on vxworks, but rather the sorry message.
* gcc.target/aarch64/shadow_call_stack_2.c: Skip on vxworks.
* gcc.target/aarch64/shadow_call_stack_3.c: Likewise.
* gcc.target/aarch64/shadow_call_stack_4.c: Likewise.
* gcc.target/aarch64/shadow_call_stack_5.c: Likewise.
* gcc.target/aarch64/shadow_call_stack_6.c: Likewise.
* gcc.target/aarch64/shadow_call_stack_7.c: Likewise.
* gcc.target/aarch64/shadow_call_stack_8.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-19.c: Likewise.
* gcc.target/aarch64/stack-check-prologue-20.c: Likewise.
|
|
So the next step in Shreya's work. In the prior patch we used two shifts to
clear bits at the high or low end of an object. In this patch we use 3 shifts
to clear bits on both ends.
Nothing really special here. With mvconst_internal still in the tree it's of
marginal value, though Shreya and I have confirmed the code coming out of
expand looks good. It's just that combine reconstitutes the operation via
mvconst_internal+and which looks cheaper.
When I was playing in this space earlier I definitely saw testsuite cases that
need this case handled to not regress with mvconst_internal removed.
This has spun in my tester on rv32 and rv64 and it's bootstrap + testing on my
BPI with a mere 23 hours to go. Waiting on pre-commit testing to render a
verdict before moving forward.
gcc/
* config/riscv/riscv.cc (synthesize_and): When profitable, use a three
shift sequence to clear bits at both upper and lower bits rather than
synthesizing the constant mask.
|
|
Patch is originally from Siarhei Volkau <lis8215@gmail.com>.
RISC-V has a zero register (x0) which we can use to store zero into memory
without loading the constant into a distinct register. Adjust the constraints
of the 32-bit movdi_32bit pattern to recognize that we can store 0.0 into
memory using x0 as the source register.
This patch only affects RISC-V. It has been regression tested on riscv64-elf.
Jeff has also tested this in his tester (riscv64-elf and riscv32-elf) with no
regressions.
PR target/70557
gcc/
* config/riscv/riscv.md (movdi_32bit): Add "J" constraint to allow storing 0
directly to memory.
|
|
The middle-end uses rtx_cost on constants with the outer of being COMPARE
to find out the cost of a constant formation for a comparison instruction.
So for aarch64 backend, we would just return the cost of constant formation
in general. We can improve this by seeing if the outer is COMPARE and if
the constant fits the constraints of the cmp instruction just set the costs
to being one instruction.
Built and tested for aarch64-linux-gnu.
PR target/120372
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_rtx_costs <case CONST_INSN>): Handle
if outer is COMPARE and the constant can be handled by the cmp instruction.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/imm_choice_comparison-2.c: New test.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
|
|
[PR120360]
As mentioned by Linus, we can't optimize comparison of otherwise unused
result of plus with CONST_INT second operand, compared against zero.
This can be done using just cmp instruction with negated constant and say
js/jns/je/jne etc. conditional jumps (or setcc).
We already have *cmp<mode>_minus_1 instruction which handles it when
(as shown in foo in the testcase) the IL has MINUS rather than PLUS,
but for constants except for the minimum value the canonical form is
with PLUS.
The following patch adds a new pattern and predicate to handle this.
2025-05-22 Jakub Jelinek <jakub@redhat.com>
PR target/120360
* config/i386/predicates.md (x86_64_neg_const_int_operand): New
predicate.
* config/i386/i386.md (*cmp<mode>_plus_1): New pattern.
* gcc.target/i386/pr120360.c: New test.
|
|
So the first special case of clearing bits from Shreya's work. We can clear an
arbitrary number of high bits by shifting left by the number of bits to clear,
then logically shifting right to put everything in place. Similarly we can
clear an arbitrary number of low bits with a right logical shift followed by a
left shift. Naturally this only applies when the constant synthesis budget is
2+ insns.
Even with mvconst_internal still enabled this does consistently show various
small code generation improvements.
I have seen a notable regression. The two shift form to wipe out high bits
isn't handled well by ext-dce. Essentially it looks like we don't recognize
the sequence as wiping upper bits, instead it makes bits live and as a result
we're unable to remove a prior zero extension. I've opened a bug for this
issue.
The other case I've seen is CSE related. If we had a number of masking
operations with the same mask, we might have previously CSE'd the constant. In
that scenario each instance of masking would be a single AND using the CSE'd
register holding the constant, whereas with this patch it'll be a pair of
shifts. But on a good uarch design the pair of shifts would be fused into a
single op. Given this is relatively rare and on the margins from a performance
standpoint I'm not going to worry about it.
This has spun in my tester for riscv32-elf and riscv64-elf. Bootstrap and
regression test is in flight and due in an hour or so. Waiting on the
upstream pre-commit tester and the bootstrap test before moving forward.
gcc/
* config/riscv/riscv.cc (synthesize_and): When profitable, use two
shift combinations to clear high or low bits rather than synthsizing
the constant.
|
|
There was a bug in aarch64_evpc_reencode which could leave zero_op0_p and
zero_op1_p of the struct "newd" uninitialized. r16-701-gd77c3bc1c35e303 fixed
the issue by zero initializing "newd." This patch provides an alternative fix
as suggested by Richard Sandiford based on the fact that the zeroness is
preserved by aarch64_evpc_reencode.
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_evpc_reencode): Copy zero_op0_p and
zero_op1_p from d to newd.
Signed-off-by: Pengxuan Zheng <quic_pzheng@quicinc.com>
|
|
I wrote this a couple months ago to fix an instruction count regression in
505.mcf on risc-v, but I don't have a trivial little testcase to add to the
suite.
There were two problems with the pattern.
First, the code was generating a shift followed by an add after reload.
Naturally combine doesn't run after reload and the code stayed in that form
rather than using shadd when available.
Second the splitter was just over-active. We need to make sure that the
shifted form of the constant operand has a cost > 1 to synthesize. It's
useless to split if the shifted constant can be synthesized in a single
instruction.
This has been in my tester since March. So it's been through numerous
riscv64-elf and riscv32-elf test cycles as well as multiple rv64 bootstrap
tests. Waiting on the upstream CI system to render a verdict before moving
forward.
Looking further out I'm hoping this pattern will transform into a simpler and
always active define_split.
gcc/
* config/riscv/riscv.md ((x << C1) + C2): Tighten split condition
and generate more efficient code when splitting.
|
|
So a followup to last week's bugfix. In last week's change we we stopped using
define_insn_and_split to rewrite instructions. That change was done to avoid
dropping a masking instruction out of the RTL.
As a result the pattern(s) were changed into simple define_insns, which is
good. One of them uses the GPR iterator since it's supposed to work for both
32bit and 64bit shifts on rv64.
But we failed to emit the right opcode for a 32bit shift on rv64. Thankfully
the fix is trivial. If the mode is anything but word_mode, then we must be
doing a 32-bit shift on rv64, ie the various "w" shift instructions.
It's run through my tester. Just waiting on the upstream CI system to spin it.
PR target/120368
gcc/
* config/riscv/riscv.md (shift with masked shift count): Fix
opcode when generating an SImode shift on rv64.
gcc/testsuite/
* gcc.target/riscv/pr120368.c: New test.
|
|
This patch would like to combine the vec_duplicate + vand.vv to the
vand.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, OP) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = in[i] OP x; \
}
DEF_VX_BINARY(int32_t, &)
Before this patch:
10 │ test_vx_binary_and_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vand.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_and_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vand.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
The below test suites are passed for this patch.
* The rv64gcv fully regression test.
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add new
case for rtx code AND.
(expand_vx_binary_vec_vec_dup): Ditto.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op and to no_shift_vx_ops.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
The automatically-generated gen_* routines take their operands as
individual arguments, named "operand0" upwards. These arguments are
stored into an "operands" array before invoking the expander's C++
code, which can then modify the operands by writing to the array.
However, the SPARC sign-extend and zero-extend expanders used the
operandN variables directly, rather than operands[N]. That's a
correct usage in context, since the code goes on to expand the
pattern manually and invoke DONE.
But it's also easy for code to accidentally write to operandN instead
of operands[N] when trying to set up something like a match_dup.
It sounds like Jeff had seen an instance of this.
A later patch is therefore going to mark the operandN arguments
as const. This patch makes way for that by using operands[N]
instead of operandN for the SPARC expanders.
gcc/
* config/sparc/sparc.md (zero_extendhisi2, zero_extendhidi2)
(extendhisi2, extendqihi2, extendqisi2, extendqidi2)
(extendhidi2): Use operands[0] and operands[1] instead of
operand0 and operand1.
|
|
The negsi2 C++ code writes to operands[2] even though the pattern
has no operand 2.
gcc/
* config/stormy16/stormy16.md (negsi2): Remove unused assignment.
|
|
This pattern used operands[2] to hold the shift amount, even though
the pattern doesn't have an operand 2 (not even as a match_dup).
This caused a build failure with -Werror:
array subscript 2 is above array bounds of ‘rtx_def* [2]’
gcc/
PR target/100837
* config/nds32/nds32-intrinsic.md (unspec_get_pending_int): Use
a local variable instead of operands[2].
|
|
So this is the next step on the path to mvconst_internal removal and is work
from Shreya and myself.
This puts in the infrastructure to allow us to synthesize logical AND much like
we're doing with logical IOR/XOR.
Unlike IOR/XOR, AND has many more special cases that can be profitable. For
example, you can use shifts to clear many bits. You can use zero extension to
clear bits, you can use rotate+andi+rotate, shift pairs, etc.
So to make potential bisecting easy the plan is to drop in the work on logical
AND in several steps, essentially one new case at a time.
This step just puts the basics of a operation synthesis in place. It still
uses the same code generation strategies as we are currently using.
I'd like to say this is NFC, but unfortunately that's not true. While the code
generation strategy is the same, this does indirectly introduce new REG_EQUAL
notes. Those additional notes in turn can impact how various optimizers behave
in very minor ways.
As usual, this has survived my tester on riscv32-elf and riscv64-elf.
Waiting on pre-commit to do its thing. And I'll start queuing up the
additional cases we want to handle while waiting 😉
gcc/
* config/riscv/riscv-protos.h (synthesize_and): Prototype.
* config/riscv/riscv.cc (synthesize_and): New function.
* config/riscv/riscv.md (and<mode>3): Use it.
Co-Authored-By: Jeff Law <jlaw@ventanamicro.com>
|
|
dummies reservation for the same.
The RISC-V backend requires all types to map to a reservation in the
scheduler model. This adds types to a dummy reservation for all the
types not currently handled by the p8700 model.
gcc/
* config/riscv/mips-p8700.md (mips_p8700_dummies): New
reservation.
(mips_p8700_unknown): Reservation for all the dummies.
|
|
P8700 is a high-performance processor from MIPS by extending RISCV with custom instructions
Add support for the p8700 design from MIPS.
gcc/
* config/riscv/mips-p8700.md: New scheduler model.
* config/riscv/riscv-cores.def (mips-p87000): New tuning model
and core architecture.
* config/riscv/riscv-opts.h (riscv_microarchitecture_type); Add
mips-p8700.
* config/riscv/riscv.cc (mips_p8700_tune_info): New uarch
tuning parameters.
* config/riscv/riscv.md (tune): Add mips_p8700.
Include mips-p8700.md
* doc/invoke.texi: Document tune/cpu options for the MIPS P8700.
Co-authored-by: Jeff Law <jlaw@ventanamicro.com>
|
|
This is the next batch of changes to reduce multiple assignments to an output
object. This time I'm focused on splitters in bitmanip.md.
This doesn't convert every case. For example there is one case that is very
clearly dependent on eliminating mvconst_internal and adjustment of a splitter
for andn and until those things happen it would clearly be a QOI implementation
regression.
There are cases where we set a scratch register more than once. It may be
possible to use an additional scratch. I haven't tried that yet.
I've seen one failure to if-convert a sequence after this patch, but it should
be resolved once the logical AND changes are merged. Otherwise I'm primarily
seeing slight differences in register allocation and scheduling. Nothing
concerning to me.
This has run through my tester, but I obviously want to see how it behaves in
the upstream CI system as that tests slightly different multilibs than mine (on
purpose).
gcc/
* config/riscv/bitmanip.md (various splits): Avoid writing the output
more than once when trivially possible.
|
|
This patch would like to combine the vec_duplicate + vrub.vv to the
vrsub.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY_REVERSE_CASE_0(T, OP, NAME) \
void \
test_vx_binary_reverse_##NAME##_##T##_case_0 (T * restrict out, \
T * restrict in, T x, \
unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = x OP in[i]; \
}
DEF_VX_BINARY_REVERSE_CASE_0(int32_t, -)
Before this patch:
54 │ test_vx_binary_reverse_rsub_int32_t_case_0:
55 │ beq a3,zero,.L27
56 │ vsetvli a5,zero,e32,m1,ta,ma
57 │ vmv.v.x v2,a2
58 │ slli a3,a3,32
59 │ srli a3,a3,32
60 │ .L22:
61 │ vsetvli a5,a3,e32,m1,ta,ma
62 │ vle32.v v1,0(a1)
63 │ slli a4,a5,2
64 │ sub a3,a3,a5
65 │ add a1,a1,a4
66 │ vsub.vv v1,v2,v1
67 │ vse32.v v1,0(a0)
68 │ add a0,a0,a4
69 │ bne a3,zero,.L22
After this patch:
50 │ test_vx_binary_reverse_rsub_int32_t_case_0:
51 │ beq a3,zero,.L27
52 │ slli a3,a3,32
53 │ srli a3,a3,32
54 │ .L22:
55 │ vsetvli a5,a3,e32,m1,ta,ma
56 │ vle32.v v1,0(a1)
57 │ slli a4,a5,2
58 │ sub a3,a3,a5
59 │ add a1,a1,a4
60 │ vrsub.vx v1,v1,a2
61 │ vse32.v v1,0(a0)
62 │ add a0,a0,a4
63 │ bne a3,zero,.L22
The below test suites are passed for this patch.
* The rv64gcv fully regression test.
gcc/ChangeLog:
* config/riscv/autovec-opt.md: Leverage the new add func to
expand the vx insn.
* config/riscv/riscv-protos.h (expand_vx_binary_vec_dup_vec): Add
new func decl to expand format v = vop(vec_dup(x), v).
(expand_vx_binary_vec_vec_dup): Diito but for format
v = vop(v, vec_dup(x)).
* config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add new
func impl to expand vx for v = vop(vec_dup(x), v).
(expand_vx_binary_vec_vec_dup): Diito but for another format
v = vop(v, vec_dup(x)).
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
I goof'd when doing analysis of missed bext cases. For the shift into the sign
bit, then shift into the low bit case (thankfully the least common), I got it
in my brain that the field is at the left shift count. It's actually at
word_size - 1 - left shift count.
One the subtraction is included, it's no longer profitable to turn those cases
into bext. Best case scenario would be sub+bext, but we can just as easily use
sll+srl which fuses in some designs into a single op.
So this patch removes those two patterns, adjusts the existing testcase and
adds the new execution test.
Given it's a partial reversion and has passed in my tester, I'm going to go
ahead and push it to the trunk rather than waiting for upstream CI.
PR target/120333
gcc/
* config/riscv/bitmanip.md: Remove bext formed from left+right
shift patterns.
gcc/testsuite/
* gcc.target/riscv/pr114512.c: Update expected output.
* gcc.target/riscv/pr120333.c: New test.
|
|
The pa target lacks atomic sync compare and swap instructions.
These are implemented as libcalls and in libatomic. As on linux,
we lie about their availability.
This fixes the gcov-30.c test on hppa64-hpux11.
2025-05-19 John David Anglin <danglin@gcc.gnu.org>
gcc/ChangeLog:
* config/pa/pa-hpux.h (TARGET_HAVE_LIBATOMIC): Define.
(HAVE_sync_compare_and_swapqi): Likewise.
(HAVE_sync_compare_and_swaphi): Likewise.
(HAVE_sync_compare_and_swapsi): Likewise.
(HAVE_sync_compare_and_swapdi): Likewise.
|
|
As Mark and I independently tripped, there's a Wuninitialized issue in the
RISC-V backend. While *I* know the value would always be properly initialized,
it'd be somewhat painful to either eliminate the infeasible paths or do deep
enough analysis to suppress the false positive.
So this initializes OUTPUT and verifies it's got a reasonable value before
using it for the final copy into operands[0].
Bootstrapped on the BPI (regression testing still has ~12hrs to go).
gcc/
* config/riscv/riscv.cc (synthesize_ior_xor): Initialize OUTPUT and
verify it's non-null before emitting the final copy insn.
|
|
It's not enough to just check that a memory operand is of the form
mem(reg); after RA we also need to validate the register being used.
The safest way to do this is to call memory_operand.
PR target/120351
gcc/ChangeLog:
* config/arm/predicates.md (mem_noofs_operand): Also check the op
is a valid memory_operand.
gcc/testsuite/ChangeLog:
* gcc.target/arm/pr120351.c: New test.
|
|
The variables `major` and `minor` in `gen-riscv-ext-texi.cc`
conflict with the macros of the same name defined in `<sys/sysmacros.h>`,
which are exposed when building with newer versions of GCC on older
Linux distributions (e.g., Ubuntu 18.04). To resolve this, we rename them
to `major_version` and `minor_version` respectively. This aligns with the
GCC community's recommended practice [1] and improves code clarity.
[1] https://gcc.gnu.org/pipermail/gcc-patches/2025-May/683881.html
gcc/ChangeLog:
* config/riscv/gen-riscv-ext-texi.cc (struct version_t):rename
major/minor to major_version/minor_version.
Signed-off-by: Songhe Zhu <zhusonghe@eswincomputing.com>
|
|
This commit adds the code gen support for Zilsd, which is a
newly added extension for RISC-V. The Zilsd extension allows
for loading and storing 64-bit values using even-odd register
pairs.
We only try to do miminal code gen support for that, which means only
use the new instructions when the load store is 64 bits data, we can use
that to optimize the code gen of memcpy/memset/memmove and also the
prologue and epilogue of functions, but I think that probably should be
done in a follow up patch.
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_legitimize_move): Handle
load/store with odd-even reg pair.
(riscv_split_64bit_move_p): Don't split load/store if zilsd enabled.
(riscv_hard_regno_mode_ok): Only allow even reg can be used for
64 bits mode for zilsd.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/zilsd-code-gen.c: New test.
|
|
This commit introduces a new operand constraint `cR` for the RISC-V
architecture, which allows the use of an even-odd RVC general purpose register
(x8-x15) in inline asm.
Ref: https://github.com/riscv-non-isa/riscv-c-api-doc/pull/102
gcc/ChangeLog:
* config/riscv/constraints.md (cR): New constraint.
* doc/md.texi (Machine Constraints::RISC-V): Document the new cR
constraint.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/constraint-cR-pair.c: New test case.
|
|
Since we use a single avx10.2 to enable everything, there is
no need to split them into two files.
gcc/ChangeLog:
* config.gcc: Remove 512 intrin file.
* config/i386/avx10_2-512bf16intrin.h:
Removed and combined to ...
* config/i386/avx10_2bf16intrin.h: ... this.
* config/i386/avx10_2-512convertintrin.h:
Removed and combined to ...
* config/i386/avx10_2convertintrin.h: ... this.
* config/i386/avx10_2-512mediaintrin.h:
Removed and combined to ...
* config/i386/avx10_2mediaintrin.h: ... this.
* config/i386/avx10_2-512minmaxintrin.h:
Removed and combined to ...
* config/i386/avx10_2minmaxintrin.h: ... this.
* config/i386/avx10_2-512satcvtintrin.h:
Removed and combined to ...
* config/i386/avx10_2satcvtintrin.h: ... this.
* config/i386/immintrin.h: Remove 512 intrin file.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx-1.c: Combine tests and change
intrin file name.
* gcc.target/i386/sse-13.c: Ditto.
* gcc.target/i386/sse-14.c: Ditto.
* gcc.target/i386/sse-22.c: Ditto.
* gcc.target/i386/sse-23.c: Ditto.
|
|
There are several iterators no longer needed in md files since
after refactor in AVX10, they could directly use legacy AVX512
ones. Remove those duplicate iterators.
gcc/ChangeLog:
* config/i386/sse.md (VF1_VF2_AVX10_2): Removed.
(VF2_AVX10_2): Ditto.
(VI1248_AVX10_2): Ditto.
(VFH_AVX10_2): Ditto.
(VF1_AVX10_2): Ditto.
(VHF_AVX10_2): Ditto.
(VBF_AVX10_2): Ditto.
(VI8_AVX10_2): Ditto.
(VI2_AVX10_2): Ditto.
(VBF): New.
(div<mode>3): Use VBF instead of AVX10.2 ones.
(vec_cmp<mode><avx512fmaskmodelower>): Ditto.
(avx10_2_cvt2ps2phx_<mode><mask_name><round_name>):
Use VHF_AVX512VL instead of AVX10.2 ones.
(vcvt<convertfp8_pack><mode><mask_name>): Ditto.
(vcvthf82ph<mode><mask_name>): Ditto.
(VHF_AVX10_2_2): Remove not needed TARGET_AVX10_2.
(usdot_prod<sseunpackmodelower><mode>): Use VI2_AVX512F
instead of AVX10.2 ones.
(vdpphps_<mode>): Use VF1_AVX512VL instead of AVX10.2 ones.
(vdpphps_<mode>_mask): Ditto.
(vdpphps_<mode>_maskz): Ditto.
(vdpphps_<mode>_maskz_1): Ditto.
(avx10_2_scalefbf16_<mode><mask_name>): Use VBF instead of
AVX10.2 ones.
(<code><mode>3): Ditto.
(avx10_2_<code>bf16_<mode><mask_name>): Ditto.
(avx10_2_fmaddbf16_<mode>_maskz); Ditto.
(avx10_2_fmaddbf16_<mode><sd_maskz_name>): Ditto.
(avx10_2_fmaddbf16_<mode>_mask): Ditto.
(avx10_2_fmaddbf16_<mode>_mask3): Ditto.
(avx10_2_fnmaddbf16_<mode>_maskz): Ditto.
(avx10_2_fnmaddbf16_<mode><sd_maskz_name>): Ditto.
(avx10_2_fnmaddbf16_<mode>_mask): Ditto.
(avx10_2_fnmaddbf16_<mode>_mask3): Ditto.
(avx10_2_fmsubbf16_<mode>_maskz); Ditto.
(avx10_2_fmsubbf16_<mode><sd_maskz_name>): Ditto.
(avx10_2_fmsubbf16_<mode>_mask): Ditto.
(avx10_2_fmsubbf16_<mode>_mask3): Ditto.
(avx10_2_fnmsubbf16_<mode>_maskz): Ditto.
(avx10_2_fnmsubbf16_<mode><sd_maskz_name>): Ditto.
(avx10_2_fnmsubbf16_<mode>_mask): Ditto.
(avx10_2_fnmsubbf16_<mode>_mask3): Ditto.
(avx10_2_rsqrtbf16_<mode><mask_name>): Ditto.
(avx10_2_sqrtbf16_<mode><mask_name>): Ditto.
(avx10_2_rcpbf16_<mode><mask_name>): Ditto.
(avx10_2_getexpbf16_<mode><mask_name>): Ditto.
(avx10_2_<bf16immop>bf16_<mode><mask_name>): Ditto.
(avx10_2_fpclassbf16_<mode><mask_scalar_merge_name>): Ditto.
(avx10_2_cmpbf16_<mode><mask_scalar_merge_name>): Ditto.
(avx10_2_cvt<sat_cvt_trunc_prefix>bf162i<sat_cvt_sign_prefix>bs<mode><mask_name>):
Ditto.
(avx10_2_cvtph2i<sat_cvt_sign_prefix>bs<mode><mask_name><round_name>):
Use VHF_AVX512VL instead of AVX10.2 ones.
(avx10_2_cvttph2i<sat_cvt_sign_prefix>bs<mode><mask_name><round_saeonly_name>):
Ditto.
(avx10_2_cvtps2i<sat_cvt_sign_prefix>bs<mode><mask_name><round_name>):
Use VF1_AVX512VL instead of AVX10.2 ones.
(avx10_2_cvttps2i<sat_cvt_sign_prefix>bs<mode><mask_name><round_saeonly_name>):
Ditto.
(avx10_2_vcvtt<castmode>2<sat_cvt_sign_prefix>dqs<mode><mask_name><round_saeonly_name>):
Use VF instead of AVX10.2 ones.
(avx10_2_vcvttpd2<sat_cvt_sign_prefix>qqs<mode><mask_name><round_saeonly_name>):
Use VF2 instead of AVX10.2 ones.
(avx10_2_vcvttps2<sat_cvt_sign_prefix>qqs<mode><mask_name><round_saeonly_name>):
Use VI8 instead of AVX10.2 ones.
(avx10_2_minmaxbf16_<mode><mask_name>): Use VBF instead of
AVX10.2 ones.
(avx10_2_minmaxp<mode><mask_name><round_saeonly_name>):
Use VFH_AVX512VL instead of AVX10.2 ones.
(avx10_2_vmovrs<ssemodesuffix><mode><mask_name>):
Use VI1248_AVX512VLBW instead of AVX10.2 ones.
|
|
As we mentioned in GCC 15, we will remove avx10.1-256/512 and evex512
in GCC 16. Also, the combination of AVX10 and AVX512 option behavior
will also be simplified in GCC 16 since AVX10.1 now implied AVX512,
making the behavior matching everyone else.
gcc/ChangeLog:
* common/config/i386/cpuinfo.h
(get_available_features): Remove feature set for AVX10_1_256.
* common/config/i386/i386-common.cc
(OPTION_MASK_ISA2_EVEX512_SET): Removed.
(OPTION_MASK_ISA2_AVX10_1_256_SET): Removed.
(OPTION_MASK_ISA_AVX10_1_SET): Imply all AVX512 features.
(OPTION_MASK_ISA2_AVX10_1_SET): Ditto.
(OPTION_MASK_ISA2_AVX2_UNSET): Remove AVX10_1_UNSET.
(OPTION_MASK_ISA2_EVEX512_UNSET): Removed.
(OPTION_MASK_ISA2_AVX10_1_UNSET): Remove AVX10_1_256.
(OPTION_MASK_ISA2_AVX512F_UNSET): Unset AVX10_1.
(ix86_handle_option): Remove special handling for AVX512/AVX10.1
options, evex512 and avx10_1_256. Modify ISA set for AVX10 options.
* common/config/i386/i386-cpuinfo.h
(enum feature_priority): Remove P_AVX10_1_256.
(enum processor_features): Remove FEATURE_AVX10_1_256.
* common/config/i386/i386-isas.h: Remove avx10.1-256/512.
* config/i386/avx512bf16intrin.h: Rollback target push before
evex512 is introduced.
* config/i386/avx512bf16vlintrin.h: Ditto.
* config/i386/avx512bitalgintrin.h: Ditto.
* config/i386/avx512bitalgvlintrin.h: Ditto.
* config/i386/avx512bwintrin.h: Ditto.
* config/i386/avx512cdintrin.h: Ditto.
* config/i386/avx512dqintrin.h: Ditto.
* config/i386/avx512fintrin.h: Ditto.
* config/i386/avx512fp16intrin.h: Ditto.
* config/i386/avx512fp16vlintrin.h: Ditto.
* config/i386/avx512ifmaintrin.h: Ditto.
* config/i386/avx512ifmavlintrin.h: Ditto.
* config/i386/avx512vbmi2intrin.h: Ditto.
* config/i386/avx512vbmi2vlintrin.h: Ditto.
* config/i386/avx512vbmiintrin.h: Ditto.
* config/i386/avx512vbmivlintrin.h: Ditto.
* config/i386/avx512vlbwintrin.h: Ditto.
* config/i386/avx512vldqintrin.h: Ditto.
* config/i386/avx512vlintrin.h: Ditto.
* config/i386/avx512vnniintrin.h: Ditto.
* config/i386/avx512vnnivlintrin.h: Ditto.
* config/i386/avx512vp2intersectintrin.h: Ditto.
* config/i386/avx512vp2intersectvlintrin.h: Ditto.
* config/i386/avx512vpopcntdqintrin.h: Ditto.
* config/i386/avx512vpopcntdqvlintrin.h: Ditto.
* config/i386/gfniintrin.h: Ditto.
* config/i386/vaesintrin.h: Ditto.
* config/i386/vpclmulqdqintrin.h: Ditto.
* config/i386/driver-i386.cc (check_avx512_features): Removed.
(host_detect_local_cpu): Remove -march=native special handling.
* config/i386/i386-builtins.cc
(ix86_vectorize_builtin_gather): Remove TARGET_EVEX512.
* config/i386/i386-c.cc
(ix86_target_macros_internal): Remove EVEX512 and AVX10_1_256.
* config/i386/i386-expand.cc
(ix86_valid_mask_cmp_mode): Remove TARGET_EVEX512.
(ix86_expand_int_sse_cmp): Ditto.
(ix86_vector_duplicate_simode_const): Ditto.
(ix86_expand_vector_init_duplicate): Ditto.
(ix86_expand_vector_init_one_nonzero): Ditto.
(ix86_emit_swsqrtsf): Ditto.
(ix86_vectorize_vec_perm_const): Ditto.
(ix86_expand_vecop_qihi2): Ditto.
(ix86_expand_sse2_mulvxdi3): Ditto.
(ix86_gen_bcst_mem): Ditto.
* config/i386/i386-isa.def (EVEX512): Removed.
(AVX10_1_256): Ditto.
* config/i386/i386-options.cc
(isa2_opts): Remove evex512 and avx10.1-256.
(ix86_function_specific_save): Remove no_avx512_explicit and
no_avx10_1_explicit.
(ix86_function_specific_restore): Ditto.
(ix86_valid_target_attribute_inner_p): Remove evex512 and
avx10.1-256/512.
(ix86_valid_target_attribute_tree): Remove special handling
to rerun ix86_option_override_internal for AVX10.1-256.
(ix86_option_override_internal): Remove warning handling.
(ix86_simd_clone_adjust): Remove evex512.
* config/i386/i386.cc
(type_natural_mode): Remove TARGET_EVEX512.
(ix86_return_in_memory): Ditto.
(standard_sse_constant_p): Ditto.
(standard_sse_constant_opcode): Ditto.
(ix86_get_ssemov): Ditto.
(ix86_legitimate_constant_p): Ditto.
(ix86_vectorize_builtin_scatter): Ditto.
(ix86_hard_regno_mode_ok): Ditto.
(ix86_set_reg_reg_cost): Ditto.
(ix86_rtx_costs): Ditto.
(ix86_vector_mode_supported_p): Ditto.
(ix86_preferred_simd_mode): Ditto.
(ix86_autovectorize_vector_modes): Ditto.
(ix86_get_mask_mode): Ditto.
(ix86_simd_clone_compute_vecsize_and_simdlen): Ditto.
(ix86_simd_clone_usable): Ditto.
* config/i386/i386.h (BIGGEST_ALIGNMENT): Ditto.
(MOVE_MAX): Ditto.
(STORE_MAX_PIECES): Ditto.
(PTA_SKYLAKE_AVX512): Remove PTA_EVEX512.
(PTA_CANNONLAKE): Ditto.
(PTA_ZNVER4): Ditto.
(PTA_GRANITERAPIDS): Use PTA_AVX10_1.
(PTA_DIAMONDRAPIDS): Use PTA_GRANITERAPIDS.
* config/i386/i386.md: Remove TARGET_EVEX512, avx512f_512
and avx512bw_512.
* config/i386/i386.opt: Remove ix86_no_avx512_explicit,
ix86_no_avx10_1_explicit, mevex512, mavx10.1-256/512 and
warning for mavx10.1. Modify option comment.
* config/i386/i386.opt.urls: Remove evex512 and avx10.1-256/512.
* config/i386/predicates.md: Remove TARGET_EVEX512.
* config/i386/sse.md: Ditto.
* doc/extend.texi: Remove avx10.1-256/512. Modify avx10.1 doc.
* doc/invoke.texi: Remove avx10.1-256/512 and evex512.
* doc/sourcebuild.texi: Remove avx10.1-256/512.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx10_1-1.c: Remove warning.
* gcc.target/i386/avx10_1-2.c: Ditto.
* gcc.target/i386/avx10_1-3.c: Ditto.
* gcc.target/i386/avx10_1-4.c: Ditto.
* gcc.target/i386/pr111068.c: Ditto.
* gcc.target/i386/pr117946.c: Ditto.
* gcc.target/i386/pr117240_avx512f.c: Remove -mevex512 and
warning.
* gcc.target/i386/avx10_1-11.c: Rename to ...
* gcc.target/i386/avx10_1-5.c: ... this. Remove warning.
* gcc.target/i386/avx10_1-12.c: Rename to ...
* gcc.target/i386/avx10_1-6.c: ... this. Remove warning.
* gcc.target/i386/avx10_1-26.c: Rename to ...
* gcc.target/i386/avx10_1-7.c: ... this. Remove warning.
The origin avx10_1-7.c is removed.
* gcc.target/i386/avx10_1-10.c: Removed.
* gcc.target/i386/avx10_1-13.c: Removed.
* gcc.target/i386/avx10_1-14.c: Removed.
* gcc.target/i386/avx10_1-15.c: Removed.
* gcc.target/i386/avx10_1-16.c: Removed.
* gcc.target/i386/avx10_1-17.c: Removed.
* gcc.target/i386/avx10_1-18.c: Removed.
* gcc.target/i386/avx10_1-19.c: Removed.
* gcc.target/i386/avx10_1-20.c: Removed.
* gcc.target/i386/avx10_1-21.c: Removed.
* gcc.target/i386/avx10_1-22.c: Removed.
* gcc.target/i386/avx10_1-23.c: Removed.
* gcc.target/i386/avx10_1-8.c: Removed.
* gcc.target/i386/avx10_1-9.c: Removed.
* gcc.target/i386/noevex512-1.c: Removed.
* gcc.target/i386/noevex512-2.c: Removed.
* gcc.target/i386/noevex512-3.c: Removed.
* gcc.target/i386/pr111889.c: Removed.
* gcc.target/i386/pr111907.c: Removed.
|
|
As we mentioned in GCC 15, we will remove evex512 in GCC 16 since it
is not useful anymore since we will have 512 bit directly. This patch
will first unpush evex512 in the builtins.
gcc/ChangeLog:
* config/i386/i386-builtin.def
(BDESC): Remove OPTION_MASK_ISA2_EVEX512.
* config/i386/i386-builtins.cc
(ix86_init_mmx_sse_builtins): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr90096.c: Adjust error message.
* gcc.target/i386/pr117304-1.c: Removed.
|
|
Found this while setting up the risc-v coordination branch off of gcc-15. Not
sure why I didn't use rtvec_alloc directly here since we're going to initialize
the whole vector ourselves. Using gen_rtvec was just wrong as it's walking
down a non-existent varargs list. Under the "right" circumstances it can walk
off a page and fault.
This was seen with a test already in the testsuite (I forget which test), so no
new regression test.
Tested in my tester and verified the failure on the coordination branch is
resolved a well. Waiting on pre-commit CI to render a verdict.
gcc/
* config/riscv/riscv-vect-permconst.cc (vector_permconst:process_bb):
Use rtvec_alloc, not gen_rtvec since we don't want/need to initialize
the vector.
|
|
While evaluating Shreya's logical AND synthesis work on spec2017 I ran into a
code quality regression where combine was failing to eliminate a redundant sign
extension.
I had a hunch the problem would be with the multiple sets of the same pseudo
register in the AND synthesis path. I was right that the problem was multiple
sets of the same pseudo, but it was actually some of the splitters in the
RISC-V backend that were the culprit. Those multiple sets caused the sign bit
tracking code to need to make conservative assumptions thus resulting in
failure to eliminate the unnecessary sign extension.
So before we start moving on the logical AND patch we're going to do some
cleanups.
There's multiple moving parts in play. For example, we have splitters which do
multiple sets of the output register. Fixing some of those independently would
result in a code quality regression. Instead they need some adjustments to or
removal of mvconst_internal. Of course getting rid of mvconst_internal will
trigger all kinds of code quality regressions right now which ultimately lead
back to the need to revamp the logical AND expander. Point being we've got
some circular dependencies and breaking them may result in short term code
quality regressions. I'll obviously try to avoid those as much as possible.
So to start the process this patch adjusts the recently added XOR/IOR synthesis
to avoid re-using the destination register. While the reuse was clearly safe
from a semantic standpoint, various parts of the compiler can do a better job
for pseudos that are only set once.
Given this synthesis path should only be active during initial RTL generation,
we can create new pseudos at will, so we create a new one for each insn. At
the end of the sequence we copy from the last set into the final destination.
This has various trivial impacts on the code generation, but the resulting code
looks no better or worse to me across spec2017.
This has been tested in my tester and is currently bootstrapping on my BPI.
Waiting on data from the pre-commit tester before moving forward...
gcc/
* config/riscv/riscv.cc (synthesize_ior_xor): Avoid writing
operands[0] more than once, use new pseudos instead.
|
|
never execute more than once
Reported-by: huangcunjian <huangcunjian.huang@alibaba-inc.com>
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_gpr_save_operation_p): Remove
break and fixbug for elt index.
|
|
We can optimize AND with certain vector of immediates as FMOV if the result of
the AND is as if the upper lane of the input vector is set to zero and the lower
lane remains unchanged.
For example, at present:
v4hi
f_v4hi (v4hi x)
{
return x & (v4hi){ 0xffff, 0xffff, 0, 0 };
}
generates:
f_v4hi:
movi d31, 0xffffffff
and v0.8b, v0.8b, v31.8b
ret
With this patch, it generates:
f_v4hi:
fmov s0, s0
ret
Changes since v1:
* v2: Simplify the mask checking logic by using native_decode_int and address a
few other review comments.
PR target/100165
gcc/ChangeLog:
* config/aarch64/aarch64-protos.h (aarch64_output_fmov): New prototype.
(aarch64_simd_valid_and_imm_fmov): Likewise.
* config/aarch64/aarch64-simd.md (and<mode>3<vczle><vczbe>): Allow FMOV
codegen.
* config/aarch64/aarch64.cc (aarch64_simd_valid_and_imm_fmov): New.
(aarch64_output_fmov): Likewise.
* config/aarch64/constraints.md (Df): New constraint.
* config/aarch64/predicates.md (aarch64_reg_or_and_imm): Update
predicate to support FMOV codegen.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/fmov-1-be.c: New test.
* gcc.target/aarch64/fmov-1-le.c: New test.
* gcc.target/aarch64/fmov-2-be.c: New test.
* gcc.target/aarch64/fmov-2-le.c: New test.
Signed-off-by: Pengxuan Zheng <quic_pzheng@quicinc.com>
|
|
[PR100165]
Certain permute that blends a vector with zero can be interpreted as an AND of a
mask. This idea was suggested by Richard Sandiford when he was reviewing my
patch which tries to optimizes certain vector permute with the FMOV instruction
for the aarch64 target.
For example, for the aarch64 target, at present:
v4hi
f_v4hi (v4hi x)
{
return __builtin_shuffle (x, (v4hi){ 0, 0, 0, 0 }, (v4hi){ 4, 1, 6, 3 });
}
generates:
f_v4hi:
uzp1 v0.2d, v0.2d, v0.2d
adrp x0, .LC0
ldr d31, [x0, #:lo12:.LC0]
tbl v0.8b, {v0.16b}, v31.8b
ret
.LC0:
.byte -1
.byte -1
.byte 2
.byte 3
.byte -1
.byte -1
.byte 6
.byte 7
With this patch, it generates:
f_v4hi:
mvni v31.2s, 0xff, msl 8
and v0.8b, v0.8b, v31.8b
ret
This patch also provides a target-independent routine for detecting vector
permute patterns which can be interpreted as AND.
Changes since v1:
* v2: Rework the patch to only perform the optimization for aarch64 by calling
the target independent routine vec_perm_and_mask.
PR target/100165
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_evpc_and): New.
(aarch64_expand_vec_perm_const_1): Call aarch64_evpc_and.
* optabs.cc (vec_perm_and_mask): New.
* optabs.h (vec_perm_and_mask): New prototype.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/and-be.c: New test.
* gcc.target/aarch64/and-le.c: New test.
Signed-off-by: Pengxuan Zheng <quic_pzheng@quicinc.com>
|
|
Some fields (e.g., zero_op0_p and zero_op1_p) of the struct "newd" may be left
uninitialized in aarch64_evpc_reencode. This can cause reading of uninitialized
data. I found this oversight when testing my patches on and/fmov
optimizations. This patch fixes the bug by zero initializing the struct.
Pushed as obvious after bootstrap/test on aarch64-linux-gnu.
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_evpc_reencode): Zero initialize
newd.
|
|
Since the AARCH64_CORE defines in aarch64-cores.def all use -1 for
the variant, it is just easier to add the cast to unsigned in the usage
in driver-aarch64.cc.
Build and tested on aarch64-linux-gnu.
gcc/ChangeLog:
PR target/118603
* config/aarch64/driver-aarch64.cc (aarch64_cpu_data): Add cast to unsigned
to VARIANT of the define AARCH64_CORE.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
|
|
There is a narrowing warning in aarch64_detect_vector_stmt_subtype
about gather_load_x32_cost and gather_load_x64_cost converting from int to unsigned.
These fields are always unsigned and even the constructor for sve_vec_cost takes
an unsigned. So let's just move the fields over to unsigned.
Build and tested for aarch64-linux-gnu.
gcc/ChangeLog:
* config/aarch64/aarch64-protos.h (struct sve_vec_cost): Change gather_load_x32_cost
and gather_load_x64_cost fields to unsigned.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
|
|
This patch mops up obvious redundancies that weren't caught by the
automatic regexp replacements in earlier patches. It doesn't do
anything with genemit.cc, since that will be part of a later series.
gcc/
* config/arm/arm.cc (arm_gen_load_multiple_1): Simplify use of
end_sequence.
(arm_gen_store_multiple_1): Likewise.
* expr.cc (gen_move_insn): Likewise.
* gentarget-def.cc (main): Likewise.
|
|
This is the result of using a regexp to replace:
rtx( |_insn *)<stuff> = end_sequence ();
return <stuff>;
with:
return end_sequence ();
gcc/
* asan.cc (asan_emit_allocas_unpoison): Directly return the
result of end_sequence.
(hwasan_emit_untag_frame): Likewise.
* config/aarch64/aarch64-speculation.cc
(aarch64_speculation_clobber_sp): Likewise.
(aarch64_speculation_establish_tracker): Likewise.
* config/arm/arm.cc (arm_call_tls_get_addr): Likewise.
* config/avr/avr-passes.cc (avr_parallel_insn_from_insns): Likewise.
* config/sh/sh_treg_combine.cc
(sh_treg_combine::make_not_reg_insn): Likewise.
* tree-outof-ssa.cc (emit_partition_copy): Likewise.
|
|
This is the result of using a regexp to replace instances of:
<stuff> = get_insns ();
end_sequence ();
with:
<stuff> = end_sequence ();
where the indentation is the same for both lines, and where there
might be blank lines inbetween.
gcc/
* asan.cc (asan_clear_shadow): Use the return value of end_sequence,
rather than calling get_insns separately.
(asan_emit_stack_protection, asan_emit_allocas_unpoison): Likewise.
(hwasan_frame_base, hwasan_emit_untag_frame): Likewise.
* auto-inc-dec.cc (attempt_change): Likewise.
* avoid-store-forwarding.cc (process_store_forwarding): Likewise.
* bb-reorder.cc (fix_crossing_unconditional_branches): Likewise.
* builtins.cc (expand_builtin_apply_args): Likewise.
(expand_builtin_return, expand_builtin_mathfn_ternary): Likewise.
(expand_builtin_mathfn_3, expand_builtin_int_roundingfn): Likewise.
(expand_builtin_int_roundingfn_2, expand_builtin_saveregs): Likewise.
(inline_string_cmp): Likewise.
* calls.cc (expand_call): Likewise.
* cfgexpand.cc (expand_asm_stmt, pass_expand::execute): Likewise.
* cfgloopanal.cc (init_set_costs): Likewise.
* cfgrtl.cc (insert_insn_on_edge, prepend_insn_to_edge): Likewise.
(rtl_lv_add_condition_to_bb): Likewise.
* config/aarch64/aarch64-speculation.cc
(aarch64_speculation_clobber_sp): Likewise.
(aarch64_speculation_establish_tracker): Likewise.
(aarch64_do_track_speculation): Likewise.
* config/aarch64/aarch64.cc (aarch64_load_symref_appropriately)
(aarch64_expand_vector_init, aarch64_gen_ccmp_first): Likewise.
(aarch64_gen_ccmp_next, aarch64_mode_emit): Likewise.
(aarch64_md_asm_adjust): Likewise.
(aarch64_switch_pstate_sm_for_landing_pad): Likewise.
(aarch64_switch_pstate_sm_for_jump): Likewise.
(aarch64_switch_pstate_sm_for_call): Likewise.
* config/alpha/alpha.cc (alpha_legitimize_address_1): Likewise.
(alpha_emit_xfloating_libcall, alpha_gp_save_rtx): Likewise.
* config/arc/arc.cc (hwloop_optimize): Likewise.
* config/arm/aarch-common.cc (arm_md_asm_adjust): Likewise.
* config/arm/arm-builtins.cc: Likewise.
* config/arm/arm.cc (require_pic_register): Likewise.
(arm_call_tls_get_addr, arm_gen_load_multiple_1): Likewise.
(arm_gen_store_multiple_1, cmse_clear_registers): Likewise.
(cmse_nonsecure_call_inline_register_clear): Likewise.
(arm_attempt_dlstp_transform): Likewise.
* config/avr/avr-passes.cc (bbinfo_t::optimize_one_block): Likewise.
(avr_parallel_insn_from_insns): Likewise.
* config/avr/avr.cc (avr_prologue_setup_frame): Likewise.
(avr_expand_epilogue): Likewise.
* config/bfin/bfin.cc (hwloop_optimize): Likewise.
* config/c6x/c6x.cc (c6x_expand_compare): Likewise.
* config/cris/cris.cc (cris_split_movdx): Likewise.
* config/cris/cris.md: Likewise.
* config/csky/csky.cc (csky_call_tls_get_addr): Likewise.
* config/epiphany/resolve-sw-modes.cc
(pass_resolve_sw_modes::execute): Likewise.
* config/fr30/fr30.cc (fr30_move_double): Likewise.
* config/frv/frv.cc (frv_split_scc, frv_split_cond_move): Likewise.
(frv_split_minmax, frv_split_abs): Likewise.
* config/frv/frv.md: Likewise.
* config/gcn/gcn.cc (move_callee_saved_registers): Likewise.
(gcn_expand_prologue, gcn_restore_exec, gcn_md_reorg): Likewise.
* config/i386/i386-expand.cc
(ix86_expand_carry_flag_compare, ix86_expand_int_movcc): Likewise.
(ix86_vector_duplicate_value, expand_vec_perm_interleave2): Likewise.
(expand_vec_perm_vperm2f128_vblend): Likewise.
(expand_vec_perm_2perm_interleave): Likewise.
(expand_vec_perm_2perm_pblendv): Likewise.
(expand_vec_perm2_vperm2f128_vblend, ix86_gen_ccmp_first): Likewise.
(ix86_gen_ccmp_next): Likewise.
* config/i386/i386-features.cc
(scalar_chain::make_vector_copies): Likewise.
(scalar_chain::convert_reg, scalar_chain::convert_op): Likewise.
(timode_scalar_chain::convert_insn): Likewise.
* config/i386/i386.cc (ix86_init_pic_reg, ix86_va_start): Likewise.
(ix86_get_drap_rtx, legitimize_tls_address): Likewise.
(ix86_md_asm_adjust): Likewise.
* config/ia64/ia64.cc (ia64_expand_tls_address): Likewise.
(ia64_expand_compare, spill_restore_mem): Likewise.
(expand_vec_perm_interleave_2): Likewise.
* config/loongarch/loongarch.cc
(loongarch_call_tls_get_addr): Likewise.
* config/m32r/m32r.cc (gen_split_move_double): Likewise.
* config/m32r/m32r.md: Likewise.
* config/m68k/m68k.cc (m68k_call_tls_get_addr): Likewise.
(m68k_call_m68k_read_tp, m68k_sched_md_init_global): Likewise.
* config/m68k/m68k.md: Likewise.
* config/microblaze/microblaze.cc
(microblaze_call_tls_get_addr): Likewise.
* config/mips/mips.cc (mips_call_tls_get_addr): Likewise.
(mips_ls2_init_dfa_post_cycle_insn): Likewise.
(mips16_split_long_branches): Likewise.
* config/nvptx/nvptx.cc (nvptx_gen_shuffle): Likewise.
(nvptx_gen_shared_bcast, nvptx_propagate): Likewise.
(workaround_uninit_method_1, workaround_uninit_method_2): Likewise.
(workaround_uninit_method_3): Likewise.
* config/or1k/or1k.cc (or1k_init_pic_reg): Likewise.
* config/pa/pa.cc (legitimize_tls_address): Likewise.
* config/pru/pru.cc (pru_expand_fp_compare, pru_reorg_loop): Likewise.
* config/riscv/riscv-shorten-memrefs.cc
(pass_shorten_memrefs::transform): Likewise.
* config/riscv/riscv-vsetvl.cc (pre_vsetvl::emit_vsetvl): Likewise.
* config/riscv/riscv.cc (riscv_call_tls_get_addr): Likewise.
(riscv_frm_emit_after_bb_end): Likewise.
* config/rl78/rl78.cc (rl78_emit_libcall): Likewise.
* config/rs6000/rs6000.cc (rs6000_debug_legitimize_address): Likewise.
* config/s390/s390.cc (legitimize_tls_address): Likewise.
(s390_two_part_insv, s390_load_got, s390_va_start): Likewise.
* config/sh/sh_treg_combine.cc
(sh_treg_combine::make_not_reg_insn): Likewise.
* config/sparc/sparc.cc (sparc_legitimize_tls_address): Likewise.
(sparc_output_mi_thunk, sparc_init_pic_reg): Likewise.
* config/stormy16/stormy16.cc (xstormy16_split_cbranch): Likewise.
* config/xtensa/xtensa.cc (xtensa_copy_incoming_a7): Likewise.
(xtensa_expand_block_set_libcall): Likewise.
(xtensa_expand_block_set_unrolled_loop): Likewise.
(xtensa_expand_block_set_small_loop, xtensa_call_tls_desc): Likewise.
* dse.cc (emit_inc_dec_insn_before, find_shift_sequence): Likewise.
(replace_read): Likewise.
* emit-rtl.cc (reorder_insns, gen_clobber, gen_use): Likewise.
* except.cc (dw2_build_landing_pads, sjlj_mark_call_sites): Likewise.
(sjlj_emit_function_enter, sjlj_emit_function_exit): Likewise.
(sjlj_emit_dispatch_table): Likewise.
* expmed.cc (expmed_mult_highpart_optab, expand_sdiv_pow2): Likewise.
* expr.cc (convert_mode_scalar, emit_move_multi_word): Likewise.
(gen_move_insn, expand_cond_expr_using_cmove): Likewise.
(expand_expr_divmod, expand_expr_real_2): Likewise.
(maybe_optimize_pow2p_mod_cmp, maybe_optimize_mod_cmp): Likewise.
* function.cc (emit_initial_value_sets): Likewise.
(instantiate_virtual_regs_in_insn, expand_function_end): Likewise.
(get_arg_pointer_save_area, make_split_prologue_seq): Likewise.
(make_prologue_seq, gen_call_used_regs_seq): Likewise.
(thread_prologue_and_epilogue_insns): Likewise.
(match_asm_constraints_1): Likewise.
* gcse.cc (prepare_copy_insn): Likewise.
* ifcvt.cc (noce_emit_store_flag, noce_emit_move_insn): Likewise.
(noce_emit_cmove): Likewise.
* init-regs.cc (initialize_uninitialized_regs): Likewise.
* internal-fn.cc (expand_POPCOUNT): Likewise.
* ira-emit.cc (emit_move_list): Likewise.
* ira.cc (ira): Likewise.
* loop-doloop.cc (doloop_modify): Likewise.
* loop-unroll.cc (compare_and_jump_seq): Likewise.
(unroll_loop_runtime_iterations, insert_base_initialization): Likewise.
(split_iv, insert_var_expansion_initialization): Likewise.
(combine_var_copies_in_loop_exit): Likewise.
* lower-subreg.cc (resolve_simple_move,resolve_shift_zext): Likewise.
* lra-constraints.cc (match_reload, check_and_process_move): Likewise.
(process_addr_reg, insert_move_for_subreg): Likewise.
(process_address_1, curr_insn_transform): Likewise.
(inherit_reload_reg, process_invariant_for_inheritance): Likewise.
(inherit_in_ebb, remove_inheritance_pseudos): Likewise.
* lra-remat.cc (do_remat): Likewise.
* mode-switching.cc (commit_mode_sets): Likewise.
(optimize_mode_switching): Likewise.
* optabs.cc (expand_binop, expand_twoval_binop_libfunc): Likewise.
(expand_clrsb_using_clz, expand_doubleword_clz_ctz_ffs): Likewise.
(expand_doubleword_popcount, expand_ctz, expand_ffs): Likewise.
(expand_absneg_bit, expand_unop, expand_copysign_bit): Likewise.
(prepare_float_lib_cmp, expand_float, expand_fix): Likewise.
(expand_fixed_convert, gen_cond_trap): Likewise.
(expand_atomic_fetch_op): Likewise.
* ree.cc (combine_reaching_defs): Likewise.
* reg-stack.cc (compensate_edge): Likewise.
* reload1.cc (emit_input_reload_insns): Likewise.
* sel-sched-ir.cc (setup_nop_and_exit_insns): Likewise.
* shrink-wrap.cc (emit_common_heads_for_components): Likewise.
(emit_common_tails_for_components): Likewise.
(insert_prologue_epilogue_for_components): Likewise.
* tree-outof-ssa.cc (emit_partition_copy): Likewise.
(insert_value_copy_on_edge): Likewise.
* tree-ssa-loop-ivopts.cc (computation_cost): Likewise.
|
|
This patch would like to combine the vec_duplicate + vsub.vv to the
vsub.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, OP) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = in[i] OP x; \
}
DEF_VX_BINARY(int32_t, -)
Before this patch:
10 │ test_binary_vx_sub:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma // Deleted if GR2VR cost zero
13 │ vmv.v.x v2,a2 // Ditto.
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vsub.vv v1,v2,v1
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_binary_vx_sub:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vsub.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
The below test suites are passed for this patch.
* The rv64gcv fully regression test.
gcc/ChangeLog:
* config/riscv/autovec-opt.md (*<optab>_vx_<mode>): Add new
pattern to convert vec_duplicate + vsub.vv to vsub.vx.
* config/riscv/riscv.cc (riscv_rtx_costs): Add minus as plus op.
* config/riscv/vector-iterators.md: Add minus to iterator
any_int_binop_no_shift_vx.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
Thead has the XTHEADBB extension which has a lot of overlap with Zbb. I made
the incorrect assumption that XTHEADBS would largely be like Zbs when
generalizing Shreya's work.
As a result we can't use the operation synthesis code for IOR/XOR because we
don't have binv/bset like capabilities. I should have double checked on
XTHEADBS, my bad.
Anyway, the fix is trivial. Don't allow bset/binv based on XTHEADBS.
Already spun in my tester. Spinning in the pre-commit CI system now.
PR target/120223
gcc/
* config/riscv/riscv.cc (synthesize_ior_xor): XTHEADBS does not have
single bit manipulations.
gcc/testsuite/
* gcc.target/riscv/pr120223.c: New test.
|
|
With the avx512_two_epilogues tuning enabled for zen4 and zen5
the gcc.target/i386/vect-epilogues-5.c testcase below regresses
and ends up using AVX2 sized vectors for the masked epilogue
rather than AVX512 sized vectors. The following patch rectifies
this and adds coverage for the intended behavior.
* config/i386/i386.cc (ix86_vector_costs::finish_cost):
Do not suggest a first epilogue mode for AVX512 sized
main loops with X86_TUNE_AVX512_TWO_EPILOGUES as that
interferes with using a masked epilogue.
* gcc.target/i386/vect-epilogues-1.c: New testcase.
* gcc.target/i386/vect-epilogues-2.c: Likewise.
* gcc.target/i386/vect-epilogues-3.c: Likewise.
* gcc.target/i386/vect-epilogues-4.c: Likewise.
* gcc.target/i386/vect-epilogues-5.c: Likewise.
|
|
The augmented hypervisor series extensions 'sha'[1] is a new profile-defined
extension series that captures the full set of features that are mandated to
be supported along with the 'H' extension.
[1] https://github.com/riscv/riscv-profiles/blob/main/src/rva23-profile.adoc#rva23s64-profile
Version log: Update implements, fix testcase format.
gcc/ChangeLog:
* config/riscv/riscv-ext.def: New extension defs.
* config/riscv/riscv-ext.opt: Ditto.
* doc/riscv-ext.texi: Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/arch-55.c: New test.
|
|
gcc/ChangeLog:
* config/riscv/t-riscv: Drop duplicate build rule for
riscv-ext.opt.
|
|
gcc/ChangeLog:
* config/riscv/riscv-ext.opt.urls: Regenerate.
|
|
n some benchmark, I notice stv failed due to cost unprofitable, but the igain
is inside the loop, but sse<->integer conversion is outside the loop, current cost
model doesn't consider the frequency of those gain/cost.
The patch weights those cost with frequency.
gcc/ChangeLog:
PR target/120215
* config/i386/i386-features.cc
(scalar_chain::mark_dual_mode_def): Weight
cost of integer<->sse move with bb frequency when it's
optimized_for_speed_p.
(general_scalar_chain::compute_convert_gain): Ditto, and
adjust function prototype to return true/false when cost model
is profitable or not.
(timode_scalar_chain::compute_convert_gain): Ditto.
(convert_scalars_to_vector): Adjust after the upper two
function prototype are changed.
* config/i386/i386-features.h (class scalar_chain): Change
n_integer_to_sse/n_sse_to_integer to cost_sse_integer, and add
weighted_cost_sse_integer.
(class general_scalar_chain): Adjust prototype to return bool
intead of int.
(class timode_scalar_chain): Ditto.
|
|
Insn tf_to_fprx2 moves a TF value into a floating-point register pair.
For alternative 0, the input is a vector register, however, in the else
case instruction ldr is emitted which expects floating-point register
operands only. Thus, this works only for vector registers which overlap
with floating-point registers. Replace ldr with vlr so that the
remaining vector registers are dealt with, too. Emitting a vlr instead
of a ldr is fine since the destination register %v0 is part of a
floating-point register pair which means that the low half of %v0 is
ignored in the end anyway and therefore may be clobbered.
gcc/ChangeLog:
* config/s390/vector.md: Fix tf_to_fprx2 by using vlr instead of
ldr.
|
|
Define a new riscv_ext_info_t struct to aggregate all ISA extension fields
(name, version, flags, implied extensions, bitmask and extra flags) generated
from riscv-ext.def.
Also adjust riscv_ext_flag_table_t and riscv_implied_info_t to make it
able to not hold extension name, this part will refactor in later
patchs.
gcc/ChangeLog:
* common/config/riscv/riscv-common.cc (riscv_ext_info_t): New
struct.
(opt_var_ref_t): Adjust order.
(cl_opt_var_ref_t): Ditto.
(riscv_ext_flag_table_t): Adjust order, and add a new construct
that not hold the extension name.
(riscv_version_t): New struct.
(riscv_implied_info_t): Adjust order, and add a new construct that not
hold the extension name.
(apply_extra_extension_flags): New function.
(riscv_ext_infos): New.
(riscv_implied_info): Adjust.
* config/riscv/riscv-opts.h (EXT_FLAG_MACRO): New macro.
(BITMASK_NOT_YET_ALLOCATED): New macro.
|