Age | Commit message (Collapse) | Author | Files | Lines |
|
missing attributes
The patch changes order of inclusions, i.e. elfos.h is included before
target specific h8300/h8300.h, in a way similar to a few other targets.
Thanks to this change it is possible to override macros from elfos.h in
h8300/h8300.h, in particular .init/.fini section definitions.
PR target/109286
gcc/ChangeLog:
* config.gcc: Include elfos.h before h8300/h8300.h.
* config/h8300/h8300.h (INIT_SECTION_ASM_OP): Override
default version from elfos.h.
(FINI_SECTION_ASM_OP): Ditto.
(ASM_DECLARE_FUNCTION_NAME): Ditto.
(ASM_GENERATE_INTERNAL_LABEL): Macro removed because it was
being overridden in elfos.h anyway.
(ASM_OUTPUT_SKIP): Ditto.
|
|
AVL propagation currently assumes that it can propagate a constant AVL into any
vector insn and trips an assert if the insn fails to recognize after such a
propagation.
However, for xtheadvector that is not a correct assumption; xtheadvector does
not allow the vector length to be a constant integer (other than zero which
allowed via x0).
After consulting with Jin Ma (thanks!) we agree the right fix is to avoid
creating the immediate AVL for xtheadvector.
This has been tested in my tester, just waiting for the pre-commit tester to
spin it.
PR target/120642
gcc/
* config/riscv/riscv-avlprop.cc (pass_avlprop::execute): Do not do
constant AVL propagation for xtheadvector.
gcc/testsuite/
* gcc.target/riscv/rvv/xtheadvector/pr120642.c: New test.
|
|
Remove #pragma GCC target ("arch=armv8.2-a+bf16") since it matches the
preceding pragma GCC target and is thus useless.
gcc/ChangeLog:
* config/arm/arm_neon.h: Remove useless push/pop pragmas.
|
|
not being emitted
This is because in canonicalize_comparison() in gcc/expmed.cc, the COMPARE
rtx_cost() for the immediate values in the title does not change between
the old and new versions. This patch fixes that.
(note: Currently, this patch only works if some constant propagation
optimizations are enabled (-O2 or higher) or if bare large constant
assignments are possible (-mconst16 or -mauto-litpools). In the future
I hope to make it work at -O1...)
gcc/ChangeLog:
* config/xtensa/xtensa.cc (xtensa_b4const_or_zero):
Remove.
(xtensa_b4const): Add a case where the value is 0, and rename
to xtensa_b4const_or_zero.
(xtensa_rtx_costs): Fix to also consider the result of
xtensa_b4constu().
gcc/testsuite/ChangeLog:
* gcc.target/xtensa/BGEUI-BLTUI-32k-64k.c: New.
|
|
Computing the address of the thread pointer on s390 involves multiple
instructions and therefore bears the risk that the address of the canary
or intermediate values of it are spilled after prologue in order to be
reloaded for the epilogue. Since there exists no mechanism to ensure
that a value is not coming from stack, as a precaution compute the
address always twice, i.e., one time for the prologue and one time for
the epilogue. Note, even if there were such a mechanism, emitting
optimal code is non-trivial since there exist cases with opposing
requirements as e.g. if the thread pointer is not only computed for the
TLS guard but also for other TLS objects. For the latter accesses it is
desired to spill and reload the thread pointer instead of recomputing it
whereas for the former it is not.
gcc/ChangeLog:
* config/s390/s390.md (stack_protect_get_tpsi): New insn.
(stack_protect_get_tpdi): New insn.
(stack_protect_set): Use new insn.
(stack_protect_test): Use new insn.
gcc/testsuite/ChangeLog:
* gcc.target/s390/stack-protector-guard-tls-1.c: New test.
|
|
In emit_vlmax_insn_lra we use a vsetivli for an immediate AVL.
XTHeadVector does not support this, so guard appropriately.
PR target/120461
gcc/ChangeLog:
* config/riscv/riscv-v.cc (emit_vlmax_insn_lra): Do not emit
vsetivli for XTHeadVector.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/xtheadvector/pr120461.c: New test.
|
|
If a user passes a string that doesn't represent a variable we still try
to compute a hash for its type. Its tree does not represent a type but
just an exceptional, though. This patch just ignores it, leaving the
error to the checking code later.
PR target/113829
gcc/ChangeLog:
* config/riscv/riscv-vector-builtins.cc (registered_function::overloaded_hash):
Skip non-type arguments.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/pr113829.c: New test.
|
|
gcc:
PR target/120995
* config/riscv/sync.md (zacas_atomic_cas_value_strong<mode>):
Allow op3 to be zero.
gcc/testsuite:
PR target/120995
* gcc.target/riscv/amo/zabha-zacas-atomic-cas.c: New test.
|
|
The following adds a x86 tuning to enable the use of AVX512 masked
epilogues in cases we heuristically determine it to be not detrimental
by high chance. Basically problematic cases are when there are
data streams that are both stored and loaded from and an outer loop
could end up executing only the inner loop masked epilogue and with
unlucky data stream advacement from the outer loop end up needing
to forward from masked stores to masked loads. This isn't very
well handled, esp. for the case where unmasked operations would
not need to forward at all - that is, when forwarding completely
from the masked out portion of the store (like the AVX upper half
to the AVX lower half of a load). There's also the case where
the number of iterations is known at compile time, only with
cost comparing we'd consider a non-masked epilog - as we are not
doing that we have to add heuristics to avoid masking when a
single vector epilog iteration would cover all scalar iterations
left (this is exercised by gcc.target/i386/pr110310.c).
SPEC CPU 2017 shows 3% text size savings over not using masked
epilogues with performance impact in the noise. Masking all vector
epilogues gets that to 4% text size savings with some major
runtime regressions in 503.bwaves_r and 527.cam4_r
(measured on a Zen4 system), we're leaving a 5% improvement
for 549.fotonik3d_r unrealized with the implemented heuristic.
With the heuristics we turn 22513 vector epilogues + up to 12305 scalar
epilogues into 12305 masked vector epilogues of which 574 are for
AVX vector sizes, 79 for SSE vector sizes and the rest for AVX512.
When masking all epilogues we get 14567 of them from
29467 vector + up to 14567 scalar epilogues, so the heuristics disable
an additional 20% of masked epilogues.
* config/i386/x86-tune.def (X86_TUNE_AVX512_MASKED_EPILOGUES):
New tunable, default on for m_ZNVER4 and m_ZNVER5.
* config/i386/i386.cc (ix86_vector_costs::finish_cost): With
X86_TUNE_AVX512_MASKED_EPILOGUES and when the main loop
had a vectorization factor > 2 use a masked epilogue when
possible and when not obviously problematic.
* gcc.target/i386/vect-mask-epilogue-1.c: New testcase.
* gcc.target/i386/vect-mask-epilogue-2.c: Likewise.
* gcc.target/i386/vect-epilogues-3.c: Adjust.
|
|
VxWorks6 used symbols __GOTT_BASE__ and __GOTT_INDEX__ to obtain the
address of the global offset table. Starting with VxWorks7, that is
no longer the case, but we've still issued these symbols in
output_set_got. Do that only with VxWorks<7.
Switching to the call-based PIC register sequence, we have to set the
flag that prevents the use of the red zone, and AFAICT the reasons
that ruled out GOTOFF and other relative addressing no longer apply to
VxWorks7+.
for gcc/ChangeLog
* config/vxworks-dummy.h (TARGET_VXWORKS_VAROFF): New.
(TARGET_VXWORKS_GOTTPIC): New.
* config/vxworks.h (TARGET_VXWORKS_VAROFF): Override.
(TARGET_VXWORKS_GOTTPIC): Likewise.
* config/i386/i386.cc (output_set_got): Disable VxWorks6 GOT
sequence on VxWorks7.
(legitimize_pic_address): Accept relative addressing of
labels on VxWorks7.
(ix86_delegitimize_address_1): Likewise.
(ix86_output_addr_diff_elt): Likewise.
* config/i386/i386.md (tablejump): Likewise.
(set_got, set_got_labelled): Set no-red-zone flag on VxWorks7.
* config/i386/predicates.md (gotoff_operand): Test
TARGET_VXWORKS_VAROFF.
|
|
xtensa ABI requires sign extension of signed 8/16-bit arguments to 32
bits and zero extension of unsigned 8/16-bit arguments to 32 bits.
TARGET_PROMOTE_PROTOTYPES is an optimization, not an ABI requirement.
Remove TARGET_PROMOTE_PROTOTYPES and define xtensa_promote_function_mode
to properly extend 8/16-bit arguments to 32 bits.
gcc/
PR target/120888
* config/xtensa/xtensa.cc (xtensa_promote_function_mode): New.
(TARGET_PROMOTE_FUNCTION_MODE): Use.
(TARGET_PROMOTE_PROTOTYPES): Removed.
gcc/testsuite/
PR target/120888
* gcc.target/xtensa/pr120888-1.c: New test.
* gcc.target/xtensa/pr120888-2.c: Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
On VXE targets, we can directly use the fp min/max instruction instead of
calling into libm for fmin/fmax etc.
Provide fmin/fmax versions also for vectors even though it cannot be
called directly. This will be exploited with a follow-up patch when
reductions are introduced.
gcc/ChangeLog:
* config/s390/s390.md: Update UNSPECs
* config/s390/vector.md (fmax<mode>3): New expander.
(fmin<mode>3): New expander.
* config/s390/vx-builtins.md (*fmin<mode>): New insn.
(vfmin<mode>): Redefined to use new insn.
(*fmax<mode>): New insn.
(vfmax<mode>): Redefined to use new insn.
gcc/testsuite/ChangeLog:
* gcc.target/s390/fminmax-1.c: New test.
* gcc.target/s390/fminmax-2.c: New test.
Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>
|
|
The TImode popcount sequence can be slightly improved with SVE.
If we generate:
ldr q31, [x0]
ptrue p7.b, vl16
cnt z31.d, p7/m, z31.d
addp d31, v31.2d
fmov x0, d31
ret
instead of:
h128:
ldr q31, [x0]
cnt v31.16b, v31.16b
addv b31, v31.16b
fmov w0, s31
ret
we use the ADDP instruction for reduction, which is cheaper on all CPUs AFAIK,
as it is only a single 64-bit addition vs the tree of additions for ADDV.
For example, on a CPU like Grace we get a latency and throughput of 2,4 vs 4,1
for ADDV.
We do generate one more instruction due to the PTRUE being materialised, but that
is cheap itself and can be scheduled away from the critical path or even CSE'd
with other PTRUE constants.
As this sequence is larger code size-wise it is avoided for -Os.
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* config/aarch64/aarch64.md (popcountti2): Add TARGET_SVE path.
gcc/testsuite/
* gcc.target/aarch64/popcnt9.c: Add +nosve to target pragma.
* gcc.target/aarch64/popcnt13.c: New test.
|
|
This patch would like to implement the SAT_MUL scalar unsigned from
uint128_t, aka:
NT __attribute__((noinline))
sat_u_mul_##NT##_fmt_1 (NT a, NT b)
{
uint128_t x = (uint128_t)a * (uint128_t)b;
NT max = -1;
if (x > (uint128_t)(max))
return max;
else
return (NT)x;
}
Take uint64_t and uint8_t as example:
Before this patch for uint8_t:
10 │ sat_u_mul_uint8_t_from_uint128_t_fmt_1:
11 │ mulhu a5,a0,a1
12 │ mul a0,a0,a1
13 │ bne a5,zero,.L3
14 │ li a5,255
15 │ bleu a0,a5,.L4
16 │ .L3:
17 │ li a0,255
18 │ .L4:
19 │ andi a0,a0,0xff
20 │ ret
After this patch for uint8_t:
10 │ sat_u_mul_uint8_t_from_uint128_t_fmt_1:
11 │ mul a0,a0,a1
12 │ li a5,255
13 │ sltu a5,a5,a0
14 │ neg a5,a5
15 │ or a0,a0,a5
16 │ andi a0,a0,0xff
17 │ ret
Before this patch for uint64_t:
10 │ sat_u_mul_uint64_t_from_uint128_t_fmt_1:
11 │ mulhu a5,a0,a1
12 │ mul a0,a0,a1
13 │ beq a5,zero,.L4
14 │ li a0,-1
15 │ .L4:
16 │ ret
After this patch for uint64_t:
10 │ sat_u_mul_uint64_t_from_uint128_t_fmt_1:
11 │ mulhsu a5,a1,a0
12 │ mul a0,a0,a1
13 │ snez a5,a5
14 │ neg a5,a5
15 │ or a0,a0,a5
16 │ ret
gcc/ChangeLog:
* config/riscv/riscv-protos.h (riscv_expand_usmul): Add new func
decl.
* config/riscv/riscv.cc (riscv_expand_xmode_usmul): Add new func
to expand Xmode SAT_MUL.
(riscv_expand_non_xmode_usmul): Ditto but for non-Xmode.
(riscv_expand_usmul): Add new func to implment SAT_MUL.
* config/riscv/riscv.md (usmul<mode>3): Add new pattern to match
standard name usmul.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
Some patterns that are detected by the autovectorizer can be supported by
s390. Add expanders such that autovectorization of these patterns works.
RTL for the builtins used unspec to represent highpart multiplication.
Replace this by the correct RTL to allow further simplification.
gcc/ChangeLog:
* config/s390/s390.md: Removed unused unspecs.
* config/s390/vector.md (avg<mode>3_ceil): New expander.
(uavg<mode>3_ceil): New expander.
(smul<mode>3_highpart): New expander.
(umul<mode>3_highpart): New expander.
* config/s390/vx-builtins.md (vec_umulh<mode>): Remove unspec.
(vec_smulh<mode>): Remove unspec.
gcc/testsuite/ChangeLog:
* gcc.target/s390/vector/pattern-avg-1.c: New test.
* gcc.target/s390/vector/pattern-mulh-1.c: New test.
Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>
|
|
This patch extends our vec_cmp expander to support partial FP modes.
We use a predicate mode that is narrower the operation's VPRED to govern
unpacked FP operations under flag_trapping_math, so the expansion must
handle cases where the comparison's target and governing predicates have
different modes.
While such predicates enable all of the defined part of the operation, they
are not all-true. Their false bits contribute to the (trapping) behavior of
the operation, so we cannot have SVE_KNOWN_PTRUE.
gcc/ChangeLog:
* config/aarch64/aarch64-sve.md (vec_cmp<mode><vpred>): Extend
to handle partial FP modes.
(@aarch64_pred_fcm<cmp_op><mode>): Likewise.
(@aarch64_pred_fcmuo<mode>): Likewise.
(*one_cmpl<mode>3): Rename to...
(@aarch64_pred_one_cmpl<mode>_z): ... this.
* config/aarch64/aarch64.cc (aarch64_emit_sve_fp_cond): Allow the
target and governing predicates to have different modes.
(aarch64_emit_sve_or_fp_conds): Likewise.
(aarch64_emit_sve_invert_fp_cond): Likewise.
(aarch64_expand_sve_vec_cmp_float): Likewise.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/sve/unpacked_fcm_1.c: New test.
* gcc.target/aarch64/sve/unpacked_fcm_2.c: Likewise.
|
|
Lowpart subregs are generally disallowed on big-endian SVE vector
registers, since the first memory element is stored at the least
significant end of the register, rather than the most significant end.
(See the comment at the head of aarch64-sve.md for details,
and aarch64_modes_compatible_p for the implementation.)
This means that arm_sve_neon_bridge.h needs to use custom define_insns
for big-endian targets, in lieu of using lowpart subregs. However,
one of those define_insns relied on the prohibited lowparts internally,
to convert an Advanced SIMD register to an SVE register. Since the
lowpart is not allowed, the lowpart_subreg would return null, leading
to a later ICE.
The simplest fix seems to be to use %Z instead, to force the Advanced
SIMD register to be written as an SVE register.
gcc/
* config/aarch64/aarch64-sve.md (@aarch64_sve_set_neonq_<mode>):
Use %Z instead of lowpart_subreg. Tweak formatting.
|
|
aarch64_expand_vector_init contains some divide-and-conquer code
that tries to load the odd and even elements into 64-bit registers
and then ZIP them together. On big-endian targets, the even elements
are more significant than the odd elements and so should come second
in the ZIP.
This fixes many execution failures on aarch64_be-elf, including
gcc.c-torture/execute/pr28982a.c.
gcc/
PR target/118891
* config/aarch64/aarch64.cc (aarch64_expand_vector_init): Fix the
ZIP1 operand order for big-endian targets.
|
|
1. Don't generate the loop if the loop count is 1.
2. For memset with vector on small size, use vector if small size supports
vector, otherwise use the scalar value.
3. Always expand vector-version of memset for vector_loop.
4. Always duplicate the promoted scalar value for vector_loop if not 0 nor
-1.
5. Use misaligned prologue if alignment isn't needed. When misaligned
prologue is used, check if destination is actually aligned and update
destination alignment if aligned.
6. Use move_by_pieces and store_by_pieces for memcpy and memset epilogues
with the fixed epilogue size to enable overlapping moves and stores.
The included tests show that codegen of vector_loop/unrolled_loop for
memset/memcpy are significantly improved. For
void
foo (void *p1, size_t len)
{
__builtin_memset (p1, 0, len);
}
with
-O2 -minline-all-stringops -mmemset-strategy=vector_loop:256:noalign,libcall:-1:noalign -march=x86-64
we used to generate
foo:
.LFB0:
.cfi_startproc
movq %rdi, %rax
pxor %xmm0, %xmm0
cmpq $64, %rsi
jnb .L18
.L2:
andl $63, %esi
je .L1
xorl %edx, %edx
testb $1, %sil
je .L5
movl $1, %edx
movb $0, (%rax)
cmpq %rsi, %rdx
jnb .L19
.L5:
movb $0, (%rax,%rdx)
movb $0, 1(%rax,%rdx)
addq $2, %rdx
cmpq %rsi, %rdx
jb .L5
.L1:
ret
.p2align 4,,10
.p2align 3
.L18:
movq %rsi, %rdx
xorl %eax, %eax
andq $-64, %rdx
.L3:
movups %xmm0, (%rdi,%rax)
movups %xmm0, 16(%rdi,%rax)
movups %xmm0, 32(%rdi,%rax)
movups %xmm0, 48(%rdi,%rax)
addq $64, %rax
cmpq %rdx, %rax
jb .L3
addq %rdi, %rax
jmp .L2
.L19:
ret
.cfi_endproc
with very poor prologue/epilogue. With this patch, we now generate:
foo:
.LFB0:
.cfi_startproc
pxor %xmm0, %xmm0
cmpq $64, %rsi
jnb .L2
testb $32, %sil
jne .L19
testb $16, %sil
jne .L20
testb $8, %sil
jne .L21
testb $4, %sil
jne .L22
testq %rsi, %rsi
jne .L23
.L1:
ret
.p2align 4,,10
.p2align 3
.L2:
movups %xmm0, -64(%rdi,%rsi)
movups %xmm0, -48(%rdi,%rsi)
movups %xmm0, -32(%rdi,%rsi)
movups %xmm0, -16(%rdi,%rsi)
subq $1, %rsi
cmpq $64, %rsi
jb .L1
andq $-64, %rsi
xorl %eax, %eax
.L9:
movups %xmm0, (%rdi,%rax)
movups %xmm0, 16(%rdi,%rax)
movups %xmm0, 32(%rdi,%rax)
movups %xmm0, 48(%rdi,%rax)
addq $64, %rax
cmpq %rsi, %rax
jb .L9
ret
.p2align 4,,10
.p2align 3
.L23:
movb $0, (%rdi)
testb $2, %sil
je .L1
xorl %eax, %eax
movw %ax, -2(%rdi,%rsi)
ret
.p2align 4,,10
.p2align 3
.L19:
movups %xmm0, (%rdi)
movups %xmm0, 16(%rdi)
movups %xmm0, -32(%rdi,%rsi)
movups %xmm0, -16(%rdi,%rsi)
ret
.p2align 4,,10
.p2align 3
.L20:
movups %xmm0, (%rdi)
movups %xmm0, -16(%rdi,%rsi)
ret
.p2align 4,,10
.p2align 3
.L21:
movq $0, (%rdi)
movq $0, -8(%rdi,%rsi)
ret
.p2align 4,,10
.p2align 3
.L22:
movl $0, (%rdi)
movl $0, -4(%rdi,%rsi)
ret
.cfi_endproc
gcc/
PR target/120670
PR target/120683
* config/i386/i386-expand.cc (expand_set_or_cpymem_via_loop):
Don't generate the loop if the loop count is 1.
(expand_cpymem_epilogue): Use move_by_pieces.
(setmem_epilogue_gen_val): New.
(expand_setmem_epilogue): Use store_by_pieces.
(expand_small_cpymem_or_setmem): Choose cpymem mode from MOVE_MAX.
For memset with vector and the size is smaller than the vector
size, first try the narrower vector, otherwise, use the scalar
value.
(promote_duplicated_reg): Duplicate the scalar value for vector.
(ix86_expand_set_or_cpymem): Always expand vector-version of
memset for vector_loop. Use misaligned prologue if alignment
isn't needed and destination isn't aligned. Always initialize
vec_promoted_val from the promoted scalar value for vector_loop.
gcc/testsuite/
PR target/120670
PR target/120683
* gcc.target/i386/auto-init-padding-9.c: Updated.
* gcc.target/i386/memcpy-strategy-12.c: Likewise.
* gcc.target/i386/memset-strategy-25.c: Likewise.
* gcc.target/i386/memset-strategy-29.c: Likewise.
* gcc.target/i386/memset-strategy-30.c: Likewise.
* gcc.target/i386/memset-strategy-31.c: Likewise.
* gcc.target/i386/memcpy-pr120683-1.c: New test.
* gcc.target/i386/memcpy-pr120683-2.c: Likewise.
* gcc.target/i386/memcpy-pr120683-3.c: Likewise.
* gcc.target/i386/memcpy-pr120683-4.c: Likewise.
* gcc.target/i386/memcpy-pr120683-5.c: Likewise.
* gcc.target/i386/memcpy-pr120683-6.c: Likewise.
* gcc.target/i386/memcpy-pr120683-7.c: Likewise.
* gcc.target/i386/memset-pr120683-1.c: Likewise.
* gcc.target/i386/memset-pr120683-2.c: Likewise.
* gcc.target/i386/memset-pr120683-3.c: Likewise.
* gcc.target/i386/memset-pr120683-4.c: Likewise.
* gcc.target/i386/memset-pr120683-5.c: Likewise.
* gcc.target/i386/memset-pr120683-6.c: Likewise.
* gcc.target/i386/memset-pr120683-7.c: Likewise.
* gcc.target/i386/memset-pr120683-8.c: Likewise.
* gcc.target/i386/memset-pr120683-9.c: Likewise.
* gcc.target/i386/memset-pr120683-10.c: Likewise.
* gcc.target/i386/memset-pr120683-11.c: Likewise.
* gcc.target/i386/memset-pr120683-12.c: Likewise.
* gcc.target/i386/memset-pr120683-13.c: Likewise.
* gcc.target/i386/memset-pr120683-14.c: Likewise.
* gcc.target/i386/memset-pr120683-15.c: Likewise.
* gcc.target/i386/memset-pr120683-16.c: Likewise.
* gcc.target/i386/memset-pr120683-17.c: Likewise.
* gcc.target/i386/memset-pr120683-18.c: Likewise.
* gcc.target/i386/memset-pr120683-19.c: Likewise.
* gcc.target/i386/memset-pr120683-20.c: Likewise.
* gcc.target/i386/memset-pr120683-21.c: Likewise.
* gcc.target/i386/memset-pr120683-22.c: Likewise.
* gcc.target/i386/memset-pr120683-23.c: Likewise.
|
|
gcc/
* config/avr/avr-mcus.def: -mmcu= takes lower case MCU names.
* doc/avr-mmcu.texi: Rebuild.
|
|
gcc/
* config/avr/avr-mcus.def (avr32da28S, avr32da32S, avr32da48S)
(avr64da28S, avr64da32S, avr64da48S avr64da64S)
(avr128da28S, avr128da32S, avr128da48S, avr128da64S): Add devices.
* doc/avr-mmcu.texi: Rebuild.
|
|
Configuring gcc for --target=powerpc-wrs-vxworks7r2 sets things up for
a 64-bit compiler, just like powerpc64-wrs-vxworks7r2, except that
TARGET_VXWORKS64 is only defined as 1 for targets that match
*64-*-vxworks*.
With !TARGET_VXWORKS64, we get a 64-bit toolchain that defines
SIZE_TYPE, PTRDIFF_TYPE, and WCHAR_TYPE as 32-bit types, and that
breaks GCC passes that expect SIZE_TYPE and PTRDIFF_TYPE to be as wide
as pointers.
Arrange for TARGET_VXWORKS64 on ppc to match TARGET_64BIT, after using
it to select the default word size with driver self specs.
for gcc/ChangeLog
* config/rs6000/vxworks.h (SUBTARGET_DRIVER_SELF_SPECS):
Redefine to select word size matching TARGET_VXWORKS64.
(TARGET_VXWORKS64): Redefine in terms of TARGET_64BIT.
|
|
prefetch was recently fixed/tightened (with Q reg constraint) to only
support right address patterns (REG or REG+D with lower 5 bits clear).
However in some cases that's too restrictive for LRA and it fails to
allocate a reg resulting in following ICE...
| gcc/testsuite/gcc.target/riscv/pr118241-b.cc:31:19: error: unable to generate reloads for:
| 31 | void m() { a.l(); }
| | ^
|(insn 26 25 27 7 (prefetch (mem/f:DI (plus:DI (reg/f:DI 143 [ _5 ])
| (const_int 56 [0x38])) [5 _5->batch[6]+0 S8 A64])
| (const_int 0 [0])
| (const_int 3 [0x3])) "gcc/testsuite/gcc.target/riscv/pr118241-b.cc":18:29 498 {prefetch}
| (expr_list:REG_DEAD (reg/f:DI 142 [ _5->batch[6] ])
| (nil)))
|during RTL pass: reload
Fix that by providing a fallback alternative register constraint to reload the address.
PR target/118241
gcc/ChangeLog:
* config/riscv/riscv.md (prefetch): Add alternative "r".
gcc/testsuite/ChangeLog:
* gcc.target/riscv/pr118241-b.cc: New test.
Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
|
|
Spotted this by chance as I saw a similar fixup in comment.
From comments, I think this is needed, but I've not hit any issues due
to this.
gcc/ChangeLog:
* config/riscv/predicates.md (prefetch_operand): mack 5 bits.
Signed-off-by: Vineet Gupta <vineetg@rivosinc.com>
|
|
A right shift of 31 will become 0 or 1, this can be checked for
treg_set_expr_not_const01 to avoid matching addc_t_r as this
can expand to a 3 insn sequence instead.
This improves tests 023 to 026 from gcc.target/sh/pr54236-2.c, e.g.:
test_023:
shll r5
mov #0,r1
mov r4,r0
rts
addc r1,r0
With this change:
test_023:
shll r5
movt r0
rts
add r4,r0
We noticed this while evaluating a patch to improve how we handle
selecting between two constants based on the output of a LT/GE 0
test.
gcc/ChangeLog:
* config/sh/predicates.md
(treg_set_expr_not_const01): call sh_recog_treg_set_expr_not_01
* config/sh/sh-protos.h
(sh_recog_treg_set_expr_not_01): New function
* config/sh/sh.cc (sh_recog_treg_set_expr_not_01): Likewise
gcc/testsuite/ChangeLog:
* gcc.target/sh/pr54236-2.c: Fix comments and expected output
|
|
This patch would like to combine the vec_duplicate + vsadd.vv to the
vsadd.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_SAT_S_ADD(T, UT, MIN, MAX) \
T \
test_##T##_sat_add (T x, T y) \
{ \
T sum = (UT)x + (UT)y; \
return (x ^ y) < 0 \
? sum \
: (sum ^ x) >= 0 \
? sum \
: x < 0 ? MIN : MAX; \
}
DEF_SAT_S_ADD(int32_t, uint32_t, INT32_MIN, INT32_MAX)
DEF_VX_BINARY_CASE_2_WRAP(T, SAT_S_ADD_FUNC(T), sat_add)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vsadd.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vsadd.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add
new case SS_PLUS.
(expand_vx_binary_vec_vec_dup): Ditto.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op ss_plus.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
The register_operand predicate can match subreg, then we'd have a subreg
of subreg and it's invalid. Use lowpart_subreg to avoid the nested
subreg.
gcc/ChangeLog:
* config/loongarch/loongarch.md (crc_combine): Avoid nested
subreg.
gcc/testsuite/ChangeLog:
* gcc.c-torture/compile/pr120708.c: New test.
|
|
We were looking to evaluate some changes from Artemiy that improve GCC's
ability to discover fusible instruction pairs. There was no good way to get
any static data out of the compiler about what kinds of fusions were happening.
Yea, you could grub around the .sched dumps looking for the magic '+'
annotation, then look around at the slim RTL representation and make an
educated guess about what fused. But boy that was inconvenient.
All we really needed was a quick note in the dump file that the target hook
found a fusion pair and what kind was discovered. That made it easy to spot
invalid fusions, evaluate the effectiveness of Artemiy's work, write/discover
testcases for existing fusions and implement new fusions.
So from a codegen standpoint this is NFC, it only affects dump file output.
It's gone through the usual testing and I'll wait for pre-commit CI to churn
through it before moving forward.
gcc/
* config/riscv/riscv.cc (riscv_macro_fusion_pair_p): Add basic
instrumentation to all cases where fusion is detected. Fix
minor formatting goofs found in the process.
|
|
s390 missed constant vector permutation cases based on the vector pack
instruction or changing the size of the vector elements during vector
merge. This enables some more patterns that do not need to load a
constant vector for permutation.
gcc/ChangeLog:
* config/s390/s390.cc (expand_perm_with_merge): Add size change cases.
(expand_perm_with_pack): New function.
(vectorize_vec_perm_const_1): Wire up new function.
gcc/testsuite/ChangeLog:
* gcc.target/s390/vector/vec-perm-merge-1.c: New test.
* gcc.target/s390/vector/vec-perm-pack-1.c: New test.
Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>
|
|
candidates
A number of folks have had their fingers in this code and it's going to take a
few submissions to do everything we want to do.
This patch is primarily concerned with avoiding signaling that fusion can occur
in cases where it obviously should not be signaling fusion.
Every DEC based fusion I'm aware of requires the first instruction to set a
destination register that is both used and set again by the second instruction.
If the two instructions set different registers, then the destination of the
first instruction was not dead and would need to have a result produced.
This is complicated by the fact that we have pseudo registers prior to reload.
So the approach we take is to signal fusion prior to reload even if the
destination registers don't match. Post reload we require them to match.
That allows us to clean up the code ever-so-slightly.
Second, we sometimes signaled fusion into loads that weren't scalar integer
loads. I'm not aware of a design that's fusing into FP loads or vector loads.
So those get rejected explicitly.
Third, the store pair "fusion" code is cleaned up a little. We use fusion to
model store pair commits since the basic properties for detection are the same.
The point where they "fuse" is different. Also this code liked to "return
false" at each step along the way if fusion wasn't possible. Future work for
additional fusion cases makes that behavior undesirable. So the logic gets
reworked a little bit to be more friendly to future work.
Fourth, if we already fused the previous instruction, then we can't fuse it
again. Signaling fusion in that case is, umm, bad as it creates an atomic blob
of code from a scheduling standpoint.
Hopefully I got everything correct with extracting this work out of a larger
set of changes 🙂 We will contribute some instrumentation & testing code so if
I botched things in a major way we'll soon have a way to test that and I'll be
on the hook to fix any goof's.
From a correctness standpoint this should be a big fat nop. We've seen this
make measurable differences in pico benchmarks, but obviously as you scale up
to bigger stuff the gains largely disappear into the noise.
This has been through Ventana's internal CI and my tester. I'll obviously wait
for a verdict from the pre-commit tester.
PR target/118886
gcc/
* config/riscv/riscv.cc (riscv_macro_fusion_pair_p): Check
for fusion being disabled earlier. If PREV is already fused,
then it can't be fused again. Be more selective about fusing
when the destination registers do not match. Don't fuse into
loads that aren't scalar integer modes. Revamp store pair
commit support.
Co-authored-by: Daniel Barboza <dbarboza@ventanamicro.com>
Co-authored-by: Shreya Munnangi <smunnangi1@ventanamicro.com>
|
|
Move the rules for CBZ/TBZ to be above the rules for
CBB<cond>/CBH<cond>/CB<cond>. We want them to have higher priority
because they can express larger displacements.
gcc/ChangeLog:
* config/aarch64/aarch64.md (aarch64_cbz<optab><mode>1): Move
above rules for CBB<cond>/CBH<cond>/CB<cond>.
(*aarch64_tbz<optab><mode>1): Likewise.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/cmpbr.c: Update tests.
|
|
Add rules for lowering `cbranch<mode>4` to CBB<cond>/CBH<cond>/CB<cond> when
CMPBR extension is enabled.
gcc/ChangeLog:
* config/aarch64/aarch64-protos.h (aarch64_cb_rhs): New function.
* config/aarch64/aarch64.cc (aarch64_cb_rhs): Likewise.
* config/aarch64/aarch64.md (cbranch<mode>4): Rename to ...
(cbranch<GPI:mode>4): ...here, and emit CMPBR if possible.
(cbranch<SHORT:mode>4): New expand rule.
(aarch64_cb<INT_CMP:code><GPI:mode>): New insn rule.
(aarch64_cb<INT_CMP:code><SHORT:mode>): Likewise.
* config/aarch64/constraints.md (Uc0): New constraint.
(Uc1): Likewise.
(Uc2): Likewise.
* config/aarch64/iterators.md (cmpbr_suffix): New mode attr.
(INT_CMP): New code iterator.
(cmpbr_imm_constraint): New code attr.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/cmpbr.c:
|
|
Add the `+cmpbr` option to enable the FEAT_CMPBR architectural
extension.
gcc/ChangeLog:
* config/aarch64/aarch64-option-extensions.def (cmpbr): New
option.
* config/aarch64/aarch64.h (TARGET_CMPBR): New macro.
* doc/invoke.texi (cmpbr): New option.
|
|
The `far_branch` attribute only ever takes the values 0 or 1, so make it
a `no/yes` valued string attribute instead.
gcc/ChangeLog:
* config/aarch64/aarch64.md (far_branch): Replace 0/1 with
no/yes.
(aarch64_bcond): Handle rename.
(aarch64_cbz<optab><mode>1): Likewise.
(*aarch64_tbz<optab><mode>1): Likewise.
(@aarch64_tbz<optab><ALLI:mode><GPI:mode>): Likewise.
|
|
Extract the hardcoded values for the minimum PC-relative displacements
into named constants and document them.
gcc/ChangeLog:
* config/aarch64/aarch64.md (BRANCH_LEN_P_1MiB): New constant.
(BRANCH_LEN_N_1MiB): Likewise.
(BRANCH_LEN_P_32KiB): Likewise.
(BRANCH_LEN_N_32KiB): Likewise.
|
|
Give the `define_insn` rules used in lowering `cbranch<mode>4` to RTL
more descriptive and consistent names: from now on, each rule is named
after the AArch64 instruction that it generates. Also add comments to
document each rule.
gcc/ChangeLog:
* config/aarch64/aarch64.md (condjump): Rename to ...
(aarch64_bcond): ...here.
(*compare_condjump<GPI:mode>): Rename to ...
(*aarch64_bcond_wide_imm<GPI:mode>): ...here.
(aarch64_cb<optab><mode>): Rename to ...
(aarch64_cbz<optab><mode>1): ...here.
(*cb<optab><mode>1): Rename to ...
(*aarch64_tbz<optab><mode>1): ...here.
(@aarch64_tb<optab><ALLI:mode><GPI:mode>): Rename to ...
(@aarch64_tbz<optab><ALLI:mode><GPI:mode>): ...here.
(restore_stack_nonlocal): Handle rename.
(stack_protect_combined_test): Likewise.
* config/aarch64/aarch64-simd.md (cbranch<mode>4): Likewise.
* config/aarch64/aarch64-sme.md (aarch64_restore_za): Likewise.
* config/aarch64/aarch64.cc (aarch64_gen_test_and_branch): Likewise.
|
|
Make the formatting of the RTL templates in the rules for branch
instructions more consistent with each other.
gcc/ChangeLog:
* config/aarch64/aarch64.md (cbranch<mode>4): Reformat.
(cbranchcc4): Likewise.
(condjump): Likewise.
(*compare_condjump<GPI:mode>): Likewise.
(aarch64_cb<optab><mode>1): Likewise.
(*cb<optab><mode>1): Likewise.
(tbranch_<code><mode>3): Likewise.
(@aarch64_tb<optab><ALLI:mode><GPI:mode>): Likewise.
|
|
The rules for conditional branches were spread throughout `aarch64.md`.
Group them together so it is easier to understand how `cbranch<mode>4`
is lowered to RTL.
gcc/ChangeLog:
* config/aarch64/aarch64.md (condjump): Move.
(*compare_condjump<GPI:mode>): Likewise.
(aarch64_cb<optab><mode>1): Likewise.
(*cb<optab><mode>1): Likewise.
(tbranch_<code><mode>3): Likewise.
(@aarch64_tb<optab><ALLI:mode><GPI:mode>): Likewise.
|
|
commit ecc81e33123d7ac9c11742161e128858d844b99d
Author: Andi Kleen <ak@linux.intel.com>
Date: Fri Sep 26 04:06:40 2014 +0000
Add direct support for Linux kernel __fentry__ patching
emitted a label, 1, for __mcount_loc section:
1: call mcount
.section __mcount_loc, "a",@progbits
.quad 1b
.previous
If __mcount_loc wasn't used, we got an unused label. Update
x86_function_profiler to emit label only when __mcount_loc section
is used.
gcc/
PR target/120936
* config/i386/i386.cc (x86_print_call_or_nop): Add a label
argument and use it to print label.
(x86_function_profiler): Emit label only when __mcount_loc
section is used.
gcc/testsuite/
PR target/120936
* gcc.target/i386/pr120936-1.c: New test
* gcc.target/i386/pr120936-2.c: Likewise.
* gcc.target/i386/pr120936-3.c: Likewise.
* gcc.target/i386/pr120936-4.c: Likewise.
* gcc.target/i386/pr120936-5.c: Likewise.
* gcc.target/i386/pr120936-6.c: Likewise.
* gcc.target/i386/pr120936-7.c: Likewise.
* gcc.target/i386/pr120936-8.c: Likewise.
* gcc.target/i386/pr120936-9.c: Likewise.
* gcc.target/i386/pr120936-10.c: Likewise.
* gcc.target/i386/pr120936-11.c: Likewise.
* gcc.target/i386/pr120936-12.c: Likewise.
* gcc.target/i386/pr93492-3.c: Updated.
* gcc.target/i386/pr93492-5.c: Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
The "else operand" to maskload should always be a const_vector, never a
const_int.
This was just an issue I noticed while looking through the code, I don't
have a testcase which shows a concrete problem due to this.
Testing of that change alone showed ICEs with load lanes vectorization
and SVE. That turned out to be because the backend pattern was missing
a mode for the else operand (causing the middle-end to choose a
const_int during expansion), fixed thusly. That in turn exposed an
issue with the unpredicated load lanes expander which was using the
wrong mode for the else operand, so fixed that too.
gcc/ChangeLog:
* config/aarch64/aarch64-sve.md
(vec_load_lanes<mode><vsingle>): Expand else operand in
subvector mode, as per optab documentation.
(vec_mask_load_lanes<mode><vsingle>): Add missing mode for
operand 3.
* config/aarch64/predicates.md (aarch64_maskload_else_operand):
Remove const_int.
|
|
*tls_global_dynamic_64_largepic, *tls_local_dynamic_64_<mode> and
*tls_local_dynamic_base_64_largepic use RDI as the __tls_get_addr
argument. Add RDI clobber to these patterns to show it.
gcc/
PR target/120908
* config/i386/i386.cc (legitimize_tls_address): Pass RDI to
gen_tls_local_dynamic_64.
* config/i386/i386.md (*tls_global_dynamic_64_largepic): Add
RDI clobber and use it to generate LEA.
(*tls_local_dynamic_64_<mode>): Likewise.
(*tls_local_dynamic_base_64_largepic): Likewise.
(@tls_local_dynamic_64_<mode>): Add a clobber.
gcc/testsuite/
PR target/120908
* gcc.target/i386/pr120908.c: New test.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
*tls_global_dynamic_64_<mode> uses RDI as the __tls_get_addr argument.
Add RDI clobber to tls_global_dynamic_64 patterns to show it.
PR target/120908
* config/i386/i386.cc (legitimize_tls_address): Pass RDI to
gen_tls_global_dynamic_64.
* config/i386/i386.md (*tls_global_dynamic_64_<mode>): Add RDI
clobber and use it to generate LEA.
(@tls_global_dynamic_64_<mode>): Add a clobber.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
It corrects the shift type of interleaved stepped patterns for const vector
expanding in LRA. The shift instruction was initially LSHIFTRT, and it seems
still should be the same type for both LRA and other cases.
PR target/120356
gcc/ChangeLog:
* config/riscv/riscv-v.cc
(expand_const_vector_interleaved_stepped_npatterns):
Fix ASHIFT to LSHIFTRT insn.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/pr120356.c: New test.
|
|
distinguished
We will use AMX-FP8 for DMR since it is a smaller and more unique feature.
gcc/ChangeLog:
* config/i386/driver-i386.cc (host_detect_local_cpu): Change
to AMX-FP8 for Diamond Rapids.
|
|
Add an optimization to aarch64 SIMD converting mvn+shrn into mvni+subhn when
possible, which allows for better optimization when the code is inside a loop
by using a constant.
The conversion is based on the fact that for an unsigned integer:
-x = ~x + 1 => ~x = -1 - x
thus '(u8)(~x >> imm)' is equivalent to '(u8)(((u16)-1 - x) >> imm)'.
For the following function:
uint8x8_t neg_narrow_v8hi(uint16x8_t a) {
uint16x8_t b = vmvnq_u16(a);
return vshrn_n_u16(b, 8);
}
Without this patch the assembly look like:
not v0.16b, v0.16b
shrn v0.8b, v0.8h, 8
After the patch it becomes:
mvni v31.4s, 0
subhn v0.8b, v31.8h, v0.8h
Bootstrapped and regtested on aarch64-linux-gnu.
Signed-off-by: Remi Machet <rmachet@nvidia.com>
gcc/ChangeLog:
* config/aarch64/aarch64-simd.md (*shrn_to_subhn_<mode>): Add pattern
converting mvn+shrn into mvni+subhn.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/simd/shrn2subhn.c: New test.
|
|
This patch updates `aarch64-sys-regs.def', bringing it into sync with
the Binutils source after this change:
https://sourceware.org/pipermail/binutils/2025-March/139894.html
gcc/ChangeLog:
* config/aarch64/aarch64-sys-regs.def: Copy from Binutils.
|
|
Fixes incorrect SP-addresses used in CFA notes for the stack probes
unrelative to the frame's top. It applied to the RISC-V targets code
generation when the stack-clash protection is enabled.
PR target/120714
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_allocate_and_probe_stack_space):
Fix SP-addresses in REG_CFA_DEF_CFA notes for stack-clash case.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/pr120714.c: New test.
|
|
This patch would like to combine the vec_duplicate + vssubu.vv to the
vssubu.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, FUNC) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = FUNC (in[i], x); \
}
T sat_sub(T a, T b)
{
return (a - b) & (-(T)(a >= b));
}
DEF_VX_BINARY(uint32_t, sat_sub)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vssubu.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vssubu.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_vec_dup): Add
new case US_MINUS.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op us_minus.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
PR109116 reveals missed optimizations when using unspecs to extract
vector components from opaque-mode variables. Since RTL optimizers do
not understand unspecs, this leads to redundant register copies. Replace
unspecs with subregs, which are well understood by RTL passes, allowing
optimizations to take place.
2025-06-30 Peter Bergner <bergner@linux.ibm.com>
gcc/
PR target/109116
* config/rs6000/mma.md (unspec): Delete UNSPEC_MMA_EXTRACT.
(vsx_disassemble_pair): Expand into a vector register sized subreg.
(mma_disassemble_acc): Likewise.
(*vsx_disassemble_pair): Delete.
(*mma_disassemble_acc): Likewise.
|
|
This commit introduces a primary vector pipeline model for the SiFive 7
series, that pipeline model is kind of simplified version, it only
defined vector command queue, arithmetic unit, and vector load store
unit.
The latency of real hardware is LMUL-aware, but I realize that will
complicate the model a lots, so I just use a simplified version, which
all LMUL use same latency, we may improve it later once we have found
meaningful performance difference.
gcc/ChangeLog:
* config/riscv/sifive-7.md: Add primary vector pipeline model
for SiFive 7 series.
|