Age | Commit message (Collapse) | Author | Files | Lines |
|
pipeline model [PR120659]
gcc/ChangeLog:
PR target/120659
* config/riscv/sifive-7.md: Add B extension, fp16 and missing
scalar instruction type for sifive-7 pipeline model.
gcc/testsuite/ChangeLog:
PR target/120659
* gcc.target/riscv/pr120659.c: New test.
|
|
This pattern enables the combine pass (or late-combine, depending on the case)
to merge a vec_duplicate into a (possibly negated) minus-mult RTL instruction.
Before this patch, we have two instructions, e.g.:
vfmv.v.f v6,fa0
vfnmacc.vv v2,v6,v4
After, we get only one:
vfnmacc.vf v2,fa0,v4
PR target/119100
gcc/ChangeLog:
* config/riscv/autovec-opt.md (*vfnmsub_<mode>,*vfnmadd_<mode>): Handle
both add and acc variants.
* config/riscv/vector.md (*pred_mul_neg_<optab><mode>_scalar_undef): New
pattern.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfnmacc and
vfnmsac.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop.h (DEF_VF_MULOP_CASE_1):
Fix return type.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmacc-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmacc-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmacc-run-1-f64.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsac-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsac-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsac-run-1-f64.c: New test.
|
|
This adds support for -mcpu=gb10. This is a big.LITTLE configuration
involving Cortex-X925 and Cortex-A725 cores. The appropriate MIDR numbers
are added to detect them in -mcpu=native. We did not add an
-mcpu=cortex-x925.cortex-a725 option because GB10 does include the crypto
instructions which we want on by default, and the current convention is to not
enable such extensions for Arm Cortex cores in -mcpu where they are optional
in the IP.
Bootstrapped and tested on aarch64-none-linux-gnu.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* config/aarch64/aarch64-cores.def (gb10): New entry.
* config/aarch64/aarch64-tune.md: Regenerate.
* doc/invoke.texi (AArch64 Options): Document the above.
|
|
Update functions with no_callee_saved_registers/preserve_none attribute
to preserve frame pointer since caller may use it to save the current
stack:
pushq %rbp
movq %rsp, %rbp
...
call function
...
leave
ret
If callee changes frame pointer without restoring it, caller will fail
to restore its stack after callee returns as LEAVE does
mov %rbp, %rsp
pop %rbp
The corrupted frame pointer will corrupt stack pointer in caller.
There are no regressions on Linux/x86-64. Also tested with
https://github.com/python/cpython
configured with "./configure --with-tail-call-interp".
gcc/
PR target/120840
* config/i386/i386-expand.cc (ix86_expand_call): Don't mark
hard frame pointer as clobber.
* config/i386/i386-options.cc (ix86_set_func_type): Use
TYPE_NO_CALLEE_SAVED_REGISTERS instead of
TYPE_NO_CALLEE_SAVED_REGISTERS_EXCEPT_BP.
* config/i386/i386.cc (ix86_function_ok_for_sibcall): Remove the
TYPE_NO_CALLEE_SAVED_REGISTERS_EXCEPT_BP check.
(ix86_save_reg): Merge TYPE_NO_CALLEE_SAVED_REGISTERS and
TYPE_PRESERVE_NONE with TYPE_NO_CALLEE_SAVED_REGISTERS_EXCEPT_BP.
* config/i386/i386.h (call_saved_registers_type): Remove
TYPE_NO_CALLEE_SAVED_REGISTERS_EXCEPT_BP.
* doc/extend.texi: Update no_callee_saved_registers documentation.
gcc/testsuite/
PR target/120840
* gcc.target/i386/no-callee-saved-1.c: Updated.
* gcc.target/i386/no-callee-saved-2.c: Likewise.
* gcc.target/i386/no-callee-saved-7.c: Likewise.
* gcc.target/i386/no-callee-saved-8.c: Likewise.
* gcc.target/i386/no-callee-saved-9.c: Likewise.
* gcc.target/i386/no-callee-saved-10.c: Likewise.
* gcc.target/i386/no-callee-saved-18.c: Likewise.
* gcc.target/i386/no-callee-saved-19a.c: Likewise.
* gcc.target/i386/no-callee-saved-19c.c: Likewise.
* gcc.target/i386/no-callee-saved-19d.c: Likewise.
* gcc.target/i386/pr119784a.c: Likewise.
* gcc.target/i386/preserve-none-6.c: Likewise.
* gcc.target/i386/preserve-none-7.c: Likewise.
* gcc.target/i386/preserve-none-12.c: Likewise.
* gcc.target/i386/preserve-none-13.c: Likewise.
* gcc.target/i386/preserve-none-14.c: Likewise.
* gcc.target/i386/preserve-none-15.c: Likewise.
* gcc.target/i386/preserve-none-23.c: Likewise.
* gcc.target/i386/pr120840-1a.c: New test.
* gcc.target/i386/pr120840-1b.c: Likewise.
* gcc.target/i386/pr120840-1c.c: Likewise.
* gcc.target/i386/pr120840-1d.c: Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
The current implementation of this function is somewhat difficult to
understand, as it uses a direct break statement within the for loop,
rendering the loop meaningless. Additionally, during the Coverity check
on the for loop, a warning appeared: "unreachable: Since the loop
increment ix++; is unreachable, the loop body will never execute more
than once." Therefore, I have made some simple refactoring to address
these issues.
gcc/ChangeLog:
* config/riscv/riscv-vsetvl.cc (bitmap_union_of_preds_with_entry):
Refactor.
Signed-off-by: Jin Ma <jinma@linux.alibaba.com>
|
|
Pipeline checker utility for RISC-V architecture that validates processor
pipeline models. This tool analyzes machine description files to ensure all
instruction types are properly handled by pipeline scheduling models.
I write this tool since I am implment vector pipeline stuff for SiFive
core, but it's hard to find which instruction type is not handled by
pipeline scheduling models. This tool will help me to find out which
instruction type is not handled by pipeline scheduling models, so I can
fix them.
And I think it may be useful for other RISC-V core developers, so I
decided to upstream that :)
Usage:
```
./pipeline-checker <your-pipeline-model>
```
Example:
```
$ ./pipeline-checker sifive-7.md
Error: Some types are not consumed by the pipemodel
Missing types:
{'vfclass', 'vimovxv', 'vmov', 'rdfrm', 'wrfrm', 'ghost', 'wrvxrm', 'crypto', 'vwsll', 'vfmovfv', 'vimovvx', 'sf_vc', 'vfmovvf', 'sf_vc_se', 'rdvlenb', 'vbrev', 'vrev8', 'sf_vqmacc', 'sf_vfnrclip', 'vsetvl_pre', 'rdvl', 'vsetvl'}
```
gcc/ChangeLog:
* config/riscv/pipeline-checker: New file.
|
|
This fixes an ICE with -mno-lra when split2 tries to split the following
zero_extendsidi2 insn:
(set (reg:DI 24)
(zero_extend:DI (reg:SI **)))
The ICE is because avr_hard_regno_mode_ok allows R24:DI but disallows
R28:SI when Reload is used. R28:SI is a result of zero_extendsidi2.
This ICE only occurs with Reload (which will die before very long),
but it occurs when building libgcc.
gcc/
PR target/120856
* config/avr/avr.cc (avr_hard_regno_mode_ok) [-mno-lra]: Deny
hard regs >= 4 bytes that overlap Y.
|
|
Now that the patches for PR120424 are upstream, the last known bug
associated with avr+lra has been fixed: PR118591. So we can pull the
switch that turns on LRA per default.
This patch only sets -mlra per default. It doesn't do any Reload related
cleanup or removal from the avr backend, hence -mno-lra still works.
The only new problem is that gcc.dg/torture/pr64088.c fails with LRA
but not with Reload. Though that test case is awkward since it is UB
but expects the compiler to behave in a specific way which avr-gcc
doesn't do: PR116780.
This patch also avoids a relative recent ICE that breaks building libgcc:
R24:DI is allowed per hard_regno_mode_ok, but R26:SI is disallowed
for Reload for old reasons. Outcome is that a split2 pattern for
R24:DI = zero_extend:DI (R22:SI) runs into an ICE.
AVR-LibC builds fine with this patch.
The AVR-LibC testsuite passes without errors.
gcc/
PR target/113934
* config/avr/avr.opt (-mlra): Turn on per default.
|
|
Use the inner scalar mode of vector broadcast source in:
(set (reg:V8DF 394)
(vec_duplicate:V8DF (reg:V2DF 190 [ alpha ])))
to compute the vector mode for broadcast from vector source.
gcc/
PR target/120830
* config/i386/i386-features.cc (ix86_get_vector_cse_mode): Handle
vector broadcast source.
gcc/testsuite/
PR target/120830
* g++.target/i386/pr120830.C: New test.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
The 64-bit register-to-register moves on PRU are implemented with two
instructions moving 32-bit registers. Defining a split for the 64-bit
moves allows this to be described in RTL, and thus one of the 32-bit
moves to be eliminated if the destination register is dead.
Also, split the loading of non-trivial 64-bit integer constants. The
resulting 32-bit integer constants have better chance to be loaded with
something more optimal than an "ldi32".
For now do the splits only after register allocation, because LRA does
not yet efficiently handle subregs. See
https://gcc.gnu.org/pipermail/gcc-patches/2024-May/651366.html
This patch shows slight improvement for wikisort benchmark from
embench-iot:
Benchmark size-before size-after difference
--------- ----------- ---------- ----------
aha-mont64 1,648 1,648 0
crc32 104 104 0
depthconv 1,172 1,172 0
edn 3,040 3,040 0
huffbench 1,616 1,616 0
matmult-int 748 748 0
md5sum 700 700 0
nettle-aes 2,664 2,664 0
nettle-sha256 5,732 5,732 0
nsichneu 21,372 21,372 0
picojpeg 9,716 9,716 0
qrduino 8,556 8,556 0
sglib-combined 3,724 3,724 0
slre 3,488 3,488 0
statemate 1,132 1,132 0
tarfind 652 652 0
ud 1,004 1,004 0
wikisort 18,120 18,092 -28
xgboost 300 300 0
gcc/ChangeLog:
* config/pru/pru.md (reg move splitter): New splitter for 64-bit
register moves into two 32-bit moves.
(const_int move splitter): New splitter for 64-bit constant
integer moves into two 32-bit moves.
gcc/testsuite/ChangeLog:
* gcc.target/pru/mov64-subreg-1.c: New test.
* gcc.target/pru/mov64-subreg-2.c: New test.
Signed-off-by: Dimitar Dimitrov <dimitar@dinux.eu>
|
|
This is a followup to 92e1893e0 "RISC-V: Add patterns for vector-scalar
multiply-(subtract-)accumulate" that caused an ICE in some cases where the mult
operands were wrongly swapped.
This patch ensures that operands are not swapped in the vector-scalar case.
PR target/120828
gcc/ChangeLog:
* config/riscv/riscv-v.cc (prepare_ternary_operands): Handle the
vector-scalar case.
|
|
Introduce crc_rev<mode>si4 expanders to generate CRC32 instruction when using
__builtin_rev_crc32_data* builtins with 0x1EDC6F41 poylnomial and -mcrc32.
PR target/120719
gcc/ChangeLog:
* config/i386/i386.md (crc_rev<SWI124:mode>si4): New expander.
gcc/testsuite/ChangeLog:
* gcc.target/i386/crc-builtin-crc32.c: New test.
|
|
Apparently I forgot to squash this fix into the previous commit before I
push...
gcc/ChangeLog:
* config/riscv/riscv.md: Fix build issue.
|
|
This patch adds a comment to the riscv.md file to clarify the purpose of
the file and reorders the include files for better organization.
gcc/ChangeLog:
* config/riscv/riscv.md: Add comment and reorder include
files.
|
|
Since float vector constant
(const_vector:V4SF [(const_double:SF -QNaN [-QNaN]) repeated x4])
is an all 1s float vector constant, update the remove_redundant_vector
pass to replace
(insn 20 18 21 2 (set (reg:V4SF 124)
(const_vector:V4SF [
(const_double:SF -QNaN [-QNaN]) repeated x4
])) "x.cc":26:5 2426 {movv4sf_internal}
(nil))
with
(insn 49 2 5 2 (set (reg:V16QI 135)
(const_vector:V16QI [
(const_int -1 [0xffffffffffffffff]) repeated x16
])) -1
(nil))
...
(insn 20 18 21 2 (set (reg:V4SF 124)
(subreg:V4SF (reg:V16QI 135) 0)) "x.cc":26:5 2426 {movv4sf_internal}
(nil))
gcc/
PR target/120819
* config/i386/i386-features.cc (ix86_broadcast_inner): Also handle
all 1s float vector constant.
gcc/testsuite/
PR target/120819
* g++.target/i386/pr120819.C: New test.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
For tcpsock_test.go in libgo tests,
commit aba3b9d3a48a0703fd565f7c5f0caf604f59970b
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Fri May 9 07:17:07 2025 +0800
x86: Extend the remove_redundant_vector pass
added an instruction:
(insn 501 101 102 21 (set (reg:V2DI 234)
(vec_duplicate:V2DI (reg:DI 111 [ _46 ]))) "tcpsock_test.go":691:12 discrim 1 -1
(nil))
after
(insn 101 100 501 21 (set (reg:DI 111 [ _46 ])
(mem:DI (reg/f:DI 110 [ _45 ]) [5 *_45+0 S8 A64])) "tcpsock_test.go":691:12 discrim 1 99 {*movdi_internal}
(expr_list:REG_DEAD (reg/f:DI 110 [ _45 ])
(expr_list:REG_EH_REGION (const_int 1 [0x1])
(nil))))
which resulted in
(insn 101 100 501 21 (set (reg:DI 111 [ _46 ])
(mem:DI (reg/f:DI 110 [ _45 ]) [5 *_45+0 S8 A64])) "tcpsock_test.go":691:12 discrim 1 99 {*movdi_internal}
(expr_list:REG_DEAD (reg/f:DI 110 [ _45 ])
(expr_list:REG_EH_REGION (const_int 1 [0x1])
(nil))))
(insn 501 101 102 21 (set (reg:V2DI 234)
(vec_duplicate:V2DI (reg:DI 111 [ _46 ]))) "tcpsock_test.go":691:12 discrim 1 -1
(nil))
and caused:
tcpsock_test.go: In function 'net.TestTCPBig..func2':
tcpsock_test.go:684:28: error: in basic block 21:
684 | go func() {
| ^
tcpsock_test.go:684:28: error: flow control insn inside a basic block
(insn 101 100 501 21 (set (reg:DI 111 [ _46 ])
(mem:DI (reg/f:DI 110 [ _45 ]) [5 *_45+0 S8 A64])) "tcpsock_test.go":691:12 discrim 1 99 {*movdi_internal}
(expr_list:REG_DEAD (reg/f:DI 110 [ _45 ])
(expr_list:REG_EH_REGION (const_int 1 [0x1])
(nil))))
during RTL pass: rrvl
tcpsock_test.go:684:28: internal compiler error: in rtl_verify_bb_insns, at cfgrtl.cc:2834
Copy the REG_EH_REGION note to the newly added instruction and split the
block after the previous instruction.
PR target/120816
* config/i386/i386-features.cc (remove_redundant_vector_load):
Handle REG_EH_REGION note in DEF_INSN.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Add preserve_none attribute which is similar to no_callee_saved_registers
attribute, except on x86-64, r12, r13, r14, r15, rdi and rsi registers are
used for integer parameter passing. This can be used in an interpreter
to avoid saving/restoring the registers in functions which process byte
codes. It improved the pystones benchmark by 6-7%:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119628#c15
Remove -mgeneral-regs-only restriction on no_caller_saved_registers
attribute. Only SSE is allowed since SSE XMM register load preserves
the upper bits in YMM/ZMM register while YMM register load zeros the
upper 256 bits of ZMM register, and preserving 32 ZMM registers can
be quite expensive.
gcc/
PR target/119628
* config/i386/i386-expand.cc (ix86_expand_call): Call
ix86_type_no_callee_saved_registers_p instead of looking up
no_callee_saved_registers attribute.
* config/i386/i386-options.cc (ix86_set_func_type): Look up
preserve_none attribute. Check preserve_none attribute for
interrupt attribute. Don't check no_caller_saved_registers nor
no_callee_saved_registers conflicts here.
(ix86_set_func_type): Check no_callee_saved_registers before
checking no_caller_saved_registers attribute.
(ix86_set_current_function): Allow SSE with
no_caller_saved_registers attribute.
(ix86_handle_call_saved_registers_attribute): Check preserve_none,
no_callee_saved_registers and no_caller_saved_registers conflicts.
(ix86_gnu_attributes): Add preserve_none attribute.
* config/i386/i386-protos.h (ix86_type_no_callee_saved_registers_p):
New.
* config/i386/i386.cc
(x86_64_preserve_none_int_parameter_registers): New.
(ix86_using_red_zone): Don't use red-zone when there are no
caller-saved registers with SSE.
(ix86_type_no_callee_saved_registers_p): New.
(ix86_function_ok_for_sibcall): Also check TYPE_PRESERVE_NONE
and call ix86_type_no_callee_saved_registers_p instead of looking
up no_callee_saved_registers attribute.
(ix86_comp_type_attributes): Call
ix86_type_no_callee_saved_registers_p instead of looking up
no_callee_saved_registers attribute. Return 0 if preserve_none
attribute doesn't match in 64-bit mode.
(ix86_function_arg_regno_p): For cfun with TYPE_PRESERVE_NONE,
use x86_64_preserve_none_int_parameter_registers.
(init_cumulative_args): Set preserve_none_abi.
(function_arg_64): Use x86_64_preserve_none_int_parameter_registers
with preserve_none attribute.
(setup_incoming_varargs_64): Use
x86_64_preserve_none_int_parameter_registers with preserve_none
attribute.
(ix86_save_reg): Treat TYPE_PRESERVE_NONE like
TYPE_NO_CALLEE_SAVED_REGISTERS.
(ix86_nsaved_sseregs): Allow saving XMM registers for
no_caller_saved_registers attribute.
(ix86_compute_frame_layout): Likewise.
(x86_this_parameter): Use
x86_64_preserve_none_int_parameter_registers with preserve_none
attribute.
* config/i386/i386.h (ix86_args): Add preserve_none_abi.
(call_saved_registers_type): Add TYPE_PRESERVE_NONE.
(machine_function): Change call_saved_registers to 3 bits.
* doc/extend.texi: Add preserve_none attribute. Update
no_caller_saved_registers attribute to remove -mgeneral-regs-only
restriction.
gcc/testsuite/
PR target/119628
* gcc.target/i386/no-callee-saved-3.c: Adjust error location.
* gcc.target/i386/no-callee-saved-19a.c: New test.
* gcc.target/i386/no-callee-saved-19b.c: Likewise.
* gcc.target/i386/no-callee-saved-19c.c: Likewise.
* gcc.target/i386/no-callee-saved-19d.c: Likewise.
* gcc.target/i386/no-callee-saved-19e.c: Likewise.
* gcc.target/i386/preserve-none-1.c: Likewise.
* gcc.target/i386/preserve-none-2.c: Likewise.
* gcc.target/i386/preserve-none-3.c: Likewise.
* gcc.target/i386/preserve-none-4.c: Likewise.
* gcc.target/i386/preserve-none-5.c: Likewise.
* gcc.target/i386/preserve-none-6.c: Likewise.
* gcc.target/i386/preserve-none-7.c: Likewise.
* gcc.target/i386/preserve-none-8.c: Likewise.
* gcc.target/i386/preserve-none-9.c: Likewise.
* gcc.target/i386/preserve-none-10.c: Likewise.
* gcc.target/i386/preserve-none-11.c: Likewise.
* gcc.target/i386/preserve-none-12.c: Likewise.
* gcc.target/i386/preserve-none-13.c: Likewise.
* gcc.target/i386/preserve-none-14.c: Likewise.
* gcc.target/i386/preserve-none-15.c: Likewise.
* gcc.target/i386/preserve-none-16.c: Likewise.
* gcc.target/i386/preserve-none-17.c: Likewise.
* gcc.target/i386/preserve-none-18.c: Likewise.
* gcc.target/i386/preserve-none-19.c: Likewise.
* gcc.target/i386/preserve-none-20.c: Likewise.
* gcc.target/i386/preserve-none-21.c: Likewise.
* gcc.target/i386/preserve-none-22.c: Likewise.
* gcc.target/i386/preserve-none-23.c: Likewise.
* gcc.target/i386/preserve-none-24.c: Likewise.
* gcc.target/i386/preserve-none-25.c: Likewise.
* gcc.target/i386/preserve-none-26.c: Likewise.
* gcc.target/i386/preserve-none-27.c: Likewise.
* gcc.target/i386/preserve-none-28.c: Likewise.
* gcc.target/i386/preserve-none-29.c: Likewise.
* gcc.target/i386/preserve-none-30a.c: Likewise.
* gcc.target/i386/preserve-none-30b.c: Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Add debug dump for the remove_redundant_vector pass with the following
output:
Replace:
(insn 7 4 8 2 (set (reg:V2DI 103)
(const_vector:V2DI [
(const_int 0 [0]) repeated x2
])) "x.c":8:13 2406 {movv2di_internal}
(nil))
with:
(insn 7 4 8 2 (set (reg:V2DI 103)
(subreg:V2DI (reg:V32QI 109) 0)) "x.c":8:13 2406 {movv2di_internal}
(nil))
...
Replace:
(insn 16 15 17 3 (set (reg:V4DI 105)
(const_vector:V4DI [
(const_int 0 [0]) repeated x4
])) "x.c":13:28 2405 {movv4di_internal}
(nil))
with:
(insn 16 15 17 3 (set (reg:V4DI 105)
(subreg:V4DI (reg:V32QI 109) 0)) "x.c":13:28 2405 {movv4di_internal}
(nil))
...
Place:
(insn 25 5 23 2 (set (reg:V32QI 109)
(const_vector:V32QI [
(const_int 0 [0]) repeated x32
])) -1
(nil))
after:
(insn 23 25 24 2 (set (reg/f:DI 107 [ mem1 ])
(reg:DI 5 di [ mem1 ])) "x.c":5:1 95 {*movdi_internal}
(expr_list:REG_DEAD (reg:DI 5 di [ mem1 ])
(nil)))
in the *.309r.rrvl debug dump.
* config/i386/i386-features.cc (ix86_place_single_vector_set):
Add debug dump.
(replace_vector_const): Likewise.
(remove_redundant_vector_load): Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
This patch handles both signed and unsigned builtin multiplication
overflow.
Uses the "mpy.f" instruction to set the condition codes based on the
result. In the event of an overflow, the V flag is set, triggering a
conditional move depending on the V flag status.
For example, set "1" to "r0" in case of overflow:
mov_s r0,1
mpy.f r0,r0,r1
j_s.d [blink]
mov.nv r0,0
gcc/ChangeLog:
* config/arc/arc.md (<su_optab>mulvsi4): New define_expand.
(<su_optab>mulsi3_Vcmp): New define_insn.
Signed-off-by: Luis Silva <luiss@synopsys.com>
|
|
This patch introduces two new instruction patterns:
`*mulsi3_cmp0`: This pattern performs a multiplication and sets
the CC_Z register based on the result, while also storing the
result of the multiplication in a general-purpose register.
`*mulsi3_cmp0_noout`: This pattern performs a multiplication and
sets the CC_Z register based on the result without storing the
result in a general-purpose register.
These patterns are optimized to generate code using the `mpy.f`
instruction, specifically used where the result is compared to zero.
In addition, the previous commutative multiplication implementation
was removed. It incorrectly took into account the negative flag,
which is wrong. This new implementation only considers the zero flag.
A test case has been added to verify the correctness of these changes.
gcc/ChangeLog:
* config/arc/arc.cc (arc_select_cc_mode): Handle multiplication
results compared against zero, selecting CC_Zmode.
* config/arc/arc.md (*mulsi3_cmp0): New define_insn.
(*mulsi3_cmp0_noout): New define_insn.
gcc/testsuite/ChangeLog:
* gcc.target/arc/mult-cmp0.c: New test.
Signed-off-by: Luis Silva <luiss@synopsys.com>
|
|
This patch covers signed and unsigned subtractions. The generated code
would be something along these lines:
signed:
sub.f r0, r1, r2
b.v @label
unsigned:
sub.f r0, r1, r2
b.c @label
gcc/
* config/arc/arc.md (subsi3_v, subvsi4, subsi3_c): New patterns.
gcc/testsuite/
* gcc.target/arc/overflow-2.c: New file.
|
|
This patch covers signed and unsigned additions. The generated code
would be something along these lines:
signed:
add.f r0, r1, r2
b.v @label
unsigned:
add.f r0, r1, r2
b.c @label
gcc/
* config/arc/arc-modes.def (CC_V): New mode.
* config/arc/arc-protos.h (arc_gen_unlikely_cbranch): New
function declaration.
* config/arc/arc.cc (arc_gen_unlikely_cbranch): New
function.
(get_arc_condition_code): Handle new mode.
* config/arc/arc.md (addvsi3_v, addvsi4, addsi3_c, uaddvsi4): New
patterns.
* config/arc/predicates.md (proper_comparison_operator): Handel
the new V_mode.
(equality_comparison_operator): Likewise.
gcc/testsuite/
* gcc.target/arc/overflow-1.c: New file
|
|
-mtune=intel is used to generate a single binary to run well on both big
core and small core, similar to hybrid CPUs. Update -mtune=intel to tune
for Diamond Rapids and Clearwater Forest, instead of Silvermont.
PR target/120815
* common/config/i386/i386-common.cc (processor_alias_table):
Replace CPU_SLM/PTA_NEHALEM with CPU_HASWELL/PTA_HASWELL for
PROCESSOR_INTEL.
* config/i386/i386-options.cc (processor_cost_table): Replace
intel_cost with alderlake_cost.
* config/i386/x86-tune-costs.h (intel_cost): Removed.
* config/i386/x86-tune-sched.cc (ix86_issue_rate): Treat
PROCESSOR_INTEL like PROCESSOR_ALDERLAKE.
(ix86_adjust_cost): Likewise.
* doc/invoke.texi: Update -mtune=intel for Diamond Rapids and
Clearwater Forest.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
CLDEMOTE is not enabled on clients according to SDM. SDM only mentioned
it will be enabled on Xeon and Atom servers, not clients. Remove them
since Alder Lake (where it is introduced).
gcc/ChangeLog:
* config/i386/i386.h (PTA_ALDERLAKE): Use PTA_GOLDMONT_PLUS
as base to remove PTA_CLDEMOTE.
(PTA_SIERRAFOREST): Add PTA_CLDEMOTE since PTA_ALDERLAKE
does not include that anymore.
* doc/invoke.texi: Update texi file.
|
|
gfx942 still uses glc for scalar access ('s_...') and only uses
sc0/nt/sc1 for vector access.
gcc/ChangeLog:
* config/gcn/gcn-opts.h (TARGET_GLC_NAME): Fix and extend the
description in the comment.
* config/gcn/gcn.cc (print_operand): Extend the comment about
'G' and 'g'.
* config/gcn/gcn.md: Use 'glc' instead of %G where appropriate.
|
|
This pattern enables the combine pass (or late-combine, depending on the case)
to merge a vec_duplicate into a plus-mult or minus-mult RTL instruction.
Before this patch, we have two instructions, e.g.:
vfmv.v.f v6,fa0
vfmacc.vv v2,v6,v4
After, we get only one:
vfmacc.vf v2,fa0,v4
PR target/119100
gcc/ChangeLog:
* config/riscv/autovec-opt.md (*<optab>_vf_<mode>): Handle both add and
acc FMA variants.
* config/riscv/vector.md (*pred_mul_<optab><mode>_scalar_undef): New.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfmacc and vfmsac.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop.h: Add support for acc
variants.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_mulop_run.h: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f16.c: Define
TEST_OUT.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmadd-run-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsub-run-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmadd-run-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfnmsub-run-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmacc-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmacc-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmacc-run-1-f64.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsac-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsac-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmsac-run-1-f64.c: New test.
|
|
ADD/SUB is faster than LEA for most processors. Also, there are
several peephole2 patterns available that convert prologue esp
subtractions to pushes (at the end of i386.md). These process only
patterns with flags reg clobber, so they are ineffective
with clobber-less stack ptr adjustments, introduced by r16-1551
("x86: Enable separate shrink wrapping").
Introduce a peephole2 pattern that adds a clobber to a clobber-less
stack ptr adjustments when FLAGS_REG is dead.
gcc/ChangeLog:
* config/i386/i386.md
(@pro_epilogue_adjust_stack_add_nocc<mode>): Add type attribute.
(pro_epilogue_adjust_stack_add_nocc peephole2 pattern):
Convert pro_epilogue_adjust_stack_add_nocc variant to
pro_epilogue_adjust_stack_add when FLAGS_REG is dead.
|
|
Also provide the vec_extract patterns for floats on pre-z13 machines
to prevent ICEing in those cases.
gcc/ChangeLog:
* config/s390/vector.md (VF): Don't restrict modes.
(VEC_SET_SINGLEFLOAT): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/s390/vector/vec-extract-1.c: Fix test on arch11.
* gcc.target/s390/vector/vec-set-1.c: Run test on arch11.
* gcc.target/s390/vector/vec-extract-2.c: New test.
Signed-off-by: Juergen Christ <jchrist@linux.ibm.com>
|
|
As requested in my patch for -mmax-vectorization this promotes the parameter
--param aarch64-autovec-preference to a first class top target flag.
If both the parameter and the flag is specified the parameter takes precedence
with the reasoning that it may already be embedded in build systems.
gcc/ChangeLog:
* config/aarch64/aarch64.cc (aarch64_override_options_internal): Set
value of parameter based on option.
* config/aarch64/aarch64.opt (autovec-preference): New.
* doc/invoke.texi (autovec-preference): Document it.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/autovec_param_asimd-only_2.c: New test.
* gcc.target/aarch64/autovec_param_default_2.c: New test.
* gcc.target/aarch64/autovec_param_prefer-asimd_2.c: New test.
* gcc.target/aarch64/autovec_param_prefer-sve_2.c: New test.
* gcc.target/aarch64/autovec_param_sve-only_2.c: New test.
|
|
With the middle-end providing a way to make vectorization more profitable by
scaling vect-scalar-cost-multiplier this makes a more user friendly option
to make it easier to use.
I propose making it an actual -m option that we document and retain vs using
the parameter name. In the future I would like to extend this option to modify
additional costing in the AArch64 backend itself.
This can be used together with --param aarch64-autovec-preference to get the
vectorizer to say, always vectorize with SVE. I did consider making this an
additional enum to --param aarch64-autovec-preference but I also think this is
a useful thing to be able to set with pragmas and attributes, but am open to
suggestions.
Note that as a follow up I plan on extending -fdump-tree-vect to support -stats
which is then intended to be usable with this flag.
gcc/ChangeLog:
* config/aarch64/aarch64.opt (max-vectorization): New.
* config/aarch64/aarch64.cc (aarch64_override_options_internal): Save
and restore option.
Implement it through vect-scalar-cost-multiplier.
(aarch64_attributes): Default to off.
* common/config/aarch64/aarch64-common.cc (aarch64_handle_option):
Initialize option.
* doc/extend.texi (max-vectorization): Document attribute.
* doc/invoke.texi (max-vectorization): Document flag.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/sve/cost_model_17.c: New test.
* gcc.target/aarch64/sve/cost_model_18.c: New test.
|
|
Extend the remove_redundant_vector pass to handle vector broadcasts from
constant and variable scalars. When broadcasting from constants and
function arguments, we can place a single widest vector broadcast at
entry of the nearest common dominator for basic blocks with all uses
since constants and function arguments aren't changed. For broadcast
from variables with a single definition, the single definition is
replaced with the widest broadcast.
gcc/
PR target/92080
* config/i386/i386-expand.cc (ix86_expand_call): Set
recursive_function to true for recursive call.
* config/i386/i386-features.cc (ix86_place_single_vector_set):
Add an argument for inner scalar, default to nullptr. Set the
source from inner scalar if not nullptr.
(ix86_get_vector_load_mode): Renamed to ...
(ix86_get_vector_cse_mode): This. Add an argument for scalar mode
and handle integer and float scalar modes.
(replace_vector_const): Add an argument for scalar mode and pass
it to ix86_get_vector_load_mode.
(x86_cse_kind): New.
(redundant_load): Likewise.
(ix86_broadcast_inner): Likewise.
(remove_redundant_vector_load): Also support const0_rtx and
constm1_rtx broadcasts. Handle vector broadcasts from constant
and variable scalars.
* config/i386/i386.h (machine_function): Add recursive_function.
gcc/testsuite/
* gcc.target/i386/keylocker-aesdecwide128kl.c: Updated to expect
movdqa instead pxor.
* gcc.target/i386/keylocker-aesdecwide256kl.c: Likewise.
* gcc.target/i386/keylocker-aesencwide128kl.c: Likewise.
* gcc.target/i386/keylocker-aesencwide256kl.c: Likewise.
* gcc.target/i386/pr92080-4.c: New test.
* gcc.target/i386/pr92080-5.c: Likewise.
* gcc.target/i386/pr92080-6.c: Likewise.
* gcc.target/i386/pr92080-7.c: Likewise.
* gcc.target/i386/pr92080-8.c: Likewise.
* gcc.target/i386/pr92080-9.c: Likewise.
* gcc.target/i386/pr92080-10.c: Likewise.
* gcc.target/i386/pr92080-11.c: Likewise.
* gcc.target/i386/pr92080-12.c: Likewise.
* gcc.target/i386/pr92080-13.c: Likewise.
* gcc.target/i386/pr92080-14.c: Likewise.
* gcc.target/i386/pr92080-15.c: Likewise.
* gcc.target/i386/pr92080-16.c: Likewise.
* gcc.target/i386/pr92080-17.c: Likewise.
* gcc.target/i386/pr92080-18.c: Likewise.
* gcc.target/i386/pr92080-19.c: Likewise.
* gcc.target/i386/pr92080-20.c: Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Update memcpy and memset inline strategies for -mtune=generic:
1. Don't align memory.
2. For known sizes, prefer vector loop, unroll loop with 4 moves or
stores per iteration without aligning the loop, up to 256 bytes.
3. For unknown sizes, use memcpy/memset.
4. Since each loop iteration has 4 stores and 8 stores for zeroing with
unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing
up to 72 bytes are fully unrolled with 9 stores without SSE.
gcc/
PR target/70308
PR target/101366
PR target/102294
PR target/108585
PR target/118276
PR target/119596
PR target/119703
PR target/119704
* config/i386/x86-tune-costs.h (generic_memcpy): Updated.
(generic_memset): Likewise.
(generic_cost): Change CLEAR_RATIO to 10.
gcc/testsuite/
PR target/70308
PR target/101366
PR target/102294
PR target/108585
PR target/118276
PR target/119596
PR target/119703
PR target/119704
* g++.target/i386/memset-pr101366-1.C: New test.
* g++.target/i386/memset-pr101366-2.C: Likewise.
* g++.target/i386/memset-pr108585-1a.C: Likewise.
* g++.target/i386/memset-pr108585-1b.C: Likewise.
* g++.target/i386/memset-pr118276-1a.C: Likewise.
* g++.target/i386/memset-pr118276-1b.C: Likewise.
* g++.target/i386/memset-pr118276-1c.C: Likewise.
* gcc.target/i386/memcpy-strategy-12.c: Likewise.
* gcc.target/i386/memcpy-strategy-13.c: Likewise.
* gcc.target/i386/memset-pr70308-1a.c: Likewise.
* gcc.target/i386/memset-pr70308-1b.c: Likewise.
* gcc.target/i386/memset-strategy-25.c: Likewise.
* gcc.target/i386/memset-strategy-26.c: Likewise.
* gcc.target/i386/memset-strategy-27.c: Likewise.
* gcc.target/i386/memset-strategy-28.c: Likewise.
* gcc.target/i386/memset-strategy-29.c: Likewise.
* gcc.target/i386/memset-strategy-30.c: Likewise.
* gcc.target/i386/memset-strategy-31.c: Likewise.
* gcc.target/i386/auto-init-padding-3.c: Expect XMM stores.
* gcc.target/i386/auto-init-padding-9.c: Likewise.
* gcc.target/i386/mvc17.c: Fail with "rep mov"
* gcc.target/i386/pr111657-1.c: Scan for unrolled loop. Fail
with "rep mov".
* gcc.target/i386/shrink_wrap_1.c: Also pass
-mmemset-strategy=rep_8byte:-1:align.
* gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
gcc/ChangeLog:
PR target/120741
* config/i386/i386.cc (ix86_expand_prologue):
Remove 1 assertion.
gcc/testsuite/ChangeLog:
PR target/120741
* gcc.target/i386/pr120741.c: New test.
* gcc.target/i386/shrink-wrap-separate-mingw.c: Likewise.
|
|
Fix typo in comment spotted by Peter B.
PR target/118241
gcc/
* config/riscv/predicates.md: Fix comment typo in recent change.
|
|
This patch would like to combine the vec_duplicate + vsaddu.vv to the
vsaddu.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, FUNC) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = FUNC (in[i], x); \
}
T sat_add(T a, T b)
{
return (a + b) | (-(T)((T)(a + b) < a));
}
DEF_VX_BINARY(uint32_t, sat_add)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vsaddu.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vsaddu.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add
new case US_PLUS.
(expand_vx_binary_vec_vec_dup): Ditto.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op us_plus.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
Don't use vmovdqu16/vmovdqu8 with non-EVEX register operands just because
AVX512BW is available.
gcc/
PR target/120728
* config/i386/i386.cc (ix86_get_ssemov): Use vmovdqu16/vmovdqu8
only with EVEX register operands.
gcc/testsuite/
PR target/120728
* gcc.target/i386/avx512bw-vmovdqu16-1.c: Scan vmovdqu for
non-EVEX register operands.
* gcc.target/i386/avx512bw-vmovdqu8-1.c: Likewise.
* gcc.target/i386/avx512fp16-13.c: Likewise.
* gcc.target/i386/pr100865-10b.c: Likewise.
* gcc.target/i386/pr100865-3.c: Likewise.
* gcc.target/i386/pr100865-4b.c: Likewise.
* gcc.target/i386/pr100865-5b.c: Likewise.
* gcc.target/i386/pr90773-15.c: Likewise.
* gcc.target/i386/pr90773-16.c: Likewise.
* gcc.target/i386/pr90773-17.c: Likewise.
* gcc.target/i386/pr95483-5.c: Likewise.
* gcc.target/i386/pr120728.c: New test.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Add a PROCESSOR_XXX comment to each entry in processor_cost_table to
describe which processor the cost enry is applied to.
* config/i386/i386-options.cc (processor_cost_table): Add a
PROCESSOR_XXX comment to each entry.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
So this is Andrew's patch from the PR. We weren't clean for a 32bit host in
some of the arithmetic for constant synthesis.
I confirmed the bug on a 32bit linux host, then confirmed that Andrew's patch
from the PR fixes the problem, then ran Andrew's patch through my tester
successfully.
Naturally I'll wait for pre-commit testing, but I'm not expecting problems.
PR target/119830
gcc/
* config/riscv/riscv.cc (riscv_build_integer_1): Make arithmetic in bclr case
clean for 32 bit hosts.
gcc/testsuite/
* gcc.target/riscv/pr119830.c: New test.
|
|
This patch implements bitfield insertion MD pattern using the DEPBITS
machine instruction, the counterpart of the EXTUI instruction, if
available.
/* example */
struct foo {
unsigned int b:10;
unsigned int r:11;
unsigned int g:11;
};
void test(struct foo *p) {
p->g >>= 1;
}
;; result (endianness: little)
test:
entry sp, 32
l32i.n a8, a2, 0
extui a9, a8, 1, 10
depbits a8, a9, 0, 11
s32i.n a8, a2, 0
retw.n
gcc/ChangeLog:
* config/xtensa/xtensa.h (TARGET_DEPBITS): New macro.
* config/xtensa/xtensa.md (insvsi): New insn pattern.
|
|
This patch implements the target-specific ZERO_CALL_USED_REGS hook, since
if -fzero-call-used-regs=all the default hook tries to assign 0 to B0
(bit 0 of the BR register) and the ICE will be thrown.
gcc/ChangeLog:
* config/xtensa/xtensa.cc (xtensa_zero_call_used_regs):
New prototype and function.
(TARGET_ZERO_CALL_USED_REGS): Define macro.
|
|
The RISC-V prefetch support is broken in a few ways. This addresses the data
side prefetch problems. I'd mistakenly thought this BZ was a prefetch.i
related (which has deeper problems).
The basic problem is we were accepting any valid address when in fact there are
restrictions. This patch more precisely defines the predicate such that we
allow
REG
REG+D
Where D must have the low 5 bits clear. Note that absolute addresses fall into
the REG+D form using the x0 for the register operand since it always has the
value zero. The test verifies REG, REG+D, ABS addressing modes that are valid
as well as REG+D and ABS which must be reloaded into a REG because the
displacement has low bits set.
An earlier version of this patch has gone through testing in my tester on rv32
and rv64. Obviously I'll wait for pre-commit CI to do its thing before moving
forward.
This is a good backport candidate after simmering on the trunk for a bit.
PR target/118241
gcc/
* config/riscv/predicates.md (prefetch_operand): New predicate.
* config/riscv/constraints.md (Q): New constraint.
* config/riscv/riscv.md (prefetch): Use new predicate and constraint.
(riscv_prefetchi_<mode>): Similarly.
gcc/testsuite/
* gcc.target/riscv/pr118241.c: New test.
|
|
The will be one ICE when expand pass, the bt similar as below.
during RTL pass: expand
red.c: In function 'main':
red.c:20:5: internal compiler error: in require, at machmode.h:323
20 | int main() {
| ^~~~
0x2e0b1d6 internal_error(char const*, ...)
../../../gcc/gcc/diagnostic-global-context.cc:517
0xd0d3ed fancy_abort(char const*, int, char const*)
../../../gcc/gcc/diagnostic.cc:1803
0xc3da74 opt_mode<machine_mode>::require() const
../../../gcc/gcc/machmode.h:323
0xc3de2f opt_mode<machine_mode>::require() const
../../../gcc/gcc/poly-int.h:1383
0xc3de2f riscv_vector::expand_select_vl(rtx_def**)
../../../gcc/gcc/config/riscv/riscv-v.cc:4218
0x21c7d22 gen_select_vldi(rtx_def*, rtx_def*, rtx_def*)
../../../gcc/gcc/config/riscv/autovec.md:1344
0x134db6c maybe_expand_insn(insn_code, unsigned int, expand_operand*)
../../../gcc/gcc/optabs.cc:8257
0x134db6c expand_insn(insn_code, unsigned int, expand_operand*)
../../../gcc/gcc/optabs.cc:8288
0x11b21d3 expand_fn_using_insn
../../../gcc/gcc/internal-fn.cc:318
0xef32cf expand_call_stmt
../../../gcc/gcc/cfgexpand.cc:3097
0xef32cf expand_gimple_stmt_1
../../../gcc/gcc/cfgexpand.cc:4264
0xef32cf expand_gimple_stmt
../../../gcc/gcc/cfgexpand.cc:4411
0xef95b6 expand_gimple_basic_block
../../../gcc/gcc/cfgexpand.cc:6472
0xefb66f execute
../../../gcc/gcc/cfgexpand.cc:7223
The select_vl op_1 and op_2 may be the same const_int like (const_int 32).
And then maybe_legitimize_operands will:
1. First mov the const op_1 to a reg.
2. Resue the reg of op_1 for op_2 as the op_1 and op_2 is equal.
That will break the assumption that the op_2 of select_vl is immediate,
or something like CONST_INT_POLY.
The below test suites are passed for this patch series.
* The rv64gcv fully regression test.
PR target/120652
gcc/ChangeLog:
* config/riscv/autovec.md: Add immediate_operand for
select_vl operand 2.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/pr120652-1.c: New test.
* gcc.target/riscv/rvv/autovec/pr120652-2.c: New test.
* gcc.target/riscv/rvv/autovec/pr120652-3.c: New test.
* gcc.target/riscv/rvv/autovec/pr120652.h: New test.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
This patch isn't fully tested yet, but it fixes the build failure, so that
will do for now. SImode was not allowed in VCC_HI because there were issues,
way back before the port went upstream, so it's possible we'll find out what
those issues were again soon.
gcc/ChangeLog:
PR target/120722
* config/gcn/gcn.cc (gcn_hard_regno_mode_ok): Allow SImode in VCC_HI.
|
|
Since MOVE_MAX defines the maximum number of bytes that an instruction
can move quickly between memory and registers, use it to get the widest
vector mode in vector loop when inlining memcpy and memset.
gcc/
PR target/120708
* config/i386/i386-expand.cc (ix86_expand_set_or_cpymem): Use
MOVE_MAX to get the widest vector mode in vector loop.
gcc/testsuite/
PR target/120708
* gcc.target/i386/memcpy-pr120708-1.c: New test.
* gcc.target/i386/memcpy-pr120708-2.c: Likewise.
* gcc.target/i386/memcpy-pr120708-3.c: Likewise.
* gcc.target/i386/memcpy-pr120708-4.c: Likewise.
* gcc.target/i386/memcpy-pr120708-5.c: Likewise.
* gcc.target/i386/memcpy-pr120708-6.c: Likewise.
* gcc.target/i386/memset-pr120708-1.c: Likewise.
* gcc.target/i386/memset-pr120708-2.c: Likewise.
* gcc.target/i386/memcpy-strategy-1.c: Drop dg-skip-if. Replace
-march=atom with -mno-avx -msse2 -mtune=generic
-mtune-ctrl=^sse_typeless_stores.
* gcc.target/i386/memcpy-strategy-2.c: Likewise.
* gcc.target/i386/memcpy-vector_loop-1.c: Likewise.
* gcc.target/i386/memcpy-vector_loop-2.c: Likewise.
* gcc.target/i386/memset-vector_loop-1.c: Likewise.
* gcc.target/i386/memset-vector_loop-2.c: Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
When working on PR120587 I found that the ce1 pass was not able to
properly optimize branches on OpenRISC. This is because of the early
splitting of "compare" and "branch" instructions during the expand pass.
Convert the cbranch* instructions from define_expand to
define_insn_and_split. This dalays the instruction split until after
the ce1 pass is done giving ce1 the best opportunity to perform the
optimizations on the original form of cbranch<mode>4 instructions.
gcc/ChangeLog:
* config/or1k/or1k.cc (or1k_noce_conversion_profitable_p): New
function.
(or1k_is_cmov_insn): New function.
(TARGET_NOCE_CONVERSION_PROFITABLE_P): Define macro.
* config/or1k/or1k.md (cbranchsi4): Convert to insn_and_split.
(cbranch<mode>4): Convert to insn_and_split.
Signed-off-by: Stafford Horne <shorne@gmail.com>
|
|
After commit 2dcc6dbd8a0 ("emit-rtl: Use simplify_subreg_regno to
validate hardware subregs [PR119966]") the OpenRISC port is broken
again.
Add extend* iinstruction patterns for the SR_F pseudo registers to avoid
having to use the subreg conversions which no longer work.
gcc/ChangeLog:
PR target/120587
* config/or1k/or1k.md (zero_extendbisi2_sr_f): New expand.
(extendbisi2_sr_f): New expand.
* config/or1k/predicates.md (sr_f_reg_operand): New predicate.
Signed-off-by: Stafford Horne <shorne@gmail.com>
|
|
This patch would like to combine the vec_duplicate + vminu.vv to the
vminu.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_BINARY(T, FUNC) \
void \
test_vx_binary (T * restrict out, T * restrict in, T x, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
out[i] = FUNC (in[i], x); \
}
uint32_t min(uint32 a, uint32 b)
{
return a > b ? b : a;
}
DEF_VX_BINARY(uint32_t, min)
Before this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
14 │ slli a3,a3,32
15 │ srli a3,a3,32
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
18 │ vle32.v v1,0(a1)
19 │ slli a4,a5,2
20 │ sub a3,a3,a5
21 │ add a1,a1,a4
22 │ vminu.vv v1,v1,v2
23 │ vse32.v v1,0(a0)
24 │ add a0,a0,a4
25 │ bne a3,zero,.L3
After this patch:
10 │ test_vx_binary_or_int32_t_case_0:
11 │ beq a3,zero,.L8
12 │ slli a3,a3,32
13 │ srli a3,a3,32
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
16 │ vle32.v v1,0(a1)
17 │ slli a4,a5,2
18 │ sub a3,a3,a5
19 │ add a1,a1,a4
20 │ vminu.vx v1,v1,a2
21 │ vse32.v v1,0(a0)
22 │ add a0,a0,a4
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_vx_binary_vec_dup_vec): Add
new case UMIN.
(expand_vx_binary_vec_vec_dup): Ditto.
* config/riscv/riscv.cc (riscv_rtx_costs): Ditto.
* config/riscv/vector-iterators.md: Add new op umin.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
commit ef26c151c14a87177d46fd3d725e7f82e040e89f
Author: Roger Sayle <roger@nextmovesoftware.com>
Date: Thu Dec 23 12:33:07 2021 +0000
x86: PR target/103773: Fix wrong-code with -Oz from pop to memory.
added "*mov<mode>_and" and extended "*mov<mode>_or" to transform
"mov $0,mem" to the shorter "and $0,mem" and "mov $-1,mem" to the shorter
"or $-1,mem" for -Oz. But the new pattern:
(define_insn "*mov<mode>_and"
[(set (match_operand:SWI248 0 "memory_operand" "=m")
(match_operand:SWI248 1 "const0_operand"))
(clobber (reg:CC FLAGS_REG))]
"reload_completed"
"and{<imodesuffix>}\t{%1, %0|%0, %1}"
[(set_attr "type" "alu1")
(set_attr "mode" "<MODE>")
(set_attr "length_immediate" "1")])
and the extended pattern:
(define_insn "*mov<mode>_or"
[(set (match_operand:SWI248 0 "nonimmediate_operand" "=rm")
(match_operand:SWI248 1 "constm1_operand"))
(clobber (reg:CC FLAGS_REG))]
"reload_completed"
"or{<imodesuffix>}\t{%1, %0|%0, %1}"
[(set_attr "type" "alu1")
(set_attr "mode" "<MODE>")
(set_attr "length_immediate" "1")])
aren't guarded for -Oz. As a result, "and $0,mem" and "or $-1,mem" are
generated without -Oz.
1. Change *mov<mode>_and" to define_insn_and_split and split it to
"mov $0,mem" if not -Oz.
2. Change "*mov<mode>_or" to define_insn_and_split and split it to
"mov $-1,mem" if not -Oz.
3. Don't transform "mov $-1,reg" to "push $-1; pop reg" for -Oz since it
should be transformed to "or $-1,reg".
gcc/
PR target/120427
* config/i386/i386.md (*mov<mode>_and): Changed to
define_insn_and_split. Split it to "mov $0,mem" if not -Oz.
(*mov<mode>_or): Changed to define_insn_and_split. Split it
to "mov $-1,mem" if not -Oz.
(peephole2): Don't transform "mov $-1,reg" to "push $-1; pop reg"
for -Oz since it will be transformed to "or $-1,reg".
gcc/testsuite/
PR target/120427
* gcc.target/i386/cold-attribute-4.c: Compile with -Oz.
* gcc.target/i386/pr120427-1.c: New test.
* gcc.target/i386/pr120427-2.c: Likewise.
* gcc.target/i386/pr120427-3.c: Likewise.
* gcc.target/i386/pr120427-4.c: Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
According to the discussion in
https://gcc.gnu.org/pipermail/gcc-patches/2025-June/686893.html, by creating
a -mtune=generic may be a good idea to slove the question regarding the branch
cost.
Changes for v2:
- Delete the code about -mcpu=generic.
gcc/ChangeLog:
* config/riscv/riscv-cores.def (RISCV_TUNE): Add "generic" tune.
* config/riscv/riscv.cc: Add generic_tune_info.
* config/riscv/riscv.h (RISCV_TUNE_STRING_DEFAULT): Change default tune.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/zicond-primitiveSemantics_compare_reg_reg_return_reg_reg.c: New test.
|
|
Use riscv_v_ext_mode_p to check the mode size is 2x XLEN, instead of
using "(GET_MODE_UNIT_SIZE (mode) == (UNITS_PER_WORD * 2))".
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_legitimize_move): Use
riscv_2x_xlen_mode_p.
(riscv_binary_cost): Ditto.
(riscv_hard_regno_mode_ok): Ditto.
|