Age | Commit message (Collapse) | Author | Files | Lines |
|
This patch implements native Thread Local Storage access on Windows, as motivated by
PR80881. Currently, Thread Local Storage accesses on Windows relies on emulation, which
is detrimental to performance in certain applications, notably the Python Interpreter
and the gcc port of the Java Virtual Machine. This patch was heavily inspired by Daniel
Green's original work on native Windows Thread Local Storage from over a decade ago, which
can be found at https://github.com/venix1/MinGW-GDC/blob/master/patches/mingw-tls-gcc-4.8.patch
as a reference.
Co-authored-by: Eric Botcazou <botcazou@adacore.com>
Co-authored-by: Uroš Bizjak <ubizjak@gmail.com>
Co-authored-by: Liu Hao <lh_mouse@126.com>
Signed-off-by: Julian Waters <tanksherman27@gmail.com>
Signed-off-by: Jonathan Yong <10walls@gmail.com>
gcc/ChangeLog:
* config/i386/i386.cc (ix86_legitimate_constant_p): Handle new UNSPEC.
(legitimate_pic_operand_p): Handle new UNSPEC.
(legitimate_pic_address_disp_p): Handle new UNSPEC.
(ix86_legitimate_address_p): Handle new UNSPEC.
(ix86_tls_index_symbol): New symbol for _tls_index.
(ix86_tls_index): Handle creation of _tls_index symbol.
(legitimize_tls_address): Create thread local access sequence.
(output_pic_addr_const): Handle new UNSPEC.
(i386_output_dwarf_dtprel): Handle new UNSPEC.
(i386_asm_output_addr_const_extra): Handle new UNSPEC.
* config/i386/i386.h (TARGET_WIN32_TLS): Define.
* config/i386/i386.md: New UNSPEC.
* config/i386/predicates.md: Handle new UNSPEC.
* config/mingw/mingw32.h (TARGET_WIN32_TLS): Define.
(TARGET_ASM_SELECT_SECTION): Define.
(DEFAULT_TLS_SEG_REG): Define.
* config/mingw/winnt.cc (mingw_pe_select_section): Select proper TLS section.
(mingw_pe_unique_section): Handle TLS section.
* config/mingw/winnt.h (mingw_pe_select_section): Declare.
* configure: Regenerate.
* configure.ac: New check for broken linker thread local support
|
|
As is outlined in the PR, we have a few define_insn_and_split patterns which
optimize away explicit masking of shift/bit positions when the masking matches
what the hardware's behavior.
A small number of those define_insn_and_split patterns generate a single
instruction. It's fairly elegant in that we were essentially just rewriting
the RTL to match an existing pattern.
In one case we'd do the rewriting and later turn a 32bit shift into a bset.
That's not safe because the masking of a 32bit shift uses 0x1f while masking on
bset uses 0x3f on rv64. The net was incorrect code as seen in the BZ entry.
The fix is pretty simple. There's no real reason we need to use a
define_insn_and_split. It was just convenient. Instead we can use a simple
define_insn. That avoids a change in the masking behavior for the shift
count/bit position and the masking stays in the RTL.
I quickly scanned the entire port and didn't see any additional
define_insn_and_splits that obviously generated a single instruction outside
the shift/rotate space, though in the vector space that's nontrivial to
ascertain.
This was been run through my tester for the cross configurations, but not the
native bootstrap/regression test (yet).
PR target/119971
gcc/
* config/riscv/bitmanip.md (rotation with masked count): Rewrite
as define_insn patterns. Fix formatting.
* config/riscv/riscv.md (shift with masked count): Similarly.
gcc/testsuite
* gcc.target/riscv/pr119971.c: New test.
* gcc.target/riscv/zbb-rol-ror-03.c: Adjust test slightly.
|
|
Some assemblers do not support MOVS instructions with explicit operands.
Emit instruction with implicit operands, but prefix the instruction with a
segment override prefix if the memory operand refers to ADDR_SPACE_SEG_FS
or ADDR_SPACE_SEG_GS named address space.
PR target/120019
gcc/ChangeLog:
* config/i386/i386.cc (ix86_print_operand): Handle 'v' operand
modifier to emit segment override prefix.
* config/i386/i386.md (*strmovdi_rex_1): Use %v operand modifier
to emit segment override prefix.
(*strmovsi_1): Ditto.
(*strmovhi_1): Ditto.
(*strmovqi_1): Ditto.
(*rep_movdi_rex64): Ditto.
(*rep_movsi): Ditto.
(*rep_movqi): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr111657-1.c (dg-do): Change to "assemble".
(dg-options): Remove -masm=att and add -save-temps.
(dg-final): Update scan-assembler and scan-assembler-not strings.
Co-authored-by: Rainer Orth <ro@CeBiTec.Uni-Bielefeld.DE>
|
|
I've noticed a typo on the flag name, fixed thusly.
2025-05-05 Jakub Jelinek <jakub@redhat.com>
* config/i386/i386.md (truncsfbf2): Fix comment typo,
unsafte -> unsafe.
|
|
Tweak the formatting of the genrvv-type-indexer.cc file to conform to
the style used by clang-format. This is a no-functional-change commit
that only modifies the formatting of the code.
gcc/Changelog:
* config/riscv/genrvv-type-indexer.cc: Apply clang-format to the
file.
|
|
This is a patch from late 2024 (just before stage1 freeze), but I never pushed
hard to the change, and thus never integrated it.
It's mostly unchanged except for updating insn in the hash table after finding
an optimizable case. We were holding the deleted insn in the hash table rather
than the new insn. Just something I noticed recently.
Bootstrapped and regression tested on my BPI and regression tested riscv32-elf
and riscv64-elf configurations. We've used this since November internally, so
it's well exercised on spec as well.
gcc/
* config.gcc (riscv): Add riscv-vect-permcost.o to extra_objs.
* config/riscv/riscv-passes.def (pass_vector_permcost): Add new pass.
* config/riscv/riscv-protos.h (make_pass_vector_permconst): Declare.
* config/riscv/riscv-vect-permconst.cc: New file.
* config/riscv/t-riscv: Add build rule for riscv-vect-permcost.o
|
|
For RV32 inline assembly, when handling 64-bit integer data, it is
often necessary to process the lower and upper 32 bits separately.
Unfortunately, we can only output the current register name
(lower 32 bits) but not the next register name (upper 32 bits).
To address this, the modifier 'H' has been added to allow users
to handle the upper 32 bits of the data. While I believe the
modifier 'N' (representing the next register name) might be more
suitable for this functionality, 'N' is already in use.
Therefore, 'H' (representing the high register) was chosen instead.
Co-Authored-By: Dimitar Dimitrov <dimitar@dinux.eu>
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_print_operand): Add H.
* doc/extend.texi: Document for H.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/modifier-H-error-1.c: New test.
* gcc.target/riscv/modifier-H-error-2.c: New test.
* gcc.target/riscv/modifier-H.c: New test.
|
|
The recent adjustment to more correctly cost register moves tripped a few
testsuite regressions.
I'm pretty torn on the thead test adjustments. But in reality they only worked
because the register move costing was broken. So I've reverted the scan-asm
part of those to a prior state for two of those tests. The other was only
failing at -Og/-Oz which was added to the exclude list.
The other Zfa test is similar, but we can make the test behave with a suitable
-mtune option and thus preserve the test.
While investigating I also noted that vector moves aren't being handled
correctly for subclasses of the integer/fp register files. So I fixed those
while I was in there.
Note this may have an impact on some of your work Pan. I haven't followed the
changes from the last week or so due to illness.
Waiting on pre-commit's verdict, though it did spin through my tester
successfully, though not all of the regressions related to that change are
addressed (there's still one for rv32 I'll look at shortly).
gcc/
* config/riscv/riscv.cc (riscv_register_move_cost): Handle
subclasses with vector registers as well.
gcc/testsuite/
* gcc.target/riscv/xtheadfmemidx-xtheadfmv-medany.c: Adjust expected
output.
* gcc.target/riscv/xtheadfmemidx-zfa-medany.c: Likewise.
* gcc.target/riscv/xtheadfmv-fmv.c: Skip for -Os and -Oz.
* gcc.target/riscv/zfa-fmovh-fmovp.c: Use sifive-p400 tuning.
|
|
After we add the frm register to the global_regs, we may not need to
define_insn that volatile to emit the frm restore insns. The
cooperatively-managed global register will help to handle this, instead
of emit the volatile define_insn explicitly.
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_emit_frm_mode_set): Refactor
the frm mode set by removing fsrmsi_restore_volatile.
* config/riscv/vector-iterators.md (unspecv): Remove as
unnecessary.
* config/riscv/vector.md (fsrmsi_restore_volatile): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-49.c: Adjust
the asm dump check times.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-50.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-52.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-74.c: Ditto.
* gcc.target/riscv/rvv/base/float-point-dynamic-frm-75.c: Ditto.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
ix86_rtx_costs VEC_MERGE by special casing AVX512 mask operations and otherwise
returning cost->sse_op completely ignoring costs of the operands. Since
VEC_MERGE is also used to represent scalar variant of SSE/AVX operation, this
means that many instructions (such as SSE converisions) are often costed as
sse_op instead of their real cost.
This patch adds pattern matching for the VEC_MERGE pattern which also forced me
to add special cases for masked versions and vcmp otherwise combine is confused
by the default cost compred to the cost of recognized version of the
instruction.
Since now the important cases should be handled, I also added recursion to the
remaining cases so substituting constants and memory is adequately costed.
gcc/ChangeLog:
* config/i386/i386.cc (unspec_pcmp_p): New function.
(ix86_rtx_costs): Cost VEC_MERGE more realistically.
|
|
This reverts commit 727a43e0a66052235706379239359807230054e0.
|
|
This patch fixes regression of imagick with PGO and AVX512 where correcting size
cost of SSE operations (to be 4 instead of 2 originally cut&pasted from x87)
made late combine to eliminate zero registers introduced by rapd. The problem
is that cost-model mistakely accounts VEC_SELECT as real instruction while it is
optimized to nothing if src==dest (which is the case of these testcases).
This register is used to eliminate false dependency between source and destination
of int->fp conversions.
While ix86_insn_cost hook already contains logic to incrase cost of the zero-extend
the costs was not enough.
gcc/ChangeLog:
PR target/119900
* config/i386/i386.cc (ix86_can_change_mode_class): Add TODO
comment.
(ix86_rtx_costs): Make VEC_SELECT equivalent to SUBREG cost 1.
|
|
This warning relies on the TRANSLATION_UNIT_WARN_EMPTY_P flag (set in
cxx_init_decl_processing) to decide whether we want to warn about the GCC 8
empty class parameter passing fix, but in a call through a function pointer
we don't have a translation unit and so complain for any -Wabi flag, even
now long after this was likely to be relevant.
In that situation, let's check the TU for current_function_decl instead.
And if we still can't come up with a TU, default to not warning.
PR c++/60336
gcc/ChangeLog:
* config/i386/i386.cc (ix86_warn_parameter_passing_abi):
If no target, check the current TU.
gcc/testsuite/ChangeLog:
* g++.dg/abi/pr60336-8a.C: New test.
|
|
Two targets were converted but retain the default.
* config/arc/arc.cc (TARGET_LRA_P): Remove define.
* config/gcn/gcn.cc (TARGET_LRA_P): Likewise.
|
|
For the test case
int32_t foo (svint32_t x)
{
svbool_t pg = svpfalse ();
return svlastb_s32 (pg, x);
}
compiled with -O3 -mcpu=grace -msve-vector-bits=128, GCC produced:
foo:
pfalse p3.b
lastb w0, p3, z0.s
ret
when it could use a Neon lane extract instead:
foo:
umov w0, v0.s[3]
ret
Similar optimizations can be made for VLS with other vector widths.
We implemented this optimization by guarding the emission of
pfalse+lastb in the pattern vec_extract<mode><Vel> by
!val.is_constant ().
Thus, for last-extract operations with VLS, the patterns
*vec_extract<mode><Vel>_v128, *vec_extract<mode><Vel>_dup, or
*vec_extract<mode><Vel>_ext are used instead.
We added tests for 128-bit VLS and adjusted the tests for the other vector
widths.
The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
OK for mainline?
Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com>
gcc/
* config/aarch64/aarch64-sve.md (vec_extract<mode><Vel>):
Prevent the emission of pfalse+lastb for VLS.
gcc/testsuite/
* gcc.target/aarch64/sve/extract_last_128.c: New test.
* gcc.target/aarch64/sve/extract_1.c: Adjust expected outcome.
* gcc.target/aarch64/sve/extract_2.c: Likewise.
* gcc.target/aarch64/sve/extract_3.c: Likewise.
* gcc.target/aarch64/sve/extract_4.c: Likewise.
|
|
This patch introduces two new inline functions, __sqrt and __sqrtf, in
arm_acle.h for Aarch64 targets. These functions wrap the new builtins
__builtin_aarch64_sqrtdf and __builtin_aarch64_sqrtsf, respectively,
providing direct access to hardware instructions without relying on the
standard math library or optimization levels.
This patch also introduces acle_sqrt.c in the AArch64 testsuite,
verifying that the new __sqrt and __sqrtf intrinsics emit the expected
fsqrt instructions for double and float arguments.
Coverage for new intrinsics ensures that __sqrt and __sqrtf are
correctly expanded to hardware instructions and do not fall back to
library calls, regardless of optimization levels.
gcc/ChangeLog:
* config/aarch64/arm_acle.h (__sqrt, __sqrtf): New function.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/acle/acle_sqrt.c: New test.
Signed-off-by: Ayan Shafqat <ayan.x.shafqat@gmail.com>
|
|
This patch changes the `sqrt` builtin definition from `BUILTIN_VHSDF_DF`
to `BUILTIN_VHSDF_HSDF` in `aarch64-simd-builtins.def`, ensuring the
builtin covers half, single, and double precision variants. The redundant
`VAR1 (UNOP, sqrt, 2, FP, hf)` lines are removed, as they are no longer
needed now that `BUILTIN_VHSDF_HSDF` handles those cases.
gcc/ChangeLog:
* config/aarch64/aarch64-simd-builtins.def: Change
BUILTIN_VHSDF_DF to BUILTIN_VHSDF_HSDF.
Signed-off-by: Ayan Shafqat <ayan.x.shafqat@gmail.com>
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
|
|
libgcc's __xload_1...4 is clobbering Z (and also R21 is some cases),
but avr.md had clobbers of respective GPRs only up to reload.
Outcome was that code reading from the same __memx address twice
could be wrong. This patch adds respective clobbers.
Forward-port from 2025-04-30 r14-11703
PR target/119989
gcc/
* config/avr/avr.md (xload_<mode>_libgcc): Clobber R21, Z.
gcc/testsuite/
* gcc.target/avr/torture/pr119989.h: New file.
* gcc.target/avr/torture/pr119989-memx-1.c: New test.
* gcc.target/avr/torture/pr119989-memx-2.c: New test.
* gcc.target/avr/torture/pr119989-memx-3.c: New test.
* gcc.target/avr/torture/pr119989-memx-4.c: New test.
* gcc.target/avr/torture/pr119989-flashx-1.c: New test.
* gcc.target/avr/torture/pr119989-flashx-2.c: New test.
* gcc.target/avr/torture/pr119989-flashx-3.c: New test.
* gcc.target/avr/torture/pr119989-flashx-4.c: New test.
(cherry picked from commit 1ca1c1fc3b58ae5e1d3db4f5a2014132fe69f82a)
|
|
Although we already try to set the mode needed to FRM_DYN after a function call,
there are still some corner cases where both FRM_DYN and FRM_DYN_CALL may appear
on incoming edges.
Therefore, we use TARGET_MODE_CONFLUENCE to tell GCC that FRM_DYN, FRM_DYN_CALL,
and FRM_DYN_EXIT modes are compatible.
gcc/ChangeLog:
PR target/119832
* config/riscv/riscv.cc (riscv_dynamic_frm_mode_p): New.
(riscv_mode_confluence): New.
(TARGET_MODE_CONFLUENCE): Define to riscv_mode_confluence.
gcc/testsuite/ChangeLog:
PR target/119832
* g++.target/riscv/pr119832.C: New test.
|
|
If -msve-vector-bits=128, SVE loads and stores (LD1 and ST1) with a
ptrue predicate can be replaced by neon instructions (LDR and STR),
thus avoiding the predicate altogether. This also enables formation of
LDP/STP pairs.
For example, the test cases
svfloat64_t
ptrue_load (float64_t *x)
{
svbool_t pg = svptrue_b64 ();
return svld1_f64 (pg, x);
}
void
ptrue_store (float64_t *x, svfloat64_t data)
{
svbool_t pg = svptrue_b64 ();
return svst1_f64 (pg, x, data);
}
were previously compiled to
(with -O2 -march=armv8.2-a+sve -msve-vector-bits=128):
ptrue_load:
ptrue p3.b, vl16
ld1d z0.d, p3/z, [x0]
ret
ptrue_store:
ptrue p3.b, vl16
st1d z0.d, p3, [x0]
ret
Now the are compiled to:
ptrue_load:
ldr q0, [x0]
ret
ptrue_store:
str q0, [x0]
ret
The implementation includes the if-statement
if (known_eq (GET_MODE_SIZE (mode), 16)
&& aarch64_classify_vector_mode (mode) == VEC_SVE_DATA)
which checks for 128-bit VLS and excludes partial modes with a
mode size < 128 (e.g. VNx2QI).
The patch was bootstrapped and tested on aarch64-linux-gnu, no regression.
OK for mainline?
Signed-off-by: Jennifer Schmitz <jschmitz@nvidia.com>
gcc/
* config/aarch64/aarch64.cc (aarch64_emit_sve_pred_move):
Fold LD1/ST1 with ptrue to LDR/STR for 128-bit VLS.
gcc/testsuite/
* gcc.target/aarch64/sve/ldst_ptrue_128_to_neon.c: New test.
* gcc.target/aarch64/sve/cond_arith_6.c: Adjust expected outcome.
* gcc.target/aarch64/sve/pcs/return_4_128.c: Likewise.
* gcc.target/aarch64/sve/pcs/return_5_128.c: Likewise.
* gcc.target/aarch64/sve/pcs/struct_3_128.c: Likewise.
|
|
This version is same as v5, but rebase to trunk, send out to trigger CI.
This commit adds intrinsics support for Xsfvcp extension.
Diff with V4: Delete the sifive_vector.h file.
Co-Authored by: Jiawei Chen <jiawei@iscas.ac.cn>
Co-Authored by: Shihua Liao <shihua@iscas.ac.cn>
Co-Authored by: Yixuan Chen <chenyixuan@iscas.ac.cn>
gcc/ChangeLog:
* config/riscv/constraints.md (Ou01): New constraint.
(Ou02): Ditto.
* config/riscv/generic-vector-ooo.md (vec_sf_vcp): New reservation.
* config/riscv/genrvv-type-indexer.cc (main): New type.
* config/riscv/riscv-c.cc (riscv_pragma_intrinsic): Add xsfvcp strings.
* config/riscv/riscv-vector-builtins-shapes.cc (struct sf_vcix_se_def):
New function.
(struct sf_vcix_def): Ditto.
(SHAPE): Ditto.
* config/riscv/riscv-vector-builtins-shapes.h: Ditto.
* config/riscv/riscv-vector-builtins-types.def (DEF_RVV_X2_U_OPS): New type.
(DEF_RVV_X2_WU_OPS): Ditto.
(vuint8mf8_t): Ditto.
(vuint8mf4_t): Ditto.
(vuint8mf2_t): Ditto.
(vuint8m1_t): Ditto.
(vuint8m2_t): Ditto.
(vuint8m4_t): Ditto.
(vuint16mf4_t): Ditto.
(vuint16mf2_t): Ditto.
(vuint16m1_t): Ditto.
(vuint16m2_t): Ditto.
(vuint16m4_t): Ditto.
(vuint32mf2_t): Ditto.
(vuint32m1_t): Ditto.
(vuint32m2_t): Ditto.
(vuint32m4_t): Ditto.
* config/riscv/riscv-vector-builtins.cc (DEF_RVV_X2_U_OPS): New builtins
def.
(DEF_RVV_X2_WU_OPS): Ditto.
(rvv_arg_type_info::get_scalar_float_type): Ditto.
(function_instance::modifies_global_state_p): Ditto.
* config/riscv/riscv-vector-builtins.def (v_x): New base type.
(i): Ditto.
(v_i): Ditto.
(xv): Ditto.
(iv): Ditto.
(fv): Ditto.
(vvv): Ditto.
(xvv): Ditto.
(ivv): Ditto.
(fvv): Ditto.
(vvw): Ditto.
(xvw): Ditto.
(ivw): Ditto.
(fvw): Ditto.
(v_vv): Ditto.
(v_xv): Ditto.
(v_iv): Ditto.
(v_fv): Ditto.
(v_vvv): Ditto.
(v_xvv): Ditto.
(v_ivv): Ditto.
(v_fvv): Ditto.
(v_vvw): Ditto.
(v_xvw): Ditto.
(v_ivw): Ditto.
(v_fvw): Ditto.
(x2_vector): Ditto.
(scalar_float): Ditto.
* config/riscv/riscv-vector-builtins.h (enum required_ext): New extension.
(required_ext_to_isa_name): Ditto.
(required_extensions_specified): Ditto.
(struct rvv_arg_type_info): Ditto.
(struct function_group_info): Ditto.
* config/riscv/riscv.md: New attr.
* config/riscv/sifive-vector-builtins-bases.cc (class sf_vc): New function.
(BASE): New base_name.
* config/riscv/sifive-vector-builtins-bases.h: New function_base.
* config/riscv/sifive-vector-builtins-functions.def
(REQUIRED_EXTENSIONS): New intrinsics def.
(sf_vc): Ditto.
* config/riscv/sifive-vector.md (@sf_vc_x_se<mode>): New RTL mode.
(@sf_vc_v_x_se<mode>): Ditto.
(@sf_vc_v_x<mode>): Ditto.
(@sf_vc_i_se<mode>): Ditto.
(@sf_vc_v_i_se<mode>): Ditto.
(@sf_vc_v_i<mode>): Ditto.
(@sf_vc_vv_se<mode>): Ditto.
(@sf_vc_v_vv_se<mode>): Ditto.
(@sf_vc_v_vv<mode>): Ditto.
(@sf_vc_xv_se<mode>): Ditto.
(@sf_vc_v_xv_se<mode>): Ditto.
(@sf_vc_v_xv<mode>): Ditto.
(@sf_vc_iv_se<mode>): Ditto.
(@sf_vc_v_iv_se<mode>): Ditto.
(@sf_vc_v_iv<mode>): Ditto.
(@sf_vc_fv_se<mode>): Ditto.
(@sf_vc_v_fv_se<mode>): Ditto.
(@sf_vc_v_fv<mode>): Ditto.
(@sf_vc_vvv_se<mode>): Ditto.
(@sf_vc_v_vvv_se<mode>): Ditto.
(@sf_vc_v_vvv<mode>): Ditto.
(@sf_vc_xvv_se<mode>): Ditto.
(@sf_vc_v_xvv_se<mode>): Ditto.
(@sf_vc_v_xvv<mode>): Ditto.
(@sf_vc_ivv_se<mode>): Ditto.
(@sf_vc_v_ivv_se<mode>): Ditto.
(@sf_vc_v_ivv<mode>): Ditto.
(@sf_vc_fvv_se<mode>): Ditto.
(@sf_vc_v_fvv_se<mode>): Ditto.
(@sf_vc_v_fvv<mode>): Ditto.
(@sf_vc_vvw_se<mode>): Ditto.
(@sf_vc_v_vvw_se<mode>): Ditto.
(@sf_vc_v_vvw<mode>): Ditto.
(@sf_vc_xvw_se<mode>): Ditto.
(@sf_vc_v_xvw_se<mode>): Ditto.
(@sf_vc_v_xvw<mode>): Ditto.
(@sf_vc_ivw_se<mode>): Ditto.
(@sf_vc_v_ivw_se<mode>): Ditto.
(@sf_vc_v_ivw<mode>): Ditto.
(@sf_vc_fvw_se<mode>): Ditto.
(@sf_vc_v_fvw_se<mode>): Ditto.
(@sf_vc_v_fvw<mode>): Ditto.
* config/riscv/vector-iterators.md: New iterator.
* config/riscv/vector.md: New vtype.
|
|
0x67 prefix is applied before segment register. That is in
rep movsq %gs:(%esi), (%edi)
the address is %gs + %esi. In case Pmode != word_mode (x32 with a default
-maddress-mode=short) instructions should not allow segment override prefixes.
Also, remove explicit addr32 prefix from asm templates because address
mode can be determined from explicit instruction operands. Also note that
Pmode != word_mode only with TARGET_64BIT, so the check in ix86_print_operand
is not needed.
PR target/111657
gcc/ChangeLog:
* config/i386/i386-expand.cc (alg_usable_p): For Pmode != word_mode
reject rep_prefix_{1,4,8}_byte algorithms with src_as in the
non-default address space.
* config/i386/i386-protos.h (ix86_check_movs): New prototype.
* config/i386/i386.cc (ix86_check_movs): New function.
(ix86_print_operand) [case '^']: Remove excess check for TARGET_64BIT.
* config/i386/i386.md (strmov): For Pmode != word_mode expand with
gen_strmov_single only when operands[3] (source) is in the default
address space.
(*strmovdi_rex_1) Use ix86_check_movs. Remove %^ from asm template.
(*strmovsi_1): Ditto.
(*strmovhi_1): DItto.
(*strmovqi_1): Ditto.
(*rep_movdi_rex64): Ditto.
(*rep_movsi): Ditto.
(*rep_movqi): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr111657-1.c: Check that segment override is not
generated for "rep movsq" for x32 target.
|
|
SIBCALL_REGS/JALR_REGS are also subset of GR_REGS and need to
be taken into acount in riscv_register_move_cost, otherwise it
will get a incorrect cost.
gcc/ChangeLog:
* config/riscv/riscv.cc (riscv_register_move_cost): Use
reg_class_subset_p to check the reg class.
|
|
MOVS instructions allow segment override of their source operand, e.g.:
rep movsq %gs:(%rsi), (%rdi)
where %rsi is the address of the source location (with %gs segment override)
and %rdi is the address of the destination location.
The testcase improves from (-O2 -mno-sse -mtune=generic):
xorl %eax, %eax
.L2:
movl %eax, %edx
addl $8, %eax
movq %gs:m(%rdx), %rcx
movq %rcx, (%rdi,%rdx)
cmpl $240, %eax
jb .L2
ret
to:
movl $m, %esi
movl $30, %ecx
rep movsq %gs:(%rsi), (%rdi)
ret
PR target/111657
gcc/ChangeLog:
* config/i386/i386-expand.cc (alg_usable_p): Remove have_as bool
argument and add dst_as and src_as address space arguments. Reject
libcall algorithm with dst_as and src_as in the non-default address
spaces. Reject rep_prefix_{1,4,8}_byte algorithms with dst_as in
the non-default address space.
(decide_alg): Remove have_as bool argument and add dst_as and src_as
address space arguments. Update calls to alg_usable_p.
(ix86_expand_set_or_cpymem): Update call to decide_alg.
* config/i386/i386.md (strmov): Do not fail if operand[3] (source)
is in the non-default address space. Expand with gen_strmov_singleop
only when operand[1] (destination) is in the default address space.
(*strmovdi_rex_1): Determine memory operands from insn pattern.
Allow only when destination is in the default address space.
Rewrite asm template to use explicit operands.
(*strmovsi_1): Ditto.
(*strmovhi_1): DItto.
(*strmovqi_1): Ditto.
(*rep_movdi_rex64): Ditto.
(*rep_movsi): Ditto.
(*rep_movqi): Ditto.
(*strsetdi_rex_1): Determine memory operands from insn pattern.
Allow only when destination is in the default address space.
(*strsetsi_1): Ditto.
(*strsethi_1): Ditto.
(*strsetqi_1): Ditto.
(*rep_stosdi_rex64): Ditto.
(*rep_stossi): Ditto.
(*rep_stosqi): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr111657-1.c: New test.
|
|
Skip sub-RTXes of the memory operand if stack access register is
not mentioned in the operand.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_update_stack_alignment): Skip sub-RTXes
of the memory operand if stack access register is not mentioned in
the operand.
|
|
For all different modes of all 0s/1s vectors, we can use the single widest
all 0s/1s vector register for all 0s/1s vector uses in the whole function.
Add a pass to generate a single widest all 0s/1s vector set instruction at
entry of the nearest common dominator for basic blocks with all 0s/1s
vector uses. On Linux/x86-64, in cc1plus, this patch reduces the number
of vector xor instructions from 4803 to 4714 and pcmpeq instructions from
144 to 142.
NB: PR target/92080 and PR target/117839 aren't same. PR target/117839
is for vectors of all 0s and all 1s with different sizes and different
components. PR target/92080 is for broadcast of the same component to
different vector sizes. This patch covers only all 0s and all 1s cases
of PR target/92080.
gcc/
PR target/92080
PR target/117839
* config/i386/i386-features.cc (ix86_place_single_vector_set):
New function.
(remove_partial_avx_dependency): Use it.
(ix86_get_vector_load_mode): New function.
(replace_vector_const): Likewise.
(remove_redundant_vector_load): Likewise.
(pass_data_remove_redundant_vector_load): Likewise.
(pass_remove_redundant_vector_load): Likewise.
(make_pass_remove_redundant_vector_load): Likewise.
* config/i386/i386-passes.def: Add
pass_remove_redundant_vector_load after
pass_remove_partial_avx_dependency.
* config/i386/i386-protos.h
(make_pass_remove_redundant_vector_load): New.
* config/i386/i386.cc (ix86_modes_tieable_p): Return true for
narrower non-scalar-integer modes in SSE registers.
gcc/testsuite/
PR target/92080
PR target/117839
* gcc.target/i386/pr117839-1a.c: New test.
* gcc.target/i386/pr117839-1b.c: Likewise.
* gcc.target/i386/pr117839-2.c: Likewise.
* gcc.target/i386/pr92080-1.c: Likewise.
* gcc.target/i386/pr92080-2.c: Likewise.
* gcc.target/i386/pr92080-3.c: Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
When passing 0xff as an unsigned char function argument with the C frontend
promotion, expand_normal used to get
<integer_cst 0x7fffe6aa23a8 type <integer_type 0x7fffe98225e8 int> constant 255>
and returned the rtx value using the sign-extended representation:
(const_int 255 [0xff])
But after
commit a670ebde3995481225ec62b29686ec07a21e5c10
Author: H.J. Lu <hjl.tools@gmail.com>
Date: Thu Nov 21 07:54:35 2024 +0800
Drop targetm.promote_prototypes from C, C++ and Ada frontends
expand_normal now gets
<integer_cst 0x7fffe9824018 type <integer_type 0x7fffe9822348 unsigned char > constant 255>
and returns
(const_int -1 [0xffffffffffffffff])
which doesn't work with the predicates nor the instruction templates which
expect the unsigned expanded value. Extract the unsigned char and short
integer constants to return
(const_int 255 [0xff])
so that the expanded value is always unsigned, without the C frontend
promotion.
PR target/117547
* config/i386/i386-expand.cc (ix86_expand_unsigned_small_int_cst_argument):
New function.
(ix86_expand_args_builtin): Call
ix86_expand_unsigned_small_int_cst_argument to expand the argument
before calling fixup_modeless_constant.
(ix86_expand_round_builtin): Likewise.
(ix86_expand_special_args_builtin): Likewise.
(ix86_expand_builtin): Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Since the tune if only for GLC(sapphirerapids and alderlake-P).
gcc/ChangeLog:
* config/i386/x86-tune.def (X86_TUNE_DEST_FALSE_DEP_FOR_GLC):
Remove other processor except for GLC since this one is only
for GLC.
|
|
Don't assume that stack slots can only be accessed by stack or frame
registers. We first find all registers defined by stack or frame
registers. Then check memory accesses by such registers, including
stack and frame registers.
gcc/
PR target/109780
PR target/109093
* config/i386/i386.cc (stack_access_data): New.
(ix86_update_stack_alignment): Likewise.
(ix86_find_all_reg_use_1): Likewise.
(ix86_find_all_reg_use): Likewise.
(ix86_find_max_used_stack_alignment): Also check memory accesses
from registers defined by stack or frame registers.
gcc/testsuite/
PR target/109780
PR target/109093
* g++.target/i386/pr109780-1.C: New test.
* gcc.target/i386/pr109093-1.c: Likewise.
* gcc.target/i386/pr109780-1.c: Likewise.
* gcc.target/i386/pr109780-2.c: Likewise.
* gcc.target/i386/pr109780-3.c: Likewise.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Co-Authored-By: Uros Bizjak <ubizjak@gmail.com>
|
|
For Windows x86-32 targets, the Microsoft ABI only guarantees that the stack
is aligned to 4-byte boundaries. GCC knows about the default alignment of the
stack. However, before this commit, it did not realign the stack unless SSE
was also enabled.
When a stricter (larger) alignment is requested, it's always necessary to
realign the stack, as what Solaris does.
Reference: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111107#c14
Signed-off-by: LIU Hao <lh_mouse@126.com>
Signed-off-by: Jonathan Yong <10walls@gmail.com>
gcc/ChangeLog:
PR target/111107
* config/i386/cygming.h (STACK_REALIGN_DEFAULT): Copy from sol2.h.
|
|
Consider the expand_const_vector is quit long (about 500 lines)
and complicated, we would like to extract the different case
into different functions. For example, the const vector stepped
will be extracted into expand_const_vector_stepped.
The below test suites are passed for this patch.
* The rv64gcv fully regression test.
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_const_vector): Extract
const vector stepped into separated func.
(expand_const_vector_single_step_npatterns): Add new func
to take care of single step.
(expand_const_vector_interleaved_stepped_npatterns): Add new
func to take care of interleaved step.
(expand_const_vector_stepped): Add new func to take care of
const vector stepped.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
Consider the expand_const_vector is quit long (about 500 lines)
and complicated, we would like to extract the different case
into different functions. For example, the const vector duplicate
will be extracted into expand_const_vector_duplicate, and then
expand_const_vector_duplicate_repeating and
expand_const_vector_duplicate_default for the underlying function.
The below test suites are passed for this patch.
* The rv64gcv fully regression test.
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_const_vector_duplicate_repeating):
Add new func to take care of vector duplicate with repeating.
(expand_const_vector_duplicate_default): Add new func to take
care of default const vector duplicate.
(expand_const_vector_duplicate): Add new func to take care
of all const vector duplicate.
(expand_const_vector): Extract const vector duplicate into
separated function.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
Consider the expand_const_vector is quit long (about 500 lines)
and complicated, we would like to extract the different case
into different functions. For example, the const vec_series
will be extracted into expand_const_vec_series.
The below test suites are passed for this patch.
* The rv64gcv fully regression test.
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_const_vec_series): Add new
func to take care of the const vec_series.
(expand_const_vector): Extract const vec_series into separated
function.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
Consider the expand_const_vector is quit long (about 500 lines)
and complicated, we would like to extract the different case
into different functions. For example, the const vec_duplicate
will be extracted into expand_const_vec_duplicate.
The below test suites are passed for this patch.
* The rv64gcv fully regression test.
gcc/ChangeLog:
* config/riscv/riscv-v.cc (expand_const_vector): Extract
const vec_duplicate into separated function.
(expand_const_vec_duplicate): Add new func to take care
of the const vec_duplicate.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
gcc/ChangeLog:
PR target/119549
* common/config/i386/i386-common.cc (ix86_handle_option):
Refactor msse4 and mno-sse4.
* config/i386/i386.opt (msse4): Remove RejectNegative.
(mno-sse4): Remove the entry.
* config/i386/i386-options.cc
(ix86_valid_target_attribute_inner_p): Remove special code
which handles mno-sse4.
|
|
I introduced a bug by last minute cleanups unifying the scalar and vector SSE conditional.
This patch fixes it and restores cost of 1 of SSE scalar MIN/MAX
Bootstrapped/regtested x86_64-linux, comitted.
gcc/ChangeLog:
PR target/105275
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Fix cost of FP scalar
MAX_EXPR and MIN_EXPR
|
|
Here is a patch to handle the PARALLEL case too.
I think we can just use rtx_equal_p there, because it will always use
SImode in the EXPR_LIST REGs in that case.
2025-04-25 Jakub Jelinek <jakub@redhat.com>
PR target/119873
* config/s390/s390.cc (s390_call_saved_register_used): Don't return
true if default definition of PARM_DECL SSA_NAME of the same register
is passed in call saved register in the PARALLEL case either.
* gcc.target/s390/pr119873-5.c: New test.
|
|
There are GCN/C++ target as well as offloading codes, where the hard-coded
section names in 'gcn_hsa_declare_function_name' do not fit, and assembly thus
fails:
LLVM ERROR: Size expression must be absolute.
This commit progresses GCN target:
[-FAIL: g++.dg/init/call1.C -std=gnu++17 (internal compiler error: Aborted signal terminated program as)-]
[-FAIL:-]{+PASS:+} g++.dg/init/call1.C -std=gnu++17 (test for excess errors)
[-UNRESOLVED:-]{+PASS:+} g++.dg/init/call1.C -std=gnu++17 [-compilation failed to produce executable-]{+execution test+}
[-FAIL: g++.dg/init/call1.C -std=gnu++26 (internal compiler error: Aborted signal terminated program as)-]
[-FAIL:-]{+PASS:+} g++.dg/init/call1.C -std=gnu++26 (test for excess errors)
[-UNRESOLVED:-]{+PASS:+} g++.dg/init/call1.C -std=gnu++26 [-compilation failed to produce executable-]{+execution test+}
UNSUPPORTED: g++.dg/init/call1.C -std=gnu++98: exception handling not supported
..., and GCN offloading:
[-XFAIL: libgomp.c++/target-exceptions-throw-1.C (internal compiler error: Aborted signal terminated program as)-]
[-XFAIL: libgomp.c++/target-exceptions-throw-1.C PR119737 at line 7 (test for bogus messages, line )-]
[-XFAIL:-]{+PASS:+} libgomp.c++/target-exceptions-throw-1.C (test for excess errors)
[-UNRESOLVED:-]{+PASS:+} libgomp.c++/target-exceptions-throw-1.C [-compilation failed to produce executable-]{+execution test+}
{+PASS: libgomp.c++/target-exceptions-throw-1.C output pattern test+}
[-XFAIL: libgomp.c++/target-exceptions-throw-2.C (internal compiler error: Aborted signal terminated program as)-]
[-XFAIL: libgomp.c++/target-exceptions-throw-2.C PR119737 at line 7 (test for bogus messages, line )-]
[-XFAIL:-]{+PASS:+} libgomp.c++/target-exceptions-throw-2.C (test for excess errors)
[-UNRESOLVED:-]{+PASS:+} libgomp.c++/target-exceptions-throw-2.C [-compilation failed to produce executable-]{+execution test+}
{+PASS: libgomp.c++/target-exceptions-throw-2.C output pattern test+}
[-XFAIL: libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 (internal compiler error: Aborted signal terminated program as)-]
[-XFAIL: libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 PR119737 at line 7 (test for bogus messages, line )-]
[-XFAIL:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 (test for excess errors)
[-UNRESOLVED:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 [-compilation failed to produce executable-]{+execution test+}
{+PASS: libgomp.oacc-c++/exceptions-throw-1.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 output pattern test+}
[-XFAIL: libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 (internal compiler error: Aborted signal terminated program as)-]
[-XFAIL: libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 PR119737 at line 9 (test for bogus messages, line )-]
[-XFAIL:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 (test for excess errors)
[-UNRESOLVED:-]{+PASS:+} libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 [-compilation failed to produce executable-]{+execution test+}
{+PASS: libgomp.oacc-c++/exceptions-throw-2.C -DACC_DEVICE_TYPE_radeon=1 -DACC_MEM_SHARED=0 -foffload=amdgcn-amdhsa -O2 output pattern test+}
PR target/119737
gcc/
* config/gcn/gcn.cc (gcn_hsa_declare_function_name): Properly
switch sections.
libgomp/
* testsuite/libgomp.c++/target-exceptions-throw-1.C: Remove
PR119737 XFAILing.
* testsuite/libgomp.c++/target-exceptions-throw-2.C: Likewise.
* testsuite/libgomp.oacc-c++/exceptions-throw-1.C: Likewise.
* testsuite/libgomp.oacc-c++/exceptions-throw-2.C: Likewise.
Co-authored-by: Thomas Schwinge <tschwinge@baylibre.com>
|
|
protobuf (and therefore firefox too) currently doesn't build on s390*-linux.
The problem is that it uses [[clang::musttail]] attribute heavily, and in
llvm (IMHO llvm bug) [[clang::musttail]] calls with 5+ arguments on
s390*-linux are silently accepted and result in a normal non-tail call.
In GCC we just reject those because the target hook refuses to tail call it
(IMHO the right behavior).
Now, the reason why that happens is as s390_function_ok_for_sibcall attempts
to explain, the 5th argument (assuming normal <= wordsize integer or pointer
arguments, nothing that needs 2+ registers) is passed in %r6 which is not
call clobbered, so we can't do tail call when we'd have to change content
of that register and then caller would assume %r6 content didn't change and
use it again.
In the protobuf case though, the 5th argument is always passed through
from the caller to the musttail callee unmodified, so one can actually
emit just jg tail_called_function or perhaps tweak some registers but
keep %r6 untouched, and in that case I think it is just fine to tail call
it (at least unless the stack slots used for 6+ argument can't be modified
by the callee in the ABI and nothing checks for that).
So, the following patch checks for this special case, where the argument
which uses %r6 is passed in a single register and it is passed default
definition of SSA_NAME of a PARM_DECL with the same DECL_INCOMING_RTL.
It won't really work at -O0 but should work for -O1 and above, at least when
one doesn't really try to modify the parameter conditionally and hope it will
be optimized away in the end.
2025-04-24 Jakub Jelinek <jakub@redhat.com>
Stefan Schulze Frielinghaus <stefansf@gcc.gnu.org>
PR target/119873
* config/s390/s390.cc (s390_call_saved_register_used): Don't return
true if default definition of PARM_DECL SSA_NAME of the same register
is passed in call saved register.
(s390_function_ok_for_sibcall): Adjust comment.
* gcc.target/s390/pr119873-1.c: New test.
* gcc.target/s390/pr119873-2.c: New test.
* gcc.target/s390/pr119873-3.c: New test.
* gcc.target/s390/pr119873-4.c: New test.
|
|
gcc/ChangeLog:
PR target/119919
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Account
correctly cond_expr and min/max when one of operands is 0 or -1.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr119919.c: New test.
|
|
PR119610 is about incorrect CFI output for a stack probe when that
probe is not the initial allocation. The main aarch64 stack probe
function, aarch64_allocate_and_probe_stack_space, implicitly assumed
that the incoming stack pointer pointed to the top of the frame,
and thus held the CFA.
aarch64_save_callee_saves and aarch64_restore_callee_saves use a
parameter called bytes_below_sp to track how far the stack pointer
is above the base of the static frame. This patch does the same
thing for aarch64_allocate_and_probe_stack_space.
Also, I noticed that the SVE path was attaching the first CFA note
to the wrong instruction: it was attaching the note to the calculation
of the stack size, rather than to the r11<-sp copy.
gcc/
PR target/119610
* config/aarch64/aarch64.cc (aarch64_allocate_and_probe_stack_space):
Add a bytes_below_sp parameter and use it to calculate the CFA
offsets. Attach the first SVE CFA note to the move into the
associated temporary register.
(aarch64_allocate_and_probe_stack_space): Update calls accordingly.
Start out with bytes_per_sp set to the frame size and decrement
it after each allocation.
gcc/testsuite/
PR target/119610
* g++.dg/torture/pr119610.C: New test.
* g++.target/aarch64/sve/pr119610-sve.C: Likewise.
|
|
Since the upper bits are already cleared by the comparison
instructions.
gcc/ChangeLog:
PR target/103750
* config/i386/sse.md (*<avx512>_cmp<mode>3_and15): New define_insn.
(*<avx512>_ucmp<mode>3_and15): Ditto.
(*<avx512>_cmp<mode>3_and3): Ditto.
(*avx512vl_ucmpv2di3_and3): Ditto.
(*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>):
Change operands[3] predicate to <cmp_imm_predicate>.
(*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>_2):
Ditto.
(*<avx512>_cmp<mode>3): Add GET_MODE_NUNITS (<MODE>mode) >= 8
to the condition.
(*<avx512>_ucmp<mode>3): Ditto.
(V48_AVX512VL_4): New mode iterator.
(VI48_AVX512VL_4): Ditto.
(V8_AVX512VL_2): Ditto.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512vl-pr103750-1.c: New test.
* gcc.target/i386/avx512f-pr96891-3.c: Adjust testcase.
* gcc.target/i386/avx512f-vpcmpgtuq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpeqq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpequq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpgeq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpgeuq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpgtq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpgtuq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpleq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpleuq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpltq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpltuq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpneqq-1.c: Ditto.
* gcc.target/i386/avx512vl-vpcmpnequq-1.c: Ditto.
|
|
this patch implements costing of truth_value exprs. I.e.
a = b < c;
Those seems to be now the most common operations that goes to the addss path
except for in->fp and fp->int conversions.
For integer we use setcc, for FP there is CMccSS and variants which sets the
destination register a s a mast (i.e. -1 on true and 0 on false). Technically
these needs res&1 to get into 1 on true, 0 on false, but looking on examples
where this is used, it is common that the resulting code is optimized avoiding
need for this (except for cases wehre result is directly saved to memory).
For this reason I am accounting only one sse_op (CMccSS) itself.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Cost truth_value
exprs.
|
|
Since ix86_expand_sse_movcc will simplify them into a simple vmov, vpand
or vpandn.
gcc/ChangeLog:
* config/i386/predicates.md (vector_or_0_or_1s_operand): New predicate.
(nonimm_or_0_or_1s_operand): Ditto.
* config/i386/sse.md (vcond_mask_<mode><sseintvecmodelower>):
Extend the predicate of operands1 to accept 0 or allones
operands.
(vcond_mask_<mode><sseintvecmodelower>): Ditto.
(vcond_mask_v1tiv1ti): Ditto.
(vcond_mask_<mode><sseintvecmodelower>): Ditto.
* config/i386/i386.md (mov<mode>cc): Ditto for operands[2] and
operands[3].
* config/i386/i386-expand.cc (ix86_expand_sse_fp_minmax):
Force immediate_operand to register.
gcc/testsuite/ChangeLog:
* gcc.target/i386/blendv-to-maxmin.c: New test.
* gcc.target/i386/blendv-to-pand.c: New test.
|
|
this patch adds special cases for vectorizer costs in COND_EXPR, MIN_EXPR,
MAX_EXPR, ABS_EXPR and ABSU_EXPR. We previously costed ABS_EXPR and ABSU_EXPR
but it was only correct for FP variant (wehre it corresponds to andss clearing
sign bit). Integer abs/absu is open coded as conditinal move for SSE2 and
SSE3 instroduced an instruction.
MIN_EXPR/MAX_EXPR compiles to minss/maxss for FP and accroding to Agner Fog
tables they costs same as sse_op on all targets. Integer translated to single
instruction since SSE3.
COND_EXPR translated to open-coded conditional move for SSE2, SSE4.1 simplified
the sequence and AVX512 introduced masked registers.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): Add special cases
for COND_EXPR; make MIN_EXPR, MAX_EXPR, ABS_EXPR and ABSU_EXPR more realistic.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr89618-2.c: XFAIL.
|
|
decisions [PR119327]
The following testcase FAILs because the always_inline function can't
be inlined.
The rs6000 backend has similarly to other targets a hook which rejects
inlining which would bring in new ISAs which aren't there in the caller.
And this hook rejects this because of OPTION_MASK_SAVE_TOC_INDIRECT
differences.
This flag is set if explicitly requested or by default depending on
whether the current function looks hot (or at least not cold):
if ((rs6000_isa_flags_explicit & OPTION_MASK_SAVE_TOC_INDIRECT) == 0
&& flag_shrink_wrap_separate
&& optimize_function_for_speed_p (cfun))
rs6000_isa_flags |= OPTION_MASK_SAVE_TOC_INDIRECT;
The target nodes that are being compared here are actually the default
target node (which was created when cfun was NULL) vs. one that was
created for the always_inline function when it wasn't NULL, so one
doesn't have it, the other does.
In any case, this flag feels like a tuning decision rather than hard
ISA requirement and I see no problems why we couldn't inline
even explicit -msave-toc-indirect function into -mno-save-toc-indirect
or vice versa.
We already ignore OPTION_MASK_P{8,10}_FUSION which are also more
like tuning flags.
2025-04-22 Jakub Jelinek <jakub@redhat.com>
PR target/119327
* config/rs6000/rs6000.cc (rs6000_can_inline_p): Ignore also
OPTION_MASK_SAVE_TOC_INDIRECT differences.
* g++.dg/opt/pr119327.C: New test.
|
|
We implemented FAMINMAX ACLE support but failed to define the
associated feature macro.
gcc/
* config/aarch64/aarch64-c.cc (aarch64_update_cpp_builtins): Define
__ARM_FEATURE_FAMINMAX.
gcc/testsuite/
* gcc.target/aarch64/pragma_cpp_predefs_4.c: Test
__ARM_FEATURE_FAMINMAX.
|
|
Enable a target with FEAT_FP16 to emit the half-precision variants
of FCMP/FCMPE.
gcc/ChangeLog:
* config/aarch64/aarch64.md: Update cbranch, cstore, fcmp
and fcmpe to use the GPF_F16 iterator for floating-point
modes.
gcc/testsuite/ChangeLog:
* gcc.target/aarch64/_Float16_cmp_1.c: New test.
* gcc.target/aarch64/_Float16_cmp_2.c: New (negative) test.
|
|
This expansion ensures that exactly one comparison is emitted for
spacesip-like sequences on floating-point operands, including when
the result of such sequences are compared against members of
std::<some_ordering>::<some_value>.
For both integer and floating-point types, we optimize for the case
in which the result of a spaceship-like operation is written to a GPR.
The PR highlights this issue for floating-point operands, but we also
make an improvement for integers, preferring:
cmp w0, w1
cset w1, gt
csinv w0, w1, wzr, ge
over:
cmp w0, w1
mov w0, 1
csinv w0, w0, wzr, ge
csel w0, w0, wzr, ne
to compute:
auto test(int a, int b) { return a <=> b;}
gcc/ChangeLog:
PR target/117013
* config/aarch64/aarch64-protos.h (aarch64_expand_fp_spaceship):
Declare optab expander function for floating-point types.
* config/aarch64/aarch64.cc (aarch64_expand_fp_spaceship):
Define optab expansion for floating-point types (new function).
* config/aarch64/aarch64.md (spaceship<mode>4):
Add define_expands for spaceship<mode>4 on integer and
floating-point types.
gcc/testsuite/ChangeLog:
PR target/117013
* g++.target/aarch64/spaceship_1.C: New test.
* g++.target/aarch64/spaceship_2.C: New test.
* g++.target/aarch64/spaceship_3.C: New test.
|
|
We had not noticed that after g:299a8e2dc667e795991bc439d2cad5ea5bd379e2 the
FP8FMA and FP8DOT4 features aren't implied by FP8FMA. The intent is for
-mcpu=olympus to support all of them.
Fix the definition to include the relevant sub-features explicitly.
Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>
gcc/
* config/aarch64/aarch64-cores.def (olympus): Add fp8fma, fp8dot4
explicitly.
|