Age | Commit message (Collapse) | Author | Files | Lines |
|
A common idiom for implementing an integer division that rounds upwards is
to write (x + y - 1) / y. Conveniently on x86, the two additions to form
the numerator can be performed by a single lea instruction, and indeed gcc
currently generates a lea when both x and y are both registers.
int foo(int x, int y) {
return (x+y-1)/y;
}
generates with -O2:
foo: leal -1(%rsi,%rdi), %eax // 4 bytes
cltd
idivl %esi
ret
Oddly, however, if x is a memory, gcc currently uses two instructions:
int m;
int bar(int y) {
return (m+y-1)/y;
}
generates:
foo: movl m(%rip), %eax
addl %edi, %eax // 2 bytes
subl $1, %eax // 3 bytes
cltd
idivl %edi
ret
This discrepancy is caused by the late decision (in peephole2) to split
an addition with a memory operand, into a load followed by a reg-reg
addition. This patch improves this situation by adding a peephole2
to recognize consecutive additions and transform them into lea if
profitable.
My first attempt at fixing this was to use a define_insn_and_split:
(define_insn_and_split "*lea<mode>3_reg_mem_imm"
[(set (match_operand:SWI48 0 "register_operand")
(plus:SWI48 (plus:SWI48 (match_operand:SWI48 1 "register_operand")
(match_operand:SWI48 2 "memory_operand"))
(match_operand:SWI48 3 "x86_64_immediate_operand")))]
"ix86_pre_reload_split ()"
"#"
"&& 1"
[(set (match_dup 4) (match_dup 2))
(set (match_dup 0) (plus:SWI48 (plus:SWI48 (match_dup 1) (match_dup 4))
(match_dup 3)))]
"operands[4] = gen_reg_rtx (<MODE>mode);")
using combine to combine instructions. Unfortunately, this approach
interferes with (reload's) subtle balance of deciding when to use/avoid lea,
which can be observed as a code size regression in CSiBE. The peephole2
approach (proposed here) uniformly improves CSiBE results.
2024-07-01 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386.md (peephole2): Transform two consecutive
additions into a 3-component lea if !TARGET_AVOID_LEA_FOR_ADDR.
gcc/testsuite/ChangeLog
* gcc.target/i386/lea-3.c: New test case.
|
|
PR target/88236
PR target/115726
gcc/
* config/avr/avr.md (mov<mode>) [avr_mem_memx_p]: Expand in such a
way that the destination does not overlap with any hard register
clobbered / used by xload8qi_A resp. xload<mode>_A.
* config/avr/avr.cc (avr_out_xload): Avoid early-clobber
situation for Z by executing just one load when the output register
overlaps with Z.
gcc/testsuite/
* gcc.target/avr/torture/pr88236-pr115726.c: New test.
|
|
PR testsuite/52641
gcc/testsuite/
* gcc.dg/analyzer/pr109577.c: Use __SIZE_TYPE__ instead of "unsigned long".
* gcc.dg/analyzer/pr93032-mztools-signed-char.c: Requires int32plus.
* gcc.dg/analyzer/pr93032-mztools-unsigned-char.c: Requires int32plus.
* gcc.dg/analyzer/putenv-1.c: Skip on avr.
* gcc.dg/torture/type-generic-1.c: Skip on avr.
|
|
This creates a new predefined allocator as a shortcut for using pinned
memory with OpenMP. This is not in the OpenMP standard so it uses the "ompx"
namespace and an independent enum baseline of 200 (selected to not clash with
other known implementations).
The allocator is equivalent to using a custom allocator with the pinned
trait and the null fallback trait. One motivation for having this feature is
for use by the (planned) -foffload-memory=pinned feature.
gcc/fortran/ChangeLog:
* openmp.cc (is_predefined_allocator): Update valid ranges to
incorporate ompx_gnu_pinned_mem_alloc.
libgomp/ChangeLog:
* allocator.c (ompx_gnu_min_predefined_alloc): New.
(ompx_gnu_max_predefined_alloc): New.
(predefined_alloc_mapping): Rename to ...
(predefined_omp_alloc_mapping): ... this.
(predefined_ompx_gnu_alloc_mapping): New.
(_Static_assert): Adjust for the new name, and add a new assert for the
new table.
(predefined_allocator_p): New.
(predefined_alloc_mapping): New.
(omp_aligned_alloc): Support ompx_gnu_pinned_mem_alloc.
Use predefined_allocator_p and predefined_alloc_mapping.
(omp_free): Likewise.
(omp_alligned_calloc): Likewise.
(omp_realloc): Likewise.
* env.c (parse_allocator): Add ompx_gnu_pinned_mem_alloc.
* libgomp.texi: Document ompx_gnu_pinned_mem_alloc.
* omp.h.in (omp_allocator_handle_t): Add ompx_gnu_pinned_mem_alloc.
* omp_lib.f90.in: Add ompx_gnu_pinned_mem_alloc.
* omp_lib.h.in: Add ompx_gnu_pinned_mem_alloc.
* testsuite/libgomp.c/alloc-pinned-5.c: New test.
* testsuite/libgomp.c/alloc-pinned-6.c: New test.
* testsuite/libgomp.fortran/alloc-pinned-1.f90: New test.
gcc/testsuite/ChangeLog:
* gfortran.dg/gomp/allocate-pinned-1.f90: New test.
Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
|
|
The following fixes an ICE with a .COND_ADD discovered as reduction
even though its else value isn't the reduction chain link but a
constant. This would be wrong-code with --disable-checking I think.
PR tree-optimization/115723
* tree-vect-loop.cc (check_reduction_path): For a .COND_ADD
verify the else value also refers to the reduction chain op.
* gcc.dg/vect/pr115723.c: New testcase.
|
|
The following adds a missed check when forwprop attempts to rewrite
a complex store.
PR tree-optimization/115694
* tree-ssa-forwprop.cc (pass_forwprop::execute): Check the
store is complex before rewriting it.
* g++.dg/torture/pr115694.C: New testcase.
|
|
gcc/ChangeLog:
PR target/115517
* config/i386/mmx.md (vcond<mode>v2sf): Removed.
(vcond<MMXMODE124:mode><MMXMODEI:mode>): Ditto.
(vcond<mode><mode>): Ditto.
(vcondu<MMXMODE124:mode><MMXMODEI:mode>): Ditto.
(vcondu<mode><mode>): Ditto.
* config/i386/sse.md (vcond<V_512:mode><VF_512:mode>): Ditto.
(vcond<V_256:mode><VF_256:mode>): Ditto.
(vcond<V_128:mode><VF_128:mode>): Ditto.
(vcond<VI2HFBF_AVX512VL:mode><VHF_AVX512VL:mode>): Ditto.
(vcond<V_512:mode><VI_AVX512BW:mode>): Ditto.
(vcond<V_256:mode><VI_256:mode>): Ditto.
(vcond<V_128:mode><VI124_128:mode>): Ditto.
(vcond<VI8F_128:mode>v2di): Ditto.
(vcondu<V_512:mode><VI_AVX512BW:mode>): Ditto.
(vcondu<V_256:mode><VI_256:mode>): Ditto.
(vcondu<V_128:mode><VI124_128:mode>): Ditto.
(vcondu<VI8F_128:mode>v2di): Ditto.
(vcondeq<VI8F_128:mode>v2di): Ditto.
|
|
Try to optimize x < 0 ? -1 : 0 into (signed) x >> 31
and x < 0 ? 1 : 0 into (unsigned) x >> 31.
Add define_insn_and_split for the optimization did in
ix86_expand_int_vcond.
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md ("*ashr<mode>3_1"): New
define_insn_and_split.
(*avx512_ashr<mode>3_1): Ditto.
(*avx2_lshr<mode>3_1): Ditto.
(*avx2_lshr<mode>3_2): Ditto and add 2 combine splitter after
it.
* config/i386/mmx.md (mmxscalarsize): New mode attribute.
(*mmw_ashr<mode>3_1): New define_insn_and_split.
("mmx_<insn><mode>3): Add a combine spiltter after it.
(*mmx_ashrv2hi3_1): New define_insn_and_plit, also add a
combine splitter after it.
gcc/testsuite/ChangeLog:
* gcc.target/i386/pr111023-2.c: Adjust testcase.
* gcc.target/i386/vect-div-1.c: Ditto.
|
|
> Richard suggests that we implement the "obvious" transforms like
> inversion in the middle-end but if for example unsigned compares
> are not supported the us_minus + eq + negative trick isn't on
> that list.
>
> The main reason to restrict vec_cmp would be to avoid
> a <= b ? c : d going with an unsupported vec_cmp but instead
> do a > b ? d : c - the alternative is trying to fix this
> on the RTL side via combine. I understand the non-native
Yes, I have a patch which can fix most regressions via pattern match
in combine.
Still there is a situation that is difficult to deal with, mainly the
optimization w/o sse4.1 . Because pblendvb/blendvps/blendvpd only
exists under sse4.1, w/o sse4.1, it takes 3
instructions (pand,pandn,por) to simulate the vcond_mask, and the
combine matches up to 4 instructions, which makes it currently
impossible to use the combine to recover those optimizations in the
vcond{,u,eq}.i.e min/max.
In the case of sse 4.1 and above, there is basically no regression anymore.
the regression testcases w/o sse4.1
FAIL: g++.target/i386/pr100637-1b.C -std=gnu++14 scan-assembler-times pcmpeqb 2
FAIL: g++.target/i386/pr100637-1b.C -std=gnu++17 scan-assembler-times pcmpeqb 2
FAIL: g++.target/i386/pr100637-1b.C -std=gnu++20 scan-assembler-times pcmpeqb 2
FAIL: g++.target/i386/pr100637-1b.C -std=gnu++98 scan-assembler-times pcmpeqb 2
FAIL: g++.target/i386/pr100637-1w.C -std=gnu++14 scan-assembler-times pcmpeqw 2
FAIL: g++.target/i386/pr100637-1w.C -std=gnu++17 scan-assembler-times pcmpeqw 2
FAIL: g++.target/i386/pr100637-1w.C -std=gnu++20 scan-assembler-times pcmpeqw 2
FAIL: g++.target/i386/pr100637-1w.C -std=gnu++98 scan-assembler-times pcmpeqw 2
FAIL: g++.target/i386/pr103861-1.C -std=gnu++14 scan-assembler-times pcmpeqb 2
FAIL: g++.target/i386/pr103861-1.C -std=gnu++17 scan-assembler-times pcmpeqb 2
FAIL: g++.target/i386/pr103861-1.C -std=gnu++20 scan-assembler-times pcmpeqb 2
FAIL: g++.target/i386/pr103861-1.C -std=gnu++98 scan-assembler-times pcmpeqb 2
FAIL: gcc.target/i386/pr88540.c scan-assembler minpd
gcc/testsuite/ChangeLog:
PR target/115517
* g++.target/i386/pr100637-1b.C: Add xfail and -mno-sse4.1.
* g++.target/i386/pr100637-1w.C: Ditto.
* g++.target/i386/pr103861-1.C: Ditto.
* gcc.target/i386/pr88540.c: Ditto.
* gcc.target/i386/pr103941-2.c: Add -mno-avx512f.
* g++.target/i386/sse4_1-pr100637-1b.C: New test.
* g++.target/i386/sse4_1-pr100637-1w.C: New test.
* g++.target/i386/sse4_1-pr103861-1.C: New test.
* gcc.target/i386/sse4_1-pr88540.c: New test.
|
|
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md
(*<sse>_movmsk<ssemodesuffix><avxsizesuffix>_lt_avx512): New
define_insn_and_split.
(*<sse>_movmsk<ssemodesuffix><avxsizesuffix>_<u>ext_lt_avx512):
Ditto.
(*<sse2_avx2>_pmovmskb_lt_avx512): Ditto.
(*<sse2_avx2>_pmovmskb_zext_lt_avx512): Ditto.
(*sse2_pmovmskb_ext_lt_avx512): Ditto.
(*pmovsk_kmask_v16qi_avx512): Ditto.
(*pmovsk_mask_v32qi_avx512): Ditto.
(*pmovsk_mask_cmp_<mode>_avx512): Ditto.
(*pmovsk_ptest_<mode>_avx512): Ditto.
|
|
These versions of the min/max patterns implement exactly the operations
min = (op1 < op2 ? op1 : op2)
max = (!(op1 < op2) ? op1 : op2)
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md (*minmax<mode>3_1): New pre_reload
define_insn_and_split.
(*minmax<mode>3_2): Ditto.
|
|
is vector -1/0.
gcc/ChangeLog
PR target/115517
* config/i386/sse.md
(*<avx512>_cvtmask2<ssemodesuffix><mode>_not): New pre_reload
splitter.
(*<avx512>_cvtmask2<ssemodesuffix><mode>_not): Ditto.
(*avx2_pcmp<mode>3_6): Ditto.
(*avx2_pcmp<mode>3_7): Ditto.
|
|
UNSPEC_BLENDV)
These define_insn_and_split are needed after vcond{,u,eq} is obsolete.
gcc/ChangeLog:
PR target/115517
* config/i386/sse.md
(*<sse4_1>_blendv<ssemodesuffix><avxsizesuffix>_gt): New
define_insn_and_split.
(*<sse4_1>_blendv<ssefltmodesuffix><avxsizesuffix>_gtint):
Ditto.
(*<sse4_1>_blendv<ssefltmodesuffix><avxsizesuffix>_not_gtint):
Ditto.
(*<sse4_1_avx2>_pblendvb_gt): Ditto.
(*<sse4_1_avx2>_pblendvb_gt_subreg_not): Ditto.
|
|
Move pass_stv2 and pass_rpad after pre_reload pass_late_combine, also
define target_insn_cost to prevent post_reload pass_late_combine to
revert the optimziation did in pass_rpad.
Adjust testcases since pass_late_combine generates better code but
break scan assembly.
.i.e
Under 32-bit target, gcc used to generate broadcast from stack and
then do the real operation.
After flate_combine, they're combined into embeded broadcast
operations.
gcc/ChangeLog:
* config/i386/i386-features.cc (ix86_rpad_gate): New function.
* config/i386/i386-options.cc (ix86_override_options_after_change):
Don't disable flate_combine.
* config/i386/i386-passes.def: Move pass_stv2 and pass_rpad
after pre_reload pas_late_combine.
* config/i386/i386-protos.h (ix86_rpad_gate): New declare.
* config/i386/i386.cc (ix86_insn_cost): New function.
(TARGET_INSN_COST): Define.
gcc/testsuite/ChangeLog:
* gcc.target/i386/avx512f-broadcast-pr87767-1.c: Adjus
testcase.
* gcc.target/i386/avx512f-broadcast-pr87767-5.c: Ditto.
* gcc.target/i386/avx512f-fmadd-sf-zmm-7.c: Ditto.
* gcc.target/i386/avx512f-fmsub-sf-zmm-7.c: Ditto.
* gcc.target/i386/avx512f-fnmadd-sf-zmm-7.c: Ditto.
* gcc.target/i386/avx512f-fnmsub-sf-zmm-7.c: Ditto.
* gcc.target/i386/avx512vl-broadcast-pr87767-1.c: Ditto.
* gcc.target/i386/avx512vl-broadcast-pr87767-5.c: Ditto.
* gcc.target/i386/pr91333.c: Ditto.
* gcc.target/i386/vect-strided-4.c: Ditto.
|
|
late_combine will combine lshift + zero into *lshifrtsi3_1_zext which
cause extra mov between gpr and kmask, add ?k to the pattern.
gcc/ChangeLog:
PR target/115610
* config/i386/i386.md (<*insnsi3_zext): Add alternative ?k,
enable it only for lshiftrt and under avx512bw.
* config/i386/sse.md (*klshrsi3_1_zext): New define_insn, and
add corresponding define_split after it.
|
|
The testcases are supposed to scan for vpopcnt{b,w,d,q} operations
with k mask, but mask is defined as uninitialized local variable which
will be set as 0 at rtl expand phase.
And it's further simplified off by late_combine which caused scan assembly failure.
Move the definition of mask outside to make the testcases more stable.
gcc/testsuite/ChangeLog:
PR target/115610
* gcc.target/i386/avx512bitalg-vpopcntb.c: Define mask as
extern instead of uninitialized local variables.
* gcc.target/i386/avx512bitalg-vpopcntbvl.c: Ditto.
* gcc.target/i386/avx512bitalg-vpopcntw.c: Ditto.
* gcc.target/i386/avx512bitalg-vpopcntwvl.c: Ditto.
* gcc.target/i386/avx512vpopcntdq-vpopcntd.c: Ditto.
* gcc.target/i386/avx512vpopcntdq-vpopcntq.c: Ditto.
|
|
|
|
2024-06-30 John David Anglin <danglin@gcc.gnu.org>
gcc/ChangeLog:
PR target/115691
* config/pa/pa.md: Remove incorrect xmpyu patterns.
|
|
The following restricts copying of points-to info from defs that
might be in regions invoking UB and are never executed.
PR tree-optimization/115701
* tree-ssanames.cc (maybe_duplicate_ssa_info_at_copy):
Only copy info from within the same BB.
* gcc.dg/torture/pr115701.c: New testcase.
|
|
The following factors out the code that preserves SSA info of the LHS
of a SSA copy LHS = RHS when LHS is about to be eliminated to RHS.
PR tree-optimization/115701
* tree-ssanames.h (maybe_duplicate_ssa_info_at_copy): Declare.
* tree-ssanames.cc (maybe_duplicate_ssa_info_at_copy): New
function, split out from ...
* tree-ssa-copy.cc (fini_copy_prop): ... here.
* tree-ssa-sccvn.cc (eliminate_dom_walker::eliminate_stmt): ...
and here.
|
|
The following makes sure that for a SLP reductions all lanes have
the same STMT_VINFO_REDUC_IDX. Once we move that info and can adjust
it we can implement swapping. It also makes the existing protection
against operand swapping trigger for all stmts participating in a
reduction, not just the final one marked as reduction-def.
* tree-vect-slp.cc (vect_build_slp_tree_1): Compare
STMT_VINFO_REDUC_IDX.
(vect_build_slp_tree_2): Prevent operand swapping for
all stmts participating in a reduction.
|
|
The input vectype of reduction PHI statement must be determined before
vect cost computation for the reduction. Since lance-reducing operation has
different input vectype from normal one, so we need to traverse all reduction
statements to find out the input vectype with the least lanes, and set that to
the PHI statement.
2024-06-16 Feng Xue <fxue@os.amperecomputing.com>
gcc/
* tree-vect-loop.cc (vectorizable_reduction): Determine input vectype
during traversal of reduction statements.
|
|
Allow shift-by-induction for slp node, when it is single lane, which is
aligned with the original loop-based handling.
2024-06-26 Feng Xue <fxue@os.amperecomputing.com>
gcc/
* tree-vect-stmts.cc (vectorizable_shift): Allow shift-by-induction
for single-lane slp node.
gcc/testsuite/
* gcc.dg/vect/vect-shift-6.c
* gcc.dg/vect/vect-shift-7.c
|
|
|
|
Use INT_MIN rather than -1 in `comparison_qty' where a comparison is not
with a register, because the value of -1 is actually a valid reference
to register 0 in the case where it has not been assigned a quantity.
Using -1 makes `REG_QTY (REGNO (folded_arg1)) == ent->comparison_qty'
comparison in `fold_rtx' to incorrectly trigger in rare circumstances
and return true for a memory reference, making CSE consider a comparison
operation to evaluate to a constant expression and consequently make the
resulting code incorrectly execute or fail to execute conditional
blocks.
This has caused a miscompilation of rwlock.c from LinuxThreads for the
`alpha-linux-gnu' target, where `rwlock->__rw_writer != thread_self ()'
expression (where `thread_self' returns the thread pointer via a PALcode
call) has been decided to be always true (with `ent->comparison_qty'
using -1 for a reference to to `rwlock->__rw_writer', while register 0
holding the thread pointer retrieved by `thread_self') and code for the
false case has been optimized away where it mustn't have, causing
program lockups.
The issue has been observed as a regression from commit 08a692679fb8
("Undefined cse.c behaviour causes 3.4 regression on HPUX"),
<https://gcc.gnu.org/ml/gcc-patches/2004-10/msg02027.html>, and up to
commit 932ad4d9b550 ("Make CSE path following use the CFG"),
<https://gcc.gnu.org/ml/gcc-patches/2006-12/msg00431.html>, where CSE
has been restructured sufficiently for the issue not to trigger with the
original reproducer anymore. However the original bug remains and can
trigger, because `comparison_qty' will still be assigned -1 for a memory
reference and the `reg_qty' member of a `cse_reg_info_table' entry will
still be assigned -1 for register 0 where the entry has not been
assigned a quantity, e.g. at initialization.
Use INT_MIN then as noted above, so that the value remains negative, for
consistency with the REGNO_QTY_VALID_P macro (even though not used on
`comparison_qty'), and then so that it should not ever match a valid
negated register number, fixing the regression with commit 08a692679fb8.
gcc/
PR rtl-optimization/115565
* cse.cc (record_jump_cond): Use INT_MIN rather than -1 for
`comparison_qty' if !REG_P.
|
|
I hadn't updated my repo on the host where I handle email, so it picked
up the older version of this patch without the testsuite fix. So, V4
with the testsuite option for lmul fixed.
--
And Sergei's movmem patch. Just trivial testsuite adjustment for an
option name change and a whitespace fix from me.
I've spun this in my tester for rv32 and rv64. I'll wait for pre-commit
CI before taking further action.
Just a reminder, this patch is designed to handle the case where we can
issue a single vector load/store which avoids all the complexities of
determining which direction to copy.
--
gcc/ChangeLog
* config/riscv/riscv.md (movmem<mode>): New expander.
gcc/testsuite/ChangeLog
PR target/112109
* gcc.target/riscv/rvv/base/movmem-1.c: New test
|
|
gcc/fortran/ChangeLog:
PR fortran/114019
* trans-stmt.cc (gfc_trans_allocate): Fix handling of case of
scalar character expression being used for SOURCE.
gcc/testsuite/ChangeLog:
PR fortran/114019
* gfortran.dg/allocate_with_source_33.f90: New test.
|
|
This patch would like to support the form of unsigned scalar .SAT_ADD
when one of the op is IMM. For example as below:
Form IMM:
#define DEF_SAT_U_ADD_IMM_FMT_1(T) \
T __attribute__((noinline)) \
sat_u_add_imm_##T##_fmt_1 (T x) \
{ \
return (T)(x + 9) >= x ? (x + 9) : -1; \
}
DEF_SAT_U_ADD_IMM_FMT_1(uint64_t)
Before this patch:
__attribute__((noinline))
uint64_t sat_u_add_imm_uint64_t_fmt_1 (uint64_t x)
{
long unsigned int _1;
uint64_t _3;
;; basic block 2, loop depth 0
;; pred: ENTRY
_1 = MIN_EXPR <x_2(D), 18446744073709551606>;
_3 = _1 + 9;
return _3;
;; succ: EXIT
}
After this patch:
__attribute__((noinline))
uint64_t sat_u_add_imm_uint64_t_fmt_1 (uint64_t x)
{
uint64_t _3;
;; basic block 2, loop depth 0
;; pred: ENTRY
_3 = .SAT_ADD (x_2(D), 9); [tail call]
return _3;
;; succ: EXIT
}
The below test suites are passed for this patch:
1. The rv64gcv fully regression test with newlib.
2. The x86 bootstrap test.
3. The x86 fully regression test.
gcc/ChangeLog:
* match.pd: Add imm form for .SAT_ADD matching.
* tree-ssa-math-opts.cc (math_opts_dom_walker::after_dom_children):
Add .SAT_ADD matching under PLUS_EXPR.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
r15-1699-g445c62ee492 contains changes that trigger two maybe-uninitialized
warnings on Darwin, which result in a bootstrap failure.
Note that the warnings are false positives, in fact the variables should be
initialized in the cases of a switch (all values of the switch condition are
covered).
Fixed here by providing default initializations for the relevant variables.
gcc/jit/ChangeLog:
* jit-recording.cc
(recording::memento_of_typeinfo::make_debug_string): Default the value
of ident.
(recording::memento_of_typeinfo::write_reproducer): Default the value
of type.
Signed-off-by: Iain Sandoe <iains@gcc.gnu.org>
|
|
So the recent IRA change exposed a bug in the mcore backend.
The mcore has a special instruction (xtrb3) which can zero extend a GPR into
R1. It's useful because zextb requires a matching source/destination.
Unfortunately xtrb3 modifies CC.
The IRA changes twiddle register allocation such that we want to use xtrb3.
Unfortunately CC is live at the point where we want to use xtrb3 and clobbering
CC causes the test to fail.
Exposing the clobber in the expander and insn seems like the best path forward.
We could also drop the xtrb3 alternative, but that seems like it would hurt
codegen more than exposing the clobber.
The bitfield extraction patterns using xtrb look problematic as well, but I
didn't try to fix those.
This fixes the builtn-arith-overflow regressions and appears to fix
20010122-1.c as a side effect.
gcc/
* config/mcore/mcore.md (zero_extendqihi2): Clobber CC in expander
and matching insn.
(zero_extendqisi2): Likewise.
|
|
|
|
Here we notice the 'this' conversion for the call f<void>() is bad, so
we correctly defer deduction for the template candidate, but we end up
never adding it to 'bad_cands' since missing_conversion_p for it returns
false (its only argument is 'this' which has already been determined to
be bad). This is not a huge deal, but it causes us to longer accept the
call with -fpermissive in release builds, and a tree check ICE in checking
builds.
So if we have a non-strictly viable template candidate that has not been
instantiated, then we need to add it to 'bad_cands' even if no argument
conversion is missing.
PR c++/106760
gcc/cp/ChangeLog:
* call.cc (add_candidates): Relax test for adding a candidate
to 'bad_cands' to also accept an uninstantiated template candidate
that has no missing conversions.
gcc/testsuite/ChangeLog:
* g++.dg/ext/conv3.C: New test.
Reviewed-by: Jason Merrill <jason@redhat.com>
|
|
Allow ssa_lazy cache to allocate bitmaps from a client provided obstack
if so desired.
* gimple-range-cache.cc (ssa_lazy_cache::ssa_lazy_cache): Relocate here.
Check for provided obstack.
(ssa_lazy_cache::~ssa_lazy_cache): Relocate here. Free bitmap or obstack.
* gimple-range-cache.h (ssa_lazy_cache::ssa_lazy_cache): Move.
(ssa_lazy_cache::~ssa_lazy_cache): Move.
(ssa_lazy_cache::m_ob): New.
* gimple-range.cc (dom_ranger::dom_ranger): Iniitialize obstack.
(dom_ranger::~dom_ranger): Release obstack.
(dom_ranger::pre_bb): Create ssa_lazy_cache using obstack.
* gimple-range.h (m_bitmaps): New.
|
|
Remove extra assignment, extra temp variable and variable shadowing.
No functional changes intended.
gcc/ChangeLog:
* config/i386/i386-expand.cc (ix86_expand_move): Remove extra
assignment to tmp variable, reuse tmp variable instead of
declaring new temporary variable and remove tmp variable shadowing.
|
|
Using auto_vec rather than vec for means the vectors are release
automatically upon return, to stop the leak. The problem seems is that
auto_vec<T, N> is not really move-aware, only the <T, 0> specialization
is.
gcc/ChangeLog:
* tree-profile.cc (find_conditions): Use auto_vec without
embedded storage.
|
|
The following addresses the corner case of an outer loop with an empty
header where we end up asking for the BB of a NULL stmt by
special-casing this case.
PR tree-optimization/115652
* tree-vect-slp.cc (vect_schedule_slp_node): Handle the case
where the outer loop header block is empty.
|
|
ix86_GOT_alias_set and PE_COFF_LEGITIMIZE_EXTERN_DECL [PR115635]
This patch fixes 3 bugs reported after merging the "Add DLL
import/export implementation to AArch64" series.
https://gcc.gnu.org/pipermail/gcc-patches/2024-June/653955.html The
series refactors the i386 codebase to reuse it in AArch64, which
triggers some bugs.
Bug 115661 - [15 Regression] wrong code at -O{2,3} on x86_64-linux-gnu
since r15-1599-g63512c72df09b4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115661
Bug 115635 - [15 regression] Bootstrap fails with failed self-test
with the rust fe (diagnostic-path.cc:1153: test_empty_path: FAIL:
ASSERT_FALSE ((path.interprocedural_p ()))) since
r15-1599-g63512c72df09b4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115635
Issue 1. In some code, i386 has been relying on the
legitimize_pe_coff_symbol call on all platforms and should return
NULL_RTX if it is not supported.
Fix: NULL_RTX handling has been added when the target does not support
PECOFF.
Issue 2. ix86_GOT_alias_set is used on all platforms and cannot be
extracted to mingw.
Fix: ix86_GOT_alias_set has been returned as it was and is used on all
platforms for i386.
Bug 115643 - [15 regression] aarch64-w64-mingw32 support today breaks
x86_64-w64-mingw32 build cannot represent relocation type BFD_RELOC_64
since r15-1602-ged20feebd9ea31
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115643
Issue 3. PE_COFF_EXTERN_DECL_SHOULD_BE_LEGITIMIZED has been added and
used with a negative operator for a complex expression without braces.
Fix: Braces has been added, and
PE_COFF_EXTERN_DECL_SHOULD_BE_LEGITIMIZED has been renamed to
PE_COFF_LEGITIMIZE_EXTERN_DECL.
2024-06-28 Evgeny Karpov <Evgeny.Karpov@microsoft.com>
gcc/ChangeLog:
PR bootstrap/115635
PR target/115643
PR target/115661
* config/aarch64/cygming.h
(PE_COFF_EXTERN_DECL_SHOULD_BE_LEGITIMIZED): Rename to
PE_COFF_LEGITIMIZE_EXTERN_DECL.
(PE_COFF_LEGITIMIZE_EXTERN_DECL): Likewise.
* config/i386/cygming.h (GOT_ALIAS_SET): Remove the diffinition to
reuse it from i386.h.
(PE_COFF_EXTERN_DECL_SHOULD_BE_LEGITIMIZED): Rename to
PE_COFF_LEGITIMIZE_EXTERN_DECL.
(PE_COFF_LEGITIMIZE_EXTERN_DECL): Likewise.
* config/i386/i386-expand.cc (ix86_expand_move): Return
ix86_GOT_alias_set.
* config/i386/i386-expand.h (ix86_GOT_alias_set): Likewise.
* config/i386/i386.cc (ix86_GOT_alias_set): Likewise.
* config/i386/i386.h (GOT_ALIAS_SET): Likewise.
* config/mingw/winnt-dll.cc (get_dllimport_decl): Use
GOT_ALIAS_SET.
(legitimize_pe_coff_symbol): Rename to
PE_COFF_LEGITIMIZE_EXTERN_DECL.
* config/mingw/winnt-dll.h (ix86_GOT_alias_set): Declare
ix86_GOT_alias_set.
|
|
gcc/ChangeLog:
* range-op-ptr.cc (class hybrid_and_operator): Remove.
(class hybrid_or_operator): Same.
(class hybrid_min_operator): Same.
(class hybrid_max_operator): Same.
|
|
The following fixes wrong-code when using outer loop vectorization
and an inner loop SLP access with permutation. A wrong adjustment
to the IV increment is then applied on GCN.
PR tree-optimization/115640
* tree-vect-stmts.cc (vectorizable_load): With an inner
loop SLP access to not apply a gap adjustment.
|
|
There was an off-by-one error in the RDNA validation check, plus I forgot to
allow for two-to-one permute-and-merge operations.
PR target/115640
gcc/ChangeLog:
* config/gcn/gcn.cc (gcn_vectorize_vec_perm_const): Modify RDNA checks.
|
|
First step to adding a general assign all class type's data members
routine. Having a general routine prevents forgetting to tackle the
edge cases, e.g. setting _len.
gcc/fortran/ChangeLog:
* trans-expr.cc (gfc_class_set_vptr): Add setting of _vptr
member.
* trans-intrinsic.cc (conv_intrinsic_move_alloc): First use
of gfc_class_set_vptr and refactor very similar code.
* trans.h (gfc_class_set_vptr): Declare the new function.
gcc/testsuite/ChangeLog:
* gfortran.dg/unlimited_polymorphic_11.f90: Remove unnecessary
casts in gd-final expression.
|
|
The vptr for a class type is set in various ways in different
locations. Refactor the use and simplify code.
gcc/fortran/ChangeLog:
* trans-array.cc (structure_alloc_comps): Use reset_vptr.
* trans-decl.cc (gfc_trans_deferred_vars): Same.
(gfc_generate_function_code): Same.
* trans-expr.cc (gfc_reset_vptr): Allow supplying the class
type.
(gfc_conv_procedure_call): Use reset_vptr.
* trans-intrinsic.cc (gfc_conv_intrinsic_transfer): Same.
|
|
This patch generalizes some of the patterns in i386.md that recognize
double word concatenation, so they handle sign_extend the same way that
they handle zero_extend in appropriate contexts.
As a motivating example consider the following function:
__int128 foo(long long x, unsigned long long y)
{
return ((__int128)x<<64) | y;
}
when compiled with -O2, x86_64 currently generates:
foo: movq %rdi, %rdx
xorl %eax, %eax
xorl %edi, %edi
orq %rsi, %rax
orq %rdi, %rdx
ret
with this patch we now generate (the same as if x is unsigned):
foo: movq %rsi, %rax
movq %rdi, %rdx
ret
Treating both extensions the same way using any_extend is valid as
the top (extended) bits are "unused" after the shift by 64 (or more).
In theory, the RTL optimizers might consider canonicalizing the form
of extension used in these cases, but zero_extend is faster on some
machine, whereas sign extension is supported via addressing modes on
others, so handling both in the machine description is probably best.
2024-06-28 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386.md (*concat<mode><dwi>3_3): Change zero_extend
to any_extend in first operand to left shift by mode precision.
(*concat<mode><dwi>3_4): Likewise.
(*concat<mode><dwi>3_6): Likewise.
gcc/testsuite/ChangeLog
* gcc.target/i386/concatditi-1.c: New test case.
|
|
This patch is another round of refinements to fine tune the new ternlog
infrastructure in i386's sse.md. This patch tweaks ix86_ternlog_idx
to allow multiple MEM/CONST_VECTOR/VEC_DUPLICATE operands prior to
splitting (before reload), when force_register is called on all but
one of these operands. Conceptually during the dynamic programming,
registers fill the args slots in the order 0, 1, 2, and mem-like
operands fill the slots in the order 2, 0, 1 [preferring the memory
operand to come last].
This patch allows us to remove some of the legacy ternlog patterns
in sse.md without regressions [which is left to the next and final
patch in this series]. An indication that these patterns are no
longer required is shown by the necessary testsuite tweaks below,
where the output assembler for the legacy instructions used hexadecimal,
but with the new ternlog infrastructure now consistently use decimal.
2024-06-28 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc (ix86_ternlog_idx) <case VEC_DUPLICATE>:
Add a "goto do_mem_operand" as this need not match memory_operand.
<case CONST_VECTOR>: Only args[2] may be volatile memory operand.
Allow MEM/VEC_DUPLICATE/CONST_VECTOR as args[0] and args[1].
gcc/testsuite/ChangeLog
* gcc.target/i386/avx512f-andn-di-zmm-2.c: Match decimal instead
of hexadecimal immediate operand to ternlog.
* gcc.target/i386/avx512f-andn-si-zmm-2.c: Likewise.
* gcc.target/i386/avx512f-orn-si-zmm-1.c: Likewise.
* gcc.target/i386/avx512f-orn-si-zmm-2.c: Likewise.
* gcc.target/i386/pr100711-3.c: Likewise.
* gcc.target/i386/pr100711-4.c: Likewise.
* gcc.target/i386/pr100711-5.c: Likewise.
|
|
|
|
gcc/jit/ChangeLog:
* docs/topics/compatibility.rst (LIBGCCJIT_ABI_28): New ABI tag.
* docs/topics/expressions.rst: Document gcc_jit_context_new_alignof.
* jit-playback.cc (new_alignof): New method.
* jit-playback.h: New method.
* jit-recording.cc (recording::context::new_alignof): New
method.
(recording::memento_of_sizeof::replay_into,
recording::memento_of_typeinfo::replay_into,
recording::memento_of_sizeof::make_debug_string,
recording::memento_of_typeinfo::make_debug_string,
recording::memento_of_sizeof::write_reproducer,
recording::memento_of_typeinfo::write_reproducer): Rename.
* jit-recording.h (enum type_info_type): New enum.
(class memento_of_sizeof class memento_of_typeinfo): Rename.
* libgccjit.cc (gcc_jit_context_new_alignof): New function.
* libgccjit.h (gcc_jit_context_new_alignof): New function.
* libgccjit.map: New function.
gcc/testsuite/ChangeLog:
* jit.dg/all-non-failing-tests.h: New test.
* jit.dg/test-alignof.c: New test.
|
|
Add an explicit error messages when c99's static is
used without a size expression in an array declarator.
gcc/c:
* c-parser.cc (c_parser_direct_declarator_inner): Add
error message.
gcc/testsuite:
* gcc.dg/c99-arraydecl-4.c: New test.
|
|
late-combine relies on df, which for -O0 is only initialised late
(pass_df_initialize_no_opt, after split1). Other df-based passes
cope with this by requiring optimize > 0, so this patch does the
same for late-combine.
gcc/
PR rtl-optimization/115677
* late-combine.cc (pass_late_combine::gate): New function.
|
|
An explicit check for address registers was not required so far since
during register allocation the processing of address constraints was
sufficient. However, address constraints themself do not check for
REGNO_OK_FOR_{BASE,INDEX}_P. Thus, with the newly introduced
late-combine pass in r15-1579-g792f97b44ffc5e we generate new insns with
invalid address registers which aren't fixed up afterwards.
Fixed by explicitly checking for address registers in
s390_decompose_addrstyle_without_index such that those new insns are
rejected.
gcc/ChangeLog:
PR target/115634
* config/s390/s390.cc (s390_decompose_addrstyle_without_index):
Check for ADDR_REGS in s390_decompose_addrstyle_without_index.
|
|
The following avoids associating a reduction path as that might
get STMT_VINFO_REDUC_IDX out-of-sync with the SLP operand order.
This is a latent issue with SLP reductions but now easily exposed
as we're doing single-lane SLP reductions.
When we achieved SLP only we can move and update this meta-data.
PR tree-optimization/115669
* tree-vect-slp.cc (vect_build_slp_tree_2): Do not reassociate
chains that participate in a reduction.
* gcc.dg/vect/pr115669.c: New testcase.
|