Age | Commit message (Collapse) | Author | Files | Lines |
|
When we apply a divmod pattern this will break reductions by introducing
multiple uses of the reduction var, so avoid this pattern in reductions.
* tree-vect-patterns.cc (vect_recog_divmod_pattern): Avoid
for stmts participating in a reduction.
|
|
The following patch adds a readelf fallback if objdump nor otool don't
exist. All of GNU binutils readelf, eu-readelf and llvm-readelf can
handle it with those options.
2025-08-28 Jakub Jelinek <jakub@redhat.com>
PR debug/119367
* configure.ac (gcc_cv_as_leb128): Add fallback using readelf.
Grammar fix in comment.
* configure: Regenerate.
|
|
possible [PR119367]
In the usual case we use .loc directives and don't emit the line table
manually. And assembler usually uses DW_LNS_advance_pc which has
uleb128 argument and in most cases will have just a single byte operand.
But if we do emit it for whatever reason (old or buggy assembler or
-gno-as-loc{,view}-support option), we do use DW_LNS_fixed_advance_pc
instead, which has fixed 2 byte operand. That is both wasteful
in the usual case of very small advances, and more importantly will
just result in assembler errors if we need to advance over more than 65535
bytes.
The following patch uses DW_LNS_advance_pc instead if assembler supports
.uleb128 directive with a difference of two labels in the same section.
This is only possible if Minimum Instruction Length in the .debug_line
header is 1 (otherwise DW_LNS_advance_pc operand is multiplied by that
value and DW_LNS_fixed_advance_pc is not), but we emit 1 for that
on all targets.
Looking at dwarf2out.o (from dwarf2out.cc with this patch)
compiled with compilers before/after this change with additional -fpic
-gno-as-loc{,view}-support options, I see .debug_line section shrunk from
878067 bytes to 773381 bytes, so shrink by 12%.
Admittedly gas generated .debug_line is even smaller, 501374 bytes (with
-fpic and without -gno-as-loc{,view}-support options).
2025-08-28 Jakub Jelinek <jakub@redhat.com>
PR debug/119367
* dwarf2out.cc (output_one_line_info_table) <case LI_adv_address>: If
HAVE_AS_LEB128, use DW_LNS_advance_pc with dw2_asm_output_delta_uleb128
instead of DW_LNS_fixed_advance_pc with dw2_asm_output_delta.
|
|
2025-08-28 Paul Thomas <pault@gcc.gnu.org>
gcc/fortran
PR fortran/82843
* intrinsic.cc (gfc_convert_type_warn): If the 'from_ts' is a
PDT instance, copy the derived type to the target ts.
* resolve.cc (gfc_resolve_ref): A PDT component in a component
reference can be that of the pdt_template. Unconditionally use
component of the PDT instance to ensure that the backend_decl
is set during translation. Likewise if a component is
encountered that is a PDT template type, use the component
parmeters to convert to the correct PDT instance.
gcc/testsuite/
PR fortran/82843
* gfortran.dg/pdt_40.f03: New test.
|
|
2025-08-28 Paul Thomas <pault@gcc.gnu.org>
gcc/fortran
PR fortran/82205
* decl.cc (gfc_get_pdt_instance): Copy the default initializer
for components that are not PDT parameters or parameterized. If
any component is a pointer or allocatable set the attributes
'pointer_comp' or 'alloc_comp' of the new PDT instance.
* primary.cc (gfc_match_rvalue): Implement the correct form of
PDT constructors with 'name (type parms)(component values)'.
* trans-array.cc (structure_alloc_comps): Apply scalar default
initializers. Array initializers await the coming change in PDT
representation.
* trans-io.cc (transfer_expr): Do not output the type parms of
a PDT in list directed output.
gcc/testsuite/
PR fortran/82205
* gfortran.dg/pdt_22.f03: Use the correct for PDT constructors.
* gfortran.dg/pdt_23.f03: Likewise.
* gfortran.dg/pdt_3.f03: Likewise.
|
|
|
|
So yet another testsuite hygiene patch. This time turning XPASS -> PASS. My
tester treats those cases the same so I didn't get notified that nozicond-2.c
was passing after some recent changes.
This removes the xfail marker on that test and thus the test is expected to
pass now.
Pushing to the trunk momentarily.
gcc/testsuite/
* gcc.target/riscv/nozicond-2.c: Remove xfails.
|
|
PR fortran/114611
gcc/fortran/ChangeLog:
* io.cc: Issue an error on use of the H descriptor in
a format with -std=f95 or higher. Otherwise, issue a
warning.
gcc/testsuite/ChangeLog:
* gfortran.dg/aliasing_dummy_1.f90: Accommodate errors
and warnings as needed.
* gfortran.dg/eoshift_8.f90: Likewise.
* gfortran.dg/g77/f77-edit-h-out.f: Likewise.
* gfortran.dg/hollerith_1.f90: Likewise.
* gfortran.dg/io_constraints_1.f90: Likewise.
* gfortran.dg/io_constraints_2.f90: Likewise.
* gfortran.dg/longline.f: Likewise.
* gfortran.dg/pr20086.f90: Likewise.
* gfortran.dg/unused_artificial_dummies_1.f90: Likewise.
* gfortran.dg/x_slash_1.f: Likewise.
|
|
r16-2648-gaebbc90d8c7c70 had a copy and pasto where
the second statement was supposed to be setting
the operand 1 of the phi but it was setting operand 0 instead.
This fixes typo.
Push as obvious after a quick build test for x86_64-linux-gnu.
PR tree-optimization/121695
gcc/ChangeLog:
* tree-if-conv.cc (factor_out_operators): Fix typo
in assignment of the phi.
gcc/testsuite/ChangeLog:
* gcc.dg/torture/pr121695-1.c: New test.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
|
|
Fix type and remove useless DejaGnu directives.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmul-run-1-f64.c: Fix type.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f32.c: Remove
useless dg directives.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f64.c: Likewise.
|
|
The following removes trivially dead code.
* tree-vect-loop.cc (vect_transform_cycle_phi): Remove
unused reduc_stmt_info.
|
|
I got the cpp_warn on __STDCPP_FLOAT*_T__ if we aren't predefining those
wrong, so e.g. on powerpc64le we don't diagnose #undef __STDCPP_FLOAT16_T__.
I've added it as else if on the
if (c_dialect_cxx () && cxx_dialect > cxx20 && !floatn_nx_types[i].extended)
condition, which means cpp_warn is called in case a target supports some
extended type like _Float32x, cpp_warn is called on __STDCPP_FLOAT32_T__
(where when it supported _Float32 as well it did cpp_define_warn
(pfile, "__STDCPP_FLOAT32_T__=1") earlier).
On targets where the types aren't supported the earlier
if (FLOATN_NX_TYPE_NODE (i) == NULL_TREE) continue;
path is taken.
This patch fixes it to cpp_warn on the non-extended types for C++23
if the target doesn't support them and cpp_define_warn as before if it does.
2025-08-27 Jakub Jelinek <jakub@redhat.com>
PR target/121520
* c-cppbuiltin.cc (c_cpp_builtins): Properly call cpp_warn
for __STDCPP_FLOAT<NN>_T__ if FLOATN_NX_TYPE_NODE (i) is NULL
for C++23 for non-extended types and don't call cpp_warn for
extended types.
|
|
The following adjusts the SLP build for only-live stmts to not
only consider vect_induction_def and vect_internal_def that are
not part of a reduction but instead consider all non-reduction
defs that are not part of a reduction, specifically in this case
a recurrence def. This is also a missed optimization on the
gcc-15 branch (but IMO a very minor one).
PR tree-optimization/121686
* tree-vect-slp.cc (vect_analyze_slp): Consider all only-live
non-reduction defs for discovery.
* gcc.dg/vect/pr121686.c: New testcase.
|
|
The problem here is after r16-101, the 2 functions containing alloca/VLA
start to be cloned and then we un-VLA happens in using_vararray so this
is no longer testing what it should be testing.
The obvious fix is to mark using_vararray and using_alloca as noclone too.
Pushed as obvious after a quick test to make sure it is now working.
gcc/testsuite/ChangeLog:
PR testsuite/121684
* c-c++-common/hwasan/unprotected-allocas-0.c: Mark
using_vararray and using_alloca as noclone too.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
|
|
For a basic block with only a debug marker:
(note 3 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(note 2 3 5 2 NOTE_INSN_FUNCTION_BEG)
(debug_insn 5 2 16 2 (debug_marker) "x.c":6:3 -1 (nil))
emit the TLS call after debug marker.
gcc/
PR target/121668
* config/i386/i386-features.cc (ix86_emit_tls_call): Emit the
TLS call after debug marker.
gcc/testsuite/
PR target/121668
* gcc.target/i386/pr121668-1a.c: New test.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Move pr121656.c to gcc.dg/torture and replace weak attribute with noipa
attribute. Verified by reverting
56ca14c4c4f Fix invalid right shift count with recent ifcvt changes
to trigger
FAIL: gcc.dg/torture/pr121656.c -O1 execution test
FAIL: gcc.dg/torture/pr121656.c -O2 execution test
FAIL: gcc.dg/torture/pr121656.c -O3 -g execution test
FAIL: gcc.dg/torture/pr121656.c -O2 -flto -fno-use-linker-plugin -flto-partition=none execution test
on Linux/x86-64.
PR tree-optimization/121656
* gcc.dg/pr121656.c: Moved to ...
* gcc.dg/torture/pr121656.c: Here.
(dg-options): Removed.
(foo): Replace weak attribute with noipa attribute.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
More testsuite hygiene. Some of the thead tests are expecting to find
xtheadvdot in the extension set, but it's not defined as a valid extension
anywhere. I'm just removing xtheadvdot. Someone more familiar with these
cores can add it back properly if they're so inclined.
Second, there's a space after the zifencei in a couple of the thead arch
strings. Naturally that causes failures as well. That's a trivial fix, just
remove the bogus whitespace.
That gets us clean on riscv.exp on the pioneer system.
The pioneer is happy, as is riscv32-elf and riscv64-elf. Pushing to the trunk.
gcc/
* config/riscv/riscv-cores.def (xt-c908v): Drop xtheadvdot.
(xt-c910v2): Remove extraenous whitespace.
(xt-c920v2): Drop xtheadvdot and remove extraeonous whitespace.
gcc/testsuite/
* gcc.target/riscv/mcpu-xt-c908v.c: Drop xtheadvdot.
* gcc.target/riscv/mcpu-xt-c920v2.c: Drop xtheadvdot.
|
|
|
|
As noted in the issue, the C++ front end has deeper problems: it's
supposed to do the name lookup of the variant at the call site but is
instead doing it when parsing the "declare variant" construct, before
registering the decl for the base function. The C++ part of the
patch is a band-aid to catch the case where there is a previous declaration
of the function and it doesn't give an undefined symbol error instead.
Some real solution ought to be included as part of fixing PR118791.
gcc/c/
PR middle-end/118839
* c-parser.cc (c_finish_omp_declare_variant): Error if variant
is the same as base.
gcc/cp/
PR middle-end/118839
* decl.cc (omp_declare_variant_finalize_one): Error if variant
is the same as base.
gcc/fortran/
PR middle-end/118839
* trans-openmp.cc (gfc_trans_omp_declare_variant): Error if variant
is the same as base.
gcc/testsuite/
PR middle-end/118839
* gcc.dg/gomp/declare-variant-3.c: New.
* gfortran.dg/gomp/declare-variant-22.f90: New.
|
|
This patch fixes a number of problems with parser error checking of
"declare variant", especially in the C front end.
The new C testcase unprototyped-variant.c added by this patch used to
ICE when gimplifying the call site, at least in part because the
variant was being recorded even after it was diagnosed as invalid.
There was also a large block of dead code in the C front end that was
supposed to fix up an unprototyped declaration of a variant function
to match the base function declaration, that was never executed because
it was nested in a conditional that could never be true. I've fixed those
problems by rearranging the code and only recording the variant if it
passes the correctness checks. I also tried to add some comments and
re-work some particularly confusing bits of code, so that it's easier to
understand.
The OpenMP specification doesn't say what the behavior of "declare
variant" with the "append_args" clause should be when the base
function is unprototyped. The additional arguments are supposed to be
inserted between the last fixed argument of the base function and any
varargs, but without a prototype, for any given call we have no idea
which arguments are fixed and which are varargs, and therefore no idea
where to insert the additional arguments. This used to trigger some
other diagnostics (which one depending on whether the variant was also
unprototyped), but I thought it was better to just reject this with an
explicit "sorry".
Finally, I also observed that a missing "match" clause was only
rejected if "append_args" or "adjust_args" was present. Per the spec,
"match" has the "required" property, so if it's missing it should be
diagnosed unconditionally. The C++ and Fortran front ends had the same
issue so I fixed this one there too.
gcc/c/ChangeLog
* c-parser.cc (c_finish_omp_declare_variant): Rework diagnostic
code. Do not record variant if there are errors. Make check for
a missing "match" clause unconditional.
gcc/cp/ChangeLog
* parser.cc (cp_finish_omp_declare_variant): Structure diagnostic
code similarly to C front end. Make check for a missing "match"
clause unconditional.
gcc/fortran/ChangeLog
* openmp.cc (gfc_match_omp_declare_variant): Make check for a
missing "match" clause unconditional.
gcc/testsuite/ChangeLog
* c-c++-common/gomp/append-args-1.c: Adjust expected output.
* g++.dg/gomp/adjust-args-1.C: Likewise.
* g++.dg/gomp/adjust-args-3.C: Likewise.
* gcc.dg/gomp/adjust-args-1.c: Likewise:
* gcc.dg/gomp/append-args-1.c: Likewise.
* gcc.dg/gomp/unprototyped-variant.c: New.
* gfortran.dg/gomp/adjust-args-1.f90: Adjust expected output.
* gfortran.dg/gomp/append_args-1.f90: Likewise.
|
|
Shreya and I were working through some testsuite failures and noticed that many
of the current failures on the pioneer were just silly. We have tests that
expect to see full architecture strings in their expected output when the bulk
(some might say all) of the architecture string is irrelevant.
Worse yet, we'd have different matching lines. ie we'd have one that would
machine rv64gc_blah_blah and another for rv64imfa_blah_blah. Judicious
wildcard usage cleans this up considerably.
This fixes ~80 failures in the riscv.exp testsuite. Pushing to the trunk as
it's happy on the pioneer native, riscv32-elf and riscv64-elf.
gcc/testsuite/
* gcc.target/riscv/arch-25.c: Use wildcards to simplify/eliminate
dg-error directives.
* gcc.target/riscv/arch-ss-2.c: Similarly.
* gcc.target/riscv/arch-zilsd-2.c: Similarly.
* gcc.target/riscv/arch-zilsd-3.c: Similarly.
|
|
The test fails to compile on 32-bit targets because the arrays are too
large. Restrict to targets where the array index type is 64-bits.
Also note the relevant PR in the test comment.
PR debug/121411
gcc/testsuite/
* gcc.dg/debug/ctf/ctf-array-7.c: Restrict to lp64,llp64
targets.
|
|
Disable sched2 and sched3 to only have one order of instructions to
consider.
gcc/testsuite/ChangeLog:
* gcc.target/arm/unsigned-extend-2.c: Disable sched2 and sched3
and update function body to match.
Signed-off-by: Torbjörn SVENSSON <torbjorn.svensson@foss.st.com>
|
|
FMA/DOT_PROD_EXPR/SAD_EXPR
The patch is trying to unroll the vectorized loop when there're
FMA/DOT_PRDO_EXPR/SAD_EXPR reductions, it will break cross-iteration dependence
and enable more parallelism(since vectorize will also enable partial
sum).
When there's gather/scatter or scalarization in the loop, don't do the
unroll since the performance bottleneck is not at the reduction.
The unroll factor is set according to FMA/DOT_PROX_EXPR/SAD_EXPR
CEIL ((latency * throught), num_of_reduction)
.i.e
For fma, latency is 4, throught is 2, if there's 1 FMA for reduction
then unroll factor is 2 * 4 / 1 = 8.
There's also a vect_unroll_limit, the final suggested_unroll_factor is
set as MIN (vect_unroll_limix, 8).
The vect_unroll_limit is mainly for register pressure, avoid to many
spills.
Ideally, all instructions in the vectorized loop should be used to
determine unroll_factor with their (latency * throughput) / number,
but that would too much for this patch, and may just GIGO, so the
patch only considers 3 kinds of instructions: FMA, DOT_PROD_EXPR,
SAD_EXPR.
Note when DOT_PROD_EXPR is not native support,
m_num_reduction += 3 * count which almost prevents unroll.
There's performance boost for simple benchmark with DOT_PRDO_EXPR/FMA
chain, slight improvement in SPEC2017 performance.
gcc/ChangeLog:
* config/i386/i386.cc (ix86_vector_costs::ix86_vector_costs):
Addd new memeber m_num_reduc, m_prefer_unroll.
(ix86_vector_costs::add_stmt_cost): Set m_prefer_unroll and
m_num_reduc
(ix86_vector_costs::finish_cost): Determine
m_suggested_unroll_vector with consideration of
reduc_lat_mult_thr, m_num_reduction and
ix86_vect_unroll_limit.
* config/i386/i386.h (enum ix86_reduc_unroll_factor): New
enum.
(processor_costs): Add reduc_lat_mult_thr and
vect_unroll_limit.
* config/i386/x86-tune-costs.h: Initialize
reduc_lat_mult_thr and vect_unroll_limit.
* config/i386/i386.opt: Add -param=ix86-vect-unroll-limit.
gcc/testsuite/ChangeLog:
* gcc.target/i386/vect_unroll-1.c: New test.
* gcc.target/i386/vect_unroll-2.c: New test.
* gcc.target/i386/vect_unroll-3.c: New test.
* gcc.target/i386/vect_unroll-4.c: New test.
* gcc.target/i386/vect_unroll-5.c: New test.
|
|
This pattern enables the combine pass (or late-combine, depending on the case)
to merge a vec_duplicate into a div RTL instruction. The vec_duplicate is the
dividend operand.
Before this patch, we have two instructions, e.g.:
vfmv.v.f v2,fa0
vfdiv.vv v1,v2,v1
After, we get only one:
vfrdiv.vf v1,v1,fa0
gcc/ChangeLog:
* config/riscv/autovec-opt.md (*vfrdiv_vf_<mode>): Add new pattern to
combine vec_duplicate + vfdiv.vv into vfrdiv.vf.
* config/riscv/vector.md (@pred_<optab><mode>_reverse_scalar): Allow VLS
modes.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfrdiv.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_binop.h: Add support for reverse
variants.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_data.h: Add data for
reverse variants.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfrdiv-run-1-f64.c: New test.
|
|
invariant [PR121290]
Consider the example:
void
f (int *restrict x, int *restrict y, int *restrict z, int n)
{
for (int i = 0; i < 4; ++i)
{
int res = 0;
for (int j = 0; j < 100; ++j)
res += y[j] * z[i];
x[i] = res;
}
}
we currently vectorize as
f:
movi v30.4s, 0
ldr q31, [x2]
add x2, x1, 400
.L2:
ld1r {v29.4s}, [x1], 4
mla v30.4s, v29.4s, v31.4s
cmp x2, x1
bne .L2
str q30, [x0]
ret
which is not useful because by doing outer-loop vectorization we're performing
less work per iteration than we would had we done inner-loop vectorization and
simply unrolled the inner loop.
This patch teaches the cost model that if all your leafs are invariant, then
adjust the loop cost by * VF, since every vector iteration has at least one lane
really just doing 1 scalar.
There are a couple of ways we could have solved this, one is to increase the
unroll factor to process more iterations of the inner loop. This removes the
need for the broadcast, however we don't support unrolling the inner loop within
the outer loop. We only support unrolling by increasing the VF, which would
affect the outer loop as well as the inner loop.
We also don't directly support costing inner-loop vs outer-loop vectorization,
and as such we're left trying to predict/steer the cost model ahead of time to
what we think should be profitable. This patch attempts to do so using a
heuristic which penalizes the outer-loop vectorization.
We now cost the loop as
note: Cost model analysis:
Vector inside of loop cost: 2000
Vector prologue cost: 4
Vector epilogue cost: 0
Scalar iteration cost: 300
Scalar outside cost: 0
Vector outside cost: 4
prologue iterations: 0
epilogue iterations: 0
missed: cost model: the vector iteration cost = 2000 divided by the scalar iteration cost = 300 is greater or equal to the vectorization factor = 4.
missed: not vectorized: vectorization not profitable.
missed: not vectorized: vector version will never be profitable.
missed: Loop costings may not be worthwhile.
And subsequently generate:
.L5:
add w4, w4, w7
ld1w z24.s, p6/z, [x0, #1, mul vl]
ld1w z23.s, p6/z, [x0, #2, mul vl]
ld1w z22.s, p6/z, [x0, #3, mul vl]
ld1w z29.s, p6/z, [x0]
mla z26.s, p6/m, z24.s, z30.s
add x0, x0, x8
mla z27.s, p6/m, z23.s, z30.s
mla z28.s, p6/m, z22.s, z30.s
mla z25.s, p6/m, z29.s, z30.s
cmp w4, w6
bls .L5
and avoids the load and replicate if it knows it has enough vector pipes to do
so.
gcc/ChangeLog:
PR target/121290
* config/aarch64/aarch64.cc
(class aarch64_vector_costs ): Add m_loop_fully_scalar_dup.
(aarch64_vector_costs::add_stmt_cost): Detect invariant inner loops.
(adjust_body_cost): Adjust final costing if m_loop_fully_scalar_dup.
gcc/testsuite/ChangeLog:
PR target/121290
* gcc.target/aarch64/pr121290.c: New test.
|
|
multiply
This pattern enables the combine pass (or late-combine, depending on the case)
to merge a vec_duplicate into a mult RTL instruction.
Before this patch, we have two instructions, e.g.:
vfmv.v.f v2,fa0
vfmul.vv v1,v1,v2
After, we get only one:
vfmul.vf v2,v2,fa0
gcc/ChangeLog:
* config/riscv/autovec-opt.md (*vfmul_vf_<mode>): Add new pattern to
combine vec_duplicate + vfmul.vv into vfmul.vf.
* config/riscv/vector.md (@pred_<optab><mode>_scalar): Allow VLS modes.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f16.c: Add vfmul.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-1-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-2-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-3-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f16.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f32.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf-4-f64.c: Likewise.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_binop.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_data.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_binop_run.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmul-run-1-f16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmul-run-1-f32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vf_vfmul-run-1-f64.c: New test.
* gcc.target/riscv/rvv/autovec/vls/floating-point-mul-2.c: Adjust scan
dump.
* gcc.target/riscv/rvv/autovec/vls/floating-point-mul-3.c: Likewise.
|
|
Recent changes from Kito have an unused parameter. On the assumption that he's
going to likely want it as part of the API, I've simply removed the parameter's
name until such time as Kito needs it.
This should restore bootstrapping to the RISC-V port. Committing now rather
than waiting for the CI system given bootstrap builds currently fail.
* config/riscv/riscv.cc (riscv_arg_partial_bytes): Remove name
from unused parameter.
|
|
The compiler is getting too smart! But this test is really intended
to test that we generate BICS instead of BIC+CMP, so make the test use
something that we can't subsequently fold away into a bit minipulation
of a store-flag value.
I've also added a couple of extra tests, so we now cover both the
cases where we fold the result away and where that cannot be done.
Also add a test that we don't generate a compare against 0, since
that's really part of what this test is covering.
gcc/testsuite:
* gcc.target/arm/bics_3.c: Add some additional tests that
cannot be folded to a bit manipulation.
|
|
The following changes the vect_reduc_type API to work on the SLP node.
The API is only used from the aarch64 backend, so all changes are there.
In particular I noticed aarch64_force_single_cycle is invoked even
for scalar costing (where the flag tested isn't computed yet), I
figured in scalar costing all reductions are a single cycle.
* tree-vectorizer.h (vect_reduc_type): Get SLP node as argument.
* config/aarch64/aarch64.cc (aarch64_sve_in_loop_reduction_latency):
Take SLO node as argument and adjust.
(aarch64_in_loop_reduction_latency): Likewise.
(aarch64_detect_vector_stmt_subtype): Adjust.
(aarch64_vector_costs::count_ops): Likewise. Treat reductions
during scalar costing as single-cycle.
|
|
The following addresses a bogus swapping of SLP operands of a
reduction operation which gets STMT_VINFO_REDUC_IDX out of sync
with the SLP operand order. In fact the most obvious mistake is
that we simply swap operands even on the first stmt even when
there's no difference in the comparison operators (for == and !=
at least). But there are more latent issues that I noticed and
fixed up in the process.
PR tree-optimization/121659
* tree-vect-slp.cc (vect_build_slp_tree_1): Do not allow
matching up comparison operators by swapping if that would
disturb STMT_VINFO_REDUC_IDX. Make sure to only
actually mark operands for swapping when there was a
mismatch and we're not processing the first stmt.
* gcc.dg/vect/pr121659.c: New testcase.
|
|
The following makes sure to read from the lanes_ifn member only
when necessary (and thus it was set).
* tree-vect-stmts.cc (vectorizable_store): Access lanes_ifn
only when VMAT_LOAD_STORE_LANES.
(vectorizable_load): Likewise.
|
|
This was added when invariants/externals outside of SLP didn't have
an easily accessible vector type. Now it's redundant so the
following removes it.
* tree-vectorizer.h (stmt_vec_info_::reduc_vectype_in): Remove.
(STMT_VINFO_REDUC_VECTYPE_IN): Likewise.
* tree-vect-loop.cc (vect_is_emulated_mixed_dot_prod): Get
at the input vectype via the SLP node child.
(vectorizable_lane_reducing): Likewise.
(vect_transform_reduction): Likewise.
(vectorizable_reduction): Do not set STMT_VINFO_REDUC_VECTYPE_IN.
|
|
The vgf2p8affineqb_<mode><mask_name> pattern uses "register_operand"
predicate for the first input operand, so using "general_operand"
for the rotate operand passed to it leads to ICEs, and so does
the "nonimmediate_operand" in the <insn>v16qi3 define_expand.
The following patch fixes it by using "register_operand" in the former
case (that pattern is TARGET_GFNI only) and using force_reg in
the latter case (the pattern is TARGET_XOP || TARGET_GFNI and for XOP
we can handle MEM operand).
The rest of the changes are small formatting tweaks or use of const0_rtx
instead of GEN_INT (0).
2025-08-26 Jakub Jelinek <jakub@redhat.com>
PR target/121658
* config/i386/sse.md (<insn><mode>3 any_shift): Use const0_rtx
instead of GEN_INT (0).
(cond_<insn><mode> any_shift): Likewise. Formatting fix.
(<insn><mode>3 any_rotate): Use register_operand predicate instead of
general_operand for match_operand 1. Use const0_rtx instead of
GEN_INT (0).
(<insn>v16qi3 any_rotate): Use force_reg on operands[1]. Formatting
fix.
* config/i386/i386.cc (ix86_shift_rotate_cost): Comment formatting
fixes.
* gcc.target/i386/pr121658.c: New test.
|
|
|
|
cost 0, 1 and 15
Add asm dump check and run test for vec_duplicate + vmacc.vvm
combine to vmacc.vx, with the GR2VR cost is 0, 2 and 15.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-u16.c: Add asm check
for vx combine.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-u32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-u64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-u8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-u16.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-u32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-u64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-u8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-u16.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-u32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-u64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-u8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-u16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-u32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-u64.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-u8.c: New test.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
0, 1 and 15
Add asm dump check and run test for vec_duplicate + vmacc.vvm
combine to vmacc.vx, with the GR2VR cost is 0, 2 and 15.
gcc/testsuite/ChangeLog:
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-i16.c: Add asm check
for vx combine.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-i32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-1-i8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-i16.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-i32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-2-i8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-i16.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-i32.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-i64.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx-3-i8.c: Ditto.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_ternary.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_ternary_data.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_ternary_run.h: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-i16.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-i32.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-i64.c: New test.
* gcc.target/riscv/rvv/autovec/vx_vf/vx_vmacc-run-1-i8.c: New test.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
This patch would like to combine the vec_duplicate + vmacc.vv to the
vmacc.vx. From example as below code. The related pattern will depend
on the cost of vec_duplicate from GR2VR. Then the late-combine will
take action if the cost of GR2VR is zero, and reject the combination
if the GR2VR cost is greater than zero.
Assume we have example code like below, GR2VR cost is 0.
#define DEF_VX_TERNARY_CASE_0(T, OP_1, OP_2, NAME) \
void \
test_vx_ternary_##NAME##_##T##_case_0 (T * restrict vd, T * restrict vs2, \
T rs1, unsigned n) \
{ \
for (unsigned i = 0; i < n; i++) \
vd[i] = vd[i] OP_2 vs2[i] OP_1 rs1; \
}
DEF_VX_TERNARY_CASE_0(int32_t, *, +, macc)
Before this patch:
11 │ beq a3,zero,.L8
12 │ vsetvli a5,zero,e32,m1,ta,ma
13 │ vmv.v.x v2,a2
...
16 │ .L3:
17 │ vsetvli a5,a3,e32,m1,ta,ma
...
22 │ vmacc.vv v1,v2,v3
...
25 │ bne a3,zero,.L3
After this patch:
11 │ beq a3,zero,.L8
...
14 │ .L3:
15 │ vsetvli a5,a3,e32,m1,ta,ma
...
20 │ vmacc.vx v1,a2,v3
...
23 │ bne a3,zero,.L3
gcc/ChangeLog:
* config/riscv/vector.md (@pred_mul_plus_vx_<mode>): Add new pattern to
generate vmacc rtl.
(*pred_macc_<mode>_scalar_undef): Ditto.
* config/riscv/autovec-opt.md (*vmacc_vx_<mode>): Add new
pattern to match the vmacc vx combine.
Signed-off-by: Pan Li <pan2.li@intel.com>
|
|
When expand_omp_for_init_counts is called from expand_omp_for_generic,
zero_iter1_bb is NULL and the code always creates a new bb in which it
clears fd->loop.n2 var (if it is a var), because it can dominate code
with lastprivate guards that use the var.
When called from other places, zero_iter1_bb is non-NULL and so we don't
insert the clearing (and can't, because the same bb is used also for the
non-zero iterations exit and in that case we need to preserve the iteration
count). Clearing is also not necessary when e.g. outermost collapsed
loop has constant non-zero number of iterations, in that case we initialize the
var to something already earlier. The following patch makes sure to clear
it if it hasn't been initialized yet before the first check for zero iterations.
2025-08-26 Jakub Jelinek <jakub@redhat.com>
PR middle-end/121453
* omp-expand.cc (expand_omp_for_init_counts): Clear fd->loop.n2
before first zero count check if zero_iter1_bb is non-NULL upon
entry and fd->loop.n2 has not been written yet.
* gcc.dg/gomp/pr121453.c: New test.
|
|
PR tree-optimization/121656
* gcc.dg/pr121656.c: New file.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
|
|
CTF array encoding uses uint32 for number of elements. This means there
is a hard upper limit on array types which the format can represent.
GCC internally was also using a uint32_t for this, which would overflow
when translating from DWARF for arrays with more than UINT32_MAX
elements. Use an unsigned HOST_WIDE_INT instead to fetch the array
bound, and fall back to CTF_K_UNKNOWN if the array cannot be
represented in CTF.
PR debug/121411
gcc/
* dwarf2ctf.cc (gen_ctf_subrange_type): Use unsigned HWI for
array_num_elements. Fallback to CTF_K_UNKNOWN if the array
type has too many elements for CTF to represent.
gcc/testsuite/
* gcc.dg/debug/ctf/ctf-array-7.c: New test.
|
|
After the return type of remove_prop_source_from_use was changed to void,
simplify_permutation only returns 1 or 0 so it can be boolified.
Bootstrapped and tested on x86_64-linux-gnu.
gcc/ChangeLog:
* tree-ssa-forwprop.cc (simplify_permutation): Boolify.
(pass_forwprop::execute): No longer handle 2 as the return
from simplify_permutation.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
|
|
After changing the return type of remove_prop_source_from_use,
forward_propagate_into_comparison will never return 2. So boolify
forward_propagate_into_comparison.
Bootstrapped and tested on x86_64-linux-gnu.
gcc/ChangeLog:
* tree-ssa-forwprop.cc (forward_propagate_into_comparison): Boolify.
(pass_forwprop::execute): Don't handle return of 2 from
forward_propagate_into_comparison.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
|
|
Since r5-4705-ga499aac5dfa5d9, remove_prop_source_from_use has always
return false. This removes the return type of remove_prop_source_from_use
and cleans up the usage of remove_prop_source_from_use.
Bootstrapped and tested on x86_64-linux-gnu.
gcc/ChangeLog:
* tree-ssa-forwprop.cc (remove_prop_source_from_use): Remove
return type.
(forward_propagate_into_comparison): Update dealing with
no return type of remove_prop_source_from_use.
(forward_propagate_into_gimple_cond): Likewise.
(simplify_permutation): Likewise.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
|
|
While looking at this code I noticed that we don't remove
the old switch index assignment if it is only used in the switch
after it is modified in simplify_gimple_switch.
This fixes that by marking the old switch index for the dce worklist.
Bootstrapped and tested on x86_64-linux-gnu.
gcc/ChangeLog:
* tree-ssa-forwprop.cc (simplify_gimple_switch): Add simple_dce_worklist
argument. Mark the old index when doing the replacement.
(pass_forwprop::execute): Update call to simplify_gimple_switch.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
|
|
Just like r16-465-gf2bb7ffe84840d8 but this time
instead of a VCE there is a full on load from a boolean.
This showed up when trying to remove the extra copy
in the testcase from the revision mentioned above (pr120122-1.c).
So when moving loads from a boolean type from being conditional
to non-conditional, the load needs to become a full load and then
casted into a bool so that the upper bits are correct.
Bitfields loads will always do the truncation so they don't need to
be rewritten. Non boolean types always do the truncation too.
What we do is wrap the original reference with a VCE which causes
the full load and then do a casting to do the truncation. Using
fold_build1 with VCE will do the correct thing if there is a secondary
VCE and will also fold if this was just a plain MEM_REF so there is
no need to handle those 2 cases special either.
Changes since v1:
* v2: Use VIEW_CONVERT_EXPR instead of doing a manual load.
Accept all non mode precision loads rather than just
boolean ones.
* v3: Move back to checking boolean type. Don't handle BIT_FIELD_REF.
Add asserts for IMAG/REAL_PART_EXPR.
Bootstrapped and tested on x86_64-linux-gnu.
PR tree-optimization/121279
gcc/ChangeLog:
* gimple-fold.cc (gimple_needing_rewrite_undefined): Return
true for non mode precision boolean loads.
(rewrite_to_defined_unconditional): Handle non mode precision loads.
gcc/testsuite/ChangeLog:
* gcc.dg/torture/pr121279-1.c: New test.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
|
|
When working on PR121279, I noticed that lim
would create an uninitialized decl and marking
it with supression for uninitialization warning.
This is fine but then into ssa would just call
get_or_create_ssa_default_def on that new decl which
could in theory take some extra compile time to figure
that out.
Plus when doing the rewriting for undefinedness, there
would now be a VCE around the decl. This means the ssa
name is kept around and not propagated in some cases.
So instead this patch manually calls get_or_create_ssa_default_def
to get the "uninitalized" ssa name for this decl and
no longer needs the write into ssa nor for undefined ness.
Bootstrapped and tested on x86_64-linux-gnu.
gcc/ChangeLog:
* tree-ssa-loop-im.cc (execute_sm): Call
get_or_create_ssa_default_def for the new uninitialized
decl.
Signed-off-by: Andrew Pinski <andrew.pinski@oss.qualcomm.com>
|
|
multiple alternatives
The use of compact syntax makes the relationship between asm output,
operand constraints, and insn attributes easier to understand and modify,
especially for "mov<mode>_internal".
gcc/ChangeLog:
* config/xtensa/xtensa.md (addsi3, <u>mulhisi3, andsi3,
zero_extend<mode>si2, extendhisi2_internal, movsi_internal,
movhi_internal, movqi_internal, movsf_internal, ashlsi3_internal,
ashrsi3, lshrsi3, rotlsi3, rotrsi3):
Rewrite in compact syntax.
|
|
gcc/ChangeLog:
* config/xtensa/xtensa.md
(The auxiliary define_split for *masktrue_const_bitcmpl):
Use a more concise function call, i.e.,
(1 << GET_MODE_BITSIZE (mode)) - 1 is equivalent to
GET_MODE_MASK (mode).
|
|
gcc/ChangeLog:
* config/xtensa/xtensa.md (mode_bits):
New mode attribute.
(zero_extend<mode>si2): Use the appropriate mode iterator and
attribute to unify "zero_extend[hq]isi2" to this description.
|