aboutsummaryrefslogtreecommitdiff
path: root/gcc
AgeCommit message (Collapse)AuthorFilesLines
2024-01-21Daily bump.GCC Administrator1-1/+1
2024-01-20Daily bump.GCC Administrator1-1/+1
2024-01-19Daily bump.GCC Administrator3-1/+98
2024-01-18FIx handling of X86_TUNE_AVOID_512FMA_CHAINSJan Hubicka1-1/+1
gcc/ChangeLog: * config/i386/i386-options.c (ix86_option_override_internal): Fix handling of X86_TUNE_AVOID_512FMA_CHAINS
2024-01-18Zen4 tuning part 2Jan Hubicka5-5/+24
Adds tunes needed for zen4 microarchitecture. I added two new knobs. TARGET_AVX512_SPLIT_REGS which is used to specify that internally 512 vectors are split to 256 vectors. This affects vectorization costs and reassociation width. It probably should also affect RTX costs however I doubt it is very useful since RTL optimizers are usually not judging between 256 and 512 vectors. I also added X86_TUNE_AVOID_256FMA_CHAINS. Since fma has improved in zen4 this flag may not be a win except for very specific benchmarks. I am still doing some more detailed testing here. Oherwise I disabled gathers on zen4 for 2 parts nad 4 parts. We can open code them and since the latencies has only increased since zen3 opencoding is better than actual instrucction. This shows at 4 tsvc benchmarks. I ended up setting AVX256_OPTIMAL. This is a compromise. There are some tsvc benchmarks that increase noticeably (up to 250%) however there are also few regressions. Most of these can be solved by incrasing vec_perm cost in the vectorizer. However this does not cure about 14% regression on x264 that is quite important. Here we produce vectorized loops for avx512 that probably would be faster if the loops in question had high enough iteration count. We hit this problem with avx256 too: since the loop iterates few times, only prologues/epilogues are used. Adding another round of prologue/epilogue code does not make it better. Finally I enabled avx stores for constnat sized memcpy and memset. I am not sure why this is an opt-in feature. I think for most hardware this is a win. gcc/ChangeLog: 2022-12-22 Jan Hubicka <hubicka@ucw.cz> * config/i386/i386-expand.c (ix86_expand_set_or_cpymem): Add TARGET_AVX512_SPLIT_REGS * config/i386/i386-options.c (ix86_option_override_internal): Honor x86_TONE_AVOID_256FMA_CHAINS. * config/i386/i386.c (ix86_vec_cost): Honor TARGET_AVX512_SPLIT_REGS. (ix86_reassociation_width): Likewise. * config/i386/i386.h (TARGET_AVX512_SPLIT_REGS): New tune. * config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Disable for znver4. (X86_TUNE_USE_GATHER_4PARTS): Likewise. (X86_TUNE_AVOID_256FMA_CHAINS): Set for znver4. (X86_TUNE_AVOID_512FMA_CHAINS): New utne; set for znver4. (X86_TUNE_AVX256_OPTIMAL): Add znver4. (X86_TUNE_AVX512_SPLIT_REGS): New tune. (X86_TUNE_AVX256_MOVE_BY_PIECES): Add znver1-3. (X86_TUNE_AVX256_STORE_BY_PIECES): Add znver1-3. (X86_TUNE_AVX512_MOVE_BY_PIECES): Add znver4. (X86_TUNE_AVX512_STORE_BY_PIECES): Add znver4. (cherry picked from commit eef81eefcdc2a58111e50eb2162ea1f5becc8004)
2024-01-18Add AMD znver4 instruction reservationsTejas Joshi3-1/+1070
This adds znver4 automata units and reservations separately from other znver automata, avoiding the insn-automata.cc size blow-up. gcc/ChangeLog: * common/config/i386/i386-common.c (processor_alias_table): Use CPU_ZNVER4 for znver4. * config/i386/i386.md: Add znver4.md. * config/i386/znver4.md: New. (cherry picked from commit 72ce780a497eb3e5efe7a79ea5f21f8dd6858f7f)
2024-01-18Remove znver4 instruction reservationsTejas Joshi2-814/+37
This reverts the changes made to znver.md in: commit bf3b532b524ecacb3202ab2c8af419ffaaab7cff 2022-10-21 Tejas Joshi <TejasSanjay.Joshi@amd.com> gcc/ChangeLog: * common/config/i386/i386-common.c (processor_alias_table): Use CPU_ZNVER3 for znver4. * config/i386/znver.md: Remove znver4 reservations. (cherry picked from commit d93171509aa7ca23148508b96f1c1f70b941d808)
2024-01-18Enable AMD znver4 support and add instruction reservationsTejas Joshi16-70/+903
2022-09-28 Tejas Joshi <TejasSanjay.Joshi@amd.com> gcc/ChangeLog: * common/config/i386/cpuinfo.h (get_amd_cpu): Recognize znver4. * common/config/i386/i386-common.c (processor_names): Add znver4. (processor_alias_table): Add znver4 and modularize old znvers. * common/config/i386/i386-cpuinfo.h (processor_subtypes): AMDFAM19H_ZNVER4. * config.gcc (x86_64-*-* |...): Likewise. * config/i386/driver-i386.c (host_detect_local_cpu): Let -march=native recognize znver4 cpus. * config/i386/i386-c.c (ix86_target_macros_internal): Add znver4. * config/i386/i386-options.c (m_ZNVER4): New definition. (m_ZNVER): Include m_ZNVER4. (processor_cost_table): Add znver4. * config/i386/i386.c (ix86_reassociation_width): Likewise. * config/i386/i386.h (processor_type): Add PROCESSOR_ZNVER4. (PTA_ZNVER1): New definition. (PTA_ZNVER2): Likewise. (PTA_ZNVER3): Likewise. (PTA_ZNVER4): Likewise. * config/i386/i386.md (define_attr "cpu"): Add znver4 and rename md file. * config/i386/x86-tune-sched.c (ix86_issue_rate): Add znver4. (ix86_adjust_cost): Likewise. * config/i386/znver1.md: Rename to znver.md. * config/i386/znver.md: Add new reservations for znver4. * doc/extend.texi: Add details about znver4. * doc/invoke.texi: Likewise. gcc/testsuite/ChangeLog: * gcc.target/i386/funcspec-56.inc: Handle new march. * g++.target/i386/mv29.C: Likewise. (cherry picked from commit bf3b532b524ecacb3202ab2c8af419ffaaab7cff)
2024-01-18Update znver4 costsJan Hubicka1-0/+134
Update cost of znver4 mostly based on data measued by Agner Fog. Compared to previous generations x87 became bit slower which is probably not big deal (and we have minimal benchmarking coverage for it). One interesting improvement is reducation of FMA cost. I also updated costs of AVX256 loads/stores based on latencies (not throughput which is twice of avx256). Overall AVX512 vectorization seems to improve noticeably some of TSVC benchmarks but since internally 512 vectors are split to 256 vectors it is somewhat risky and does not win in SPEC scores (mostly by regressing benchmarks with loop that have small trip count like x264 and exchange), so for now I am going to set AVX256_OPTIMAL tune but I am still playing with it. We improved since ZNVER1 on choosing vectorization size and also have vectorized prologues/epilogues so it may be possible to make avx512 small win overall. 2022-12-22 Jan Hubicka <hubicka@ucw.cz> * config/i386/x86-tune-costs.h (znver4_cost): Upate costs of FP and SSE moves, division multiplication, gathers, L2 cache size, and more complex FP instrutions. (cherry picked from commit bbe04bade0cc3b17e62c2af3d89b899367e7d2d1)
2024-01-18Daily bump.GCC Administrator1-1/+1
2024-01-17Daily bump.GCC Administrator1-1/+1
2024-01-16Daily bump.GCC Administrator1-1/+1
2024-01-15Daily bump.GCC Administrator1-1/+1
2024-01-14Daily bump.GCC Administrator1-1/+1
2024-01-13Daily bump.GCC Administrator1-1/+1
2024-01-12Daily bump.GCC Administrator1-1/+1
2024-01-11Daily bump.GCC Administrator1-1/+1
2024-01-10Daily bump.GCC Administrator1-1/+1
2024-01-09Daily bump.GCC Administrator1-1/+1
2024-01-08Daily bump.GCC Administrator1-1/+1
2024-01-07Daily bump.GCC Administrator1-1/+1
2024-01-06Daily bump.GCC Administrator1-1/+1
2024-01-05Daily bump.GCC Administrator1-1/+1
2024-01-04Daily bump.GCC Administrator1-1/+1
2024-01-03Daily bump.GCC Administrator1-1/+1
2024-01-02Daily bump.GCC Administrator1-1/+1
2024-01-01Daily bump.GCC Administrator1-1/+1
2023-12-31Daily bump.GCC Administrator1-1/+1
2023-12-30Daily bump.GCC Administrator1-1/+1
2023-12-29Daily bump.GCC Administrator1-1/+1
2023-12-28Daily bump.GCC Administrator1-1/+1
2023-12-27Daily bump.GCC Administrator1-1/+1
2023-12-26Daily bump.GCC Administrator1-1/+1
2023-12-25Daily bump.GCC Administrator3-1/+21
2023-12-24c++: constraint rewriting during ttp coercion [PR111485]Patrick Palka3-2/+44
In order to compare the constraints of a ttp with that of its argument, we rewrite the ttp's constraints in terms of the argument template's template parameters. The substitution to achieve this currently uses a single level of template arguments, but that never does the right thing because a ttp's template parameters always have level >= 2. This patch fixes this by including the outer template arguments in the substitution, which ought to match the depth of the ttp. The second testcase demonstrates it's better to substitute the concrete outer template arguments instead of generic ones since a ttp's constraints could depend on outer parameters. PR c++/111485 gcc/cp/ChangeLog: * pt.c (is_compatible_template_arg): New parameter 'args'. Add the outer template arguments 'args' to 'new_args'. (convert_template_argument): Pass 'args' to is_compatible_template_arg. gcc/testsuite/ChangeLog: * g++.dg/cpp2a/concepts-ttp5.C: New test. * g++.dg/cpp2a/concepts-ttp6.C: New test. (cherry picked from commit 6f902a42b0afe3f3145bcb864695fc290b5acc3e)
2023-12-24Daily bump.GCC Administrator1-1/+1
2023-12-23Daily bump.GCC Administrator1-1/+1
2023-12-22Daily bump.GCC Administrator1-1/+1
2023-12-21Daily bump.GCC Administrator3-1/+54
2023-12-20c++: -Wdeprecated-copy and using operator= [PR92145]Jason Merrill2-1/+37
For the purpose of [depr.impldec] "if the class has a user-declared copy assignment operator", an operator= brought in from a base class with 'using' may be a copy-assignment operator, but it isn't a copy-assignment operator for the derived class. gcc/cp/ChangeLog: PR c++/92145 * class.c (classtype_has_depr_implicit_copy): Check DECL_CONTEXT of operator=. gcc/testsuite/ChangeLog: PR c++/92145 * g++.dg/cpp0x/depr-copy3.C: New test. (cherry picked from commit 37846c42f1f5ac4d9ba190d49c4373673c89c8b5)
2023-12-20c++: NRV and goto [PR92407]Jason Merrill4-0/+36
Here our named return value optimization was breaking the required destructor when the goto takes 'a' out of scope. A simple fix for the release branches is to disable the optimization in the presence of backward goto. We could do better by disabling the optimization only if there is a backward goto across the variable declaration, but we don't track that, and in GCC 14 we instead make the goto work with NRV. PR c++/92407 gcc/cp/ChangeLog: * cp-tree.h (struct language_function): Add backward_goto. * decl.c (check_goto): Set it. * typeck.c (check_return_expr): Prevent NRV if set. gcc/testsuite/ChangeLog: * g++.dg/opt/nrv22.C: New test. (cherry picked from commit a645347c19b07cc7abd7bf276c6769fc41afc932)
2023-12-19c++: value dependence of by-ref lambda capture [PR108975]Patrick Palka2-3/+25
We are still ICEing on the generic lambda version of the testcase from this PR, even after r13-6743-g6f90de97634d6f, due to the by-ref capture of the constant local variable 'dim' being considered value-dependent when regenerating the lambda (at which point processing_template_decl is set since the lambda is generic), which prevents us from constant folding its uses. Later during prune_lambda_captures we end up not thoroughly walking the body of the lambda and overlook the (non-folded) uses of 'dim' within the array bound and using-decls. We could fix this by making prune_lambda_captures walk the body of the lambda more thoroughly so that it finds these uses of 'dim', but ideally we should be able to constant fold all uses of 'dim' ahead of time and prune the implicit capture after all. To that end this patch makes value_dependent_expression_p return false for such by-ref captures of constant local variables, allowing their uses to get constant folded ahead of time. It seems we just need to disable the predicate's conservative early exit for reference variables (added by r5-5022-g51d72abe5ea04e) when DECL_HAS_VALUE_EXPR_P. This effectively makes us treat by-value and by-ref captures more consistently when it comes to value dependence. PR c++/108975 gcc/cp/ChangeLog: * pt.c (value_dependent_expression_p) <case VAR_DECL>: Suppress conservative early exit for reference variables when DECL_HAS_VALUE_EXPR_P. gcc/testsuite/ChangeLog: * g++.dg/cpp0x/lambda/lambda-const11a.C: New test. (cherry picked from commit 3d674e29d7f89bf93fcfcc963ff0248c6347586d)
2023-12-20Daily bump.GCC Administrator3-1/+17
2023-12-19i386: Fix mmx.md signbit expanders [PR112816]Jakub Jelinek2-1/+20
Apparently when looking for "signbit<mode>2" vector expanders, I've only looked at sse.md and forgot mmx.md, which has another one and the following patch still ICEd. 2023-12-19 Jakub Jelinek <jakub@redhat.com> PR target/112816 * config/i386/mmx.md (signbitv2sf2): Force operands[1] into a REG. * gcc.target/i386/sse2-pr112816-2.c: New test. (cherry picked from commit 80e1375ed7a7a05a5a60a57e72c5ad5eba005798)
2023-12-19Daily bump.GCC Administrator1-1/+1
2023-12-18Daily bump.GCC Administrator4-1/+137
2023-12-17c++: Unshare folded SAVE_EXPR arguments during cp_fold [PR112727]Jakub Jelinek2-1/+25
The following testcase is miscompiled because two ubsan instrumentations run into each other. The first one is the shift instrumentation. Before the C++ FE calls it, it wraps the 2 shift arguments with cp_save_expr, so that side-effects in them aren't evaluated multiple times. And, ubsan_instrument_shift itself uses unshare_expr on any uses of the operands to make sure further modifications in them don't affect other copies of them (the only not unshared ones are the one the caller then uses for the actual operation after the instrumentation, which means there is no tree sharing). Now, if there are side-effects in the first operand like say function call, cp_save_expr wraps it into a SAVE_EXPR, and ubsan_instrument_shift in this mode emits something like if (..., SAVE_EXPR <foo ()>, SAVE_EXPR <op1> > const) __ubsan_handle_shift_out_of_bounds (..., SAVE_EXPR <foo ()>, ...); and caller adds SAVE_EXPR <foo ()> << SAVE_EXPR <op1> after it in a COMPOUND_EXPR. So far so good. If there are no side-effects and cp_save_expr doesn't create SAVE_EXPR, everything is ok as well because of the unshare_expr. We have if (..., SAVE_EXPR <op1> > const) __ubsan_handle_shift_out_of_bounds (..., ptr->something[i], ...); and ptr->something[i] << SAVE_EXPR <op1> where ptr->something[i] is unshared. In the testcase below, the !x->s[j] ? 1 : 0 expression is wrapped initially into a SAVE_EXPR though, and unshare_expr doesn't unshare SAVE_EXPRs nor anything used in them for obvious reasons, so we end up with: if (..., SAVE_EXPR <!(bool) VIEW_CONVERT_EXPR<const struct S *>(x)->s[j] ? 1 : 0>, SAVE_EXPR <op1> > const) __ubsan_handle_shift_out_of_bounds (..., SAVE_EXPR <!(bool) VIEW_CONVERT_EXPR<const struct S *>(x)->s[j] ? 1 : 0>, ...); and SAVE_EXPR <!(bool) VIEW_CONVERT_EXPR<const struct S *>(x)->s[j] ? 1 : 0> << SAVE_EXPR <op1> So far good as well. But later during cp_fold of the SAVE_EXPR we find out that VIEW_CONVERT_EXPR<const struct S *>(x)->s[j] ? 0 : 1 is actually invariant (has TREE_READONLY set) and so cp_fold simplifies the above to if (..., SAVE_EXPR <op1> > const) __ubsan_handle_shift_out_of_bounds (..., (bool) VIEW_CONVERT_EXPR<const struct S *>(x)->s[j] ? 0 : 1, ...); and ((bool) VIEW_CONVERT_EXPR<const struct S *>(x)->s[j] ? 0 : 1) << SAVE_EXPR <op1> with the s[j] ARRAY_REFs and other expressions shared in between the two uses (and obviously the expression optimized away from the COMPOUND_EXPR in the if condition. Then comes another ubsan instrumentation at genericization time, this time to instrument the ARRAY_REFs with strict bounds checking, and replaces the s[j] in there with s[.UBSAN_BOUNDS (0B, SAVE_EXPR<j>, 8), SAVE_EXPR<j>] As the trees are shared, it does that just once though. And as the if body is gimplified first, the SAVE_EXPR<j> is evaluated inside of the if body and when it is used again after the if, it uses a potentially uninitialized value of j.1 (always uninitialized if the shift count isn't out of bounds). The following patch fixes that by unshare_expr unsharing the folded argument of a SAVE_EXPR if we've folded the SAVE_EXPR into an invariant and it is used more than once. 2023-12-08 Jakub Jelinek <jakub@redhat.com> PR sanitizer/112727 * cp-gimplify.c (cp_fold): If SAVE_EXPR has been previously folded, unshare_expr what is returned. * c-c++-common/ubsan/pr112727.c: New test. (cherry picked from commit 6ddaf06e375e1c15dcda338697ab6ea457e6f497)
2023-12-17fold-const: Fix up multiple_of_p [PR112733]Jakub Jelinek2-1/+17
We ICE on the following testcase when wi::multiple_of_p is called on widest_int 1 and -128 with UNSIGNED. I still need to work on the actual wide-int.cc issue, the latest patch attached to the PR regressed bitint-{38,39}.c, so will need to debug that, but there is a clear bug on the fold-const.cc side as well - widest_int is a signed representation by definition, using UNSIGNED with it certainly doesn't match what was intended, because -128 as the second operand effectively means unsigned 131072 bit 0xfffff............ffff80 integer, not the signed char -128 that appeared in the source. In the INTEGER_CST case a few lines above this we already use case INTEGER_CST: if (TREE_CODE (bottom) != INTEGER_CST || integer_zerop (bottom)) return false; return wi::multiple_of_p (wi::to_widest (top), wi::to_widest (bottom), SIGNED); so I think using SIGNED with widest_int is best there (compared to the other choices in the PR). 2023-11-29 Jakub Jelinek <jakub@redhat.com> PR middle-end/112733 * fold-const.c (multiple_of_p): Pass SIGNED rather than UNSIGNED for wi::multiple_of_p on widest_int arguments. * gcc.dg/pr112733.c: New test. (cherry picked from commit 5c95bf945c632925efba86dd5dceccdb9da8884c)
2023-12-17i386: Fix -fcf-protection -Os ICE due to movabsq peephole2 [PR112845]Jakub Jelinek2-1/+13
The following testcase ICEs in the movabsq $(i32 << shift), r64 peephole2 I've added a while back to use smaller code than movabsq if possible. If i32 is 0xfa1e0ff3 and shift is not divisible by 8, then it creates an invalid insn (as 0xfa1e0ff3 CONST_INT is not allowed as x86_64_immediate_operand nor x86_64_zext_immediate_operand), the peephole2 even triggers on it again and again (this time with shift 0) until it gives up. The following patch fixes that. As ix86_endbr_immediate_operand needs a CONST_INT and it is hopefully rare, I chose to use FAIL rather than handling it in the condition (where I'd probably need to call ctz_hwi again etc.). 2023-12-05 Jakub Jelinek <jakub@redhat.com> PR target/112845 * config/i386/i386.md (movabsq $(i32 << shift), r64 peephole2): FAIL if the new immediate is ix86_endbr_immediate_operand. (cherry picked from commit e0786ca9a18c50ad08c40936b228e325193664b8)
2023-12-17i386: Fix rtl checking ICE in ix86_elim_entry_set_got [PR112837]Jakub Jelinek2-4/+16
The following testcase ICEs with RTL checking, because it sets if XINT (SET_SRC (set), 1) is UNSPEC_SET_GOT without checking if SET_SRC (set) is actually an UNSPEC, so any time we see any other insn with PARALLEL and a SET in it which is not an UNSPEC we ICE during RTL checking or access there some other union member as if it was an rt_int. The rest is just small cleanup. 2023-12-04 Jakub Jelinek <jakub@redhat.com> PR target/112837 * config/i386/i386.c (ix86_elim_entry_set_got): Before checking for UNSPEC_SET_GOT check that SET_SRC is UNSPEC. Use SET_SRC and SET_DEST macros instead of XEXP, rename vec variable to set. * gcc.dg/pr112837.c: New test. (cherry picked from commit 4586d7d0a92e9b60d0c01043e0ae262b1e06f337)