rocket-tools/riscv-gnu-toolchain/gcc.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Richard Biener <rguenther@suse.de>	2025-05-25 19:29:04 +0200
committer	Richard Biener <rguenth@gcc.gnu.org>	2025-07-08 10:12:30 +0200
commit	e9079e4f43d13579c41110ce1871051a43c577b6 (patch)
tree	569926f1da2c5842757bd0be9bb67b229610d62f /gcc/rust/backend/rust-compile-struct-field-expr.h
parent	df64d099faf843d90e8fe29aec17d84277986ee9 (diff)
download	gcc-e9079e4f43d13579c41110ce1871051a43c577b6.zip gcc-e9079e4f43d13579c41110ce1871051a43c577b6.tar.gz gcc-e9079e4f43d13579c41110ce1871051a43c577b6.tar.bz2

add masked-epilogue tuning

The following adds a x86 tuning to enable the use of AVX512 masked epilogues in cases we heuristically determine it to be not detrimental by high chance. Basically problematic cases are when there are data streams that are both stored and loaded from and an outer loop could end up executing only the inner loop masked epilogue and with unlucky data stream advacement from the outer loop end up needing to forward from masked stores to masked loads. This isn't very well handled, esp. for the case where unmasked operations would not need to forward at all - that is, when forwarding completely from the masked out portion of the store (like the AVX upper half to the AVX lower half of a load). There's also the case where the number of iterations is known at compile time, only with cost comparing we'd consider a non-masked epilog - as we are not doing that we have to add heuristics to avoid masking when a single vector epilog iteration would cover all scalar iterations left (this is exercised by gcc.target/i386/pr110310.c). SPEC CPU 2017 shows 3% text size savings over not using masked epilogues with performance impact in the noise. Masking all vector epilogues gets that to 4% text size savings with some major runtime regressions in 503.bwaves_r and 527.cam4_r (measured on a Zen4 system), we're leaving a 5% improvement for 549.fotonik3d_r unrealized with the implemented heuristic. With the heuristics we turn 22513 vector epilogues + up to 12305 scalar epilogues into 12305 masked vector epilogues of which 574 are for AVX vector sizes, 79 for SSE vector sizes and the rest for AVX512. When masking all epilogues we get 14567 of them from 29467 vector + up to 14567 scalar epilogues, so the heuristics disable an additional 20% of masked epilogues. * config/i386/x86-tune.def (X86_TUNE_AVX512_MASKED_EPILOGUES): New tunable, default on for m_ZNVER4 and m_ZNVER5. * config/i386/i386.cc (ix86_vector_costs::finish_cost): With X86_TUNE_AVX512_MASKED_EPILOGUES and when the main loop had a vectorization factor > 2 use a masked epilogue when possible and when not obviously problematic. * gcc.target/i386/vect-mask-epilogue-1.c: New testcase. * gcc.target/i386/vect-mask-epilogue-2.c: Likewise. * gcc.target/i386/vect-epilogues-3.c: Adjust.

Diffstat (limited to 'gcc/rust/backend/rust-compile-struct-field-expr.h')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: