aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorJan Hubicka <jh@suse.cz>2023-02-07 05:23:00 +0100
committerJan Hubicka <jh@suse.cz>2023-02-07 05:24:15 +0100
commita7502c4a614238ac3f80271886b217b156bdf923 (patch)
tree0c559825291ae69f4641981ee44114a4576a94d0
parentf0e73dd031135695249f87589a0d250bf2f334b6 (diff)
downloadgcc-a7502c4a614238ac3f80271886b217b156bdf923.zip
gcc-a7502c4a614238ac3f80271886b217b156bdf923.tar.gz
gcc-a7502c4a614238ac3f80271886b217b156bdf923.tar.bz2
Enable 512 bit vector for zen4
While internally 512 registers are splits into two 256 halves, 512 bit vectors reduces number of instructions to retire and has chance to improve paralelism. There are few tsvc benchmarks that improves significantly: runtime benchmark 256bit 512bit s2275 48.57 20.67 -58% s311 32.29 16.06 -50% s312 32.30 16.07 -50% vsumr 32.30 16.07 -50% s314 10.77 5.42 -50% s313 21.52 10.85 -50% vdotr 43.05 21.69 -50% s316 10.80 5.64 -48% s235 61.72 33.91 -45% s161 15.91 9.95 -38% s3251 32.13 20.31 -36% And there are no benchmarks with off-noise regression. The basic matrix multiplication loop improves by 32%. It is also expected that 512 bit vectors are more power effecient (I can't masure that). The down side is that loops with low trip counts may get slower when the unvectorized prologue and epilogue is hit more often. With SPECfp this problem happens with x264 (12% regression) and bwaves (6% regression) and this is tracked in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108410 and will need more work on vectorizer to support masked epilogues. After some additional testing it seems that using 512 bit vectors by default is now overall better choice. Bootstrapped/regtested x86_64-linux. Plan to commit it tomorrow. * config/i386/x86-tune.def (X86_TUNE_AVX256_OPTIMAL): Turn off for znver4.
-rw-r--r--gcc/config/i386/x86-tune.def2
1 files changed, 1 insertions, 1 deletions
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index c78dad0..3054656 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -551,7 +551,7 @@ DEF_TUNE (X86_TUNE_AVX128_OPTIMAL, "avx128_optimal", m_BDVER | m_BTVER2
/* X86_TUNE_AVX256_OPTIMAL: Use 256-bit AVX instructions instead of 512-bit AVX
instructions in the auto-vectorizer. */
-DEF_TUNE (X86_TUNE_AVX256_OPTIMAL, "avx256_optimal", m_CORE_AVX512 | m_ZNVER4)
+DEF_TUNE (X86_TUNE_AVX256_OPTIMAL, "avx256_optimal", m_CORE_AVX512)
/* X86_TUNE_AVX256_SPLIT_REGS: if true, AVX512 ops are split into two AVX256 ops. */
DEF_TUNE (X86_TUNE_AVX512_SPLIT_REGS, "avx512_split_regs", m_ZNVER4)