riscv-gnu-toolchain/gcc.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	liuhongt <hongtao.liu@intel.com>	2024-09-25 13:11:11 +0800
committer	liuhongt <hongtao.liu@intel.com>	2024-10-10 10:21:29 +0800
commit	9eaecce3d8c1d9349adbf8c2cdaf8d87672ed29c (patch)
tree	c40f7be375c491dc7cb7bbc3699ee1f1beb5bfcf /gcc/fortran
parent	9c8cea8feb6cd54ef73113a0b74f1df7b60d09dc (diff)
download	gcc-9eaecce3d8c1d9349adbf8c2cdaf8d87672ed29c.zip gcc-9eaecce3d8c1d9349adbf8c2cdaf8d87672ed29c.tar.gz gcc-9eaecce3d8c1d9349adbf8c2cdaf8d87672ed29c.tar.bz2

Add a new tune avx256_avoid_vec_perm for SRF.

According to Intel SOM[1], For Crestmont, most 256-bit Intel AVX2 instructions can be decomposed into two independent 128-bit micro-operations, except for a subset of Intel AVX2 instructions, known as cross-lane operations, can only compute the result for an element by utilizing one or more sources belonging to other elements. The 256-bit instructions listed below use more operand sources than can be natively supported by a single reservation station within these microarchitectures. They are decomposed into two μops, where the first μop resolves a subset of operand dependencies across two cycles. The dependent second μop executes the 256-bit operation by using a single 128-bit execution port for two consecutive cycles with a five-cycle latency for a total latency of seven cycles. VPERM2I128 ymm1, ymm2, ymm3/m256, imm8 VPERM2F128 ymm1, ymm2, ymm3/m256, imm8 VPERMPD ymm1, ymm2/m256, imm8 VPERMPS ymm1, ymm2, ymm3/m256 VPERMD ymm1, ymm2, ymm3/m256 VPERMQ ymm1, ymm2/m256, imm8 Instead of setting tune avx128_optimal for SRF, the patch add a new tune avx256_avoid_vec_perm for it. so by default, vectorizer still uses 256-bit VF if cost is profitable, but lowers to 128-bit whenever 256-bit vec_perm is needed for auto-vectorization. w/o vec_perm, performance of 256-bit vectorization should be similar as 128-bit ones(some benchmark results show it's even better than 128-bit vectorization since it enables more parallelism for convert cases.) [1] https://www.intel.com/content/www/us/en/content-details/814198/intel-64-and-ia-32-architectures-optimization-reference-manual-volume-1.html gcc/ChangeLog: * config/i386/i386.cc (ix86_vector_costs::ix86_vector_costs): Add new member m_num_avx256_vec_perm. (ix86_vector_costs::add_stmt_cost): Record 256-bit vec_perm. (ix86_vector_costs::finish_cost): Prevent vectorization for TAREGT_AVX256_AVOID_VEC_PERM when there's 256-bit vec_perm instruction. * config/i386/i386.h (TARGET_AVX256_AVOID_VEC_PERM): New Macro. * config/i386/x86-tune.def (X86_TUNE_AVX256_SPLIT_REGS): Add m_CORE_ATOM. (X86_TUNE_AVX256_AVOID_VEC_PERM): New tune. gcc/testsuite/ChangeLog: * gcc.target/i386/avx256_avoid_vec_perm.c: New test.

Diffstat (limited to 'gcc/fortran')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: