diff options
author | MayShao-oc <MayShao-oc@zhaoxin.com> | 2024-06-29 11:58:27 +0800 |
---|---|---|
committer | H.J. Lu <hjl.tools@gmail.com> | 2024-06-30 06:26:43 -0700 |
commit | c19457aec67da28a3f78badef53556cd55640a6e (patch) | |
tree | 41d5357bb9426f975e15ad88e30196c35e0cfc49 /sysdeps | |
parent | 44d757eb9f4484dbc3aa32042ab64cdf9374e093 (diff) | |
download | glibc-c19457aec67da28a3f78badef53556cd55640a6e.zip glibc-c19457aec67da28a3f78badef53556cd55640a6e.tar.gz glibc-c19457aec67da28a3f78badef53556cd55640a6e.tar.bz2 |
x86_64: Optimize large size copy in memmove-ssse3
This patch optimizes large size copy using normal store when src > dst
and overlap. Make it the same as the logic in memmove-vec-unaligned-erms.S.
Current memmove-ssse3 use '__x86_shared_cache_size_half' as the non-
temporal threshold, this patch updates that value to
'__x86_shared_non_temporal_threshold'. Currently, the
__x86_shared_non_temporal_threshold is cpu-specific, and different CPUs
will have different values based on the related nt-benchmark results.
However, in memmove-ssse3, the nontemporal threshold uses
'__x86_shared_cache_size_half', which sounds unreasonable.
The performance is not changed drastically although shows overall
improvements without any major regressions or gains.
Results on Zhaoxin KX-7000:
bench-memcpy geometric_mean(N=20) New / Original: 0.999
bench-memcpy-random geometric_mean(N=20) New / Original: 0.999
bench-memcpy-large geometric_mean(N=20) New / Original: 0.978
bench-memmove geometric_mean(N=20) New / Original: 1.000
bench-memmmove-large geometric_mean(N=20) New / Original: 0.962
Results on Intel Core i5-6600K:
bench-memcpy geometric_mean(N=20) New / Original: 1.001
bench-memcpy-random geometric_mean(N=20) New / Original: 0.999
bench-memcpy-large geometric_mean(N=20) New / Original: 1.001
bench-memmove geometric_mean(N=20) New / Original: 0.995
bench-memmmove-large geometric_mean(N=20) New / Original: 0.936
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
Diffstat (limited to 'sysdeps')
-rw-r--r-- | sysdeps/x86_64/multiarch/memmove-ssse3.S | 14 |
1 files changed, 9 insertions, 5 deletions
diff --git a/sysdeps/x86_64/multiarch/memmove-ssse3.S b/sysdeps/x86_64/multiarch/memmove-ssse3.S index 048d015..01008fd 100644 --- a/sysdeps/x86_64/multiarch/memmove-ssse3.S +++ b/sysdeps/x86_64/multiarch/memmove-ssse3.S @@ -151,13 +151,10 @@ L(more_2x_vec): loop. */ movups %xmm0, (%rdi) -# ifdef SHARED_CACHE_SIZE_HALF - cmp $SHARED_CACHE_SIZE_HALF, %RDX_LP -# else - cmp __x86_shared_cache_size_half(%rip), %rdx -# endif + cmp __x86_shared_non_temporal_threshold(%rip), %rdx ja L(large_memcpy) +L(loop_fwd): leaq -64(%rdi, %rdx), %r8 andq $-16, %rdi movl $48, %edx @@ -199,6 +196,13 @@ L(large_memcpy): movups -64(%r9, %rdx), %xmm10 movups -80(%r9, %rdx), %xmm11 + /* Check if src and dst overlap. If they do use cacheable + writes to potentially gain positive interference between + the loads during the memmove. */ + subq %rdi, %r9 + cmpq %rdx, %r9 + jb L(loop_fwd) + sall $5, %ecx leal (%rcx, %rcx, 2), %r8d leaq -96(%rdi, %rdx), %rcx |