diff options
author | Roger Sayle <roger@nextmovesoftware.com> | 2023-12-19 11:24:36 +0000 |
---|---|---|
committer | Roger Sayle <roger@nextmovesoftware.com> | 2023-12-19 11:24:36 +0000 |
commit | 7e1c440bc84c02e67b1cf338579a3274cdc337e0 (patch) | |
tree | f9312aa153a0a9e62369c1f72b8f3fab7e08f03d /gcc/ada | |
parent | cf840a7f7c14242ab7018071310851486a557d4f (diff) | |
download | gcc-7e1c440bc84c02e67b1cf338579a3274cdc337e0.zip gcc-7e1c440bc84c02e67b1cf338579a3274cdc337e0.tar.gz gcc-7e1c440bc84c02e67b1cf338579a3274cdc337e0.tar.bz2 |
i386: Improved TImode (128-bit) integer constants on x86_64.
This patch fixes two issues with the handling of 128-bit TImode integer
constants in the x86_64 backend. The main issue is that GCC always
tries to load 128-bit integer constants via broadcasts to vector SSE
registers, even if the result is required in general registers. This
is seen in the two closely related functions below:
__int128 m;
void foo() { m &= CONST; }
void bar() { m = CONST; }
When compiled with -O2 -mavx, we currently generate:
foo: movabsq $81985529216486895, %rax
vmovq %rax, %xmm0
vpunpcklqdq %xmm0, %xmm0, %xmm0
vmovq %xmm0, %rax
vpextrq $1, %xmm0, %rdx
andq %rax, m(%rip)
andq %rdx, m+8(%rip)
ret
bar: movabsq $81985529216486895, %rax
vmovq %rax, %xmm1
vpunpcklqdq %xmm1, %xmm1, %xmm0
vpextrq $1, %xmm0, %rdx
vmovq %xmm0, m(%rip)
movq %rdx, m+8(%rip)
ret
With this patch we defer the decision to use vector broadcast for
TImode until we know that we actually want a SSE register result,
by moving the call to ix86_convert_const_wide_int_to_broadcast from
the RTL expansion pass, to the scalar-to-vector (STV) pass. With
this change (and a minor tweak described below) we now generate:
foo: movabsq $81985529216486895, %rax
andq %rax, m(%rip)
andq %rax, m+8(%rip)
ret
bar: movabsq $81985529216486895, %rax
vmovq %rax, %xmm0
vpunpcklqdq %xmm0, %xmm0, %xmm0
vmovdqa %xmm0, m(%rip)
ret
showing that we now correctly use vector mode broadcasts (only)
where appropriate.
The one minor tweak mentioned above is to enable the un-cprop hi/lo
optimization, that I originally contributed back in September 2004
https://gcc.gnu.org/pipermail/gcc-patches/2004-September/148756.html
even when not optimizing for size. Without this (and currently with
just -O2) the function foo above generates:
foo: movabsq $81985529216486895, %rax
movabsq $81985529216486895, %rdx
andq %rax, m(%rip)
andq %rdx, m+8(%rip)
ret
I'm not sure why (back in 2004) I thought that avoiding the implicit
"movq %rax, %rdx" instead of a second load was faster, perhaps avoiding
a dependency to allow better scheduling, but nowadays "movq %rax, %rdx"
is either eliminated by GCC's hardreg cprop pass, or special cased by
modern hardware, making the first foo preferrable, not only shorter but
also faster.
2023-12-19 Roger Sayle <roger@nextmovesoftware.com>
gcc/ChangeLog
* config/i386/i386-expand.cc
(ix86_convert_const_wide_int_to_broadcast): Remove static.
(ix86_expand_move): Don't attempt to convert wide constants
to SSE using ix86_convert_const_wide_int_to_broadcast here.
(ix86_split_long_move): Always un-cprop multi-word constants.
* config/i386/i386-expand.h
(ix86_convert_const_wide_int_to_broadcast): Prototype here.
* config/i386/i386-features.cc: Include i386-expand.h.
(timode_scalar_chain::convert_insn): When converting TImode to
V1TImode, try ix86_convert_const_wide_int_to_broadcast.
gcc/testsuite/ChangeLog
* gcc.target/i386/movti-2.c: New test case.
* gcc.target/i386/movti-3.c: Likewise.
Diffstat (limited to 'gcc/ada')
0 files changed, 0 insertions, 0 deletions