aboutsummaryrefslogtreecommitdiff
path: root/gcc/testsuite/gcc.target/i386/avx10_2-vmovd-1.c
diff options
context:
space:
mode:
authorAndi Kleen <ak@gcc.gnu.org>2025-04-01 11:48:11 -0700
committerAndi Kleen <ak@gcc.gnu.org>2025-04-01 22:22:13 -0700
commit063fbd5a10d47d4957d605ca917480d02e054249 (patch)
treed6ae14fce7db8a52b6ecad704725ef2e4d1c7d5a /gcc/testsuite/gcc.target/i386/avx10_2-vmovd-1.c
parent12533c0c8b27dcbb1b83517bf4eb09faa98bf814 (diff)
downloadgcc-master.zip
gcc-master.tar.gz
gcc-master.tar.bz2
PR119482: Avoid mispredictions in bitmap_set_bitHEADtrunkmaster
bitmap_set_bit checks the original value of the bit to return it to the caller and then only writes the new value back if it changes. Most callers of bitmap_set_bit don't need the return value, but with the conditional store the CPU still has to predict it correctly since gcc doesn't know how to do that without APX on x86 (even though CMOV could do it with a dummy target). Really if-conversion should handle this case, but for now we can fix it. This simple patch improves runtime by 15% for the test case in the PR. Which is more than I expected given it only has ~1.44% of the cycles, but I guess the mispredicts caused some down stream effects. cc1plus-bitmap -std=gnu++20 -O2 pr119482.cc -quiet ran 1.15 ± 0.01 times faster than cc1plus -std=gnu++20 -O2 pr119482.cc -quiet At least with this test case the total number of branches decreases drastically. Even though the mispredict rate goes up slightly it is still a big win. $ perf stat -e branches,branch-misses,uncore_imc/cas_count_read/,uncore_imc/cas_count_write/ \ -a ../obj-fast/gcc/cc1plus -std=gnu++20 -O2 pr119482.cc -quiet -w Performance counter stats for 'system wide': 41,932,957,091 branches 686,117,623 branch-misses # 1.64% of all branches 43,690.47 MiB uncore_imc/cas_count_read/ 12,362.56 MiB uncore_imc/cas_count_write/ 49.328633365 seconds time elapsed $ perf stat -e branches,branch-misses,uncore_imc/cas_count_read/,uncore_imc/cas_count_write/ \ -a ../obj-fast/gcc/cc1plus-bitmap -std=gnu++20 -O2 pr119482.cc -quiet -w Performance counter stats for 'system wide': 37,092,113,179 branches 663,641,708 branch-misses # 1.79% of all branches 43,196.52 MiB uncore_imc/cas_count_read/ 12,369.33 MiB uncore_imc/cas_count_write/ 42.632458350 seconds time elapsed gcc/ChangeLog: PR middle-end/119482 * bitmap.cc (bitmap_set_bit): Write back value unconditionally
Diffstat (limited to 'gcc/testsuite/gcc.target/i386/avx10_2-vmovd-1.c')
0 files changed, 0 insertions, 0 deletions