Age | Commit message (Collapse) | Author | Files | Lines |
|
The experimental PSTL's current dispatching mechanism was designed with
flexibility in mind. However, while reviewing the in-progress OpenMP
backend, I realized that the dispatching mechanism based on ADL and
default definitions in the frontend had several downsides. To name a
few:
1. The dispatching of an algorithm to the back-end and its default
implementation is bundled together via `_LIBCPP_PSTL_CUSTOMIZATION_POINT`.
This makes the dispatching really confusing and leads to annoyances
such as variable shadowing and weird lambda captures in the front-end.
2. The distinction between back-end functions and front-end algorithms
is not as clear as it could be, which led us to call one where we meant
the other in a few cases. This is bad due to the exception requirements
of the PSTL: calling a front-end algorithm inside the implementation of
a back-end is incorrect for exception-safety.
3. There are two levels of back-end dispatching in the PSTL, which treat
CPU backends as a special case. This was confusing and not as flexible
as we'd like. For example, there was no straightforward way to dispatch
all uses of `unseq` to a specific back-end from the OpenMP backend,
or for CPU backends to fall back on each other.
This patch rewrites the backend dispatching mechanism to solve these
problems, but doesn't touch any of the actual implementation of
algorithms. Specifically, this rewrite has the following
characteristics:
- There is a single level of backend dispatching, however partial backends can
be stacked to provide a full implementation of the PSTL. The two-level dispatching
that was used for CPU-based backends is handled by providing CPU-based basis
operations as simple helpers that can easily be reused when defining any PSTL
backend.
- The default definitions for algorithms are separated from their dispatching logic.
- The front-end is thus simplified a whole lot and made very consistent
for all algorithms, which makes it easier to audit the front-end for
things like exception-correctness, appropriate forwarding, etc.
Fixes #70718
|
|
Instead of using FOO_TEST_DEPS global variables that don't get updated
properly from subdirectories, use targets to propagate the dependencies
across directories.
|
|
The binary version is four times faster than current implementation in
my setup, and generally considered a better implementation.
Code inspired by https://en.algorithmica.org/hpc/algorithms/gcd/ which
itself is inspired by
https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/
Fix #77648
|
|
This is a preparation to measure the performance impact of
- P2713R1 Escaping improvements in std::format
and its performance improving followup patch.
|
|
The overhead of taking a std::mutex is much lower than taking a reader
lock on a shared mutex, even under heavy contention.
The benefit of shared_mutex only occurs as the amount of
time spent in the critical sections grows large enough.
In our case all we do is read a pointer and return the lock.
As a result, using a shared lock can be ~50%-100% slower
Here are the results for the provided benchmark on my machine:
```
2024-04-07T12:48:51-04:00
Running ./libcxx/benchmarks/shared_mutex_vs_mutex.libcxx.out
Run on (12 X 400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 1024 KiB (x6)
L3 Unified 32768 KiB (x1)
Load Average: 2.70, 2.70, 1.63
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_shared_mutex/threads:1 13.9 ns 13.9 ns 50533700
BM_shared_mutex/threads:2 34.5 ns 68.9 ns 9957784
BM_shared_mutex/threads:4 38.4 ns 137 ns 4987772
BM_shared_mutex/threads:8 51.1 ns 358 ns 1974160
BM_shared_mutex/threads:32 57.1 ns 682 ns 1043648
BM_mutex/threads:1 5.54 ns 5.53 ns 125867422
BM_mutex/threads:2 15.5 ns 30.9 ns 21830116
BM_mutex/threads:4 15.4 ns 57.2 ns 12136920
BM_mutex/threads:8 19.3 ns 140 ns 4997080
BM_mutex/threads:32 20.8 ns 252 ns 2859808
```
|
|
This allows Clang to vectorize the loop.
```
---------------------------------------------------------------------
Benchmark old new
---------------------------------------------------------------------
BM_std_minmax<char>/1 0.659 ns 1.41 ns
BM_std_minmax<char>/2 1.08 ns 2.16 ns
BM_std_minmax<char>/3 2.16 ns 2.96 ns
BM_std_minmax<char>/4 2.82 ns 3.81 ns
BM_std_minmax<char>/5 3.43 ns 4.69 ns
BM_std_minmax<char>/6 4.08 ns 5.63 ns
BM_std_minmax<char>/7 4.75 ns 6.51 ns
BM_std_minmax<char>/8 5.42 ns 7.41 ns
BM_std_minmax<char>/9 6.05 ns 8.34 ns
BM_std_minmax<char>/10 6.68 ns 9.29 ns
BM_std_minmax<char>/11 7.47 ns 10.6 ns
BM_std_minmax<char>/12 7.95 ns 11.4 ns
BM_std_minmax<char>/13 8.64 ns 12.4 ns
BM_std_minmax<char>/14 9.35 ns 13.4 ns
BM_std_minmax<char>/15 10.1 ns 14.4 ns
BM_std_minmax<char>/16 10.6 ns 2.25 ns
BM_std_minmax<char>/17 11.3 ns 2.82 ns
BM_std_minmax<char>/18 11.8 ns 3.71 ns
BM_std_minmax<char>/19 12.6 ns 4.52 ns
BM_std_minmax<char>/20 13.2 ns 5.47 ns
BM_std_minmax<char>/21 14.1 ns 6.67 ns
BM_std_minmax<char>/22 14.5 ns 7.78 ns
BM_std_minmax<char>/23 15.1 ns 8.67 ns
BM_std_minmax<char>/24 15.7 ns 9.68 ns
BM_std_minmax<char>/25 16.4 ns 10.7 ns
BM_std_minmax<char>/26 17.1 ns 11.7 ns
BM_std_minmax<char>/27 17.8 ns 12.8 ns
BM_std_minmax<char>/28 18.4 ns 14.1 ns
BM_std_minmax<char>/29 19.0 ns 15.0 ns
BM_std_minmax<char>/30 19.6 ns 16.0 ns
BM_std_minmax<char>/31 20.2 ns 17.0 ns
BM_std_minmax<char>/32 20.8 ns 2.46 ns
BM_std_minmax<char>/64 41.5 ns 2.97 ns
BM_std_minmax<char>/512 340 ns 6.05 ns
BM_std_minmax<char>/1024 667 ns 8.83 ns
BM_std_minmax<char>/4000 2571 ns 28.6 ns
BM_std_minmax<char>/4096 2632 ns 25.8 ns
BM_std_minmax<char>/5500 3554 ns 51.1 ns
BM_std_minmax<char>/64000 41175 ns 480 ns
BM_std_minmax<char>/65536 42039 ns 490 ns
BM_std_minmax<char>/70000 44931 ns 528 ns
BM_std_minmax<short>/1 0.708 ns 1.20 ns
BM_std_minmax<short>/2 1.18 ns 1.78 ns
BM_std_minmax<short>/3 1.98 ns 2.42 ns
BM_std_minmax<short>/4 2.47 ns 3.05 ns
BM_std_minmax<short>/5 3.09 ns 3.72 ns
BM_std_minmax<short>/6 3.49 ns 4.37 ns
BM_std_minmax<short>/7 4.24 ns 5.03 ns
BM_std_minmax<short>/8 4.65 ns 2.12 ns
BM_std_minmax<short>/9 5.34 ns 2.51 ns
BM_std_minmax<short>/10 5.82 ns 3.18 ns
BM_std_minmax<short>/11 6.36 ns 3.97 ns
BM_std_minmax<short>/12 6.73 ns 4.68 ns
BM_std_minmax<short>/13 7.59 ns 5.49 ns
BM_std_minmax<short>/14 7.77 ns 6.45 ns
BM_std_minmax<short>/15 8.54 ns 7.55 ns
BM_std_minmax<short>/16 8.74 ns 2.38 ns
BM_std_minmax<short>/17 9.59 ns 2.76 ns
BM_std_minmax<short>/18 9.88 ns 3.37 ns
BM_std_minmax<short>/19 10.7 ns 4.17 ns
BM_std_minmax<short>/20 10.9 ns 4.88 ns
BM_std_minmax<short>/21 12.1 ns 5.70 ns
BM_std_minmax<short>/22 12.6 ns 6.64 ns
BM_std_minmax<short>/23 13.5 ns 7.72 ns
BM_std_minmax<short>/24 13.2 ns 2.87 ns
BM_std_minmax<short>/25 14.2 ns 3.10 ns
BM_std_minmax<short>/26 14.2 ns 3.59 ns
BM_std_minmax<short>/27 15.4 ns 4.35 ns
BM_std_minmax<short>/28 15.3 ns 5.10 ns
BM_std_minmax<short>/29 16.2 ns 5.87 ns
BM_std_minmax<short>/30 16.2 ns 6.88 ns
BM_std_minmax<short>/31 17.0 ns 7.78 ns
BM_std_minmax<short>/32 17.2 ns 3.45 ns
BM_std_minmax<short>/64 34.1 ns 3.35 ns
BM_std_minmax<short>/512 279 ns 8.37 ns
BM_std_minmax<short>/1024 549 ns 14.2 ns
BM_std_minmax<short>/4000 2111 ns 50.1 ns
BM_std_minmax<short>/4096 2167 ns 47.9 ns
BM_std_minmax<short>/5500 2895 ns 69.7 ns
BM_std_minmax<short>/64000 33454 ns 953 ns
BM_std_minmax<short>/65536 34474 ns 970 ns
BM_std_minmax<short>/70000 36691 ns 1037 ns
BM_std_minmax<int>/1 0.664 ns 1.17 ns
BM_std_minmax<int>/2 1.11 ns 1.69 ns
BM_std_minmax<int>/3 2.36 ns 2.29 ns
BM_std_minmax<int>/4 2.53 ns 2.91 ns
BM_std_minmax<int>/5 3.23 ns 3.56 ns
BM_std_minmax<int>/6 3.56 ns 4.23 ns
BM_std_minmax<int>/7 4.28 ns 4.91 ns
BM_std_minmax<int>/8 4.60 ns 5.60 ns
BM_std_minmax<int>/9 5.38 ns 6.31 ns
BM_std_minmax<int>/10 5.69 ns 7.03 ns
BM_std_minmax<int>/11 6.41 ns 7.70 ns
BM_std_minmax<int>/12 6.73 ns 8.39 ns
BM_std_minmax<int>/13 7.38 ns 9.07 ns
BM_std_minmax<int>/14 7.74 ns 9.79 ns
BM_std_minmax<int>/15 8.53 ns 10.5 ns
BM_std_minmax<int>/16 8.79 ns 11.2 ns
BM_std_minmax<int>/17 9.63 ns 12.0 ns
BM_std_minmax<int>/18 9.84 ns 12.7 ns
BM_std_minmax<int>/19 10.6 ns 13.5 ns
BM_std_minmax<int>/20 11.0 ns 14.3 ns
BM_std_minmax<int>/21 11.7 ns 15.0 ns
BM_std_minmax<int>/22 12.0 ns 15.7 ns
BM_std_minmax<int>/23 13.1 ns 16.5 ns
BM_std_minmax<int>/24 13.0 ns 17.3 ns
BM_std_minmax<int>/25 13.7 ns 17.9 ns
BM_std_minmax<int>/26 14.0 ns 18.6 ns
BM_std_minmax<int>/27 14.8 ns 19.4 ns
BM_std_minmax<int>/28 15.1 ns 20.3 ns
BM_std_minmax<int>/29 15.8 ns 20.9 ns
BM_std_minmax<int>/30 16.1 ns 21.7 ns
BM_std_minmax<int>/31 16.9 ns 22.5 ns
BM_std_minmax<int>/32 17.2 ns 3.40 ns
BM_std_minmax<int>/64 33.9 ns 4.04 ns
BM_std_minmax<int>/512 275 ns 14.6 ns
BM_std_minmax<int>/1024 541 ns 27.5 ns
BM_std_minmax<int>/4000 2093 ns 96.3 ns
BM_std_minmax<int>/4096 2146 ns 98.3 ns
BM_std_minmax<int>/5500 2866 ns 157 ns
BM_std_minmax<int>/64000 33619 ns 1954 ns
BM_std_minmax<int>/65536 34252 ns 2009 ns
BM_std_minmax<int>/70000 36618 ns 2125 ns
BM_std_minmax<long long>/1 0.709 ns 1.19 ns
BM_std_minmax<long long>/2 1.01 ns 1.65 ns
BM_std_minmax<long long>/3 2.14 ns 2.21 ns
BM_std_minmax<long long>/4 2.45 ns 2.83 ns
BM_std_minmax<long long>/5 3.09 ns 3.47 ns
BM_std_minmax<long long>/6 3.44 ns 4.11 ns
BM_std_minmax<long long>/7 4.16 ns 4.79 ns
BM_std_minmax<long long>/8 4.54 ns 5.47 ns
BM_std_minmax<long long>/9 5.37 ns 6.20 ns
BM_std_minmax<long long>/10 5.71 ns 6.93 ns
BM_std_minmax<long long>/11 6.00 ns 7.60 ns
BM_std_minmax<long long>/12 6.43 ns 8.27 ns
BM_std_minmax<long long>/13 7.01 ns 8.94 ns
BM_std_minmax<long long>/14 7.45 ns 9.65 ns
BM_std_minmax<long long>/15 8.16 ns 10.4 ns
BM_std_minmax<long long>/16 8.46 ns 5.22 ns
BM_std_minmax<long long>/17 9.16 ns 5.22 ns
BM_std_minmax<long long>/18 9.53 ns 5.52 ns
BM_std_minmax<long long>/19 10.2 ns 6.02 ns
BM_std_minmax<long long>/20 10.5 ns 6.89 ns
BM_std_minmax<long long>/21 11.3 ns 7.83 ns
BM_std_minmax<long long>/22 11.6 ns 8.59 ns
BM_std_minmax<long long>/23 12.3 ns 9.91 ns
BM_std_minmax<long long>/24 12.6 ns 10.1 ns
BM_std_minmax<long long>/25 13.2 ns 12.0 ns
BM_std_minmax<long long>/26 13.6 ns 13.5 ns
BM_std_minmax<long long>/27 14.2 ns 14.8 ns
BM_std_minmax<long long>/28 14.7 ns 15.9 ns
BM_std_minmax<long long>/29 15.3 ns 16.6 ns
BM_std_minmax<long long>/30 15.8 ns 17.3 ns
BM_std_minmax<long long>/31 16.3 ns 18.2 ns
BM_std_minmax<long long>/32 16.7 ns 7.18 ns
BM_std_minmax<long long>/64 33.1 ns 11.5 ns
BM_std_minmax<long long>/512 268 ns 71.0 ns
BM_std_minmax<long long>/1024 532 ns 138 ns
BM_std_minmax<long long>/4000 2056 ns 533 ns
BM_std_minmax<long long>/4096 2112 ns 539 ns
BM_std_minmax<long long>/5500 2823 ns 749 ns
BM_std_minmax<long long>/64000 32956 ns 8590 ns
BM_std_minmax<long long>/65536 33795 ns 8791 ns
BM_std_minmax<long long>/70000 36084 ns 9442 ns
BM_std_minmax<unsigned char>/1 0.714 ns 1.41 ns
BM_std_minmax<unsigned char>/2 0.955 ns 1.96 ns
BM_std_minmax<unsigned char>/3 1.90 ns 2.63 ns
BM_std_minmax<unsigned char>/4 2.40 ns 3.34 ns
BM_std_minmax<unsigned char>/5 2.87 ns 4.10 ns
BM_std_minmax<unsigned char>/6 3.47 ns 4.88 ns
BM_std_minmax<unsigned char>/7 4.04 ns 5.66 ns
BM_std_minmax<unsigned char>/8 4.65 ns 6.45 ns
BM_std_minmax<unsigned char>/9 5.18 ns 7.24 ns
BM_std_minmax<unsigned char>/10 5.80 ns 8.05 ns
BM_std_minmax<unsigned char>/11 6.24 ns 8.86 ns
BM_std_minmax<unsigned char>/12 6.78 ns 9.70 ns
BM_std_minmax<unsigned char>/13 7.30 ns 10.6 ns
BM_std_minmax<unsigned char>/14 7.86 ns 11.4 ns
BM_std_minmax<unsigned char>/15 8.46 ns 12.3 ns
BM_std_minmax<unsigned char>/16 9.00 ns 2.12 ns
BM_std_minmax<unsigned char>/17 9.58 ns 2.83 ns
BM_std_minmax<unsigned char>/18 10.1 ns 3.37 ns
BM_std_minmax<unsigned char>/19 10.7 ns 4.11 ns
BM_std_minmax<unsigned char>/20 11.2 ns 4.85 ns
BM_std_minmax<unsigned char>/21 11.9 ns 5.69 ns
BM_std_minmax<unsigned char>/22 12.3 ns 6.77 ns
BM_std_minmax<unsigned char>/23 13.1 ns 7.56 ns
BM_std_minmax<unsigned char>/24 13.5 ns 8.40 ns
BM_std_minmax<unsigned char>/25 14.2 ns 9.30 ns
BM_std_minmax<unsigned char>/26 14.4 ns 10.1 ns
BM_std_minmax<unsigned char>/27 15.0 ns 11.1 ns
BM_std_minmax<unsigned char>/28 15.3 ns 11.9 ns
BM_std_minmax<unsigned char>/29 16.2 ns 12.9 ns
BM_std_minmax<unsigned char>/30 16.5 ns 13.9 ns
BM_std_minmax<unsigned char>/31 17.2 ns 14.8 ns
BM_std_minmax<unsigned char>/32 17.6 ns 2.36 ns
BM_std_minmax<unsigned char>/64 35.6 ns 3.21 ns
BM_std_minmax<unsigned char>/512 288 ns 6.00 ns
BM_std_minmax<unsigned char>/1024 573 ns 8.80 ns
BM_std_minmax<unsigned char>/4000 2222 ns 28.6 ns
BM_std_minmax<unsigned char>/4096 2265 ns 25.9 ns
BM_std_minmax<unsigned char>/5500 3047 ns 48.8 ns
BM_std_minmax<unsigned char>/64000 35059 ns 480 ns
BM_std_minmax<unsigned char>/65536 35941 ns 491 ns
BM_std_minmax<unsigned char>/70000 38922 ns 525 ns
BM_std_minmax<unsigned short>/1 0.711 ns 1.18 ns
BM_std_minmax<unsigned short>/2 0.957 ns 1.65 ns
BM_std_minmax<unsigned short>/3 2.13 ns 2.21 ns
BM_std_minmax<unsigned short>/4 2.14 ns 2.78 ns
BM_std_minmax<unsigned short>/5 3.06 ns 3.29 ns
BM_std_minmax<unsigned short>/6 2.89 ns 3.87 ns
BM_std_minmax<unsigned short>/7 3.80 ns 4.55 ns
BM_std_minmax<unsigned short>/8 3.68 ns 2.02 ns
BM_std_minmax<unsigned short>/9 4.53 ns 2.40 ns
BM_std_minmax<unsigned short>/10 4.60 ns 2.94 ns
BM_std_minmax<unsigned short>/11 5.67 ns 3.67 ns
BM_std_minmax<unsigned short>/12 5.39 ns 4.22 ns
BM_std_minmax<unsigned short>/13 6.58 ns 4.78 ns
BM_std_minmax<unsigned short>/14 6.33 ns 5.54 ns
BM_std_minmax<unsigned short>/15 7.34 ns 6.30 ns
BM_std_minmax<unsigned short>/16 7.17 ns 2.25 ns
BM_std_minmax<unsigned short>/17 8.19 ns 2.61 ns
BM_std_minmax<unsigned short>/18 8.02 ns 3.19 ns
BM_std_minmax<unsigned short>/19 9.03 ns 3.72 ns
BM_std_minmax<unsigned short>/20 8.89 ns 4.36 ns
BM_std_minmax<unsigned short>/21 9.77 ns 5.10 ns
BM_std_minmax<unsigned short>/22 9.70 ns 5.55 ns
BM_std_minmax<unsigned short>/23 10.8 ns 6.29 ns
BM_std_minmax<unsigned short>/24 10.6 ns 2.41 ns
BM_std_minmax<unsigned short>/25 11.6 ns 2.75 ns
BM_std_minmax<unsigned short>/26 11.4 ns 3.26 ns
BM_std_minmax<unsigned short>/27 12.4 ns 3.86 ns
BM_std_minmax<unsigned short>/28 12.3 ns 4.45 ns
BM_std_minmax<unsigned short>/29 13.2 ns 5.07 ns
BM_std_minmax<unsigned short>/30 13.1 ns 5.77 ns
BM_std_minmax<unsigned short>/31 13.9 ns 6.65 ns
BM_std_minmax<unsigned short>/32 13.9 ns 2.72 ns
BM_std_minmax<unsigned short>/64 27.8 ns 3.25 ns
BM_std_minmax<unsigned short>/512 220 ns 8.30 ns
BM_std_minmax<unsigned short>/1024 435 ns 14.1 ns
BM_std_minmax<unsigned short>/4000 1703 ns 49.8 ns
BM_std_minmax<unsigned short>/4096 1746 ns 47.9 ns
BM_std_minmax<unsigned short>/5500 2350 ns 69.9 ns
BM_std_minmax<unsigned short>/64000 27388 ns 953 ns
BM_std_minmax<unsigned short>/65536 28040 ns 975 ns
BM_std_minmax<unsigned short>/70000 29967 ns 1040 ns
BM_std_minmax<unsigned int>/1 0.712 ns 1.18 ns
BM_std_minmax<unsigned int>/2 0.965 ns 1.65 ns
BM_std_minmax<unsigned int>/3 2.13 ns 2.14 ns
BM_std_minmax<unsigned int>/4 2.09 ns 2.64 ns
BM_std_minmax<unsigned int>/5 3.02 ns 3.21 ns
BM_std_minmax<unsigned int>/6 2.94 ns 3.81 ns
BM_std_minmax<unsigned int>/7 3.91 ns 4.38 ns
BM_std_minmax<unsigned int>/8 3.75 ns 4.93 ns
BM_std_minmax<unsigned int>/9 4.71 ns 5.60 ns
BM_std_minmax<unsigned int>/10 4.59 ns 6.26 ns
BM_std_minmax<unsigned int>/11 5.57 ns 6.80 ns
BM_std_minmax<unsigned int>/12 5.43 ns 7.47 ns
BM_std_minmax<unsigned int>/13 6.45 ns 8.10 ns
BM_std_minmax<unsigned int>/14 6.32 ns 8.69 ns
BM_std_minmax<unsigned int>/15 7.29 ns 9.37 ns
BM_std_minmax<unsigned int>/16 7.12 ns 9.99 ns
BM_std_minmax<unsigned int>/17 8.24 ns 10.6 ns
BM_std_minmax<unsigned int>/18 8.00 ns 11.2 ns
BM_std_minmax<unsigned int>/19 8.94 ns 12.0 ns
BM_std_minmax<unsigned int>/20 8.91 ns 12.6 ns
BM_std_minmax<unsigned int>/21 9.73 ns 17.2 ns
BM_std_minmax<unsigned int>/22 9.75 ns 13.8 ns
BM_std_minmax<unsigned int>/23 10.6 ns 14.5 ns
BM_std_minmax<unsigned int>/24 10.6 ns 15.1 ns
BM_std_minmax<unsigned int>/25 11.5 ns 15.7 ns
BM_std_minmax<unsigned int>/26 11.4 ns 16.3 ns
BM_std_minmax<unsigned int>/27 12.3 ns 17.0 ns
BM_std_minmax<unsigned int>/28 12.3 ns 17.6 ns
BM_std_minmax<unsigned int>/29 13.2 ns 18.3 ns
BM_std_minmax<unsigned int>/30 13.2 ns 19.0 ns
BM_std_minmax<unsigned int>/31 14.0 ns 19.6 ns
BM_std_minmax<unsigned int>/32 14.0 ns 3.39 ns
BM_std_minmax<unsigned int>/64 27.6 ns 4.05 ns
BM_std_minmax<unsigned int>/512 221 ns 14.2 ns
BM_std_minmax<unsigned int>/1024 439 ns 25.5 ns
BM_std_minmax<unsigned int>/4000 1720 ns 96.3 ns
BM_std_minmax<unsigned int>/4096 1762 ns 97.8 ns
BM_std_minmax<unsigned int>/5500 2364 ns 146 ns
BM_std_minmax<unsigned int>/64000 27874 ns 1905 ns
BM_std_minmax<unsigned int>/65536 28012 ns 1961 ns
BM_std_minmax<unsigned int>/70000 29899 ns 2087 ns
BM_std_minmax<unsigned long long>/1 0.707 ns 1.18 ns
BM_std_minmax<unsigned long long>/2 0.909 ns 1.65 ns
BM_std_minmax<unsigned long long>/3 1.65 ns 2.70 ns
BM_std_minmax<unsigned long long>/4 1.93 ns 2.69 ns
BM_std_minmax<unsigned long long>/5 2.45 ns 3.34 ns
BM_std_minmax<unsigned long long>/6 2.78 ns 3.81 ns
BM_std_minmax<unsigned long long>/7 3.28 ns 4.43 ns
BM_std_minmax<unsigned long long>/8 3.70 ns 4.92 ns
BM_std_minmax<unsigned long long>/9 4.12 ns 5.64 ns
BM_std_minmax<unsigned long long>/10 4.44 ns 6.15 ns
BM_std_minmax<unsigned long long>/11 4.91 ns 6.81 ns
BM_std_minmax<unsigned long long>/12 5.31 ns 7.41 ns
BM_std_minmax<unsigned long long>/13 5.72 ns 7.96 ns
BM_std_minmax<unsigned long long>/14 6.05 ns 8.66 ns
BM_std_minmax<unsigned long long>/15 6.55 ns 9.37 ns
BM_std_minmax<unsigned long long>/16 6.89 ns 7.98 ns
BM_std_minmax<unsigned long long>/17 7.34 ns 8.13 ns
BM_std_minmax<unsigned long long>/18 7.73 ns 8.42 ns
BM_std_minmax<unsigned long long>/19 8.26 ns 8.63 ns
BM_std_minmax<unsigned long long>/20 8.54 ns 8.96 ns
BM_std_minmax<unsigned long long>/21 9.14 ns 9.37 ns
BM_std_minmax<unsigned long long>/22 9.39 ns 9.67 ns
BM_std_minmax<unsigned long long>/23 10.1 ns 10.1 ns
BM_std_minmax<unsigned long long>/24 10.4 ns 10.6 ns
BM_std_minmax<unsigned long long>/25 11.0 ns 11.3 ns
BM_std_minmax<unsigned long long>/26 11.3 ns 12.1 ns
BM_std_minmax<unsigned long long>/27 11.8 ns 14.2 ns
BM_std_minmax<unsigned long long>/28 12.1 ns 15.8 ns
BM_std_minmax<unsigned long long>/29 12.6 ns 17.4 ns
BM_std_minmax<unsigned long long>/30 13.1 ns 18.1 ns
BM_std_minmax<unsigned long long>/31 13.4 ns 18.8 ns
BM_std_minmax<unsigned long long>/32 13.8 ns 10.4 ns
BM_std_minmax<unsigned long long>/64 27.3 ns 15.5 ns
BM_std_minmax<unsigned long long>/512 222 ns 80.6 ns
BM_std_minmax<unsigned long long>/1024 443 ns 156 ns
BM_std_minmax<unsigned long long>/4000 1731 ns 591 ns
BM_std_minmax<unsigned long long>/4096 1752 ns 609 ns
BM_std_minmax<unsigned long long>/5500 2340 ns 819 ns
BM_std_minmax<unsigned long long>/64000 27166 ns 9652 ns
BM_std_minmax<unsigned long long>/65536 27869 ns 9876 ns
BM_std_minmax<unsigned long long>/70000 29920 ns 10680 ns
```
|
|
```
-----------------------------------------------------------------------------
Benchmark old new
-----------------------------------------------------------------------------
bm_mismatch_two_range_overload<char>/1 0.941 ns 1.88 ns
bm_mismatch_two_range_overload<char>/2 1.43 ns 2.15 ns
bm_mismatch_two_range_overload<char>/3 1.95 ns 2.55 ns
bm_mismatch_two_range_overload<char>/4 2.58 ns 2.90 ns
bm_mismatch_two_range_overload<char>/5 3.75 ns 3.31 ns
bm_mismatch_two_range_overload<char>/6 5.00 ns 3.83 ns
bm_mismatch_two_range_overload<char>/7 5.59 ns 4.35 ns
bm_mismatch_two_range_overload<char>/8 6.37 ns 4.84 ns
bm_mismatch_two_range_overload<char>/16 11.8 ns 6.72 ns
bm_mismatch_two_range_overload<char>/64 45.5 ns 2.59 ns
bm_mismatch_two_range_overload<char>/512 366 ns 12.6 ns
bm_mismatch_two_range_overload<char>/4096 2890 ns 91.6 ns
bm_mismatch_two_range_overload<char>/32768 23038 ns 758 ns
bm_mismatch_two_range_overload<char>/262144 142813 ns 6573 ns
bm_mismatch_two_range_overload<char>/1048576 366679 ns 26710 ns
bm_mismatch_two_range_overload<short>/1 0.934 ns 1.88 ns
bm_mismatch_two_range_overload<short>/2 1.30 ns 2.58 ns
bm_mismatch_two_range_overload<short>/3 1.76 ns 3.28 ns
bm_mismatch_two_range_overload<short>/4 2.24 ns 3.98 ns
bm_mismatch_two_range_overload<short>/5 2.80 ns 4.92 ns
bm_mismatch_two_range_overload<short>/6 3.58 ns 6.01 ns
bm_mismatch_two_range_overload<short>/7 4.29 ns 7.03 ns
bm_mismatch_two_range_overload<short>/8 4.67 ns 7.39 ns
bm_mismatch_two_range_overload<short>/16 9.86 ns 13.1 ns
bm_mismatch_two_range_overload<short>/64 38.9 ns 4.55 ns
bm_mismatch_two_range_overload<short>/512 348 ns 27.7 ns
bm_mismatch_two_range_overload<short>/4096 2881 ns 225 ns
bm_mismatch_two_range_overload<short>/32768 23111 ns 1715 ns
bm_mismatch_two_range_overload<short>/262144 184846 ns 14416 ns
bm_mismatch_two_range_overload<short>/1048576 742885 ns 57264 ns
bm_mismatch_two_range_overload<int>/1 0.838 ns 1.19 ns
bm_mismatch_two_range_overload<int>/2 1.19 ns 1.65 ns
bm_mismatch_two_range_overload<int>/3 1.83 ns 2.06 ns
bm_mismatch_two_range_overload<int>/4 2.38 ns 2.42 ns
bm_mismatch_two_range_overload<int>/5 3.60 ns 2.47 ns
bm_mismatch_two_range_overload<int>/6 3.68 ns 3.05 ns
bm_mismatch_two_range_overload<int>/7 4.32 ns 3.36 ns
bm_mismatch_two_range_overload<int>/8 5.18 ns 3.58 ns
bm_mismatch_two_range_overload<int>/16 10.6 ns 2.84 ns
bm_mismatch_two_range_overload<int>/64 39.0 ns 7.78 ns
bm_mismatch_two_range_overload<int>/512 247 ns 53.9 ns
bm_mismatch_two_range_overload<int>/4096 1927 ns 429 ns
bm_mismatch_two_range_overload<int>/32768 15569 ns 3393 ns
bm_mismatch_two_range_overload<int>/262144 125413 ns 28504 ns
bm_mismatch_two_range_overload<int>/1048576 504549 ns 112729 ns
```
|
|
This adds vectorization to the last 0-3 vectors and, if the range is
large enough, the remaining elements that don't fill a vector
completely.
```
-----------------------------------------------------------------------
Benchmark old full vectors partial vector
-----------------------------------------------------------------------
bm_mismatch<char>/1 1.40 ns 1.62 ns 2.09 ns
bm_mismatch<char>/2 1.88 ns 2.10 ns 2.33 ns
bm_mismatch<char>/3 2.67 ns 2.56 ns 2.72 ns
bm_mismatch<char>/4 3.01 ns 3.20 ns 3.70 ns
bm_mismatch<char>/5 3.51 ns 3.73 ns 3.64 ns
bm_mismatch<char>/6 4.71 ns 4.85 ns 4.37 ns
bm_mismatch<char>/7 5.12 ns 5.33 ns 4.37 ns
bm_mismatch<char>/8 5.79 ns 6.02 ns 4.75 ns
bm_mismatch<char>/15 9.20 ns 10.5 ns 7.23 ns
bm_mismatch<char>/16 10.2 ns 10.1 ns 7.46 ns
bm_mismatch<char>/17 10.2 ns 10.8 ns 7.57 ns
bm_mismatch<char>/31 17.6 ns 17.1 ns 10.8 ns
bm_mismatch<char>/32 17.4 ns 1.64 ns 1.64 ns
bm_mismatch<char>/33 23.3 ns 2.10 ns 2.33 ns
bm_mismatch<char>/63 31.8 ns 16.9 ns 2.33 ns
bm_mismatch<char>/64 32.6 ns 2.10 ns 2.10 ns
bm_mismatch<char>/65 33.6 ns 2.57 ns 2.80 ns
bm_mismatch<char>/127 67.3 ns 18.1 ns 3.27 ns
bm_mismatch<char>/128 2.17 ns 2.14 ns 2.57 ns
bm_mismatch<char>/129 2.36 ns 2.80 ns 3.27 ns
bm_mismatch<char>/255 67.5 ns 19.6 ns 4.68 ns
bm_mismatch<char>/256 3.76 ns 3.71 ns 3.97 ns
bm_mismatch<char>/257 3.77 ns 4.04 ns 4.43 ns
bm_mismatch<char>/511 70.8 ns 22.1 ns 7.47 ns
bm_mismatch<char>/512 7.27 ns 7.30 ns 6.95 ns
bm_mismatch<char>/513 7.11 ns 7.05 ns 6.96 ns
bm_mismatch<char>/1023 75.9 ns 27.4 ns 13.3 ns
bm_mismatch<char>/1024 13.9 ns 13.8 ns 12.4 ns
bm_mismatch<char>/1025 13.6 ns 13.6 ns 12.8 ns
bm_mismatch<char>/2047 87.3 ns 37.5 ns 25.4 ns
bm_mismatch<char>/2048 26.8 ns 27.4 ns 24.0 ns
bm_mismatch<char>/2049 26.7 ns 27.3 ns 25.5 ns
bm_mismatch<char>/4095 112 ns 64.7 ns 48.7 ns
bm_mismatch<char>/4096 53.0 ns 54.2 ns 46.8 ns
bm_mismatch<char>/4097 52.7 ns 54.2 ns 48.4 ns
bm_mismatch<char>/8191 160 ns 118 ns 98.4 ns
bm_mismatch<char>/8192 107 ns 108 ns 96.0 ns
bm_mismatch<char>/8193 106 ns 108 ns 97.2 ns
bm_mismatch<char>/16383 283 ns 234 ns 215 ns
bm_mismatch<char>/16384 227 ns 223 ns 217 ns
bm_mismatch<char>/16385 221 ns 221 ns 215 ns
bm_mismatch<char>/32767 547 ns 499 ns 488 ns
bm_mismatch<char>/32768 495 ns 492 ns 492 ns
bm_mismatch<char>/32769 491 ns 489 ns 488 ns
bm_mismatch<char>/65535 1028 ns 979 ns 971 ns
bm_mismatch<char>/65536 976 ns 970 ns 974 ns
bm_mismatch<char>/65537 970 ns 965 ns 971 ns
bm_mismatch<char>/131071 2031 ns 1948 ns 2005 ns
bm_mismatch<char>/131072 1973 ns 1955 ns 1974 ns
bm_mismatch<char>/131073 1989 ns 1932 ns 2001 ns
bm_mismatch<char>/262143 4469 ns 4244 ns 4223 ns
bm_mismatch<char>/262144 4443 ns 4183 ns 4243 ns
bm_mismatch<char>/262145 4400 ns 4232 ns 4246 ns
bm_mismatch<char>/524287 10169 ns 9733 ns 9592 ns
bm_mismatch<char>/524288 10154 ns 9664 ns 9843 ns
bm_mismatch<char>/524289 10113 ns 9641 ns 10003 ns
bm_mismatch<short>/1 1.86 ns 2.53 ns 2.32 ns
bm_mismatch<short>/2 2.57 ns 2.77 ns 2.55 ns
bm_mismatch<short>/3 3.26 ns 3.00 ns 2.79 ns
bm_mismatch<short>/4 3.95 ns 3.39 ns 3.15 ns
bm_mismatch<short>/5 4.83 ns 3.97 ns 3.72 ns
bm_mismatch<short>/6 5.43 ns 4.34 ns 4.03 ns
bm_mismatch<short>/7 6.11 ns 4.73 ns 4.44 ns
bm_mismatch<short>/8 6.84 ns 5.02 ns 4.79 ns
bm_mismatch<short>/15 11.5 ns 7.12 ns 6.50 ns
bm_mismatch<short>/16 13.9 ns 1.87 ns 2.11 ns
bm_mismatch<short>/17 14.0 ns 3.00 ns 2.47 ns
bm_mismatch<short>/31 23.1 ns 7.87 ns 2.47 ns
bm_mismatch<short>/32 23.8 ns 2.57 ns 2.81 ns
bm_mismatch<short>/33 24.5 ns 3.70 ns 2.94 ns
bm_mismatch<short>/63 44.8 ns 9.37 ns 3.46 ns
bm_mismatch<short>/64 2.32 ns 2.57 ns 2.64 ns
bm_mismatch<short>/65 2.52 ns 3.02 ns 3.51 ns
bm_mismatch<short>/127 45.6 ns 9.97 ns 5.18 ns
bm_mismatch<short>/128 3.85 ns 3.93 ns 3.94 ns
bm_mismatch<short>/129 3.82 ns 4.20 ns 4.70 ns
bm_mismatch<short>/255 50.4 ns 12.6 ns 8.07 ns
bm_mismatch<short>/256 7.23 ns 6.91 ns 6.98 ns
bm_mismatch<short>/257 7.24 ns 7.19 ns 7.55 ns
bm_mismatch<short>/511 52.3 ns 17.8 ns 14.0 ns
bm_mismatch<short>/512 13.6 ns 13.7 ns 13.6 ns
bm_mismatch<short>/513 13.9 ns 13.8 ns 18.5 ns
bm_mismatch<short>/1023 60.9 ns 30.9 ns 26.3 ns
bm_mismatch<short>/1024 26.7 ns 27.7 ns 25.7 ns
bm_mismatch<short>/1025 27.7 ns 27.6 ns 25.3 ns
bm_mismatch<short>/2047 88.4 ns 58.0 ns 51.6 ns
bm_mismatch<short>/2048 52.8 ns 55.3 ns 50.6 ns
bm_mismatch<short>/2049 55.2 ns 54.8 ns 48.7 ns
bm_mismatch<short>/4095 153 ns 113 ns 102 ns
bm_mismatch<short>/4096 105 ns 110 ns 101 ns
bm_mismatch<short>/4097 110 ns 110 ns 99.1 ns
bm_mismatch<short>/8191 277 ns 219 ns 206 ns
bm_mismatch<short>/8192 226 ns 214 ns 250 ns
bm_mismatch<short>/8193 226 ns 207 ns 208 ns
bm_mismatch<short>/16383 519 ns 492 ns 488 ns
bm_mismatch<short>/16384 494 ns 492 ns 492 ns
bm_mismatch<short>/16385 492 ns 488 ns 489 ns
bm_mismatch<short>/32767 1007 ns 968 ns 964 ns
bm_mismatch<short>/32768 977 ns 972 ns 970 ns
bm_mismatch<short>/32769 972 ns 962 ns 967 ns
bm_mismatch<short>/65535 1978 ns 1918 ns 1956 ns
bm_mismatch<short>/65536 1940 ns 1927 ns 1970 ns
bm_mismatch<short>/65537 1937 ns 1922 ns 1959 ns
bm_mismatch<short>/131071 4524 ns 4193 ns 4304 ns
bm_mismatch<short>/131072 4445 ns 4196 ns 4306 ns
bm_mismatch<short>/131073 4452 ns 4278 ns 4311 ns
bm_mismatch<short>/262143 9801 ns 10188 ns 9634 ns
bm_mismatch<short>/262144 9738 ns 10151 ns 9651 ns
bm_mismatch<short>/262145 9716 ns 10171 ns 9715 ns
bm_mismatch<short>/524287 19944 ns 20718 ns 20044 ns
bm_mismatch<short>/524288 21139 ns 20647 ns 20008 ns
bm_mismatch<short>/524289 21162 ns 19512 ns 20068 ns
bm_mismatch<int>/1 1.40 ns 1.84 ns 1.87 ns
bm_mismatch<int>/2 1.87 ns 2.08 ns 2.09 ns
bm_mismatch<int>/3 2.36 ns 2.31 ns 2.87 ns
bm_mismatch<int>/4 3.06 ns 2.72 ns 2.95 ns
bm_mismatch<int>/5 3.66 ns 3.37 ns 3.42 ns
bm_mismatch<int>/6 4.55 ns 3.65 ns 3.73 ns
bm_mismatch<int>/7 5.03 ns 3.93 ns 3.94 ns
bm_mismatch<int>/8 5.67 ns 1.86 ns 1.87 ns
bm_mismatch<int>/15 9.89 ns 4.41 ns 2.34 ns
bm_mismatch<int>/16 10.1 ns 2.33 ns 2.34 ns
bm_mismatch<int>/17 10.2 ns 3.34 ns 2.86 ns
bm_mismatch<int>/31 17.2 ns 5.54 ns 3.28 ns
bm_mismatch<int>/32 2.16 ns 2.15 ns 2.58 ns
bm_mismatch<int>/33 2.36 ns 3.01 ns 3.28 ns
bm_mismatch<int>/63 17.7 ns 6.50 ns 4.93 ns
bm_mismatch<int>/64 3.81 ns 3.58 ns 3.90 ns
bm_mismatch<int>/65 3.74 ns 4.36 ns 4.45 ns
bm_mismatch<int>/127 19.5 ns 9.56 ns 7.74 ns
bm_mismatch<int>/128 7.30 ns 6.41 ns 6.85 ns
bm_mismatch<int>/129 7.09 ns 7.04 ns 7.06 ns
bm_mismatch<int>/255 24.7 ns 14.8 ns 13.3 ns
bm_mismatch<int>/256 14.0 ns 12.1 ns 12.3 ns
bm_mismatch<int>/257 13.8 ns 12.7 ns 12.8 ns
bm_mismatch<int>/511 34.3 ns 26.3 ns 24.8 ns
bm_mismatch<int>/512 27.6 ns 23.6 ns 23.9 ns
bm_mismatch<int>/513 27.3 ns 24.4 ns 25.1 ns
bm_mismatch<int>/1023 62.5 ns 50.9 ns 48.3 ns
bm_mismatch<int>/1024 54.4 ns 46.1 ns 46.6 ns
bm_mismatch<int>/1025 54.2 ns 48.4 ns 47.5 ns
bm_mismatch<int>/2047 116 ns 97.8 ns 94.1 ns
bm_mismatch<int>/2048 108 ns 92.6 ns 92.4 ns
bm_mismatch<int>/2049 108 ns 104 ns 94.0 ns
bm_mismatch<int>/4095 233 ns 222 ns 205 ns
bm_mismatch<int>/4096 226 ns 223 ns 225 ns
bm_mismatch<int>/4097 221 ns 219 ns 210 ns
bm_mismatch<int>/8191 499 ns 485 ns 488 ns
bm_mismatch<int>/8192 496 ns 490 ns 495 ns
bm_mismatch<int>/8193 491 ns 485 ns 488 ns
bm_mismatch<int>/16383 982 ns 962 ns 964 ns
bm_mismatch<int>/16384 974 ns 971 ns 971 ns
bm_mismatch<int>/16385 971 ns 961 ns 968 ns
bm_mismatch<int>/32767 2003 ns 1959 ns 1920 ns
bm_mismatch<int>/32768 1996 ns 1947 ns 1928 ns
bm_mismatch<int>/32769 1990 ns 1945 ns 1926 ns
bm_mismatch<int>/65535 4434 ns 4275 ns 4312 ns
bm_mismatch<int>/65536 4437 ns 4267 ns 4321 ns
bm_mismatch<int>/65537 4442 ns 4261 ns 4321 ns
bm_mismatch<int>/131071 9673 ns 9648 ns 9465 ns
bm_mismatch<int>/131072 9667 ns 9671 ns 9465 ns
bm_mismatch<int>/131073 9661 ns 9653 ns 9464 ns
bm_mismatch<int>/262143 20595 ns 19605 ns 19064 ns
bm_mismatch<int>/262144 19894 ns 19572 ns 19009 ns
bm_mismatch<int>/262145 19851 ns 19656 ns 18999 ns
bm_mismatch<int>/524287 39556 ns 39364 ns 38131 ns
bm_mismatch<int>/524288 39678 ns 39573 ns 38183 ns
bm_mismatch<int>/524289 40168 ns 39301 ns 38121 ns
```
|
|
```
---------------------------------------------------
Benchmark old new
---------------------------------------------------
bm_mismatch<char>/1 0.835 ns 2.37 ns
bm_mismatch<char>/2 1.44 ns 2.60 ns
bm_mismatch<char>/3 2.06 ns 2.83 ns
bm_mismatch<char>/4 2.60 ns 3.29 ns
bm_mismatch<char>/5 3.15 ns 3.77 ns
bm_mismatch<char>/6 3.82 ns 4.17 ns
bm_mismatch<char>/7 4.29 ns 4.52 ns
bm_mismatch<char>/8 4.78 ns 4.86 ns
bm_mismatch<char>/16 9.06 ns 7.54 ns
bm_mismatch<char>/64 31.7 ns 19.1 ns
bm_mismatch<char>/512 249 ns 8.16 ns
bm_mismatch<char>/4096 1956 ns 44.2 ns
bm_mismatch<char>/32768 15498 ns 501 ns
bm_mismatch<char>/262144 123965 ns 4479 ns
bm_mismatch<char>/1048576 495668 ns 21306 ns
bm_mismatch<short>/1 0.710 ns 2.12 ns
bm_mismatch<short>/2 1.03 ns 2.66 ns
bm_mismatch<short>/3 1.29 ns 3.56 ns
bm_mismatch<short>/4 1.68 ns 4.29 ns
bm_mismatch<short>/5 1.96 ns 5.18 ns
bm_mismatch<short>/6 2.59 ns 5.91 ns
bm_mismatch<short>/7 2.86 ns 6.63 ns
bm_mismatch<short>/8 3.19 ns 7.33 ns
bm_mismatch<short>/16 5.48 ns 13.0 ns
bm_mismatch<short>/64 16.6 ns 4.06 ns
bm_mismatch<short>/512 130 ns 13.8 ns
bm_mismatch<short>/4096 985 ns 93.8 ns
bm_mismatch<short>/32768 7846 ns 1002 ns
bm_mismatch<short>/262144 63217 ns 10637 ns
bm_mismatch<short>/1048576 251782 ns 42471 ns
bm_mismatch<int>/1 0.716 ns 1.91 ns
bm_mismatch<int>/2 1.21 ns 2.49 ns
bm_mismatch<int>/3 1.38 ns 3.46 ns
bm_mismatch<int>/4 1.71 ns 4.04 ns
bm_mismatch<int>/5 2.00 ns 4.98 ns
bm_mismatch<int>/6 2.43 ns 5.67 ns
bm_mismatch<int>/7 3.05 ns 6.38 ns
bm_mismatch<int>/8 3.22 ns 7.09 ns
bm_mismatch<int>/16 5.18 ns 12.8 ns
bm_mismatch<int>/64 16.6 ns 5.28 ns
bm_mismatch<int>/512 129 ns 25.2 ns
bm_mismatch<int>/4096 1009 ns 201 ns
bm_mismatch<int>/32768 7776 ns 2144 ns
bm_mismatch<int>/262144 62371 ns 20551 ns
bm_mismatch<int>/1048576 254750 ns 90097 ns
```
|
|
```
------------------------------------------------------
Benchmark old new
------------------------------------------------------
bm_ranges_fill_n/1 1.64 ns 3.06 ns
bm_ranges_fill_n/2 3.45 ns 3.06 ns
bm_ranges_fill_n/3 4.88 ns 3.06 ns
bm_ranges_fill_n/4 6.46 ns 3.06 ns
bm_ranges_fill_n/5 8.03 ns 3.06 ns
bm_ranges_fill_n/6 9.65 ns 3.07 ns
bm_ranges_fill_n/7 11.5 ns 3.06 ns
bm_ranges_fill_n/8 13.0 ns 3.06 ns
bm_ranges_fill_n/16 25.9 ns 3.06 ns
bm_ranges_fill_n/64 103 ns 4.62 ns
bm_ranges_fill_n/512 711 ns 4.40 ns
bm_ranges_fill_n/4096 5642 ns 9.86 ns
bm_ranges_fill_n/32768 45135 ns 33.6 ns
bm_ranges_fill_n/262144 360818 ns 243 ns
bm_ranges_fill_n/1048576 1442828 ns 982 ns
bm_ranges_fill/1 1.63 ns 3.17 ns
bm_ranges_fill/2 3.43 ns 3.28 ns
bm_ranges_fill/3 4.97 ns 3.31 ns
bm_ranges_fill/4 6.53 ns 3.27 ns
bm_ranges_fill/5 8.12 ns 3.33 ns
bm_ranges_fill/6 9.76 ns 3.32 ns
bm_ranges_fill/7 11.6 ns 3.29 ns
bm_ranges_fill/8 13.2 ns 3.26 ns
bm_ranges_fill/16 26.3 ns 3.26 ns
bm_ranges_fill/64 104 ns 4.92 ns
bm_ranges_fill/512 716 ns 4.47 ns
bm_ranges_fill/4096 5772 ns 8.21 ns
bm_ranges_fill/32768 45778 ns 33.1 ns
bm_ranges_fill/262144 351422 ns 241 ns
bm_ranges_fill/1048576 1404710 ns 965 ns
```
|
|
For the mutex vs atomic test:
Old: `unique_lock<mutex>`
New: a lock implemented with `atomic::wait`
On 10 years old Intel Macbook, `atomic::wait` is 50% slower than `mutex`
```
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------------------
BM_multi_thread_lock_unlock/1024 +0.3735 +2.4497 1724726 2368935 153159 528354
BM_multi_thread_lock_unlock/2048 +0.4174 +1.2487 3410538 4834012 435062 978311
BM_multi_thread_lock_unlock/4096 +0.5256 +1.9824 6903783 10532681 590266 1760405
BM_multi_thread_lock_unlock/8192 +0.5415 +0.4578 14536391 22408399 1456328 2123075
BM_multi_thread_lock_unlock/16384 +0.5663 +0.0513 30181991 47275023 3316850 3486950
BM_multi_thread_lock_unlock/32768 +0.5635 -0.2081 62027663 96977726 6477076 5129190
BM_multi_thread_lock_unlock/65536 +0.5228 -0.3273 129637761 197408739 11341630 7628955
BM_multi_thread_lock_unlock/131072 +0.4825 -0.1070 266256295 394712193 10379800 9269200
BM_multi_thread_lock_unlock/262144 +0.4793 +0.2795 539732340 798409253 10802200 13821100
BM_multi_thread_lock_unlock/524288 +0.5272 +0.2847 1070035132 1634124353 14523000 18657800
BM_multi_thread_lock_unlock/1048576 +0.4799 +0.3353 2125510441 3145636119 13404200 17899000
OVERALL_GEOMEAN +0.4970 +0.3886 0 0 0 0
```
On Apple Arm, `atomic::wait` is 200% slower than `mutex`. And
`atomic::wait` is even slower than my 10 years old Intel CPU Macbook
```
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------------------
BM_multi_thread_lock_unlock/1024 +2.1811 +3.9854 2036726 6478993 119817 597334
BM_multi_thread_lock_unlock/2048 +1.6736 +1.4301 3162161 8454415 426201 1035727
BM_multi_thread_lock_unlock/4096 +1.1017 +0.6456 6620503 13914159 893019 1469578
BM_multi_thread_lock_unlock/8192 +0.6688 +0.2148 12089392 20174635 1489000 1808799
BM_multi_thread_lock_unlock/16384 +1.4217 -0.2436 19365999 46899345 2068266 1564530
BM_multi_thread_lock_unlock/32768 +2.6161 -0.4927 31371052 113440165 3715100 1884540
BM_multi_thread_lock_unlock/65536 +2.6286 -0.3967 54314581 197086847 5912764 3567410
BM_multi_thread_lock_unlock/131072 +2.3554 +0.4990 103176565 346201425 9260407 13880900
BM_multi_thread_lock_unlock/262144 +2.8780 +0.4995 182355400 707170733 16335852 24496000
BM_multi_thread_lock_unlock/524288 +3.0280 +0.3001 360953079 1453902595 32548700 42316364
BM_multi_thread_lock_unlock/1048576 +3.7480 +1.2374 714500462 3392470417 48603455 108747000
OVERALL_GEOMEAN +2.0791 +0.3874 0 0 0 0
```
For the atomic_wait test:
On my 2013 MacBook with Intel CPU
```
Run on (8 X 2300 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB (x1)
Load Average: 1.95, 3.77, 4.13
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------------
BM_atomic_wait_one_thread_one_atomic_wait/1024 184455 ns 183979 ns 3760
BM_atomic_wait_one_thread_one_atomic_wait/2048 361607 ns 360917 ns 1912
BM_atomic_wait_one_thread_one_atomic_wait/4096 709055 ns 708326 ns 929
BM_atomic_wait_one_thread_one_atomic_wait/8192 1469063 ns 1467430 ns 488
BM_atomic_wait_one_thread_one_atomic_wait/16384 2865332 ns 2863473 ns 237
BM_atomic_wait_one_thread_one_atomic_wait/32768 5839429 ns 5834708 ns 113
BM_atomic_wait_one_thread_one_atomic_wait/65536 11460822 ns 11453183 ns 60
BM_atomic_wait_one_thread_one_atomic_wait/131072 23052804 ns 23035000 ns 30
BM_atomic_wait_one_thread_one_atomic_wait/262144 46958743 ns 46712733 ns 15
BM_atomic_wait_one_thread_one_atomic_wait/524288 93151904 ns 92977429 ns 7
BM_atomic_wait_one_thread_one_atomic_wait/1048576 186100011 ns 185888500 ns 4
BM_atomic_wait_one_thread_one_atomic_wait/2097152 364548135 ns 364280000 ns 2
BM_atomic_wait_one_thread_one_atomic_wait/4194304 747181672 ns 745056000 ns 1
BM_atomic_wait_one_thread_one_atomic_wait/8388608 1473070400 ns 1471165000 ns 1
BM_atomic_wait_one_thread_one_atomic_wait/16777216 2950352547 ns 2947373000 ns 1
BM_atomic_wait_multi_thread_one_atomic_wait/1024 668544 ns 167233 ns 4496
BM_atomic_wait_multi_thread_one_atomic_wait/2048 1384668 ns 369750 ns 1941
BM_atomic_wait_multi_thread_one_atomic_wait/4096 2851627 ns 768559 ns 995
BM_atomic_wait_multi_thread_one_atomic_wait/8192 5797669 ns 1476876 ns 526
BM_atomic_wait_multi_thread_one_atomic_wait/16384 11597952 ns 2692792 ns 260
BM_atomic_wait_multi_thread_one_atomic_wait/32768 23528028 ns 5291465 ns 142
BM_atomic_wait_multi_thread_one_atomic_wait/65536 46287247 ns 8547713 ns 87
BM_atomic_wait_multi_thread_one_atomic_wait/131072 90315848 ns 13294492 ns 61
BM_atomic_wait_multi_thread_one_atomic_wait/262144 190722393 ns 16193917 ns 36
BM_atomic_wait_multi_thread_one_atomic_wait/524288 408456684 ns 23641600 ns 10
BM_atomic_wait_multi_thread_one_atomic_wait/1048576 708809670 ns 36361900 ns 10
BM_atomic_wait_multi_thread_wait_different_atomics/1024 2116444 ns 11669 ns 10000
BM_atomic_wait_multi_thread_wait_different_atomics/2048 12435259 ns 21905 ns 1000
BM_atomic_wait_multi_thread_wait_different_atomics/4096 6393816 ns 17819 ns 1000
BM_atomic_wait_multi_thread_wait_different_atomics/8192 11930400 ns 28637 ns 1000
BM_atomic_wait_multi_thread_wait_different_atomics/16384 20987224 ns 35272 ns 1000
BM_atomic_wait_multi_thread_wait_different_atomics/32768 44335820 ns 66660 ns 100
BM_atomic_wait_multi_thread_wait_different_atomics/65536 91395912 ns 129030 ns 100
BM_atomic_wait_multi_thread_wait_different_atomics/131072 145440007 ns 165960 ns 100
BM_atomic_wait_multi_thread_wait_different_atomics/262144 368219935 ns 420800 ns 10
BM_atomic_wait_multi_thread_wait_different_atomics/524288 630106863 ns 809500 ns 10
BM_atomic_wait_multi_thread_wait_different_atomics/1048576 1138174673 ns 1093000 ns 10
```
On apple arm
```
Run on (8 X 24.1208 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x8)
L1 Instruction 128 KiB (x8)
L2 Unified 4096 KiB (x2)
Load Average: 1.34, 1.58, 1.66
-----------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------------------------------------
BM_atomic_wait_one_thread_one_atomic_wait/1024 61602 ns 61602 ns 8701
BM_atomic_wait_one_thread_one_atomic_wait/2048 123148 ns 123146 ns 5688
BM_atomic_wait_one_thread_one_atomic_wait/4096 246248 ns 246249 ns 2888
BM_atomic_wait_one_thread_one_atomic_wait/8192 480373 ns 480359 ns 1455
BM_atomic_wait_one_thread_one_atomic_wait/16384 974725 ns 974721 ns 724
BM_atomic_wait_one_thread_one_atomic_wait/32768 1922185 ns 1922115 ns 355
BM_atomic_wait_one_thread_one_atomic_wait/65536 3940632 ns 3940608 ns 181
BM_atomic_wait_one_thread_one_atomic_wait/131072 7886302 ns 7886102 ns 88
BM_atomic_wait_one_thread_one_atomic_wait/262144 15393156 ns 15393000 ns 45
BM_atomic_wait_one_thread_one_atomic_wait/524288 30833221 ns 30832174 ns 23
BM_atomic_wait_one_thread_one_atomic_wait/1048576 62551936 ns 62551909 ns 11
BM_atomic_wait_one_thread_one_atomic_wait/2097152 123155625 ns 123155667 ns 6
BM_atomic_wait_one_thread_one_atomic_wait/4194304 252468180 ns 252458667 ns 3
BM_atomic_wait_one_thread_one_atomic_wait/8388608 505075604 ns 505075500 ns 2
BM_atomic_wait_one_thread_one_atomic_wait/16777216 992977209 ns 992935000 ns 1
BM_atomic_wait_multi_thread_one_atomic_wait/1024 531411 ns 239695 ns 2783
BM_atomic_wait_multi_thread_one_atomic_wait/2048 1030592 ns 484868 ns 1413
BM_atomic_wait_multi_thread_one_atomic_wait/4096 1951896 ns 922357 ns 631
BM_atomic_wait_multi_thread_one_atomic_wait/8192 3759893 ns 1952074 ns 390
BM_atomic_wait_multi_thread_one_atomic_wait/16384 7417929 ns 3458309 ns 233
BM_atomic_wait_multi_thread_one_atomic_wait/32768 14386361 ns 5590830 ns 100
BM_atomic_wait_multi_thread_one_atomic_wait/65536 29725536 ns 6521887 ns 115
BM_atomic_wait_multi_thread_one_atomic_wait/131072 60023797 ns 10766795 ns 73
BM_atomic_wait_multi_thread_one_atomic_wait/262144 120782267 ns 17532091 ns 44
BM_atomic_wait_multi_thread_one_atomic_wait/524288 242539333 ns 27506920 ns 25
BM_atomic_wait_multi_thread_one_atomic_wait/1048576 482833787 ns 53721600 ns 10
BM_atomic_wait_multi_thread_wait_different_atomics/1024 2230048 ns 626042 ns 1000
BM_atomic_wait_multi_thread_wait_different_atomics/2048 3931958 ns 837540 ns 884
BM_atomic_wait_multi_thread_wait_different_atomics/4096 6506887 ns 1127922 ns 586
BM_atomic_wait_multi_thread_wait_different_atomics/8192 10528008 ns 1651254 ns 456
BM_atomic_wait_multi_thread_wait_different_atomics/16384 18055829 ns 2066379 ns 317
BM_atomic_wait_multi_thread_wait_different_atomics/32768 29878496 ns 2875600 ns 100
BM_atomic_wait_multi_thread_wait_different_atomics/65536 50523799 ns 3193170 ns 100
BM_atomic_wait_multi_thread_wait_different_atomics/131072 85926943 ns 4121950 ns 100
BM_atomic_wait_multi_thread_wait_different_atomics/262144 154602296 ns 5879050 ns 100
BM_atomic_wait_multi_thread_wait_different_atomics/524288 279121754 ns 10063400 ns 10
BM_atomic_wait_multi_thread_wait_different_atomics/1048576 522796900 ns 12370300 ns 10
```
|
|
(#80558)
This reapplies #76657. Non-trivial elements didn't get destroyed
previously. This fixes the bug and adds tests for all the vector
insertion functions.
|
|
(#76657)"
Broke sanitizer bots: https://lab.llvm.org/buildbot/#/builders/5/builds/40641
This reverts commit 67eee4a029797c09129889c3655416d1be487cfe.
|
|
This patch introduces a new trait to represent whether a type is
trivially
relocatable, and uses that trait to optimize the growth of a std::vector
of trivially relocatable objects.
```
--------------------------------------------------
Benchmark old new
--------------------------------------------------
bm_grow<int> 1354 ns 1301 ns
bm_grow<std::string> 5584 ns 3370 ns
bm_grow<std::unique_ptr<int>> 3506 ns 1994 ns
bm_grow<std::deque<int>> 27114 ns 27209 ns
```
This also changes to order of moving and destroying the objects when
growing the vector. This should not affect our conformance.
|
|
std::make_exception_ptr (#65534)
This patch implements __cxa_init_primary_exception, an extension to the
Itanium C++ ABI. This extension is already present in both libsupc++ and
libcxxrt. This patch also starts making use of this function in
std::make_exception_ptr: instead of going through a full throw/catch
cycle, we are now able to initialize an exception directly, thus making
std::make_exception_ptr around 30x faster.
|
|
Differential Revision: https://reviews.llvm.org/D159232
```
Running ./ranges_contains.libcxx.out
Run on (10 X 24.121 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x10)
L1 Instruction 128 KiB (x10)
L2 Unified 4096 KiB (x5)
Load Average: 3.37, 6.77, 5.27
--------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------
bm_contains_char/16 1.88 ns 1.87 ns 371607095
bm_contains_char/256 7.48 ns 7.47 ns 93292285
bm_contains_char/4096 99.7 ns 99.6 ns 7013185
bm_contains_char/65536 1296 ns 1294 ns 540436
bm_contains_char/1048576 23887 ns 23860 ns 29302
bm_contains_char/16777216 389420 ns 389095 ns 1796
bm_contains_int/16 7.14 ns 7.14 ns 97776288
bm_contains_int/256 90.4 ns 90.3 ns 7558089
bm_contains_int/4096 1294 ns 1290 ns 543052
bm_contains_int/65536 20482 ns 20443 ns 34334
bm_contains_int/1048576 328817 ns 327965 ns 2147
bm_contains_int/16777216 5246279 ns 5239361 ns 133
bm_contains_bool/16 2.19 ns 2.19 ns 322565780
bm_contains_bool/256 3.42 ns 3.41 ns 205025467
bm_contains_bool/4096 22.1 ns 22.1 ns 31780479
bm_contains_bool/65536 333 ns 332 ns 2106606
bm_contains_bool/1048576 5126 ns 5119 ns 135901
bm_contains_bool/16777216 81656 ns 81574 ns 8569
```
---------
Co-authored-by: Nathan Gauër <brioche@google.com>
|
|
```
--------------------------------------------------------------------------
Benchmark old new
--------------------------------------------------------------------------
bm_find<std::deque<char>>/1 6.06 ns 10.6 ns
bm_find<std::deque<char>>/2 15.5 ns 10.6 ns
bm_find<std::deque<char>>/3 19.0 ns 10.6 ns
bm_find<std::deque<char>>/4 20.8 ns 10.6 ns
bm_find<std::deque<char>>/5 22.0 ns 10.6 ns
bm_find<std::deque<char>>/6 23.0 ns 10.5 ns
bm_find<std::deque<char>>/7 24.8 ns 10.7 ns
bm_find<std::deque<char>>/8 25.7 ns 10.6 ns
bm_find<std::deque<char>>/16 28.3 ns 10.6 ns
bm_find<std::deque<char>>/64 44.2 ns 27.0 ns
bm_find<std::deque<char>>/512 133 ns 37.6 ns
bm_find<std::deque<char>>/4096 867 ns 53.1 ns
bm_find<std::deque<char>>/32768 6838 ns 160 ns
bm_find<std::deque<char>>/262144 52897 ns 1495 ns
bm_find<std::deque<char>>/1048576 215621 ns 6077 ns
bm_find<std::deque<short>>/1 6.03 ns 6.28 ns
bm_find<std::deque<short>>/2 15.8 ns 15.8 ns
bm_find<std::deque<short>>/3 20.5 ns 20.3 ns
bm_find<std::deque<short>>/4 21.0 ns 21.0 ns
bm_find<std::deque<short>>/5 23.0 ns 22.1 ns
bm_find<std::deque<short>>/6 22.6 ns 23.0 ns
bm_find<std::deque<short>>/7 23.4 ns 23.7 ns
bm_find<std::deque<short>>/8 24.4 ns 24.9 ns
bm_find<std::deque<short>>/16 26.6 ns 27.2 ns
bm_find<std::deque<short>>/64 43.2 ns 40.9 ns
bm_find<std::deque<short>>/512 124 ns 90.7 ns
bm_find<std::deque<short>>/4096 845 ns 525 ns
bm_find<std::deque<short>>/32768 7273 ns 3194 ns
bm_find<std::deque<short>>/262144 53710 ns 24385 ns
bm_find<std::deque<short>>/1048576 216086 ns 96195 ns
bm_find<std::deque<int>>/1 6.03 ns 10.3 ns
bm_find<std::deque<int>>/2 15.6 ns 10.3 ns
bm_find<std::deque<int>>/3 19.1 ns 10.3 ns
bm_find<std::deque<int>>/4 22.3 ns 10.3 ns
bm_find<std::deque<int>>/5 23.5 ns 10.4 ns
bm_find<std::deque<int>>/6 23.1 ns 10.3 ns
bm_find<std::deque<int>>/7 23.7 ns 10.2 ns
bm_find<std::deque<int>>/8 24.5 ns 10.2 ns
bm_find<std::deque<int>>/16 27.9 ns 26.6 ns
bm_find<std::deque<int>>/64 42.6 ns 32.2 ns
bm_find<std::deque<int>>/512 123 ns 43.0 ns
bm_find<std::deque<int>>/4096 874 ns 93.5 ns
bm_find<std::deque<int>>/32768 7031 ns 751 ns
bm_find<std::deque<int>>/262144 57723 ns 6169 ns
bm_find<std::deque<int>>/1048576 230867 ns 35851 ns
bm_ranges_find<std::deque<char>>/1 5.97 ns 10.6 ns
bm_ranges_find<std::deque<char>>/2 16.0 ns 10.5 ns
bm_ranges_find<std::deque<char>>/3 19.5 ns 10.5 ns
bm_ranges_find<std::deque<char>>/4 21.1 ns 10.6 ns
bm_ranges_find<std::deque<char>>/5 22.8 ns 10.5 ns
bm_ranges_find<std::deque<char>>/6 22.8 ns 10.6 ns
bm_ranges_find<std::deque<char>>/7 23.4 ns 10.8 ns
bm_ranges_find<std::deque<char>>/8 24.1 ns 10.5 ns
bm_ranges_find<std::deque<char>>/16 26.9 ns 10.6 ns
bm_ranges_find<std::deque<char>>/64 50.2 ns 27.2 ns
bm_ranges_find<std::deque<char>>/512 126 ns 38.3 ns
bm_ranges_find<std::deque<char>>/4096 868 ns 53.8 ns
bm_ranges_find<std::deque<char>>/32768 6695 ns 161 ns
bm_ranges_find<std::deque<char>>/262144 54411 ns 1497 ns
bm_ranges_find<std::deque<char>>/1048576 241699 ns 6042 ns
bm_ranges_find<std::deque<short>>/1 6.39 ns 6.31 ns
bm_ranges_find<std::deque<short>>/2 15.8 ns 15.9 ns
bm_ranges_find<std::deque<short>>/3 19.0 ns 19.8 ns
bm_ranges_find<std::deque<short>>/4 20.8 ns 20.9 ns
bm_ranges_find<std::deque<short>>/5 21.8 ns 22.1 ns
bm_ranges_find<std::deque<short>>/6 23.0 ns 23.0 ns
bm_ranges_find<std::deque<short>>/7 23.2 ns 23.9 ns
bm_ranges_find<std::deque<short>>/8 23.7 ns 24.4 ns
bm_ranges_find<std::deque<short>>/16 26.6 ns 26.8 ns
bm_ranges_find<std::deque<short>>/64 43.4 ns 39.7 ns
bm_ranges_find<std::deque<short>>/512 131 ns 90.5 ns
bm_ranges_find<std::deque<short>>/4096 851 ns 523 ns
bm_ranges_find<std::deque<short>>/32768 7370 ns 3166 ns
bm_ranges_find<std::deque<short>>/262144 60778 ns 24814 ns
bm_ranges_find<std::deque<short>>/1048576 229288 ns 99273 ns
bm_ranges_find<std::deque<int>>/1 6.43 ns 10.2 ns
bm_ranges_find<std::deque<int>>/2 16.6 ns 10.2 ns
bm_ranges_find<std::deque<int>>/3 19.6 ns 10.2 ns
bm_ranges_find<std::deque<int>>/4 21.0 ns 10.2 ns
bm_ranges_find<std::deque<int>>/5 21.9 ns 10.4 ns
bm_ranges_find<std::deque<int>>/6 22.7 ns 10.2 ns
bm_ranges_find<std::deque<int>>/7 23.9 ns 10.2 ns
bm_ranges_find<std::deque<int>>/8 23.8 ns 10.2 ns
bm_ranges_find<std::deque<int>>/16 27.2 ns 27.1 ns
bm_ranges_find<std::deque<int>>/64 42.4 ns 32.4 ns
bm_ranges_find<std::deque<int>>/512 122 ns 43.0 ns
bm_ranges_find<std::deque<int>>/4096 895 ns 93.7 ns
bm_ranges_find<std::deque<int>>/32768 6890 ns 756 ns
bm_ranges_find<std::deque<int>>/262144 54025 ns 6102 ns
bm_ranges_find<std::deque<int>>/1048576 221558 ns 32783 ns
```
|
|
|
|
Locale objects use atomic reference counting, which may be very
expensive in parallel applications. The classic locale is used by
default by all streams and can be very contended. But it's never
destroyed, so the reference counting is also completely pointless on the
classic locale. Currently ~70% of time in the parallel stringstream
benchmarks is spent in locale ctor/dtor. And the execution radically
slows down with more threads.
Avoid reference counting on the classic locale. With this change
parallel benchmarks start to scale with threads.
This is a re-application of f8afc53d641c (aka PR #72112) which was
reverted in 4e0c48b907f1 because it broke the sanitizer builds due
to an initialization order fiasco. This issue has now been fixed by
ensuring that the locale is constinit'ed.
Co-authored-by: Dmitry Vyukov <dvyukov@google.com>
|
|
<filesystem> is a C++17 addition. In C++11 and C++14 modes, we actually
have all the code for <filesystem> but it is hidden behind a non-inline
namespace __fs so it is not accessible. Instead of doing this unusual
dance, just guard the code for filesystem behind a classic C++17 check
like we normally do.
|
|
Looks like it broke the ASAN build: https://lab.llvm.org/buildbot/#/builders/168/builds/17053/steps/9/logs/stdio
This reverts commit f8afc53d641ce9d4ad8565aae9e7b5911b572a02.
|
|
Locale objects use atomic reference counting, which may be very
expensive in parallel applications. The classic locale is used by
default by all streams and can be very contended. But it's never
destroyed, so the reference counting is also completely pointless on the
classic locale. Currently ~70% of time in the parallel stringstream
benchmarks is spent in locale ctor/dtor. And the execution radically
slows down with more threads.
Avoid reference counting on the classic locale. With this change
parallel benchmarks start to scale with threads.
Co-authored-by: Louis Dionne <ldionne.2@gmail.com>
```
│ baseline │ optimized │
│ sec/op │ sec/op vs base │
Istream_numbers/0/threads:1 4.672µ ± 0% 4.419µ ± 0% -5.42% (p=0.000 n=30+39)
Istream_numbers/0/threads:72 539.817µ ± 0% 9.842µ ± 1% -98.18% (p=0.000 n=30+40)
Istream_numbers/1/threads:1 4.890µ ± 0% 4.750µ ± 0% -2.85% (p=0.000 n=30+40)
Istream_numbers/1/threads:72 66.44µ ± 1% 10.14µ ± 1% -84.74% (p=0.000 n=30+40)
Istream_numbers/2/threads:1 4.888µ ± 0% 4.746µ ± 0% -2.92% (p=0.000 n=30+40)
Istream_numbers/2/threads:72 494.8µ ± 0% 410.2µ ± 1% -17.11% (p=0.000 n=30+40)
Istream_numbers/3/threads:1 4.697µ ± 0% 4.695µ ± 5% ~ (p=0.391 n=30+37)
Istream_numbers/3/threads:72 421.5µ ± 7% 421.9µ ± 9% ~ (p=0.665 n=30)
Ostream_number/0/threads:1 183.0n ± 0% 141.0n ± 2% -22.95% (p=0.000 n=30)
Ostream_number/0/threads:72 24196.5n ± 1% 343.5n ± 3% -98.58% (p=0.000 n=30)
Ostream_number/1/threads:1 250.0n ± 0% 196.0n ± 2% -21.60% (p=0.000 n=30)
Ostream_number/1/threads:72 16260.5n ± 0% 407.0n ± 2% -97.50% (p=0.000 n=30)
Ostream_number/2/threads:1 254.0n ± 0% 196.0n ± 1% -22.83% (p=0.000 n=30)
Ostream_number/2/threads:72 28.49µ ± 1% 18.89µ ± 5% -33.72% (p=0.000 n=30)
Ostream_number/3/threads:1 185.0n ± 0% 185.0n ± 0% 0.00% (p=0.017 n=30)
Ostream_number/3/threads:72 19.38µ ± 4% 19.33µ ± 5% ~ (p=0.425 n=30)
```
|
|
```
---------------------------------------------------
Benchmark old new
---------------------------------------------------
bm_for_each/1 3.00 ns 2.98 ns
bm_for_each/2 4.53 ns 4.57 ns
bm_for_each/3 5.82 ns 5.82 ns
bm_for_each/4 6.94 ns 6.91 ns
bm_for_each/5 7.55 ns 7.75 ns
bm_for_each/6 7.06 ns 7.45 ns
bm_for_each/7 6.69 ns 7.14 ns
bm_for_each/8 6.86 ns 4.06 ns
bm_for_each/16 11.5 ns 5.73 ns
bm_for_each/64 43.7 ns 4.06 ns
bm_for_each/512 356 ns 7.98 ns
bm_for_each/4096 2787 ns 53.6 ns
bm_for_each/32768 20836 ns 438 ns
bm_for_each/262144 195362 ns 4945 ns
bm_for_each/1048576 685482 ns 19822 ns
```
Reviewed By: ldionne, Mordante, #libc
Spies: bgraur, sberg, arichardson, libcxx-commits
Differential Revision: https://reviews.llvm.org/D151274
|
|
This is transforming the `stop_token` benchmark that Lewis Baker had
created into Google
Bench
https://reviews.llvm.org/D154702
|
|
We support GCC 13, which supports the flag. This allows simplifying the
CMake logic around the use of -nostdlib++. Note that there are other
places where we don't assume -nostdlib++ yet in the build, but this
patch is intentionally trying to be small because this part of our CMake
is pretty tricky.
|
|
```
---------------------------------------------------------------
Benchmark old new
---------------------------------------------------------------
bm_vector_bool_count/1 1.92 ns 1.92 ns
bm_vector_bool_count/2 1.92 ns 1.92 ns
bm_vector_bool_count/3 1.92 ns 1.92 ns
bm_vector_bool_count/4 1.92 ns 1.92 ns
bm_vector_bool_count/5 1.92 ns 1.92 ns
bm_vector_bool_count/6 1.92 ns 1.92 ns
bm_vector_bool_count/7 1.92 ns 1.92 ns
bm_vector_bool_count/8 1.92 ns 1.92 ns
bm_vector_bool_count/16 1.92 ns 1.92 ns
bm_vector_bool_count/64 2.24 ns 2.25 ns
bm_vector_bool_count/512 3.19 ns 3.20 ns
bm_vector_bool_count/4096 14.1 ns 12.3 ns
bm_vector_bool_count/32768 84.0 ns 83.6 ns
bm_vector_bool_count/262144 664 ns 661 ns
bm_vector_bool_count/1048576 2623 ns 2628 ns
bm_vector_bool_ranges_count/1 1.07 ns 1.92 ns
bm_vector_bool_ranges_count/2 1.65 ns 1.92 ns
bm_vector_bool_ranges_count/3 2.27 ns 1.92 ns
bm_vector_bool_ranges_count/4 2.68 ns 1.92 ns
bm_vector_bool_ranges_count/5 3.33 ns 1.92 ns
bm_vector_bool_ranges_count/6 3.99 ns 1.92 ns
bm_vector_bool_ranges_count/7 4.67 ns 1.92 ns
bm_vector_bool_ranges_count/8 5.19 ns 1.92 ns
bm_vector_bool_ranges_count/16 11.1 ns 1.92 ns
bm_vector_bool_ranges_count/64 52.2 ns 2.24 ns
bm_vector_bool_ranges_count/512 452 ns 3.20 ns
bm_vector_bool_ranges_count/4096 3577 ns 12.1 ns
bm_vector_bool_ranges_count/32768 28725 ns 83.7 ns
bm_vector_bool_ranges_count/262144 229676 ns 662 ns
bm_vector_bool_ranges_count/1048576 905574 ns 2625 ns
```
Reviewed By: #libc, ldionne
Spies: arichardson, ldionne, libcxx-commits
Differential Revision: https://reviews.llvm.org/D156956
|
|
pointer
Credits: this change is based on analysis and a proof of concept by
gerbens@google.com.
Before, the compiler loses track of end as 'this' and other references
possibly escape beyond the compiler's scope. This can be see in the
generated assembly:
16.28 │200c80: mov %r15d,(%rax)
60.87 │200c83: add $0x4,%rax
│200c87: mov %rax,-0x38(%rbp)
0.03 │200c8b: → jmpq 200d4e
...
...
1.69 │200d4e: cmp %r15d,%r12d
│200d51: → je 200c40
16.34 │200d57: inc %r15d
0.05 │200d5a: mov -0x38(%rbp),%rax
3.27 │200d5e: mov -0x30(%rbp),%r13
1.47 │200d62: cmp %r13,%rax
│200d65: → jne 200c80
We fix this by always explicitly storing the loaded local and pointer
back at the end of push back. This generates some slight source 'noise',
but creates nice and compact fast path code, i.e.:
32.64 │200760: mov %r14d,(%r12)
9.97 │200764: add $0x4,%r12
6.97 │200768: mov %r12,-0x38(%rbp)
32.17 │20076c: add $0x1,%r14d
2.36 │200770: cmp %r14d,%ebx
│200773: → je 200730
8.98 │200775: mov -0x30(%rbp),%r13
6.75 │200779: cmp %r13,%r12
│20077c: → jne 200760
Now there is a single store for the push_back value (as before), and a
single store for the end without a reload (dependency).
For fully local vectors, (i.e., not referenced elsewhere), the capacity
load and store inside the loop could also be removed, but this requires
more substantial refactoring inside vector.
Differential Revision: https://reviews.llvm.org/D80588
|
|
of `vector` and `deque`." (#67753)
This reverts commit 10edd5d9436153ace82009a04900ac67d3adc202 and guards
against older versions of GCC to work around the problem.
|
|
`vector` and `deque`."
This reverts commit 390ac823178fc1073612b4c8a38835f441138d9d.
It broke the sphinx publish bots for our documentation:
https://lab.llvm.org/buildbot/#/builders/242/builds/1130
because that machine has GCC 9.4.0 which does not know about C++23
|
|
Temporary workaround for the following CMake error:
```
CMake Error in /llvm/libcxx/benchmarks/CMakeLists.txt:
The compiler feature "cxx_std_23" is not known to CXX compiler
"IBMClang"
```
|
|
This patch is a melting pot of changes picked up from
https://llvm.org/D61878. It adds a few tests checking corner cases of
unordered containers comparison and adds benchmarks for a few
unordered_set operations.
|
|
Reviewed By: #libc, var-const
Differential Revision: https://reviews.llvm.org/D150831
|
|
This fixes some unused variable and "comparison of integers of different
signs" warnings in the benchmarks.
Differential Revision: https://reviews.llvm.org/D156613
|
|
|
|
This commit contains refactorings around __dynamic_cast without changing
its behavior. Some important changes include:
- Refactor __dynamic_cast into various small helper functions;
- Move dynamic_cast_stress.pass.cpp to libcxx/benchmarks and refactor
it into a benchmark. The benchmark performance numbers are updated
as well.
Differential Revision: https://reviews.llvm.org/D138006
|
|
`vector` and `deque`.
Differential Revision: https://reviews.llvm.org/D150747
|
|
Reviewed By: #libc, ldionne
Spies: ldionne, libcxx-commits
Differential Revision: https://reviews.llvm.org/D155136
|
|
In the event the internal function __init is called with an empty string the code will take unnecessary extra steps, in addition, the code generated might be overall greater because, to my understanding, when initializing a string with an empty `const char*` "" (like in this case), the compiler might be unable to deduce the string is indeed empty at compile time and more code is generated.
The goal of this patch is to make a new internal function that will accept just an error code skipping the empty string argument. It should skip the unnecessary steps and in the event `if (ec)` is `false`, it will return an empty string using the correct ctor, avoiding any extra code generation issues.
After the conversation about this patch matured in the libcxx channel on the LLVM Discord server, the patch was analyzed quickly with "Compiler Explorer" and other tools and it was discovered that it does indeed reduce the amount of code generated when using the latest stable clang version (16) which in turn produces faster code.
This patch targets LLVM 18 as it will break the ABI by addressing https://github.com/llvm/llvm-project/issues/63985
Benchmark tests run on other machines as well show in the best case, that the new version without the extra string as an argument performs 10 times faster.
On the buildkite CI run it shows the new code takes less CPU time as well.
In conclusion, the new code should also just appear cleaner because there are fewer checks to do when there is no message.
Reviewed By: #libc, philnik
Spies: emaste, nemanjai, philnik, libcxx-commits
Differential Revision: https://reviews.llvm.org/D155820
|
|
Benchmark results:
```
----------------------------------------------------------------
Benchmark old new
----------------------------------------------------------------
bm_vector_bool_ranges_find/1 5.64 ns 6.08 ns
bm_vector_bool_ranges_find/2 16.5 ns 6.03 ns
bm_vector_bool_ranges_find/3 20.3 ns 6.07 ns
bm_vector_bool_ranges_find/4 22.2 ns 6.08 ns
bm_vector_bool_ranges_find/5 23.5 ns 6.05 ns
bm_vector_bool_ranges_find/6 24.4 ns 6.10 ns
bm_vector_bool_ranges_find/7 26.7 ns 6.10 ns
bm_vector_bool_ranges_find/8 25.0 ns 6.08 ns
bm_vector_bool_ranges_find/16 27.9 ns 6.07 ns
bm_vector_bool_ranges_find/64 44.5 ns 5.35 ns
bm_vector_bool_ranges_find/512 243 ns 25.7 ns
bm_vector_bool_ranges_find/4096 1858 ns 35.6 ns
bm_vector_bool_ranges_find/32768 15461 ns 93.5 ns
bm_vector_bool_ranges_find/262144 126462 ns 571 ns
bm_vector_bool_ranges_find/1048576 497736 ns 2272 ns
```
Reviewed By: #libc, Mordante
Spies: var-const, Mordante, libcxx-commits
Differential Revision: https://reviews.llvm.org/D156039
|
|
There are couple of optimizations of `__hash_table::find` which are applicable
to other places like `__hash_table::__node_insert_unique_prepare` and
`__hash_table::__emplace_unique_key_args`.
```
for (__nd = __nd->__next_; __nd != nullptr &&
(__nd->__hash() == __hash
// ^^^^^^^^^^^^^^^^^^^^^^
// (1)
|| std::__constrain_hash(__nd->__hash(), __bc) == __chash);
__nd = __nd->__next_)
{
if ((__nd->__hash() == __hash)
// ^^^^^^^^^^^^^^^^^^^^^^^^^^
// (2)
&& key_eq()(__nd->__upcast()->__value_, __k))
return iterator(__nd, this);
}
```
(1): We can avoid expensive modulo operations from `std::__constrain_hash` if
hashes matched. This one is from commit 6a411472e3c4.
(2): We can avoid `key_eq` calls if hashes didn't match. Commit:
318d35a7bca6c4e5.
Both of them are applicable for insert and emplace methods.
Results of unordered_set_operations benchmark:
```
Comparing /tmp/main to /tmp/hashtable-hash-value-optimization
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------------------------
BM_Hash/uint32_random_std_hash/1024 -0.0127 -0.0127 1511 1492 1508 1489
BM_Hash/uint32_random_custom_hash/1024 +0.0012 +0.0013 1370 1371 1367 1369
BM_Hash/uint32_top_std_hash/1024 -0.0027 -0.0028 1502 1497 1498 1494
BM_Hash/uint32_top_custom_hash/1024 +0.0033 +0.0032 1368 1373 1365 1370
BM_InsertValue/unordered_set_uint32/1024 +0.0267 +0.0266 36421 37392 36350 37318
BM_InsertValue/unordered_set_uint32_sorted/1024 +0.0230 +0.0229 28247 28897 28193 28837
BM_InsertValue/unordered_set_top_bits_uint32/1024 +0.0492 +0.0491 31012 32539 30952 32472
BM_InsertValueRehash/unordered_set_top_bits_uint32/1024 +0.0523 +0.0520 62905 66197 62780 66043
BM_InsertValue/unordered_set_string/1024 -0.0252 -0.0253 300762 293189 299805 292221
BM_InsertValueRehash/unordered_set_string/1024 -0.0932 -0.0920 332924 301882 331276 300810
BM_InsertValue/unordered_set_prefixed_string/1024 -0.0578 -0.0577 314239 296072 313222 295137
BM_InsertValueRehash/unordered_set_prefixed_string/1024 -0.0986 -0.0985 336091 302950 334982 301995
BM_Find/unordered_set_random_uint64/1024 -0.1416 -0.1417 16075 13798 16041 13769
BM_FindRehash/unordered_set_random_uint64/1024 -0.0105 -0.0105 5900 5838 5889 5827
BM_Find/unordered_set_sorted_uint64/1024 +0.0014 +0.0014 2813 2817 2807 2811
BM_FindRehash/unordered_set_sorted_uint64/1024 -0.0247 -0.0249 5863 5718 5851 5706
BM_Find/unordered_set_sorted_uint128/1024 +0.0113 +0.0112 15570 15746 15539 15713
BM_FindRehash/unordered_set_sorted_uint128/1024 +0.0438 +0.0441 6917 7220 6902 7206
BM_Find/unordered_set_sorted_uint32/1024 -0.0020 -0.0020 3098 3091 3092 3085
BM_FindRehash/unordered_set_sorted_uint32/1024 +0.0570 +0.0569 5377 5684 5368 5673
BM_Find/unordered_set_sorted_large_uint64/1024 +0.0081 +0.0081 3594 3623 3587 3616
BM_FindRehash/unordered_set_sorted_large_uint64/1024 -0.0542 -0.0540 6154 5820 6140 5808
BM_Find/unordered_set_top_bits_uint64/1024 -0.0061 -0.0061 10440 10377 10417 10353
BM_FindRehash/unordered_set_top_bits_uint64/1024 +0.0131 +0.0128 5852 5928 5840 5914
BM_Find/unordered_set_string/1024 -0.0352 -0.0349 189037 182384 188389 181809
BM_FindRehash/unordered_set_string/1024 -0.0309 -0.0311 180718 175142 180141 174532
BM_Find/unordered_set_prefixed_string/1024 -0.0559 -0.0557 190853 180177 190251 179659
BM_FindRehash/unordered_set_prefixed_string/1024 -0.0563 -0.0561 182396 172136 181797 171602
BM_Rehash/unordered_set_string_arg/1024 -0.0244 -0.0241 27052 26393 26989 26339
BM_Rehash/unordered_set_int_arg/1024 -0.0410 -0.0410 19582 18779 19539 18738
BM_InsertDuplicate/unordered_set_int/1024 +0.0023 +0.0025 12168 12196 12142 12173
BM_InsertDuplicate/unordered_set_string/1024 -0.0505 -0.0504 189238 179683 188648 179133
BM_InsertDuplicate/unordered_set_prefixed_string/1024 -0.0989 -0.0987 198893 179222 198263 178702
BM_EmplaceDuplicate/unordered_set_int/1024 -0.0175 -0.0173 12674 12452 12646 12427
BM_EmplaceDuplicate/unordered_set_string/1024 -0.0559 -0.0557 190104 179481 189492 178934
BM_EmplaceDuplicate/unordered_set_prefixed_string/1024 -0.1111 -0.1110 201233 178870 200608 178341
BM_InsertDuplicate/unordered_set_int_insert_arg/1024 -0.0747 -0.0745 12993 12022 12964 11997
BM_InsertDuplicate/unordered_set_string_insert_arg/1024 -0.0584 -0.0583 191489 180311 190864 179731
BM_EmplaceDuplicate/unordered_set_int_insert_arg/1024 -0.0807 -0.0804 35946 33047 35866 32982
BM_EmplaceDuplicate/unordered_set_string_arg/1024 -0.0312 -0.0310 321623 311601 320559 310637
OVERALL_GEOMEAN -0.0276 -0.0275 0 0 0 0
```
Time differences looks more like noise to me. But if we want to have this
optimizations in `find`, we probably want them in `insert` and `emplace` as
well.
Reviewed By: #libc, Mordante
Differential Revision: https://reviews.llvm.org/D140779
|
|
This commit does a pass of clang-format over files in libc++ that
don't require major changes to conform to our style guide, or for
which we're not overly concerned about conflicting with in-flight
patches or hindering the git blame.
This roughly covers:
- benchmarks
- range algorithms
- concepts
- type traits
I did a manual verification of all the changes, and in particular I
applied clang-format on/off annotations in a few places where the
result was less readable after than before. This was not necessary
in a lot of places, however I did find that clang-format had pretty
bad taste when it comes to formatting concepts.
Differential Revision: https://reviews.llvm.org/D153140
|
|
CMake older than 3.20.0 is no longer supported.
This removes work-arounds for no longer supported versions.
Reviewed By: #libc, jloser, philnik
Differential Revision: https://reviews.llvm.org/D152099
|
|
This reverts commit b1dc43aa3a05c2f14725e2e6428544208ccbe161.
|
|
```
---------------------------------------------------
Benchmark old new
---------------------------------------------------
bm_for_each/1 3.00 ns 2.98 ns
bm_for_each/2 4.53 ns 4.57 ns
bm_for_each/3 5.82 ns 5.82 ns
bm_for_each/4 6.94 ns 6.91 ns
bm_for_each/5 7.55 ns 7.75 ns
bm_for_each/6 7.06 ns 7.45 ns
bm_for_each/7 6.69 ns 7.14 ns
bm_for_each/8 6.86 ns 4.06 ns
bm_for_each/16 11.5 ns 5.73 ns
bm_for_each/64 43.7 ns 4.06 ns
bm_for_each/512 356 ns 7.98 ns
bm_for_each/4096 2787 ns 53.6 ns
bm_for_each/32768 20836 ns 438 ns
bm_for_each/262144 195362 ns 4945 ns
bm_for_each/1048576 685482 ns 19822 ns
```
Reviewed By: ldionne, Mordante, #libc
Spies: arichardson, libcxx-commits
Differential Revision: https://reviews.llvm.org/D151274
|
|
Reviewed By: #libc, ldionne
Spies: Mordante, libcxx-commits, ldionne, mikhail.ramalho
Differential Revision: https://reviews.llvm.org/D144394
|
|
This is an ongoing series of commits that are reformatting our
Python code.
Reformatting is done with `black`.
If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.
If you run into any problems, post to discourse about it and
we will try to help.
RFC Thread below:
https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style
Reviewed By: #libc, kwk, Mordante
Differential Revision: https://reviews.llvm.org/D150763
|
|
__uninitialized_allocator_copy for trivial types
See: https://github.com/llvm/llvm-project/issues/61987
Fix suggested by: @philnik and @var-const
Reviewers: philnik, ldionne, EricWF, var-const
Differential Revision: https://reviews.llvm.org/D147741
Testing:
ninja check-cxx check-clang check-llvm
Benchmark Testcases (BM_CopyConstruct, and BM_Assignment) added.
performance improvement:
Run on (8 X 4800 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 1280 KiB (x4)
L3 Unified 12288 KiB (x1)
Load Average: 1.66, 3.02, 2.43
Comparing build-runtimes-base/libcxx/benchmarks/vector_operations.libcxx.out to build-runtimes/libcxx/benchmarks/vector_operations.libcxx.out
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------------------------
BM_ConstructSize/vector_byte/5140480 +0.0362 +0.0362 116906 121132 116902 121131
BM_CopyConstruct/vector_int/5140480 -0.4563 -0.4577 1755224 954241 1755330 951987
BM_Assignment/vector_int/5140480 -0.0222 -0.0220 990045 968095 989917 968125
BM_ConstructSizeValue/vector_byte/5140480 +0.0308 +0.0307 116970 120567 116977 120573
BM_ConstructIterIter/vector_char/1024 -0.0831 -0.0831 19 17 19 17
BM_ConstructIterIter/vector_size_t/1024 +0.0129 +0.0131 88 89 88 89
BM_ConstructIterIter/vector_string/1024 -0.0064 -0.0018 54455 54109 54208 54112
OVERALL_GEOMEAN -0.0845 -0.0842 0 0 0 0
FYI, the perf improvements for BM_CopyConstruct due to this patch is mostly subsumed by the https://reviews.llvm.org/D149826. However this patch still adds value by converting copy to memmove (the second testcase).
Before the patch:
```
define linkonce_odr dso_local void @_ZNSt3__16vectorIiNS_9allocatorIiEEE18__construct_at_endIPiS5_EEvT_T0_m(ptr noundef nonnull align 8 dereferenceable(24) %0, ptr noundef %1, ptr noundef %2, i64 noundef %3) local_unnamed_addr #4 comdat align 2 {
%5 = getelementptr inbounds %"class.std::__1::vector", ptr %0, i64 0, i32 1
%6 = load ptr, ptr %5, align 8, !tbaa !12
%7 = icmp eq ptr %1, %2
br i1 %7, label %16, label %8
8: ; preds = %4, %8
%9 = phi ptr [ %13, %8 ], [ %1, %4 ]
%10 = phi ptr [ %14, %8 ], [ %6, %4 ]
%11 = icmp ne ptr %10, null
tail call void @llvm.assume(i1 %11)
%12 = load i32, ptr %9, align 4, !tbaa !14
store i32 %12, ptr %10, align 4, !tbaa !14
%13 = getelementptr inbounds i32, ptr %9, i64 1
%14 = getelementptr inbounds i32, ptr %10, i64 1
%15 = icmp eq ptr %13, %2
br i1 %15, label %16, label %8, !llvm.loop !16
16: ; preds = %8, %4
%17 = phi ptr [ %6, %4 ], [ %14, %8 ]
store ptr %17, ptr %5, align 8, !tbaa !12
ret void
}
```
After the patch:
```
define linkonce_odr dso_local void @_ZNSt3__16vectorIiNS_9allocatorIiEEE18__construct_at_endIPiS5_EEvT_T0_m(ptr noundef nonnull align 8 dereferenceable(24) %0, ptr noundef %1, ptr noundef %2, i64 noundef %3) local_unnamed_addr #4 comdat align 2 {
%5 = getelementptr inbounds %"class.std::__1::vector", ptr %0, i64 0, i32 1
%6 = load ptr, ptr %5, align 8, !tbaa !12
%7 = ptrtoint ptr %2 to i64
%8 = ptrtoint ptr %1 to i64
%9 = sub i64 %7, %8
%10 = ashr exact i64 %9, 2
tail call void @llvm.memmove.p0.p0.i64(ptr align 4 %6, ptr align 4 %1, i64 %9, i1 false)
%11 = getelementptr inbounds i32, ptr %6, i64 %10
store ptr %11, ptr %5, align 8, !tbaa !12
ret void
}
```
This is due to the optimized version of uninitialized_allocator_copy function.
|
|
The original `__dynamic_cast` implementation does not use the ABI-provided `src2dst_offset` parameter which helps improve performance on the hot paths. This patch improves the performance on the hot paths in `__dynamic_cast` by leveraging hints provided by the `src2dst_offset` parameter. This patch also includes a performance benchmark suite for the `__dynamic_cast` implementation.
Reviewed By: philnik, ldionne, #libc, #libc_abi, avogelsgesang
Spies: mikhail.ramalho, avogelsgesang, xingxue, libcxx-commits
Differential Revision: https://reviews.llvm.org/D138005
|
|
Don't forward to `min_element` for small types that are trivially copyable, and instead use a naive loop that keeps track of the smallest element (as opposed to an iterator to the smallest element). This allows the compiler to vectorize the loop in some cases.
Reviewed By: #libc, ldionne
Spies: ldionne, libcxx-commits
Differential Revision: https://reviews.llvm.org/D143596
|
|
Reviewed By: #libc, ldionne
Spies: ldionne, Mordante, libcxx-commits
Differential Revision: https://reviews.llvm.org/D139554
|