rocket-tools/riscv-gnu-toolchain/llvm.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author	Files	Lines
2024-06-12	[libc++] Overhaul the PSTL dispatching mechanism (#88131)	Louis Dionne	1	-0/+1
	The experimental PSTL's current dispatching mechanism was designed with flexibility in mind. However, while reviewing the in-progress OpenMP backend, I realized that the dispatching mechanism based on ADL and default definitions in the frontend had several downsides. To name a few: 1. The dispatching of an algorithm to the back-end and its default implementation is bundled together via `_LIBCPP_PSTL_CUSTOMIZATION_POINT`. This makes the dispatching really confusing and leads to annoyances such as variable shadowing and weird lambda captures in the front-end. 2. The distinction between back-end functions and front-end algorithms is not as clear as it could be, which led us to call one where we meant the other in a few cases. This is bad due to the exception requirements of the PSTL: calling a front-end algorithm inside the implementation of a back-end is incorrect for exception-safety. 3. There are two levels of back-end dispatching in the PSTL, which treat CPU backends as a special case. This was confusing and not as flexible as we'd like. For example, there was no straightforward way to dispatch all uses of `unseq` to a specific back-end from the OpenMP backend, or for CPU backends to fall back on each other. This patch rewrites the backend dispatching mechanism to solve these problems, but doesn't touch any of the actual implementation of algorithms. Specifically, this rewrite has the following characteristics: - There is a single level of backend dispatching, however partial backends can be stacked to provide a full implementation of the PSTL. The two-level dispatching that was used for CPU-based backends is handled by providing CPU-based basis operations as simple helpers that can easily be reused when defining any PSTL backend. - The default definitions for algorithms are separated from their dispatching logic. - The front-end is thus simplified a whole lot and made very consistent for all algorithms, which makes it easier to audit the front-end for things like exception-correctness, appropriate forwarding, etc. Fixes #70718
2024-05-28	[runtimes][CMake] Simplify the propagation of test dependencies (#93558)	Louis Dionne	1	-5/+1
	Instead of using FOO_TEST_DEPS global variables that don't get updated properly from subdirectories, use targets to propagate the dependencies across directories.
2024-05-08	[libc++] Implement std::gcd using the binary version (#77747)	serge-sans-paille	2	-1/+55
	The binary version is four times faster than current implementation in my setup, and generally considered a better implementation. Code inspired by https://en.algorithmica.org/hpc/algorithms/gcd/ which itself is inspired by https://lemire.me/blog/2013/12/26/fastest-way-to-compute-the-greatest-common-divisor/ Fix #77648
2024-04-24	[libc++][format] Adds an escaped output benchmark. (#88020)	Mark de Wever	2	-0/+304
	This is a preparation to measure the performance impact of - P2713R1 Escaping improvements in std::format and its performance improving followup patch.
2024-04-13	[tzdb] Replace shared_mutex with mutex. (#87929)	Eric	2	-0/+42
	The overhead of taking a std::mutex is much lower than taking a reader lock on a shared mutex, even under heavy contention. The benefit of shared_mutex only occurs as the amount of time spent in the critical sections grows large enough. In our case all we do is read a pointer and return the lock. As a result, using a shared lock can be ~50%-100% slower Here are the results for the provided benchmark on my machine: ``` 2024-04-07T12:48:51-04:00 Running ./libcxx/benchmarks/shared_mutex_vs_mutex.libcxx.out Run on (12 X 400 MHz CPU s) CPU Caches: L1 Data 32 KiB (x6) L1 Instruction 32 KiB (x6) L2 Unified 1024 KiB (x6) L3 Unified 32768 KiB (x1) Load Average: 2.70, 2.70, 1.63 --------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------- BM_shared_mutex/threads:1 13.9 ns 13.9 ns 50533700 BM_shared_mutex/threads:2 34.5 ns 68.9 ns 9957784 BM_shared_mutex/threads:4 38.4 ns 137 ns 4987772 BM_shared_mutex/threads:8 51.1 ns 358 ns 1974160 BM_shared_mutex/threads:32 57.1 ns 682 ns 1043648 BM_mutex/threads:1 5.54 ns 5.53 ns 125867422 BM_mutex/threads:2 15.5 ns 30.9 ns 21830116 BM_mutex/threads:4 15.4 ns 57.2 ns 12136920 BM_mutex/threads:8 19.3 ns 140 ns 4997080 BM_mutex/threads:32 20.8 ns 252 ns 2859808 ```
2024-04-06	[libc++] Optimize ranges::minmax (#87335)	Nikolas Klauser	2	-0/+69
	This allows Clang to vectorize the loop. ``` --------------------------------------------------------------------- Benchmark old new --------------------------------------------------------------------- BM_std_minmax<char>/1 0.659 ns 1.41 ns BM_std_minmax<char>/2 1.08 ns 2.16 ns BM_std_minmax<char>/3 2.16 ns 2.96 ns BM_std_minmax<char>/4 2.82 ns 3.81 ns BM_std_minmax<char>/5 3.43 ns 4.69 ns BM_std_minmax<char>/6 4.08 ns 5.63 ns BM_std_minmax<char>/7 4.75 ns 6.51 ns BM_std_minmax<char>/8 5.42 ns 7.41 ns BM_std_minmax<char>/9 6.05 ns 8.34 ns BM_std_minmax<char>/10 6.68 ns 9.29 ns BM_std_minmax<char>/11 7.47 ns 10.6 ns BM_std_minmax<char>/12 7.95 ns 11.4 ns BM_std_minmax<char>/13 8.64 ns 12.4 ns BM_std_minmax<char>/14 9.35 ns 13.4 ns BM_std_minmax<char>/15 10.1 ns 14.4 ns BM_std_minmax<char>/16 10.6 ns 2.25 ns BM_std_minmax<char>/17 11.3 ns 2.82 ns BM_std_minmax<char>/18 11.8 ns 3.71 ns BM_std_minmax<char>/19 12.6 ns 4.52 ns BM_std_minmax<char>/20 13.2 ns 5.47 ns BM_std_minmax<char>/21 14.1 ns 6.67 ns BM_std_minmax<char>/22 14.5 ns 7.78 ns BM_std_minmax<char>/23 15.1 ns 8.67 ns BM_std_minmax<char>/24 15.7 ns 9.68 ns BM_std_minmax<char>/25 16.4 ns 10.7 ns BM_std_minmax<char>/26 17.1 ns 11.7 ns BM_std_minmax<char>/27 17.8 ns 12.8 ns BM_std_minmax<char>/28 18.4 ns 14.1 ns BM_std_minmax<char>/29 19.0 ns 15.0 ns BM_std_minmax<char>/30 19.6 ns 16.0 ns BM_std_minmax<char>/31 20.2 ns 17.0 ns BM_std_minmax<char>/32 20.8 ns 2.46 ns BM_std_minmax<char>/64 41.5 ns 2.97 ns BM_std_minmax<char>/512 340 ns 6.05 ns BM_std_minmax<char>/1024 667 ns 8.83 ns BM_std_minmax<char>/4000 2571 ns 28.6 ns BM_std_minmax<char>/4096 2632 ns 25.8 ns BM_std_minmax<char>/5500 3554 ns 51.1 ns BM_std_minmax<char>/64000 41175 ns 480 ns BM_std_minmax<char>/65536 42039 ns 490 ns BM_std_minmax<char>/70000 44931 ns 528 ns BM_std_minmax<short>/1 0.708 ns 1.20 ns BM_std_minmax<short>/2 1.18 ns 1.78 ns BM_std_minmax<short>/3 1.98 ns 2.42 ns BM_std_minmax<short>/4 2.47 ns 3.05 ns BM_std_minmax<short>/5 3.09 ns 3.72 ns BM_std_minmax<short>/6 3.49 ns 4.37 ns BM_std_minmax<short>/7 4.24 ns 5.03 ns BM_std_minmax<short>/8 4.65 ns 2.12 ns BM_std_minmax<short>/9 5.34 ns 2.51 ns BM_std_minmax<short>/10 5.82 ns 3.18 ns BM_std_minmax<short>/11 6.36 ns 3.97 ns BM_std_minmax<short>/12 6.73 ns 4.68 ns BM_std_minmax<short>/13 7.59 ns 5.49 ns BM_std_minmax<short>/14 7.77 ns 6.45 ns BM_std_minmax<short>/15 8.54 ns 7.55 ns BM_std_minmax<short>/16 8.74 ns 2.38 ns BM_std_minmax<short>/17 9.59 ns 2.76 ns BM_std_minmax<short>/18 9.88 ns 3.37 ns BM_std_minmax<short>/19 10.7 ns 4.17 ns BM_std_minmax<short>/20 10.9 ns 4.88 ns BM_std_minmax<short>/21 12.1 ns 5.70 ns BM_std_minmax<short>/22 12.6 ns 6.64 ns BM_std_minmax<short>/23 13.5 ns 7.72 ns BM_std_minmax<short>/24 13.2 ns 2.87 ns BM_std_minmax<short>/25 14.2 ns 3.10 ns BM_std_minmax<short>/26 14.2 ns 3.59 ns BM_std_minmax<short>/27 15.4 ns 4.35 ns BM_std_minmax<short>/28 15.3 ns 5.10 ns BM_std_minmax<short>/29 16.2 ns 5.87 ns BM_std_minmax<short>/30 16.2 ns 6.88 ns BM_std_minmax<short>/31 17.0 ns 7.78 ns BM_std_minmax<short>/32 17.2 ns 3.45 ns BM_std_minmax<short>/64 34.1 ns 3.35 ns BM_std_minmax<short>/512 279 ns 8.37 ns BM_std_minmax<short>/1024 549 ns 14.2 ns BM_std_minmax<short>/4000 2111 ns 50.1 ns BM_std_minmax<short>/4096 2167 ns 47.9 ns BM_std_minmax<short>/5500 2895 ns 69.7 ns BM_std_minmax<short>/64000 33454 ns 953 ns BM_std_minmax<short>/65536 34474 ns 970 ns BM_std_minmax<short>/70000 36691 ns 1037 ns BM_std_minmax<int>/1 0.664 ns 1.17 ns BM_std_minmax<int>/2 1.11 ns 1.69 ns BM_std_minmax<int>/3 2.36 ns 2.29 ns BM_std_minmax<int>/4 2.53 ns 2.91 ns BM_std_minmax<int>/5 3.23 ns 3.56 ns BM_std_minmax<int>/6 3.56 ns 4.23 ns BM_std_minmax<int>/7 4.28 ns 4.91 ns BM_std_minmax<int>/8 4.60 ns 5.60 ns BM_std_minmax<int>/9 5.38 ns 6.31 ns BM_std_minmax<int>/10 5.69 ns 7.03 ns BM_std_minmax<int>/11 6.41 ns 7.70 ns BM_std_minmax<int>/12 6.73 ns 8.39 ns BM_std_minmax<int>/13 7.38 ns 9.07 ns BM_std_minmax<int>/14 7.74 ns 9.79 ns BM_std_minmax<int>/15 8.53 ns 10.5 ns BM_std_minmax<int>/16 8.79 ns 11.2 ns BM_std_minmax<int>/17 9.63 ns 12.0 ns BM_std_minmax<int>/18 9.84 ns 12.7 ns BM_std_minmax<int>/19 10.6 ns 13.5 ns BM_std_minmax<int>/20 11.0 ns 14.3 ns BM_std_minmax<int>/21 11.7 ns 15.0 ns BM_std_minmax<int>/22 12.0 ns 15.7 ns BM_std_minmax<int>/23 13.1 ns 16.5 ns BM_std_minmax<int>/24 13.0 ns 17.3 ns BM_std_minmax<int>/25 13.7 ns 17.9 ns BM_std_minmax<int>/26 14.0 ns 18.6 ns BM_std_minmax<int>/27 14.8 ns 19.4 ns BM_std_minmax<int>/28 15.1 ns 20.3 ns BM_std_minmax<int>/29 15.8 ns 20.9 ns BM_std_minmax<int>/30 16.1 ns 21.7 ns BM_std_minmax<int>/31 16.9 ns 22.5 ns BM_std_minmax<int>/32 17.2 ns 3.40 ns BM_std_minmax<int>/64 33.9 ns 4.04 ns BM_std_minmax<int>/512 275 ns 14.6 ns BM_std_minmax<int>/1024 541 ns 27.5 ns BM_std_minmax<int>/4000 2093 ns 96.3 ns BM_std_minmax<int>/4096 2146 ns 98.3 ns BM_std_minmax<int>/5500 2866 ns 157 ns BM_std_minmax<int>/64000 33619 ns 1954 ns BM_std_minmax<int>/65536 34252 ns 2009 ns BM_std_minmax<int>/70000 36618 ns 2125 ns BM_std_minmax<long long>/1 0.709 ns 1.19 ns BM_std_minmax<long long>/2 1.01 ns 1.65 ns BM_std_minmax<long long>/3 2.14 ns 2.21 ns BM_std_minmax<long long>/4 2.45 ns 2.83 ns BM_std_minmax<long long>/5 3.09 ns 3.47 ns BM_std_minmax<long long>/6 3.44 ns 4.11 ns BM_std_minmax<long long>/7 4.16 ns 4.79 ns BM_std_minmax<long long>/8 4.54 ns 5.47 ns BM_std_minmax<long long>/9 5.37 ns 6.20 ns BM_std_minmax<long long>/10 5.71 ns 6.93 ns BM_std_minmax<long long>/11 6.00 ns 7.60 ns BM_std_minmax<long long>/12 6.43 ns 8.27 ns BM_std_minmax<long long>/13 7.01 ns 8.94 ns BM_std_minmax<long long>/14 7.45 ns 9.65 ns BM_std_minmax<long long>/15 8.16 ns 10.4 ns BM_std_minmax<long long>/16 8.46 ns 5.22 ns BM_std_minmax<long long>/17 9.16 ns 5.22 ns BM_std_minmax<long long>/18 9.53 ns 5.52 ns BM_std_minmax<long long>/19 10.2 ns 6.02 ns BM_std_minmax<long long>/20 10.5 ns 6.89 ns BM_std_minmax<long long>/21 11.3 ns 7.83 ns BM_std_minmax<long long>/22 11.6 ns 8.59 ns BM_std_minmax<long long>/23 12.3 ns 9.91 ns BM_std_minmax<long long>/24 12.6 ns 10.1 ns BM_std_minmax<long long>/25 13.2 ns 12.0 ns BM_std_minmax<long long>/26 13.6 ns 13.5 ns BM_std_minmax<long long>/27 14.2 ns 14.8 ns BM_std_minmax<long long>/28 14.7 ns 15.9 ns BM_std_minmax<long long>/29 15.3 ns 16.6 ns BM_std_minmax<long long>/30 15.8 ns 17.3 ns BM_std_minmax<long long>/31 16.3 ns 18.2 ns BM_std_minmax<long long>/32 16.7 ns 7.18 ns BM_std_minmax<long long>/64 33.1 ns 11.5 ns BM_std_minmax<long long>/512 268 ns 71.0 ns BM_std_minmax<long long>/1024 532 ns 138 ns BM_std_minmax<long long>/4000 2056 ns 533 ns BM_std_minmax<long long>/4096 2112 ns 539 ns BM_std_minmax<long long>/5500 2823 ns 749 ns BM_std_minmax<long long>/64000 32956 ns 8590 ns BM_std_minmax<long long>/65536 33795 ns 8791 ns BM_std_minmax<long long>/70000 36084 ns 9442 ns BM_std_minmax<unsigned char>/1 0.714 ns 1.41 ns BM_std_minmax<unsigned char>/2 0.955 ns 1.96 ns BM_std_minmax<unsigned char>/3 1.90 ns 2.63 ns BM_std_minmax<unsigned char>/4 2.40 ns 3.34 ns BM_std_minmax<unsigned char>/5 2.87 ns 4.10 ns BM_std_minmax<unsigned char>/6 3.47 ns 4.88 ns BM_std_minmax<unsigned char>/7 4.04 ns 5.66 ns BM_std_minmax<unsigned char>/8 4.65 ns 6.45 ns BM_std_minmax<unsigned char>/9 5.18 ns 7.24 ns BM_std_minmax<unsigned char>/10 5.80 ns 8.05 ns BM_std_minmax<unsigned char>/11 6.24 ns 8.86 ns BM_std_minmax<unsigned char>/12 6.78 ns 9.70 ns BM_std_minmax<unsigned char>/13 7.30 ns 10.6 ns BM_std_minmax<unsigned char>/14 7.86 ns 11.4 ns BM_std_minmax<unsigned char>/15 8.46 ns 12.3 ns BM_std_minmax<unsigned char>/16 9.00 ns 2.12 ns BM_std_minmax<unsigned char>/17 9.58 ns 2.83 ns BM_std_minmax<unsigned char>/18 10.1 ns 3.37 ns BM_std_minmax<unsigned char>/19 10.7 ns 4.11 ns BM_std_minmax<unsigned char>/20 11.2 ns 4.85 ns BM_std_minmax<unsigned char>/21 11.9 ns 5.69 ns BM_std_minmax<unsigned char>/22 12.3 ns 6.77 ns BM_std_minmax<unsigned char>/23 13.1 ns 7.56 ns BM_std_minmax<unsigned char>/24 13.5 ns 8.40 ns BM_std_minmax<unsigned char>/25 14.2 ns 9.30 ns BM_std_minmax<unsigned char>/26 14.4 ns 10.1 ns BM_std_minmax<unsigned char>/27 15.0 ns 11.1 ns BM_std_minmax<unsigned char>/28 15.3 ns 11.9 ns BM_std_minmax<unsigned char>/29 16.2 ns 12.9 ns BM_std_minmax<unsigned char>/30 16.5 ns 13.9 ns BM_std_minmax<unsigned char>/31 17.2 ns 14.8 ns BM_std_minmax<unsigned char>/32 17.6 ns 2.36 ns BM_std_minmax<unsigned char>/64 35.6 ns 3.21 ns BM_std_minmax<unsigned char>/512 288 ns 6.00 ns BM_std_minmax<unsigned char>/1024 573 ns 8.80 ns BM_std_minmax<unsigned char>/4000 2222 ns 28.6 ns BM_std_minmax<unsigned char>/4096 2265 ns 25.9 ns BM_std_minmax<unsigned char>/5500 3047 ns 48.8 ns BM_std_minmax<unsigned char>/64000 35059 ns 480 ns BM_std_minmax<unsigned char>/65536 35941 ns 491 ns BM_std_minmax<unsigned char>/70000 38922 ns 525 ns BM_std_minmax<unsigned short>/1 0.711 ns 1.18 ns BM_std_minmax<unsigned short>/2 0.957 ns 1.65 ns BM_std_minmax<unsigned short>/3 2.13 ns 2.21 ns BM_std_minmax<unsigned short>/4 2.14 ns 2.78 ns BM_std_minmax<unsigned short>/5 3.06 ns 3.29 ns BM_std_minmax<unsigned short>/6 2.89 ns 3.87 ns BM_std_minmax<unsigned short>/7 3.80 ns 4.55 ns BM_std_minmax<unsigned short>/8 3.68 ns 2.02 ns BM_std_minmax<unsigned short>/9 4.53 ns 2.40 ns BM_std_minmax<unsigned short>/10 4.60 ns 2.94 ns BM_std_minmax<unsigned short>/11 5.67 ns 3.67 ns BM_std_minmax<unsigned short>/12 5.39 ns 4.22 ns BM_std_minmax<unsigned short>/13 6.58 ns 4.78 ns BM_std_minmax<unsigned short>/14 6.33 ns 5.54 ns BM_std_minmax<unsigned short>/15 7.34 ns 6.30 ns BM_std_minmax<unsigned short>/16 7.17 ns 2.25 ns BM_std_minmax<unsigned short>/17 8.19 ns 2.61 ns BM_std_minmax<unsigned short>/18 8.02 ns 3.19 ns BM_std_minmax<unsigned short>/19 9.03 ns 3.72 ns BM_std_minmax<unsigned short>/20 8.89 ns 4.36 ns BM_std_minmax<unsigned short>/21 9.77 ns 5.10 ns BM_std_minmax<unsigned short>/22 9.70 ns 5.55 ns BM_std_minmax<unsigned short>/23 10.8 ns 6.29 ns BM_std_minmax<unsigned short>/24 10.6 ns 2.41 ns BM_std_minmax<unsigned short>/25 11.6 ns 2.75 ns BM_std_minmax<unsigned short>/26 11.4 ns 3.26 ns BM_std_minmax<unsigned short>/27 12.4 ns 3.86 ns BM_std_minmax<unsigned short>/28 12.3 ns 4.45 ns BM_std_minmax<unsigned short>/29 13.2 ns 5.07 ns BM_std_minmax<unsigned short>/30 13.1 ns 5.77 ns BM_std_minmax<unsigned short>/31 13.9 ns 6.65 ns BM_std_minmax<unsigned short>/32 13.9 ns 2.72 ns BM_std_minmax<unsigned short>/64 27.8 ns 3.25 ns BM_std_minmax<unsigned short>/512 220 ns 8.30 ns BM_std_minmax<unsigned short>/1024 435 ns 14.1 ns BM_std_minmax<unsigned short>/4000 1703 ns 49.8 ns BM_std_minmax<unsigned short>/4096 1746 ns 47.9 ns BM_std_minmax<unsigned short>/5500 2350 ns 69.9 ns BM_std_minmax<unsigned short>/64000 27388 ns 953 ns BM_std_minmax<unsigned short>/65536 28040 ns 975 ns BM_std_minmax<unsigned short>/70000 29967 ns 1040 ns BM_std_minmax<unsigned int>/1 0.712 ns 1.18 ns BM_std_minmax<unsigned int>/2 0.965 ns 1.65 ns BM_std_minmax<unsigned int>/3 2.13 ns 2.14 ns BM_std_minmax<unsigned int>/4 2.09 ns 2.64 ns BM_std_minmax<unsigned int>/5 3.02 ns 3.21 ns BM_std_minmax<unsigned int>/6 2.94 ns 3.81 ns BM_std_minmax<unsigned int>/7 3.91 ns 4.38 ns BM_std_minmax<unsigned int>/8 3.75 ns 4.93 ns BM_std_minmax<unsigned int>/9 4.71 ns 5.60 ns BM_std_minmax<unsigned int>/10 4.59 ns 6.26 ns BM_std_minmax<unsigned int>/11 5.57 ns 6.80 ns BM_std_minmax<unsigned int>/12 5.43 ns 7.47 ns BM_std_minmax<unsigned int>/13 6.45 ns 8.10 ns BM_std_minmax<unsigned int>/14 6.32 ns 8.69 ns BM_std_minmax<unsigned int>/15 7.29 ns 9.37 ns BM_std_minmax<unsigned int>/16 7.12 ns 9.99 ns BM_std_minmax<unsigned int>/17 8.24 ns 10.6 ns BM_std_minmax<unsigned int>/18 8.00 ns 11.2 ns BM_std_minmax<unsigned int>/19 8.94 ns 12.0 ns BM_std_minmax<unsigned int>/20 8.91 ns 12.6 ns BM_std_minmax<unsigned int>/21 9.73 ns 17.2 ns BM_std_minmax<unsigned int>/22 9.75 ns 13.8 ns BM_std_minmax<unsigned int>/23 10.6 ns 14.5 ns BM_std_minmax<unsigned int>/24 10.6 ns 15.1 ns BM_std_minmax<unsigned int>/25 11.5 ns 15.7 ns BM_std_minmax<unsigned int>/26 11.4 ns 16.3 ns BM_std_minmax<unsigned int>/27 12.3 ns 17.0 ns BM_std_minmax<unsigned int>/28 12.3 ns 17.6 ns BM_std_minmax<unsigned int>/29 13.2 ns 18.3 ns BM_std_minmax<unsigned int>/30 13.2 ns 19.0 ns BM_std_minmax<unsigned int>/31 14.0 ns 19.6 ns BM_std_minmax<unsigned int>/32 14.0 ns 3.39 ns BM_std_minmax<unsigned int>/64 27.6 ns 4.05 ns BM_std_minmax<unsigned int>/512 221 ns 14.2 ns BM_std_minmax<unsigned int>/1024 439 ns 25.5 ns BM_std_minmax<unsigned int>/4000 1720 ns 96.3 ns BM_std_minmax<unsigned int>/4096 1762 ns 97.8 ns BM_std_minmax<unsigned int>/5500 2364 ns 146 ns BM_std_minmax<unsigned int>/64000 27874 ns 1905 ns BM_std_minmax<unsigned int>/65536 28012 ns 1961 ns BM_std_minmax<unsigned int>/70000 29899 ns 2087 ns BM_std_minmax<unsigned long long>/1 0.707 ns 1.18 ns BM_std_minmax<unsigned long long>/2 0.909 ns 1.65 ns BM_std_minmax<unsigned long long>/3 1.65 ns 2.70 ns BM_std_minmax<unsigned long long>/4 1.93 ns 2.69 ns BM_std_minmax<unsigned long long>/5 2.45 ns 3.34 ns BM_std_minmax<unsigned long long>/6 2.78 ns 3.81 ns BM_std_minmax<unsigned long long>/7 3.28 ns 4.43 ns BM_std_minmax<unsigned long long>/8 3.70 ns 4.92 ns BM_std_minmax<unsigned long long>/9 4.12 ns 5.64 ns BM_std_minmax<unsigned long long>/10 4.44 ns 6.15 ns BM_std_minmax<unsigned long long>/11 4.91 ns 6.81 ns BM_std_minmax<unsigned long long>/12 5.31 ns 7.41 ns BM_std_minmax<unsigned long long>/13 5.72 ns 7.96 ns BM_std_minmax<unsigned long long>/14 6.05 ns 8.66 ns BM_std_minmax<unsigned long long>/15 6.55 ns 9.37 ns BM_std_minmax<unsigned long long>/16 6.89 ns 7.98 ns BM_std_minmax<unsigned long long>/17 7.34 ns 8.13 ns BM_std_minmax<unsigned long long>/18 7.73 ns 8.42 ns BM_std_minmax<unsigned long long>/19 8.26 ns 8.63 ns BM_std_minmax<unsigned long long>/20 8.54 ns 8.96 ns BM_std_minmax<unsigned long long>/21 9.14 ns 9.37 ns BM_std_minmax<unsigned long long>/22 9.39 ns 9.67 ns BM_std_minmax<unsigned long long>/23 10.1 ns 10.1 ns BM_std_minmax<unsigned long long>/24 10.4 ns 10.6 ns BM_std_minmax<unsigned long long>/25 11.0 ns 11.3 ns BM_std_minmax<unsigned long long>/26 11.3 ns 12.1 ns BM_std_minmax<unsigned long long>/27 11.8 ns 14.2 ns BM_std_minmax<unsigned long long>/28 12.1 ns 15.8 ns BM_std_minmax<unsigned long long>/29 12.6 ns 17.4 ns BM_std_minmax<unsigned long long>/30 13.1 ns 18.1 ns BM_std_minmax<unsigned long long>/31 13.4 ns 18.8 ns BM_std_minmax<unsigned long long>/32 13.8 ns 10.4 ns BM_std_minmax<unsigned long long>/64 27.3 ns 15.5 ns BM_std_minmax<unsigned long long>/512 222 ns 80.6 ns BM_std_minmax<unsigned long long>/1024 443 ns 156 ns BM_std_minmax<unsigned long long>/4000 1731 ns 591 ns BM_std_minmax<unsigned long long>/4096 1752 ns 609 ns BM_std_minmax<unsigned long long>/5500 2340 ns 819 ns BM_std_minmax<unsigned long long>/64000 27166 ns 9652 ns BM_std_minmax<unsigned long long>/65536 27869 ns 9876 ns BM_std_minmax<unsigned long long>/70000 29920 ns 10680 ns ```
2024-04-01	[libc++] Optimize the two range overload of mismatch (#86853)	Nikolas Klauser	1	-0/+16
	``` ----------------------------------------------------------------------------- Benchmark old new ----------------------------------------------------------------------------- bm_mismatch_two_range_overload<char>/1 0.941 ns 1.88 ns bm_mismatch_two_range_overload<char>/2 1.43 ns 2.15 ns bm_mismatch_two_range_overload<char>/3 1.95 ns 2.55 ns bm_mismatch_two_range_overload<char>/4 2.58 ns 2.90 ns bm_mismatch_two_range_overload<char>/5 3.75 ns 3.31 ns bm_mismatch_two_range_overload<char>/6 5.00 ns 3.83 ns bm_mismatch_two_range_overload<char>/7 5.59 ns 4.35 ns bm_mismatch_two_range_overload<char>/8 6.37 ns 4.84 ns bm_mismatch_two_range_overload<char>/16 11.8 ns 6.72 ns bm_mismatch_two_range_overload<char>/64 45.5 ns 2.59 ns bm_mismatch_two_range_overload<char>/512 366 ns 12.6 ns bm_mismatch_two_range_overload<char>/4096 2890 ns 91.6 ns bm_mismatch_two_range_overload<char>/32768 23038 ns 758 ns bm_mismatch_two_range_overload<char>/262144 142813 ns 6573 ns bm_mismatch_two_range_overload<char>/1048576 366679 ns 26710 ns bm_mismatch_two_range_overload<short>/1 0.934 ns 1.88 ns bm_mismatch_two_range_overload<short>/2 1.30 ns 2.58 ns bm_mismatch_two_range_overload<short>/3 1.76 ns 3.28 ns bm_mismatch_two_range_overload<short>/4 2.24 ns 3.98 ns bm_mismatch_two_range_overload<short>/5 2.80 ns 4.92 ns bm_mismatch_two_range_overload<short>/6 3.58 ns 6.01 ns bm_mismatch_two_range_overload<short>/7 4.29 ns 7.03 ns bm_mismatch_two_range_overload<short>/8 4.67 ns 7.39 ns bm_mismatch_two_range_overload<short>/16 9.86 ns 13.1 ns bm_mismatch_two_range_overload<short>/64 38.9 ns 4.55 ns bm_mismatch_two_range_overload<short>/512 348 ns 27.7 ns bm_mismatch_two_range_overload<short>/4096 2881 ns 225 ns bm_mismatch_two_range_overload<short>/32768 23111 ns 1715 ns bm_mismatch_two_range_overload<short>/262144 184846 ns 14416 ns bm_mismatch_two_range_overload<short>/1048576 742885 ns 57264 ns bm_mismatch_two_range_overload<int>/1 0.838 ns 1.19 ns bm_mismatch_two_range_overload<int>/2 1.19 ns 1.65 ns bm_mismatch_two_range_overload<int>/3 1.83 ns 2.06 ns bm_mismatch_two_range_overload<int>/4 2.38 ns 2.42 ns bm_mismatch_two_range_overload<int>/5 3.60 ns 2.47 ns bm_mismatch_two_range_overload<int>/6 3.68 ns 3.05 ns bm_mismatch_two_range_overload<int>/7 4.32 ns 3.36 ns bm_mismatch_two_range_overload<int>/8 5.18 ns 3.58 ns bm_mismatch_two_range_overload<int>/16 10.6 ns 2.84 ns bm_mismatch_two_range_overload<int>/64 39.0 ns 7.78 ns bm_mismatch_two_range_overload<int>/512 247 ns 53.9 ns bm_mismatch_two_range_overload<int>/4096 1927 ns 429 ns bm_mismatch_two_range_overload<int>/32768 15569 ns 3393 ns bm_mismatch_two_range_overload<int>/262144 125413 ns 28504 ns bm_mismatch_two_range_overload<int>/1048576 504549 ns 112729 ns ```
2024-03-29	[libc++] Optimize the std::mismatch tail (#83440)	Nikolas Klauser	1	-3/+12
	This adds vectorization to the last 0-3 vectors and, if the range is large enough, the remaining elements that don't fill a vector completely. ``` ----------------------------------------------------------------------- Benchmark old full vectors partial vector ----------------------------------------------------------------------- bm_mismatch<char>/1 1.40 ns 1.62 ns 2.09 ns bm_mismatch<char>/2 1.88 ns 2.10 ns 2.33 ns bm_mismatch<char>/3 2.67 ns 2.56 ns 2.72 ns bm_mismatch<char>/4 3.01 ns 3.20 ns 3.70 ns bm_mismatch<char>/5 3.51 ns 3.73 ns 3.64 ns bm_mismatch<char>/6 4.71 ns 4.85 ns 4.37 ns bm_mismatch<char>/7 5.12 ns 5.33 ns 4.37 ns bm_mismatch<char>/8 5.79 ns 6.02 ns 4.75 ns bm_mismatch<char>/15 9.20 ns 10.5 ns 7.23 ns bm_mismatch<char>/16 10.2 ns 10.1 ns 7.46 ns bm_mismatch<char>/17 10.2 ns 10.8 ns 7.57 ns bm_mismatch<char>/31 17.6 ns 17.1 ns 10.8 ns bm_mismatch<char>/32 17.4 ns 1.64 ns 1.64 ns bm_mismatch<char>/33 23.3 ns 2.10 ns 2.33 ns bm_mismatch<char>/63 31.8 ns 16.9 ns 2.33 ns bm_mismatch<char>/64 32.6 ns 2.10 ns 2.10 ns bm_mismatch<char>/65 33.6 ns 2.57 ns 2.80 ns bm_mismatch<char>/127 67.3 ns 18.1 ns 3.27 ns bm_mismatch<char>/128 2.17 ns 2.14 ns 2.57 ns bm_mismatch<char>/129 2.36 ns 2.80 ns 3.27 ns bm_mismatch<char>/255 67.5 ns 19.6 ns 4.68 ns bm_mismatch<char>/256 3.76 ns 3.71 ns 3.97 ns bm_mismatch<char>/257 3.77 ns 4.04 ns 4.43 ns bm_mismatch<char>/511 70.8 ns 22.1 ns 7.47 ns bm_mismatch<char>/512 7.27 ns 7.30 ns 6.95 ns bm_mismatch<char>/513 7.11 ns 7.05 ns 6.96 ns bm_mismatch<char>/1023 75.9 ns 27.4 ns 13.3 ns bm_mismatch<char>/1024 13.9 ns 13.8 ns 12.4 ns bm_mismatch<char>/1025 13.6 ns 13.6 ns 12.8 ns bm_mismatch<char>/2047 87.3 ns 37.5 ns 25.4 ns bm_mismatch<char>/2048 26.8 ns 27.4 ns 24.0 ns bm_mismatch<char>/2049 26.7 ns 27.3 ns 25.5 ns bm_mismatch<char>/4095 112 ns 64.7 ns 48.7 ns bm_mismatch<char>/4096 53.0 ns 54.2 ns 46.8 ns bm_mismatch<char>/4097 52.7 ns 54.2 ns 48.4 ns bm_mismatch<char>/8191 160 ns 118 ns 98.4 ns bm_mismatch<char>/8192 107 ns 108 ns 96.0 ns bm_mismatch<char>/8193 106 ns 108 ns 97.2 ns bm_mismatch<char>/16383 283 ns 234 ns 215 ns bm_mismatch<char>/16384 227 ns 223 ns 217 ns bm_mismatch<char>/16385 221 ns 221 ns 215 ns bm_mismatch<char>/32767 547 ns 499 ns 488 ns bm_mismatch<char>/32768 495 ns 492 ns 492 ns bm_mismatch<char>/32769 491 ns 489 ns 488 ns bm_mismatch<char>/65535 1028 ns 979 ns 971 ns bm_mismatch<char>/65536 976 ns 970 ns 974 ns bm_mismatch<char>/65537 970 ns 965 ns 971 ns bm_mismatch<char>/131071 2031 ns 1948 ns 2005 ns bm_mismatch<char>/131072 1973 ns 1955 ns 1974 ns bm_mismatch<char>/131073 1989 ns 1932 ns 2001 ns bm_mismatch<char>/262143 4469 ns 4244 ns 4223 ns bm_mismatch<char>/262144 4443 ns 4183 ns 4243 ns bm_mismatch<char>/262145 4400 ns 4232 ns 4246 ns bm_mismatch<char>/524287 10169 ns 9733 ns 9592 ns bm_mismatch<char>/524288 10154 ns 9664 ns 9843 ns bm_mismatch<char>/524289 10113 ns 9641 ns 10003 ns bm_mismatch<short>/1 1.86 ns 2.53 ns 2.32 ns bm_mismatch<short>/2 2.57 ns 2.77 ns 2.55 ns bm_mismatch<short>/3 3.26 ns 3.00 ns 2.79 ns bm_mismatch<short>/4 3.95 ns 3.39 ns 3.15 ns bm_mismatch<short>/5 4.83 ns 3.97 ns 3.72 ns bm_mismatch<short>/6 5.43 ns 4.34 ns 4.03 ns bm_mismatch<short>/7 6.11 ns 4.73 ns 4.44 ns bm_mismatch<short>/8 6.84 ns 5.02 ns 4.79 ns bm_mismatch<short>/15 11.5 ns 7.12 ns 6.50 ns bm_mismatch<short>/16 13.9 ns 1.87 ns 2.11 ns bm_mismatch<short>/17 14.0 ns 3.00 ns 2.47 ns bm_mismatch<short>/31 23.1 ns 7.87 ns 2.47 ns bm_mismatch<short>/32 23.8 ns 2.57 ns 2.81 ns bm_mismatch<short>/33 24.5 ns 3.70 ns 2.94 ns bm_mismatch<short>/63 44.8 ns 9.37 ns 3.46 ns bm_mismatch<short>/64 2.32 ns 2.57 ns 2.64 ns bm_mismatch<short>/65 2.52 ns 3.02 ns 3.51 ns bm_mismatch<short>/127 45.6 ns 9.97 ns 5.18 ns bm_mismatch<short>/128 3.85 ns 3.93 ns 3.94 ns bm_mismatch<short>/129 3.82 ns 4.20 ns 4.70 ns bm_mismatch<short>/255 50.4 ns 12.6 ns 8.07 ns bm_mismatch<short>/256 7.23 ns 6.91 ns 6.98 ns bm_mismatch<short>/257 7.24 ns 7.19 ns 7.55 ns bm_mismatch<short>/511 52.3 ns 17.8 ns 14.0 ns bm_mismatch<short>/512 13.6 ns 13.7 ns 13.6 ns bm_mismatch<short>/513 13.9 ns 13.8 ns 18.5 ns bm_mismatch<short>/1023 60.9 ns 30.9 ns 26.3 ns bm_mismatch<short>/1024 26.7 ns 27.7 ns 25.7 ns bm_mismatch<short>/1025 27.7 ns 27.6 ns 25.3 ns bm_mismatch<short>/2047 88.4 ns 58.0 ns 51.6 ns bm_mismatch<short>/2048 52.8 ns 55.3 ns 50.6 ns bm_mismatch<short>/2049 55.2 ns 54.8 ns 48.7 ns bm_mismatch<short>/4095 153 ns 113 ns 102 ns bm_mismatch<short>/4096 105 ns 110 ns 101 ns bm_mismatch<short>/4097 110 ns 110 ns 99.1 ns bm_mismatch<short>/8191 277 ns 219 ns 206 ns bm_mismatch<short>/8192 226 ns 214 ns 250 ns bm_mismatch<short>/8193 226 ns 207 ns 208 ns bm_mismatch<short>/16383 519 ns 492 ns 488 ns bm_mismatch<short>/16384 494 ns 492 ns 492 ns bm_mismatch<short>/16385 492 ns 488 ns 489 ns bm_mismatch<short>/32767 1007 ns 968 ns 964 ns bm_mismatch<short>/32768 977 ns 972 ns 970 ns bm_mismatch<short>/32769 972 ns 962 ns 967 ns bm_mismatch<short>/65535 1978 ns 1918 ns 1956 ns bm_mismatch<short>/65536 1940 ns 1927 ns 1970 ns bm_mismatch<short>/65537 1937 ns 1922 ns 1959 ns bm_mismatch<short>/131071 4524 ns 4193 ns 4304 ns bm_mismatch<short>/131072 4445 ns 4196 ns 4306 ns bm_mismatch<short>/131073 4452 ns 4278 ns 4311 ns bm_mismatch<short>/262143 9801 ns 10188 ns 9634 ns bm_mismatch<short>/262144 9738 ns 10151 ns 9651 ns bm_mismatch<short>/262145 9716 ns 10171 ns 9715 ns bm_mismatch<short>/524287 19944 ns 20718 ns 20044 ns bm_mismatch<short>/524288 21139 ns 20647 ns 20008 ns bm_mismatch<short>/524289 21162 ns 19512 ns 20068 ns bm_mismatch<int>/1 1.40 ns 1.84 ns 1.87 ns bm_mismatch<int>/2 1.87 ns 2.08 ns 2.09 ns bm_mismatch<int>/3 2.36 ns 2.31 ns 2.87 ns bm_mismatch<int>/4 3.06 ns 2.72 ns 2.95 ns bm_mismatch<int>/5 3.66 ns 3.37 ns 3.42 ns bm_mismatch<int>/6 4.55 ns 3.65 ns 3.73 ns bm_mismatch<int>/7 5.03 ns 3.93 ns 3.94 ns bm_mismatch<int>/8 5.67 ns 1.86 ns 1.87 ns bm_mismatch<int>/15 9.89 ns 4.41 ns 2.34 ns bm_mismatch<int>/16 10.1 ns 2.33 ns 2.34 ns bm_mismatch<int>/17 10.2 ns 3.34 ns 2.86 ns bm_mismatch<int>/31 17.2 ns 5.54 ns 3.28 ns bm_mismatch<int>/32 2.16 ns 2.15 ns 2.58 ns bm_mismatch<int>/33 2.36 ns 3.01 ns 3.28 ns bm_mismatch<int>/63 17.7 ns 6.50 ns 4.93 ns bm_mismatch<int>/64 3.81 ns 3.58 ns 3.90 ns bm_mismatch<int>/65 3.74 ns 4.36 ns 4.45 ns bm_mismatch<int>/127 19.5 ns 9.56 ns 7.74 ns bm_mismatch<int>/128 7.30 ns 6.41 ns 6.85 ns bm_mismatch<int>/129 7.09 ns 7.04 ns 7.06 ns bm_mismatch<int>/255 24.7 ns 14.8 ns 13.3 ns bm_mismatch<int>/256 14.0 ns 12.1 ns 12.3 ns bm_mismatch<int>/257 13.8 ns 12.7 ns 12.8 ns bm_mismatch<int>/511 34.3 ns 26.3 ns 24.8 ns bm_mismatch<int>/512 27.6 ns 23.6 ns 23.9 ns bm_mismatch<int>/513 27.3 ns 24.4 ns 25.1 ns bm_mismatch<int>/1023 62.5 ns 50.9 ns 48.3 ns bm_mismatch<int>/1024 54.4 ns 46.1 ns 46.6 ns bm_mismatch<int>/1025 54.2 ns 48.4 ns 47.5 ns bm_mismatch<int>/2047 116 ns 97.8 ns 94.1 ns bm_mismatch<int>/2048 108 ns 92.6 ns 92.4 ns bm_mismatch<int>/2049 108 ns 104 ns 94.0 ns bm_mismatch<int>/4095 233 ns 222 ns 205 ns bm_mismatch<int>/4096 226 ns 223 ns 225 ns bm_mismatch<int>/4097 221 ns 219 ns 210 ns bm_mismatch<int>/8191 499 ns 485 ns 488 ns bm_mismatch<int>/8192 496 ns 490 ns 495 ns bm_mismatch<int>/8193 491 ns 485 ns 488 ns bm_mismatch<int>/16383 982 ns 962 ns 964 ns bm_mismatch<int>/16384 974 ns 971 ns 971 ns bm_mismatch<int>/16385 971 ns 961 ns 968 ns bm_mismatch<int>/32767 2003 ns 1959 ns 1920 ns bm_mismatch<int>/32768 1996 ns 1947 ns 1928 ns bm_mismatch<int>/32769 1990 ns 1945 ns 1926 ns bm_mismatch<int>/65535 4434 ns 4275 ns 4312 ns bm_mismatch<int>/65536 4437 ns 4267 ns 4321 ns bm_mismatch<int>/65537 4442 ns 4261 ns 4321 ns bm_mismatch<int>/131071 9673 ns 9648 ns 9465 ns bm_mismatch<int>/131072 9667 ns 9671 ns 9465 ns bm_mismatch<int>/131073 9661 ns 9653 ns 9464 ns bm_mismatch<int>/262143 20595 ns 19605 ns 19064 ns bm_mismatch<int>/262144 19894 ns 19572 ns 19009 ns bm_mismatch<int>/262145 19851 ns 19656 ns 18999 ns bm_mismatch<int>/524287 39556 ns 39364 ns 38131 ns bm_mismatch<int>/524288 39678 ns 39573 ns 38183 ns bm_mismatch<int>/524289 40168 ns 39301 ns 38121 ns ```
2024-03-23	[libc++] Vectorize mismatch (#73255)	Nikolas Klauser	2	-0/+32
	``` --------------------------------------------------- Benchmark old new --------------------------------------------------- bm_mismatch<char>/1 0.835 ns 2.37 ns bm_mismatch<char>/2 1.44 ns 2.60 ns bm_mismatch<char>/3 2.06 ns 2.83 ns bm_mismatch<char>/4 2.60 ns 3.29 ns bm_mismatch<char>/5 3.15 ns 3.77 ns bm_mismatch<char>/6 3.82 ns 4.17 ns bm_mismatch<char>/7 4.29 ns 4.52 ns bm_mismatch<char>/8 4.78 ns 4.86 ns bm_mismatch<char>/16 9.06 ns 7.54 ns bm_mismatch<char>/64 31.7 ns 19.1 ns bm_mismatch<char>/512 249 ns 8.16 ns bm_mismatch<char>/4096 1956 ns 44.2 ns bm_mismatch<char>/32768 15498 ns 501 ns bm_mismatch<char>/262144 123965 ns 4479 ns bm_mismatch<char>/1048576 495668 ns 21306 ns bm_mismatch<short>/1 0.710 ns 2.12 ns bm_mismatch<short>/2 1.03 ns 2.66 ns bm_mismatch<short>/3 1.29 ns 3.56 ns bm_mismatch<short>/4 1.68 ns 4.29 ns bm_mismatch<short>/5 1.96 ns 5.18 ns bm_mismatch<short>/6 2.59 ns 5.91 ns bm_mismatch<short>/7 2.86 ns 6.63 ns bm_mismatch<short>/8 3.19 ns 7.33 ns bm_mismatch<short>/16 5.48 ns 13.0 ns bm_mismatch<short>/64 16.6 ns 4.06 ns bm_mismatch<short>/512 130 ns 13.8 ns bm_mismatch<short>/4096 985 ns 93.8 ns bm_mismatch<short>/32768 7846 ns 1002 ns bm_mismatch<short>/262144 63217 ns 10637 ns bm_mismatch<short>/1048576 251782 ns 42471 ns bm_mismatch<int>/1 0.716 ns 1.91 ns bm_mismatch<int>/2 1.21 ns 2.49 ns bm_mismatch<int>/3 1.38 ns 3.46 ns bm_mismatch<int>/4 1.71 ns 4.04 ns bm_mismatch<int>/5 2.00 ns 4.98 ns bm_mismatch<int>/6 2.43 ns 5.67 ns bm_mismatch<int>/7 3.05 ns 6.38 ns bm_mismatch<int>/8 3.22 ns 7.09 ns bm_mismatch<int>/16 5.18 ns 12.8 ns bm_mismatch<int>/64 16.6 ns 5.28 ns bm_mismatch<int>/512 129 ns 25.2 ns bm_mismatch<int>/4096 1009 ns 201 ns bm_mismatch<int>/32768 7776 ns 2144 ns bm_mismatch<int>/262144 62371 ns 20551 ns bm_mismatch<int>/1048576 254750 ns 90097 ns ```
2024-03-17	[libc++] Optimize ranges::fill{,_n} for vector<bool>::iterator (#84642)	Nikolas Klauser	2	-0/+50
	``` ------------------------------------------------------ Benchmark old new ------------------------------------------------------ bm_ranges_fill_n/1 1.64 ns 3.06 ns bm_ranges_fill_n/2 3.45 ns 3.06 ns bm_ranges_fill_n/3 4.88 ns 3.06 ns bm_ranges_fill_n/4 6.46 ns 3.06 ns bm_ranges_fill_n/5 8.03 ns 3.06 ns bm_ranges_fill_n/6 9.65 ns 3.07 ns bm_ranges_fill_n/7 11.5 ns 3.06 ns bm_ranges_fill_n/8 13.0 ns 3.06 ns bm_ranges_fill_n/16 25.9 ns 3.06 ns bm_ranges_fill_n/64 103 ns 4.62 ns bm_ranges_fill_n/512 711 ns 4.40 ns bm_ranges_fill_n/4096 5642 ns 9.86 ns bm_ranges_fill_n/32768 45135 ns 33.6 ns bm_ranges_fill_n/262144 360818 ns 243 ns bm_ranges_fill_n/1048576 1442828 ns 982 ns bm_ranges_fill/1 1.63 ns 3.17 ns bm_ranges_fill/2 3.43 ns 3.28 ns bm_ranges_fill/3 4.97 ns 3.31 ns bm_ranges_fill/4 6.53 ns 3.27 ns bm_ranges_fill/5 8.12 ns 3.33 ns bm_ranges_fill/6 9.76 ns 3.32 ns bm_ranges_fill/7 11.6 ns 3.29 ns bm_ranges_fill/8 13.2 ns 3.26 ns bm_ranges_fill/16 26.3 ns 3.26 ns bm_ranges_fill/64 104 ns 4.92 ns bm_ranges_fill/512 716 ns 4.47 ns bm_ranges_fill/4096 5772 ns 8.21 ns bm_ranges_fill/32768 45778 ns 33.1 ns bm_ranges_fill/262144 351422 ns 241 ns bm_ranges_fill/1048576 1404710 ns 965 ns ```
2024-02-21	[libc++][test] add benchmarks for `std::atomic::wait` (#70571)	Hui	3	-0/+265
	For the mutex vs atomic test: Old: `unique_lock<mutex>` New: a lock implemented with `atomic::wait` On 10 years old Intel Macbook, `atomic::wait` is 50% slower than `mutex` ``` Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------- BM_multi_thread_lock_unlock/1024 +0.3735 +2.4497 1724726 2368935 153159 528354 BM_multi_thread_lock_unlock/2048 +0.4174 +1.2487 3410538 4834012 435062 978311 BM_multi_thread_lock_unlock/4096 +0.5256 +1.9824 6903783 10532681 590266 1760405 BM_multi_thread_lock_unlock/8192 +0.5415 +0.4578 14536391 22408399 1456328 2123075 BM_multi_thread_lock_unlock/16384 +0.5663 +0.0513 30181991 47275023 3316850 3486950 BM_multi_thread_lock_unlock/32768 +0.5635 -0.2081 62027663 96977726 6477076 5129190 BM_multi_thread_lock_unlock/65536 +0.5228 -0.3273 129637761 197408739 11341630 7628955 BM_multi_thread_lock_unlock/131072 +0.4825 -0.1070 266256295 394712193 10379800 9269200 BM_multi_thread_lock_unlock/262144 +0.4793 +0.2795 539732340 798409253 10802200 13821100 BM_multi_thread_lock_unlock/524288 +0.5272 +0.2847 1070035132 1634124353 14523000 18657800 BM_multi_thread_lock_unlock/1048576 +0.4799 +0.3353 2125510441 3145636119 13404200 17899000 OVERALL_GEOMEAN +0.4970 +0.3886 0 0 0 0 ``` On Apple Arm, `atomic::wait` is 200% slower than `mutex`. And `atomic::wait` is even slower than my 10 years old Intel CPU Macbook ``` Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------- BM_multi_thread_lock_unlock/1024 +2.1811 +3.9854 2036726 6478993 119817 597334 BM_multi_thread_lock_unlock/2048 +1.6736 +1.4301 3162161 8454415 426201 1035727 BM_multi_thread_lock_unlock/4096 +1.1017 +0.6456 6620503 13914159 893019 1469578 BM_multi_thread_lock_unlock/8192 +0.6688 +0.2148 12089392 20174635 1489000 1808799 BM_multi_thread_lock_unlock/16384 +1.4217 -0.2436 19365999 46899345 2068266 1564530 BM_multi_thread_lock_unlock/32768 +2.6161 -0.4927 31371052 113440165 3715100 1884540 BM_multi_thread_lock_unlock/65536 +2.6286 -0.3967 54314581 197086847 5912764 3567410 BM_multi_thread_lock_unlock/131072 +2.3554 +0.4990 103176565 346201425 9260407 13880900 BM_multi_thread_lock_unlock/262144 +2.8780 +0.4995 182355400 707170733 16335852 24496000 BM_multi_thread_lock_unlock/524288 +3.0280 +0.3001 360953079 1453902595 32548700 42316364 BM_multi_thread_lock_unlock/1048576 +3.7480 +1.2374 714500462 3392470417 48603455 108747000 OVERALL_GEOMEAN +2.0791 +0.3874 0 0 0 0 ``` For the atomic_wait test: On my 2013 MacBook with Intel CPU ``` Run on (8 X 2300 MHz CPU s) CPU Caches: L1 Data 32 KiB (x4) L1 Instruction 32 KiB (x4) L2 Unified 256 KiB (x4) L3 Unified 6144 KiB (x1) Load Average: 1.95, 3.77, 4.13 ----------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------------------------- BM_atomic_wait_one_thread_one_atomic_wait/1024 184455 ns 183979 ns 3760 BM_atomic_wait_one_thread_one_atomic_wait/2048 361607 ns 360917 ns 1912 BM_atomic_wait_one_thread_one_atomic_wait/4096 709055 ns 708326 ns 929 BM_atomic_wait_one_thread_one_atomic_wait/8192 1469063 ns 1467430 ns 488 BM_atomic_wait_one_thread_one_atomic_wait/16384 2865332 ns 2863473 ns 237 BM_atomic_wait_one_thread_one_atomic_wait/32768 5839429 ns 5834708 ns 113 BM_atomic_wait_one_thread_one_atomic_wait/65536 11460822 ns 11453183 ns 60 BM_atomic_wait_one_thread_one_atomic_wait/131072 23052804 ns 23035000 ns 30 BM_atomic_wait_one_thread_one_atomic_wait/262144 46958743 ns 46712733 ns 15 BM_atomic_wait_one_thread_one_atomic_wait/524288 93151904 ns 92977429 ns 7 BM_atomic_wait_one_thread_one_atomic_wait/1048576 186100011 ns 185888500 ns 4 BM_atomic_wait_one_thread_one_atomic_wait/2097152 364548135 ns 364280000 ns 2 BM_atomic_wait_one_thread_one_atomic_wait/4194304 747181672 ns 745056000 ns 1 BM_atomic_wait_one_thread_one_atomic_wait/8388608 1473070400 ns 1471165000 ns 1 BM_atomic_wait_one_thread_one_atomic_wait/16777216 2950352547 ns 2947373000 ns 1 BM_atomic_wait_multi_thread_one_atomic_wait/1024 668544 ns 167233 ns 4496 BM_atomic_wait_multi_thread_one_atomic_wait/2048 1384668 ns 369750 ns 1941 BM_atomic_wait_multi_thread_one_atomic_wait/4096 2851627 ns 768559 ns 995 BM_atomic_wait_multi_thread_one_atomic_wait/8192 5797669 ns 1476876 ns 526 BM_atomic_wait_multi_thread_one_atomic_wait/16384 11597952 ns 2692792 ns 260 BM_atomic_wait_multi_thread_one_atomic_wait/32768 23528028 ns 5291465 ns 142 BM_atomic_wait_multi_thread_one_atomic_wait/65536 46287247 ns 8547713 ns 87 BM_atomic_wait_multi_thread_one_atomic_wait/131072 90315848 ns 13294492 ns 61 BM_atomic_wait_multi_thread_one_atomic_wait/262144 190722393 ns 16193917 ns 36 BM_atomic_wait_multi_thread_one_atomic_wait/524288 408456684 ns 23641600 ns 10 BM_atomic_wait_multi_thread_one_atomic_wait/1048576 708809670 ns 36361900 ns 10 BM_atomic_wait_multi_thread_wait_different_atomics/1024 2116444 ns 11669 ns 10000 BM_atomic_wait_multi_thread_wait_different_atomics/2048 12435259 ns 21905 ns 1000 BM_atomic_wait_multi_thread_wait_different_atomics/4096 6393816 ns 17819 ns 1000 BM_atomic_wait_multi_thread_wait_different_atomics/8192 11930400 ns 28637 ns 1000 BM_atomic_wait_multi_thread_wait_different_atomics/16384 20987224 ns 35272 ns 1000 BM_atomic_wait_multi_thread_wait_different_atomics/32768 44335820 ns 66660 ns 100 BM_atomic_wait_multi_thread_wait_different_atomics/65536 91395912 ns 129030 ns 100 BM_atomic_wait_multi_thread_wait_different_atomics/131072 145440007 ns 165960 ns 100 BM_atomic_wait_multi_thread_wait_different_atomics/262144 368219935 ns 420800 ns 10 BM_atomic_wait_multi_thread_wait_different_atomics/524288 630106863 ns 809500 ns 10 BM_atomic_wait_multi_thread_wait_different_atomics/1048576 1138174673 ns 1093000 ns 10 ``` On apple arm ``` Run on (8 X 24.1208 MHz CPU s) CPU Caches: L1 Data 64 KiB (x8) L1 Instruction 128 KiB (x8) L2 Unified 4096 KiB (x2) Load Average: 1.34, 1.58, 1.66 ----------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------------------------------------------- BM_atomic_wait_one_thread_one_atomic_wait/1024 61602 ns 61602 ns 8701 BM_atomic_wait_one_thread_one_atomic_wait/2048 123148 ns 123146 ns 5688 BM_atomic_wait_one_thread_one_atomic_wait/4096 246248 ns 246249 ns 2888 BM_atomic_wait_one_thread_one_atomic_wait/8192 480373 ns 480359 ns 1455 BM_atomic_wait_one_thread_one_atomic_wait/16384 974725 ns 974721 ns 724 BM_atomic_wait_one_thread_one_atomic_wait/32768 1922185 ns 1922115 ns 355 BM_atomic_wait_one_thread_one_atomic_wait/65536 3940632 ns 3940608 ns 181 BM_atomic_wait_one_thread_one_atomic_wait/131072 7886302 ns 7886102 ns 88 BM_atomic_wait_one_thread_one_atomic_wait/262144 15393156 ns 15393000 ns 45 BM_atomic_wait_one_thread_one_atomic_wait/524288 30833221 ns 30832174 ns 23 BM_atomic_wait_one_thread_one_atomic_wait/1048576 62551936 ns 62551909 ns 11 BM_atomic_wait_one_thread_one_atomic_wait/2097152 123155625 ns 123155667 ns 6 BM_atomic_wait_one_thread_one_atomic_wait/4194304 252468180 ns 252458667 ns 3 BM_atomic_wait_one_thread_one_atomic_wait/8388608 505075604 ns 505075500 ns 2 BM_atomic_wait_one_thread_one_atomic_wait/16777216 992977209 ns 992935000 ns 1 BM_atomic_wait_multi_thread_one_atomic_wait/1024 531411 ns 239695 ns 2783 BM_atomic_wait_multi_thread_one_atomic_wait/2048 1030592 ns 484868 ns 1413 BM_atomic_wait_multi_thread_one_atomic_wait/4096 1951896 ns 922357 ns 631 BM_atomic_wait_multi_thread_one_atomic_wait/8192 3759893 ns 1952074 ns 390 BM_atomic_wait_multi_thread_one_atomic_wait/16384 7417929 ns 3458309 ns 233 BM_atomic_wait_multi_thread_one_atomic_wait/32768 14386361 ns 5590830 ns 100 BM_atomic_wait_multi_thread_one_atomic_wait/65536 29725536 ns 6521887 ns 115 BM_atomic_wait_multi_thread_one_atomic_wait/131072 60023797 ns 10766795 ns 73 BM_atomic_wait_multi_thread_one_atomic_wait/262144 120782267 ns 17532091 ns 44 BM_atomic_wait_multi_thread_one_atomic_wait/524288 242539333 ns 27506920 ns 25 BM_atomic_wait_multi_thread_one_atomic_wait/1048576 482833787 ns 53721600 ns 10 BM_atomic_wait_multi_thread_wait_different_atomics/1024 2230048 ns 626042 ns 1000 BM_atomic_wait_multi_thread_wait_different_atomics/2048 3931958 ns 837540 ns 884 BM_atomic_wait_multi_thread_wait_different_atomics/4096 6506887 ns 1127922 ns 586 BM_atomic_wait_multi_thread_wait_different_atomics/8192 10528008 ns 1651254 ns 456 BM_atomic_wait_multi_thread_wait_different_atomics/16384 18055829 ns 2066379 ns 317 BM_atomic_wait_multi_thread_wait_different_atomics/32768 29878496 ns 2875600 ns 100 BM_atomic_wait_multi_thread_wait_different_atomics/65536 50523799 ns 3193170 ns 100 BM_atomic_wait_multi_thread_wait_different_atomics/131072 85926943 ns 4121950 ns 100 BM_atomic_wait_multi_thread_wait_different_atomics/262144 154602296 ns 5879050 ns 100 BM_atomic_wait_multi_thread_wait_different_atomics/524288 279121754 ns 10063400 ns 10 BM_atomic_wait_multi_thread_wait_different_atomics/1048576 522796900 ns 12370300 ns 10 ```
2024-02-04	Reapply "[libc++] Optimize vector growing of trivially relocatable types" ↵	Nikolas Klauser	2	-2/+18
	(#80558) This reapplies #76657. Non-trivial elements didn't get destroyed previously. This fixes the bug and adds tests for all the vector insertion functions.
2024-02-02	Revert "[libc++] Optimize vector growing of trivially relocatable types ↵	Kirill Stoimenov	2	-18/+2
	(#76657)" Broke sanitizer bots: https://lab.llvm.org/buildbot/#/builders/5/builds/40641 This reverts commit 67eee4a029797c09129889c3655416d1be487cfe.
2024-02-02	[libc++] Optimize vector growing of trivially relocatable types (#76657)	Nikolas Klauser	2	-2/+18
	This patch introduces a new trait to represent whether a type is trivially relocatable, and uses that trait to optimize the growth of a std::vector of trivially relocatable objects. ``` -------------------------------------------------- Benchmark old new -------------------------------------------------- bm_grow<int> 1354 ns 1301 ns bm_grow<std::string> 5584 ns 3370 ns bm_grow<std::unique_ptr<int>> 3506 ns 1994 ns bm_grow<std::deque<int>> 27114 ns 27209 ns ``` This also changes to order of moving and destroying the objects when growing the vector. This should not affect our conformance.
2024-01-22	[libc++abi] Implement __cxa_init_primary_exception and use it to optimize ↵	itrofimow	2	-0/+20
	std::make_exception_ptr (#65534) This patch implements __cxa_init_primary_exception, an extension to the Itanium C++ ABI. This extension is already present in both libsupc++ and libcxxrt. This patch also starts making use of this function in std::make_exception_ptr: instead of going through a full throw/catch cycle, we are now able to initialize an exception directly, thus making std::make_exception_ptr around 30x faster.
2023-12-19	[libc++] Implement ranges::contains (#65148)	ZijunZhaoCCK	2	-0/+50
	Differential Revision: https://reviews.llvm.org/D159232 ``` Running ./ranges_contains.libcxx.out Run on (10 X 24.121 MHz CPU s) CPU Caches: L1 Data 64 KiB (x10) L1 Instruction 128 KiB (x10) L2 Unified 4096 KiB (x5) Load Average: 3.37, 6.77, 5.27 -------------------------------------------------------------------- Benchmark Time CPU Iterations -------------------------------------------------------------------- bm_contains_char/16 1.88 ns 1.87 ns 371607095 bm_contains_char/256 7.48 ns 7.47 ns 93292285 bm_contains_char/4096 99.7 ns 99.6 ns 7013185 bm_contains_char/65536 1296 ns 1294 ns 540436 bm_contains_char/1048576 23887 ns 23860 ns 29302 bm_contains_char/16777216 389420 ns 389095 ns 1796 bm_contains_int/16 7.14 ns 7.14 ns 97776288 bm_contains_int/256 90.4 ns 90.3 ns 7558089 bm_contains_int/4096 1294 ns 1290 ns 543052 bm_contains_int/65536 20482 ns 20443 ns 34334 bm_contains_int/1048576 328817 ns 327965 ns 2147 bm_contains_int/16777216 5246279 ns 5239361 ns 133 bm_contains_bool/16 2.19 ns 2.19 ns 322565780 bm_contains_bool/256 3.42 ns 3.41 ns 205025467 bm_contains_bool/4096 22.1 ns 22.1 ns 31780479 bm_contains_bool/65536 333 ns 332 ns 2106606 bm_contains_bool/1048576 5126 ns 5119 ns 135901 bm_contains_bool/16777216 81656 ns 81574 ns 8569 ``` --------- Co-authored-by: Nathan Gauër <brioche@google.com>
2023-12-15	[libc++] Optimize std::find for segmented iterators (#67224)	Nikolas Klauser	1	-10/+21
	``` -------------------------------------------------------------------------- Benchmark old new -------------------------------------------------------------------------- bm_find<std::deque<char>>/1 6.06 ns 10.6 ns bm_find<std::deque<char>>/2 15.5 ns 10.6 ns bm_find<std::deque<char>>/3 19.0 ns 10.6 ns bm_find<std::deque<char>>/4 20.8 ns 10.6 ns bm_find<std::deque<char>>/5 22.0 ns 10.6 ns bm_find<std::deque<char>>/6 23.0 ns 10.5 ns bm_find<std::deque<char>>/7 24.8 ns 10.7 ns bm_find<std::deque<char>>/8 25.7 ns 10.6 ns bm_find<std::deque<char>>/16 28.3 ns 10.6 ns bm_find<std::deque<char>>/64 44.2 ns 27.0 ns bm_find<std::deque<char>>/512 133 ns 37.6 ns bm_find<std::deque<char>>/4096 867 ns 53.1 ns bm_find<std::deque<char>>/32768 6838 ns 160 ns bm_find<std::deque<char>>/262144 52897 ns 1495 ns bm_find<std::deque<char>>/1048576 215621 ns 6077 ns bm_find<std::deque<short>>/1 6.03 ns 6.28 ns bm_find<std::deque<short>>/2 15.8 ns 15.8 ns bm_find<std::deque<short>>/3 20.5 ns 20.3 ns bm_find<std::deque<short>>/4 21.0 ns 21.0 ns bm_find<std::deque<short>>/5 23.0 ns 22.1 ns bm_find<std::deque<short>>/6 22.6 ns 23.0 ns bm_find<std::deque<short>>/7 23.4 ns 23.7 ns bm_find<std::deque<short>>/8 24.4 ns 24.9 ns bm_find<std::deque<short>>/16 26.6 ns 27.2 ns bm_find<std::deque<short>>/64 43.2 ns 40.9 ns bm_find<std::deque<short>>/512 124 ns 90.7 ns bm_find<std::deque<short>>/4096 845 ns 525 ns bm_find<std::deque<short>>/32768 7273 ns 3194 ns bm_find<std::deque<short>>/262144 53710 ns 24385 ns bm_find<std::deque<short>>/1048576 216086 ns 96195 ns bm_find<std::deque<int>>/1 6.03 ns 10.3 ns bm_find<std::deque<int>>/2 15.6 ns 10.3 ns bm_find<std::deque<int>>/3 19.1 ns 10.3 ns bm_find<std::deque<int>>/4 22.3 ns 10.3 ns bm_find<std::deque<int>>/5 23.5 ns 10.4 ns bm_find<std::deque<int>>/6 23.1 ns 10.3 ns bm_find<std::deque<int>>/7 23.7 ns 10.2 ns bm_find<std::deque<int>>/8 24.5 ns 10.2 ns bm_find<std::deque<int>>/16 27.9 ns 26.6 ns bm_find<std::deque<int>>/64 42.6 ns 32.2 ns bm_find<std::deque<int>>/512 123 ns 43.0 ns bm_find<std::deque<int>>/4096 874 ns 93.5 ns bm_find<std::deque<int>>/32768 7031 ns 751 ns bm_find<std::deque<int>>/262144 57723 ns 6169 ns bm_find<std::deque<int>>/1048576 230867 ns 35851 ns bm_ranges_find<std::deque<char>>/1 5.97 ns 10.6 ns bm_ranges_find<std::deque<char>>/2 16.0 ns 10.5 ns bm_ranges_find<std::deque<char>>/3 19.5 ns 10.5 ns bm_ranges_find<std::deque<char>>/4 21.1 ns 10.6 ns bm_ranges_find<std::deque<char>>/5 22.8 ns 10.5 ns bm_ranges_find<std::deque<char>>/6 22.8 ns 10.6 ns bm_ranges_find<std::deque<char>>/7 23.4 ns 10.8 ns bm_ranges_find<std::deque<char>>/8 24.1 ns 10.5 ns bm_ranges_find<std::deque<char>>/16 26.9 ns 10.6 ns bm_ranges_find<std::deque<char>>/64 50.2 ns 27.2 ns bm_ranges_find<std::deque<char>>/512 126 ns 38.3 ns bm_ranges_find<std::deque<char>>/4096 868 ns 53.8 ns bm_ranges_find<std::deque<char>>/32768 6695 ns 161 ns bm_ranges_find<std::deque<char>>/262144 54411 ns 1497 ns bm_ranges_find<std::deque<char>>/1048576 241699 ns 6042 ns bm_ranges_find<std::deque<short>>/1 6.39 ns 6.31 ns bm_ranges_find<std::deque<short>>/2 15.8 ns 15.9 ns bm_ranges_find<std::deque<short>>/3 19.0 ns 19.8 ns bm_ranges_find<std::deque<short>>/4 20.8 ns 20.9 ns bm_ranges_find<std::deque<short>>/5 21.8 ns 22.1 ns bm_ranges_find<std::deque<short>>/6 23.0 ns 23.0 ns bm_ranges_find<std::deque<short>>/7 23.2 ns 23.9 ns bm_ranges_find<std::deque<short>>/8 23.7 ns 24.4 ns bm_ranges_find<std::deque<short>>/16 26.6 ns 26.8 ns bm_ranges_find<std::deque<short>>/64 43.4 ns 39.7 ns bm_ranges_find<std::deque<short>>/512 131 ns 90.5 ns bm_ranges_find<std::deque<short>>/4096 851 ns 523 ns bm_ranges_find<std::deque<short>>/32768 7370 ns 3166 ns bm_ranges_find<std::deque<short>>/262144 60778 ns 24814 ns bm_ranges_find<std::deque<short>>/1048576 229288 ns 99273 ns bm_ranges_find<std::deque<int>>/1 6.43 ns 10.2 ns bm_ranges_find<std::deque<int>>/2 16.6 ns 10.2 ns bm_ranges_find<std::deque<int>>/3 19.6 ns 10.2 ns bm_ranges_find<std::deque<int>>/4 21.0 ns 10.2 ns bm_ranges_find<std::deque<int>>/5 21.9 ns 10.4 ns bm_ranges_find<std::deque<int>>/6 22.7 ns 10.2 ns bm_ranges_find<std::deque<int>>/7 23.9 ns 10.2 ns bm_ranges_find<std::deque<int>>/8 23.8 ns 10.2 ns bm_ranges_find<std::deque<int>>/16 27.2 ns 27.1 ns bm_ranges_find<std::deque<int>>/64 42.4 ns 32.4 ns bm_ranges_find<std::deque<int>>/512 122 ns 43.0 ns bm_ranges_find<std::deque<int>>/4096 895 ns 93.7 ns bm_ranges_find<std::deque<int>>/32768 6890 ns 756 ns bm_ranges_find<std::deque<int>>/262144 54025 ns 6102 ns bm_ranges_find<std::deque<int>>/1048576 221558 ns 32783 ns ```
2023-12-02	[libc++][test] add more benchmarks for `stop_token` (#69590)	Hui	1	-3/+78

2023-11-29	[libc++] Speed up classic locale (take 2) (#73533)	Louis Dionne	1	-3/+60
	Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads. Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads. This is a re-application of f8afc53d641c (aka PR #72112) which was reverted in 4e0c48b907f1 because it broke the sanitizer builds due to an initialization order fiasco. This issue has now been fixed by ensuring that the locale is constinit'ed. Co-authored-by: Dmitry Vyukov <dvyukov@google.com>
2023-11-28	[libc++] Properly guard std::filesystem with >= C++17 (#72701)	Louis Dionne	1	-1/+4
	<filesystem> is a C++17 addition. In C++11 and C++14 modes, we actually have all the code for <filesystem> but it is hidden behind a non-inline namespace __fs so it is not accessible. Instead of doing this unusual dance, just guard the code for filesystem behind a classic C++17 check like we normally do.
2023-11-27	Revert "[libc++] Speed up classic locale (#72112)"	Kirill Stoimenov	1	-60/+3
	Looks like it broke the ASAN build: https://lab.llvm.org/buildbot/#/builders/168/builds/17053/steps/9/logs/stdio This reverts commit f8afc53d641ce9d4ad8565aae9e7b5911b572a02.
2023-11-27	[libc++] Speed up classic locale (#72112)	Dmitry Vyukov	1	-3/+60
	Locale objects use atomic reference counting, which may be very expensive in parallel applications. The classic locale is used by default by all streams and can be very contended. But it's never destroyed, so the reference counting is also completely pointless on the classic locale. Currently ~70% of time in the parallel stringstream benchmarks is spent in locale ctor/dtor. And the execution radically slows down with more threads. Avoid reference counting on the classic locale. With this change parallel benchmarks start to scale with threads. Co-authored-by: Louis Dionne <ldionne.2@gmail.com> ``` │ baseline │ optimized │ │ sec/op │ sec/op vs base │ Istream_numbers/0/threads:1 4.672µ ± 0% 4.419µ ± 0% -5.42% (p=0.000 n=30+39) Istream_numbers/0/threads:72 539.817µ ± 0% 9.842µ ± 1% -98.18% (p=0.000 n=30+40) Istream_numbers/1/threads:1 4.890µ ± 0% 4.750µ ± 0% -2.85% (p=0.000 n=30+40) Istream_numbers/1/threads:72 66.44µ ± 1% 10.14µ ± 1% -84.74% (p=0.000 n=30+40) Istream_numbers/2/threads:1 4.888µ ± 0% 4.746µ ± 0% -2.92% (p=0.000 n=30+40) Istream_numbers/2/threads:72 494.8µ ± 0% 410.2µ ± 1% -17.11% (p=0.000 n=30+40) Istream_numbers/3/threads:1 4.697µ ± 0% 4.695µ ± 5% ~ (p=0.391 n=30+37) Istream_numbers/3/threads:72 421.5µ ± 7% 421.9µ ± 9% ~ (p=0.665 n=30) Ostream_number/0/threads:1 183.0n ± 0% 141.0n ± 2% -22.95% (p=0.000 n=30) Ostream_number/0/threads:72 24196.5n ± 1% 343.5n ± 3% -98.58% (p=0.000 n=30) Ostream_number/1/threads:1 250.0n ± 0% 196.0n ± 2% -21.60% (p=0.000 n=30) Ostream_number/1/threads:72 16260.5n ± 0% 407.0n ± 2% -97.50% (p=0.000 n=30) Ostream_number/2/threads:1 254.0n ± 0% 196.0n ± 1% -22.83% (p=0.000 n=30) Ostream_number/2/threads:72 28.49µ ± 1% 18.89µ ± 5% -33.72% (p=0.000 n=30) Ostream_number/3/threads:1 185.0n ± 0% 185.0n ± 0% 0.00% (p=0.017 n=30) Ostream_number/3/threads:72 19.38µ ± 4% 19.33µ ± 5% ~ (p=0.425 n=30) ```
2023-11-14	[libc++] Optimize for_each for segmented iterators	Nikolas Klauser	2	-0/+24
	``` --------------------------------------------------- Benchmark old new --------------------------------------------------- bm_for_each/1 3.00 ns 2.98 ns bm_for_each/2 4.53 ns 4.57 ns bm_for_each/3 5.82 ns 5.82 ns bm_for_each/4 6.94 ns 6.91 ns bm_for_each/5 7.55 ns 7.75 ns bm_for_each/6 7.06 ns 7.45 ns bm_for_each/7 6.69 ns 7.14 ns bm_for_each/8 6.86 ns 4.06 ns bm_for_each/16 11.5 ns 5.73 ns bm_for_each/64 43.7 ns 4.06 ns bm_for_each/512 356 ns 7.98 ns bm_for_each/4096 2787 ns 53.6 ns bm_for_each/32768 20836 ns 438 ns bm_for_each/262144 195362 ns 4945 ns bm_for_each/1048576 685482 ns 19822 ns ``` Reviewed By: ldionne, Mordante, #libc Spies: bgraur, sberg, arichardson, libcxx-commits Differential Revision: https://reviews.llvm.org/D151274
2023-10-16	[libc++][test] Add `stop_token` benchmark (#69117)	Hui	2	-0/+109
	This is transforming the `stop_token` benchmark that Lewis Baker had created into Google Bench https://reviews.llvm.org/D154702
2023-10-13	[libc++] Use -nostdlib++ on GCC unconditionally (#68832)	Louis Dionne	1	-1/+1
	We support GCC 13, which supports the flag. This allows simplifying the CMake logic around the use of -nostdlib++. Note that there are other places where we don't assume -nostdlib++ yet in the build, but this patch is intentionally trying to be small because this part of our CMake is pretty tricky.
2023-10-06	[libc++] Optimize ranges::count for __bit_iterators	Nikolas Klauser	2	-0/+36
	``` --------------------------------------------------------------- Benchmark old new --------------------------------------------------------------- bm_vector_bool_count/1 1.92 ns 1.92 ns bm_vector_bool_count/2 1.92 ns 1.92 ns bm_vector_bool_count/3 1.92 ns 1.92 ns bm_vector_bool_count/4 1.92 ns 1.92 ns bm_vector_bool_count/5 1.92 ns 1.92 ns bm_vector_bool_count/6 1.92 ns 1.92 ns bm_vector_bool_count/7 1.92 ns 1.92 ns bm_vector_bool_count/8 1.92 ns 1.92 ns bm_vector_bool_count/16 1.92 ns 1.92 ns bm_vector_bool_count/64 2.24 ns 2.25 ns bm_vector_bool_count/512 3.19 ns 3.20 ns bm_vector_bool_count/4096 14.1 ns 12.3 ns bm_vector_bool_count/32768 84.0 ns 83.6 ns bm_vector_bool_count/262144 664 ns 661 ns bm_vector_bool_count/1048576 2623 ns 2628 ns bm_vector_bool_ranges_count/1 1.07 ns 1.92 ns bm_vector_bool_ranges_count/2 1.65 ns 1.92 ns bm_vector_bool_ranges_count/3 2.27 ns 1.92 ns bm_vector_bool_ranges_count/4 2.68 ns 1.92 ns bm_vector_bool_ranges_count/5 3.33 ns 1.92 ns bm_vector_bool_ranges_count/6 3.99 ns 1.92 ns bm_vector_bool_ranges_count/7 4.67 ns 1.92 ns bm_vector_bool_ranges_count/8 5.19 ns 1.92 ns bm_vector_bool_ranges_count/16 11.1 ns 1.92 ns bm_vector_bool_ranges_count/64 52.2 ns 2.24 ns bm_vector_bool_ranges_count/512 452 ns 3.20 ns bm_vector_bool_ranges_count/4096 3577 ns 12.1 ns bm_vector_bool_ranges_count/32768 28725 ns 83.7 ns bm_vector_bool_ranges_count/262144 229676 ns 662 ns bm_vector_bool_ranges_count/1048576 905574 ns 2625 ns ``` Reviewed By: #libc, ldionne Spies: arichardson, ldionne, libcxx-commits Differential Revision: https://reviews.llvm.org/D156956
2023-10-02	[libc++] Optimize vector push_back to avoid continuous load and store of end ↵	Martijn Vels	2	-0/+15
	pointer Credits: this change is based on analysis and a proof of concept by gerbens@google.com. Before, the compiler loses track of end as 'this' and other references possibly escape beyond the compiler's scope. This can be see in the generated assembly: 16.28 │200c80: mov %r15d,(%rax) 60.87 │200c83: add $0x4,%rax │200c87: mov %rax,-0x38(%rbp) 0.03 │200c8b: → jmpq 200d4e ... ... 1.69 │200d4e: cmp %r15d,%r12d │200d51: → je 200c40 16.34 │200d57: inc %r15d 0.05 │200d5a: mov -0x38(%rbp),%rax 3.27 │200d5e: mov -0x30(%rbp),%r13 1.47 │200d62: cmp %r13,%rax │200d65: → jne 200c80 We fix this by always explicitly storing the loaded local and pointer back at the end of push back. This generates some slight source 'noise', but creates nice and compact fast path code, i.e.: 32.64 │200760: mov %r14d,(%r12) 9.97 │200764: add $0x4,%r12 6.97 │200768: mov %r12,-0x38(%rbp) 32.17 │20076c: add $0x1,%r14d 2.36 │200770: cmp %r14d,%ebx │200773: → je 200730 8.98 │200775: mov -0x30(%rbp),%r13 6.75 │200779: cmp %r13,%r12 │20077c: → jne 200760 Now there is a single store for the push_back value (as before), and a single store for the end without a reload (dependency). For fully local vectors, (i.e., not referenced elsewhere), the capacity load and store inside the loop could also be removed, but this requires more substantial refactoring inside vector. Differential Revision: https://reviews.llvm.org/D80588
2023-09-29	Reapply "[libc++][ranges] Add benchmarks for the `from_range` constructors ↵	Konstantin Varlamov	4	-1/+49
	of `vector` and `deque`." (#67753) This reverts commit 10edd5d9436153ace82009a04900ac67d3adc202 and guards against older versions of GCC to work around the problem.
2023-09-25	Revert "[libc++][ranges] Add benchmarks for the `from_range` constructors of ↵	Aaron Ballman	4	-42/+1
	`vector` and `deque`." This reverts commit 390ac823178fc1073612b4c8a38835f441138d9d. It broke the sphinx publish bots for our documentation: https://lab.llvm.org/buildbot/#/builders/242/builds/1130 because that machine has GCC 9.4.0 which does not know about C++23
2023-09-21	[AIX] cxx_std_23 is currently not a known feature to ibm-clang (#66952)	Jake Egan	1	-0/+4
	Temporary workaround for the following CMake error: ``` CMake Error in /llvm/libcxx/benchmarks/CMakeLists.txt: The compiler feature "cxx_std_23" is not known to CXX compiler "IBMClang" ```
2023-09-21	[libc++] Add test coverage for unordered containers comparison (#66692)	Louis Dionne	2	-0/+50
	This patch is a melting pot of changes picked up from https://llvm.org/D61878. It adds a few tests checking corner cases of unordered containers comparison and adds benchmarks for a few unordered_set operations.
2023-09-18	[libc++] Implement ranges::ends_with	Zijun Zhao	2	-0/+108
	Reviewed By: #libc, var-const Differential Revision: https://reviews.llvm.org/D150831
2023-09-13	[libc++] Fix minor warnings in libcxx benchmarks	Dmitry Ilvokhin	2	-3/+3
	This fixes some unused variable and "comparison of integers of different signs" warnings in the benchmarks. Differential Revision: https://reviews.llvm.org/D156613
2023-09-12	[libc++][NFC] Remove stray #if 1 that was probably a debugging leftover	Louis Dionne	1	-2/+0

2023-09-08	[libc++abi] Refactor around __dynamic_cast	Sirui Mu	2	-0/+78
	This commit contains refactorings around __dynamic_cast without changing its behavior. Some important changes include: - Refactor __dynamic_cast into various small helper functions; - Move dynamic_cast_stress.pass.cpp to libcxx/benchmarks and refactor it into a benchmark. The benchmark performance numbers are updated as well. Differential Revision: https://reviews.llvm.org/D138006
2023-09-05	[libc++][ranges] Add benchmarks for the `from_range` constructors of ↵	varconst	4	-1/+38
	`vector` and `deque`. Differential Revision: https://reviews.llvm.org/D150747
2023-08-15	[libc++][PSTL] Add a __parallel_sort implementation to libdispatch	Nikolas Klauser	3	-1/+40
	Reviewed By: #libc, ldionne Spies: ldionne, libcxx-commits Differential Revision: https://reviews.llvm.org/D155136
2023-08-11	[libc++] Optimize internal function in <system_error>	Edoardo Sanguineti	2	-0/+31
	In the event the internal function __init is called with an empty string the code will take unnecessary extra steps, in addition, the code generated might be overall greater because, to my understanding, when initializing a string with an empty `const char*` "" (like in this case), the compiler might be unable to deduce the string is indeed empty at compile time and more code is generated. The goal of this patch is to make a new internal function that will accept just an error code skipping the empty string argument. It should skip the unnecessary steps and in the event `if (ec)` is `false`, it will return an empty string using the correct ctor, avoiding any extra code generation issues. After the conversation about this patch matured in the libcxx channel on the LLVM Discord server, the patch was analyzed quickly with "Compiler Explorer" and other tools and it was discovered that it does indeed reduce the amount of code generated when using the latest stable clang version (16) which in turn produces faster code. This patch targets LLVM 18 as it will break the ABI by addressing https://github.com/llvm/llvm-project/issues/63985 Benchmark tests run on other machines as well show in the best case, that the new version without the extra string as an argument performs 10 times faster. On the buildkite CI run it shows the new code takes less CPU time as well. In conclusion, the new code should also just appear cleaner because there are fewer checks to do when there is no message. Reviewed By: #libc, philnik Spies: emaste, nemanjai, philnik, libcxx-commits Differential Revision: https://reviews.llvm.org/D155820
2023-08-01	[libc++] Optimize ranges::find for vector<bool>	Nikolas Klauser	1	-0/+28
	Benchmark results: ``` ---------------------------------------------------------------- Benchmark old new ---------------------------------------------------------------- bm_vector_bool_ranges_find/1 5.64 ns 6.08 ns bm_vector_bool_ranges_find/2 16.5 ns 6.03 ns bm_vector_bool_ranges_find/3 20.3 ns 6.07 ns bm_vector_bool_ranges_find/4 22.2 ns 6.08 ns bm_vector_bool_ranges_find/5 23.5 ns 6.05 ns bm_vector_bool_ranges_find/6 24.4 ns 6.10 ns bm_vector_bool_ranges_find/7 26.7 ns 6.10 ns bm_vector_bool_ranges_find/8 25.0 ns 6.08 ns bm_vector_bool_ranges_find/16 27.9 ns 6.07 ns bm_vector_bool_ranges_find/64 44.5 ns 5.35 ns bm_vector_bool_ranges_find/512 243 ns 25.7 ns bm_vector_bool_ranges_find/4096 1858 ns 35.6 ns bm_vector_bool_ranges_find/32768 15461 ns 93.5 ns bm_vector_bool_ranges_find/262144 126462 ns 571 ns bm_vector_bool_ranges_find/1048576 497736 ns 2272 ns ``` Reviewed By: #libc, Mordante Spies: var-const, Mordante, libcxx-commits Differential Revision: https://reviews.llvm.org/D156039
2023-07-04	Use hash value checks optimizations consistently	Dmitry Ilvokhin	2	-0/+30
	There are couple of optimizations of `__hash_table::find` which are applicable to other places like `__hash_table::__node_insert_unique_prepare` and `__hash_table::__emplace_unique_key_args`. ``` for (__nd = __nd->__next_; __nd != nullptr && (__nd->__hash() == __hash // ^^^^^^^^^^^^^^^^^^^^^^ // (1) \|\| std::__constrain_hash(__nd->__hash(), __bc) == __chash); __nd = __nd->__next_) { if ((__nd->__hash() == __hash) // ^^^^^^^^^^^^^^^^^^^^^^^^^^ // (2) && key_eq()(__nd->__upcast()->__value_, __k)) return iterator(__nd, this); } ``` (1): We can avoid expensive modulo operations from `std::__constrain_hash` if hashes matched. This one is from commit 6a411472e3c4. (2): We can avoid `key_eq` calls if hashes didn't match. Commit: 318d35a7bca6c4e5. Both of them are applicable for insert and emplace methods. Results of unordered_set_operations benchmark: ``` Comparing /tmp/main to /tmp/hashtable-hash-value-optimization Benchmark Time CPU Time Old Time New CPU Old CPU New ------------------------------------------------------------------------------------------------------------------------------------------------------ BM_Hash/uint32_random_std_hash/1024 -0.0127 -0.0127 1511 1492 1508 1489 BM_Hash/uint32_random_custom_hash/1024 +0.0012 +0.0013 1370 1371 1367 1369 BM_Hash/uint32_top_std_hash/1024 -0.0027 -0.0028 1502 1497 1498 1494 BM_Hash/uint32_top_custom_hash/1024 +0.0033 +0.0032 1368 1373 1365 1370 BM_InsertValue/unordered_set_uint32/1024 +0.0267 +0.0266 36421 37392 36350 37318 BM_InsertValue/unordered_set_uint32_sorted/1024 +0.0230 +0.0229 28247 28897 28193 28837 BM_InsertValue/unordered_set_top_bits_uint32/1024 +0.0492 +0.0491 31012 32539 30952 32472 BM_InsertValueRehash/unordered_set_top_bits_uint32/1024 +0.0523 +0.0520 62905 66197 62780 66043 BM_InsertValue/unordered_set_string/1024 -0.0252 -0.0253 300762 293189 299805 292221 BM_InsertValueRehash/unordered_set_string/1024 -0.0932 -0.0920 332924 301882 331276 300810 BM_InsertValue/unordered_set_prefixed_string/1024 -0.0578 -0.0577 314239 296072 313222 295137 BM_InsertValueRehash/unordered_set_prefixed_string/1024 -0.0986 -0.0985 336091 302950 334982 301995 BM_Find/unordered_set_random_uint64/1024 -0.1416 -0.1417 16075 13798 16041 13769 BM_FindRehash/unordered_set_random_uint64/1024 -0.0105 -0.0105 5900 5838 5889 5827 BM_Find/unordered_set_sorted_uint64/1024 +0.0014 +0.0014 2813 2817 2807 2811 BM_FindRehash/unordered_set_sorted_uint64/1024 -0.0247 -0.0249 5863 5718 5851 5706 BM_Find/unordered_set_sorted_uint128/1024 +0.0113 +0.0112 15570 15746 15539 15713 BM_FindRehash/unordered_set_sorted_uint128/1024 +0.0438 +0.0441 6917 7220 6902 7206 BM_Find/unordered_set_sorted_uint32/1024 -0.0020 -0.0020 3098 3091 3092 3085 BM_FindRehash/unordered_set_sorted_uint32/1024 +0.0570 +0.0569 5377 5684 5368 5673 BM_Find/unordered_set_sorted_large_uint64/1024 +0.0081 +0.0081 3594 3623 3587 3616 BM_FindRehash/unordered_set_sorted_large_uint64/1024 -0.0542 -0.0540 6154 5820 6140 5808 BM_Find/unordered_set_top_bits_uint64/1024 -0.0061 -0.0061 10440 10377 10417 10353 BM_FindRehash/unordered_set_top_bits_uint64/1024 +0.0131 +0.0128 5852 5928 5840 5914 BM_Find/unordered_set_string/1024 -0.0352 -0.0349 189037 182384 188389 181809 BM_FindRehash/unordered_set_string/1024 -0.0309 -0.0311 180718 175142 180141 174532 BM_Find/unordered_set_prefixed_string/1024 -0.0559 -0.0557 190853 180177 190251 179659 BM_FindRehash/unordered_set_prefixed_string/1024 -0.0563 -0.0561 182396 172136 181797 171602 BM_Rehash/unordered_set_string_arg/1024 -0.0244 -0.0241 27052 26393 26989 26339 BM_Rehash/unordered_set_int_arg/1024 -0.0410 -0.0410 19582 18779 19539 18738 BM_InsertDuplicate/unordered_set_int/1024 +0.0023 +0.0025 12168 12196 12142 12173 BM_InsertDuplicate/unordered_set_string/1024 -0.0505 -0.0504 189238 179683 188648 179133 BM_InsertDuplicate/unordered_set_prefixed_string/1024 -0.0989 -0.0987 198893 179222 198263 178702 BM_EmplaceDuplicate/unordered_set_int/1024 -0.0175 -0.0173 12674 12452 12646 12427 BM_EmplaceDuplicate/unordered_set_string/1024 -0.0559 -0.0557 190104 179481 189492 178934 BM_EmplaceDuplicate/unordered_set_prefixed_string/1024 -0.1111 -0.1110 201233 178870 200608 178341 BM_InsertDuplicate/unordered_set_int_insert_arg/1024 -0.0747 -0.0745 12993 12022 12964 11997 BM_InsertDuplicate/unordered_set_string_insert_arg/1024 -0.0584 -0.0583 191489 180311 190864 179731 BM_EmplaceDuplicate/unordered_set_int_insert_arg/1024 -0.0807 -0.0804 35946 33047 35866 32982 BM_EmplaceDuplicate/unordered_set_string_arg/1024 -0.0312 -0.0310 321623 311601 320559 310637 OVERALL_GEOMEAN -0.0276 -0.0275 0 0 0 0 ``` Time differences looks more like noise to me. But if we want to have this optimizations in `find`, we probably want them in `insert` and `emplace` as well. Reviewed By: #libc, Mordante Differential Revision: https://reviews.llvm.org/D140779
2023-06-19	[libc++][NFC] Apply clang-format on large parts of the code base	Louis Dionne	37	-1052/+744
	This commit does a pass of clang-format over files in libc++ that don't require major changes to conform to our style guide, or for which we're not overly concerned about conflicting with in-flight patches or hindering the git blame. This roughly covers: - benchmarks - range algorithms - concepts - type traits I did a manual verification of all the changes, and in particular I applied clang-format on/off annotations in a few places where the result was less readable after than before. This was not necessary in a lot of places, however I did find that clang-format had pretty bad taste when it comes to formatting concepts. Differential Revision: https://reviews.llvm.org/D153140
2023-06-06	[libc++] Removes CMake work-arounds.	Mark de Wever	1	-5/+0
	CMake older than 3.20.0 is no longer supported. This removes work-arounds for no longer supported versions. Reviewed By: #libc, jloser, philnik Differential Revision: https://reviews.llvm.org/D152099
2023-06-05	Revert "[libc++] Optimize for_each for segmented iterators"	Nikolas Klauser	2	-24/+0
	This reverts commit b1dc43aa3a05c2f14725e2e6428544208ccbe161.
2023-05-31	[libc++] Optimize for_each for segmented iterators	Nikolas Klauser	2	-0/+24
	``` --------------------------------------------------- Benchmark old new --------------------------------------------------- bm_for_each/1 3.00 ns 2.98 ns bm_for_each/2 4.53 ns 4.57 ns bm_for_each/3 5.82 ns 5.82 ns bm_for_each/4 6.94 ns 6.91 ns bm_for_each/5 7.55 ns 7.75 ns bm_for_each/6 7.06 ns 7.45 ns bm_for_each/7 6.69 ns 7.14 ns bm_for_each/8 6.86 ns 4.06 ns bm_for_each/16 11.5 ns 5.73 ns bm_for_each/64 43.7 ns 4.06 ns bm_for_each/512 356 ns 7.98 ns bm_for_each/4096 2787 ns 53.6 ns bm_for_each/32768 20836 ns 438 ns bm_for_each/262144 195362 ns 4945 ns bm_for_each/1048576 685482 ns 19822 ns ``` Reviewed By: ldionne, Mordante, #libc Spies: arichardson, libcxx-commits Differential Revision: https://reviews.llvm.org/D151274
2023-05-25	[libc++] Forward to std::{,w}memchr in std::find	Nikolas Klauser	2	-0/+50
	Reviewed By: #libc, ldionne Spies: Mordante, libcxx-commits, ldionne, mikhail.ramalho Differential Revision: https://reviews.llvm.org/D144394
2023-05-25	[NFC][Py Reformat] Reformat python files in libcxx/libcxxabi	Tobias Hieta	1	-7/+7
	This is an ongoing series of commits that are reformatting our Python code. Reformatting is done with `black`. If you end up having problems merging this commit because you have made changes to a python file, the best way to handle that is to run git checkout --ours <yourfile> and then reformat it with black. If you run into any problems, post to discourse about it and we will try to help. RFC Thread below: https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style Reviewed By: #libc, kwk, Mordante Differential Revision: https://reviews.llvm.org/D150763
2023-05-24	[libc++, std::vector] call the optimized version of ↵	AdityaK	2	-0/+30
	__uninitialized_allocator_copy for trivial types See: https://github.com/llvm/llvm-project/issues/61987 Fix suggested by: @philnik and @var-const Reviewers: philnik, ldionne, EricWF, var-const Differential Revision: https://reviews.llvm.org/D147741 Testing: ninja check-cxx check-clang check-llvm Benchmark Testcases (BM_CopyConstruct, and BM_Assignment) added. performance improvement: Run on (8 X 4800 MHz CPU s) CPU Caches: L1 Data 48 KiB (x4) L1 Instruction 32 KiB (x4) L2 Unified 1280 KiB (x4) L3 Unified 12288 KiB (x1) Load Average: 1.66, 3.02, 2.43 Comparing build-runtimes-base/libcxx/benchmarks/vector_operations.libcxx.out to build-runtimes/libcxx/benchmarks/vector_operations.libcxx.out Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------- BM_ConstructSize/vector_byte/5140480 +0.0362 +0.0362 116906 121132 116902 121131 BM_CopyConstruct/vector_int/5140480 -0.4563 -0.4577 1755224 954241 1755330 951987 BM_Assignment/vector_int/5140480 -0.0222 -0.0220 990045 968095 989917 968125 BM_ConstructSizeValue/vector_byte/5140480 +0.0308 +0.0307 116970 120567 116977 120573 BM_ConstructIterIter/vector_char/1024 -0.0831 -0.0831 19 17 19 17 BM_ConstructIterIter/vector_size_t/1024 +0.0129 +0.0131 88 89 88 89 BM_ConstructIterIter/vector_string/1024 -0.0064 -0.0018 54455 54109 54208 54112 OVERALL_GEOMEAN -0.0845 -0.0842 0 0 0 0 FYI, the perf improvements for BM_CopyConstruct due to this patch is mostly subsumed by the https://reviews.llvm.org/D149826. However this patch still adds value by converting copy to memmove (the second testcase). Before the patch: ``` define linkonce_odr dso_local void @_ZNSt3__16vectorIiNS_9allocatorIiEEE18__construct_at_endIPiS5_EEvT_T0_m(ptr noundef nonnull align 8 dereferenceable(24) %0, ptr noundef %1, ptr noundef %2, i64 noundef %3) local_unnamed_addr #4 comdat align 2 { %5 = getelementptr inbounds %"class.std::__1::vector", ptr %0, i64 0, i32 1 %6 = load ptr, ptr %5, align 8, !tbaa !12 %7 = icmp eq ptr %1, %2 br i1 %7, label %16, label %8 8: ; preds = %4, %8 %9 = phi ptr [ %13, %8 ], [ %1, %4 ] %10 = phi ptr [ %14, %8 ], [ %6, %4 ] %11 = icmp ne ptr %10, null tail call void @llvm.assume(i1 %11) %12 = load i32, ptr %9, align 4, !tbaa !14 store i32 %12, ptr %10, align 4, !tbaa !14 %13 = getelementptr inbounds i32, ptr %9, i64 1 %14 = getelementptr inbounds i32, ptr %10, i64 1 %15 = icmp eq ptr %13, %2 br i1 %15, label %16, label %8, !llvm.loop !16 16: ; preds = %8, %4 %17 = phi ptr [ %6, %4 ], [ %14, %8 ] store ptr %17, ptr %5, align 8, !tbaa !12 ret void } ``` After the patch: ``` define linkonce_odr dso_local void @_ZNSt3__16vectorIiNS_9allocatorIiEEE18__construct_at_endIPiS5_EEvT_T0_m(ptr noundef nonnull align 8 dereferenceable(24) %0, ptr noundef %1, ptr noundef %2, i64 noundef %3) local_unnamed_addr #4 comdat align 2 { %5 = getelementptr inbounds %"class.std::__1::vector", ptr %0, i64 0, i32 1 %6 = load ptr, ptr %5, align 8, !tbaa !12 %7 = ptrtoint ptr %2 to i64 %8 = ptrtoint ptr %1 to i64 %9 = sub i64 %7, %8 %10 = ashr exact i64 %9, 2 tail call void @llvm.memmove.p0.p0.i64(ptr align 4 %6, ptr align 4 %1, i64 %9, i1 false) %11 = getelementptr inbounds i32, ptr %6, i64 %10 store ptr %11, ptr %5, align 8, !tbaa !12 ret void } ``` This is due to the optimized version of uninitialized_allocator_copy function.
2023-03-19	[libc++abi] Improve performance of __dynamic_cast	Sirui Mu	2	-0/+173
	The original `__dynamic_cast` implementation does not use the ABI-provided `src2dst_offset` parameter which helps improve performance on the hot paths. This patch improves the performance on the hot paths in `__dynamic_cast` by leveraging hints provided by the `src2dst_offset` parameter. This patch also includes a performance benchmark suite for the `__dynamic_cast` implementation. Reviewed By: philnik, ldionne, #libc, #libc_abi, avogelsgesang Spies: mikhail.ramalho, avogelsgesang, xingxue, libcxx-commits Differential Revision: https://reviews.llvm.org/D138005
2023-03-11	[libc++] Optimize std::ranges::{min, max} for types that are cheap to copy	Nikolas Klauser	2	-0/+71
	Don't forward to `min_element` for small types that are trivially copyable, and instead use a naive loop that keeps track of the smallest element (as opposed to an iterator to the smallest element). This allows the compiler to vectorize the loop in some cases. Reviewed By: #libc, ldionne Spies: ldionne, libcxx-commits Differential Revision: https://reviews.llvm.org/D143596
2023-02-21	[libc++] Forward to std::memcmp for trivially comparable types in equal	Nikolas Klauser	2	-1/+48
	Reviewed By: #libc, ldionne Spies: ldionne, Mordante, libcxx-commits Differential Revision: https://reviews.llvm.org/D139554