Age | Commit message (Collapse) | Author | Files | Lines |
|
Use tailcalls to avoid the overhead of a frame on the free fastpath.
Move tcache initialization to _int_free_chunk(). Add malloc_printerr_tail()
which can be tailcalled without forcing a frame like no-return functions.
Change tcache_double_free_verify() to retry via __libc_free() after clearing
the key.
Reviewed-by: Florian Weimer <fweimer@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Inline tcache_free since it's only used by __libc_free. Add __glibc_likely
for the tcache checks.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
The checks on size can be merged and use __builtin_add_overflow. Since
tcache only handles small sizes (and rejects sizes < MINSIZE), delay this
check until after tcache.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Inline _int_free_check since it is only used by __libc_free.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Inline _int_free since it is a small function and only really used by
__libc_free.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Currently __libc_free checks for a freed mmap chunk in the fast path.
Also errno is always saved and restored to preserve it. Since mmap chunks
are larger than the largest tcache chunk, it is safe to delay this and
handle tcache, smallbin and medium bin blocks first. Move saving of errno
to cases that actually need it. Remove a safety check that fails on mmap
chunks and a check that mmap chunks cannot be added to tcache.
Performance of bench-malloc-thread improves by 9.2% for 1 thread and
6.9% for 32 threads on Neoverse V2.
Reviewed-by: DJ Delorie <dj@redhat.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Improve performance of __libc_malloc by splitting it into 2 parts: first handle
the tcache fastpath, then do the rest in a separate tailcalled function.
This results in significant performance gains since __libc_malloc doesn't need
to setup a frame and we delay tcache initialization and setting of errno until
later.
On Neoverse V2, bench-malloc-simple improves by 6.7% overall (up to 8.5% for
ST case) and bench-malloc-thread improves by 20.3% for 1 thread and 14.4% for
32 threads.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
Use __always_inline for small helper functions that are critical for
performance. This ensures inlining always happens when expected.
Performance of bench-malloc-simple improves by 0.6% on average on
Neoverse V2.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
When splitting a chunk, release the tail part by calling int_free_chunk.
This avoids inserting random blocks into tcache that were never requested
by the user. Fragmentation will be worse if they are never used again.
Note if the tail is fairly small, we could avoid splitting it at all.
Also remove an oddly placed initialization of tcache in _libc_realloc.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
_mid_memalign includes tcache code but does not attempt to initialize
tcaches.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
Remove the alignment rounding up from csize2tidx - this makes no sense
since the input should be a chunk size. Removing it enables further
optimizations, for example chunksize_nomask can be safely used and
invalid sizes < MINSIZE are not mapped to a valid tidx.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Change heap_max_size() to improve performance of arena_for_chunk().
Instead of a complex calculation, using a simple mask operation to get the
arena base pointer. HEAP_MAX_SIZE should be larger than the huge page size,
otherwise heaps will use not huge pages.
On AArch64 this removes 6 instructions from arena_for_chunk(), and
bench-malloc-thread improves by 1.1% - 1.8%.
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
If attacker overwrites the bk_nextsize link in the first chunk of a
largebin that later has a smaller chunk inserted into it, malloc will
write a heap pointer into an attacker-controlled address [0].
This patch adds an integrity check to mitigate this attack.
[0]: https://github.com/shellphish/how2heap/blob/master/glibc_2.39/large_bin_attack.c
Signed-off-by: Ben Kallus <benjamin.p.kallus.gr@dartmouth.edu>
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
By overwriting a forward link in a fastbin chunk that is subsequently
moved into the tcache, it's possible to get malloc to return an
arbitrary address [0].
When a chunk is fetched from a fastbin, its size is checked against the
expected chunk size for that fastbin (see malloc.c:3991). This patch
adds a similar check for chunks being moved from a fastbin to tcache,
which renders obsolete the exploitation technique described above.
Now updated to use __glibc_unlikely instead of __builtin_expect, as
requested.
[0]: https://github.com/shellphish/how2heap/blob/master/glibc_2.39/fastbin_reverse_into_tcache.c
Signed-off-by: Ben Kallus <benjamin.p.kallus.gr@dartmouth.edu>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
The functions serve very similar purposes. The advantage of
__rtld_libc_freeres is that it is located within ld.so, so it is
more natural to poke at link map internals there.
This slightly regresses cleanup capabilities for statically linked
binaries. If that becomes a problem, we should start calling
__rtld_libc_freeres from __libc_freeres (perhaps after renaming it).
|
|
Followup to c3d1dac96bdd10250aa37bb367d5ef8334a093a1. As pointed out by
Maciej W. Rozycki, the casts are obviously useless now.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Similar to a9944a52c967ce76a5894c30d0274b824df43c7a and
f9493a15ea9cfb63a815c00c23142369ec09d8ce, we need to hide calloc use from
the compiler to accommodate GCC's r15-6566-g804e9d55d9e54c change.
First, include tst-malloc-aux.h, but then use `volatile` variables
for size.
The test passes without the tst-malloc-aux.h change but IMO we want
it there for consistency and to avoid future problems (possibly silent).
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
I've updated copyright dates in glibc for 2025. This is the patch for
the changes not generated by scripts/update-copyrights and subsequent
build / regeneration of generated files.
|
|
|
|
Add __attribute_optimization_barrier__ to disable inlining and cloning on a
function. For Clang, expand it to
__attribute__ ((optnone))
Otherwise, expand it to
__attribute__ ((noinline, clone))
Co-Authored-By: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
|
|
Reviewed-by: Sam James <sam@gentoo.org>
|
|
Since -1 isn't a power of two, compiler may reject it, hide memalign from
Clang 19 which issues an error:
tst-memalign.c:86:31: error: requested alignment is not a power of 2 [-Werror,-Wnon-power-of-two-alignment]
86 | p = memalign (-1, pagesize);
| ^~
tst-memalign.c:86:31: error: requested alignment must be 4294967296 bytes or smaller; maximum alignment assumed [-Werror,-Wbuiltin-assume-aligned-alignment]
86 | p = memalign (-1, pagesize);
| ^~
Update tst-malloc-aux.h to hide all malloc functions and include it in
all malloc tests to prevent compiler from optimizing out any malloc
functions.
Tested with Clang 19.1.5 and GCC 15 20241206 for BZ #32366.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Sam James <sam@gentoo.org>
|
|
This commit add tcache support in calloc() which can largely improve
the performance of small size allocation, especially in multi-thread
scenario. tcache_available() and tcache_try_malloc() are split out as
a helper function for better reusing the code.
Also fix tst-safe-linking failure after enabling tcache. In previous,
calloc() is used as a way to by-pass tcache in memory allocation and
trigger safe-linking check in fastbins path. With tcache enabled, it
needs extra workarounds to bypass tcache.
Result of bench-calloc-thread benchmark
Test Platform: Xeon-8380
Ratio: New / Original time_per_iteration (Lower is Better)
Threads# | Ratio
-----------|------
1 thread | 0.656
4 threads | 0.470
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
GCC 15 introduces allocation dead code removal (DCE) for PR117370 in
r15-5255-g7828dc070510f8. This breaks various glibc tests which want
to assert various properties of the allocator without doing anything
obviously useful with the allocated memory.
Alexander Monakov rightly pointed out that we can and should do better
than passing -fno-malloc-dce to paper over the problem. Not least because
GCC 14 already does such DCE where there's no testing of malloc's return
value against NULL, and LLVM has such optimisations too.
Handle this by providing malloc (and friends) wrappers with a volatile
function pointer to obscure that we're calling malloc (et. al) from the
compiler.
Reviewed-by: Paul Eggert <eggert@cs.ucla.edu>
|
|
Add calloc-clear-memory.h to clear memory size up to 36 bytes (72 bytes
on 64-bit targets) for calloc. Use repeated stores with 1 branch, instead
of up to 3 branches. On x86-64, it is faster than memset since calling
memset needs 1 indirect branch, 1 broadcast, and up to 4 branches.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
|
|
Large chunks get added to the unsorted bin since
sorting them takes time, for small chunks the
benefit of adding them to the unsorted bin is
non-existant, actually hurting performance.
Splitting and malloc_consolidate still add small
chunks to unsorted, but we can hint the compiler
that that is a relatively rare occurance.
Benchmarking shows this to be consistently good.
Authored-by: k4lizen <k4lizen@proton.me>
Signed-off-by: Aleksa Siriški <sir@tmina.org>
|
|
Tcache is an important optimzation to accelerate memory free(), things
within this code path should be kept as simple as possible. This commit
try to remove the function call when free() invokes tcache code path by
inlining _int_free().
Result of bench-malloc-thread benchmark
Test Platform: Xeon-8380
Ratio: New / Original time_per_iteration (Lower is Better)
Threads# | Ratio
-----------|------
1 thread | 0.879
4 threads | 0.874
The performance data shows it can improve bench-malloc-thread benchmark
by ~12% in both single thread and multi-thread scenario.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
|
|
Replace 0 by NULL and {0} by {}.
Omit a few cases that aren't so trivial to fix.
Link: <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117059>
Link: <https://software.codidact.com/posts/292718/292759#answer-292759>
Signed-off-by: Alejandro Colomar <alx@kernel.org>
|
|
Split _int_free() into 3 smaller functions for flexible combination:
* _int_free_check -- sanity check for free
* tcache_free -- free memory to tcache (quick path)
* _int_free_chunk -- free memory chunk (slow path)
|
|
Linux 6.11 has getrandom() in vDSO. It operates on a thread-local opaque
state allocated with mmap using flags specified by the vDSO.
Multiple states are allocated at once, as many as fit into a page, and
these are held in an array of available states to be doled out to each
thread upon first use, and recycled when a thread terminates. As these
states run low, more are allocated.
To make this procedure async-signal-safe, a simple guard is used in the
LSB of the opaque state address, falling back to the syscall if there's
reentrancy contention.
Also, _Fork() is handled by blocking signals on opaque state allocation
(so _Fork() always sees a consistent state even if it interrupts a
getrandom() call) and by iterating over the thread stack cache on
reclaim_stack. Each opaque state will be in the free states list
(grnd_alloc.states) or allocated to a running thread.
The cancellation is handled by always using GRND_NONBLOCK flags while
calling the vDSO, and falling back to the cancellable syscall if the
kernel returns EAGAIN (would block). Since getrandom is not defined by
POSIX and cancellation is supported as an extension, the cancellation is
handled as 'may occur' instead of 'shall occur' [1], meaning that if
vDSO does not block (the expected behavior) getrandom will not act as a
cancellation entrypoint. It avoids a pthread_testcancel call on the fast
path (different than 'shall occur' functions, like sem_wait()).
It is currently enabled for x86_64, which is available in Linux 6.11,
and aarch64, powerpc32, powerpc64, loongarch64, and s390x, which are
available in Linux 6.12.
Link: https://pubs.opengroup.org/onlinepubs/9799919799/nframe.html [1]
Co-developed-by: Jason A. Donenfeld <Jason@zx2c4.com>
Tested-by: Jason A. Donenfeld <Jason@zx2c4.com> # x86_64
Tested-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> # x86_64, aarch64
Tested-by: Xi Ruoyao <xry111@xry111.site> # x86_64, aarch64, loongarch64
Tested-by: Stefan Liebler <stli@linux.ibm.com> # s390x
|
|
Fixes build failures on Hurd.
|
|
Fixes build failures on Hurd.
|
|
Improve aligned_alloc/calloc/malloc test coverage by adding
multi-threaded tests with random memory allocations and with/without
cross-thread memory deallocations.
Perform a number of memory allocation calls with random sizes limited
to 0xffff.
Use the existing DSO ('malloc/tst-aligned_alloc-lib.c') to randomize
allocator selection.
The multi-threaded allocation/deallocation is staged as described below:
- Stage 1: Half of the threads will be allocating memory and the
other half will be waiting for them to finish the allocation.
- Stage 2: Half of the threads will be allocating memory and the
other half will be deallocating memory.
- Stage 3: Half of the threads will be deallocating memory and the
second half waiting on them to finish.
Add 'malloc/tst-aligned-alloc-random-thread.c' where each thread will
deallocate only the memory that was previously allocated by itself.
Add 'malloc/tst-aligned-alloc-random-thread-cross.c' where each thread
will deallocate memory that was previously allocated by another thread.
The intention is to be able to utilize existing malloc testing to ensure
that similar allocation APIs are also exposed to the same rigors.
Reviewed-by: Arjun Shankar <arjun@redhat.com>
|
|
Make sure the DSO used by aligned_alloc/calloc/malloc tests does not get
a global lock on multithreaded tests.
Reviewed-by: Arjun Shankar <arjun@redhat.com>
|
|
Signed-off-by: John David Anglin <dave.anglin@bell.net>
Reviewed-By: Andreas K. Hüttel <dilfridge@gentoo.org>
|
|
Use the list form of the open function to avoid interpreting meta
characters in the arguments.
|
|
The previous version expanded $0 and $@ twice.
The new version defines a q no-op shell command. The Perl syntax
error is masked by the eval Perl function. The q { … } construct
is executed by the shell without errors because the q shell function
was defined, but treated as a non-expanding quoted string by Perl,
effectively hiding its context from the Perl interpreter. As before
the script is read by require instead of executed directly, to avoid
infinite recursion because the #! line contains /bin/sh.
Introduce the “fatal” function to produce diagnostics that are not
suppressed by “do”. Use “do” instead of “require” because it has
fewer requirements on the executed script than “require”.
Prefix relative paths with './' because “do” (and “require“ before)
searches for the script in @INC if the path is relative and does not
start with './'. Use $_ to make the trampoline shorter.
Add an Emacs mode marker to indentify the script as a Perl script.
|
|
Generation of the Perl script does not depend on Perl, so we can
always install it even if $(PERL) is not set during the build.
Change the malloc/mtrace.pl text substition not to rely on $(PERL).
Instead use PATH at run time to find the Perl interpreter. The Perl
interpreter cannot execute directly a script that starts with
“#! /bin/sh”: it always executes it with /bin/sh. There is no
perl command line switch to disable this behavior. Instead, use
the Perl require function to execute the script. The additional
shift calls remove the “.” shell arguments. Perl interprets the
“.” as a string concatenation operator, making the expression
syntactically valid.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
|
|
The test aims to ensure that malloc uses the alternate path to
allocate memory when sbrk() or brk() fails.To achieve this,
the test first creates an obstruction at current program break,
tests that obstruction with a failing sbrk(), then checks if malloc
is still returning a valid ptr thus inferring that malloc() used
mmap() instead of brk() or sbrk() to allocate the memory.
Reviewed-by: Arjun Shankar <arjun@redhat.com>
Reviewed-by: Zack Weinberg <zack@owlfolio.org>
|
|
Add a DSO (malloc/tst-aligned_alloc-lib.so) that can be used during
testing to interpose malloc with a call that randomly uses either
aligned_alloc, __libc_malloc, or __libc_calloc in the place of malloc.
Use LD_PRELOAD with the DSO to mirror malloc/tst-malloc.c testing as an
example in malloc/tst-malloc-random.c. Add malloc/tst-aligned-alloc-random.c
as another example that does a number of malloc calls with randomly sized,
but limited to 0xffff, requests.
The intention is to be able to utilize existing malloc testing to ensure
that similar allocation APIs are also exposed to the same rigors.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
Put each test on a separate line and sort tests.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
|
|
The __getrandom_nocancel function returns errors as negative values
instead of errno. This is inconsistent with other _nocancel functions
and it breaks "TEMP_FAILURE_RETRY (__getrandom_nocancel (p, n, 0))" in
__arc4random_buf. Use INLINE_SYSCALL_CALL instead of
INTERNAL_SYSCALL_CALL to fix this issue.
But __getrandom_nocancel has been avoiding from touching errno for a
reason, see BZ 29624. So add a __getrandom_nocancel_nostatus function
and use it in tcache_key_initialize.
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
|
|
I've updated copyright dates in glibc for 2024. This is the patch for
the changes not generated by scripts/update-copyrights and subsequent
build / regeneration of generated files.
|
|
|
|
Even for explicit large page support, allocation might use mmap without
the hugepage bit set if the requested size is smaller than
mmap_threshold. For this case where mmap is issued, MAP_HUGETLB is set
iff the allocation size is larger than the used large page.
To force such allocations to use large pages, also tune the mmap_threhold
(if it is not explicit set by a tunable). This forces allocation to
follow the sbrk path, which will fall back to mmap (which will try large
pages before galling back to default mmap).
Checked on x86_64-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
|
|
This restore the 2.33 semantic for arena_get2. It was changed by
11a02b035b46 to avoid arena_get2 call malloc (back when __get_nproc
was refactored to use an scratch_buffer - 903bc7dcc2acafc). The
__get_nproc was refactored over then and now it also avoid to call
malloc.
The 11a02b035b46 did not take in consideration any performance
implication, which should have been discussed properly. The
__get_nprocs_sched is still used as a fallback mechanism if procfs
and sysfs is not acessible.
Checked on x86_64-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
Add anonymous mmap annotations on loader malloc, malloc when it
allocates memory with mmap, and on malloc arena. The /proc/self/maps
will now print:
[anon: glibc: malloc arena]
[anon: glibc: malloc]
[anon: glibc: loader malloc]
On arena allocation, glibc annotates only the read/write mapping.
Checked on x86_64-linux-gnu and aarch64-linux-gnu.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
With gcc 13.1 with --enable-fortify-source=2, tst-tcfree3 fails to
build on csky-linux-gnuabiv2 with:
../string/bits/string_fortified.h: In function ‘do_test’:
../string/bits/string_fortified.h:26:8: error: inlining failed in call
to ‘always_inline’ ‘memcpy’: target specific option mismatch
26 | __NTH (memcpy (void *__restrict __dest, const void *__restrict
__src,
| ^~~~~~
../misc/sys/cdefs.h:81:62: note: in definition of macro ‘__NTH’
81 | # define __NTH(fct) __attribute__ ((__nothrow__ __LEAF)) fct
| ^~~
tst-tcfree3.c:45:3: note: called from here
45 | memcpy (c, a, 32);
| ^~~~~~~~~~~~~~~~~
Instead of relying on -O0 to avoid malloc/free to be optimized away,
disable the builtin.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
On the test workload (mpv --cache=yes with VP9 video decoding), the
bin scanning has a very poor success rate (less than 2%). The tcache
scanning has about 50% success rate, so keep that.
Update comments in malloc/tst-memalign-2 to indicate the purpose
of the tests. Even with the scanning removed, the additional
merging opportunities since commit 542b1105852568c3ebc712225ae78b
("malloc: Enable merging of remainders in memalign (bug 30723)")
are sufficient to pass the existing large bins test.
Remove leftover variables from _int_free from refactoring in the
same commit.
Reviewed-by: DJ Delorie <dj@redhat.com>
|
|
Previously, calling _int_free from _int_memalign could put remainders
into the tcache or into fastbins, where they are invisible to the
low-level allocator. This results in missed merge opportunities
because once these freed chunks become available to the low-level
allocator, further memalign allocations (even of the same size are)
likely obstructing merges.
Furthermore, during forwards merging in _int_memalign, do not
completely give up when the remainder is too small to serve as a
chunk on its own. We can still give it back if it can be merged
with the following unused chunk. This makes it more likely that
memalign calls in a loop achieve a compact memory layout,
independently of initial heap layout.
Drop some useless (unsigned long) casts along the way, and tweak
the style to more closely match GNU on changed lines.
Reviewed-by: DJ Delorie <dj@redhat.com>
|