aboutsummaryrefslogtreecommitdiff
path: root/nptl
AgeCommit message (Collapse)AuthorFilesLines
4 dayssupport: add check_mem_access functionYury Khrustalev1-47/+3
Add check_mem_access(addr) function to check if memory at addr can be written or read returning false if memory is not accessible. This function changes signal handler for SIGSEGV and SIGBUS signals when it is called first, and it is not thread-safe. Co-authored-by: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
11 daysRemove futex_supports_psharedAndreas Schwab5-26/+9
Both NPTL and HTL support PTHREAD_PROCESS_SHARED, and since the removal of the NaCL port there are no other pthread implementations.
2025-09-01nptl: Provide __pthread_rwlock_unlock compat symbol for versions before 2.43Xi Ruoyao2-1/+7
The symbol was unintentionally leaked on ports introduced after GLIBC_2.34, provide the compat symbol to avoid breaking ABI on them. Signed-off-by: Xi Ruoyao <xry111@xry111.site> Reviewed-by: Florian Weimer <fweimer@redhat.com>
2025-08-27elf: early conversion of elf p_flags to mprotect flagsCupertino Miranda1-9/+3
This patch replaces _dl_stack_flags global variable by _dl_stack_prot_flags. The advantage is that any convertion from p_flags to final used mprotect flags occurs at loading of p_flags. It avoids repeated spurious convertions of _dl_stack_flags, for example in allocate_thread_stack. This modification was suggested in: https://sourceware.org/pipermail/libc-alpha/2025-March/165537.html Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-08-01nptl: Fix SYSCALL_CANCEL for return values larger than INT_MAX (BZ 33245)Adhemerval Zanella1-2/+2
The SYSCALL_CANCEL calls __syscall_cancel, which in turn calls __internal_syscall_cancel with an 'int' return instead of the expected 'long int'. This causes issues with syscalls that return values larger than INT_MAX, such as copy_file_range [1]. Checked on x86_64-linux-gnu. [1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=79139 Reviewed-by: Andreas K. Huettel <dilfridge@gentoo.org>
2025-03-13nptl: Check if thread is already terminated in sigcancel_handler (BZ 32782)Adhemerval Zanella1-6/+8
The SIGCANCEL signal handler should not issue __syscall_do_cancel, which calls __do_cancel and __pthread_unwind, if the cancellation is already in proces (and libgcc unwind is not reentrant). Any cancellation signal received after is ignored. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Tested-by: Aurelien Jarno <aurelien@aurel32.net> Reviewed-by: Florian Weimer <fweimer@redhat.com>
2025-03-12Makefile: Clean up pthread_atfork integrationFlorian Weimer1-2/+0
Do not add the pthread_atfork routine again in nptl/Makefile, instead rely on sysdeps/pthread/Makefile for the integration (as this is the directory that contains the source file). In sysdeps/pthread/Makefile, add to static-only-routines. Reviewed-by: Joseph Myers <josmyers@redhat.com>
2025-03-12nptl: Include <stdbool.h> in tst-pthread_gettid_np.cFlorian Weimer1-0/+1
The test uses the while (true) construct.
2025-03-12Linux: Add the pthread_gettid_np function (bug 27880)Florian Weimer4-0/+114
Current Bionic has this function, with enhanced error checking (the undefined case terminates the process). Reviewed-by: Joseph Myers <josmyers@redhat.com>
2025-03-04Pass -Wl,--no-error-execstack for tests where -Wl,-z,execstack is used [PR32717]Sam James1-0/+3
When GNU Binutils is configured with --enable-error-execstack=yes, a handful of our tests which rely on -Wl,-z,execstack fail. Pass --Wl,--no-error-execstack to override the behaviour and get a warning instead. Bug: https://sourceware.org/PR32717 Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-02-13nptl: Remove unused __g_refs comment.Carlos O'Donell1-5/+0
In the block comment for __pthread_cond_wait_common we mention __g_refs, but the implementation no longer uses group references.
2025-01-30ld.so: Decorate BSS mappingsPetr Malat1-4/+0
Decorate BSS mappings with [anon: glibc: .bss <file>], for example [anon: glibc: .bss /lib/libc.so.6]. The string ".bss" is already used by bionic so use the same, but add the filename as well. If the name would be longer than what the kernel allows, drop the directory part of the path. Refactor glibc.mem.decorate_maps check to a separate function and use it to avoid assembling a name, which would not be used later. Signed-off-by: Petr Malat <oss@malat.biz> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-01-30nptl: Add support for setup guard pages with MADV_GUARD_INSTALLAdhemerval Zanella7-93/+556
Linux 6.13 (662df3e5c3766) added a lightweight way to define guard areas through madvise syscall. Instead of PROT_NONE the guard region through mprotect, userland can madvise the same area with a special flag, and the kernel ensures that accessing the area will trigger a SIGSEGV (as for PROT_NONE mapping). The madvise way has the advantage of less kernel memory consumption for the process page-table (one less VMA per guard area), and slightly less contention on kernel (also due to the fewer VMA areas being tracked). The pthread_create allocates a new thread stack in two ways: if a guard area is set (the default) it allocates the memory range required using PROT_NONE and then mprotect the usable stack area. Otherwise, if a guard page is not set it allocates the region with the required flags. For the MADV_GUARD_INSTALL support, the stack area region is allocated with required flags and then the guard region is installed. If the kernel does not support it, the usual way is used instead (and MADV_GUARD_INSTALL is disabled for future stack creations). The stack allocation strategy is recorded on the pthread struct, and it is used in case the guard region needs to be resized. To avoid needing an extra field, the 'user_stack' is repurposed and renamed to 'stack_mode'. This patch also adds a proper test for the pthread guard. I checked on x86_64, aarch64, powerpc64le, and hppa with kernel 6.13.0-rc7. Reviewed-by: DJ Delorie <dj@redhat.com>
2025-01-29nptl: Correct stack size attribute when stack grows up [BZ #32574]John David Anglin1-2/+2
Set stack size attribute to the size of the mmap'd region only when the size of the remaining stack space is less than the size of the mmap'd region. This was reversed. As a result, the initial stack size was only 135168 bytes. On architectures where the stack grows down, the initial stack size is approximately 8384512 bytes with the default rlimit settings. The small main stack size on hppa broke applications like ruby that check for stack overflows. Signed-off-by: John David Anglin <dave.anglin@bell.net>
2025-01-21nptl: Include <stdbool.h> in tst-skeleton-affinity-inheritance.cFlorian Weimer1-0/+1
The file uses the identifiers bool, false, true.
2025-01-17nptl: Use all of g1_start and g_signalsMalte Skarupke4-28/+18
The LSB of g_signals was unused. The LSB of g1_start was used to indicate which group is G2. This was used to always go to sleep in pthread_cond_wait if a waiter is in G2. A comment earlier in the file says that this is not correct to do: "Waiters cannot determine whether they are currently in G2 or G1 -- but they do not have to because all they are interested in is whether there are available signals" I either would have had to update the comment, or get rid of the check. I chose to get rid of the check. In fact I don't quite know why it was there. There will never be available signals for group G2, so we didn't need the special case. Even if there were, this would just be a spurious wake. This might have caught some cases where the count has wrapped around, but it wouldn't reliably do that, (and even if it did, why would you want to force a sleep in that case?) and we don't support that many concurrent waiters anyway. Getting rid of it allows us to use one more bit, making us more robust to wraparound. Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-17nptl: rename __condvar_quiesce_and_switch_g1Malte Skarupke4-30/+26
This function no longer waits for threads to leave g1, so rename it to __condvar_switch_g1 Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-17nptl: Fix indentationMalte Skarupke1-55/+55
In my previous change I turned a nested loop into a simple loop. I'm doing the resulting indentation changes in a separate commit to make the diff on the previous commit easier to review. Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-17nptl: Use a single loop in pthread_cond_wait instaed of a nested loopMalte Skarupke1-22/+19
The loop was a little more complicated than necessary. There was only one break statement out of the inner loop, and the outer loop was nearly empty. So just remove the outer loop, moving its code to the one break statement in the inner loop. This allows us to replace all gotos with break statements. Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-17nptl: Remove g_refs from condition variablesMalte Skarupke2-57/+7
This variable used to be needed to wait in group switching until all sleepers have confirmed that they have woken. This is no longer needed. Nothing waits on this variable so there is no need to track how many threads are currently asleep in each group. Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-17nptl: Remove unnecessary quadruple check in pthread_cond_waitMalte Skarupke1-49/+0
pthread_cond_wait was checking whether it was in a closed group no less than four times. Checking once is enough. Here are the four checks: 1. While spin-waiting. This was dead code: maxspin is set to 0 and has been for years. 2. Before deciding to go to sleep, and before incrementing grefs: I kept this 3. After incrementing grefs. There is no reason to think that the group would close while we do an atomic increment. Obviously it could close at any point, but that doesn't mean we have to recheck after every step. This check was equally good as check 2, except it has to do more work. 4. When we find ourselves in a group that has a signal. We only get here after we check that we're not in a closed group. There is no need to check again. The check would only have helped in cases where the compare_exchange in the next line would also have failed. Relying on the compare_exchange is fine. Removing the duplicate checks clarifies the code. Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-17nptl: Remove unnecessary catch-all-wake in condvar group switchMalte Skarupke1-30/+1
This wake is unnecessary. We only switch groups after every sleeper in a group has been woken. Sure, they may take a while to actually wake up and may still hold a reference, but waking them a second time doesn't speed that up. Instead this just makes the code more complicated and may hide problems. In particular this safety wake wouldn't even have helped with the bug that was fixed by Barrus' patch: The bug there was that pthread_cond_signal would not switch g1 when it should, so we wouldn't even have entered this code path. Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-17nptl: Update comments and indentation for new condvar implementationMalte Skarupke2-22/+22
Some comments were wrong after the most recent commit. This fixes that. Also fixing indentation where it was using spaces instead of tabs. Signed-off-by: Malte Skarupke <malteskarupke@fastmail.fm> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-17pthreads NPTL: lost wakeup fix 2Frank Barrus2-168/+81
This fixes the lost wakeup (from a bug in signal stealing) with a change in the usage of g_signals[] in the condition variable internal state. It also completely eliminates the concept and handling of signal stealing, as well as the need for signalers to block to wait for waiters to wake up every time there is a G1/G2 switch. This greatly reduces the average and maximum latency for pthread_cond_signal. The g_signals[] field now contains a signal count that is relative to the current g1_start value. Since it is a 32-bit field, and the LSB is still reserved (though not currently used anymore), it has a 31-bit value that corresponds to the low 31 bits of the sequence number in g1_start. (since g1_start also has an LSB flag, this means bits 31:1 in g_signals correspond to bits 31:1 in g1_start, plus the current signal count) By making the signal count relative to g1_start, there is no longer any ambiguity or A/B/A issue, and thus any checks before blocking, including the futex call itself, are guaranteed not to block if the G1/G2 switch occurs, even if the signal count remains the same. This allows initially safely blocking in G2 until the switch to G1 occurs, and then transitioning from G1 to a new G1 or G2, and always being able to distinguish the state change. This removes the race condition and A/B/A problems that otherwise ocurred if a late (pre-empted) waiter were to resume just as the futex call attempted to block on g_signal since otherwise there was no last opportunity to re-check things like whether the current G1 group was already closed. By fixing these issues, the signal stealing code can be eliminated, since there is no concept of signal stealing anymore. The code to block for all waiters to exit g_refs can also be removed, since any waiters that are still in the g_refs region can be guaranteed to safely wake up and exit. If there are still any left at this time, they are all sent one final futex wakeup to ensure that they are not blocked any longer, but there is no need for the signaller to block and wait for them to wake up and exit the g_refs region. The signal count is then effectively "zeroed" but since it is now relative to g1_start, this is done by advancing it to a new value that can be observed by any pending blocking waiters. Any late waiters can always tell the difference, and can thus just cleanly exit if they are in a stale G1 or G2. They can never steal a signal from the current G1 if they are not in the current G1, since the signal value that has to match in the cmpxchg has the low 31 bits of the g1_start value contained in it, and that's first checked, and then it won't match if there's a G1/G2 change. Note: the 31-bit sequence number used in g_signals is designed to handle wrap-around when checking the signal count, but if the entire 31-bit wraparound (2 billion signals) occurs while there is still a late waiter that has not yet resumed, and it happens to then match the current g1_start low bits, and the pre-emption occurs after the normal "closed group" checks (which are 64-bit) but then hits the futex syscall and signal consuming code, then an A/B/A issue could still result and cause an incorrect assumption about whether it should block. This particular scenario seems unlikely in practice. Note that once awake from the futex, the waiter would notice the closed group before consuming the signal (since that's still a 64-bit check that would not be aliased in the wrap-around in g_signals), so the biggest impact would be blocking on the futex until the next full wakeup from a G1/G2 switch. Signed-off-by: Frank Barrus <frankbarrus_sw@shaggy.cc> Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2025-01-14affinity-inheritance: Overallocate CPU setsStefan Liebler2-7/+15
Some kernels on S390 appear to return a CPU affinity mask based on configured processors rather than the ones online. Overallocate the CPU set to match that, but operate only on the ones online. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Co-authored-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2025-01-10nptl: Remove the rseq area from 'struct pthread'Michael Jeanson1-19/+3
The rseq extensible ABI implementation moved the rseq area to the 'extra TLS' block, remove the unused 'rseq_area' member of 'struct pthread'. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Florian Weimer <fweimer@redhat.com>
2025-01-10nptl: Move the rseq area to the 'extra TLS' blockMichael Jeanson1-1/+1
Move the rseq area to the newly added 'extra TLS' block, this is the last step in adding support for the rseq extended ABI. The size of the rseq area is now dynamic and depends on the rseq features reported by the kernel through the elf auxiliary vector. This will allow applications to use rseq features past the 32 bytes of the original rseq ABI as they become available in future kernels. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Florian Weimer <fweimer@redhat.com>
2025-01-09tests: Verify inheritance of cpu affinitySiddhesh Poyarekar3-0/+224
Add a couple of tests to verify that CPU affinity set using sched_setaffinity and pthread_setaffinity_np are inherited by a child process and child thread. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-01-07Revert "nptl: More useful padding in struct pthread"Florian Weimer1-25/+31
This reverts commit 7c22dcda27743658b6b8ea479283b384ad56bd5a. The padding is required by Chromium's MaybeUpdateGlibcTidCache in sandbox/linux/services/namespace_sandbox.cc. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2025-01-01Update copyright dates with scripts/update-copyrightsPaul Eggert278-278/+278
2024-12-31elf: Do not change stack permission on dlopen/dlmopenAdhemerval Zanella1-19/+0
If some shared library loaded with dlopen/dlmopen requires an executable stack, either implicitly because of a missing GNU_STACK ELF header (where the ABI default flags implies in the executable bit) or explicitly because of the executable bit from GNU_STACK; the loader will try to set the both the main thread and all thread stacks (from the pthread cache) as executable. Besides the issue where any __nptl_change_stack_perm failure does not undo the previous executable transition (meaning that if the library fails to load, there can be thread stacks with executable stacks), this behavior was used on a CVE [1] as a vector for RCE. This patch changes that if a shared library requires an executable stack, and the current stack is not executable, dlopen fails. The change is done only for dynamically loaded modules, if the program or any dependency requires an executable stack, the loader will still change the main thread before program execution and any thread created with default stack configuration. [1] https://www.qualys.com/2023/07/19/cve-2023-38408/rce-openssh-forwarded-ssh-agent.txt Checked on x86_64-linux-gnu and i686-linux-gnu. Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-12-27nptl: More useful padding in struct pthreadFlorian Weimer1-31/+25
The previous use of padding within a union made it impossible to re-use the padding for GLIBC_PRIVATE ABI preservation because tcbhead_t could use up all of the padding (as was historically the case on x86-64). Allocating padding unconditionally addresses this issue. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-12-23include/sys/cdefs.h: Add __attribute_optimization_barrier__Adhemerval Zanella3-16/+16
Add __attribute_optimization_barrier__ to disable inlining and cloning on a function. For Clang, expand it to __attribute__ ((optnone)) Otherwise, expand it to __attribute__ ((noinline, clone)) Co-Authored-By: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sam James <sam@gentoo.org>
2024-12-22Suppress -Wmaybe-uninitialized only for GCCH.J. Lu1-1/+1
Clang doesn't support -Wmaybe-uninitialized. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sam James <sam@gentoo.org>
2024-12-22Enable execstack tests only if compiler supports trampolineH.J. Lu1-2/+7
Since trampoline is required to test execstack, enable execstack tests only if compiler supports trampoline. Signed-off-by: H.J. Lu <hjl.tools@gmail.com> Reviewed-by: Sam James <sam@gentoo.org>
2024-11-28pthread_getcpuclockid: Add descriptive comment to smoke testSiddhesh Poyarekar2-6/+7
Add a descriptive comment to the tst-pthread-cpuclockid-invalid test and also drop pthread_getcpuclockid from the TODO-testing list since it now has full coverage. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2024-11-25Silence most -Wzero-as-null-pointer-constant diagnosticsAlejandro Colomar1-1/+1
Replace 0 by NULL and {0} by {}. Omit a few cases that aren't so trivial to fix. Link: <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117059> Link: <https://software.codidact.com/posts/292718/292759#answer-292759> Signed-off-by: Alejandro Colomar <alx@kernel.org>
2024-11-22nptl: Add smoke test for pthread_getcpuclockid failureSiddhesh Poyarekar2-0/+51
Exercise the case where an exited thread will cause pthread_getcpuclockid to fail. Signed-off-by: Siddhesh Poyarekar <siddhesh@sourceware.org> Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-11-12linux: Add support for getrandom vDSOAdhemerval Zanella3-0/+10
Linux 6.11 has getrandom() in vDSO. It operates on a thread-local opaque state allocated with mmap using flags specified by the vDSO. Multiple states are allocated at once, as many as fit into a page, and these are held in an array of available states to be doled out to each thread upon first use, and recycled when a thread terminates. As these states run low, more are allocated. To make this procedure async-signal-safe, a simple guard is used in the LSB of the opaque state address, falling back to the syscall if there's reentrancy contention. Also, _Fork() is handled by blocking signals on opaque state allocation (so _Fork() always sees a consistent state even if it interrupts a getrandom() call) and by iterating over the thread stack cache on reclaim_stack. Each opaque state will be in the free states list (grnd_alloc.states) or allocated to a running thread. The cancellation is handled by always using GRND_NONBLOCK flags while calling the vDSO, and falling back to the cancellable syscall if the kernel returns EAGAIN (would block). Since getrandom is not defined by POSIX and cancellation is supported as an extension, the cancellation is handled as 'may occur' instead of 'shall occur' [1], meaning that if vDSO does not block (the expected behavior) getrandom will not act as a cancellation entrypoint. It avoids a pthread_testcancel call on the fast path (different than 'shall occur' functions, like sem_wait()). It is currently enabled for x86_64, which is available in Linux 6.11, and aarch64, powerpc32, powerpc64, loongarch64, and s390x, which are available in Linux 6.12. Link: https://pubs.opengroup.org/onlinepubs/9799919799/nframe.html [1] Co-developed-by: Jason A. Donenfeld <Jason@zx2c4.com> Tested-by: Jason A. Donenfeld <Jason@zx2c4.com> # x86_64 Tested-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> # x86_64, aarch64 Tested-by: Xi Ruoyao <xry111@xry111.site> # x86_64, aarch64, loongarch64 Tested-by: Stefan Liebler <stli@linux.ibm.com> # s390x
2024-11-07nptl: initialize rseq area prior to registrationMichael Jeanson1-0/+2
Per the rseq syscall documentation, 3 fields are required to be initialized by userspace prior to registration, they are 'cpu_id', 'rseq_cs' and 'flags'. Since we have no guarantee that 'struct pthread' is cleared on all architectures, explicitly set those 3 fields prior to registration. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Florian Weimer <fweimer@redhat.com>
2024-10-29Add more tests of pthread attributes initial valuesJoseph Myers2-0/+63
There are various existing tests that call pthread_attr_init and then verify properties of the resulting initial values retrieved with pthread_attr_get* functions. However, those are missing coverage of the initial values retrieved with pthread_attr_getschedparam and pthread_attr_getstacksize. Add testing for initial values from those functions as well. (tst-attr2 covers pthread_attr_getdetachstate, pthread_attr_getguardsize, pthread_attr_getinheritsched, pthread_attr_getschedpolicy, pthread_attr_getscope. tst-attr3 covers some of those together with pthread_attr_getaffinity_np. tst-pthread-attr-sigmask covers pthread_attr_getsigmask_np. pthread_attr_getstack has unspecified results if called before the relevant attributes have been set, while pthread_attr_getstackaddr is deprecated.) Tested for x86_64.
2024-10-21Check time arguments to pthread_timedjoin_np and pthread_clockjoin_npJoseph Myers1-0/+6
The pthread_timedjoin_np and pthread_clockjoin_np functions do not check that a valid time has been specified. The documentation for these functions in the glibc manual isn't sufficiently detailed to say if they should, but consistency with POSIX functions such as pthread_mutex_timedlock and pthread_cond_timedwait strongly indicates that an EINVAL error is appropriate (even if there might be some ambiguity about exactly where such a check should go in relation to other checks for whether the thread exists, whether it's immediately joinable, etc.). Copy the logic for such a check used in pthread_rwlock_common.c. pthread_join_common had some logic calling valid_nanoseconds before commit 9e92278ffad441daf588ff1ff5bd8094aa33fbfd, "nptl: Remove clockwait_tid"; I haven't checked exactly what cases that detected. Tested for x86_64 and x86.
2024-10-08stdlib: Make abort/_Exit AS-safe (BZ 26275)Adhemerval Zanella1-0/+11
The recursive lock used on abort does not synchronize with a new process creation (either by fork-like interfaces or posix_spawn ones), nor it is reinitialized after fork(). Also, the SIGABRT unblock before raise() shows another race condition, where a fork or posix_spawn() call by another thread, just after the recursive lock release and before the SIGABRT signal, might create programs with a non-expected signal mask. With the default option (without POSIX_SPAWN_SETSIGDEF), the process can see SIG_DFL for SIGABRT, where it should be SIG_IGN. To fix the AS-safe, raise() does not change the process signal mask, and an AS-safe lock is used if a SIGABRT is installed or the process is blocked or ignored. With the signal mask change removal, there is no need to use a recursive loc. The lock is also taken on both _Fork() and posix_spawn(), to avoid the spawn process to see the abort handler as SIG_DFL. A read-write lock is used to avoid serialize _Fork and posix_spawn execution. Both sigaction (SIGABRT) and abort() requires to lock as writer (since both change the disposition). The fallback is also simplified: there is no need to use a loop of ABORT_INSTRUCTION after _exit() (if the syscall does not terminate the process, the system is broken). The proposed fix changes how setjmp works on a SIGABRT handler, where glibc does not save the signal mask. So usage like the below will now always abort. static volatile int chk_fail_ok; static jmp_buf chk_fail_buf; static void handler (int sig) { if (chk_fail_ok) { chk_fail_ok = 0; longjmp (chk_fail_buf, 1); } else _exit (127); } [...] signal (SIGABRT, handler); [....] chk_fail_ok = 1; if (! setjmp (chk_fail_buf)) { // Something that can calls abort, like a failed fortify function. chk_fail_ok = 0; printf ("FAIL\n"); } Such cases will need to use sigsetjmp instead. The _dl_start_profile calls sigaction through _profil, and to avoid pulling abort() on loader the call is replaced with __libc_sigaction. Checked on x86_64-linux-gnu and aarch64-linux-gnu. Reviewed-by: DJ Delorie <dj@redhat.com>
2024-09-24nptl: Prefer setresuid32 in tst-setuid2Florian Weimer1-0/+5
Use the setresuid32 system call if it is available, prefering it over setresuid. If both system calls exist, setresuid is the 16-bit variant. This fixes a build failure on sparcv9-linux-gnu.
2024-08-23nptl: Fix Race conditions in pthread cancellation [BZ#12683]Adhemerval Zanella12-122/+240
The current racy approach is to enable asynchronous cancellation before making the syscall and restore the previous cancellation type once the syscall returns, and check if cancellation has happen during the cancellation entrypoint. As described in BZ#12683, this approach shows 2 problems: 1. Cancellation can act after the syscall has returned from the kernel, but before userspace saves the return value. It might result in a resource leak if the syscall allocated a resource or a side effect (partial read/write), and there is no way to program handle it with cancellation handlers. 2. If a signal is handled while the thread is blocked at a cancellable syscall, the entire signal handler runs with asynchronous cancellation enabled. This can lead to issues if the signal handler call functions which are async-signal-safe but not async-cancel-safe. For the cancellation to work correctly, there are 5 points at which the cancellation signal could arrive: [ ... )[ ... )[ syscall ]( ... 1 2 3 4 5 1. Before initial testcancel, e.g. [*... testcancel) 2. Between testcancel and syscall start, e.g. [testcancel...syscall start) 3. While syscall is blocked and no side effects have yet taken place, e.g. [ syscall ] 4. Same as 3 but with side-effects having occurred (e.g. a partial read or write). 5. After syscall end e.g. (syscall end...*] And libc wants to act on cancellation in cases 1, 2, and 3 but not in cases 4 or 5. For the 4 and 5 cases, the cancellation will eventually happen in the next cancellable entrypoint without any further external event. The proposed solution for each case is: 1. Do a conditional branch based on whether the thread has received a cancellation request; 2. It can be caught by the signal handler determining that the saved program counter (from the ucontext_t) is in some address range beginning just before the "testcancel" and ending with the syscall instruction. 3. SIGCANCEL can be caught by the signal handler and determine that the saved program counter (from the ucontext_t) is in the address range beginning just before "testcancel" and ending with the first uninterruptable (via a signal) syscall instruction that enters the kernel. 4. In this case, except for certain syscalls that ALWAYS fail with EINTR even for non-interrupting signals, the kernel will reset the program counter to point at the syscall instruction during signal handling, so that the syscall is restarted when the signal handler returns. So, from the signal handler's standpoint, this looks the same as case 2, and thus it's taken care of. 5. For syscalls with side-effects, the kernel cannot restart the syscall; when it's interrupted by a signal, the kernel must cause the syscall to return with whatever partial result is obtained (e.g. partial read or write). 6. The saved program counter points just after the syscall instruction, so the signal handler won't act on cancellation. This is similar to 4. since the program counter is past the syscall instruction. So The proposed fixes are: 1. Remove the enable_asynccancel/disable_asynccancel function usage in cancellable syscall definition and instead make them call a common symbol that will check if cancellation is enabled (__syscall_cancel at nptl/cancellation.c), call the arch-specific cancellable entry-point (__syscall_cancel_arch), and cancel the thread when required. 2. Provide an arch-specific generic system call wrapper function that contains global markers. These markers will be used in SIGCANCEL signal handler to check if the interruption has been called in a valid syscall and if the syscalls has side-effects. A reference implementation sysdeps/unix/sysv/linux/syscall_cancel.c is provided. However, the markers may not be set on correct expected places depending on how INTERNAL_SYSCALL_NCS is implemented by the architecture. It is expected that all architectures add an arch-specific implementation. 3. Rewrite SIGCANCEL asynchronous handler to check for both canceling type and if current IP from signal handler falls between the global markers and act accordingly. 4. Adjust libc code to replace LIBC_CANCEL_ASYNC/LIBC_CANCEL_RESET to use the appropriate cancelable syscalls. 5. Adjust 'lowlevellock-futex.h' arch-specific implementations to provide cancelable futex calls. Some architectures require specific support on syscall handling: * On i386 the syscall cancel bridge needs to use the old int80 instruction because the optimized vDSO symbol the resulting PC value for an interrupted syscall points to an address outside the expected markers in __syscall_cancel_arch. It has been discussed in LKML [1] on how kernel could help userland to accomplish it, but afaik discussion has stalled. Also, sysenter should not be used directly by libc since its calling convention is set by the kernel depending of the underlying x86 chip (check kernel commit 30bfa7b3488bfb1bb75c9f50a5fcac1832970c60). * mips o32 is the only kABI that requires 7 argument syscall, and to avoid add a requirement on all architectures to support it, mips support is added with extra internal defines. Checked on aarch64-linux-gnu, arm-linux-gnueabihf, powerpc-linux-gnu, powerpc64-linux-gnu, powerpc64le-linux-gnu, i686-linux-gnu, and x86_64-linux-gnu. [1] https://lkml.org/lkml/2016/3/8/1105 Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2024-08-07nptl: Fix stray process left by tst-cancel7 blocking testingMaciej W. Rozycki1-0/+4
Fix an issue with commit b74121ae4bc5 ("Update.") and prevent a stray process from being left behind by tst-cancel7 (and also tst-cancelx7, which is the same test built with '-fexceptions' additionally supplied to the compiler), which then blocks remote testing until the process has been killed by hand. This test case creates a thread that runs an extra copy of the test via system(3) and using the '--direct' option so that the test wrapper does not interfere with this instance. This extra copy executes its business and calls sigsuspend(2) and then never terminates by itself. Instead it relies on being killed by the main test process directly via a thread cancellation request or, should that fail, by issuing SIGKILL either at the conclusion of 'do_test' or by the test driver via 'do_cleanup' where the test timeout has been hit or the test driver interrupted. However if the main test process has been instead killed by a signal, such as due to incorrect execution, before it had a chance to kill the extra copy of the test case, then the test wrapper will terminate without running 'do_cleanup' and consequently the extra copy of the test case will remain forever in its suspended state, and in the remote case in particular it means that the remote test wrapper will wait forever for the SSH command to complete. This has been observed with the 'alpha-linux-gnu' target, where the main test process triggers SIGSEGV and the test wrapper correctly records: Didn't expect signal from child: got `Segmentation fault' in nptl/tst-cancel7.out and terminates, but then the calling SSH command continues waiting for the remaining process started in the same session on the remote target to complete. Address this problem by also registering 'do_cleanup' via atexit(3), observing that 'support_delete_temp_files' is registered by the test wrapper before the test initializing function 'do_prepare' is called and that we call all the functions registered in the reverse of the order in which they were registered, so it is safe to refer to 'pidfilename' in 'do_cleanup' invoked by exit(3) because by that time temporary files have not yet been deleted. A minor inconvenience is that if 'signal_handler' is invoked in the test wrapper as a result of SIGALRM rather than SIGINT, then 'do_cleanup' will be called twice, once as a cleanup handler and again by exit(3). In reality it is harmless though, because issuing SIGKILL is guarded by a record lock, so if the first call has succeeded in killing the extra copy of the test case, then the subsequent call will do nothing. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-08-07nptl: Reorder semaphore release in tst-cancel7Maciej W. Rozycki1-4/+4
Move the release of the semaphore used to synchronize between an extra copy of the test run as a separate process and the main test process until after the PID file has been locked. It is so that if the cleanup function gets called by the test driver due to premature termination of the main test process, then the function does not get at the PID file before it has been locked and conclude that the extra copy of the test has already terminated. This won't usually happen due to a relatively high amount of time required to elapse before timeout triggers in the test driver, but it will change with the next change. There is still a small time window remaining with this change in place where the main test process gets killed for some reason between the extra copy of the test has been already started by pthread_create(3) and a successful return from the call to sem_wait(3), in which case the cleanup function can be reached before PID has been written to the PID file and the file locked. It seems that with the test case structured as it is now and PID-based process management we have no means to avoid it. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-08-05elf: Clarify and invert second argument of _dl_allocate_tls_initFlorian Weimer1-1/+1
Also remove an outdated comment: _dl_allocate_tls_init is called as part of pthread_create. Reviewed-by: Carlos O'Donell <carlos@redhat.com>
2024-07-12nptl: Convert tst-sem11 and tst-sem12 tests to use the test driverMaciej W. Rozycki2-4/+6
Fix an issue with commit 2af4e3e5668f ("Test of semaphores.") by making the tst-sem11 and tst-sem12 tests use the test driver, preventing them from ever causing testing to hang forever and never complete, such as currently happening with the 'mips-linux-gnu' (o32 ABI) target. Adjust the name of the PREPARE macro, which clashes with the interpretation of its presence by the test driver, by using a TF_ prefix in reference to the name of the 'tf' function. Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-07-12nptl: Add copyright notice tst-sem11 and tst-sem12 testsMaciej W. Rozycki2-0/+36
Add a copyright notice to the tst-sem11 and tst-sem12 tests, observing that they have been originally contributed back in 2007, with commit 2af4e3e5668f ("Test of semaphores."). Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>