Age | Commit message (Collapse) | Author | Files | Lines |
|
Unlike for BTI, the kernel does not process GCS properties so update
GL(dl_aarch64_gcs) before the GCS status is set.
|
|
check_gcs is called for each dependency of a DSO, but the GNU property
of the ld.so is not processed so ldso->l_mach.gcs may not be correct.
Just assume ld.so is GCS compatible independently of the ELF marking.
|
|
Allows using the same function for static exe.
TODO: not clear if the two are always equivalent other than the
ordering and in case of static linking when dl-support.c leaves
l_initfini NULL.
|
|
|
|
Allows using the same function for static exe.
TODO: not clear if the two are always equivalent other than the
ordering and in case of static linking when dl-support.c leaves
l_initfini NULL.
|
|
policy sets how gcs tunable and gcs marking turns into gcs state:
0: state = tunable
1: state = marking ? tunable : (tunable && dlopen ? err : 0)
2: state = marking ? tunable : (tunable ? err : 0)
TODO: state lock, default policy
|
|
TODO: binutils config check
TODO: build attributes instead of gnu property
|
|
Use the dynamic linker start code to enable GCS in the dynamic linked
case after _dl_start returns and before _dl_start_user which marks
the point after which user code may run.
Like in the static linked case this ensures that GCS is enabled on a
top level stack frame.
|
|
Use the ARCH_SETUP_TLS hook to enable GCS in the static linked case.
The system call must be inlined and then GCS is enabled on a top
level stack frame that does not return and has no exception handlers
above it.
|
|
This tunable is for controlling the GCS status. It is the argument to
the PR_SET_SHADOW_STACK_STATUS prctl, by default 0, so GCS is disabled.
The status is stored into GL(dl_aarch64_gcs) early and only applied
later, since enabling GCS is tricky: it must happen on a top level
stack frame. (Using GL instead of GLRO because it may need updates
depending on loaded libraries that happen after readonly protection
is applied, however library marking based GCS setting is not yet
implemented.)
|
|
Free GCS after a makecontext start func returns and at thread exit, so
assume makecontext cannot outlive the thread where it was created.
This is an attempt to bound the lifetime of the GCS allocated for
makecontext, but it is still possible to have significant GCS leaks,
new GCS aware APIs could solve that, but that would not allow using
GCS with existing code transparently.
|
|
Changed the makecontext logic: previously the first setcontext jumped
straight to the user callback function and the return address is set
to __startcontext. This does not work when GCS is enabled as the
integrity of the return address is protected, so instead the context
is setup such that setcontext jumps to __startcontext which calls the
user callback (passed in x20).
The map_shadow_stack syscall is used to allocate a suitably sized GCS
(which includes some reserved area to account for altstack signal
handlers and otherwise supports maximum number of 16 byte aligned
stack frames on the given stack) however the GCS is never freed as
the lifetime of ucontext and related stack is user managed.
|
|
|
|
Userspace ucontext needs to store GCSPR, it does not have to be
compatible with the kernel ucontext. For now we use the linux
struct gcs_context layout but only use the gcspr field from it.
Similar implementation to the longjmp code, supports switching GCS
if the target GCS is capped, and unwinding a continous GCS to a
previous state.
|
|
|
|
This implementations ensures that longjmp across different stacks
works: it scans for GCS cap token and switches GCS if necessary
then the target GCSPR is restored with a GCSPOPM loop once the
current GCSPR is on the same GCS.
This makes longjmp linear time in the number of jumped over stack
frames when GCS is enabled.
|
|
The target specific internal __longjmp is called with a __jmp_buf
argument which has its size exposed in the ABI. On aarch64 this has
no space left, so GCSPR cannot be restored in longjmp in the usual
way, which is needed for the Guarded Control Stack (GCS) extension.
setjmp is implemented via __sigsetjmp which has a jmp_buf argument
however it is also called with __pthread_unwind_buf_t argument cast
to jmp_buf (in cancellation cleanup code built with -fno-exception).
The two types, jmp_buf and __pthread_unwind_buf_t, have common bits
beyond the __jmp_buf field and there is unused space there which we
can use for saving GCSPR.
For this to work some bits of those two generic types have to be
reserved for target specific use and the generic code in glibc has
to ensure that __longjmp is always called with a __jmp_buf that is
embedded into one of those two types. Morally __longjmp should be
changed to take jmp_buf as argument, but that is an intrusive change
across targets.
Note: longjmp is never called with __pthread_unwind_buf_t from user
code, only the internal __libc_longjmp is called with that type and
thus the two types could have separate longjmp implementations on a
target. We don't rely on this now (but migh in the future given that
cancellation unwind does not need to restore GCSPR).
Given the above this patch finds an unused slot for GCSPR. This
placement is not exposed in the ABI so it may change in the future.
This is also very target ABI specific so the generic types cannot
be easily changed to clearly mark the reserved fields.
|
|
The Guarded Control Stack instructions can be present even if the
hardware does not support the extension (runtime checked feature),
so the asm code should be backward compatible with old assemblers.
|
|
|
|
And avoid a Hurd build failures.
Checked on x86_64-linux-gnu.
|
|
While working on a patch to add support for the extensible rseq ABI, we
came across an issue where a new 'const' variable would be merged with
the existing '__rseq_size' variable. We tracked this to the use of
'-fmerge-all-constants' which allows the compiler to merge identical
constant variables. This means that all 'const' variables in a compile
unit that are of the same size and are initialized to the same value can
be merged.
In this specific case, on 32 bit systems 'unsigned int' and 'ptrdiff_t'
are both 4 bytes and initialized to 0 which should trigger the merge.
However for reasons we haven't delved into when the attribute 'section
(".data.rel.ro")' is added to the mix, only variables of the same exact
types are merged. As far as we know this behavior is not specified
anywhere and could change with a new compiler version, hence this patch.
Move the definitions of these variables into an assembler file and add
hidden writable aliases for internal use. This has the added bonus of
removing the asm workaround to set the values on rseq registration.
Tested on Debian 12 with GCC 12.2.
Signed-off-by: Michael Jeanson <mjeanson@efficios.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Florian Weimer <fweimer@redhat.com>
|
|
Fixes 32 test failures.
|
|
Adhemerval noticed that the gettimeofday() and 32-bit clock_gettime()
vDSO calls won't be used by glibc on hppa, so there is no need to
declare them. Both syscalls will be emulated by utilizing return values
of the 64-bit clock_gettime() vDSO instead.
Signed-off-by: Helge Deller <deller@gmx.de>
Suggested-by: Adhemerval Zanella Netto <adhemerval.zanella@linaro.org>
|
|
MIPSr6 has MADDF.s/MADDF.d instructions, which are fused.
In MIPS ISA, double support can be subsetted. Only FMAF is enabled
for this case.
* sysdeps/mips/fpu/math-use-builtins-fma.h
Signed-off-by: YunQiang Su <syq@gcc.gnu.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
It turns out that quite a few applications use bundled mallocs that
have been built to use global-dynamic TLS (instead of the recommended
initial-exec TLS). The previous workaround from
commit afe42e935b3ee97bac9a7064157587777259c60e ("elf: Avoid some
free (NULL) calls in _dl_update_slotinfo") does not fix all
encountered cases unfortunatelly.
This change avoids the TLS generation update for recursive use
of TLS from a malloc that was called during a TLS update. This
is possible because an interposed malloc has a fixed module ID and
TLS slot. (It cannot be unloaded.) If an initially-loaded module ID
is encountered in __tls_get_addr and the dynamic linker is already
in the middle of a TLS update, use the outdated DTV, thus avoiding
another call into malloc. It's still necessary to update the
DTV to the most recent generation, to get out of the slow path,
which is why the check for recursion is needed.
The bookkeeping is done using a global counter instead of per-thread
flag because TLS access in the dynamic linker is tricky.
All this will go away once the dynamic linker stops using malloc
for TLS, likely as part of a change that pre-allocates all TLS
during pthread_create/dlopen.
Fixes commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow
tls access after dlopen [BZ #19924]").
Reviewed-by: Szabolcs Nagy <szabolcs.nagy@arm.com>
|
|
Current 'non_temporal_threshold' set to 'non_temporal_threshold_lowbound'
on Zhaoxin processors without ERMS. The default
'non_temporal_threshold_lowbound' is too small for the KH-40000 and KX-7000
Zhaoxin processors, this patch updates the value to
'shared / cachesize_non_temporal_divisor'.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
|
|
This patch optimizes large size copy using normal store when src > dst
and overlap. Make it the same as the logic in memmove-vec-unaligned-erms.S.
Current memmove-ssse3 use '__x86_shared_cache_size_half' as the non-
temporal threshold, this patch updates that value to
'__x86_shared_non_temporal_threshold'. Currently, the
__x86_shared_non_temporal_threshold is cpu-specific, and different CPUs
will have different values based on the related nt-benchmark results.
However, in memmove-ssse3, the nontemporal threshold uses
'__x86_shared_cache_size_half', which sounds unreasonable.
The performance is not changed drastically although shows overall
improvements without any major regressions or gains.
Results on Zhaoxin KX-7000:
bench-memcpy geometric_mean(N=20) New / Original: 0.999
bench-memcpy-random geometric_mean(N=20) New / Original: 0.999
bench-memcpy-large geometric_mean(N=20) New / Original: 0.978
bench-memmove geometric_mean(N=20) New / Original: 1.000
bench-memmmove-large geometric_mean(N=20) New / Original: 0.962
Results on Intel Core i5-6600K:
bench-memcpy geometric_mean(N=20) New / Original: 1.001
bench-memcpy-random geometric_mean(N=20) New / Original: 0.999
bench-memcpy-large geometric_mean(N=20) New / Original: 1.001
bench-memmove geometric_mean(N=20) New / Original: 0.995
bench-memmmove-large geometric_mean(N=20) New / Original: 0.936
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
|
|
Fix code formatting under the Zhaoxin branch and add comments for
different Zhaoxin models.
Unaligned AVX load are slower on KH-40000 and KX-7000, so disable
the AVX_Fast_Unaligned_Load.
Enable Prefer_No_VZEROUPPER and Fast_Unaligned_Load features to
use sse2_unaligned version of memset,strcpy and strcat.
Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
|
|
Qualcom's new core, oryon-1, has a different characteristics for
memset than the current versions of memset. For non-zero, larger
sizes, using GPRs rather than the SIMD stores is ~30% faster.
For even larger sizes, using the nontemporal stores is needed
not to polute the L1/L2 caches.
For zero values, using `dc zva` should be used. Since we
know the size will always be 64 bytes, we don't need to figure
out the size there.
I started with the emag memset and added back the `dc zva` code.
Changes since v1:
* v3: Fix comment formating
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
Qualcomm's new core (oryon-1) has a different performance characteristic
than other cores. For memcpy, it is faster to use the GPRs to
do the copy for large sizes (2x faster). For even larger sizes,
it is better to use the nontemporal load/store instructions so
we don't pollute the L1/L2 caches.
For smaller sizes, the characteristic are very similar to
other cores.
I used the thunderx memcpy as a starting point and expanded from there.
Changes since v1:
* v2: Fix ordering in Makefile.
* v3: Fix comment grammar about the ldnp/stnp instructions.
Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
|
|
This recently came up during a cleanup to remove misaligned accesses
from the RISC-V port.
Link: https://sourceware.org/pipermail/libc-alpha/2022-June/139961.html
Suggested-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: Fangrui Song <maskray@google.com>
|
|
asm volatile ("movfcsr2gr $t0, $fcsr0" ::: "$t0");
asm volatile ("st.d $t0, %0" :"=m"(restore_fcsr));
generate to the following instructions with -Og flag:
movfcsr2gr $t0, $zero
addi.d $t0, $sp, 2047(0x7ff)
addi.d $t0, $t0, 77(0x4d)
st.w $t0, $t0, 0
fcsr0 register and restore_fcsr variable are both stored in t0 register.
Change to:
asm volatile ("movfcsr2gr %0, $fcsr0" :"=r"(restore_fcsr));
to avoid restore_fcsr address in t0.
Comparing float value using memcmp because float value cannot be
directly compared for equality.
Put LOAD_REGISTER_FCSR and SAVE_REGISTER_FCC after LOAD_REGISTER_FLOAT.
Some float instructions may change fcsr register.
|
|
If the pidfd_spawn/pidfd_spawnp helper process succeeds, but evecve
fails for some reason (either with an invalid/non-existent, memory
allocation, etc.) the resulting pidfd is never closed, nor returned
to caller (so it can call close).
Since the process creation failed, it should be up to posix_spawn to
also, close the file descriptor in this case (similar to what it
does to reap the process).
This patch also changes the waitpid with waitid (P_PIDFD) for pidfd
case, to avoid a possible pid re-use.
Checked on x86_64-linux-gnu.
Reviewed-by: Carlos O'Donell <carlos@redhat.com>
|
|
Apologies, I mistakenly interpreted this to be already accepted.
Reverting until v6 or later is reviewed and approved.
This reverts commit 9e06e4a43b58519991acbed1d7f33abc40249226.
|
|
The atomic_spin_nop() macro can be used to run arch-specific
code in the body of a spin loop to potentially improve efficiency.
RISC-V's Zihintpause extension includes a PAUSE instruction for
this use-case, which is encoded as a HINT, which means that it
behaves like a NOP on systems that don't implement Zihintpause.
Binutils supports Zihintpause since 2.36, so this patch uses
the ".insn" directive to keep the code compatible with older
toolchains.
Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Acked-by: Palmer Dabbelt <palmer@rivosinc.com>
|
|
MIPSr6 has MADDF.s/MADDF.d instructions, which are fused.
In MIPS ISA, double support can be subsetted. Only FMAF is enabled
for this case.
* sysdeps/mips/fpu/math-use-builtins-fma.h
Signed-off-by: YunQiang Su <syq@gcc.gnu.org>
Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
|
|
The upcoming parisc (hppa) v6.11 Linux kernel will include vDSO
support for gettimeofday(), clock_gettime() and clock_gettime64()
syscalls for 32- and 64-bit userspace.
The patch below adds the necessary glue code for glibc.
Signed-off-by: Helge Deller <deller@gmx.de>
Changes in v2:
- add vsyscalls for 64-bit too
|
|
|
|
|
|
For the exp10m1, exp2m1, log10p1 and log2p1 implementations.
Signed-off-by: Julian Zhu <jz531210@gmail.com>
|
|
Update mips32/mips64 ulps for the exp10m1, exp2m1, and log10p1 implementations.
Signed-off-by: Julian Zhu <jz531210@gmail.com>
|
|
This is from a -march=i686 -mtune=generic build with
--disable-multi-arch, running on a Cascade Lake CPU.
|
|
The test is not a run-time check, so update the description.
Also use readelf -W for a more stable output format and fix
an LC_ALL typo.
This avoids garbled configure messages:
checking for s390-specific static PIE requirements (runtime check)... 0x0000000000000017 (JMPREL) 0x280
yes
|
|
Results based on POWER8 and POWER9 machines running
powerpc64-linux-gnu, with and without --disable-multi-arch.
|
|
Based on a -march=x86-64-v4 -mfpmath=sse build, with and without
--disable-multi-arch, running on a Zen 4 CPU. Also used different
-march=x8i6-64-v… settings.
|
|
Add ulps for recently added C23 exp10m1, exp2m1, and log10p1 functions.
Signed-off-by: Xi Ruoyao <xry111@xry111.site>
|
|
Linux catbus 5.15.110-gentoo-r1 #1 SMP Fri Jun 9 17:53:23 PDT 2023 sparc64 sun4v UltraSparc T5 (Niagara5) GNU/Linux
Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
|
|
Needed due to:
- "Implement C23 log10p1"
commit ID 55eb99e9a9d840ba452b128be14d6529c2dde039
- "Implement C23 exp2m1, exp10m1"
commit ID 7ec903e028271d029818378fd60ddaf6b76b89ac
|
|
HWCAP value is overwritten at the first comparison of the LASX case.
The second comparison at LSX get incorrect result.
Change to use t0 to save HWCAP value, and use t1 to save comparison
result.
|
|
For the exp10m1, exp2m1, and log10p1 implementations.
|