riscv-gnu-toolchain/glibc.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author	Files	Lines
2024-11-01	math: Use exp10m1f from CORE-MATH	Adhemerval Zanella	1	-4/+0
	The CORE-MATH implementation is correctly rounded (for any rounding mode) and shows better performance compared to the generic exp10m1f. The code was adapted to glibc style and to use the definition of math_config.h (to handle errno, overflow, and underflow). I mostly fixed some small issues in corner cases (sNaN handling, -INFINITY, a specific overflow check). Benchtest on x64_64 (Ryzen 9 5900X, gcc 14.2.1), aarch64 (Neoverse-N1, gcc 13.3.1), and powerpc (POWER10, gcc 13.2.1): Latency master patched improvement x86_64 45.4690 49.5845 -9.05% x86_64v2 46.1604 36.2665 21.43% x86_64v3 37.8442 31.0359 17.99% i686 121.367 93.0079 23.37% aarch64 21.1126 15.0165 28.87% power10 12.7426 8.4929 33.35% reciprocal-throughput master patched improvement x86_64 19.6005 17.4005 11.22% x86_64v2 19.6008 11.1977 42.87% x86_64v3 17.5427 10.2898 41.34% i686 59.4215 60.9675 -2.60% aarch64 13.9814 7.9173 43.37% power10 6.7814 6.4258 5.24% The generic implementation calls __ieee754_exp10f which has an optimized version, although it is not correctly rounded, which is the main culprit of the the latency difference for x86_64 and throughp for i686. Signed-off-by: Alexei Sibidanov <sibid@uvic.ca> Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr> Signed-off-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Reviewed-by: DJ Delorie <dj@redhat.com>
2024-11-01	nptl: Add <thread_pointer.h> for LoongArch	Michael Jeanson	1	-0/+36
	This will be required by the rseq extensible ABI implementation on all Linux architectures exposing the '__rseq_size' and '__rseq_offset' symbols to set the initial value of the 'cpu_id' field which can be used by applications to test if rseq is available and registered. As long as the symbols are exposed it is valid for an application to perform this test even if rseq is not yet implemented in libc for this architecture. Both code paths are compile tested with build-many-glibcs.py but I don't have access to any hardware to run the tests. Signed-off-by: Michael Jeanson <mjeanson@efficios.com> Reviewed-by: Arjun Shankar <arjun@redhat.com>
2024-10-11	replace tgammaf by the CORE-MATH implementation	Paul Zimmermann	1	-4/+0
	The CORE-MATH implementation is correctly rounded (for any rounding mode). This can be checked by exhaustive tests in a few minutes since there are less than 2^32 values to check against for example GNU MPFR. This patch also adds some bench values for tgammaf. Tested on x86_64 and x86 (cfarm26). With the initial GNU libc code it gave on an Intel(R) Core(TM) i7-8700: "tgammaf": { "": { "duration": 3.50188e+09, "iterations": 2e+07, "max": 602.891, "min": 65.1415, "mean": 175.094 } } With the new code: "tgammaf": { "": { "duration": 3.30825e+09, "iterations": 5e+07, "max": 211.592, "min": 32.0325, "mean": 66.1649 } } With the initial GNU libc code it gave on cfarm26 (i686): "tgammaf": { "": { "duration": 3.70505e+09, "iterations": 6e+06, "max": 2420.23, "min": 243.154, "mean": 617.509 } } With the new code: "tgammaf": { "": { "duration": 3.24497e+09, "iterations": 1.8e+07, "max": 1238.15, "min": 101.155, "mean": 180.276 } } Signed-off-by: Alexei Sibidanov <sibid@uvic.ca> Signed-off-by: Paul Zimmermann <Paul.Zimmermann@inria.fr> Changes in v2: - include <math.h> (fix the linknamespace failures) - restored original benchtests/strcoll-inputs/filelist#en_US.UTF-8 file - restored original wrapper code (math/w_tgammaf_compat.c), except for the dealing with the sign - removed the tgammaf/float entries in all libm-test-ulps files - address other comments from Joseph Myers (https://sourceware.org/pipermail/libc-alpha/2024-July/158736.html) Changes in v3: - pass NULL argument for signgam from w_tgammaf_compat.c - use of math_narrow_eval - added more comments Changes in v4: - initialize local_signgam to 0 in math/w_tgamma_template.c - replace sysdeps/ieee754/dbl-64/gamma_productf.c by dummy file Changes in v5: - do not mention local_signgam any more in math/w_tgammaf_compat.c - initialize local_signgam to 1 instead of 0 in w_tgamma_template.c and added comment Changes in v6: - pass NULL as 2nd argument of __ieee754_gammaf_r in w_tgammaf_compat.c, and check for NULL in e_gammaf_r.c Changes in v7: - added Signed-off-by line for Alexei Sibidanov (author of the code) Changes in v8: - added Signed-off-by line for Paul Zimmermann (submitted of the patch) Changes in v9: - address comments from review by Adhemerval Zanella Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-09-06	LoongArch: Fix macro redefined warning in tls-desc.S	mengqinggang	2	-6/+15
	Undef macro to avoid redefined warning.
2024-08-09	LoongArch: Add cfi instructions for _dl_tlsdesc_dynamic	mengqinggang	5	-373/+258
	In _dl_tlsdesc_dynamic, there are three 'addi.d sp, sp, -size' instructions to allocate stack size for Float/LSX/LASX registers. Every 'addi.d sp, sp, -size' needs a cfi_adjust_cfa_offset because of sp is used to compute CFA. But only one 'addi.d sp, sp, -size' will be run according to HWCAP value. And all cfi_adjust_cfa_offset will be executed in stack unwinding, it result in incorrect CFA. Change _dl_tlsdesc_dynamic to _dl_tlsdesc_dynamic, _dl_tlsdesc_dynamic_lsx and _dl_tlsdesc_dynamic_lasx. Conflicting cfi instructions can be distributed to the three functions. And cfi instructions can correspond to stack down instructions.
2024-08-09	LoongArch: Regenerate ULPs	caiyinyu	1	-32/+32
	From new tests added by 07972839108495245d8b93ca546462b3f4dad47f. Signed-off-by: caiyinyu <caiyinyu@loongson.cn>
2024-08-06	LoongArch: Update Ulps.	caiyinyu	1	-4/+4
	From new tests added by 4dc22baa84bdb4111c0ac0db7139bf9ab953bf61. Signed-off-by: caiyinyu <caiyinyu@loongson.cn>
2024-07-17	Revert "LoongArch: Add cfi instructions for _dl_tlsdesc_dynamic"	Andreas K. Hüttel	5	-258/+373
	We're in freeze for the 2.40 release. This reverts commit 43224b1379d60b1ad98d29ef3d7905d55f828a9f. Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org>
2024-07-17	LoongArch: Add cfi instructions for _dl_tlsdesc_dynamic	mengqinggang	5	-373/+258
	In _dl_tlsdesc_dynamic, there are three 'addi.d sp, sp, -size' instructions to allocate stack size for Float/LSX/LASX registers. Every 'addi.d sp, sp, -size' needs a cfi_adjust_cfa_offset because of sp is used to compute CFA. But only one 'addi.d sp, sp, -size' will be run according to HWCAP value. And all cfi_adjust_cfa_offset will be executed in stack unwinding, it result in incorrect CFA. Change _dl_tlsdesc_dynamic to _dl_tlsdesc_dynamic, _dl_tlsdesc_dynamic_lsx and _dl_tlsdesc_dynamic_lasx. Conflicting cfi instructions can be distributed to the three functions. And cfi instructions can correspond to stack down instructions.
2024-06-26	LoongArch: Fix tst-gnu2-tls2 test case	mengqinggang	1	-143/+153
	asm volatile ("movfcsr2gr $t0, $fcsr0" ::: "$t0"); asm volatile ("st.d $t0, %0" :"=m"(restore_fcsr)); generate to the following instructions with -Og flag: movfcsr2gr $t0, $zero addi.d $t0, $sp, 2047(0x7ff) addi.d $t0, $t0, 77(0x4d) st.w $t0, $t0, 0 fcsr0 register and restore_fcsr variable are both stored in t0 register. Change to: asm volatile ("movfcsr2gr %0, $fcsr0" :"=r"(restore_fcsr)); to avoid restore_fcsr address in t0. Comparing float value using memcmp because float value cannot be directly compared for equality. Put LOAD_REGISTER_FCSR and SAVE_REGISTER_FCC after LOAD_REGISTER_FLOAT. Some float instructions may change fcsr register.
2024-06-19	LoongArch: Update ulps	Xi Ruoyao	1	-0/+60
	Add ulps for recently added C23 exp10m1, exp2m1, and log10p1 functions. Signed-off-by: Xi Ruoyao <xry111@xry111.site>
2024-06-19	LoongArch: Fix _dl_tlsdesc_dynamic in LSX case	mengqinggang	1	-9/+9
	HWCAP value is overwritten at the first comparison of the LASX case. The second comparison at LSX get incorrect result. Change to use t0 to save HWCAP value, and use t1 to save comparison result.
2024-06-17	Convert to autoconf 2.72 (vanilla release, no distribution patches)	Andreas K. Hüttel	1	-14/+19
	As discussed at the patch review meeting Signed-off-by: Andreas K. Hüttel <dilfridge@gentoo.org> Reviewed-by: Simon Chopin <simon.chopin@canonical.com>
2024-06-17	Implement C23 logp1	Joseph Myers	1	-0/+20
	C23 adds various <math.h> function families originally defined in TS 18661-4. Add the logp1 functions (aliases for log1p functions - the name is intended to be more consistent with the new log2p1 and log10p1, where clearly it would have been very confusing to name those functions log21p and log101p). As aliases rather than new functions, the content of this patch is somewhat different from those actually adding new functions. Tests are shared with log1p, so this patch does mechanically update all affected libm-test-ulps files to expect the same errors for both functions. The vector versions of log1p on aarch64 and x86_64 are not updated to have logp1 aliases (and thus there are no corresponding header, tests, abilist or ulps changes for vector functions either). It would be reasonable for such vector aliases and corresponding changes to other files to be made separately. For now, the log1p tests instead avoid testing logp1 in the vector case (a Makefile change is needed to avoid problems with grep, used in generating the .c files for vector function tests, matching more than one ALL_RM_TEST line in a file testing multiple functions with the same inputs, when it assumes that the .inc file only has a single such line). Tested for x86_64 and x86, and with build-many-glibcs.py.
2024-06-14	LoongArch: Ensure sp 16-byte aligned for tlsdesc	Xi Ruoyao	2	-7/+4
	"ADDI sp, sp, 24" and "ADDI sp, sp, SZFCSREG" (SZFCSREG = 4) are misaligning the stack: the ABI mandates a 16-byte alignment. Fix it by changing the first one to "ADDI sp, sp, 32", and reuse the spare 4th slot for saving fcsr. Reported-by: Jinyang He <hejinyang@loongson.cn> Signed-off-by: Xi Ruoyao <xry111@xry111.site>
2024-05-28	LoongArch: Use "$fcsr0" instead of "$r0" in _FPU_{GET,SET}CW	Xi Ruoyao	1	-2/+2
	Clang inline-asm parser does not allow using "$r0" in movfcsr2gr/movgr2fcsr, so everything using _FPU_{GET,SET}CW is now failing to build with Clang on LoongArch. As we now requires Binutils >= 2.41 which supports using "$fcsr0" here, use it instead of "$r0" to fix the issue. Link: https://github.com/loongson-community/discussions/issues/53#issuecomment-2081507390 Link: https://sourceware.org/git/?p=binutils-gdb.git;a=commit;h=4142b2368353 Signed-off-by: Xi Ruoyao <xry111@xry111.site>
2024-05-23	loongarch: Remove duplicate strnlen in libc.a (BZ 31785)	Adhemerval Zanella	1	-0/+2
	The generic version provides weak definitions of strnlen, which are already provided by the ifunc resolver. Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
2024-05-21	LoongArch: Update ulps	caiyinyu	1	-0/+20
	For the log2p1 implementation.
2024-05-21	LoongArch: Fix tst-gnu2-tls2 compiler error	mengqinggang	3	-2/+8
	Add -mno-lsx to tst-gnu2-tlsmod*.c if gcc support -mno-lsx. Add escape character '\' in vector support test function.
2024-05-15	LoongArch: Add support for TLS Descriptors	mengqinggang	14	-8/+1071
	This is mostly based on AArch64 and RISC-V implementation. Add R_LARCH_TLS_DESC32 and R_LARCH_TLS_DESC64 relocations. For _dl_tlsdesc_dynamic function slow path, temporarily save and restore all vector registers.
2024-04-24	LoongArch: Add glibc.cpu.hwcap support.	caiyinyu	7	-5/+312
	The current IFUNC selection is always using the most recent features which are available via AT_HWCAP. But in some scenarios it is useful to adjust this selection. The environment variable: GLIBC_TUNABLES=glibc.cpu.hwcaps=-xxx,yyy,zzz,.... can be used to enable HWCAP feature yyy, disable HWCAP feature xxx, where the feature name is case-sensitive and has to match the ones used in sysdeps/loongarch/cpu-tunables.c. Signed-off-by: caiyinyu <caiyinyu@loongson.cn>
2024-03-12	LoongArch: Correct {__ieee754, _}_scalb -> {__ieee754, _}_scalbf	caiyinyu	1	-1/+1

2024-02-15	Apply the Makefile sorting fix	H.J. Lu	1	-40/+40
	Apply the Makefile sorting fix generated by sort-makefile-lines.py.
2024-02-05	LoongArch: Use builtins for ffs and ffsll	Xi Ruoyao	1	-0/+2
	On LoongArch GCC compiles __builtin_ffs{,ll} to basically `(x ? __builtin_ctz (x) : -1) + 1`. Since a hardware ctz instruction is available, this is much better than the table-driven generic implementation. Tested on loongarch64. Signed-off-by: Xi Ruoyao <xry111@xry111.site> Reviewed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org>
2024-02-01	Refer to C23 in place of C2X in glibc	Joseph Myers	1	-1/+1
	WG14 decided to use the name C23 as the informal name of the next revision of the C standard (notwithstanding the publication date in 2024). Update references to C2X in glibc to use the C23 name. This is intended to update everything except where it involves renaming files (the changes involving renaming tests are intended to be done separately). In the case of the _ISOC2X_SOURCE feature test macro - the only user-visible interface involved - support for that macro is kept for backwards compatibility, while adding _ISOC23_SOURCE. Tested for x86_64.
2024-01-01	Update copyright dates with scripts/update-copyrights	Paul Eggert	171	-171/+171

2023-11-21	elf: Remove LD_PROFILE for static binaries	Adhemerval Zanella	2	-2/+6
	The _dl_non_dynamic_init does not parse LD_PROFILE, which does not enable profile for dlopen objects. Since dlopen is deprecated for static objects, it is better to remove the support. It also allows to trim down libc.a of profile support. Checked on x86_64-linux-gnu. Reviewed-by: Siddhesh Poyarekar <siddhesh@sourceware.org>
2023-10-26	LoongArch: Delete excessively allocated memory.	caiyinyu	1	-34/+34

2023-10-26	LoongArch: Unify Register Names.	caiyinyu	2	-19/+19

2023-09-21	Revert "LoongArch: Add glibc.cpu.hwcap support."	caiyinyu	6	-171/+4
	This reverts commit a53451559dc9cce765ea5bcbb92c4007e058e92b.
2023-09-19	LoongArch: Add glibc.cpu.hwcap support.	caiyinyu	6	-4/+171
	Key Points: 1. On lasx & lsx platforms, We must use _dl_runtime_{profile, resolve}_{lsx, lasx} to save vector registers. 2. Via "tunables", users can choose str/mem_{lasx,lsx,unaligned} functions with `export GLIBC_TUNABLES=glibc.cpu.hwcaps=LASX,...`. Note: glibc.cpu.hwcaps doesn't affect _dl_runtime_{profile, resolve}_{lsx, lasx} selection. Usage Notes: 1. Only valid inputs: LASX, LSX, UAL. Case-sensitive, comma-separated, no spaces. 2. Example: `export GLIBC_TUNABLES=glibc.cpu.hwcaps=LASX,UAL` turns on LASX & UAL. Unmentioned features turn off. With default ifunc: lasx > lsx > unaligned > aligned > generic, effect is: lasx > unaligned > aligned > generic; lsx off. 3. Incorrect GLIBC_TUNABLES settings will show error messages. For example: On lsx platforms, you cannot enable lasx features. If you do that, you will get error messages. 4. Valid input examples: - GLIBC_TUNABLES=glibc.cpu.hwcaps=LASX: lasx > aligned > generic. - GLIBC_TUNABLES=glibc.cpu.hwcaps=LSX,UAL: lsx > unaligned > aligned > generic. - GLIBC_TUNABLES=glibc.cpu.hwcaps=LASX,UAL,LASX,UAL,LSX,LASX,UAL: Repetitions allowed but not recommended. Results in: lasx > lsx > unaligned > aligned > generic.
2023-09-15	LoongArch: Change to put magic number to .rodata section	dengjianbo	1	-10/+10
	Change to put magic number to .rodata section in memmove-lsx, and use pcalau12i and %pc_lo12 with vld to get the data.
2023-09-15	LoongArch: Add ifunc support for strrchr{aligned, lsx, lasx}	dengjianbo	7	-0/+578
	According to glibc strrchr microbenchmark test results, this implementation could reduce the runtime time as following: Name Percent of rutime reduced strrchr-lasx 10%-50% strrchr-lsx 0%-50% strrchr-aligned 5%-50% Generic strrchr is implemented by function strlen + memrchr, the lasx version will compare with generic strrchr implemented by strlen-lasx + memrchr-lasx, the lsx version will compare with generic strrchr implemented by strlen-lsx + memrchr-lsx, the aligned version will compare with generic strrchr implemented by strlen-aligned + memrchr-generic.
2023-09-15	LoongArch: Add ifunc support for strcpy, stpcpy{aligned, unaligned, lsx, lasx}	dengjianbo	12	-0/+963
	According to glibc strcpy and stpcpy microbenchmark test results(changed to use generic_strcpy and generic_stpcpy instead of strlen + memcpy), comparing with the generic version, this implementation could reduce the runtime as following: Name Percent of rutime reduced strcpy-aligned 8%-45% strcpy-unaligned 8%-48%, comparing with the aligned version, unaligned version takes less instructions to copy the tail of data which length is less than 8. it also has better performance in case src and dest cannot be both aligned with 8bytes strcpy-lsx 20%-80% strcpy-lasx 15%-86% stpcpy-aligned 6%-43% stpcpy-unaligned 8%-48% stpcpy-lsx 10%-80% stpcpy-lasx 10%-87%
2023-09-15	LoongArch: Replace deprecated $v0 with $a0 to eliminate 'as' Warnings.	caiyinyu	1	-1/+1

2023-09-15	LoongArch: Add lasx/lsx support for _dl_runtime_profile.	caiyinyu	7	-179/+331

2023-08-29	LoongArch: Change loongarch to LoongArch in comments	dengjianbo	24	-24/+24

2023-08-29	LoongArch: Add ifunc support for memcmp{aligned, lsx, lasx}	dengjianbo	7	-0/+861
	According to glibc memcmp microbenchmark test results(Add generic memcmp), this implementation have performance improvement except the length is less than 3, details as below: Name Percent of time reduced memcmp-lasx 16%-74% memcmp-lsx 20%-50% memcmp-aligned 5%-20%
2023-08-29	LoongArch: Add ifunc support for memset{aligned, unaligned, lsx, lasx}	dengjianbo	8	-0/+688
	According to glibc memset microbenchmark test results, for LSX and LASX versions, A few cases with length less than 8 experience performace degradation, overall, the LASX version could reduce the runtime about 15% - 75%, LSX version could reduce the runtime about 15%-50%. The unaligned version uses unaligned memmory access to set data which length is less than 64 and make address aligned with 8. For this part, the performace is better than aligned version. Comparing with the generic version, the performance is close when the length is larger than 128. When the length is 8-128, the unaligned version could reduce the runtime about 30%-70%, the aligned version could reduce the runtime about 20%-50%.
2023-08-29	LoongArch: Add ifunc support for memrchr{lsx, lasx}	dengjianbo	7	-0/+335
	According to glibc memrchr microbenchmark, this implementation could reduce the runtime as following: Name Percent of rutime reduced memrchr-lasx 20%-83% memrchr-lsx 20%-64%
2023-08-29	LoongArch: Add ifunc support for memchr{aligned, lsx, lasx}	dengjianbo	7	-0/+401
	According to glibc memchr microbenchmark, this implementation could reduce the runtime as following: Name Percent of runtime reduced memchr-lasx 37%-83% memchr-lsx 30%-66% memchr-aligned 0%-15%
2023-08-29	LoongArch: Add ifunc support for rawmemchr{aligned, lsx, lasx}	dengjianbo	7	-0/+365
	According to glibc rawmemchr microbenchmark, A few cases tested with char '\0' experience performance degradation due to the lasx and lsx versions don't handle the '\0' separately. Overall, rawmemchr-lasx implementation could reduce the runtime about 40%-80%, rawmemchr-lsx implementation could reduce the runtime about 40%-66%, rawmemchr-aligned implementation could reduce the runtime about 20%-40%.
2023-08-29	LoongArch: Remove support code for old linker in start.S	Xi Ruoyao	1	-16/+3
	We are requiring Binutils >= 2.41, so la.pcrel always works here. Signed-off-by: Xi Ruoyao <xry111@xry111.site>
2023-08-29	LoongArch: Simplify the autoconf check for static PIE	Xi Ruoyao	2	-50/+16
	We are strictly requiring GAS >= 2.41 now, so we don't need to check assembler capability anymore. Signed-off-by: Xi Ruoyao <xry111@xry111.site>
2023-08-24	LoongArch: Add ifunc support for strncmp{aligned, lsx}	dengjianbo	6	-0/+508
	Based on the glibc microbenchmark, only a few short inputs with this strncmp-aligned and strncmp-lsx implementation experience performance degradation, overall, strncmp-aligned could reduce the runtime 0%-10% for aligned comparision, 10%-25% for unaligend comparision, strncmp-lsx could reduce the runtime about 0%-60%.
2023-08-24	LoongArch: Add ifunc support for strcmp{aligned, lsx}	dengjianbo	6	-0/+426
	Based on the glibc microbenchmark, strcmp-aligned implementation could reduce the runtime 0%-10% for aligned comparison, 10%-20% for unaligned comparison, strcmp-lsx implemenation could reduce the runtime 0%-50%.
2023-08-24	LoongArch: Add ifunc support for strnlen{aligned, lsx, lasx}	dengjianbo	7	-0/+382
	Based on the glibc microbenchmark, strnlen-aligned implementation could reduce the runtime more than 10%, strnlen-lsx implementation could reduce the runtime about 50%-78%, strnlen-lasx implementation could reduce the runtime about 50%-88%.
2023-08-17	Loongarch: Add ifunc support for memcpy{aligned, unaligned, lsx, lasx} and ↵	dengjianbo	13	-0/+2435
	memmove{aligned, unaligned, lsx, lasx} These implementations improve the time to copy data in the glibc microbenchmark as below: memcpy-lasx reduces the runtime about 8%-76% memcpy-lsx reduces the runtime about 8%-72% memcpy-unaligned reduces the runtime of unaligned data copying up to 40% memcpy-aligned reduece the runtime of unaligned data copying up to 25% memmove-lasx reduces the runtime about 20%-73% memmove-lsx reduces the runtime about 50% memmove-unaligned reduces the runtime of unaligned data moving up to 40% memmove-aligned reduces the runtime of unaligned data moving up to 25%
2023-08-17	Loongarch: Add ifunc support for strchr{aligned, lsx, lasx} and ↵	dengjianbo	12	-0/+581
	strchrnul{aligned, lsx, lasx} These implementations improve the time to run strchr{nul} microbenchmark in glibc as below: strchr-lasx reduces the runtime about 50%-83% strchr-lsx reduces the runtime about 30%-67% strchr-aligned reduces the runtime about 10%-20% strchrnul-lasx reduces the runtime about 50%-83% strchrnul-lsx reduces the runtime about 36%-65% strchrnul-aligned reduces the runtime about 6%-10%
2023-08-14	Loongarch: Add ifunc support and add different versions of strlen	dengjianbo	8	-0/+416
	strlen-lasx is implemeted by LASX simd instructions(256bit) strlen-lsx is implemeted by LSX simd instructions(128bit) strlen-align is implemented by LA basic instructions and never use unaligned memory acess