aboutsummaryrefslogtreecommitdiff
path: root/libc/startup
AgeCommit message (Collapse)AuthorFilesLines
2024-01-04[libc] major refactor of startup library (#76092)Schrodinger ZHU Yifan12-671/+503
* separate initialization routines into _start and do_start for all architectures. * lift do_start as a separate object library to avoid code duplication. * (addtionally) address the problem of building hermetic libc with -fstack-pointer-* The `crt1.o` is now a merged result of three components: ``` ___ |___ x86_64 | |_______ start.cpp.o <- _start (loads process initial stack and aligns stack pointer) | |_______ tls.cpp.o <- init_tls, cleanup_tls, set_thread_pointer (TLS related routines) |___ do_start.cpp.o <- do_start (sets up global variables and invokes the main function) ```
2023-12-20[libc] suppress stdlib explicitly for crt1.a (#76079)Schrodinger ZHU Yifan1-1/+1
[nd: updated oneline]
2023-12-20[libc] [startup] add cmake function to merge separated crt1 objects (#75413)Schrodinger ZHU Yifan1-11/+60
As part of startup refactoring, this patch adds a function to merge multiple objects into a single relocatable object: cc -r obj1.o obj2.o -o obj.o A relocatable object is an object file that is not fully linked into an executable or a shared library. It is an intermediate file format that can be passed into the linker. A crt object can have arch-specific code and arch-agnostic code. To reduce code cohesion, the implementation is splitted into multiple units. As a result, we need to merge them into a single relocatable object.
2023-12-19[libc] move __stack_chk_fail to src/ from startup/ (#75863)Nick Desaulniers1-5/+0
__stack_chk_fail should be provided by libc.a, not startup files. Add __stack_chk_fail to existing linux and arm entrypoints. On Windows (when not targeting MinGW), it seems that the corresponding function identifier is __security_check_cookie, so no entrypoint is added for Windows. Baremetal targets also ought to be compileable with `-fstack-protector*` There is no common header for this prototype, since calls to __stack_chk_fail are meant to be inserted by the compiler upon function return when compiled `-fstack-protector*`.
2023-12-18[libc] expose aux vector (#75806)Schrodinger ZHU Yifan3-27/+12
This patch lifts aux vector related definitions to app.h. Because startup's refactoring is in progress, this patch still contains duplicated changes. This problem will be addressed very soon in an incoming patch.
2023-12-12[libc] fix issues around stack protector (#74567)Schrodinger ZHU Yifan4-18/+25
If a function is declared with stack-protector, the compiler may try to load the TLS. However, inside certain runtime functions, TLS may not be available. This patch disables stack protectors for such routines to fix the problem. Closes #74487.
2023-12-04[libc][NFC] unify startup library's code style with the rest (#74041)Schrodinger ZHU Yifan3-44/+45
This PR unifies the startup library's code style with the rest of libc.
2023-11-20[libc] Remove the optional arguments for NVPTX constructors (#69536)Joseph Huber1-3/+5
Summary: We call the global constructors by function pointer. For whatever reason the NVPTX architecture relies very specifically on the arguments to the function pointer invocation matching what the function is implemented as. This is problematic as most of these constructors are generated with no arguments. This patch removes the extended arguments that GNU and LLVM use for the constructors optionally so that it can support the common case.
2023-11-09[libc][fix] Call GPU destructors in the correct orderJoseph Huber2-4/+4
Summary: I was mistakenly iterating the list backwards. Regular semantics puts both arrays in priority order but the destructors are called backwards.
2023-10-19[libc] Fix accidental LIBC_NAMESPACE_clock_freq (#69620)alfredfo1-1/+1
See-also: https://github.com/llvm/llvm-project/pull/69548
2023-10-04[libc] Add x86-64 stack protector support.tnv012-1/+21
2023-09-26[libc] Start to refactor riscv platform abstraction to support both 32 and ↵Mikhail R. Gadelha2-0/+0
64 bits versions This patch enables the compilation of libc for rv32 by unifying the current rv64 and rv32 implementation into a single rv implementation. We updated the cmake file to match the new riscv32 arch and force LIBC_TARGET_ARCHITECTURE to be "riscv" whenever we find "riscv32" or "riscv64". This is required as LIBC_TARGET_ARCHITECTURE is used in the path for several platform specific implementations. Reviewed By: michaelrj Differential Revision: https://reviews.llvm.org/D148797
2023-09-26[libc] Mass replace enclosing namespace (#67032)Guillaume Chatelet5-86/+86
This is step 4 of https://discourse.llvm.org/t/rfc-customizable-namespace-to-allow-testing-the-libc-when-the-system-libc-is-also-llvms-libc/73079
2023-09-21[libc] Remove the 'rpc_reset' routine from the RPC implementation (#66700)Joseph Huber2-12/+2
Summary: This patch removes the `rpc_reset` function. This was previously used to initialize the RPC client on the device by setting up the pointers to communicate with the server. The purpose of this was to make it easier to initialize the device for testing. However, this prevented us from enforcing an invariant that the buffers are all read-only from the client side. The expected way to initialize the server is now to copy it from the host runtime. This will allow us to maintain that the RPC client is in the constant address space on the GPU, potentially through inference, and improving caching behaviour.
2023-09-14[libc] Fix start up crash on 32 bit systems (#66210)Mikhail R. Gadelha1-7/+17
This patch changes the default types of argc/argv so it's no longer a uint64_t in all systems, instead, it's now a uintptr_t, which fixes crashes in 32-bit systems that expect 32-bit types. This patch also adds two uintptr_t types (EnvironType and AuxEntryType) for the same reason. The patch also adds a PgrHdrTableType type behind an ifdef that's Elf64_Phdr in 64-bit systems and Elf32_Phdr in 32-bit systems.
2023-09-11[libc] Manually set the AMDGPU code object version (#65986)Joseph Huber1-0/+2
Summary: There is currently effort to change over the default AMDGPU code object version https://github.com/llvm/llvm-project/pull/65410. However, this unfortunately causes problems in the LLVM LibC test suite that leads to a hang while executing. This is most likely a bug to do with indirect call optimization, as it can be avoided without optimizations or with manually preventing inlining in the AMDGPU startup code. This patch sets the AMDGPU code object version to be four explicitly on the LibC test suite. This should unblock the efforts to move the default to 5 without breaking the test suite. This isn't a great solution, but there is currently some time pressure to get COV5 landed and this seems to be the easiest solution.
2023-08-30[libc] Fix set_thread_ptr call in rv32 start up codeMikhail R. Gadelha1-1/+1
This patch changes the instruction in set_thread_ptr from ld to mv, as rv32 doesn't have the ld instruction, and mv is supported by both rv32 and rv64. Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D159110
2023-08-07[libc][cleanup] Fix most conversion warningsMichael Jones3-25/+31
This patch is large, but is almost entirely just adding casts to calls to syscall_impl. Much of the work was done programatically, with human checking when the syntax or types got confusing. Reviewed By: mcgrathr Differential Revision: https://reviews.llvm.org/D156950
2023-07-21[libc] Treat the locks array as a bitfieldJoseph Huber2-2/+2
Currently we keep an internal buffer of device memory that is used to indicate ownership of a port. Since we only use this as a single bit we can simply turn this into a bitfield. I did this manually rather than having a separate type as we need very special handling of the masks used to interact with the locks. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D155511
2023-07-19Revert "[libc] Treat the locks array as a bitfield"Joseph Huber2-2/+2
Summary: This caused test failures on the gfx90a buildbot. This works on my gfx1030 and the Nvidia buildbots, so we'll need to investigate what is going wrong here. For now revert it to get the bots green. This reverts commit 05abcc579244b68162b847a6780d27b22bd58f74.
2023-07-19[libc][NFC] Rename filesGuillaume Chatelet6-6/+6
This patch mostly renames files so it better reflects the function they declare. Reviewed By: michaelrj Differential Revision: https://reviews.llvm.org/D155607
2023-07-18[libc] Treat the locks array as a bitfieldJoseph Huber2-2/+2
Currently we keep an internal buffer of device memory that is used to indicate ownership of a port. Since we only use this as a single bit we can simply turn this into a bitfield. I did this manually rather than having a separate type as we need very special handling of the masks used to interact with the locks. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D155511
2023-07-05[libc] Support timing information in libc testsJoseph Huber1-0/+6
This patch adds the necessary support to provide timing information in `libc` tests. This is useful for determining which tests look what amount of time. We also can use this as a test basis for providing more fine-grained timing when implementing things on the GPU. The main difficulty with this is the fact that the AMDGPU fixed frequency clock operates at an unknown frequency. We need to read this on a per-card basis from the driver and then copy it in. NVPTX on the other hand has a fixed clock at a resolution of 1ns. I have also increased the resolution of the print-outs as the majority of these are below a millisecond for me. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D154446
2023-07-05[libc] Initiliaze the global pointer in riscv startup code.Siva Chandra1-0/+4
Reviewed By: mikhail.ramalho Differential Revision: https://reviews.llvm.org/D151539
2023-06-23[libc][NFC] Simplify return value logic in set_thread_ptr()Jun Zhang1-3/+1
Signed-off-by: Jun Zhang <jun@junz.org> Differential Revision: https://reviews.llvm.org/D153572
2023-06-20[libc] Remove disabled pass after performance improvementJoseph Huber1-3/+0
This pass used to cause huge compile time regressions, That has been address and can now be re-added. Differential Revision: https://reviews.llvm.org/D153374
2023-06-20[libc] Remove flexible array and replace with a templateJoseph Huber2-2/+0
Currently the implementation of the RPC interface requires a flexible struct. This caused problems when compilling the RPC server with GCC as would be required if trying to export the RPC server interface. This required that we either move to the `x[1]` workaround or make it a template parameter. While just using `x[1]` would be much less noisy, this is technically undefined behavior. For this reason I elected to use templates. The downside to using templates is that the server code must now be able to handle multiple different types at runtime. I was unable to find a good solution that didn't rely on type erasure so I simply branch off of the given value. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D153304
2023-06-19[libc] Disable atomic optimizations for `libc` AMDGPU buildsJoseph Huber1-0/+3
Recently the AMDGPU backend automatically enables a pass to optimize atomics. This results in the LTO build taking about 10x longer in all cases. For now we disable this by default as was the case before the patch in D152649. Reviewed By: lntue Differential Revision: https://reviews.llvm.org/D153232
2023-05-23[libc][AMDGPU] Disable the AMDGPU backend's ctor/dtor lowering for libcJoseph Huber1-0/+1
The AMDGPU backend has a built-in pass to lower constructors. We do this manually in the `start.cpp` implementation so we can disable this to keep the binaries smaller. Differential Revision: https://reviews.llvm.org/D151213
2023-05-11[libc][obvious] Fix undefined variable after name changeJoseph Huber2-2/+2
I forgot that we still used these variables in the loaders. Differential Revision: https://reviews.llvm.org/D150362
2023-05-11[libc][rpc] Allocate a single block of shared memory instead of threeJon Chesterfield2-6/+6
Allows moving the pointer swap between server and client into reset. Single allocation simplifies whatever allocates the client/server, currently the libc loaders. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D150337
2023-05-11[libc][rpc] Allocate locks array within processJon Chesterfield2-8/+4
Replaces the globals currently used. Worth changing to a bitmap before allowing runtime number of ports >> 64. One bit per port is likely to be cheap enough that sizing for the worst case is always fine, otherwise in the future we can change to dynamically allocating it. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D150309
2023-05-05[libc] Support concurrent RPC port access on the GPUJoseph Huber2-4/+6
Previously we used a single port to implement the RPC. This was sufficient for single threaded tests but can potentially cause deadlocks when using multiple threads. The reason for this is that GPUs make no forward progress guarantees. Therefore one group of threads waiting on another group of threads can spin forever because there is no guarantee that the other threads will continue executing. The typical workaround for this is to allocate enough memory that a sufficiently large number of work groups can make progress. As long as this number is somewhat close to the amount of total concurrency we can obtain reliable execution around a shared resource. This patch enables using multiple ports by widening the arrays to a predetermined size and indexes into them. Empty ports are currently obtained via a trivial linker scan. This should be imporoved in the future for performance reasons. Portions of D148191 were applied to achieve parallel support. Depends on D149581 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D149598
2023-05-04[libc] Change GPU startup and loader to use multiple kernelsJoseph Huber2-107/+47
The GPU has a different execution model to standard `_start` implementations. On the GPU, all threads are active at the start of a kernel. In order to correctly intitialize and call the constructors we want single threaded semantics. Previously, this was done using a makeshift global barrier with atomics. However, it should be easier to simply put the portions of the code that must be single threaded in separate kernels and then call those with only one thread. Generally, mixing global state between kernel launches makes optimizations more difficult, similarly to calling a function outside of the TU, but for testing it is better to be correct. Depends on D149527 D148943 Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D149581
2023-05-04[libc] Enable multiple threads to use RPC on the GPUJoseph Huber2-2/+2
The execution model of the GPU expects that groups of threads will execute in lock-step in SIMD fashion. It's both important for performance and correctness that we treat this as the smallest possible granularity for an RPC operation. Thus, we map multiple threads to a single larger buffer and ship that across the wire. This patch makes the necessary changes to support executing the RPC on the GPU with multiple threads. This requires some workarounds to mimic the model when handling the protocol from the CPU. I'm not completely happy with some of the workarounds required, but I think it should work. Uses some of the implementation details from D148191. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D148943
2023-05-04[libc] Support global constructors and destructors on NVPTXJoseph Huber2-8/+72
This patch adds the necessary hacks to support global constructors and destructors. This is an incredibly hacky process caused by the primary fact that Nvidia does not provide any binary tools and very little linker support. We first had to emit references to these functions and their priority in D149451. Then we dig them out of the module once it's loaded to manually create the list that the linker should have made for us. This patch also contains a few Nvidia specific hacks, but it passes the test, albeit with a stack size warning from `ptxas` for the callback. But this should be fine given the resource usage of a common test. This also adds a dependency on LLVM to the NVPTX loader, which hopefully doesn't cause problems with our CUDA buildbot. Depends on D149451 Reviewed By: tra Differential Revision: https://reviews.llvm.org/D149527
2023-04-29[libc] Add support for global ctors / dtors for AMDGPUJoseph Huber2-8/+65
This patch makes the necessary changes to support calling global constructors and destructors on the GPU. The patch in D149340 allows the `lld` linker to create the symbols pointing us to these globals. These should be executed by a single thread, which is more difficult on the GPU because all threads are active. I chose to use an atomic counter to sync every thread on the GPU. This is very slow if you use more than a few thousand threads, but for testing purposes it should be sufficient. Depends on D149340 D149363 Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D149398
2023-04-24[libc] Add more utility functions for the GPUJoseph Huber4-7/+53
This patch adds extra intrinsics for the GPU. Some of these are unused for now but will be used later. We use these currently to update the `RPC` handling. Currently, every thread can update the RPC client, which isn't correct. This patch adds code neccesary to allow a single thread to perfrom the write while the others wait. Feedback is welcome for the naming of these functions. I'm copying the OpenMP nomenclature where we call an AMD `wavefront` or NVIDIA `warp` a `lane`. Reviewed By: tra Differential Revision: https://reviews.llvm.org/D148810
2023-04-19[libc] Update RPC interface for system utilities on the GPUJoseph Huber2-2/+6
This patch reworks the RPC interface to allow more generic memory operations using the shared better. This patch decomposes the entire RPC interface into opening a port and calling `send` or `recv` on it. The `send` function sends a single packet of the length of the buffer. The `recv` function is paired with the `send` call to then use the data. So, any aribtrary combination of sending packets is possible. The only restriction is that the client initiates the exchange with a `send` while the server consumes it with a `recv`. The operation of this is driven by two independent state machines that tracks the buffer ownership during loads / stores. We keep track of two so that we can transition between a send state and a recv state without an extra wait. State transitions are observed via bit toggling, e.g. This interface supports an efficient `send -> ack -> send -> ack -> send` interface and allows for the last send to be ignored without checking the ack. A following patch will add some more comprehensive testing to this interface. I I informally made an RPC call that simply incremented an integer and it took roughly 10 microsends to complete an RPC call. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D148288
2023-04-17[libc] Add special handling for CUDA PTX featuresJoseph Huber1-2/+2
The NVIDIA compilation path requires some special options. This is mostly because compilation is dependent on having a valid CUDA toolchain. We don't actually need the CUDA toolchain to create the exported `libcgpu.a` library because it's pure LLVM-IR. However, for some language features we need the PTX version to be set. This is normally set by checking the CUDA version, but without one installed it will fail to build. We instead choose a minimum set of features on the desired target, inferred from https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#release-notes and the PTX refernece for functions like `nanosleep`. Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D148532
2023-04-17[libc][NFC] Standardize missing syscalls error messages.Mikhail R. Gadelha3-3/+3
This patch standardizes the error messages when a syscall is not available to be in the format: "ABC and DEF syscalls are not available." Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D148373
2023-04-05[libc] Search for the CUDA patch explicitly when testingJoseph Huber1-0/+2
The packaged version of the `libc` library does not depend on the CUDA installation because it only uses `clang` and emits LLVM-IR. However, for testing we directly need the CUDA toolkit to emit and execute the files. This patch explicitly passes `--cuda-path` to the relevant compilations for NVPTX testing. Reviewed By: tra Differential Revision: https://reviews.llvm.org/D147653
2023-03-24[libc] Implement the RPC client / server for NVPTXJoseph Huber2-1/+7
This patch adds the necessary code to impelement the existing RPC client / server interface when targeting NVPTX GPUs. This follows closely to the implementation in the AMDGPU version. This does not yet enable unit testing as the `nvlink` linker does not support static libraries. So that will need to be worked around. I am ignoring the RPC duplication between the AMDGPU and NVPTX loaders. This will be changed completely later so there's no point unifying the code at this stage. The implementation was tested manually with the following file and compilation flags. ``` namespace __llvm_libc { void write_to_stderr(const char *msg); void quick_exit(int); } // namespace __llvm_libc using namespace __llvm_libc; int main(int argc, char **argv, char **envp) { for (int i = 0; i < argc; ++i) { write_to_stderr(argv[i]); write_to_stderr("\n"); } quick_exit(255); } ``` ``` $ clang++ crt1.o rpc_client.o quick_exit.o io.o main.cpp --target=nvptx64-nvidia-cuda -march=sm_70 -o image $ ./nvptx_loader image 1 2 3 image 1 2 3 $ echo $? 255 ``` Depends on D146681 Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D146846
2023-03-24[libc] Use `nvptx_kernel` attribute in NVPTX startup codeJoseph Huber2-8/+4
Summary: A recent patch allowed us to emit a callable kernel from freestanding NVPTX code. This allows us to move away from using the CUDA language. This has several advantages in that it works around an entire assortment of errors I was seeing while implementing RPC for Nvidia.
2023-03-22[libc] Adjust NVPTX startup codeJoseph Huber2-4/+10
Summary: The startup code needs to include the environment pointer so we add this to the arguments. Also we need to ensure that the `crt1.o` file is made with `-fgpu-rdc` set so we can actually use it without undefined reference errors.
2023-03-21[libc] Don't install the GPU startup code for nowJoseph Huber1-4/+0
Summary: This startup code is only intended to be used internally, we shouldn't export it under a conflicting name. In the future we may package this in an exportable format.
2023-03-20[libc] Add environment variables to GPU libc test for AMDGPUJoseph Huber1-3/+4
This patch performs the same operation to copy over the `argv` array to the `envp` array. This allows the GPU tests to use environment variables. Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D146322
2023-03-17[libc] Enable integration tests targeting the GPUJoseph Huber1-2/+7
This patch enables integration tests running on the GPU. This uses the RPC interface implemented in D145913 to compile the necessary dependencies for the integration test object. We can then use this to compile the objects for the GPU directly and execute them using the AMD HSA loader combined with its RPC server. For example, the compiler is performing the following actions to execute the integration tests. ``` $ clang++ --target=amdgcn-amd-amdhsa -mcpu=gfx1030 -nostdlib -flto -ffreestanding \ crt1.o io.o quick_exit.o test.o rpc_client.o args_test.o -o image $ ./amdhsa_loader image 1 2 5 args_test.cpp:24: Expected 'my_streq(argv[3], "3")' to be true, but is false ``` This currently only works with a single threaded client implementation running on AMDGPU. Further work will implement multiple clients for AMD and the ability to run on NVPTX as well. Depends on D145913 Reviewed By: sivachandra, JonChesterfield Differential Revision: https://reviews.llvm.org/D146256
2023-03-17[libc] Add initial support for an RPC mechanism for the GPUJoseph Huber2-2/+7
This patch adds initial support for an RPC client / server architecture. The GPU is unable to perform several system utilities on its own, so in order to implement features like printing or memory allocation we need to be able to communicate with the executing process. This is done via a buffer of "sharable" memory. That is, a buffer with a unified pointer that both the client and server can use to communicate. The implementation here is based off of Jon Chesterfields minimal RPC example in his work. We use an `inbox` and `outbox` to communicate between if there is an RPC request and to signify when work is done. We use a fixed-size buffer for the communication channel. This is fixed size so that we can ensure that there is enough space for all compute-units on the GPU to issue work to any of the ports. Right now the implementation is single threaded so there is only a single buffer that is not shared. This implementation still has several features missing to be complete. Such as multi-threaded support and asynchrnonous calls. Depends on D145912 Reviewed By: sivachandra Differential Revision: https://reviews.llvm.org/D145913
2023-03-16[libc] Add missing dependencies to RISC-V startup implementationJoseph Huber1-0/+2
Summary: Just like the last patch, the threads and envrion dependencies were missing. This lead to linker failures when building the tests.