Age | Commit message (Collapse) | Author | Files | Lines |
|
Fixed a typo:
- auto Section = (Prefix + "llvm_offload_entries").str();
+ auto Section = (Prefix + "_offload_entries").str();
which broke buildbot e.g.
https://lab.llvm.org/buildbot/#/builders/208/builds/1948
|
|
(#132869) (#143964)"
This reverts commit 22f9b4aa1dad597d908be77be1e10ba4c77330ce.
|
|
Fixed two issues:
1. assertion with -flto. the linker wrapper action is missing for
wrapping the device binary. Added it for -flto.
2. when there are two HIP files, the kernels in the second file were not
found. This is because the -r option of linker wrapper assumes offload
entries section of HIP to be hip_offloading_entries but it is actually
llvm_offload_entries, causing the offload entries sections not made
unique for different object files. Fixed and tested working for both
-fgpu-rdc and -fno-gpu-rdc case with and without -r
|
|
(#143432)
This breaks a lot of new driver HIP compilation. We should probably
revert this for now until we can make a fixed version.
```c++
static __global__ void print() { printf("%s\n", "foo"); }
void b();
int main() {
hipLaunchKernelGGL(print, dim3(1), dim3(1), 0, 0);
auto y = hipDeviceSynchronize();
b();
}
```
```c++
static __global__ void print() { printf("%s\n", "bar"); }
void b() {
hipLaunchKernelGGL(print, dim3(1), dim3(1), 0, 0);
auto y = hipDeviceSynchronize();
}
```
```console
$ clang++ a.hip b.hip --offload-arch=gfx1030 --offload-new-driver
$ ./a.out
foo
foo
```
```console
$ clang++ a.hip b.hip --offload-arch=gfx1030 --offload-new-driver -flto
<crash>
```
This reverts commit d54c28b9c1396fa92d9347ac1135da7907121cb8.
|
|
|
|
Currently HIP still uses offload bundler for non-rdc mode for the new
offload driver.
This patch switches to use offload wrapper for non-device-only non-rdc
mode when new offload driver is enabled.
This makes the rdc and non-rdc compilation more consistent and speeds up
compilation since the offload wrapper supports parallel compilation for
different GPU arch's.
It is implemented by adding a linker wrapper action for each assemble
action of input file. Linker wrapper action differentiates this special
type of work vs normal linker wrapper work by the fle type. This type of
work results in object instead of image. The linker wrapper adds "-r"
for it and only includes the object file as input, not the host
libraries.
For device-only non-RDC mode, the new driver keeps the original
behavior.
|
|
Summary:
This patch unifies the existing offloading entires into a single section
called `llvm_offload_entires`. This lets us use a more unified
offloading infrastructure so that all targets share the same handling.
The effect is that people in the runtimes now need to check if the kind
is what they expect, but the expectation is that you can combine
multiple potential providers into a compile job. Doesn't fully work
yet because of other runtime issues, but some day. Mostly this helps the
future of liboffload where we want to handle different languages than
OpenMP.
|
|
Summary:
The previous offloading entry type did not fit the current use-cases
very well. This widens it and adds a version to prevent further
annoyances. It also includes the kind to better sort who's using it.
The first 64-bytes are reserved as zero so the OpenMP runtime can detect
the old format for binary compatibilitry.
|
|
Summary:
Previously, managed variables didn't work in rdc mode using the new
driver because we just didn't register them. This was previously ignored
because we didn't have enough space in the current struct format. This
patch amends that by just emitting a struct pair for the two variables
and using the single pointer.
In the future, a more extensible entry format would be nice, but that
can be done later.
|
|
Remove '_' in fatbin and gpubin symbol suffixes when missing TU hash ID.
Internalize gpubin symbol so that it is not unresolved at link-time when
symbol is not relocatable.
|
|
Through the new `-foffload-via-llvm` flag, CUDA kernels can now be
lowered to the LLVM/Offload API. On the Clang side, this is simply done
by using the OpenMP offload toolchain and emitting calls to `llvm*`
functions to orchestrate the kernel launch rather than `cuda*`
functions. These `llvm*` functions are implemented on top of the
existing LLVM/Offload API.
As we are about to redefine the Offload API, this wil help us in the
design process as a second offload language.
We do not support any CUDA APIs yet, however, we could:
https://www.osti.gov/servlets/purl/1892137
For proper host execution we need to resurrect/rebase
https://tianshilei.me/wp-content/uploads/2021/12/llpp-2021.pdf
(which was designed for debugging).
```
❯❯❯ cat test.cu
extern "C" {
void *llvm_omp_target_alloc_shared(size_t Size, int DeviceNum);
void llvm_omp_target_free_shared(void *DevicePtr, int DeviceNum);
}
__global__ void square(int *A) { *A = 42; }
int main(int argc, char **argv) {
int DevNo = 0;
int *Ptr = reinterpret_cast<int *>(llvm_omp_target_alloc_shared(4, DevNo));
*Ptr = 7;
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
square<<<1, 1>>>(Ptr);
printf("Ptr %p, *Ptr %i\n", Ptr, *Ptr);
llvm_omp_target_free_shared(Ptr, DevNo);
}
❯❯❯ clang++ test.cu -O3 -o test123 -foffload-via-llvm --offload-arch=native
❯❯❯ llvm-objdump --offloading test123
test123: file format elf64-x86-64
OFFLOADING IMAGE [0]:
kind elf
arch gfx90a
triple amdgcn-amd-amdhsa
producer openmp
❯❯❯ LIBOMPTARGET_INFO=16 ./test123
Ptr 0x155448ac8000, *Ptr 7
Ptr 0x155448ac8000, *Ptr 42
```
|
|
As part of the LLVM effort to eliminate debug-info intrinsics, we're
moving to a world where only iterators should be used to insert
instructions. This isn't a problem in clang when instructions get
generated before any debug-info is inserted, however we're planning on
deprecating and removing the instruction-pointer insertion routines.
Scatter some calls to getIterator in a few places, remove a
deref-then-addrof on another iterator, and add an overload for the
createLoadInstBefore utility. Some callers passes a null insertion
point, which we need to handle explicitly now.
|
|
Remove zero-index constant expression GEPs, which are not needed
with opaque pointers and will get folded away.
|
|
MSVC linker merges functions having comdat which have identical set of
instructions. CUDA uses kernel stub function as key to look up kernels
in device executables. If kernel stub function for different kernels are
merged by ICF, incorrect kernels will be launched.
To prevent ICF from merging kernel stub functions, an unique global
variable is created for each kernel stub function having comdat and a
store is added to the kernel stub function. This makes the set of
instructions in each kernel function unique.
Fixes: https://github.com/llvm/llvm-project/issues/88883
|
|
The IdentifierInfo isn't typically modified. Use 'const' wherever
possible.
|
|
to authenticate signed pointers (#86923)
To authenticate pointers, CodeGen needs access to the key and
discriminators that were used to sign the pointer. That information is
sometimes known from the context, but not always, which is why `Address`
needs to hold that information.
This patch adds methods and data members to `Address`, which will be
needed in subsequent patches to authenticate signed pointers, and uses
the newly added methods throughout CodeGen. Although this patch isn't
strictly NFC as it causes CodeGen to use different code paths in some
cases (e.g., `mergeAddressesInConditionalExpr`), it doesn't cause any
changes in functionality as it doesn't add any information needed for
authentication.
In addition to the changes mentioned above, this patch introduces class
`RawAddress`, which contains a pointer that we know is unsigned, and
adds several new functions for creating `Address` and `LValue` objects.
This reapplies d9a685a9dd589486e882b722e513ee7b8c84870c, which was
reverted because it broke ubsan bots. There seems to be a bug in
coroutine code-gen, which is causing EmitTypeCheck to use the wrong
alignment. For now, pass alignment zero to EmitTypeCheck so that it can
compute the correct alignment based on the passed type (see function
EmitCXXMemberOrOperatorMemberCallExpr).
|
|
needed to authenticate signed pointers (#86721)" (#86898)
This reverts commit d9a685a9dd589486e882b722e513ee7b8c84870c.
The commit broke ubsan bots.
|
|
to authenticate signed pointers (#86721)
To authenticate pointers, CodeGen needs access to the key and
discriminators that were used to sign the pointer. That information is
sometimes known from the context, but not always, which is why `Address`
needs to hold that information.
This patch adds methods and data members to `Address`, which will be
needed in subsequent patches to authenticate signed pointers, and uses
the newly added methods throughout CodeGen. Although this patch isn't
strictly NFC as it causes CodeGen to use different code paths in some
cases (e.g., `mergeAddressesInConditionalExpr`), it doesn't cause any
changes in functionality as it doesn't add any information needed for
authentication.
In addition to the changes mentioned above, this patch introduces class
`RawAddress`, which contains a pointer that we know is unsigned, and
adds several new functions for creating `Address` and `LValue` objects.
This reapplies 8bd1f9116aab879183f34707e6d21c7051d083b6. The commit
broke msan bots because LValue::IsKnownNonNull was uninitialized.
|
|
needed to authenticate signed pointers (#67454)" (#86674)
This reverts commit 8bd1f9116aab879183f34707e6d21c7051d083b6.
It appears that the commit broke msan bots.
|
|
to authenticate signed pointers (#67454)
To authenticate pointers, CodeGen needs access to the key and
discriminators that were used to sign the pointer. That information is
sometimes known from the context, but not always, which is why `Address`
needs to hold that information.
This patch adds methods and data members to `Address`, which will be
needed in subsequent patches to authenticate signed pointers, and uses
the newly added methods throughout CodeGen. Although this patch isn't
strictly NFC as it causes CodeGen to use different code paths in some
cases (e.g., `mergeAddressesInConditionalExpr`), it doesn't cause any
changes in functionality as it doesn't add any information needed for
authentication.
In addition to the changes mentioned above, this patch introduces class
`RawAddress`, which contains a pointer that we know is unsigned, and
adds several new functions for creating `Address` and `LValue` objects.
|
|
Refactor managed variable handling in codegen so that the transformation
is done separately from registration.
This will allow the new driver to register the managed var in the linker
wrapper.
|
|
insert (#84737)"
Fixes a build error caused by an unupdated getAsInstruction callsite in clang.
This reverts commit ab851f7fe946e7eed700ef9d82082eb721860189.
|
|
`-fgpu-rdc` mode allows device functions call device functions in
different TU. However, currently all device objects have to be linked
together since only one fat binary is supported. This is time consuming
for AMDGPU backend since it only supports LTO.
There are use cases that objects can be divided into groups in which
device functions are self-contained but host functions are not. It is
desirable to link/optimize/codegen the device code and generate a fatbin
for each group, whereas partially link the host code with `ld -r` or
generate a static library by using the `--emit-static-lib` option of
clang. This avoids linking all device code together, therefore decreases
the linking time for `-fgpu-rdc`.
Previously, clang emits an external symbol `__hip_fatbin` for all
objects for `-fgpu-rdc`. With this patch, clang emits an unique external
symbol `__hip_fatbin_{cuid}` for the fat binary for each object. When a
group of objects are linked together to generate a fatbin, the symbols
are merged by alias and point to the same fat binary. Each group has its
own fat binary. One executable or shared library can have multiple fat
binaries. Device linking is done for undefined fab binary symbols only
to avoid repeated linking. `__hip_gpubin_handle` is also uniquefied and
merged to avoid repeated registering. Symbol `__hip_cuid_{cuid}` is
introduced to facilitate debugging and tooling.
Fixes: https://github.com/llvm/llvm-project/issues/77018
|
|
|
|
Unify 3 `Pointertype *` members which all refer to the same llvm type.
Opaque pointer clean-up effort.
|
|
Summary:
This patch adds support for registering texture / surface variables from
CUDA / HIP. Additionally, we now properly track the `extern` and `const`
flags that are also used in these runtime functions.
This does not implement the `managed` variables yet as those seem to
require some extra handling I'm not familiar with. The issue is that the
current offload entry isn't large enough to carry size and alignment
information along with an extra global.
|
|
Remove redundant bitcasts performed on instances of
`llvm::GlobalVariable`, which are pointers.
Opaque pointer cleanup effort. NFC.
|
|
Summary:
This patch is a first step to remove dependencies on the OpenMPIRBuilder
for creating generic offloading entries. This patch changes no
functionality and merely moves the code around. In the future the
interface will be changed to allow for more code re-use in the
registration and creation of offloading entries as well as a more
generic interface for CUDA, HIP, OpenMP, and SYCL(?). Doing this as a
first step to reduce the noise involved in the functional changes.
|
|
- Update CodeGenTypeCache to use a single union for all pointers in
address space zero.
- Introduce a UnqualPtrTy in CodeGenTypeCache, and use that (for
example instead of llvm::PointerType::getUnqual) in some places.
- Drop some redundant bit/pointers casts from ptr to ptr.
|
|
Currently, clang emits LLVM IR that fails verifier for the following
code:
```
template<typename T>
__global__ void foo(T x);
void bar() {
foo<<<1, 1>>>(0);
}
```
This is due to clang putting the kernel handle for foo into comdat,
which is not allowed, since the kernel handle is a declaration.
The siutation is similar to calling a declaration-only template
function. The callee will be a declaration in LLVM IR and won't be put
into comdat. This is in contrast to calling a template function with
body, which will be put into comdat.
Fixes: SWDEV-419769
|
|
I'm using clang to compile CUDA code. And just found that clang doesn't support the per-thread stream option for NV CUDA. I don't know if there is another solution.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D154822
|
|
Partial progress towards replacing in-tree uses of `Type::getPointerTo()`.
This needs to be done before deprecating the API.
Reviewed By: nikic, barannikov88
Differential Revision: https://reviews.llvm.org/D152321
|
|
CUDA support can be enabled in clang-repl with --cuda flag.
Device code linking is not yet supported. inline must be used with all
__device__ functions.
Differential Revision: https://reviews.llvm.org/D146389
|
|
This reverts commit 80e7eed6a610ab3c7289e6f9b7ec006bc7d7ae31.
|
|
CUDA support can be enabled in clang-repl with --cuda flag.
Device code linking is not yet supported. inline must be used with all
__device__ functions.
Differential Revision: https://reviews.llvm.org/D146389
|
|
Fixes clang crash caused by a stale function pointer.
The bug has been present for a pretty long time, but we were lucky not to
trigger it until D140663.
Differential Revision: https://reviews.llvm.org/D146448
|
|
Currently CGCUDANV uses an llvm::Function as a key to map kernels to a
symbol in host code. HIP adds one level of indirection and uses the
llvm::Function to map to a global variable that will be initialized to
the kernel stub ptr.
Unfortunately there is no garantee that the llvm::Function created
by GetOrCreateLLVMFunction will be the same. In fact, the first
time we encounter GetOrCrateLLVMFunction for a kernel, the type
might not be completed yet, and the type of llvm::Function will be
a generic {}, since the complete type is not required to get a symbol
to a function. In this case we end up creating two global variables,
one for the llvm::Function with the incomplete type and one for the
function with the complete type. The first global variable will be
declared by not defined, resulting in a linking error.
This change uses the mangled name of the llvm::Function as key in the
KernelHandles map, in this way the same llvm::Function will be
associated to the same kernel handle even if they types are different.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D140663
|
|
Extra NUL does not impact functionality of the generated code, but it confuses
various NVIDIA tools used to examine embedded GPU binaries.
Differential Revision: https://reviews.llvm.org/D135832
|
|
The new driver primarily allows us to support RDC-mode compilations with
proper linking. This is not needed for non-RDC mode compilation, but we
still would like the new driver to be able to handle this mode so we can
transition away from the old driver in the future. This patch adds the
necessary code to support creating a fatbinary for CUDA code generation
as well as removing old assumptions and errors about RDC-mode with the
new driver.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D129655
|
|
This patch adds the small change required to output offloading entried
for HIP instead of CUDA. These should be placed in different sections so
because they need to be distinct to the offloading toolchain, otherwise
we'd have HIP trying to register CUDA kernels or vice-versa. This patch will
precede support for HIP in the linker wrapper.
Reviewed By: yaxunl, tra
Differential Revision: https://reviews.llvm.org/D128850
|
|
CUDA requires that static variables be visible to the host when
offloading. However, The standard semantics of a stiatc variable dictate
that it should not be visible outside of the current file. In order to
access it from the host we need to perform "externalization" on the
static variable on the device. This requires generating a semi-unique
name that can be affixed to the variable as to not cause linker errors.
This is currently done using the CUID functionality, an MD5 hash value
set up by the clang driver. This allows us to achieve is mostly unique
ID that is unique even between multiple compilations of the same file.
However, this is not always availible. Instead, this patch uses the
unique ID from the file to generate a unique symbol name. This will
create a unique name that is consistent between the host and device side
compilations without requiring the CUID to be entered by the driver. The
one downside to this is that we are no longer stable under multiple
compilations of the same file. However, this is a very niche use-case
and is not supported by Nvidia's CUDA compiler so it likely to be good
enough.
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D125904
|
|
The changes made in D123460 generalized the code generation for OpenMP's
offloading entries. We can use the same scheme to register globals for
CUDA code. This patch adds the code generation to create these
offloading entries when compiling using the new offloading driver mode.
The offloading entries are simple structs that contain the information
necessary to register the global. The struct used is as follows:
```
Type struct __tgt_offload_entry {
void *addr; // Pointer to the offload entry info.
// (function or global)
char *name; // Name of the function or global.
size_t size; // Size of the entry info (0 if it a function).
int32_t flags;
int32_t reserved;
};
```
Currently CUDA handles RDC code generation by deferring the registration
of globals in the current TU to a callback function containing the
modules ID. Later all the module IDs will be used to register all of the
globals at once. Rather than mimic this, offloading entries allow us to
mimic the way OpenMP registers globals. That is, we create a simple
global struct for each device global to be registered. These are placed
at a special section `cuda_offloading_entires`. Because this section is
a valid C-identifier, the linker will profide a `__start` and `__stop`
pointer that we can use to iterate and register all globals at runtime.
the registration requires a flag variable to indicate which registration
function to use. I have assigned the flags somewhat arbitrarily, but
these use the following values.
Kernel: 0
Variable: 0
Managed: 1
Surface: 2
Texture: 3
Depends on D120272
Reviewed By: tra
Differential Revision: https://reviews.llvm.org/D123471
|
|
CUDA/HIP needs to mangle for aux target. When mangling for aux target,
the mangler should use mangling number for aux target. Previously
in https://reviews.llvm.org/D122734 a state was introduced in
ASTContext to let the mangler get mangling number for aux target
from ASTContext. This patch removes that state from ASTConext
and add an IsAux member to MangleContext to indicate that
the mangle context is for aux target. This reflects the reality that
the mangle context is created for mangling aux target and makes
ASTContext cleaner.
Reviewed by: Artem Belevich, Reid Kleckner
Differential Revision: https://reviews.llvm.org/D124842
|
|
MSVC and Itanium mangling use different mangling numbers
for function-scope structs, which causes inconsistent
mangled kernel names in device and host compilations.
This patch uses Itanium mangling number for structs
in for mangling device side names in CUDA/HIP host
compilation on Windows to fix this issue.
A state is added to ASTContext to indicate whether the
current name mangling is for device side names in host
compilation. Device and host mangling number
are encoded/decoded as upper and lower half of 32 bit
unsigned integer to fit into the original mangling number
field for AST. Diagnostic will be emitted if a manglining
number exceeds limit.
Reviewed by: Artem Belevich, Reid Kleckner
Differential Revision: https://reviews.llvm.org/D122734
Fixes: SWDEV-328515
|
|
kernels in anonymous name space needs to have unique name
to avoid duplicate symbols.
Fixes: https://github.com/llvm/llvm-project/issues/54560
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D123353
|
|
In theory, constructors can take arguments when called via .init_array
where at least glibc passes in (argc, argv, envp). This isn't used in
the generated code and if it was, the first argument should be an
integer, not a pointer. For destructors registered via atexit, the
function should never take an argument.
Differential Revision: https://reviews.llvm.org/D123370
|
|
Remove two stray uses in CodeGenModule and CGCUDANV.
|
|
Introduce -fgpu-default-stream={legacy|per-thread} option to
support per-thread default stream for HIP runtime.
When -fgpu-default-stream=per-thread, HIP kernels are
launched through hipLaunchKernel_spt instead of
hipLaunchKernel. Also HIP_API_PER_THREAD_DEFAULT_STREAM=1
is defined by the preprocessor to enable other per-thread stream
API's.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D120298
|
|
To make uses of the deprecated constructor easier to spot, and to
ensure that no new uses are introduced, rename it to
Address::deprecated().
While doing the rename, I've filled in element types in cases
where it was relatively obvious, but we're still left with 135
calls to the deprecated constructor.
|
|
This patch fixes issues for -fgpu-rdc for Windows MSVC
toolchain:
Fix COFF specific section flags and remove section types
in llvm-mc input file for Windows.
Escape fatbin path in llvm-mc input file.
Add -triple option to llvm-mc.
Put __hip_gpubin_handle in comdat when it has linkonce_odr
linkage.
Reviewed by: Artem Belevich
Differential Revision: https://reviews.llvm.org/D115039
|