aboutsummaryrefslogtreecommitdiff
path: root/openmp
AgeCommit message (Collapse)AuthorFilesLines
2021-10-27[openmp] [elf_common] Fix linking against LLVM dylibMichał Górny1-0/+3
The hand-rolled linking logic in elf_common does not account for the possibility of using LLVM dylib rather than a dozen static libraries. Since it does not seem to be easily convertible to add_llvm_library, just hand-roll support for LLVM_LINK_LLVM_DYLIB. This is necessary to support stand-alone builds against installed LLVM. Differential Revision: https://reviews.llvm.org/D111038 (cherry picked from commit 0873b9bef4e03b4cfc44a4946c11103c763055df)
2021-09-03[libomptarget][amdcgn] Only add opt/llvm-link dependency if TARGET is availableJoachim Protze1-1/+9
In some build configurations, the target we depend on is not available for declaring the build dependency. We only need to declare the build dependency, if the build target is available in the same build. Fixes the issue raised in https://reviews.llvm.org/D107156#2969862 This patch should go into release/13 together with D108404 Differential Revision: https://reviews.llvm.org/D108868 (cherry picked from commit 5ea1c37118699f0ed1da17e0d8562011d0002edd)
2021-09-03[libomptarget][amdcgn] Add build dependency for llvm-link and optJoachim Protze2-1/+3
D107156 and D107320 are not sufficient when OpenMP is built as llvm runtime (LLVM_ENABLE_RUNTIMES=openmp) because dependencies only work within the same cmake instance. We could limit the dependency to cases where libomptarget/plugins are really built. But compared to the whole llvm project, building openmp runtime is negligible and postponing the build of OpenMP runtime after the dependencies are ready seems reasonable. The direct dependency introduced in D107156 and D107320 is necessary for the case where OpenMP is built as llvm project (LLVM_ENABLE_PROJECTS=openmp). Differential Revision: https://reviews.llvm.org/D108404 (cherry picked from commit 4bb36df144127c5bee6ea2607bc544c003aae446)
2021-08-31[libomptarget][amdgpu] don't declare Elf_Note on FreeBSDDimitry Andric1-0/+3
On FreeBSD, the system `<libelf.h>` already declares `struct Elf_Note` indirectly (via `<sys/elf_common.h>`). This results in compile errors when building the libomptarget amdgpu plugin. Avoid redeclaring `struct Elf_Note` on FreeBSD to fix the errors. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D107661 (cherry picked from commit 71ae2e0221a99958ed82175781d92a73ea05597c)
2021-08-23[libomptarget] Apply D106710 to amdgcn devicertlJon Chesterfield1-1/+1
(cherry picked from commit f420939b82766e371695e54abca4a7fadda6f801)
2021-08-23[libomptarget][amdcgn] Add build dependency for optJoachim Protze1-1/+1
This patch should fix the build we observe when building LLVM from scratch. Differential Revision: https://reviews.llvm.org/D107156 (cherry picked from commit 4ffa1478fd1bbfdea9382786c0afc4e1303bbd06)
2021-08-10[libomptarget][amdgpu] use --allow-shlib-undefined to link on FreeBSDDimitry Andric1-1/+10
On FreeBSD, the `environ` symbol is undefined at link time for shared libraries, but resolved by the dynamic linker at runtime. Therefore, allow the symbol to be undefined when creating a shared library, by using the `--allow-shlib-undefined` linker flag, instead of `-z defs` (a.k.a `--no-undefined`). Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D107698 (cherry picked from commit 400cd6d2f0496e913e25285615a86f9c29811171)
2021-08-10[OpenMP] Fix performance regression reported in bug #51235Shilei Tian1-0/+1
This patch fixes the "performance regression" reported in https://bugs.llvm.org/show_bug.cgi?id=51235. In fact it has nothing to do with performance. The root cause is, the stolen task is not allowed to execute by another thread because by default it is tied task. Since hidden helper task will always be executed by hidden helper threads, it should be untied. Reviewed By: protze.joachim Differential Revision: https://reviews.llvm.org/D107121 (cherry picked from commit 9f5d6ea52eb120ba370bf16ee0537602c6fc727e)
2021-08-05[OpenMP] libomp: taskwait depend implementation fixed.AndreyChurbanov3-9/+89
Fix for https://bugs.llvm.org/show_bug.cgi?id=49723. Eliminated references from task dependency hash to node allocated on stack, thus eliminated accesses to stale memory. So the node now never freed. Uncommented assertion which triggered when stale memory accessed. Removed unneeded ref count increment for stack allocated node. Differential Revision: https://reviews.llvm.org/D106705 (cherry picked from commit 8e29b4b323b87f3855dc71abf1e3f3d48952a4e4)
2021-08-02[OpenMP] Fixing llvm-omp-device-info compilation with runtimesJose M Monsalve Diaz1-1/+0
When using `-DLLVM_ENABLED_RUNTIMES` instead of `-DLLVM_ENABLED_PROJECTS` the `llvm-omp-device-info` tool is not compiled or installed. In general, no llvm tool would be build on runtimes, because the -DLLVM_BUILD_TOOLS flag is removed by the way runtimes compilation calls cmake again. This patch is simple. Just forward the value of this flag to the runtime cmake command. I'm also removing an unnecessary comment in the compilation of the tool Differential Revision: https://reviews.llvm.org/D107177 (cherry picked from commit 5424ceeda0534ab382e2a6cb192099f76ee8b12c)
2021-07-27[OpenMP] Fixing missing variables when CUDA SDK not in systemJose M Monsalve Diaz1-2/+131
This patch fixes the error reported in D106751. When there is no CUDA SDK installed in the system, the build fails due to missing `CU_DEVICE_ATTRIBUTE` variables. Using @zsrkmyn sugested fix Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106933
2021-07-27[OpenMP][Tool] Introducing the `llvm-omp-device-info` toolJose M Monsalve Diaz10-26/+119
This patch introduces the `llvm-omp-device-info` tool, which uses the omptarget library and interface to query the device info from all the available devices as seen by OpenMP. This is inspired by PGI's `pgaccelinfo` Since omptarget usually requires a description structure with executable kernels, I split the initialization of the RTLs and Devices to be able to initialize all possible devices and query each of them. This revision relies on the patch that introduces the print device info. A limitation is that the order in which the devices are initialized, and the corresponding device ID is not necesarily the one seen by OpenMP. The changes are as follows: 1. Separate the RTL initialization that was performed in `RegisterLib` to its own `initRTLonce` function 2. Create an `initAllRTLs` method that initializes all available RTLs at runtime 3. Created the `llvm-deviceinfo.cpp` tool that uses `omptarget` to query each device and prints its information. Example Output: ``` Device (0): print_device_info not implemented Device (1): print_device_info not implemented Device (2): print_device_info not implemented Device (3): print_device_info not implemented Device (4): CUDA Driver Version: 11000 CUDA Device Number: 0 Device Name: Quadro P1000 Global Memory Size: 4236312576 bytes Number of Multiprocessors: 5 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 bytes Max Shared Memory per Block: 49152 bytes Registers per Block: 65536 Warp Size: 32 Threads Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 2147483647 x 65535 x 65535 Maximum Memory Pitch: 2147483647 bytes Texture Alignment: 512 bytes Clock Rate: 1480500 kHz Execution Timeout: Yes Integrated Device: No Can Map Host Memory: Yes Compute Mode: DEFAULT Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 2505000 kHz Memory Bus Width: 128 bits L2 Cache Size: 1048576 bytes Max Threads Per SMP: 2048 Async Engines: Yes (2) Unified Addressing: Yes Managed Memory: Yes Concurrent Managed Memory: Yes Preemption Supported: Yes Cooperative Launch: Yes Multi-Device Boars: No Compute Capabilities: 61 ``` Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D106752
2021-07-27[OpenMP][Libomptarget] Adding `print_device_info` to RTL and `omptarget`Jose M Monsalve Diaz12-0/+216
This patch introduces a function in the device's plugin to print the device information. This patch relates to another patch that introduces a CLI tool to obtain the device information from the omplibrary directly. It is inspired by PGI's pgaccelinfo. The modifications are as follows: 1. Introduce the optional `void __tgt_rtl_print_device_info(RTLdevID)` function into the RTL. 2. Introduce the `bool __tgt_print_device_info(devID)` function into `omptarget` interface. Returns false if the RTL is not implemented 3. Added `bool printDeviceInfo(RTLDevID)` to the `DeviceTy` 4. Implement the `__tgt_rtl_print_device_info` for CUDA. Added additional CUDA Runtime calls. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106751
2021-07-27[OpenMP] Folding threadLimit and numThreads when single value in kernelsJose M Monsalve Diaz1-2/+2
The device runtime contains several calls to `__kmpc_get_hardware_num_threads_in_block` and `__kmpc_get_hardware_num_blocks`. If the thread_limit and the num_teams are constant, these calls can be folded to the constant value. In this patch we use the already introduced `AAFoldRuntimeCall` and the `NumTeams` and `NumThreads` kernel attributes (to be introduced in a different patch) to fold these functions. The code checks all the kernels, and if their attributes match, the functions are folded. In the future we will explore specializing for multiple values of NumThreads and NumTeams. Depends on D106390 Reviewed By: jdoerfert, JonChesterfield Differential Revision: https://reviews.llvm.org/D106033
2021-07-27[OpenMP] Improve alignment handling in the new device runtimeJohannes Doerfert2-4/+13
2021-07-27[Libomptarget] Revert new variable sharing to use the old methodJoseph Huber3-9/+21
The new method of sharing variables introduces a `__kmpc_alloc_shared` call that cannot be removed in the middle end because of its non-constant argument and unconnected free. This patch reverts this to the old method that used a static amount of shared memory for sharing variables. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106905
2021-07-28[OpenMP][Tests] Fix test compatibilityJoachim Protze1-1/+1
gcc and clang disagree in how the event handle needs to be handled. According to OpenMP LC, gcc is right. Will open clang bug report
2021-07-28[OpenMP] Fix deadlock for detachable task with child tasksJoachim Protze2-3/+67
This patch fixes https://bugs.llvm.org/show_bug.cgi?id=49066. For detachable tasks, the assumption breaks that the proxy task cannot have remaining child tasks when the proxy completes. In stead of increment/decrement the incomplete task count, a high-order bit is flipped to mark and wait for the incomplete proxy task. Differential Revision: https://reviews.llvm.org/D101082
2021-07-27Convert the error to warning for enabling OMPD in non-Linux platformVignesh Balasubramanian1-1/+2
OMPD is enabled by default on Linux machines and disabled on others. However, if explicitly enabled it throws an error and exit while configuring. It is mentioned in Bug: https://bugs.llvm.org/show_bug.cgi?id=51121 This patch, instead of throwing error, disables OMPD support with a warning message, so configuration can continue. Reviewed By: @protze.joachim Differential Revision: https://reviews.llvm.org/D106682
2021-07-27[OpenMP] Prototype opt-in new GPU device RTLJohannes Doerfert23-0/+4323
The "old" OpenMP GPU device runtime (D14254) has served us well for many years but modernizing it has caused some pain recently. This patch introduces an alternative which is mostly written from scratch embracing OpenMP 5.X, C++, LLVM coding style (where applicable), and conceptual interfaces. This new runtime is opt-in through a clang flag (D106793). The new runtime is currently only build for nvptx and has "-new" in its name. The design is tailored towards middle-end optimizations rather than front-end code generation choices, a trend we already started in the old runtime a while back. In contrast to the old one, state is organized in a simple manner rather than a "smart" one. While this can induce costs it helps optimizations. Our expectation is that the majority of codes can be optimized and a "simple" design is therefore preferable. The new runtime does also avoid users to pay for things they do not use, especially wrt. memory. The unlikely case of nested parallelism is supported but costly to make the more likely case use less resources. The worksharing and reduction implementation have been taken from the old runtime and will be rewritten in the future if necessary. Documentation and debug features are still mostly missing and will be added over time. All external symbols start with `__kmpc` for legacy reasons but should be renamed once we switch over to a single runtime. All internal symbols are placed in appropriate namespaces (anonymous or `_OMP`) to avoid name clashes with user symbols. Differential Revision: https://reviews.llvm.org/D106803
2021-07-26[AbstractAttributor] Fold __kmpc_parallel_level if possibleShilei Tian2-11/+12
Similar to D105787, this patch tries to fold `__kmpc_parallel_level` if possible. Note that `__kmpc_parallel_level` doesn't take activeness into consideration, based on current `deviceRTLs`, its return value can be such as 0, 1, 2, instead of 0, 129, 130, etc. that also indicate activeness. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106154
2021-07-26[OpenMP][NFC] Fix a few typos in OpenMP documentationJoseph Huber7-33/+45
Summary: Fixes some typos in the OpenMP documentation.
2021-07-26[libomptarget] Build amdgpu plugin without hsaJon Chesterfield1-5/+1
Default to building the amdgpu plugin to use dlopen when hsa is not found instead of disabling it. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106600
2021-07-26[libomptarget][nfc] Squash unused variable warningJon Chesterfield1-0/+1
Suppress only current warning on openmp-clang-x86_64-linux-debian Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106777
2021-07-25[libomptarget][amdgpu] More robust handling of failure to init HSAJon Chesterfield2-6/+9
If hsa_init fails, subsequent calls into hsa are not safe. Except for hsa_init, but we don't retry on failure. This patch: - deletes a print that called into hsa to ask why it can't call into hsa - drops a merge conflict block next to that print - reliably initializes number of devices to zero - skips the plugin destructor contents if the constructor failed to init hsa Tested by making hsa_init return error, and by forcing the dynamic library use which was then deleted from disk. Before this patch, both segv. After it, friendly message about offloading being unavailable. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106774
2021-07-25Revert "[libomptarget] Build amdgpu plugin without hsa"Jon Chesterfield1-1/+5
Inaccurate error handling around hsa_init This reverts commit e30b3b23a4eddbc08b5648e643f0a0b456a57832.
2021-07-25[libomptarget] Build amdgpu plugin without hsaJon Chesterfield1-5/+1
Default to building the amdgpu plugin to use dlopen when hsa is not found instead of disabling it. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106600
2021-07-25[OpenMP][tests][NFC] Update test status for gcc 11 and 12Joachim Protze6-13/+23
gcc 11 introduced support for depend clause, but the gomp interface of libomp does not yet handle the information. Also remove -fopenmp-version=50, which is no longer needed for clang, but not supported by gcc.
2021-07-25[OpenMP][NVPTX] Disable OpenMPOpt when building deviceRTLsShilei Tian1-1/+1
We build `deviceRTLs` with `-O1` by default, which also triggers OpenMPOpt. When the info cache is created, some attributes are removed. As a result, although we mark a few functions `noinline`, they are still inlined when the bitcode library is generated. This can cause an issue in middle end optimization. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106710
2021-07-23[OpenMP] always compile with c++14 instead of gnu++14Ye Luo2-5/+1
Fixes PR 51174. c++14 should be a more portable option than gnu++14. Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D106632
2021-07-23[OpenMP] Fix bug 50022Shilei Tian2-4/+44
Bug 50022 [0] reports target nowait fails in certain case, which is added in this patch. The root cause of the failure is, when the second task is created, its parent's `td_incomplete_child_tasks` will not be incremented because there is no parallel region here thus its team is serialized. Therefore, when the initial thread is waiting for its unfinished children tasks, it thought there is only one, the first task, because it is hidden helper task, so it is tracked. The second task will only be pushed to the queue when the first task is finished. However, when the first task finishes, it first decrements the counter of its parent, and then release dependences. Once the counter is decremented, the thread will move on because its counter is reset, but actually, the second task has not been executed at all. As a result, since in this case, the main function finishes, then `libomp` starts to destroy. When the second task is pushed somewhere, all some of the structures might already have already been destroyed, then anything could happen. This patch simply moves `__kmp_release_deps` ahead of decrement of the counter. In this way, we can make sure that the initial thread is aware of the existence of another task(s) so it will not move on. In addition, in order to tackle dependence chain starting with hidden helper thread, when hidden helper task is encountered, we force the task to release dependences. Reference: [0] https://bugs.llvm.org/show_bug.cgi?id=50022 Reviewed By: AndreyChurbanov Differential Revision: https://reviews.llvm.org/D106519
2021-07-23[Libomptarget] Add unroll flag to shared variables loopJoseph Huber1-0/+1
Unrolling this loop provides better performance in practice because it is executed on the device and is likely to be very small. Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D106692
2021-07-23[OpenMP][Offloading] Fix data race in data mapping by using two locksShilei Tian3-53/+95
This patch tries to partially fix one of the two data race issues reported in [1] by introducing a per-entry mutex. Additional discussion can also be found in D104418, which will also be refined to fix another data race problem. Here is how it works. Like before, `DataMapMtx` is still being used for mapping table lookup and update. In any case, we will get a table entry. If we need to make a data transfer (update the data on the device), we need to lock the entry right before releasing `DataMapMtx`, and the issue of data transfer should be after releasing `DataMapMtx`, and the entry is unlocked afterwards. This can guarantee that: 1) issue of data movement is not in critical region, which will not affect performance too much, and also will not affect other threads that don't touch the same entry; 2) if another thread accesses the same entry, the state of data movement is consistent (which requires that a thread must first get the update lock before getting data movement information). For a target that doesn't support async data transfer, issue of data movement is data transfer. This two-lock design can potentially improve concurrency compared with the design that guards data movement with `DataMapMtx` as well. For a target that supports async data movement, we could simply attach the event between the issue of data movement and unlock the entry. For a thread that wants to get the event, it must first get the lock. This can also get rid of the busy wait until the event pointer is valid. Reference: [1] https://bugs.llvm.org/show_bug.cgi?id=49940 Reviewed By: grokos Differential Revision: https://reviews.llvm.org/D104555
2021-07-23[OpenMP] Fix CUDA plugin build after 3817ba13aea3.Abhinav Gaba2-0/+17
The build was broken on machines that don't have Cuda SDK installed. See https://reviews.llvm.org/D106627 for the original discussion.
2021-07-22[OpenMP] Simplify the ThreadStackTy for globalization fallbackJohannes Doerfert1-75/+31
With D106496 we can make the globalization fallback stack much simpler and this version doesn't seem to experience the spurious failures and deadlocks we have seen before. Differential Revision: https://reviews.llvm.org/D106576
2021-07-22[OpenMP][NFC] Fix formatting in CUDA pluginJoseph Huber1-3/+4
2021-07-22[OpenMP] Add environment variables to change stack / heap size in the CUDA ↵Joseph Huber2-0/+40
plugin This patch adds support for two environment variables to configure the device. ``LIBOMPTARGET_STACK_SIZE`` sets the amount of memory in bytes that each thread has for its stack. ``LIBOMPTARGET_HEAP_SIZE`` sets the amount of heap memory that can be allocated using malloc / free on the device. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106627
2021-07-22[OpenMP] Refined the logic to give a regular task from a hidden helper taskShilei Tian2-19/+30
In current implementation, if a regular task depends on a hidden helper task, and when the hidden helper task is releasing its dependences, it directly calls `__kmp_omp_task`. This could cause a problem that if `__kmp_push_task` returns `TASK_NOT_PUSHED`, the task will be executed immediately. However, the hidden helper threads are assumed to only execute hidden helper tasks. This could cause problems because when calling `__kmp_omp_task`, the encountering gtid, which is not the real one of the thread, is passed. This patch uses `__kmp_give_task`, but because it is a static function, a new wrapper `__kmpc_give_task` is added. Reviewed By: AndreyChurbanov Differential Revision: https://reviews.llvm.org/D106572
2021-07-22[OpenMP] Renaming RT functions `GetNumberOfBlocksInKernel` and ↵Jose M Monsalve Diaz7-15/+19
`GetNumberOfThreadsInBlock` These functions should follow the camel case convention. These are really easy to change and are needed for D106033. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D106390
2021-07-22[libomptarget][amdgpu][nfc] Normalise license headersJon Chesterfield20-64/+123
Reviewed By: gregrodgers, jdoerfert Differential Revision: https://reviews.llvm.org/D106581
2021-07-22[libomptarget][amdgpu][nfc] Replace use of gelf.h with libelf.hJon Chesterfield2-11/+3
AMDGPU can assume Elf64 so doesn't need to abstract over Elf32 Drop a few other unused headers at the same time. Now only llvm elf and libelf are used by the plugin. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106579
2021-07-22[libomptarget][amdgpu] Implement dlopen of libhsaJon Chesterfield4-4/+490
AMDGPU plugin equivalent of D95155, build without HSA installed locally Compiles a new file, plugins/amdgpu/dynamic_hsa/hsa.cpp, to an object file that exposes the same symbols that the plugin presently uses from hsa. The object file contains dlopen of hsa and cached dlsym calls. Also provides header files corresponding to the subset that is used. This is behind a feature flag, LIBOMPTARGET_FORCE_DLOPEN_LIBHSA, default off. That allows developers to build against the dlopen/dlsym implementation, e.g. while testing this mode. Enabling by default will cause this plugin to build on a wider variety of machines than it does at present so may break some CI builds. That risk can be minimised by reviewing the header dependencies of the library and ensuring it doesn't use any libraries that are not already used by libomptarget. Separating the implementation from enabling by default in case the latter needs to be rolled back after wider CI results. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106559
2021-07-22[libomptarget][nfc] Improve static assert message in dlwrapJon Chesterfield1-1/+5
Revision of D102858. Raise dlwrap arity argument to template argument so the correct value is given in the error message. E.g. '2 == 1' instead of '2 == trait<>::nargs'. Arity higher than it should be: Before diff ``` $/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: error: static_assert failed due to requirement '2 == trait<cudaError_enum (*)(unsigned int)>::nargs' "Arity Error" DLWRAP_INTERNAL(cuInit, 2); ^~~~~~~~~~~~~~~~~~~~~~~~~~ ... $/include/dlwrap.h:166:3: note: expanded from macro 'DLWRAP_COMMON' static_assert(ARITY == trait<decltype(&SYMBOL)>::nargs, "Arity Error"); \ ``` After diff In file included from $/plugins/cuda/dynamic_cuda/cuda.cpp:16: ``` $/include/dlwrap.h:131:3: error: static_assert failed due to requirement '2UL == 1UL' "Arity Error" static_assert(Requested == Required, "Arity Error"); ^ ~~~~~~~~~~~~~~~~~~~~~ $/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: in instantiation of function template specialization 'dlwrap::verboseAssert<2UL, 1UL>' requested here DLWRAP_INTERNAL(cuInit, 2); ``` Arity lower than it should be: Before diff ``` $/plugins/cuda/dynamic_cuda/cuda.cpp:131:10: error: no matching function for call to 'dlwrap_cuInit' return dlwrap_cuInit(X); ^~~~~~~~~~~~~ $/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: candidate function not viable: requires 0 arguments, but 1 was provided DLWRAP_INTERNAL(cuInit, 0); ``` After diff In file included from $/plugins/cuda/dynamic_cuda/cuda.cpp:16: ``` $/include/dlwrap.h:131:3: error: static_assert failed due to requirement '0UL == 1UL' "Arity Error" static_assert(Requested == Required, "Arity Error"); ^ ~~~~~~~~~~~~~~~~~~~~~ $/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: in instantiation of function template specialization 'dlwrap::verboseAssert<0UL, 1UL>' requested here DLWRAP_INTERNAL(cuInit, 0); ``` Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106543
2021-07-22[OpenMP] Fix warnings for uninitialized block countsJoseph Huber2-1/+4
Summary: Fixes some warning given for uninitialized block counts if the exection mode is not recognized. This shouldn't happen in practice because the execution mode is checked when it's read from the device.
2021-07-22[libomptarget][amdgpu][nfc] Drop dead signal pool setupJon Chesterfield1-15/+1
This class is instantiated once in rtl.cpp before hsa_init is called. The hsa_signal_create call therefore fails leaving the pool empty. This signal pool is a legacy from ATMI where it was constructed after hsa_init. Moving the state into the rtl.cpp global class disabled the initial populating of the pool without noticeably changing performance. Just rechecked with a fix that allocates the signals after hsa_init and that also doesn't noticeably change performance. This patch therefore drops the initialisation. Only change from main is to drop a DEBUG_PRINT statement that would say the pool initial size is zero. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106515
2021-07-21[OpenMP] Add an option to disable function internalizationJoseph Huber2-0/+2
Function internalization can sometimes occur in situations where we want to keep the call sites intact. This patch adds an option to disable function internalization and prevents the device runtime from being internalized while creating the bitcode library. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106438
2021-07-21[Libomptarget] Introduce new main thread ID runtime functionJoseph Huber2-1/+9
This patch introduces `__kmpc_is_generic_main_thread_id` which splits the old comparison into its own runtime function. The purpose of this is so we can fold this part independently, so when both this and `is_spmd_mode` are folded the final function will be folded as well. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106437
2021-07-21[OpenMP] Add new execution mode for SPMD execution with Generic semanticsJoseph Huber2-11/+29
Qualified kernels can be transformed from generic-mode to SPMD mode using an optimization in OpenMPOpt. This patch introduces a new execution mode to indicate kernels that have been transformed from generic-mode to SPMD-mode. These kernels have SPMD-mode execution, but need generic-mode semantics for scheduling the blocks and threads. Without this far too few blocks will be scheduled for a generic region as SPMD mode expects the trip count to be divided by the number of threads. Reviewed By: ggeorgakoudis Differential Revision: https://reviews.llvm.org/D106460
2021-07-21[OpenMP] Change `__kmpc_free_shared` to include the paired allocation sizeJoseph Huber2-3/+4
This patch changes `__kmpc_free_shared` to take an additional argument corresponding to the associated allocation's size. This makes it easier to implement the allocator in the runtime. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106496
2021-07-21[OpenMP] Expose libomptarget function to get HW thread idGiorgis Georgakoudis11-47/+63
The patch exposes the libomptarget runtime function that gets the hardware thread id through the kmpc API. This is to be used in SPMDization for checking the thread id to execute regions by a single thread in a block. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D106323