Age | Commit message (Collapse) | Author | Files | Lines |
|
PR libgomp/120444
include/ChangeLog:
* cuda/cuda.h (cuMemsetD8, cuMemsetD8Async): Declare.
libgomp/ChangeLog:
* libgomp-plugin.h (GOMP_OFFLOAD_memset): Declare.
* libgomp.h (struct gomp_device_descr): Add memset_func.
* libgomp.map (GOMP_6.0.1): Add omp_target_memset{,_async}.
* libgomp.texi (Device Memory Routines): Document them.
* omp.h.in (omp_target_memset, omp_target_memset_async): Declare.
* omp_lib.f90.in (omp_target_memset, omp_target_memset_async):
Add interfaces.
* omp_lib.h.in (omp_target_memset, omp_target_memset_async): Likewise.
* plugin/cuda-lib.def: Add cuMemsetD8.
* plugin/plugin-gcn.c (struct hsa_runtime_fn_info): Add
hsa_amd_memory_fill_fn.
(init_hsa_runtime_functions): DLSYM_OPT_FN load it.
(GOMP_OFFLOAD_memset): New.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_memset): New.
* target.c (omp_target_memset_int, omp_target_memset,
omp_target_memset_async_helper, omp_target_memset_async): New.
(gomp_load_plugin_for_device): Add DLSYM (memset).
* testsuite/libgomp.c-c++-common/omp_target_memset.c: New test.
* testsuite/libgomp.c-c++-common/omp_target_memset-2.c: New test.
* testsuite/libgomp.c-c++-common/omp_target_memset-3.c: New test.
* testsuite/libgomp.fortran/omp_target_memset.f90: New test.
* testsuite/libgomp.fortran/omp_target_memset-2.f90: New test.
|
|
libgomp/ChangeLog:
PR libgomp/93226
* libgomp-plugin.h (GOMP_OFFLOAD_openacc_async_dev2dev): New
prototype.
* libgomp.h (struct acc_dispatch_t): Add dev2dev_func.
(gomp_copy_dev2dev): New prototype.
* libgomp.map (OACC_2.6.1): New; add acc_memcpy_device{,_async}.
* libgomp.texi (acc_memcpy_device): New.
* oacc-mem.c (memcpy_tofrom_device): Change to take from/to
device boolean; use memcpy not memmove; add early return if
size == 0 or same device + same ptr.
(acc_memcpy_to_device, acc_memcpy_to_device_async,
acc_memcpy_from_device, acc_memcpy_from_device_async): Update.
(acc_memcpy_device, acc_memcpy_device_async): New.
* openacc.f90 (acc_memcpy_device, acc_memcpy_device_async):
Add interface.
* openacc_lib.h (acc_memcpy_device, acc_memcpy_device_async):
Likewise.
* openacc.h (acc_memcpy_device, acc_memcpy_device_async): Add
prototype.
* plugin/plugin-gcn.c (GOMP_OFFLOAD_openacc_async_host2dev):
Update comment.
(GOMP_OFFLOAD_openacc_async_dev2host): Update call.
(GOMP_OFFLOAD_openacc_async_dev2dev): New.
* plugin/plugin-nvptx.c (cuda_memcpy_dev_sanity_check): New.
(GOMP_OFFLOAD_dev2dev): Call it.
(GOMP_OFFLOAD_openacc_async_dev2dev): New.
* target.c (gomp_copy_dev2dev): New.
(gomp_load_plugin_for_device): Load dev2dev and async_dev2dev.
* testsuite/libgomp.oacc-c-c++-common/acc_memcpy_device-1.c: New test.
* testsuite/libgomp.oacc-fortran/acc_memcpy_device-1.f90: New test.
|
|
libgomp/ChangeLog:
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_interop): Set context for
stream creation to use the specified device.
|
|
The interop directive operates on an opaque object that represents a
foreign runtime. This commit adds support for
this to the two offloading plugins.
For nvptx, it supports cuda, cuda_driver and hip; the latter is AMD's
version of CUDA which for Nvidia devices boils down to normal CUDA.
Thus, at the end for this limited use, cuda/cuda_driver/hip are all
the same - and for plugin-nvptx.c, the they differ only in terms of
what gets fr_id, fr_name and get_interop_type_desc return.
For gcn, it supports hip and hsa.
Regarding get-mapped-ptr-1.c: That's actually a fix for the
GOMP_interop commit r15-8654-g99e2906ae255fc that added
GOMP_DEVICE_DEFAULT_OMP_61 alias omp_default_device, which is
a conforming device number. But that test used -5 as check for a
non-conforming device number.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (_LIBGOMP_PLUGIN_INCLUDE): Define.
(struct hsa_runtime_fn_info): Add two queue functions.
(hipError_t, hipCtx_t, hipStream_s, hipStream_t): New types.
(struct hip_runtime_fn_info): New.
(hip_runtime_lib, hip_fns): New global vars.
(init_environment_variables): Handle hip_runtime_lib.
(init_hsa_runtime_functions): Load the two queue functions.
(init_hip_runtime_functions, GOMP_OFFLOAD_interop,
GOMP_OFFLOAD_get_interop_int, GOMP_OFFLOAD_get_interop_ptr,
GOMP_OFFLOAD_get_interop_str,
GOMP_OFFLOAD_get_interop_type_desc): New.
* plugin/plugin-nvptx.c (_LIBGOMP_PLUGIN_INCLUDE): Define.
(GOMP_OFFLOAD_interop, GOMP_OFFLOAD_get_interop_int,
GOMP_OFFLOAD_get_interop_ptr, GOMP_OFFLOAD_get_interop_str,
GOMP_OFFLOAD_get_interop_type_desc): New.
* testsuite/libgomp.c/interop-fr-1.c: New test.
* testsuite/libgomp.c-c++-common/get-mapped-ptr-1.c: Use -6
not -5 as non-conforming device number.
|
|
libgomp/ChangeLog:
* plugin/plugin-gcn.c (ELFABIVERSION_AMDGPU_HSA_V6,
EF_AMDGPU_GENERIC_VERSION_V, EF_AMDGPU_GENERIC_VERSION_OFFSET,
GET_GENERIC_VERSION): New #define.
(elf_gcn_isa_is_generic): New.
(isa_matches_agent): Accept all generic code objects on the first
go; extend the diagnostic and handle runtime-failed case.
(create_and_finalize_hsa_program): Call it also after loading
the code failed, pass the status.
|
|
|
|
Follow up to r15-5392-g884637b6362391. As the name implies,
GOMP_OFFLOAD_openacc_async_construct is also externally called.
Hence, partially revert previous commit to permit unlocking handling
in oacc-async.c's lookup_goacc_asyncqueue by not failing fatally.
Hence, also the other (indirect) callers had to be updated:
GOMP_OFFLOAD_dev2dev fails now with 'false' and
GOMP_OFFLOAD_async_run fatally.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (GOMP_OFFLOAD_dev2dev, GOMP_OFFLOAD_async_run):
Handle omp_async_queue == NULL after call to maybe_init_omp_async.
(GOMP_OFFLOAD_openacc_async_construct): Use error not fatal error,
partially reverting r15-5392.
|
|
fail fatally
libgomp/ChangeLog:
* plugin/plugin-gcn.c (GOMP_OFFLOAD_openacc_async_construct): In
case of an error, call GOMP_PLUGIN_fatal not ..._error; use NULL
not false in return.
|
|
wrong-return-type error [PR117626]
libgomp/ChangeLog:
PR libgomp/117626
* plugin/plugin-nvptx.c (nvptx_open_device): Use 'CUDA_CALL_ERET'
with 'NULL' as error return instead of 'CUDA_CALL' that returns false.
|
|
libgomp/ChangeLog:
* plugin/plugin-gcn.c (isa_matches_agent): Mention the device number
and ROCR_VISIBLE_DEVICES when reporting an ISA mismatch error.
|
|
Almost all device-specific settings are now centralised into gcn-devices.def
for the compiler, mkoffload, and libgomp. No longer will we have to touch 10
files in multiple places just to add another device without any exotic
features. (New ISAs and devices with incompatible metadata will continue to
need a bit more.)
In order to remove the device-specific conditionals in the code a new value
HSACO_ATTR_UNSUPPORTED has been added, indicating that the assembler will
reject any setting of that option.
This incorporates some of Tobias's patch from March 2024.
Co-Authored-By: Tobias Burnus <tburnus@baylibre.com>
gcc/ChangeLog:
* config.gcc (amdgcn): Add gcn-device-macros.h to tm_file.
Add gcn-tables.opt to extra_options.
* config/gcn/gcn-hsa.h (NO_XNACK): Delete.
(NO_SRAM_ECC): Delete.
(SRAMOPT): Move definition to generated file gcn-device-macros.h.
(XNACKOPT): Likewise.
(ASM_SPEC): Redefine using generated values from gcn-device-macros.h.
* config/gcn/gcn-opts.h
(enum processor_type): Generate from gcn-devices.def.
(TARGET_VEGA10): Delete.
(TARGET_VEGA20): Delete.
(TARGET_GFX908): Delete.
(TARGET_GFX90a): Delete.
(TARGET_GFX90c): Delete.
(TARGET_GFX1030): Delete.
(TARGET_GFX1036): Delete.
(TARGET_GFX1100): Delete.
(TARGET_GFX1103): Delete.
(TARGET_XNACK): Redefine to allow for HSACO_ATTR_UNSUPPORTED.
(enum hsaco_attr_type): Add HSACO_ATTR_UNSUPPORTED.
(TARGET_TGSPLIT): New define.
* config/gcn/gcn.cc (gcn_devices): New constant table.
(gcn_option_override): Rework to use gcn_devices table.
(gcn_omp_device_kind_arch_isa): Likewise.
(output_file_start): Likewise.
(gcn_hsa_declare_function_name): Rework using TARGET_* macros.
* config/gcn/gcn.h (gcn_devices): Declare struct and table.
(TARGET_CPU_CPP_BUILTINS): Rework using gcn_devices.
* config/gcn/gcn.opt: Move enum data to generated file gcn-tables.opt.
Use new names for the default values.
* config/gcn/mkoffload.cc (EF_AMDGPU_MACH_AMDGCN_GFX900): Delete.
(EF_AMDGPU_MACH_AMDGCN_GFX906): Delete.
(EF_AMDGPU_MACH_AMDGCN_GFX908): Delete.
(EF_AMDGPU_MACH_AMDGCN_GFX90a): Delete.
(EF_AMDGPU_MACH_AMDGCN_GFX90c): Delete.
(EF_AMDGPU_MACH_AMDGCN_GFX1030): Delete.
(EF_AMDGPU_MACH_AMDGCN_GFX1036): Delete.
(EF_AMDGPU_MACH_AMDGCN_GFX1100): Delete.
(EF_AMDGPU_MACH_AMDGCN_GFX1103): Delete.
(enum elf_arch_code): Define using gcn-devices.def.
(get_arch): Rework using gcn-devices.def.
(main): Rework using gcn-devices.def
* config/gcn/t-gcn-hsa (gcn-tables.opt): Generate file.
(gcn-device-macros.h): Generate file.
* config/gcn/t-omp-device: Generate isa list from gcn-devices.def.
* config/gcn/gcn-devices.def: New file.
* config/gcn/gcn-tables.opt: New file.
* config/gcn/gcn-tables.opt.urls: New file.
* config/gcn/gen-gcn-device-macros.awk: New file.
* config/gcn/gen-opt-tables.awk: New file.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (EF_AMDGPU_MACH): Generate from gcn-devices.def.
(gcn_gfx803_s): Delete.
(gcn_gfx900_s): Delete.
(gcn_gfx906_s): Delete.
(gcn_gfx908_s): Delete.
(gcn_gfx90a_s): Delete.
(gcn_gfx90c_s): Delete.
(gcn_gfx1030_s): Delete.
(gcn_gfx1036_s): Delete.
(gcn_gfx1100_s): Delete.
(gcn_gfx1103_s): Delete.
(gcn_isa_name_len): Delete.
(isa_hsa_name): Rename ...
(isa_name): ... to this, and rework using gcn-devices.def.
(isa_gcc_name): Delete.
(isa_code): Rework using gcn-devices.def.
(max_isa_vgprs): Rework using gcn-devices.def.
(isa_matches_agent): Update isa_name usage.
(GOMP_OFFLOAD_init_device): Improve diagnostic using the name.
|
|
'self_maps' implies 'unified_shared_memory', except that the latter
also permits that explicit maps copy data to device memory while
self_maps does not. In GCC, currently, both are handled identical.
gcc/c/ChangeLog:
* c-parser.cc (c_parser_omp_requires): Handle self_maps clause.
gcc/cp/ChangeLog:
* parser.cc (cp_parser_omp_requires): Handle self_maps clause.
gcc/fortran/ChangeLog:
* gfortran.h (enum gfc_omp_requires_kind): Add OMP_REQ_SELF_MAPS.
(gfc_namespace): Enlarge omp_requires bitfield.
* module.cc (enum ab_attribute, attr_bits): Add AB_OMP_REQ_SELF_MAPS.
(mio_symbol_attribute): Handle it.
* openmp.cc (gfc_check_omp_requires, gfc_match_omp_requires): Handle
self_maps clause.
* parse.cc (gfc_parse_file): Handle self_maps clause.
gcc/ChangeLog:
* lto-cgraph.cc (output_offload_tables, omp_requires_to_name): Handle
self_maps clause.
* omp-general.cc (struct omp_ts_info, omp_context_selector_matches):
Likewise for the associated trait.
* omp-general.h (enum omp_requires): Add OMP_REQUIRES_SELF_MAPS.
* omp-selectors.h (enum omp_ts_code): Add
OMP_TRAIT_IMPLEMENTATION_SELF_MAPS.
include/ChangeLog:
* gomp-constants.h (GOMP_REQUIRES_SELF_MAPS): #define.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (GOMP_OFFLOAD_get_num_devices):
Accept self_maps clause.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices):
Likewise.
* libgomp.texi (TR13 Impl. Status): Set to 'Y'.
* target.c (gomp_requires_to_name, GOMP_offload_register_ver,
gomp_target_init): Handle self_maps clause.
* testsuite/libgomp.fortran/self_maps.f90: New test.
gcc/testsuite/ChangeLog:
* c-c++-common/gomp/declare-variant-1.c: Add self_maps test.
* c-c++-common/gomp/requires-4.c: Likewise.
* gfortran.dg/gomp/declare-variant-3.f90: Likewise.
* c-c++-common/gomp/requires-2.c: Update dg-error msg.
* gfortran.dg/gomp/requires-2.f90: Likewise.
* gfortran.dg/gomp/requires-self-maps-aux.f90: New.
* gfortran.dg/gomp/requires-self-maps.f90: New.
|
|
In Fortran, omp_get_device_from_uid can also accept substrings, which are
then not NUL terminated. Fixed by introducing a fortran.c wrapper function.
Additionally, in case of a fail the plugin functions now return NULL instead
of failing fatally such that a fall-back UID is generated.
gcc/ChangeLog:
* omp-general.cc (omp_runtime_api_procname): Strip "omp_" from
string; move get_device_from_uid as now a '_' suffix exists.
libgomp/ChangeLog:
* fortran.c (omp_get_device_from_uid_): New function.
* libgomp.map (GOMP_6.0): Add it.
* oacc-host.c (host_dispatch): Init '.uid' and '.get_uid_func'.
* omp_lib.f90.in: Make it used by removing bind(C).
* omp_lib.h.in: Likewise.
* target.c (omp_get_device_from_uid): Ensure the device is initialized.
* plugin/plugin-gcn.c (GOMP_OFFLOAD_get_uid): Add function comment;
return NULL in case of an error.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_uid): Likewise.
* testsuite/libgomp.fortran/device_uid.f90: Update to test substrings.
|
|
Those TR13/OpenMP 6.0 routines permit a reproducible offloading to
a specific device by mapping an OpenMP device number to a
unique ID (UID). The GPU device UIDs should be universally unique,
the one for the host is not.
gcc/ChangeLog:
* omp-general.cc (omp_runtime_api_procname): Add
get_device_from_uid and omp_get_uid_from_device routines.
include/ChangeLog:
* cuda/cuda.h (cuDeviceGetUuid): Declare.
(cuDeviceGetUuid_v2): Add prototype.
libgomp/ChangeLog:
* config/gcn/target.c (omp_get_uid_from_device,
omp_get_device_from_uid): Add stub implementation.
* config/nvptx/target.c (omp_get_uid_from_device,
omp_get_device_from_uid): Likewise.
* fortran.c (omp_get_uid_from_device_,
omp_get_uid_from_device_8_): New functions.
* libgomp-plugin.h (GOMP_OFFLOAD_get_uid): Add prototype.
* libgomp.h (struct gomp_device_descr): Add 'uid' and 'get_uid_func'.
* libgomp.map (GOMP_6.0): New, includind the new UID routines.
* libgomp.texi (OpenMP Technical Report 13): Mark UID routines as 'Y'.
(Device Information Routines): Document new UID routines.
(Offload-Target Specifics): Document UID format.
* omp.h.in (omp_get_device_from_uid, omp_get_uid_from_device):
New prototype.
* omp_lib.f90.in (omp_get_device_from_uid, omp_get_uid_from_device):
New interface.
* omp_lib.h.in: Likewise.
* plugin/cuda-lib.def: Add cuDeviceGetUuid and cuDeviceGetUuid_v2 via
CUDA_ONE_CALL_MAYBE_NULL.
* plugin/plugin-gcn.c (GOMP_OFFLOAD_get_uid): New.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_uid): New.
* target.c (str_omp_initial_device): New static var.
(STR_OMP_DEV_PREFIX): Define.
(gomp_get_uid_for_device, omp_get_uid_from_device,
omp_get_device_from_uid): New.
(gomp_load_plugin_for_device): DLSYM_OPT the function 'get_uid'.
(gomp_target_init): Set the device's 'uid' field to NULL.
* testsuite/libgomp.c/device_uid.c: New test.
* testsuite/libgomp.fortran/device_uid.f90: New test.
|
|
variable [PR97384, PR105274]
... as a means to manually set the "native" GPU thread stack size.
PR libgomp/97384
PR libgomp/105274
libgomp/
* plugin/cuda-lib.def (cuCtxSetLimit): Add.
* plugin/plugin-nvptx.c (nvptx_open_device): Handle
'GOMP_NVPTX_NATIVE_GPU_THREAD_STACK_SIZE' environment variable.
|
|
This extends commit d9c90c82d900fdae95df4499bf5f0a4ecb903b53
"nvptx target: Global constructor, destructor support, via nvptx-tools 'ld'"
for offloading.
libgcc/
* config/nvptx/gbl-ctors.c ["mgomp"]
(__do_global_ctors__entry__mgomp)
(__do_global_dtors__entry__mgomp): New.
[!"mgomp"] (__do_global_ctors__entry, __do_global_dtors__entry):
New.
libgomp/
* plugin/plugin-nvptx.c (nvptx_do_global_cdtors): New.
(nvptx_close_device, GOMP_OFFLOAD_load_image)
(GOMP_OFFLOAD_unload_image): Call it.
|
|
If HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT is true,
all GPUs on the system support unified shared memory. That's
the case for APUs and MI200 devices when XNACK is enabled.
XNACK can be enabled by setting HSA_XNACK=1 as env var for
supported devices; otherwise, if disable, USM code will
use host fallback.
gcc/ChangeLog:
* config/gcn/gcn-hsa.h (gcn_local_sym_hash): Fix typo.
include/ChangeLog:
* hsa.h (HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT): Add
enum value.
libgomp/ChangeLog:
* libgomp.texi (gcn): Update USM handling
* plugin/plugin-gcn.c (GOMP_OFFLOAD_get_num_devices): Handle
USM if HSA_AMD_SYSTEM_INFO_SVM_ACCESSIBLE_BY_DEFAULT is true.
|
|
A few high-end nvptx devices support the attribute
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS; for those, unified shared
memory is supported in hardware. This patch enables support for those -
if all installed nvptx devices have this feature (as the capabilities
are per device type).
This exposes a bug in gomp_copy_back_icvs as it did before use
omp_get_mapped_ptr to find mapped variables, but that returns
the unchanged pointer in cased of shared memory. But in this case,
we have a few actually mapped pointers - like the ICV variables.
Additionally, there was a mismatch with regards to '-1' for the
device number as gomp_copy_back_icvs and omp_get_mapped_ptr count
differently. Hence, do the lookup manually.
include/ChangeLog:
* cuda/cuda.h (CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS): Add.
libgomp/ChangeLog:
* libgomp.texi (nvptx): Update USM description.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices):
Claim support when requesting USM and all devices support
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS.
* target.c (gomp_copy_back_icvs): Fix device ptr lookup.
(gomp_target_init): Set GOMP_OFFLOAD_CAP_SHARED_MEM is the
devices supports USM.
|
|
Add support for gfx90c GCN5 APU integrated graphics devices.
The LLVM AMDGPU documentation does not list those devices as supported
by rocm-amdhsa, but it passes most libgomp offloading tests.
Although they are constrainted compared to dGPUs, they might be
interesting for learning, experimentation, and testing.
gcc/ChangeLog:
* config.gcc: Add gfx90c.
* config/gcn/gcn-hsa.h (NO_SRAM_ECC): Likewise.
* config/gcn/gcn-opts.h (enum processor_type): Likewise.
(TARGET_GFX90c): New macro.
* config/gcn/gcn.cc (gcn_option_override): Handle gfx90c.
(gcn_omp_device_kind_arch_isa): Likewise.
(output_file_start): Likewise.
* config/gcn/gcn.h: Add gfx90c.
* config/gcn/gcn.opt: Likewise.
* config/gcn/mkoffload.cc (EF_AMDGPU_MACH_AMDGCN_GFX90c): New macro.
(get_arch): Handle gfx90c.
(main): Handle EF_AMDGPU_MACH_AMDGCN_GFX90c
* config/gcn/t-omp-device: Add gfx90c.
* doc/install.texi: Likewise.
* doc/invoke.texi: Likewise.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (isa_hsa_name): Handle EF_AMDGPU_MACH_AMDGCN_GFX90c.
(isa_code): Handle gfx90c.
(max_isa_vgprs): Handle EF_AMDGPU_MACH_AMDGCN_GFX90c.
Signed-off-by: Frederik Harwath <frederik@harwath.name>
|
|
Currently, we silently disable libgomp GCN and nvptx plugins/devices in
presence of certain error conditions during device probing, thus typically
silently resorting to host-fallback execution. Make such errors fatal, similar
as for any other device access later on, so that we early and reliably notice
when things go wrong. (Keep just two cases non-fatal: (a) libgomp GCN or nvptx
plugins are available but 'libhsa-runtime64.so.1' or 'libcuda.so.1' are not,
and (b) those are available, but the corresponding devices are not.)
This resolves the issue that we've got execution test cases unexpectedly
PASSing, despite:
libgomp: GCN fatal error: Run-time could not be initialized
Runtime message: HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
..., and therefore they were not offloaded to the GCN device, but ran in
host-fallback execution mode. What happend in that scenario is that in
'init_hsa_context' during the initial 'GOMP_OFFLOAD_get_num_devices' we ran
into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', but it wasn't fatal, but just
silently disabled the libgomp plugin/device.
Especially "entertaining" were cases where such unintended host-fallback
execution happened during effective-target checks like
'offload_device_available' (host-fallback execution there meaning: no offload
device available), but actual test cases then were running with an offload
device available, and therefore mis-configured.
include/
* cuda/cuda.h (CUresult): Add 'CUDA_ERROR_NO_DEVICE'.
libgomp/
* plugin/plugin-gcn.c (init_hsa_context): Add and handle
'bool probe' parameter. Adjust all users; errors during device
probing are fatal.
* plugin/plugin-nvptx.c (nvptx_get_num_devices): Aside from
'CUDA_ERROR_NO_DEVICE', errors during device probing are fatal.
|
|
Add support for the gfx1036 RDNA2 APU integrated graphics devices. The ROCm
documentation warns that these may not be supported, but it seems to work
at least partially.
gcc/ChangeLog:
* config.gcc (amdgcn): Add gfx1036 entries.
* config/gcn/gcn-hsa.h (NO_XNACK): Likewise.
(gcn_local_sym_hash): Likewise.
* config/gcn/gcn-opts.h (enum processor_type): Likewise.
(TARGET_GFX1036): New macro.
* config/gcn/gcn.cc (gcn_option_override): Handle gfx1036.
(gcn_omp_device_kind_arch_isa): Likewise.
(output_file_start): Likewise.
* config/gcn/gcn.h (TARGET_CPU_CPP_BUILTINS): Add __gfx1036__.
(TARGET_CPU_CPP_BUILTINS): Rename __gfx1030 to __gfx1030__.
* config/gcn/gcn.opt: Add gfx1036.
* config/gcn/mkoffload.cc (EF_AMDGPU_MACH_AMDGCN_GFX1036): New.
(main): Handle gfx1036.
* config/gcn/t-omp-device: Add gfx1036 isa.
* doc/install.texi (amdgcn): Add gfx1036.
* doc/invoke.texi (-march): Likewise.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (EF_AMDGPU_MACH): GFX1036.
(gcn_gfx1103_s): New.
(isa_hsa_name): Handle gfx1036.
(isa_code): Likewise.
(max_isa_vgprs): Likewise.
|
|
Add support for the gfx1103 RDNA3 APU integrated graphics devices. The ROCm
documentation warns that these may not be supported, but it seems to work
at least partially.
gcc/ChangeLog:
* config.gcc (amdgcn): Add gfx1103 entries.
* config/gcn/gcn-hsa.h (NO_XNACK): Likewise.
(gcn_local_sym_hash): Likewise.
* config/gcn/gcn-opts.h (enum processor_type): Likewise.
(TARGET_GFX1103): New macro.
* config/gcn/gcn.cc (gcn_option_override): Handle gfx1103.
(gcn_omp_device_kind_arch_isa): Likewise.
(output_file_start): Likewise.
(gcn_hsa_declare_function_name): Use TARGET_RDNA3, not just gfx1100.
* config/gcn/gcn.h (TARGET_CPU_CPP_BUILTINS): Add __gfx1103__.
* config/gcn/gcn.opt: Add gfx1103.
* config/gcn/mkoffload.cc (EF_AMDGPU_MACH_AMDGCN_GFX1103): New.
(main): Handle gfx1103.
* config/gcn/t-omp-device: Add gfx1103 isa.
* doc/install.texi (amdgcn): Add gfx1103.
* doc/invoke.texi (-march): Likewise.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (EF_AMDGPU_MACH): GFX1103.
(gcn_gfx1103_s): New.
(isa_hsa_name): Handle gfx1103.
(isa_code): Likewise.
(max_isa_vgprs): Likewise.
|
|
(non-shared memory system)
'GCN_SUPPRESS_HOST_FALLBACK' originated as 'HSA_SUPPRESS_HOST_FALLBACK' in the
libgomp HSA plugin, where the idea was -- in my understanding -- that you
wouldn't have device code available for all functions that may be called, and
in that case transparently (shared memory system!) do host-fallback execution.
Or, with 'HSA_SUPPRESS_HOST_FALLBACK' set, you'd get those diagnosed.
This has then been copied into the libgomp GCN plugin as
'GCN_SUPPRESS_HOST_FALLBACK'. However, the original meaning isn't applicable
for the libgomp GCN plugin anymore: we assume that we're generating device code
for all relevant functions, and we're implementing a non-shared memory system,
where we cannot transparently do host-fallback execution for individual
functions.
However, 'GCN_SUPPRESS_HOST_FALLBACK' has gained an additional meaning, to
enforce a fatal error in case that 'libhsa-runtime64.so.1' can't be dynamically
loaded; keep that meaning.
libgomp/
* plugin/plugin-gcn.c (GOMP_OFFLOAD_can_run): Don't consider
'GCN_SUPPRESS_HOST_FALLBACK' anymore (assume always-'true').
(init_hsa_context): Adjust 'GCN_SUPPRESS_HOST_FALLBACK' error
message.
|
|
Per commit 683f11843974f0bdf42f79cdcbb0c2b43c7b81b0
"OpenMP: Move omp requires checks to libgomp", we're now using 'return -1'
from 'GOMP_OFFLOAD_get_num_devices' for 'omp_requires_mask' purposes. This
missed that via 'nvptx_get_num_devices', we could also 'return -1' for
'cuDeviceGetCount' failure. Before, this meant (in 'gomp_target_init') to
silently ignore the plugin/device -- which also has been doubtful behavior.
Let's instead turn 'cuDeviceGetCount' failure into a fatal error, similar to
other errors during device initialization.
libgomp/
* plugin/plugin-nvptx.c (nvptx_get_num_devices):
'cuDeviceGetCount' failure is fatal.
|
|
'libcuda.so.1'
If 'libhsa-runtime64.so.1', 'libcuda.so.1' are not available, the corresponding
libgomp plugin/device gets disabled, as before. But if they are available,
report any inconsistencies such as missing symbols, similar to how we fail in
presence of other issues during device initialization.
libgomp/
* plugin/plugin-gcn.c (init_hsa_runtime_functions): Fatal error
for missing symbols.
* plugin/plugin-nvptx.c (init_cuda_lib): Likewise.
|
|
The following avoids registering unsupported GCN offload devices
when iterating over available ones. With a Zen4 desktop CPU
you will have an IGPU (unspported) which will otherwise be made
available. This causes testcases like
libgomp.c-c++-common/non-rect-loop-1.c which iterate over all
decives to FAIL.
libgomp/
* plugin/plugin-gcn.c (suitable_hsa_agent_p): Filter out
agents with unsupported ISA.
|
|
The following makes the existing architecture support check work
instead of being optimized away (enum vs. -1). This avoids
later asserts when we assume such devices are never actually
used.
libgomp/
* plugin/plugin-gcn.c
(EF_AMDGPU_MACH::EF_AMDGPU_MACH_UNSUPPORTED): Add.
(isa_code): Return that instead of -1.
(GOMP_OFFLOAD_init_device): Adjust.
|
|
This is enough to get gfx1030 and gfx1100 working; there are still some test
failures to investigate, and probably some tuning to do.
gcc/ChangeLog:
* config/gcn/gcn-opts.h (TARGET_PACKED_WORK_ITEMS): Add TARGET_RDNA3.
* config/gcn/gcn-valu.md (all_convert): New iterator.
(<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>): New
define_expand, and rename the old one to ...
(*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>): ... this.
(extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>2<exec>): Likewise, to ...
(extend<V_INT_1REG_ALT:mode><V_INT_1REG:mode>_sdwa<exec>): .. this.
(*<convop><V_INT_1REG_ALT:mode><V_INT_1REG:mode>_shift<exec>): New.
* config/gcn/gcn.cc (gcn_global_address_p): Use "offsetbits" correctly.
(gcn_hsa_declare_function_name): Update the vgpr counting for gfx1100.
* config/gcn/gcn.md (<u>mulhisi3): Disable on RDNA3.
(<u>mulqihi3_scalar): Likewise.
libgcc/ChangeLog:
* config/gcn/amdgcn_veclib.h (CDNA3_PLUS): Handle RDNA3.
libgomp/ChangeLog:
* config/gcn/time.c (RTC_TICKS): Configure RDNA3.
(omp_get_wtime): Add RDNA3-compatible variant.
* plugin/plugin-gcn.c (max_isa_vgprs): Tune for gfx1030 and gfx1100.
Signed-off-by: Andrew Stubbs <ams@baylibre.com>
|
|
../../../source-gcc/libgomp/plugin/plugin-gcn.c: In function ‘isa_hsa_name’:
../../../source-gcc/libgomp/plugin/plugin-gcn.c:1666:10: error: ‘EF_AMDGPU_MACH_AMDGCN_GFX1100’ undeclared (first use in this function); did you mean ‘EF_AMDGPU_MACH_AMDGCN_GFX1030’?
1666 | case EF_AMDGPU_MACH_AMDGCN_GFX1100:
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| EF_AMDGPU_MACH_AMDGCN_GFX1030
../../../source-gcc/libgomp/plugin/plugin-gcn.c:1666:10: note: each undeclared identifier is reported only once for each function it appears in
../../../source-gcc/libgomp/plugin/plugin-gcn.c: In function ‘isa_code’:
../../../source-gcc/libgomp/plugin/plugin-gcn.c:1711:12: error: ‘EF_AMDGPU_MACH_AMDGCN_GFX1100’ undeclared (first use in this function); did you mean ‘EF_AMDGPU_MACH_AMDGCN_GFX1030’?
1711 | return EF_AMDGPU_MACH_AMDGCN_GFX1100;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| EF_AMDGPU_MACH_AMDGCN_GFX1030
../../../source-gcc/libgomp/plugin/plugin-gcn.c: In function ‘max_isa_vgprs’:
../../../source-gcc/libgomp/plugin/plugin-gcn.c:1728:10: error: ‘EF_AMDGPU_MACH_AMDGCN_GFX1100’ undeclared (first use in this function); did you mean ‘EF_AMDGPU_MACH_AMDGCN_GFX1030’?
1728 | case EF_AMDGPU_MACH_AMDGCN_GFX1100:
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| EF_AMDGPU_MACH_AMDGCN_GFX1030
make[4]: *** [Makefile:813: libgomp_plugin_gcn_la-plugin-gcn.lo] Error 1
Fix-up for commit 52a2c659ae6c21f84b6acce0afcb9b93b9dc71a0
"GCN: Add pre-initial support for gfx1100".
libgomp/
* plugin/plugin-gcn.c (EF_AMDGPU_MACH): Add
'EF_AMDGPU_MACH_AMDGCN_GFX1100'.
|
|
This patch adds support for 2D/3D memory copies for omp_target_memcpy_rect
using AMD extensions to the HSA API. This is just the AMD GCN-specific
part of the following patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631001.html
2024-01-04 Julian Brown <julian@codesourcery.com>
libgomp/
* plugin/plugin-gcn.c (hsa_runtime_fn_info): Add
hsa_amd_memory_lock_fn, hsa_amd_memory_unlock_fn,
hsa_amd_memory_async_copy_rect_fn function pointers.
(init_hsa_runtime_functions): Add above functions, with
DLSYM_OPT_FN.
(GOMP_OFFLOAD_memcpy2d, GOMP_OFFLOAD_memcpy3d): New functions.
|
|
ROCm since 5.7.1 supports gfx1100 (RDNA3) cards. This commit adds support
for it, mostly by assuming gfx1100 behaves identical to gfx1030. Like gfx1030,
gfx1100 support is neither documented nor the build of the multilib enabled by
default.
But contrary to gfx1030, gfx1100 has a known issue causing some libraries not
to build, including newlib: The sdwa variant of v_mov_b32_sdwa is not supported
by the hardware but GCC current does generates this instruction.
This will be addressed in a later commit.
gcc/ChangeLog:
* config.gcc (amdgcn-*-amdhsa): Accept --with-arch=gfx1100.
* config/gcn/gcn-hsa.h (NO_XNACK): Add gfx1100:
(ASM_SPEC): Handle gfx1100.
* config/gcn/gcn-opts.h (enum processor_type): Add PROCESSOR_GFX1100.
(enum gcn_isa): Add ISA_RDNA3.
(TARGET_GFX1100, TARGET_RDNA2_PLUS, TARGET_RDNA3): Define.
* config/gcn/gcn-valu.md: Change TARGET_RDNA2 to TARGET_RDNA2_PLUS.
* config/gcn/gcn.cc (gcn_option_override,
gcn_omp_device_kind_arch_isa, output_file_start): Handle gfx1100.
(gcn_global_address_p, gcn_addr_space_legitimate_address_p): Change
TARGET_RDNA2 to TARGET_RDNA2_PLUS.
(gcn_hsa_declare_function_name): Don't use '.amdhsa_reserve_flat_scratch'
with gfx1100.
* config/gcn/gcn.h (ASSEMBLER_DIALECT): Likewise.
(TARGET_CPU_CPP_BUILTINS): Define __RDNA3__, __gfx1030__ and
__gfx1100__.
* config/gcn/gcn.md: Change TARGET_RDNA2 to TARGET_RDNA2_PLUS.
* config/gcn/gcn.opt (Enum gpu_type): Add gfx1100.
* config/gcn/mkoffload.cc (EF_AMDGPU_MACH_AMDGCN_GFX1100): Define.
(isa_has_combined_avgprs, main): Handle gfx1100.
* config/gcn/t-omp-device (isa): Add gfx1100.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (gcn_gfx1100_s): New const string.
(gcn_isa_name_len): Fix length.
(isa_hsa_name, isa_code, max_isa_vgprs): Handle gfx1100.
|
|
|
|
This patch works around behaviour of the 2D and 3D memcpy operations in
the CUDA driver runtime. Particularly in Fortran, the "base pointer"
of an array (used for either source or destination of a host/device copy)
may lie outside of data that is actually stored on the device. The fix
is to make sure that we use the first element of data to be transferred
instead, and adjust parameters accordingly.
2023-10-02 Julian Brown <julian@codesourcery.com>
libgomp/
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d): Adjust parameters to
avoid out-of-bounds array checks in CUDA runtime.
(GOMP_OFFLOAD_memcpy3d): Likewise.
* testsuite/libgomp.c-c++-common/memcpyxd-bias-1.c: New test.
|
|
This implements the OpenMP low-latency memory allocator for AMD GCN using the
small per-team LDS memory (Local Data Store).
Since addresses can now refer to LDS space, the "Global" address space is
no-longer compatible. This patch therefore switches the backend to use
entirely "Flat" addressing (which supports both memories). A future patch
will re-enable "global" instructions for cases where it is known to be safe
to do so.
gcc/ChangeLog:
* config/gcn/gcn-builtins.def (DISPATCH_PTR): New built-in.
* config/gcn/gcn.cc (gcn_init_machine_status): Disable global
addressing.
(gcn_expand_builtin_1): Implement GCN_BUILTIN_DISPATCH_PTR.
libgomp/ChangeLog:
* config/gcn/libgomp-gcn.h (TEAM_ARENA_START): Move to here.
(TEAM_ARENA_FREE): Likewise.
(TEAM_ARENA_END): Likewise.
(GCN_LOWLAT_HEAP): New.
* config/gcn/team.c (LITTLEENDIAN_CPU): New, and import hsa.h.
(__gcn_lowlat_init): New prototype.
(gomp_gcn_enter_kernel): Initialize the low-latency heap.
* libgomp.h (TEAM_ARENA_START): Move to libgomp.h.
(TEAM_ARENA_FREE): Likewise.
(TEAM_ARENA_END): Likewise.
* plugin/plugin-gcn.c (lowlat_size): New variable.
(print_kernel_dispatch): Label the group_segment_size purpose.
(init_environment_variables): Read GOMP_GCN_LOWLAT_POOL.
(create_kernel_dispatch): Pass low-latency head allocation to kernel.
(run_kernel): Use shadow; don't assume values.
* testsuite/libgomp.c/omp_alloc-traits.c: Enable for amdgcn.
* config/gcn/allocator.c: New file.
* libgomp.texi: Document low-latency implementation details.
|
|
This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc. The memory
can be allocated, reallocated, and freed using a basic but fast algorithm,
is thread safe and the size of the low-latency heap can be configured using
the GOMP_NVPTX_LOWLAT_POOL environment variable.
The use of the PTX dynamic_smem_size feature means that low-latency allocator
will not work with the PTX 3.1 multilib.
For now, the omp_low_lat_mem_alloc allocator also works, but that will change
when I implement the access traits.
libgomp/ChangeLog:
* allocator.c (MEMSPACE_ALLOC): New macro.
(MEMSPACE_CALLOC): New macro.
(MEMSPACE_REALLOC): New macro.
(MEMSPACE_FREE): New macro.
(predefined_alloc_mapping): New array. Add _Static_assert to match.
(ARRAY_SIZE): New macro.
(omp_aligned_alloc): Use MEMSPACE_ALLOC.
Implement fall-backs for predefined allocators. Simplify existing
fall-backs.
(omp_free): Use MEMSPACE_FREE.
(omp_calloc): Use MEMSPACE_CALLOC. Implement fall-backs for
predefined allocators. Simplify existing fall-backs.
(omp_realloc): Use MEMSPACE_REALLOC, MEMSPACE_ALLOC, and MEMSPACE_FREE.
Implement fall-backs for predefined allocators. Simplify existing
fall-backs.
* config/nvptx/team.c (__nvptx_lowlat_pool): New asm variable.
(__nvptx_lowlat_init): New prototype.
(gomp_nvptx_main): Call __nvptx_lowlat_init.
* libgomp.texi: Update memory space table.
* plugin/plugin-nvptx.c (lowlat_pool_size): New variable.
(GOMP_OFFLOAD_init_device): Read the GOMP_NVPTX_LOWLAT_POOL envvar.
(GOMP_OFFLOAD_run): Apply lowlat_pool_size.
* basic-allocator.c: New file.
* config/nvptx/allocator.c: New file.
* testsuite/libgomp.c/omp_alloc-1.c: New test.
* testsuite/libgomp.c/omp_alloc-2.c: New test.
* testsuite/libgomp.c/omp_alloc-3.c: New test.
* testsuite/libgomp.c/omp_alloc-4.c: New test.
* testsuite/libgomp.c/omp_alloc-5.c: New test.
* testsuite/libgomp.c/omp_alloc-6.c: New test.
Co-authored-by: Kwok Cheung Yeung <kcy@codesourcery.com>
Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
|
|
Add the new CDNA register file. We don't support any of the specialized
instructions that use these registers, but they're useful to relieve
register pressure without spilling to stack.
Co-authored-by: Andrew Jenner <andrew@codesourcery.com>
gcc/ChangeLog:
* config/gcn/constraints.md: Add "a" AVGPR constraint.
* config/gcn/gcn-valu.md (*mov<mode>): Add AVGPR alternatives.
(*mov<mode>_4reg): Likewise.
(@mov<mode>_sgprbase): Likewise.
(gather<mode>_insn_1offset<exec>): Likewise.
(gather<mode>_insn_1offset_ds<exec>): Likewise.
(gather<mode>_insn_2offsets<exec>): Likewise.
(scatter<mode>_expr<exec_scatter>): Likewise.
(scatter<mode>_insn_1offset_ds<exec_scatter>): Likewise.
(scatter<mode>_insn_2offsets<exec_scatter>): Likewise.
* config/gcn/gcn.cc (MAX_NORMAL_AVGPR_COUNT): Define.
(gcn_class_max_nregs): Handle AVGPR_REGS and ALL_VGPR_REGS.
(gcn_hard_regno_mode_ok): Likewise.
(gcn_regno_reg_class): Likewise.
(gcn_spill_class): Allow spilling to AVGPRs on TARGET_CDNA1_PLUS.
(gcn_sgpr_move_p): Handle AVGPRs.
(gcn_secondary_reload): Reload AVGPRs via VGPRs.
(gcn_conditional_register_usage): Handle AVGPRs.
(gcn_vgpr_equivalent_register_operand): New function.
(gcn_valid_move_p): Check for validity of AVGPR moves.
(gcn_compute_frame_offsets): Handle AVGPRs.
(gcn_memory_move_cost): Likewise.
(gcn_register_move_cost): Likewise.
(gcn_vmem_insn_p): Handle TYPE_VOP3P_MAI.
(gcn_md_reorg): Handle AVGPRs.
(gcn_hsa_declare_function_name): Likewise.
(print_reg): Likewise.
(gcn_dwarf_register_number): Likewise.
* config/gcn/gcn.h (FIRST_AVGPR_REG): Define.
(AVGPR_REGNO): Define.
(LAST_AVGPR_REG): Define.
(SOFT_ARG_REG): Update.
(FRAME_POINTER_REGNUM): Update.
(DWARF_LINK_REGISTER): Update.
(FIRST_PSEUDO_REGISTER): Update.
(AVGPR_REGNO_P): Define.
(enum reg_class): Add AVGPR_REGS and ALL_VGPR_REGS.
(REG_CLASS_CONTENTS): Add new register classes and add entries for
AVGPRs to all classes.
(REGISTER_NAMES): Add AVGPRs.
* config/gcn/gcn.md (FIRST_AVGPR_REG, LAST_AVGPR_REG): Define.
(AP_REGNUM, FP_REGNUM): Update.
(define_attr "type"): Add vop3p_mai.
(define_attr "unit"): Handle vop3p_mai.
(define_attr "gcn_version"): Add "cdna2".
(define_attr "enabled"): Handle cdna2.
(*mov<mode>_insn): Add AVGPR alternatives.
(*movti_insn): Likewise.
* config/gcn/mkoffload.cc (isa_has_combined_avgprs): New.
(process_asm): Process avgpr_count.
* config/gcn/predicates.md (gcn_avgpr_register_operand): New.
(gcn_avgpr_hard_register_operand): New.
* doc/md.texi: Document the "a" constraint.
gcc/testsuite/ChangeLog:
* gcc.target/gcn/avgpr-mem-double.c: New test.
* gcc.target/gcn/avgpr-mem-int.c: New test.
* gcc.target/gcn/avgpr-mem-long.c: New test.
* gcc.target/gcn/avgpr-mem-short.c: New test.
* gcc.target/gcn/avgpr-spill-double.c: New test.
* gcc.target/gcn/avgpr-spill-int.c: New test.
* gcc.target/gcn/avgpr-spill-long.c: New test.
* gcc.target/gcn/avgpr-spill-short.c: New test.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (max_isa_vgprs): New.
(run_kernel): CDNA2 devices have more VGPRs.
|
|
This adds support for the 'indirect' clause in the 'declare target'
directive. Functions declared as indirect may be called via function
pointers passed from the host in offloaded code.
Virtual calls to member functions via the object pointer in C++ are
currently not supported in target regions.
2023-11-07 Kwok Cheung Yeung <kcy@codesourcery.com>
gcc/c-family/
* c-attribs.cc (c_common_attribute_table): Add attribute for
indirect functions.
* c-pragma.h (enum parma_omp_clause): Add entry for indirect clause.
gcc/c/
* c-decl.cc (c_decl_attributes): Add attribute for indirect
functions.
* c-lang.h (c_omp_declare_target_attr): Add indirect field.
* c-parser.cc (c_parser_omp_clause_name): Handle indirect clause.
(c_parser_omp_clause_indirect): New.
(c_parser_omp_all_clauses): Handle indirect clause.
(OMP_DECLARE_TARGET_CLAUSE_MASK): Add indirect clause to mask.
(c_parser_omp_declare_target): Handle indirect clause. Emit error
message if device_type or indirect clauses used alone. Emit error
if indirect clause used with device_type that is not 'any'.
(OMP_BEGIN_DECLARE_TARGET_CLAUSE_MASK): Add indirect clause to mask.
(c_parser_omp_begin): Handle indirect clause.
* c-typeck.cc (c_finish_omp_clauses): Handle indirect clause.
gcc/cp/
* cp-tree.h (cp_omp_declare_target_attr): Add indirect field.
* decl2.cc (cplus_decl_attributes): Add attribute for indirect
functions.
* parser.cc (cp_parser_omp_clause_name): Handle indirect clause.
(cp_parser_omp_clause_indirect): New.
(cp_parser_omp_all_clauses): Handle indirect clause.
(handle_omp_declare_target_clause): Add extra parameter. Add
indirect attribute for indirect functions.
(OMP_DECLARE_TARGET_CLAUSE_MASK): Add indirect clause to mask.
(cp_parser_omp_declare_target): Handle indirect clause. Emit error
message if device_type or indirect clauses used alone. Emit error
if indirect clause used with device_type that is not 'any'.
(OMP_BEGIN_DECLARE_TARGET_CLAUSE_MASK): Add indirect clause to mask.
(cp_parser_omp_begin): Handle indirect clause.
* semantics.cc (finish_omp_clauses): Handle indirect clause.
gcc/
* lto-cgraph.cc (enum LTO_symtab_tags): Add tag for indirect
functions.
(output_offload_tables): Write indirect functions.
(input_offload_tables): read indirect functions.
* lto-section-names.h (OFFLOAD_IND_FUNC_TABLE_SECTION_NAME): New.
* omp-builtins.def (BUILT_IN_GOMP_TARGET_MAP_INDIRECT_PTR): New.
* omp-offload.cc (offload_ind_funcs): New.
(omp_discover_implicit_declare_target): Add functions marked with
'omp declare target indirect' to indirect functions list.
(omp_finish_file): Add indirect functions to section for offload
indirect functions.
(execute_omp_device_lower): Redirect indirect calls on target by
passing function pointer to BUILT_IN_GOMP_TARGET_MAP_INDIRECT_PTR.
(pass_omp_device_lower::gate): Run pass_omp_device_lower if
indirect functions are present on an accelerator device.
* omp-offload.h (offload_ind_funcs): New.
* tree-core.h (omp_clause_code): Add OMP_CLAUSE_INDIRECT.
* tree.cc (omp_clause_num_ops): Add entry for OMP_CLAUSE_INDIRECT.
(omp_clause_code_name): Likewise.
* tree.h (OMP_CLAUSE_INDIRECT_EXPR): New.
* config/gcn/mkoffload.cc (process_asm): Process offload_ind_funcs
section. Count number of indirect functions.
(process_obj): Emit number of indirect functions.
* config/nvptx/mkoffload.cc (ind_func_ids, ind_funcs_tail): New.
(process): Emit offload_ind_func_table in PTX code. Emit indirect
function names and count in image.
* config/nvptx/nvptx.cc (nvptx_record_offload_symbol): Mark
indirect functions in PTX code with IND_FUNC_MAP.
gcc/testsuite/
* c-c++-common/gomp/declare-target-7.c: Update expected error message.
* c-c++-common/gomp/declare-target-indirect-1.c: New.
* c-c++-common/gomp/declare-target-indirect-2.c: New.
* g++.dg/gomp/attrs-21.C (v12): Update expected error message.
* g++.dg/gomp/declare-target-indirect-1.C: New.
* gcc.dg/gomp/attrs-21.c (v12): Update expected error message.
include/
* gomp-constants.h (GOMP_VERSION): Increment to 3.
(GOMP_VERSION_SUPPORTS_INDIRECT_FUNCS): New.
libgcc/
* offloadstuff.c (OFFLOAD_IND_FUNC_TABLE_SECTION_NAME): New.
(__offload_ind_func_table): New.
(__offload_ind_funcs_end): New.
(__OFFLOAD_TABLE__): Add entries for indirect functions.
libgomp/
* Makefile.am (libgomp_la_SOURCES): Add target-indirect.c.
* Makefile.in: Regenerate.
* libgomp-plugin.h (GOMP_INDIRECT_ADDR_MAP): New define.
(GOMP_OFFLOAD_load_image): Add extra argument.
* libgomp.h (struct indirect_splay_tree_key_s): New.
(indirect_splay_tree_node, indirect_splay_tree,
indirect_splay_tree_key): New.
(indirect_splay_compare): New.
* libgomp.map (GOMP_5.1.1): Add GOMP_target_map_indirect_ptr.
* libgomp.texi (OpenMP 5.1): Update documentation on indirect
calls in target region and on indirect clause.
(Other new OpenMP 5.2 features): Add entry for virtual function calls.
* libgomp_g.h (GOMP_target_map_indirect_ptr): Add prototype.
* oacc-host.c (host_load_image): Add extra argument.
* target.c (gomp_load_image_to_device): If the GOMP_VERSION is high
enough, read host indirect functions table and pass to
load_image_func.
* config/accel/target-indirect.c: New.
* config/linux/target-indirect.c: New.
* config/gcn/team.c (build_indirect_map): Add prototype.
(gomp_gcn_enter_kernel): Initialize support for indirect
function calls on GCN target.
* config/nvptx/team.c (build_indirect_map): Add prototype.
(gomp_nvptx_main): Initialize support for indirect function
calls on NVPTX target.
* plugin/plugin-gcn.c (struct gcn_image_desc): Add field for
indirect functions count.
(GOMP_OFFLOAD_load_image): Add extra argument. If the GOMP_VERSION
is high enough, build address translation table and copy it to target
memory.
* plugin/plugin-nvptx.c (nvptx_tdata): Add field for indirect
functions count.
(GOMP_OFFLOAD_load_image): Add extra argument. If the GOMP_VERSION
is high enough, Build address translation table and copy it to target
memory.
* testsuite/libgomp.c-c++-common/declare-target-indirect-1.c: New.
* testsuite/libgomp.c-c++-common/declare-target-indirect-2.c: New.
* testsuite/libgomp.c++/declare-target-indirect-1.C: New.
|
|
Accept the architecture configure option and resolve build failures. This is
enough to build binaries, but I've not got a device to test it on, so there
are probably runtime issues to fix. The cache control instructions might be
unsafe (or too conservative), and the kernel metadata might be off. Vector
reductions will need to be reworked for RDNA2. In principle, it would be
better to use wavefrontsize32 for this architecture, but that would mean
switching everything to allow SImode masks, so wavefrontsize64 it is.
The multilib is not included in the default configuration so either configure
--with-arch=gfx1030 or include it in --with-multilib-list=gfx1030,....
The majority of this patch has no effect on other devices, but changing from
using scalar writes for the exit value to vector writes means we don't need
the scalar cache write-back instruction anywhere (which doesn't exist in RDNA2).
gcc/ChangeLog:
* config.gcc: Allow --with-arch=gfx1030.
* config/gcn/gcn-hsa.h (NO_XNACK): gfx1030 does not support xnack.
(ASM_SPEC): gfx1030 needs -mattr=+wavefrontsize64 set.
* config/gcn/gcn-opts.h (enum processor_type): Add PROCESSOR_GFX1030.
(TARGET_GFX1030): New.
(TARGET_RDNA2): New.
* config/gcn/gcn-valu.md (@dpp_move<mode>): Disable for RDNA2.
(addc<mode>3<exec_vcc>): Add RDNA2 syntax variant.
(subc<mode>3<exec_vcc>): Likewise.
(<convop><mode><vndi>2_exec): Add RDNA2 alternatives.
(vec_cmp<mode>di): Likewise.
(vec_cmp<u><mode>di): Likewise.
(vec_cmp<mode>di_exec): Likewise.
(vec_cmp<u><mode>di_exec): Likewise.
(vec_cmp<mode>di_dup): Likewise.
(vec_cmp<mode>di_dup_exec): Likewise.
(reduc_<reduc_op>_scal_<mode>): Disable for RDNA2.
(*<reduc_op>_dpp_shr_<mode>): Likewise.
(*plus_carry_dpp_shr_<mode>): Likewise.
(*plus_carry_in_dpp_shr_<mode>): Likewise.
* config/gcn/gcn.cc (gcn_option_override): Recognise gfx1030.
(gcn_global_address_p): RDNA2 only allows smaller offsets.
(gcn_addr_space_legitimate_address_p): Likewise.
(gcn_omp_device_kind_arch_isa): Recognise gfx1030.
(gcn_expand_epilogue): Use VGPRs instead of SGPRs.
(output_file_start): Configure gfx1030.
* config/gcn/gcn.h (TARGET_CPU_CPP_BUILTINS): Add __RDNA2__;
(ASSEMBLER_DIALECT): New.
* config/gcn/gcn.md (rdna): New define_attr.
(enabled): Use "rdna" attribute.
(gcn_return): Remove s_dcache_wb.
(addcsi3_scalar): Add RDNA2 syntax variant.
(addcsi3_scalar_zero): Likewise.
(addptrdi3): Likewise.
(mulsi3): v_mul_lo_i32 should be v_mul_lo_u32 on all ISA.
(*memory_barrier): Add RDNA2 syntax variant.
(atomic_load<mode>): Add RDNA2 cache control variants, and disable
scalar atomics for RDNA2.
(atomic_store<mode>): Likewise.
(atomic_exchange<mode>): Likewise.
* config/gcn/gcn.opt (gpu_type): Add gfx1030.
* config/gcn/mkoffload.cc (EF_AMDGPU_MACH_AMDGCN_GFX1030): New.
(main): Recognise -march=gfx1030.
* config/gcn/t-omp-device: Add gfx1030 isa.
libgcc/ChangeLog:
* config/gcn/amdgcn_veclib.h (CDNA3_PLUS): Set false for __RDNA2__.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (EF_AMDGPU_MACH_AMDGCN_GFX1030): New.
(isa_hsa_name): Recognise gfx1030.
(isa_code): Likewise.
* team.c (defined): Remove s_endpgm.
|
|
Fixes for commit r14-2792-g25072a477a56a727b369bf9b20f4d18198ff5894
"OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect",
namely:
In that commit, the code was changed to handle shared-memory devices;
however, as pointed out, omp_target_memcpy_check already set the pointer
to NULL in that case. Hence, this commit reverts to the prior version.
In cuda.h, it adds cuMemcpyPeer{,Async} for symmetry for cuMemcpy3DPeer
(all currently unused) and in three structs, fixes reserved-member names
and remove a bogus 'const' in three structs.
And it changes a DLSYM to DLSYM_OPT as not all plugins support the new
functions, yet.
include/ChangeLog:
* cuda/cuda.h (CUDA_MEMCPY2D, CUDA_MEMCPY3D, CUDA_MEMCPY3D_PEER):
Remove bogus 'const' from 'const void *dst' and fix reserved-name
name in those structs.
(cuMemcpyPeer, cuMemcpyPeerAsync): Add.
libgomp/ChangeLog:
* target.c (omp_target_memcpy_rect_worker): Undo dim=1 change for
GOMP_OFFLOAD_CAP_SHARED_MEM.
(omp_target_memcpy_rect_copy): Likewise for lock condition.
(gomp_load_plugin_for_device): Use DLSYM_OPT not DLSYM for
memcpy3d/memcpy2d.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d,
GOMP_OFFLOAD_memcpy3d): Use memset 0 to nullify reserved and
unused src/dst fields for that mem type; remove '{src,dst}LOD = 0'.
|
|
When copying a 2D or 3D rectangular memmory block, the performance is
better when using CUDA's cuMemcpy2D/cuMemcpy3D instead of copying the
data one by one. That's what this commit does.
Additionally, it permits device-to-device copies, if neccessary using a
temporary variable on the host.
include/ChangeLog:
* cuda/cuda.h (CUlimit): Add CUDA_ERROR_NOT_INITIALIZED,
CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_INVALID_HANDLE.
(CUarray, CUmemorytype, CUDA_MEMCPY2D, CUDA_MEMCPY3D,
CUDA_MEMCPY3D_PEER): New typdefs.
(cuMemcpy2D, cuMemcpy2DAsync, cuMemcpy2DUnaligned,
cuMemcpy3D, cuMemcpy3DAsync, cuMemcpy3DPeer,
cuMemcpy3DPeerAsync): New prototypes.
libgomp/ChangeLog:
* libgomp-plugin.h (GOMP_OFFLOAD_memcpy2d,
GOMP_OFFLOAD_memcpy3d): New prototypes.
* libgomp.h (struct gomp_device_descr): Add memcpy2d_func
and memcpy3d_func.
* libgomp.texi (nvtpx): Document when cuMemcpy2D/cuMemcpy3D is used.
* oacc-host.c (memcpy2d_func, .memcpy3d_func): Init with NULL.
* plugin/cuda-lib.def (cuMemcpy2D, cuMemcpy2DUnaligned,
cuMemcpy3D): Invoke via CUDA_ONE_CALL.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d,
GOMP_OFFLOAD_memcpy3d): New.
* target.c (omp_target_memcpy_rect_worker):
(omp_target_memcpy_rect_check, omp_target_memcpy_rect_copy):
Permit all device-to-device copyies; invoke new plugins for
2D and 3D copying when available.
(gomp_load_plugin_for_device): DLSYM the new plugin functions.
* testsuite/libgomp.c/target-12.c: Fix dimension bug.
* testsuite/libgomp.fortran/target-12.f90: Likewise.
* testsuite/libgomp.fortran/target-memcpy-rect-1.f90: New test.
|
|
Effectively, for GCN (as for nvptx) there is a common address space between
host and device, whether being accessible or not. Thus, this commit
permits to use 'omp requires unified_address' with GCN devices.
(nvptx accepts this requirement since r13-3460-g131d18e928a3ea.)
libgomp/
* plugin/plugin-gcn.c (GOMP_OFFLOAD_get_num_devices): Regard
unified_address requirement as supported.
* libgomp.texi (OpenMP 5.0, AMD Radeon, nvptx): Remove
'unified_address' from the not-supported requirements.
|
|
implementation
... by using the existing 'goacc_asyncqueue' instead of re-coding parts of it.
Follow-up to commit 131d18e928a3ea1ab2d3bf61aa92d68a8a254609
"libgomp/nvptx: Prepare for reverse-offload callback handling",
and commit ea4b23d9c82d9be3b982c3519fe5e8e9d833a6a8
"libgomp: Handle OpenMP's reverse offloads".
libgomp/
* target.c (gomp_target_rev): Instead of 'dev_to_host_cpy',
'host_to_dev_cpy', 'token', take a single 'goacc_asyncqueue'.
* libgomp.h (gomp_target_rev): Adjust.
* libgomp-plugin.c (GOMP_PLUGIN_target_rev): Adjust.
* libgomp-plugin.h (GOMP_PLUGIN_target_rev): Adjust.
* plugin/plugin-gcn.c (process_reverse_offload): Adjust.
* plugin/plugin-nvptx.c (rev_off_dev_to_host_cpy)
(rev_off_host_to_dev_cpy): Remove.
(GOMP_OFFLOAD_run): Adjust.
|
|
Thereby considerably simplify the device plugins' 'GOMP_OFFLOAD_openacc_exec',
'GOMP_OFFLOAD_openacc_async_exec' functions: in terms of lines of code, but in
particular conceptually: no more device memory allocation, host to device data
copying, device memory deallocation -- 'GOMP_MAP_VARS_TARGET' does all that for
us.
This depends on commit 2b2340e236c0bba8aaca358ea25a5accd8249fbd
"Allow libgomp 'cbuf' buffering with OpenACC 'async' for 'ephemeral' data",
where I said that "a use will emerge later", which is this one here.
PR libgomp/90596
libgomp/
* target.c (gomp_map_vars_internal): Allow for
'param_kind == GOMP_MAP_VARS_OPENACC | GOMP_MAP_VARS_TARGET'.
* oacc-parallel.c (GOACC_parallel_keyed): Pass
'GOMP_MAP_VARS_TARGET' to 'goacc_map_vars'.
* plugin/plugin-gcn.c (alloc_by_agent, gcn_exec)
(GOMP_OFFLOAD_openacc_exec, GOMP_OFFLOAD_openacc_async_exec):
Adjust, simplify.
(gomp_offload_free): Remove.
* plugin/plugin-nvptx.c (nvptx_exec, GOMP_OFFLOAD_openacc_exec)
(GOMP_OFFLOAD_openacc_async_exec): Adjust, simplify.
(cuda_free_argmem): Remove.
* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c:
Adjust.
|
|
For 'OFFSET_INLINED', 'gomp_map_val' does the right thing, and we may then
simplify the device plugins accordingly.
This is a follow-up to
Subversion r279551 (Git commit a6163563f2ce502bd4ef444bd5de33570bb8eeb1)
"Add OpenACC 2.6's no_create",
Subversion r279622 (Git commit 5bcd470bf0749e1f56d05dd43aa9584ff2e3a090)
"Use gomp_map_val for OpenACC host-to-device address translation".
libgomp/
* target.c (gomp_map_vars_internal): Use 'OFFSET_INLINED' for
'GOMP_MAP_IF_PRESENT'.
* plugin/plugin-gcn.c (gcn_exec, GOMP_OFFLOAD_openacc_exec)
(GOMP_OFFLOAD_openacc_async_exec): Adjust.
* plugin/plugin-nvptx.c (nvptx_exec, GOMP_OFFLOAD_openacc_exec)
(GOMP_OFFLOAD_openacc_async_exec): Likewise.
* testsuite/libgomp.oacc-c-c++-common/no_create-1.c: Add 'async'
testing.
* testsuite/libgomp.oacc-c-c++-common/no_create-2.c: Likewise.
|
|
For an OpenACC compute construct, we've currently got:
- [...]
- acc_ev_enqueue_launch_start
- launch kernel
- free memory
- acc_ev_free
- acc_ev_enqueue_launch_end
This confused another thing that I'm working on, so I adjusted that to:
- [...]
- acc_ev_enqueue_launch_start
- launch kernel
- acc_ev_enqueue_launch_end
- free memory
- acc_ev_free
Correspondingly, verify 'acc_ev_alloc', 'acc_ev_free' in
'libgomp.oacc-c-c++-common/acc_prof-parallel-1.c'.
libgomp/
* plugin/plugin-gcn.c (gcn_exec): Fix 'acc_ev_enqueue_launch_end'
position.
* testsuite/libgomp.oacc-c-c++-common/acc_prof-parallel-1.c:
Verify 'acc_ev_alloc', 'acc_ev_free'.
|
|
libgomp/ChangeLog:
* libgomp.texi (5.0 Impl. Status, gcn specifics): Update for
reverse offload.
* plugin/plugin-gcn.c (GOMP_OFFLOAD_get_num_devices): Accept
reverse-offload requirement.
|
|
Switch from using stacks in the "private segment" to using a memory block
allocated on the host side. The primary reason is to permit the reverse
offload implementation to access values located on the device stack, but
there may also be performance benefits, especially with repeated kernel
invocations.
This implementation unifies the stacks with the "team arena" optimization
feature, and now allows both to have run-time configurable sizes.
A new ABI is needed, so all libraries must be rebuilt, and newlib must be
version 4.3.0.20230120 or newer.
gcc/ChangeLog:
* config/gcn/gcn-run.cc: Include libgomp-gcn.h.
(struct kernargs): Replace the common content with kernargs_abi.
(struct heap): Delete.
(main): Read GCN_STACK_SIZE envvar.
Allocate space for the device stacks.
Write the new kernargs fields.
* config/gcn/gcn.cc (gcn_option_override): Remove stack_size_opt.
(default_requested_args): Remove PRIVATE_SEGMENT_BUFFER_ARG and
PRIVATE_SEGMENT_WAVE_OFFSET_ARG.
(gcn_addr_space_convert): Mask the QUEUE_PTR_ARG content.
(gcn_expand_prologue): Move the TARGET_PACKED_WORK_ITEMS to the top.
Set up the stacks from the values in the kernargs, not private.
(gcn_expand_builtin_1): Match the stack configuration in the prologue.
(gcn_hsa_declare_function_name): Turn off the private segment.
(gcn_conditional_register_usage): Ensure QUEUE_PTR is fixed.
* config/gcn/gcn.h (FIXED_REGISTERS): Fix the QUEUE_PTR register.
* config/gcn/gcn.opt (mstack-size): Change the description.
include/ChangeLog:
* gomp-constants.h (GOMP_VERSION_GCN): Bump.
libgomp/ChangeLog:
* config/gcn/libgomp-gcn.h (DEFAULT_GCN_STACK_SIZE): New define.
(DEFAULT_TEAM_ARENA_SIZE): New define.
(struct heap): Move to this file.
(struct kernargs_abi): Likewise.
* config/gcn/team.c (gomp_gcn_enter_kernel): Use team arena size from
the kernargs.
* libgomp.h: Include libgomp-gcn.h.
(TEAM_ARENA_SIZE): Remove.
(team_malloc): Update the error message.
* plugin/plugin-gcn.c (struct kernargs): Move common content to
struct kernargs_abi.
(struct agent_info): Rename team arenas to ephemeral memories.
(struct team_arena_list): Rename ....
(struct ephemeral_memories_list): to this.
(struct heap): Delete.
(team_arena_size): New variable.
(stack_size): New variable.
(print_kernel_dispatch): Update debug messages.
(init_environment_variables): Read GCN_TEAM_ARENA_SIZE.
Read GCN_STACK_SIZE.
(get_team_arena): Rename ...
(configure_ephemeral_memories): ... to this, and set up stacks.
(release_team_arena): Rename ...
(release_ephemeral_memories): ... to this.
(destroy_team_arenas): Rename ...
(destroy_ephemeral_memories): ... to this.
(create_kernel_dispatch): Add num_threads parameter.
Adjust for kernargs_abi refactor and ephemeral memories.
(release_kernel_dispatch): Adjust for ephemeral memories.
(run_kernel): Pass thread-count to create_kernel_dispatch.
(GOMP_OFFLOAD_init_device): Adjust for ephemeral memories.
(GOMP_OFFLOAD_fini_device): Adjust for ephemeral memories.
gcc/testsuite/ChangeLog:
* gcc.c-torture/execute/pr47237.c: Xfail on amdgcn.
* gcc.dg/builtin-apply3.c: Xfail for amdgcn.
* gcc.dg/builtin-apply4.c: Xfail for amdgcn.
* gcc.dg/torture/stackalign/builtin-apply-3.c: Xfail for amdgcn.
* gcc.dg/torture/stackalign/builtin-apply-4.c: Xfail for amdgcn.
|
|
|
|
This commit enabled reverse offload for nvptx such that gomp_target_rev
actually gets called. And it fills the latter function to do all of
the following: finding the host function to the device func ptr and
copying the arguments to the host, processing the mapping/firstprivate,
calling the host function, copying back the data and freeing as needed.
The data handling is made easier by assuming that all host variables
either existed before (and are in the mapping) or that those are
devices variables not yet available on the host. Thus, the reverse
mapping can do without refcounts etc. Note that the spec disallows
inside a target region device-affecting constructs other than target
plus ancestor device-modifier and it also limits the clauses permitted
on this construct.
For the function addresses, an additional splay tree is used; for
the lookup of mapped variables, the existing splay-tree is used.
Unfortunately, its data structure requires a full walk of the tree;
Additionally, the just mapped variables are recorded in a separate
data structure an extra lookup. While the lookup is slow, assuming
that only few variables get mapped in each reverse offload construct
and that reverse offload is the exception and not performance critical,
this seems to be acceptable.
libgomp/ChangeLog:
* libgomp.h (struct target_mem_desc): Predeclare; move
below after 'reverse_splay_tree_node' and add rev_array
member.
(struct reverse_splay_tree_key_s, reverse_splay_compare): New.
(reverse_splay_tree_node, reverse_splay_tree,
reverse_splay_tree_key): New typedef.
(struct gomp_device_descr): Add mem_map_rev member.
* oacc-host.c (host_dispatch): NULL init .mem_map_rev.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices): Claim
support for GOMP_REQUIRES_REVERSE_OFFLOAD.
* splay-tree.h (splay_tree_callback_stop): New typedef; like
splay_tree_callback but returning int not void.
(splay_tree_foreach_lazy): Define; like splay_tree_foreach but
taking splay_tree_callback_stop as argument.
* splay-tree.c (splay_tree_foreach_internal_lazy,
splay_tree_foreach_lazy): New; but early exit if callback returns
nonzero.
* target.c: Instatiate splay_tree_c with splay_tree_prefix 'reverse'.
(gomp_map_lookup_rev): New.
(gomp_load_image_to_device): Handle reverse-offload function
lookup table.
(gomp_unload_image_from_device): Free devicep->mem_map_rev.
(struct gomp_splay_tree_rev_lookup_data, gomp_splay_tree_rev_lookup,
gomp_map_rev_lookup, struct cpy_data, gomp_map_cdata_lookup_int,
gomp_map_cdata_lookup): New auxiliary structs and functions for
gomp_target_rev.
(gomp_target_rev): Implement reverse offloading and its mapping.
(gomp_target_init): Init current_device.mem_map_rev.root.
* testsuite/libgomp.fortran/reverse-offload-2.f90: New test.
* testsuite/libgomp.fortran/reverse-offload-3.f90: New test.
* testsuite/libgomp.fortran/reverse-offload-4.f90: New test.
* testsuite/libgomp.fortran/reverse-offload-5.f90: New test.
* testsuite/libgomp.fortran/reverse-offload-5a.f90: New test without
mapping of on-device allocated variables.
|
|
omp_{gs}et_teams_thread_limit on offload devices
This patch adds support for omp_get_max_teams, omp_set_num_teams, and
omp_{gs}et_teams_thread_limit on offload devices. That includes the usage of
device-specific ICV values (specified as environment variables or changed on a
device). In order to reuse device-specific ICV values, a copy back mechanism is
implemented that copies ICV values back from device to the host.
Additionally, a limitation of the number of teams on gcn offload devices is
implemented. The number of teams is limited by twice the number of compute
units (one team is executed on one compute unit). This avoids queueing
unnessecary many teams and a corresponding allocation of large amounts of
memory. Without that limitation the memory allocation for a large number of
user-specified teams can result in an "memory access fault".
A limitation of the number of teams is already also implemented for nvptx
devices (see nvptx_adjust_launch_bounds in libgomp/plugin/plugin-nvptx.c).
gcc/ChangeLog:
* gimplify.cc (optimize_target_teams): Set initial num_teams_upper
to "-2" instead of "1" for non-existing num_teams clause in order to
disambiguate from the case of an existing num_teams clause with value 1.
libgomp/ChangeLog:
* config/gcn/icv-device.c (omp_get_teams_thread_limit): Added to
allow processing of device-specific values.
(omp_set_teams_thread_limit): Likewise.
(ialias): Likewise.
* config/nvptx/icv-device.c (omp_get_teams_thread_limit): Likewise.
(omp_set_teams_thread_limit): Likewise.
(ialias): Likewise.
* icv-device.c (omp_get_teams_thread_limit): Likewise.
(ialias): Likewise.
(omp_set_teams_thread_limit): Likewise.
* icv.c (omp_set_teams_thread_limit): Removed.
(omp_get_teams_thread_limit): Likewise.
(ialias): Likewise.
* libgomp.texi: Updated documentation for nvptx and gcn corresponding
to the limitation of the number of teams.
* plugin/plugin-gcn.c (limit_teams): New helper function that limits
the number of teams by twice the number of compute units.
(parse_target_attributes): Limit the number of teams on gcn offload
devices.
* target.c (get_gomp_offload_icvs): Added teams_thread_limit_var
handling.
(gomp_load_image_to_device): Added a size check for the ICVs struct
variable.
(gomp_copy_back_icvs): New function that is used in GOMP_target_ext to
copy back the ICV values from device to host.
(GOMP_target_ext): Update the number of teams and threads in the kernel
args also considering device-specific values.
* testsuite/libgomp.c-c++-common/icv-4.c: Fixed an error in the reading
of OMP_TEAMS_THREAD_LIMIT from the environment.
* testsuite/libgomp.c-c++-common/icv-5.c: Extended.
* testsuite/libgomp.c-c++-common/icv-6.c: Extended.
* testsuite/libgomp.c-c++-common/icv-7.c: Extended.
* testsuite/libgomp.c-c++-common/icv-9.c: New test.
* testsuite/libgomp.fortran/icv-5.f90: New test.
* testsuite/libgomp.fortran/icv-6.f90: New test.
gcc/testsuite/ChangeLog:
* c-c++-common/gomp/target-teams-1.c: Adapt expected values for
num_teams from "1" to "-2" in cases without num_teams clause.
* g++.dg/gomp/target-teams-1.C: Likewise.
* gfortran.dg/gomp/defaultmap-4.f90: Likewise.
* gfortran.dg/gomp/defaultmap-5.f90: Likewise.
* gfortran.dg/gomp/defaultmap-6.f90: Likewise.
|