aboutsummaryrefslogtreecommitdiff
path: root/libgomp/libgomp.texi
diff options
context:
space:
mode:
authorTobias Burnus <tobias@codesourcery.com>2023-07-26 16:22:35 +0200
committerTobias Burnus <tobias@codesourcery.com>2023-07-26 16:22:35 +0200
commit25072a477a56a727b369bf9b20f4d18198ff5894 (patch)
tree8dc40c0f128509b0e5c78ff32d5102c321bbaa4d /libgomp/libgomp.texi
parentc194a413369e9c9f92f1c9334556b359c7417742 (diff)
downloadgcc-25072a477a56a727b369bf9b20f4d18198ff5894.zip
gcc-25072a477a56a727b369bf9b20f4d18198ff5894.tar.gz
gcc-25072a477a56a727b369bf9b20f4d18198ff5894.tar.bz2
OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect
When copying a 2D or 3D rectangular memmory block, the performance is better when using CUDA's cuMemcpy2D/cuMemcpy3D instead of copying the data one by one. That's what this commit does. Additionally, it permits device-to-device copies, if neccessary using a temporary variable on the host. include/ChangeLog: * cuda/cuda.h (CUlimit): Add CUDA_ERROR_NOT_INITIALIZED, CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_INVALID_HANDLE. (CUarray, CUmemorytype, CUDA_MEMCPY2D, CUDA_MEMCPY3D, CUDA_MEMCPY3D_PEER): New typdefs. (cuMemcpy2D, cuMemcpy2DAsync, cuMemcpy2DUnaligned, cuMemcpy3D, cuMemcpy3DAsync, cuMemcpy3DPeer, cuMemcpy3DPeerAsync): New prototypes. libgomp/ChangeLog: * libgomp-plugin.h (GOMP_OFFLOAD_memcpy2d, GOMP_OFFLOAD_memcpy3d): New prototypes. * libgomp.h (struct gomp_device_descr): Add memcpy2d_func and memcpy3d_func. * libgomp.texi (nvtpx): Document when cuMemcpy2D/cuMemcpy3D is used. * oacc-host.c (memcpy2d_func, .memcpy3d_func): Init with NULL. * plugin/cuda-lib.def (cuMemcpy2D, cuMemcpy2DUnaligned, cuMemcpy3D): Invoke via CUDA_ONE_CALL. * plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d, GOMP_OFFLOAD_memcpy3d): New. * target.c (omp_target_memcpy_rect_worker): (omp_target_memcpy_rect_check, omp_target_memcpy_rect_copy): Permit all device-to-device copyies; invoke new plugins for 2D and 3D copying when available. (gomp_load_plugin_for_device): DLSYM the new plugin functions. * testsuite/libgomp.c/target-12.c: Fix dimension bug. * testsuite/libgomp.fortran/target-12.f90: Likewise. * testsuite/libgomp.fortran/target-memcpy-rect-1.f90: New test.
Diffstat (limited to 'libgomp/libgomp.texi')
-rw-r--r--libgomp/libgomp.texi5
1 files changed, 5 insertions, 0 deletions
diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index 4ac01e9..073c181 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -5060,6 +5060,11 @@ The implementation remark:
list of available devices (``host fallback'').
@item The default per-warp stack size is 128 kiB; see also @code{-msoft-stack}
in the GCC manual.
+@item The OpenMP routines @code{omp_target_memcpy_rect} and
+ @code{omp_target_memcpy_rect_async} and the @code{target update}
+ directive for non-contiguous list items will use the 2D and 3D
+ memory-copy functions of the CUDA library. Higher dimensions will
+ call those functions in a loop and are therefore supported.
@end itemize