.. _amdgpu-async-operations: =============================== AMDGPU Asynchronous Operations =============================== .. contents:: :local: Introduction ============ Asynchronous operations are memory transfers (usually between the global memory and LDS) that are completed independently at an unspecified scope. A thread that requests one or more asynchronous transfers can use *async marks* to track their completion. The thread waits for each mark to be *completed*, which indicates that requests initiated in program order before this mark have also completed. Operations ========== Memory Accesses --------------- LDS DMA Operations ^^^^^^^^^^^^^^^^^^ .. code-block:: llvm ; "Legacy" LDS DMA operations void @llvm.amdgcn.load.async.to.lds(ptr %src, ptr %dst) void @llvm.amdgcn.global.load.async.lds(ptr %src, ptr %dst) void @llvm.amdgcn.raw.buffer.load.async.lds(ptr %src, ptr %dst) void @llvm.amdgcn.raw.ptr.buffer.load.async.lds(ptr %src, ptr %dst) void @llvm.amdgcn.struct.buffer.load.async.lds(ptr %src, ptr %dst) void @llvm.amdgcn.struct.ptr.buffer.load.async.lds(ptr %src, ptr %dst) Request an async operation that copies the specified number of bytes from the global/buffer pointer ``%src`` to the LDS pointer ``%dst``. .. note:: The above listing is *merely representative*. The actual function signatures are identical to their non-async variants, and supported only on the corresponding architectures (GFX9 and GFX10). Async Mark Operations --------------------- An *async mark* in the abstract machine tracks all the async operations that are program ordered before that mark. A mark M is said to be *completed* only when all async operations program ordered before M are reported by the implementation as having finished, and it is said to be *outstanding* otherwise. Thus we have the following sufficient condition: An async operation X is *completed* at a program point P if there exists a mark M such that X is program ordered before M, M is program ordered before P, and M is completed. X is said to be *outstanding* at P otherwise. The abstract machine maintains a sequence of *async marks* during the execution of a function body, which excludes any marks produced by calls to other functions encountered in the currently executing function. ``@llvm.amdgcn.asyncmark()`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When executed, inserts an async mark in the sequence associated with the currently executing function body. ``@llvm.amdgcn.wait.asyncmark(i16 %N)`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Waits until there are at most N outstanding marks in the sequence associated with the currently executing function body. Memory Consistency Model ======================== Each asynchronous operation consists of a non-atomic read on the source and a non-atomic write on the destination. Async "LDS DMA" intrinsics result in async accesses that guarantee visibility relative to other memory operations as follows: An asynchronous operation `A` program ordered before an overlapping memory operation `X` happens-before `X` only if `A` is completed before `X`. A memory operation `X` program ordered before an overlapping asynchronous operation `A` happens-before `A`. .. note:: The *only if* in the above wording implies that unlike the default LLVM memory model, certain program order edges are not automatically included in ``happens-before``. Examples ======== Uneven blocks of async transfers -------------------------------- .. code-block:: c++ void foo(global int *g, local int *l) { // first block async_load_to_lds(l, g); async_load_to_lds(l, g); async_load_to_lds(l, g); asyncmark(); // second block; longer async_load_to_lds(l, g); async_load_to_lds(l, g); async_load_to_lds(l, g); async_load_to_lds(l, g); async_load_to_lds(l, g); asyncmark(); // third block; shorter async_load_to_lds(l, g); async_load_to_lds(l, g); asyncmark(); // Wait for first block wait.asyncmark(2); } Software pipeline ----------------- .. code-block:: c++ void foo(global int *g, local int *l) { // first block asyncmark(); // second block asyncmark(); // third block asyncmark(); for (;;) { wait.asyncmark(2); // use data // next block asyncmark(); } // flush one block wait.asyncmark(2); // flush one more block wait.asyncmark(1); // flush last block wait.asyncmark(0); } Ordinary function call ---------------------- .. code-block:: c++ extern void bar(); // may or may not make async calls void foo(global int *g, local int *l) { // first block asyncmark(); // second block asyncmark(); // function call bar(); // third block asyncmark(); wait.asyncmark(1); // will wait for at least the second block, possibly including bar() wait.asyncmark(0); // will wait for third block, including bar() } Implementation notes ==================== [This section is informational.] Optimization ------------ The implementation may eliminate async mark/wait intrinsics in the following cases: 1. An ``asyncmark`` operation which is not included in the wait count of a later wait operation in the current function. In particular, an ``asyncmark`` which is not post-dominated by any ``wait.asyncmark``. 2. A ``wait.asyncmark`` whose wait count is more than the outstanding async marks at that point. In particular, a ``wait.asyncmark`` that is not dominated by any ``asyncmark``. In general, at a function call, if the caller uses sufficient waits to track its own async operations, the actions performed by the callee cannot affect correctness. But inlining such a call may result in redundant waits. .. code-block:: c++ void foo() { asyncmark(); // A } void bar() { asyncmark(); // B asyncmark(); // C foo(); wait.asyncmark(1); } Before inlining, the ``wait.asyncmark`` waits for mark B to be completed. .. code-block:: c++ void foo() { } void bar() { asyncmark(); // B asyncmark(); // C asyncmark(); // A from call to foo() wait.asyncmark(1); } After inlining, the asyncmark-wait now waits for mark C to complete, which is longer than necessary. Ideally, the optimizer should have eliminated mark A in the body of foo() itself.