llvm/docs/AMDGPUAsyncOperations.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238

.. _amdgpu-async-operations:

===============================
 AMDGPU Asynchronous Operations
===============================

.. contents::
   :local:

Introduction
============

Asynchronous operations are memory transfers (usually between the global memory
and LDS) that are completed independently at an unspecified scope. A thread that
requests one or more asynchronous transfers can use *async marks* to track
their completion. The thread waits for each mark to be *completed*, which
indicates that requests initiated in program order before this mark have also
completed.

Operations
==========

Memory Accesses
---------------

LDS DMA Operations
^^^^^^^^^^^^^^^^^^

.. code-block:: llvm

  ; "Legacy" LDS DMA operations
  void @llvm.amdgcn.load.async.to.lds(ptr %src, ptr %dst)
  void @llvm.amdgcn.global.load.async.lds(ptr %src, ptr %dst)
  void @llvm.amdgcn.raw.buffer.load.async.lds(ptr %src, ptr %dst)
  void @llvm.amdgcn.raw.ptr.buffer.load.async.lds(ptr %src, ptr %dst)
  void @llvm.amdgcn.struct.buffer.load.async.lds(ptr %src, ptr %dst)
  void @llvm.amdgcn.struct.ptr.buffer.load.async.lds(ptr %src, ptr %dst)

Request an async operation that copies the specified number of bytes from the
global/buffer pointer ``%src`` to the LDS pointer ``%dst``.

.. note::

   The above listing is *merely representative*. The actual function signatures
   are identical to their non-async variants, and supported only on the
   corresponding architectures (GFX9 and GFX10).

Async Mark Operations
---------------------

An *async mark* in the abstract machine tracks all the async operations that
are program ordered before that mark. A mark M is said to be *completed*
only when all async operations program ordered before M are reported by the
implementation as having finished, and it is said to be *outstanding* otherwise.

Thus we have the following sufficient condition:

  An async operation X is *completed* at a program point P if there exists a
  mark M such that X is program ordered before M, M is program ordered before
  P, and M is completed. X is said to be *outstanding* at P otherwise.

The abstract machine maintains a sequence of *async marks* during the
execution of a function body, which excludes any marks produced by calls to
other functions encountered in the currently executing function.


``@llvm.amdgcn.asyncmark()``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When executed, inserts an async mark in the sequence associated with the
currently executing function body.

``@llvm.amdgcn.wait.asyncmark(i16 %N)``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Waits until there are at most N outstanding marks in the sequence associated
with the currently executing function body.

Memory Consistency Model
========================

Each asynchronous operation consists of a non-atomic read on the source and a
non-atomic write on the destination. Async "LDS DMA" intrinsics result in async
accesses that guarantee visibility relative to other memory operations as
follows:

  An asynchronous operation `A` program ordered before an overlapping memory
  operation `X` happens-before `X` only if `A` is completed before `X`.

  A memory operation `X` program ordered before an overlapping asynchronous
  operation `A` happens-before `A`.

.. note::

   The *only if* in the above wording implies that unlike the default LLVM
   memory model, certain program order edges are not automatically included in
   ``happens-before``.

Examples
========

Uneven blocks of async transfers
--------------------------------

.. code-block:: c++

   void foo(global int *g, local int *l) {
     // first block
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     asyncmark();

     // second block; longer
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     asyncmark();

     // third block; shorter
     async_load_to_lds(l, g);
     async_load_to_lds(l, g);
     asyncmark();

     // Wait for first block
     wait.asyncmark(2);
   }

Software pipeline
-----------------

.. code-block:: c++

   void foo(global int *g, local int *l) {
     // first block
     asyncmark();

     // second block
     asyncmark();

     // third block
     asyncmark();

     for (;;) {
       wait.asyncmark(2);
       // use data

       // next block
       asyncmark();
     }

     // flush one block
     wait.asyncmark(2);

     // flush one more block
     wait.asyncmark(1);

     // flush last block
     wait.asyncmark(0);
   }

Ordinary function call
----------------------

.. code-block:: c++

   extern void bar(); // may or may not make async calls

   void foo(global int *g, local int *l) {
       // first block
       asyncmark();

       // second block
       asyncmark();

       // function call
       bar();

       // third block
       asyncmark();

       wait.asyncmark(1); // will wait for at least the second block, possibly including bar()
       wait.asyncmark(0); // will wait for third block, including bar()
   }

Implementation notes
====================

[This section is informational.]

Optimization
------------

The implementation may eliminate async mark/wait intrinsics in the following cases:

1. An ``asyncmark`` operation which is not included in the wait count of a later
   wait operation in the current function. In particular, an ``asyncmark`` which
   is not post-dominated by any ``wait.asyncmark``.
2. A ``wait.asyncmark`` whose wait count is more than the outstanding async
   marks at that point. In particular, a ``wait.asyncmark`` that is not
   dominated by any ``asyncmark``.

In general, at a function call, if the caller uses sufficient waits to track
its own async operations, the actions performed by the callee cannot affect
correctness. But inlining such a call may result in redundant waits.

.. code-block:: c++

   void foo() {
     asyncmark(); // A
   }

   void bar() {
     asyncmark(); // B
     asyncmark(); // C
     foo();
     wait.asyncmark(1);
   }

Before inlining, the ``wait.asyncmark`` waits for mark B to be completed.

.. code-block:: c++

   void foo() {
   }

   void bar() {
     asyncmark(); // B
     asyncmark(); // C
     asyncmark(); // A from call to foo()
     wait.asyncmark(1);
   }

After inlining, the asyncmark-wait now waits for mark C to complete, which is
longer than necessary. Ideally, the optimizer should have eliminated mark A in
the body of foo() itself.