llvm/docs/ConvergentOperations.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610

==============================
Convergent Operation Semantics
==============================

.. contents::
   :local:
   :depth: 4

Overview
========

Some parallel execution environments execute threads in groups that allow
efficient communication within the group using special primitives called
*convergent* operations. The outcome of a convergent operation is sensitive to
the set of threads that executes it "together", i.e., convergently. When control
flow :ref:`diverges <convergence-and-uniformity>`, i.e. threads of the same
group follow different
paths through the CFG, not all threads of the group may be available to
participate in this communication. This is the defining characteristic that
distinguishes convergent operations from other inter-thread communication:

  A convergent operation involves inter-thread communication or synchronization
  that occurs outside of the memory model, where the set of threads which
  participate in communication is implicitly affected by control flow.

For example, in the following GPU compute kernel, communication during the
convergent operation is expected to occur precisely among those threads of an
implementation-defined execution scope (such as workgroup or subgroup) for
which ``condition`` is true:

.. code-block:: c++

  void example_kernel() {
      ...
      if (condition)
          convergent_operation();
      ...
  }

In structured programming languages, there is often an intuitive and
unambiguous way of determining the threads that are expected to communicate.
However, this is not always the case even in structured programming languages,
and the intuition breaks down entirely in unstructured control flow. This
document describes the formal semantics in LLVM, i.e. how to determine the set
of communicating threads for convergent operations.

The definitions in this document leave many details open, such as how groups of
threads are formed in the first place. It focuses on the questions that are
relevant for deciding the correctness of generic program transforms and
convergence-related analyses such as :ref:`uniformity analysis
<convergence-and-uniformity>`.

.. _convergent_operations:

Convergent Operations
=====================

In LLVM IR, the only way to communicate between threads as described
above is by calling target-defined convergent intrinsics. Hence, only
a call-site in LLVM IR (a :ref:`call <i_call>`, :ref:`invoke
<i_invoke>`, or :ref:`callbr <i_callbr>` instruction) can result in a
convergent operation.

A function in LLVM IR is said to be *convergent* if it has the
:ref:`convergent <attr_convergent>` attribute.

A call-site in LLVM IR is said to be *convergent* if it is a direct
call to a convergent function or it has the :ref:`convergent
<attr_convergent>` attribute or a :ref:`convergencectrl operand bundle
<convergencectrl>`.

Informational notes:

  A function may have to be treated as convergent if that function, or
  transitively, any function called from it, contains a convergent call-site. A
  frontend generating the ``convergent`` attribute should take this into account
  when emitting functions and function calls. But this is not always the case:

  A non-convergent function may contain convergent operations; such operations
  do not directly depend on the set of threads that enter the function as a
  single communicating group. Instead, these operations depend on an
  implementation-defined subset of threads within the body of the function, as
  shown in :ref:`opportunistic_convergence`.

Examples of Convergent Operations
========================================

(This section is informative.)

Texture sampling in a pixel shader
----------------------------------

The following stylized pixel shader samples a texture at a given set of
coordinates, using the builtin function `textureSample`. Texture sampling
requires screen-space derivatives of the coordinates to determine the level of
detail (mipmap level) of the sample. They are commonly approximated by taking
the difference between neighboring pixels, which are computed by different
threads in the same group:

.. code-block:: c++

  void example_shader() {
    ...
    color = textureSample(texture, coordinates);
    if (condition) {
      use(color);
    }
    ...
  }

From a purely single-threaded perspective, sinking the `textureSample` into
the if-statement appears legal. However, if the condition is false for some
neighboring pixels, then their corresponding threads will not execute together
in the group, making it impossible to take the difference of coordinates as an
approximation of the screen-space derivative. In practice, the outcome will be
an undefined value.

That is, the `textureSample` operation fits our definition of a convergent
operation:

 1. It communicates with a set of threads that implicitly depends on control
    flow.
 2. Correctness depends on this set of threads.

The compiler frontend can emit IR that expresses the convergence constraints as
follows:

.. code-block:: llvm

  define void @example_shader() convergent {
    %entry = call token @llvm.experimental.convergence.entry()
    ...
    %color = call T @textureSample(U %texture, V %coordinates) [ "convergencectrl"(token %entry) ]
    br i1 %condition, label %then, label %end

  then:
    call void @use(T %color)
    br label %end

  end:
    ret void
  }

The :ref:`llvm.experimental.convergence.entry <llvm.experimental.convergence.entry>`
intrinsic is itself ``convergent``, and we expect it to communicate at least
among all threads of the same "quad" -- a group of 2x2 pixels that are
evaluated together for the purpose of approximating screen-space derivatives.
This fact is not part of the generic LLVM IR semantics; it would have to be
defined somewhere else, for example as part of target-specific ABI definitions
and/or in reference to some relevant API specs.

Since the ``@textureSample`` call then uses the token produced by the entry
intrinsic in its ``convergencectrl`` bundle, and has no additional control
dependencies, it must communicate among the same set of threads. This indicates
to generic program transforms that sinking the ``@textureSample`` call is
forbidden. (A program transform can still sink the call if it can prove somehow,
e.g. by leaning on target-specific callbacks that can analyze the program with
additional knowledge, that ``%condition`` is always uniform across the threads
referenced by the *convergence token* ``%entry``.)

.. _convergence_example_reductions:

Reductions inside divergent control flow
----------------------------------------

The following example shows that merging common code of branches can be
incorrect in the face of convergent operations:

.. code-block:: c++

  void example_kernel() {
    delta = ...
    if (delta > 0) {
      total_gains = subgroupAdd(delta);
      ...
    } else {
      total_losses = subgroupAdd(delta);
      ...
    }
  }

The ``subgroupAdd`` computing the ``total_gains`` will be executed by the
subset of threads with positive ``delta`` in a subgroup (wave), and so will sum
up all the ``delta`` values of those threads; and similarly for the
``subgroupAdd`` that computes the ``total_losses``.

If we were to hoist and merge the ``subgroupAdd`` above the if-statement, it
would sum up the ``delta`` across *all* threads instead.

The compiler frontend can emit IR that expresses the convergence constraints
as follows:

.. code-block:: llvm

  define void @example_kernel() convergent {
    %entry = call token @llvm.experimental.convergence.entry()
    %delta = ...
    %cc = icmp sgt i32 %delta, 0
    br i1 %cc, label %then, label %else

  then:
    %total_gains = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
    ...
    br label %end

  else:
    %total_losses = call i32 @subgroupAdd(i32 %delta) [ "convergencectrl"(token %entry) ]
    ...
    br label %end

  end:
    ...
  }

The entry intrinsic behaves like in the previous example: assuming that
``@example_kernel`` is an OpenCL kernel (as hinted at by the "subgroup"
terminology), we expect it to communicate among all threads within the
"subgroup". This typically maps to a SIMD vector on GPU hardware.

The calls to ``@subgroupAdd`` use the token produced by the entry intrinsic,
but they also have an additional control dependency. According to the rules
defined in this document, they only communicate among the subset of threads
that actually end up executing the respective (static) call site.

Hoisting them would remove the control dependency and cause them to communicate
among the full set of threads that the entry intrinsic communicated with.
Again, hoisting is allowed if it can be proven that ``%cc`` is always uniform
among the relevant set of threads: in that case, the ``@subgroupAdd`` already
communicates among the full set of threads in the original program.

Motivating Examples of Convergence Control
==========================================

(This section is informative.)

Unstructured control flow
-------------------------

Consider an example of how jump threading removes structure in a way that can
make semantics non-obvious without the convergence intrinsics described in this
document:

.. code-block:: llvm

  void example_original() {
  entry:
      ...
      br i1 %cond1, label %then1, label %mid

  then1:
      ...
      %cond2 = ...
      br label %mid

  mid:
      %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
      br i1 %flag, label %then2, label %end

  then2:
      ...
      call void @subgroupControlBarrier()
      ...
      br label %end

  end:
  }

  void example_jumpthreaded() {
  entry:
      ...
      br i1 %cond1, label %then1, label %then2

  then1:
      ...
      %cond2 = ...
      br i1 %cond2, label %then2, label %end

  then2:
      ...
      call void @subgroupControlBarrier()
      ...
      br label %end

  end:
  }

Is the control barrier guaranteed to synchronize among the same set of threads
in both cases? Different implementations in the literature may give different
answers to this question:

* In an implementation that reconverges at post-dominators, threads reconverge
  at ``mid`` in the first version, so that all threads (within a subgroup/wave)
  that execute the control barrier do so together. In the second version,
  threads that reach the control barrier via different paths synchronize
  separately: the first (and only) post-dominator is ``end``, so threads do not
  reconverge before then.

* An implementation that sorts basic blocks topologically and ensures maximal
  reconvergence for each basic block would behave the same way in both
  versions.

We generally take the stance that reconvergence in acyclic control flow must
be maximal. The compiler frontend could augment the original code as follows:

.. code-block:: llvm

  define void @example_original() convergent {
  entry:
    %entry = call token @llvm.experimental.convergence.entry()
    ...
    br i1 %cond1, label %then1, label %mid

  then1:
    ...
    %cond2 = ...
    br label %mid

  mid:
    %flag = phi i1 [ true, %entry ], [ %cond2, %then1 ]
    br i1 %flag, label %then2, label %end

  then2:
    ...
    call void @subgroupControlBarrier() [ "convergencectrl"(token %entry) ]
    ...
    br label %end

  end:
  }

If S is the set of threads that the entry intrinsic communicated with, then
the ``@subgroupControlBarrier`` call communicates with the subset of S that
actually reaches the call site. This set of threads doesn't change after
jump-threading, so the answer to the question posed above remains the same.

.. _opportunistic_convergence:

Opportunistic convergent operations
-----------------------------------

Some programs have local regions of code that contain a sequence of convergent
operations where the code does not care about the exact set of threads with
which it is executed, but only that the set of threads is the same for all the
operations within the sequence. (If a subset of the convergent operations in the
sequence have additional, non-uniform control dependencies, then this is not
possible. However, the code may still require that the sets of threads are
logically consistent with the conditions of those control dependencies.) In this
case, :ref:`llvm.experimental.convergence.anchor
<llvm.experimental.convergence.anchor>` can be used to express the desired
semantics.

The following example function could be part of a hypothetical "append buffer"
implementation, where threads conditionally write fixed-sized records
contiguously into a global buffer. The function ``@reserveSpaceInBuffer``
returns the index into the buffer at which the calling thread should store its
data.

This could be achieved by using a simple atomic operation in every thread to
bump an allocation counter.

However, the following implementation can be more performant on some hardware,
because it uses only a single atomic operation for an entire group of threads.
To do this, it first determines the total size of the group, which will be the
operand to the atomic operation, and then later broadcasts the result of the
atomic operation to all threads of the group, so that each thread can compute
its individual position in the buffer:

.. code-block:: llvm

  define i32 @reserveSpaceInBuffer() {    ; NOTE: _not_ a convergent function!
  entry:
    %anchor = call token @llvm.experimental.convergence.anchor()

    %ballot = call i64 @subgroupBallot(i1 true) [ "convergencectrl"(token %anchor) ]
    %numThreads.p = call i64 @llvm.ctpop.i64(i64 %ballot)
    %numThreads = trunc i64 %numThreads.p to i32

    %absoluteThreadIdx = call i32 @getSubgroupLocalInvocationId()
    %absoluteThreadIdx.ext = zext i32 %absoluteThreadIdx to i64
    %mask.p = shl i64 1, %absoluteThreadIdx.ext
    %mask = sub i64 %mask.p, 1

    %maskedBallot = and i64 %ballot, %mask
    %relativeThreadIdx.p = call i64 @llvm.ctpop.i64(i64 %maskedBallot)
    %relativeThreadIdx = trunc i64 %relativeThreadIdx.p to i32

    %isFirstThread = icmp eq i32 %relativeThreadIdx, 0
    br i1 %isFirstThread, label %then, label %end

  then:
    %baseOffset.1 = atomicrmw add ptr @bufferAllocationCount, i32 %numThreads monotonic
    br label %end

  end:
    %baseOffset.2 = phi i32 [ undef, %entry ], [ %baseOffset.1, %then ]
    %baseOffset = call i32 @subgroupBroadcastFirst(i32 %baseOffset.2) [ "convergencectrl"(token %anchor) ]
    %offset = add i32 %baseOffset, %relativeThreadIdx
    ret i32 %offset
  }

The key here is that the function really doesn't care which set of threads it
is being called with. It takes whatever set of threads it can get. What the
implementation of the function cares about is that the initial
``@subgroupBallot`` -- which is used to retrieve the bitmask of threads that
executed the anchor together -- executes with the same set of threads as the
final ``@subgroupBroadcastFirst``. Nothing else is required for correctness as
far as convergence is concerned.

The function ``@reserveSpaceInBuffer`` itself is _not_ ``convergent``: callers
are free to move call sites of the function as they see fit. This can change
the behavior in practice, by changing the sets of threads that are grouped
together for the atomic operation. This can be visible in the output of the
program, since the order in which outputs appear in the buffer is changed.
However, this does not break the overall contract that ``@reserveSpaceInBuffer``
has with its caller -- which makes sense: the order of outputs is
non-deterministic anyway because of the atomic operation that is involved.

If the function is inlined, the use of the anchor intrinsic similarly indicates
that certain transforms which are usually forbidden by the presence of
convergent operations are in fact allowed, as long as they don't break up the
region of code that is controlled by the anchor.

.. _convergence_high-level_break:

Extended Cycles: Divergent Exit from a Loop
-------------------------------------------

High-level languages typically provide a ``break`` statement that transfers
control out of a loop statement. In most cases, the loop is structured and hence
there is no ambiguity about convergence inside the loop. But an ambiguity arises
when a ``break`` is control dependent on a divergent condition inside the loop.
Consider the following example:

.. code-block:: c++

  void example() {
    // A
    ...
    for (...) {
      // B
      if (condition) { // divergent condition
        // C
        convergent_op();
        break;
      }
      // D
      ...
    }
    // E
  }

In this program, the call to convergent_op() is lexically "inside" the ``for``
loop. But when translated to LLVM IR, the basic block B is an exiting block
ending in a divergent branch, and the basic block C is an exit of the loop.
Thus, the call to convergent_op() is outside the loop. This causes a mismatch
between the programmer's expectation and the compiled program. The call should
be executed convergently on every iteration of the loop, by threads that
together take the branch to exit the loop. But when compiled, all threads that
take the divergent exit on different iterations first converge at the beginning
of basic block C and then together execute the call to convergent_op().

In this case, :ref:`llvm.experimental.convergence.loop
<llvm.experimental.convergence.loop>` can be used to express the desired
semantics. A call to this intrinsic is placed in the loop header, which tracks
each iteration of the loop. The token produced by this is used as a
``convergencectrl`` operand to the convergent call. The semantics of the
``loop`` intrinsic ensures that the convergent call is performed convergently
only by those threads that convergently exited the loop in a given iteration.

.. code-block:: llvm

  define void @example() convergent {
    %entry = call token @llvm.experimental.convergence.entry()
    br label %for

  for:
    %inner = call token @llvm.experimental.convergence.loop() ["convergencectrl"(token %entry)]
    %for.cond = i1 ...
    br i1 %for.cond, label %B, label %E

  B:
    ...
    %condition = i1 ...
    br i1 %condition, label %C, label %D

  C:
    call void @convergent_op() ["convergencectrl"(token %inner)]
    br label %E

  D:
    ...
    br label %for

  E:
    ...
    ret void
  }

The LLVM IR version of the same program shows a cycle consisting of the basic
blocks ``%for``, ``%B`` and ``%D``, while ``%C`` is an exit reached by the
divergent branch at the end of the exiting block ``%B``. But the use of
convergence control tokens makes it clear that block ``%C`` must be executed
convergently only by those threads that convergently take the exit edge from %B
to ``%C``. In other words, the convergent execution of ``%C`` is governed by the
call to the :ref:`llvm.experimental.convergence.loop
<llvm.experimental.convergence.loop>` intrinsic inside the cycle. The cycle is
effectively extended to include all uses of this token that lie outside the
cycle.

.. _dynamic_instances_and_convergence_tokens:

Dynamic Instances and Convergence Tokens
========================================

Every execution of an LLVM IR instruction occurs in a :ref:`dynamic instance
<convergence-dynamic-instances>` of the instruction. Dynamic instances are the
formal objects by which we talk about communicating threads in convergent
operations. Dynamic instances are defined for *all* operations in an LLVM
program, whether convergent or not. Convergence control is primarily about the
dynamic instances of convergent operations since they affect execution of the
program through inter-thread communication. The dynamic instances for
non-convergent operations are relevant for determining :ref:`uniformity
<convergence-and-uniformity>` of values.

Dynamic instances produced by the execution of the same *convergent operation*
by different threads may be :ref:`converged <convergence-definition>`. When
executing a convergent operation, the set of threads that execute converged
dynamic instances is the set of threads that communicate with each other.
*Convergence tokens* capture this convergence as described below.

*Convergence tokens* are values of ``token`` type, i.e. they cannot be used in
``phi`` or ``select`` instructions. A convergence token value represents the
dynamic instance of the instruction that produced it.

Convergent operations may have an optional ``convergencectrl`` operand bundle with
a convergence token operand to define the set of communicating threads relative
to the operation that defined the token.

   Let ``U`` be a convergent operation other than a call to a convergence
   control intrinsic, and ``D`` be the convergent operation that defines
   the token value used as the ``convergencectrl`` operand to ``U``. Two
   threads execute converged dynamic instances of ``U`` if and only if the
   token value in both threads was returned by converged dynamic
   instances of ``D``.

.. note::

   The text defines convergence token values as representing dynamic instances.
   But if we were to assume that converged dynamic instances produce the same
   token value, then we could almost think of the token value as representing a
   set of threads instead -- specifically, the set ``S`` of threads that
   executed converged dynamic instances of the defining instruction ``D``.

   In this intuitive picture, when a convergence token value ``T`` is used by a
   ``convergencectrl`` bundle on an instruction ``I``, then the set of threads that
   communicates in ``I`` is a subset of the set ``S`` represented by the token value.
   Specifically, it is the subset of threads that ends up executing ``I`` while
   using the token value.

   This by itself wouldn't quite work as a definition: what if ``I`` is executed
   multiple times by the same threads? Which execution of ``I`` in thread 1
   communicates with which execution of ``I`` in thread 2? Leaning on the notion
   of dynamic instances gives a robust answer to this question as long as ``D``
   and ``I`` are at the same loop (or cycle) nesting level.

   The case where ``D`` and ``I`` are at different loop nesting levels is
   forbidden by the :ref:`static rules <convergence_static_rules>` -- handling
   that case is the purpose of :ref:`llvm.experimental.convergence.loop
   <llvm.experimental.convergence.loop>`.

.. _convergence_control_intrinsics:

Convergence Control Intrinsics
==============================

This section describes target-independent intrinsics that can be used to
produce convergence tokens.

Behaviour is undefined if a convergence control intrinsic is called
indirectly.

.. _llvm.experimental.convergence.entry:

``llvm.experimental.convergence.entry``
----------------------------------------

.. code-block:: llvm

  token @llvm.experimental.convergence.entry() convergent readnone

This intrinsic is used to tie the dynamic instances inside of a function to
those in the caller.

1. If the function is called from outside the scope of LLVM, the convergence of
   dynamic instances of this intrinsic are environment-defined. For example:

   a. In an OpenCL *kernel launch*, the maximal set of threads that
      can communicate outside the memory model is a *workgroup*.
      Hence, a suitable choice is to specify that all the threads from
      a single workgroup in OpenCL execute converged dynamic instances
      of this intrinsic.
   b. In a C/C++ program, threads are launched independently and they can
      communicate only through the memory model. Hence the dynamic instances of
      this intrinsic in a C/C++ program are never converged.
2. If the function is called from a call-site in LLVM IR, then two
   threads execute converged dynamic instances of this intrinsic if and
   only if both threads entered the function by executing converged
   dynamic instances of the call-site.

This intrinsic can occur at most once in a function, and only in the entry
block of the function. If this intrinsic occurs in a basic block, then it must
precede any other convergent operation in the same basic block.

It is an error if this intrinsic appears in a non-convergent function.

It is an error to specify a ``convergencectrl`` operand bundle at a
call to this intrinsic.

Function inlining substitutes this intrinsic with the token from the operand
bundle. For example:

.. code-block:: c++

  // Before inlining:

  void callee() convergent {
    %tok = call token @llvm.experimental.convergence.entry()
    convergent_operation(...) [ "convergencectrl"(token %tok) ]
  }

  void main() {
    %outer = call token @llvm.experimental.convergence.anchor()
    for (...) {
      %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
      callee() [ "convergencectrl"(token %inner) ]
    }
  }

  // After inlining:

  void main() {
    %outer = call token @llvm.experimental.convergence.anchor()
    for (...) {
      %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
      convergent_operation(...) [ "convergencectrl"(token %inner) ]
    }
  }

.. _llvm.experimental.convergence.loop:

``llvm.experimental.convergence.loop``
--------------------------------------

.. code-block:: llvm

  token @llvm.experimental.convergence.loop() [ "convergencectrl"(token) ] convergent readnone

This intrinsic represents the place where an imaginary counter is incremented
for determining convergence inside a control flow cycle.

Let ``U`` be a call to this intrinsic and ``D`` be the convergent operation that
defines the token value used as the ``convergencectrl`` operand to ``U``. Two
threads execute converged dynamic instances of ``U`` if and only if:

1. The token value in both threads was returned by converged dynamic
   instances of ``D``, and,
2. There is an integer *n* such that both threads execute ``U`` for the *n*'th time
   with that token value.

It is an error to omit the ``convergencectrl`` operand bundle on a
call to this intrinsic.

If this intrinsic occurs in a basic block, then it must precede any other
convergent operation in the same basic block.

.. _convergence_cycle_heart:

**Heart of a Cycle:**

  If a :ref:`cycle <cycle-terminology>` ``C`` contains an occurrence ``H`` of
  this intrinsic whose token operand is defined outside ``C``, then ``H`` is
  called the heart of ``C``.

  .. note::

     The static rules for cycles imply that a heart can occur only in the header
     of a natural loop. This ensures that the heart closely represents the
     intuitive notion of a loop iteration. If this restriction is relaxed, the
     resulting semantics provides a new notion of "cycle iteration" even for
     irreducible cycles. But this allows a natural loop to have a heart in a
     node other than its header, which has interesting consequences on the
     meaning of a loop iteration in terms of convergence. For now, we disallow
     this situation since its practical application is very rare.

.. _llvm.experimental.convergence.anchor:

``llvm.experimental.convergence.anchor``
----------------------------------------

.. code-block:: llvm

  token @llvm.experimental.convergence.anchor() convergent readnone

This intrinsic produces an initial convergence token that is independent from
any "outer scope". The set of threads executing converged dynamic instances of
this intrinsic is implementation-defined.

It is an error to pass a ``convergencectrl`` operand bundle at a
call to this intrinsic.

.. note::

   The expectation is that all threads within a group that "happen to be active
   at the same time" will execute converged dynamic instances, so that programs
   can detect the maximal set of threads that can communicate efficiently within
   some local region of the program.

.. _convergence_uncontrolled:

Uncontrolled Convergent Operations
==================================

Convergent operations with an explicit ``convergencectrl`` operand bundle are
called *controlled convergent operations*. All other convergent operations are
said to be *uncontrolled*.

An uncontrolled convergent operation is said to have *implicit convergence
control* determined by the ``convergent`` attribute alone. The semantics of the
``convergent`` attribute as implemented in LLVM differs from the documented
semantics. The implementation tries to follow common intuition about convergent
operations, which remains under-specified. As such, it is not possible to fully
translate implicit convergence control into explicit convergence control tokens,
and these two modes cannot be mixed in the same function.

  If a function contains a controlled convergent operation, then all convergent
  operations in that function must either be controlled operations or calls to
  the convergence control intrinsics.

Inferring Tokens
----------------

(This section is informational)

Sometimes, it may be necessary to reinterpret the implicit convergence control
in terms of explicit convergence control tokens. For example, this may happen
when a function call is inlined, and either the caller or the callee contains
uncontrolled convergent operations.

Some uses of uncontrolled convergent operations may need to satisfy the
following property:

  For an environment-defined group of threads (such as an OpenCL workgroup or
  subgroup), if one thread in the group executes a convergent operation, then
  all threads in the group do so convergently with that thread.

In terms of explicit convergence control, this means that the
``convergencectrl`` operand on each convergent operation ``X`` must ultimately
originate from a call to the :ref:`llvm.experimental.convergence.entry
<llvm.experimental.convergence.entry>` intrinsic. This preserves the possibility
that the group of threads that converge on reaching ``X`` is the same group that
originally started executing the program in convergence. In comparison, the
:ref:`llvm.experimental.convergence.anchor
<llvm.experimental.convergence.anchor>` intrinsic captures an
implementation-defined group of threads, which is insufficient to support the
above property.

One way to approximate implicit convergence control in terms of explicit
convergence control tokens is the following procedure, which preserves the above
mentioned property:

1. Convert every irreducible cycle into a reducible cycle.
2. Insert a call to :ref:`llvm.experimental.convergence.entry
   <llvm.experimental.convergence.entry>` at the start of the entry block of the
   function.
3. Insert a call to :ref:`llvm.experimental.convergence.loop
   <llvm.experimental.convergence.loop>` at the start of every loop header. If
   this loop is an outermost loop, the ``convergencectrl`` operand is the call
   to :ref:`llvm.experimental.convergence.entry
   <llvm.experimental.convergence.entry>` in the entry block of the function.
   Otherwise, the ``convergencectrl`` operand is the call to
   :ref:`llvm.experimental.convergence.loop
   <llvm.experimental.convergence.loop>` in the parent loop's header.
4. For each uncontrolled convergent operation ``X``, add a ``convergencectrl``
   operand bundle using the token defined by a definition ``D`` that is a
   :ref:`sibling <cycle-sibling>` to this operation. ``D`` always dominates
   ``X`` --- if ``X`` is not in any cycle, then ``D`` is a call to
   :ref:`llvm.experimental.convergence.entry
   <llvm.experimental.convergence.entry>`; otherwise ``D`` is the heart of the
   parent cycle of ``X``.

.. _convergence_static_rules:

Static Rules
============

A *well-formed* program in LLVM IR must satisfy the following static
rules about cycles and convergence regions.

Closed Paths
------------

A :ref:`closed path <cycle-closed-path>` in a CFG is a connected sequence of
nodes and edges in the CFG whose start and end points are the same.

1. Every closed path in the CFG that contains a use of a convergence token T other
   than a use by
   :ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
   must also contain the definition of T.

2. Every closed path in the CFG that contains two different uses of a convergence
   token T must also contain the definition of T.

3. Every closed path in the CFG that contains uses of two different convergence tokens
   T1 and T2 must also contain the definition of at least one of them.

Taken together, these rules imply that for every closed path C, there can be at most
one convergence token T which is used in C but defined outside of it, and that
T can be used only once in C, and only by ``llvm.experimental.convergence.loop``.

4. In every closed path that contains a use U of a token T but not the
   definition of T, U must dominate all nodes in the closed path.

This implies that ``llvm.experimental.convergence.loop`` can appear as a heart
only in the header of a natural loop.

**Sufficient Conditions:** From the :ref:`properties of cycles
<cycle-closed-path>`, it is sufficient to prove the above properties
for cycles instead of closed paths. Briefly, any closed path that violates
one or more of the above static rules is contained in a cycle that also
violates the same rule(s).

.. _convergence_region:

Convergence Regions
-------------------

The *convergence region* of a convergence token T is the minimal region in
which T is live and used, i.e., the set of program points dominated by the
definition D of T from which a use of T can be reached.

The following static rule about convergence regions must be satisfied by
valid programs:

   If a convergence region R for a token T1 contains a use of a convergence
   token T2, then R must also contain the definition of T2. (In other words,
   convergence regions must be reasonably nested.)

.. note::

   For brevity, this document uses the term "convergence region of a token
   definition ``D``" to actually refer to the convergence region of the token
   ``T`` defined by ``D``.

.. _inferring_noconvergent:

Inferring non-convergence
=========================

When the target or the environment guarantees that threads do not
communicate using convergent operations or that threads never diverge,
the dynamic instances in the program are irrelevant and an optimizer
may remove any occurrence of the ``convergent`` attribute on a
call-site or a function and any explicit ``convergencectrl`` operand
bundle at a call-site.

An optimizer may remove the ``convergent`` attribute and any explicit
``convergencectrl`` operand bundle from a call-site if it can prove
that the execution of this call-site always results in a call to a
non-convergent function.

An optimizer may remove the ``convergent`` attribute on a function if it can
prove that the function does not contain a call to
:ref:`llvm.experimental.convergence.entry
<llvm.experimental.convergence.entry>`, or any uncontrolled convergent
operations.

Memory Model Non-Interaction
============================

The fact that an operation is convergent has no effect on how it is treated for
memory model purposes. In particular, an operation that is ``convergent`` and
``readnone`` does not introduce additional ordering constraints as far as the
memory model is concerned. There is no implied barrier, neither in the memory
barrier sense nor in the control barrier sense of synchronizing the execution
of threads.

Informational note: Threads that execute converged dynamic instances do not
necessarily do so at the same time.


Other Interactions
==================

A function can be both ``convergent`` and
``speculatable``, indicating that the function does not have undefined
behavior and has no effects besides calculating its result, but is still
affected by the set of threads executing this function. This typically
prevents speculation of calls to the function unless the constraint imposed
by ``convergent`` is further relaxed by some other means.

Controlled Maximal Convergence
==============================

The :ref:`converged-with relation <convergence-definition>` over dynamic
instances of each controlled convergent operation is completely defined by the
semantics of convergence tokens. But the implementation-defined convergence at a
call to :ref:`llvm.experimental.convergence.anchor
<llvm.experimental.convergence.anchor>` also depends on the cycle hierarchy
chosen if it occurs inside an irreducible cycle.

When the token defined by a convergent operation ``D`` is used at another
convergent operation ``U``, the implementation must ensure that the threads that
converge at ``U`` are all the threads that reached ``U`` after converging at
``D``. On most implementations, it is reasonable to assume that only these
threads are converged at every node they reach on any path from ``D`` to ``U``.
In other words, the converged-with relation at ``D`` produces groups of threads
that can converge only within each group, while inside the convergence region of
``D``.

All this affects the :ref:`maximal converged-with relation
<convergence-maximal>` over dynamic instances and in turn the :ref:`m-converged
property <uniformity-analysis>` of static instances in the convergence region of
``D``.

.. _controlled_maximal_converged_with:

  **Controlled Maximal converged-with Relation**

  1. Dynamic instances of a *convergent operation* are related in the controlled
     maximal converged-with relation according to the semantics of the convergence
     control tokens.
  2. Dynamic instances ``X1`` and ``X2`` produced by different threads for the
     same *non-convergent operation* ``X`` are related in the controlled maximal
     converged-with relation if and only if:

     1. Both threads executed converged dynamic instances of every token
        definition ``D`` such that ``X`` is in the convergence region of ``D``,
        and,
     2. Either ``X`` is not contained in any cycle, or, for every cycle ``C``
        with header ``H`` that contains ``X``:

        - every dynamic instance ``H1`` of ``H`` that precedes ``X1`` in the
          respective thread is convergence-before ``X2``, and,
        - every dynamic instance ``H2`` of ``H`` that precedes ``X2`` in the
          respective thread is convergence-before ``X1``,
        - without assuming that ``X1`` is converged with ``X2``.

.. _controlled_m_converged:

  **Controlled m-converged Static Instances**

  A node ``X`` in a given CFG is reported to be m-converged if and only if:

  1. For any token definition ``D`` such that ``X`` is inside the convergence region
     of ``D``, ``D`` itself is m-converged, and,
  2. Every cycle that contains ``X`` satisfies the following necessary
     conditions:

     a. Every divergent branch inside the cycle satisfies the :ref:`diverged
        entry criterion<convergence-diverged-entry>`, and,
     b. There are no :ref:`diverged paths reaching the
        cycle<convergence-diverged-outside>` from a divergent branch outside it.

Temporal Divergence at Cycle Exit
---------------------------------

When a cycle has a divergent exit, maximal convergence assumes that all threads
converge at the exit block. But if a controlled convergent operation outside the
cycle uses a token defined by an operation ``D`` inside the cycle, the
convergence region of ``D`` now extends outside the cycle. If two threads
executed converged dynamic instances of ``D`` before exiting the cycle, then
they continue to execute converged dynamic instances of nodes in the convergence
region of ``D`` outside the cycle. Thus, for a value ``V`` defined inside the
cycle, any use ``U`` of ``V`` within the convergence region of ``T`` uses the
output of converged dynamic instances of ``V``. If ``V`` is uniform, then its
use at such a ``U`` is also uniform. In other words, temporal divergence applies
only to a use of ``V`` that is outside the convergence region of ``D``.

Rationales for Static rules about cycles
========================================

(This section is informative.)

.. note::

   For convenience, we use the operator ``==`` to represent the relation
   ``converged-with`` and the operator ``!=`` to represent its negation.

Consider a loop with (incorrect!) convergence control as in the following
pseudocode:

.. code-block:: llvm

  ; WARNING: Example of incorrect convergence control!

  %anchor = call token @llvm.experimental.convergence.anchor()
  for (;;) {
    ...
    call void @convergent.op() [ "convergencectrl"(token %anchor) ]
    ...
  }

This code is forbidden by the first static rule about cycles.

A first formal argument why we have to do this is that the dynamic rule for
deciding whether two threads execute converged dynamic instances of
``@convergent.op`` leads to a logical contradiction in this code.
Assume two threads execute converged dynamic instances of the anchor
followed by two iterations of the loop. Thread 1 executes dynamic instances
I1 and I2 of ``@convergent.op``, thread 2 executes dynamic instances J1 and J2.
Using all the rules, we can deduce:

1. ``I1 != I2`` and ``J1 != J2`` by the basic rules of dynamic instances.

2. ``I1 == J1`` by the first dynamic rule about controlled convergent
   operations: both threads execute the same static instruction while using
   a convergence token value produced by converged dynamic instances of an
   instruction (the anchor).

3. ``I1 == J2`` by the same argument. Also, ``I2 == J1`` and ``I2 == J2``.

   The fact that one may be *intuitively* tempted to think of ``I1`` and ``J2``
   as being executed in different loop iterations is completely irrelevant for
   the *formal* argument. There is no mechanism in LLVM IR semantics for
   forming associations between loop iterations in different threads, *except*
   for the rules defined in this document -- and the rules in this document
   require a loop heart intrinsic for talking about loop iterations.

4. By transitivity, we have ``I1 == I2`` and ``J1 == J2``. That is a
   contradiction.

This problem goes away by inserting a loop heart intrinsic as follows, which
establishes a relationship between loop iterations across threads.

.. code-block:: llvm

  %anchor = call token @llvm.experimental.convergence.anchor()
  for (;;) {
    %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
    ...
    call void @convergent.op() [ "convergencectrl"(token %loop) ]
    ...
  }

In the same scenario of two threads executing converged dynamic instances of the
anchor and then two iterations of the loop, the dynamic rule about loop heart
intrinsics implies that both threads execute the converged dynamic instances of
the loop heart intrinsic in their respective first iterations and then again in
their respective second iterations of the loop.

This then implies that they execute converged dynamic instances ``I1 == J1`` of
the ``@convergent.op`` in their first iterations and then
``I2 == J2`` in their second iterations. The rule is an "if and only if" rule,
so it also implies that ``I1 != J2`` and ``I2 != J1``, because those executions
see token values of ``%loop`` originating from non-converged dynamic
instances of the loop intrinsic.

One may ask whether we could change the dynamic rule instead of adding the
static rule about cycles. That is impractical due to deeper difficulties.
Consider the following loop, again with incorrect convergence control:

.. code-block:: llvm

  ; WARNING: Example of incorrect convergence control!

  ; (A)
  %anchor = call token @llvm.experimental.convergence.anchor()
  for (;;) {
    ; (B)
    if (condition1) {
      ; (C)
      call void @convergent.op.1() [ "convergencectrl"(token %anchor) ]
    }
    ; (D)
    if (condition2) {
      ; (E)
      call void @convergent.op.2() [ "convergencectrl"(token %anchor) ]
    }
    ; (F)
  }
  ; (G)

Assume two threads execute converged dynamic instances of the anchor followed
by this sequence of basic blocks:

.. code-block:: text

  Thread 1: A B C D F B D E F G
  Thread 2: A B D E F B C D F G

That is, both threads execute two iterations of the loop, but they execute
the different convergent operations in different iterations. Without forming a
relation between loop iterations across the threads, there is no reasonable way
of defining which dynamic instances of the convergent operations should be the
same across the threads, if any.

Again, this can be addressed by adding a loop heart intrinsic, most naturally
as:

.. code-block:: llvm

  ; (A)
  %anchor = call token @llvm.experimental.convergence.anchor()
  for (;;) {
    ; (B)
    %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
    if (condition1) {
      ; (C)
      call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
    }
    ; (D)
    if (condition2) {
      ; (E)
      call void @convergent.op.2() [ "convergencectrl"(token %loop) ]
    }
    ; (F)
  }
  ; (G)

Let ``%loop(i;j)`` be the dynamic instance of ``j``-th execution of the loop
heart intrinsic by thread ``i``, and analogously ``@op.k(i)`` and ``@op.k(i)``
the dynamic instances of the execution of ``@convergent.op.k`` by thread ``i``.
Then we have:

1. ``%loop(1;j) == %loop(2;j)`` for ``j = 1, 2`` because of the dynamic rule
   about loop heart intrinsics.

2. ``%loop(i;1) != %loop(i;2)`` for ``i = 1, 2`` because of the basic rule that
   different executions by the same thread happen in different dynamic
   instances.

3. ``@op.1(1) != @op.1(2)``, since ``@op.1(1)`` uses the token value of ``%loop``
   referring to ``%loop(1;1)`` and ``@op.1(2)`` uses that
   referring to ``%loop(2;2) == %loop(1;2)``, which is different from
   ``%loop(1;1)``.

4. Similarly, ``@op.2(1) != @op.2(2)``.

However, loop heart intrinsics could be inserted differently, at the cost
of also inserting a free-standing anchor:

.. code-block:: llvm

  ; (A)
  %anchor = call token @llvm.experimental.convergence.anchor()
  for (;;) {
    ; (B)
    if (condition1) {
      ; (C)
      %loop = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
      call void @convergent.op.1() [ "convergencectrl"(token %loop) ]
    }
    ; (D)
    if (condition2) {
      ; (E)
      %free = call token @llvm.experimental.convergence.anchor()
      call void @convergent.op.2() [ "convergencectrl"(token %free) ]
    }
    ; (F)
  }
  ; (G)

This leads to the "unnatural counting of loop iterations" that is also mentioned
elsewhere. Let ``%loop(i)`` be the dynamic instance of the execution of the
loop heart intrinsic by thread ``i`` (each thread executes it only once), and
let ``@op.k(i)`` be as before. Then:

1. ``%loop(1) == %loop(2)`` because of the dynamic rule about loop heart
   intrinsics.

2. ``@op.1(1) == @op.1(2)`` because ``@op.1(i)`` uses the value of ``%loop``
   referring to ``%loop(i)``, and ``%loop(1) == %loop(2)``.

3. Whether ``@op.2(1) == @op.2(2)`` is implementation-defined because of the
   use of the ``%free`` anchor intrinsic.

   In practice, they almost certainly have to be non-converged dynamic
   instances. Consider that if an implementation strictly follows the order of
   instructions given in the program, the executions of the threads can be
   "aligned" as follows:

   .. code-block:: text

     Thread 1: A B         C D F B D E F G
     Thread 2: A B D E F B C D F         G

   So then ``@op.2(1)`` physically executes later than ``@op.2(2)`` and there
   can be no communication between the threads, which means they execute
   non-converged dynamic instances.

   That said, it is conceivable that there aren't actually any data or other
   dependencies that would enforce this execution order. In that case, a highly
   out-of-order implementation could potentially allow communication. That's
   why the rules defined in this document are silent about whether
   ``@op.2(1) == @op.2(2)`` or not.

This type of convergence control seems relatively unlikely to appear in real
programs. Its possibility is simply a logical consequence of the model.

An equivalent issue arises if the convergent operations are replaced by nested
loops with loop heart intrinsics that directly refer to ``%anchor``, hence
the variants of the static rules about cycles that apply to them:

.. code-block:: llvm

  ; WARNING: Example of incorrect convergence control!

  %anchor = call token @llvm.experimental.convergence.anchor()
  for (;;) {
    if (condition1) {
      for (;;) {
        %loop1 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
      }
    }
    if (condition2) {
      for (;;) {
        %loop2 = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %anchor) ]
      }
    }
  }

There is a cycle (closed walk in the CFG) that goes through both loop heart
intrinsics using ``%anchor`` but not through the definition of ``%anchor``,
so this code is invalid.


Examples for the Correctness of Program Transforms
==================================================

(This section is informative.)

As implied by the rules in the previous sections, program transforms are correct
with respect to convergent operations if they preserve or refine their
semantics. This means that the set of communicating threads in the transformed
program must have been possible in the original program.

Program transforms with a single-threaded focus are generally conservatively
correct if they do not sink or hoist convergent operations across a branch.
This applies even to program transforms that change the control flow graph.

For example, unrolling a loop that does not contain convergent operations
cannot break any of the guarantees required for convergent operations outside
of the loop.


Loop unrolling examples
-----------------------

We consider three kinds of loop unrolling here:

* Partial unrolling with no known trip multiple, so a "tail" is required to
  collect the remaining elements.
* Partial unrolling by a trip multiple, so no "tail" is required.
* Full unrolling, which eliminates the loop.

The first kind is forbidden when ``@llvm.experimental.convergence.loop`` is
used. We illustrate the reasoning with some examples.

First, an arbitrary loop that contains convergent operations *can* be unrolled
in all of these ways, even with "tail", if all convergent operations refer back
to an anchor inside the loop. For example (in pseudo-code):

.. code-block:: llvm

  while (counter > 0) {
    %tok = call token @llvm.experimental.convergence.anchor()
    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
    counter--;
  }

This can be unrolled to:

.. code-block:: llvm

  while (counter >= 2) {
    %tok = call token @llvm.experimental.convergence.anchor()
    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
    %tok = call token @llvm.experimental.convergence.anchor()
    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
    counter -= 2;
  }
  while (counter > 0) {
    %tok = call token @llvm.experimental.convergence.anchor()
    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
    counter--;
  }

This is likely to change the behavior of the convergent operation if there
are threads whose initial counter value is not a multiple of 2. In particular,
all threads with an odd trip count are now likely to execute the convergent
operation in their respective final iterations together because the
underlying implementation is likely to try to group as many threads together
as possible for the execution of the "tail".

This change is allowed because the anchor intrinsic has implementation-defined
convergence behavior and the loop unrolling transform is considered to be part
of the implementation. Another way of reasoning is that while the *likely*
behavior of the code has changed, the *guarantees* about its behavior have
remained the same.

If the loop contains uncontrolled convergent operations, this kind of unrolling
is forbidden.

Unrolling a loop with convergent operations that refer to tokens produced
outside the loop is forbidden when a "tail" or "remainder" would have to
be introduced. Consider:

.. code-block:: llvm

  ; (A)
  %outer = call token @llvm.experimental.convergence.anchor()
  while (counter > 0) {
    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
    ; (B)
    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
    counter--;
  }
  ; (C)

To understand why unrolling is forbidden, consider two threads that execute
converged dynamic instances of the anchor and then proceed with 3 and 4 loop
iterations, respectively:

.. code-block:: text

  Thread 1: A B B B C
  Thread 2: A B B B B C

By the dynamic rule on loop heart intrinsics, these threads execute converged
dynamic instances of the loop intrinsic for the first 3 iterations, and then
thread 2 executes another dynamic instance by itself.

By the dynamic rule on general convergent operations, the threads execute
converged dynamic instances of the ``@convergent.operation`` in the first 3
iterations (that is, the dynamic instance executed by thread 1 in iteration
*n* is the same as that executed by thread 2 in iteration *n*, for *n = 1,2,3*;
the dynamic instance executed in iteration 1 is different from that in
iteration 2, etc.).

Now assume that the loop is unrolled by a factor of 2, which requires a
remainder as follows:

.. code-block:: llvm

  ; (A)
  %outer = call token @llvm.experimental.convergence.anchor()
  while (counter >= 2) {
    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
    ; (B)
    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
    counter -= 2;
  }
  ; (C)
  if (counter > 0) {
    %remainder = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
    ; (D)
    call void @convergent.operation() [ "convergencectrl"(token %remainder) ]
  }
  ; (E)

First of all, note some interesting problems surrounding the loop intrinsic:

1. It is *not* duplicated inside the unrolled loop. This is to comply with
   the :ref:`convergence_static_rules`.

2. It is unclear whether the loop intrinsic ought to be duplicated in the
   remainder, or whether the final ``@convergent.operation`` in D should just
   refer to either ``%inner`` (which is possible in SSA form) or directly to
   ``%outer``. The decision made here is arbitrary and doesn't change the
   argument that follows. Ultimately, it simply doesn't matter because the
   transform is incorrect either way.

The threads now execute the following sequences of blocks:

.. code-block:: text

  Thread 1: A B C D E
  Thread 2: A B B C D E

Analogous to the argument above, they execute converged dynamic instances of the
``%inner`` intrinsic and the ``@convergent.operation`` in the first iteration
of the unrolled loop, which corresponds to the first 2 iterations of the
original loop.

However, they execute different static calls to ``@convergent.operation`` for
the 3rd iteration of the original loop. In thread 1, that iteration corresponds
to the call in the remainder, while in thread 2 it corresponds to the first
call to ``@convergent.operation`` in the unrolled loop. Therefore, they execute
non-converged dynamic instances, which means that the set of communicating threads
for the 3rd iteration of the original loop is different. This is why the
unrolling is incorrect.

On the other hand, unrolling without "tail" is allowed. For example, assuming
that the trip counter is known to be a multiple of 2, we can unroll the loop
as follows:

.. code-block:: llvm

  %outer = call token @llvm.experimental.convergence.anchor()
  while (counter > 0) {
    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
    counter -= 2;
  }

Note again that the loop intrinsic is not duplicated.

The
:ref:`llvm.experimental.convergence.loop <llvm.experimental.convergence.loop>`
intrinsic is typically expected to appear in the header of a natural loop.
However, it can also appear in non-header blocks of a loop. In that case, the
loop can generally not be unrolled.


Hoisting and sinking
--------------------

In general, hoisting and sinking of convergent operations is forbidden. This is
because moving the operation to a different point in control flow generally
changes the set of threads that reach the operation and therefore, the set of
threads that execute converged dynamic instances of the operation. By
definition, this changes the set of threads that participate in the
communication of the convergent operation, which will typically change its
result.

There are a number of exceptions, though most of them require additional
knowledge.

For example, hoisting and sinking across *uniform* conditional branches -- i.e.,
conditional branches where within every possible relevant set of threads, all
threads will always take the same direction -- is generally allowed. See the end
of the :ref:`example of reductions inside control flow
<convergence_example_reductions>` for a brief discussion.

Some convergent operations can be hoisted but not sunk, or vice versa. A simple
example is the ``subgroupShuffle(data, id)`` operation. It returns the ``data``
operand of the thread identified by ``id``, where thread IDs are fixed and
assigned to each thread at launch. The result is undefined (or perhaps there is
UB, depending on the language and environment) if thread ``id`` is not in the
communicating set of threads. So hoisting is allowed in the following
pseudo-code example:

.. code-block:: llvm

  define void @example(...) convergent {
    %entry = call token @llvm.experimental.convergence.entry()
    %data = ...
    %id = ...
    if (condition) {
      %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
      ...
    } else {
      %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
      ...
    }
  }

After hoisting the calls to ``@subgroupShuffle``, the communicating set of
threads is the union of the two sets of threads in the original program, so
``%id`` can only go "out of range" after hoisting if it did so in the original
program.

However, speculative execution of ``@subgroupShuffle`` in the following program
may be forbidden:

.. code-block:: llvm

  define void @example(...) convergent {
    %entry = call token @llvm.experimental.convergence.entry()
    %data = ...
    %id = ...
    if (condition) {
      %shuffled = call i32 @subgroupShuffle(i32 %data, i32 %id) [ "convergencectrl"(token %entry) ]
      ...
    }
  }

There is no guarantee about the value of ``%id`` in the threads where
``condition`` is false. If ``@subgroupShuffle`` is defined to have UB when
``%id`` is outside of the set of communicating threads, then speculating and
hoisting ``@subgroupShuffle`` might introduce UB.

On the other hand, if ``@subgroupShuffle`` is defined such that it merely
produces an undefined value or poison as result when ``%id`` is "out of range",
then speculating is okay.

Even though
:ref:`llvm.experimental.convergence.anchor <llvm.experimental.convergence.anchor>`
is marked as ``convergent``, it can be sunk in some cases. For example, in
pseudo-code:

.. code-block:: llvm

  %tok = call token @llvm.experimental.convergence.anchor()
  if (condition) {
    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
  }

Assuming that ``%tok`` is only used inside the conditional block, the anchor can
be sunk. The rationale is two-fold. First, the anchor has implementation-defined
behavior, and the sinking is part of the implementation. Second, already in the
original program, the set of threads that communicates in the
``@convergent.operation`` is automatically subset to the threads for which
``condition`` is true.

Anchors can be hoisted in acyclic control flow. For example:

.. code-block:: llvm

  if (condition) {
    %tok1 = call token @llvm.experimental.convergence.anchor()
    call void @convergent.operation() [ "convergencectrl"(token %tok1) ]
  } else {
    %tok2 = call token @llvm.experimental.convergence.anchor()
    call void @convergent.operation() [ "convergencectrl"(token %tok2) ]
  }

The anchors can be hoisted, resulting in:

.. code-block:: llvm

  %tok = call token @llvm.experimental.convergence.anchor()
  if (condition) {
    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
  } else {
    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
  }

The behavior is unchanged, since each of the static convergent operations only
ever communicates with threads that have the same ``condition`` value.
By contrast, hoisting the convergent operations themselves is forbidden.

Hoisting and sinking anchors out of and into loops is forbidden. For example:

.. code-block:: llvm

  for (;;) {
    %tok = call token @llvm.experimental.convergence.anchor()
    call void @convergent.operation() [ "convergencectrl"(token %tok) ]
  }

Hoisting the anchor would make the program invalid according to the static
validity rules. Conversely:

.. code-block:: llvm

  %outer = call token @llvm.experimental.convergence.anchor()
  while (counter > 0) {
    %inner = call token @llvm.experimental.convergence.loop() [ "convergencectrl"(token %outer) ]
    call void @convergent.operation() [ "convergencectrl"(token %inner) ]
    counter--;
  }

The program would stay valid if the anchor was sunk into the loop, but its
behavior could end up being different. If the anchor is inside the loop, then
each loop iteration has a new dynamic instance of the anchor, and the set of
threads participating in those dynamic instances of the anchor could be
different in arbitrary implementation-defined ways. Via the dynamic rules about
dynamic instances of convergent operations, this then implies that the set of
threads executing ``@convergent.operation`` could be different in each loop
iteration in arbitrary implementation-defined ways.

Convergent operations can be sunk together with their anchor. Again in
pseudo-code:

.. code-block:: llvm

  %tok = call token @llvm.experimental.convergence.anchor()
  %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
  %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
  if (condition) {
    use(%a, %b)
  }

Assuming that ``%tok``, ``%a``, and ``%b`` are only used inside the conditional
block, all can be sunk together:

.. code-block:: llvm

  if (condition) {
    %tok = call token @llvm.experimental.convergence.anchor()
    %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
    %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
    use(%a, %b)
  }

The rationale is that the anchor intrinsic has implementation-defined behavior,
and the sinking transform is considered to be part of the implementation:
the sinking will restrict the set of communicating threads to those for which
``condition`` is true, but that could have happened in the original program
anyway for some arbitrary other reason.

However, sinking *only* the convergent operation producing ``%b`` would be
incorrect. That would allow threads for which ``condition`` is false to
communicate at ``%a``, but not at ``%b``, which the original program doesn't
allow.

Note that the entry intrinsic behaves differently. Sinking the convergent
operations is forbidden in the following snippet:

.. code-block:: llvm

  %tok = call token @llvm.experimental.convergence.entry()
  %a = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
  %b = call T @pure.convergent.operation(...) [ "convergencectrl"(token %tok) ]
  if (condition) {
    use(%a, %b)
  }