src/v.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392

\chapter{``V'' Standard Extension for Vector Operations, Version 0.4-DRAFT}
\label{sec:bits}

This chapter presents a proposal for the RISC-V base vector
instruction set extension.  The base vector extension is intended to
provide general support for data-parallel execution within the 32-bit
instruction encoding space, with later vector extensions supporting
richer functionality for certain domains.

\begin{commentary}
The vector extension is based on the style of vector register
architecture introduced by Seymour Cray in the 1970s, as opposed to
the earlier packed SIMD approach, introduced with the Lincoln Labs
TX-2 in 1957 and now adopted by most other commercial instruction
sets.
\end{commentary}

The base vector extension defines the components that must be included
when the ``V'' bit is set in the {\tt misa} register, and consequently
those that will be assumed to exist by software written for an ABI
specifying V.

\begin{commentary}
  This draft version of the chapter includes additional specifications
  of proposed extensions to the base vector extension to explain some
  of the encoding choices made for the base.
\end{commentary}

The vector extension supports a configurable vector unit, to enable
implementations to tradeoff the number of active architectural vector
registers and supported element widths against available maximum
vector length.  The vector extension is designed to allow the same
binary code to work efficiently across a variety of hardware
implementations varying in physical vector storage capacity and
datapath spatial and/or temporal parallelism.

\begin{commentary}
The vector instruction set contains many features developed in earlier
research projects, including the Berkeley T0~\cite{} and VIRAM~\cite{VIRAM}
vector microprocessors, the MIT Scale vector-thread processor~\cite{},
and the Berkeley Maven~\cite{} and Hwacha~\cite{} projects.
\end{commentary}

\section{Vector Unit State}

The additional vector unit architectural state includes 32 vector
registers ({\tt v0}--{\tt v31}), and an XLEN-bit WARL vector length
CSR, {\tt vl}.  Each vector register {\tt v}$n$ has an associated
16-bit configuration field {\tt vtype}$n$ described below. A 6-bit
global maximum element width register {\tt vmaxew} defines the maximum
number of bits of storage in every element of every active vector
register.

\begin{commentary}
  Future vector extensions using wider instruction encodings can
  support more architectural vector registers. For example, 256
  architectural vector registers in a 64-bit instruction encoding.
\end{commentary}

\begin{commentary}
  Future 2D shape extensions add two more vector length registers,
  {\tt vm} and {\tt vn}.
\end{commentary}

There is also a 3-bit fixed-point rounding mode CSR {\tt vxrm}, and a
single-bit fixed-point saturation status CSR {\tt vxsat}.  The {\tt
  vcs} CSR alias provides combined access to the {\tt vl}, {\tt vxrm},
{\tt vxsat} fields to reduce context switch time.  The {\tt vcs}
register also includes a configuration mode field to support future
extended configuration modes.

\begin{discussion}
The components of vcs might not need separate CSR addresses,
depending on how they're accessed via other non-CSR instructions.
\end{discussion}

\section{Vector Unit Type Configuration Register ({\tt vtype}$n$)}

The vector unit must be configured before use.  Each architectural
vector register, {\tt v}$n$, is configured via 16 bits of vector type
configuration state {\tt vtype}$n$, which can be accessed via vector
configuration ({\tt vcfg}) CSRs and other rapid vector configuration
instructions as described below.  The vector register type
configuration encodes the overall organization, or {\em shape}, of the
elements in each vector register (e.g., scalar versus 1-D vector), as
well as the bitwidth and numeric representation of each element.  As
shown in Figure~\ref{fig:vtype}, the 16-bit {\tt vtype}$n$ encoding is
divided into a 5-bit current shape field {\tt vshape}$n$, a 5-bit
representation field {\tt verep}$n$, and a 6-bit element bit-width
field {\tt vew}$n$\, held in the {\tt vcfg}$x$ CSRs.  The combination
of an element numeric representation and an element bitwidth is called
an element {\em format}.  Each vector register can also be disabled to
free physical vector storage for other architectural vector registers.

\begin{figure}[htb]
\begin{center}
\begin{tabular}{O@{}O@{}O}
\\
\instbitrange{15}{11} &
\instbitrange{10}{6} &
\instbitrange{5}{0} \\
\hline
\multicolumn{1}{|c|}{{\tt vshape}$n$} &
\multicolumn{1}{c|}{{\tt verep}$n$} &
\multicolumn{1}{c|}{{\tt vew}$n$} \\
\hline
5 & 5 & 6 \\
\end{tabular}
\end{center}
\caption{Location of subfields within a single {\tt vtype}$n$ field.}
\label{fig:vtype}
\end{figure}

\begin{commentary}
  It was also common in earlier vector machines to support multiple
  precisions within the vector datapath.  In particular, the CDC
  STAR-100~\cite{cdcstar100} supported single-precision and
  double-precision floating-point operations and also bit, byte, and
  nibble operations in the vector unit; TI ASC~\cite{tiasc} designs
  supported dividing 64-bit vector lanes into two 32-bit lanes for
  double throughput.
\end{commentary}

\clearpage

\section{Shape Encoding}

The 5-bit shape field describes the structure of the elements within
the vector register.  In the base vector extension, the shape can be
set to either scalar or vector.

\begin{table}[hbt]
  \centering
  \begin{tabular}{|c|l|}
    \hline
        {\tt vshape} & Shape \\
        \hline
        00000  & scalar  \\
        00100  & 1-D vector, length controlled by {\tt vl}  \\
        \hline
        \multicolumn{2}{|c|}{All other encodings reserved}\\
        \hline
  \end{tabular}
  \caption{Base vector encoding of {\tt vshape}$n$ field.}
  \label{tab:vshape}
\end{table}

\begin{commentary}
  For the base vector ISA, only a single bit is required in each {\tt
    vshape} field to select between scalar and 1-D vector elements
  with the other bits hardwired to zero.
\end{commentary}
  
\begin{table}[hbt]
  \centering
  \begin{tabular}{|c|l|}
    \hline
        {\tt vshape} & Shape \\
        \hline
        00000  & scalar \\
        00001  & {\em Reserved} \\
        0001x  & {\em Reserved} \\
        \hline
        00100  & 1-D vector {\tt vl} \\
        01000  & 1-D vector {\tt vm} \\
        01100  & 1-D vector {\tt vn} \\
        \hline
        00101  & 2-D matrix {\tt vl} x {\tt vl} \\
        00110  & 2-D matrix {\tt vl} x {\tt vm} \\
        00111  & 2-D matrix {\tt vl} x {\tt vn} \\
        \hline
        01001  & 2-D matrix {\tt vm} x {\tt vl} \\
        01010  & 2-D matrix {\tt vm} x {\tt vm} \\
        01011  & 2-D matrix {\tt vm} x {\tt vn} \\
        \hline
        01101  & 2-D matrix {\tt vn} x {\tt vl} \\
        01110  & 2-D matrix {\tt vn} x {\tt vm} \\
        01111  & 2-D matrix {\tt vn} x {\tt vn} \\
        \hline
        1xxxx  & {\em Reserved}/{\em Custom} \\
        \hline
  \end{tabular}
  \caption{Extended encoding of per-vector-register {\tt vshape} field.}
  \label{tab:extvshape}
\end{table}

\begin{commentary}
  A sketch of the proposed encodings for the 2D shape extension is
  shown in the Table.
\end{commentary}

\clearpage

\section{Representation Encoding}

The 5-bit {\tt verep}$n$ register sets the numeric representation of
each element of the vector register.  In the base vector
extension, the representation can be set to unsigned integer,
two's-complement signed integer, or floating-point.  The
floating-point representations follow the IEEE 754 standards.

\begin{table}[hbtp]
  \centering
  \begin{tabular}{|c|l|}
    \hline
    {\tt verep} & Representation \\
    \hline
    00000 & Unsigned integer \\
    00001 & Two's-complement signed integer \\
    00010 & {\em Reserved (unsigned floating-point?)}\\
    00011 & IEEE-754 floating-point \\
    \hline
    \multicolumn{2}{|c|}{All other encodings reserved}\\
    \hline
  \end{tabular}
  \caption{Base vector representation encoding.}
  \label{tab:verep}
\end{table}

\begin{table}[hbtp]
  \centering
  \begin{tabular}{|c|l|}
    \hline
    {\tt verep} & Representation \\
    \hline
    00000 & Unsigned integer \\
    00001 & Two's-complement signed integer \\
    00010 & {\em Reserved (unsigned floating-point)}\\
    00011 & IEEE-754 floating-point \\
    \hline
    001x0 & {\em Reserved} \\
    00101 & Complex signed integer \\
    00111 & Complex floating-point \\
    \hline
    01000 & Prime Galois field - integer representation \\
    01001 & Prime Galois field - Montgomery representation \\
    01100 & Binary extension Galois field - polynomial basis \\
    01101 & Binary extension Galois field - normal basis \\
    \hline
    01010 & UNORM \\
    01011 & SNORM \\
    01110 & {\em Reserved} \\
    01111 & {\em Reserved (complex SNORM?)} \\
    \hline
    10xxx & Custom representations \\
    \hline
    11xxx & {\em Reserved} \\
    \hline
  \end{tabular}
  \caption{Extended vector representation encoding.}
  \label{tab:extverep}
\end{table}

\begin{commentary}
  The complex representations split the element width given in {\tt
    vew}$n$ into two equal-sized real and imaginary fields, so an
  element width of 64 bits can hold a single complex value with a
  32-bit real and a 32-bit imaginary component.
\end{commentary}

\clearpage

\section{Element Bitwidth}

Each vector register, {\tt v}$n$, has a 6-bit element width
register, {\tt vew}$n$, to specify the number of bits for each element
of the current type in the vector register.

The largest element width supported is
termed ELEN, and is defined to be the larger of the supported integer
and floating-point type widths:
\[ \mbox{\em ELEN} = max(\mbox{\em XLEN}, \mbox{\em FLEN}) \]
For the base vector ISA, the bit width can be set at any power of two
between 8 and ELEN.

\begin{table}[hbt]
  \centering
  \begin{tabular}{|c|r|l|}
    \hline
        {\tt vew} & Width & Required in Base \\
        \hline
        000 000 & disabled & All \\
        001 000 & 8 & All \\
        010 000 & 16 & All \\
        011 000 & 32 & All \\
        100 000 & 64 & RV32D, RV64, RV128\\
        101 000 & 128 & RV64Q, RV128\\
        \hline
        \multicolumn{3}{|c|}{All other encodings reserved.}\\
        \hline
  \end{tabular}
  \caption{Base vector ISA encoding of vector element width ({\tt
      vew}$n$) register fields.}
  \label{tab:basevew}
\end{table}

\begin{table}[hbtp]
  \centering
  \begin{tabular}{|c|r|}
    \hline
        {\tt vew} & Width \\
        \hline
        000 000 & disabled \\
        000 001 & 1 \\
        000 xxx & \multicolumn{1}{r|}{steps of 1}\\
        000 111 & 7 \\
        \hline
        001 000 & 8 \\
        001 xxx & \multicolumn{1}{r|}{steps of 1}\\
        001 111 & 15 \\
        \hline
        010 000 & 16 \\
        010 xxx & \multicolumn{1}{r|}{steps of 2}\\
        010 111 & 30 \\
        \hline
        011 000 & 32 \\
        011 xxx & \multicolumn{1}{r|}{steps of 4}\\
        011 111 & 60 \\
        \hline
        100 000 & 64 \\
        100 xxx & \multicolumn{1}{r|}{steps of 8}\\
        100 111 & 120 \\
        \hline
        101 xxx & reserved \\
        \hline
        110 000 & 128 \\
        110 001 & 192 \\
        110 010 & 2048 \\
        110 011 & 3072 \\
        110 100 & 512 \\
        110 101 & 768 \\
        110 110 & 8192 \\
        110 111 & 12288 \\
        \hline
        111 000 & 256 \\
        111 001 & 384 \\
        111 010 & 4096 \\
        111 011 & 6144 \\
        111 100 & 1024 \\
        111 101 & 1536 \\
        111 110 & 16384 \\
        111 111 & 24576 \\
        \hline
  \end{tabular}

   \caption{Proposed extended encoding of vector element width ({\tt
       vew}$n$) register fields. Every bit width between 1 and 16 can
     be supported.  Bit widths in steps of 2 between 16 to 32 (i.e.,
     16, 18, 20, ...).  Bit widths in steps of 4 between 32 to 64
     (i.e., 32, 36, 40, ...).  Bit widths in steps of 8 between 64 and
     128 (i.e., 64, 72, 80,...).  For bit widths greater than 128, all
     powers-of-two up to 16384 and all widths 1.5$\times$ greater are
     supported (128, 384, 512, 768,...).  }
   \label{tab:extvew}
\end{table}

\begin{commentary}
    The extended bit-width encoding is designed to minimize the number
    of state bits required to support useful subsets of widths. For
    example, an RV32 system only needs two bits of state per {\tt
      vew}$n$ field to represent {\em disabled}, 8, 16, and 32. An
    RV32 system with 3 bits of state can represent {\em disabled}, 4,
    8, 12, 16, 24, 32, and 48.  An RV64 system with 4 bits of state
    can represent {\em disabled}, 4, 8, 12, 16, 24, 32, 48, 64, 96,
    128, 256, 512, 1024.
\end{commentary}

\clearpage

\section{Base Vector Extension Supported Types}

The types supported by the base V extension depend upon the base
scalar ISA and supported extensions.  When the base V extension is
added to a base scalar ISA, it must support the vector data element
types implied by the supported scalar types as defined by
Table~\ref{tab:velemtypes}.

\begin{table}[hbt]
  \centering
\begin{tabular}{|l|l|}
  \hline
  \multicolumn{2}{|c|}{Supported Fixed-Point Formats} \\
  \hline
  RV32I  & I8, U8, I16, U16, I32, U32 \\
  RV64I  & I8, U8, I16, U16, I32, U32, I64, U64 \\
  RV128I & I8, U8, I16, U16, I32, U32, I64, U64, I128, U128 \\
  \hline
  \hline
  \multicolumn{2}{|c|}{Supported Floating-Point Formats} \\
  \hline
  F      & F16, F32 \\
  FD     & F16, F32, F64 \\
  FDQ    & F16, F32, F64, F128 \\
  \hline
\end{tabular}
\caption{Supported data element formats depending on base integer ISA
  and supported floating-point extensions.  I$x$ indicates a signed
  integer of $x$ bits, U$x$ indicates an unsigned integer of $x$ bits,
  and F$x$ indicates an IEEE floating-point number of $x$ bits.}
\label{tab:velemtypes}
\end{table}

\begin{commentary}
  Future vector extensions might expand the set of supported
  datatypes, including custom application-specific datatypes.
\end{commentary}

\clearpage

\section{Maximum Vector Element Width ({\tt vmaxew})}

The global {\tt vmaxew} field is used to support more complex vector
runtime environments where the types to be held in each register of a
single configuration may vary dynamically, and may not even be known
at compile time due to separate compilation.

The global maximum element width register {\tt vmaxew} defines the
maximum number of bits of storage in every element of every active
architectural register, or if zero, defers to the per-vector-register
width field.

\begin{commentary}
  The VIRAM processor had a virtual processor width
  register similar to {\tt vmaxew}~\cite{VIRAM}.
\end{commentary}

If {\tt vmaxew} is zero, then the per-element vector element widths
{\tt vew}$n$ determine the minimum storage required for each element
of the associated vector register {\tt v}$n$.

If {\tt vmaxew} is non-zero, it sets the largest element width that
can be supported in any vector register element in the current
configuration.

\clearpage

\section{Vector Configuration Registers ({\tt vcfg0}--{\tt vcfg15})}

The vector type configuration requires 512 bits of state (32 vector
registers each with 16-bit {\tt vtype}$n$ field) that can be accessed
via the {\tt vcfg CSRs}.

RV128 uses four vector configuration CSRs: {\tt vcfg0} holds
configuration data for {\tt v0}--{\tt v7} with bits $16n$ to $16n+15$
holding {\tt vtype}$n$, while {\tt vcfg4}, {\tt vcfg8} and {\tt
  vcfg12} similarly holds configuration data for {\tt v8}--{\tt v15},
  {\tt v16}--{\tt v23}, and {\tt v24}--{\tt v31} respectively.

In RV64, the {\tt vcfg2} CSR provides access to the upper 64 bits of {\tt
  vcfg0} and {\tt vcfg6} provides access to the upper 64 bits of
{\tt vcfg4}.  In RV32, the {\tt vcfg1}, {\tt vcfg3}, {\tt vcfg5}
and {\tt vcfg7} CSRs provides access to the upper bits of {\tt
  vcfg0}, {\tt vcfg2}, {\tt vcfg4} and {\tt vcfg6} respectively.

Any CSR write to a {\tt vcfg}$x$ register zeros all {\tt vcfg}$y$
registers, for $y>x$.  As a result configuration data should be
written from the {\tt vcfg0} CSR upwards.

\begin{commentary}
  Zeroing higher-numbered {\tt vcfg}$y$ registers allows more rapid
  reconfiguration of the vector register file via CSR writes, and
  provides backward-compatibility for extensions that increase the
  number of possible architectural vector registers.  This choice does
  prevent the use of CSRRW instructions to swap the configuration
  context; an entire old configuration must be read out before a new
  configuration is written in.
\end{commentary}

Additional instructions are provided to support more rapid changes to
the vector unit configuration as described below.

\section{Legal Vector Unit Configurations}

To simplify hardware configuration calculations and to reduce software
context-switch complexity, vector unit configurations are constrained
to have non-disabled architectural vector registers numbered
contiguously starting at {\tt v0}.  An exception will be raised if an
instruction tries to change {\tt vtype}$n$ in a way that violates this
constraint.

\begin{commentary}
  During a software vector-context save, the software handler can stop
  searching for active architectural registers after encountering the
  first disabled vector register.  Hardware to calculate physical
  register allocation is also simplified with this constraint.
\end{commentary}

\clearpage

\section{Vector Unit CSRs}

\begin{table}[hbt]
  \centering
  \begin{tabular}{|l|c|l|l|}
    \hline
    CSR name & Number & Base ISA & Description\\
    \hline
    {\tt vcs}  & TBD & RV32, RV64, RV128 & Vector control-status register\\
    {\tt vl}    & TBD & RV32, RV64, RV128 & Active vector length\\
    {\tt vxrm}  & TBD & RV32, RV64, RV128 & Vector fixed-point rounding mode\\
    {\tt vxsat} & TBD & RV32, RV64, RV128 & Vector fixed-point
    saturation flag \\
    {\tt vmaxew} & TBD & RV32, RV64, RV128 & Global maximum vector element width \\
    \hline
    {\tt vcfg0} & TBD & RV32, RV64, RV128 & \multirow{16}{*}{Vector
      register configuration}\\
    {\tt vcfg1} & TBD & RV32 &\\
    {\tt vcfg2} & TBD & RV32, RV64 &\\
    {\tt vcfg3} & TBD & RV32 &\\
    {\tt vcfg4}  & TBD & RV32, RV64, RV128 &\\
    {\tt vcfg5} & TBD & RV32 &\\
    {\tt vcfg6} & TBD & RV32, RV64 &\\
    {\tt vcfg7} & TBD & RV32 &\\
    {\tt vcfg8} & TBD & RV32, RV64, RV128 & \\
    {\tt vcfg9} & TBD & RV32 &\\
    {\tt vcfg10} & TBD & RV32, RV64 &\\
    {\tt vcfg11} & TBD & RV32 &\\
    {\tt vcfg12}  & TBD & RV32, RV64, RV128 &\\
    {\tt vcfg13} & TBD & RV32 &\\
    {\tt vcfg14} & TBD & RV32, RV64 &\\
    {\tt vcfg15} & TBD & RV32 &\\
    \hline
  \end{tabular}
  \caption{Vector extension CSRs.}
  \label{tab:vcsrs}
\end{table}

\clearpage

\section{Maximum Vector Length (MVL)}

The implementation determines an available {\em maximum vector length}
(MVL) dependent on the current vector type configuration held in {\tt
  vcfg}$x$ and {\tt vmaxew}.  The available MVL depends on the
configuration setting and on the implementation's microarchitecture,
but MVL must always have the same value for the same configuration
parameters on a given hart.

\begin{commentary}
  Several earlier vector machines had the ability to configure
  physical vector register storage into a larger number of short
  vectors or a shorter number of long vectors. In particular the
  Fujitsu VP series~\cite{vp200} supported combining power-of-2 base
  vector registers into longer vector registers.

  The Scale~\cite{}, Maven~\cite{}, and Hwacha~\cite{} processors also
  support configuration-dependent MVL.
\end{commentary}

\begin{commentary}
  Previously, the specification imposed a minimum vector length (4) on
  all configurations to allow stripmining code to be removed for short
  vector lengths.  With the expanded scope of the vector unit types,
  this would be too onerous to support, and so the requirement is removed.
\end{commentary}

\begin{discussion}
  A separate mechanism for supporting fixed vector lengths should be
  designed, possibly as part of an optional extension.
\end{discussion}

Any change to the vector configuration that might change MVL cause the
entire vector unit state to be zeroed.  Any write to the global {\tt
  vmaxew} causes the entire vector unit state to be zeroed, even if
the value in {\tt vmaxew} is unchanged.

If {\tt vmaxew} is non-zero, any write to an individual {\tt vew}$n$
register that would set the width greater than {\tt vmaxew} raises an
illegal instruction exception and leaves the vector unit state
unchanged.

If {\tt vmaxew} is non-zero, any write to an individual {\tt vew}$n$
field with a value less than or equal to the value in {\tt vmaxew}
only zeros the associated vector register {\tt v}$n$ and leaves other
vector unit state unchanged.  The vector register data is zeroed even
if {\tt vew}$n$ would be unchanged by the write.

If {\tt vmaxew} is zero, then any write to an individual {\tt vew}$n$
register zeros the associated {\tt v}$n$ vector register.  In addition,
any write that changes the value in {\tt vew}$n$, zeros the entire vector
unit state.

\begin{commentary}
  The state is zeroed to hide implementation-dependent bit mappings
  and to provide additional security when context swapping.  Zero is
  also a convenient initial value for some loops.

  In-order implementations will probaby use a flag bit per register to
  mux in 0 instead of garbage values on each source until it is
  overwritten.  For in-order machines, vector lengths less than MVL
  complicate this zeroing, but these cases can be handled by adding a
  zero bit per element or element group.  Machines with vector
  register renaming can just initialize the rename table to point
  entries at a physical zero register.
\end{commentary}

Each vector register can be reconfigured dynamically to hold different
formats without zeroing the entire vector unit state provided that: if
{\tt vmaxew} is zero, the bit-width of the new format is the same as
the current {\tt vew}; or if {\tt vmaxew} is non-zero, the format does
not require more than {\tt vmaxew} bits.  Any change to a vector
register's format zeros the affected vector register.

If a vector register is disabled, then any vector instruction
that attempts to access that vector register will raise an
illegal instruction exception.  Attempting to write any {\tt
  vmaxew}$n$ with an unsupported value will raise an illegal
instruction exception.

\begin{commentary}
  Vector registers have both a maximum element width and a
  current element data type to allow the same vector register to
  be changed to different types during execution provided the
  maximum width is not exceeded.  This reduces register pressure and
  helps support vector function calls, where the caller does not know
  the types needed by the callee, as described below.
\end{commentary}

\begin{commentary}
  The set of supported types might be greatly increased with future
  extensions.  For example (and not limited to), new scalar types in
  new number systems, a complex type with real and imaginary
  components, a key-value type, or an application-specific structure
  type with multiple consitituent fields.  Auxiliary type
  configuration state might be required in these cases.
\end{commentary}

Attempting to write an unsupported type or a type that requires more
than the current {\tt vmaxew} width to a {\tt vetype} field will raise
an illegal instruction exception.

\begin{commentary}
Implementations must still raise an exception for a {\tt vetype}$n$
setting that is greater than the architectural {\tt vmaxew}$n$ width,
even if they internally implement a larger physical {\tt vmaxew}$n$
that could accomodate the {\tt vetype}$n$ request.
\end{commentary}

\begin{discussion}
We can either have 1) implementations raise exceptions whenever
illegal values are written to {\tt vmaxew} and {\tt vetype} fields
(current design), 2) raise exceptions at use if config holds illegal
values, 3) make the fields WARL so silently reduce to supported types
with no exceptions.  Option 2 could complicate vector unit context
switch code by having more cases to check, while Option 3 could make
debugging more difficult by allowing code to run with reduced
precision or incorrect types.
\end{discussion}

\begin{commentary}
Three broad classes of implementation can be distinguished by how they
handle {\tt vmaxew} settings.

The simplest is {\em max-width-per-implementation} (MWPI), where the
vector unit is organized in fixed ELEN-width physical lanes, and
changes to {\tt vmaxew} settings simply cause portions of the
physical registers and datapath to be disabled for operations narrower
than ELEN bits.

The next most complex implementation, {\em
  max-width-per-configuration} (MWPC), uses the maximum width across
all {\tt vmaxew} settings in a dynamic configuration to divide the
physical register storage and datapaths.  For example, a MWPC machine
with ELEN=64 might subdivide physical lanes into 32-bit datapaths if
no {\tt vmaxew} setting is greater than 32.  Operations on
sub-32-bit quantities would disable appropriate portions of the
physical registers and functional units in each 32-bit lane.  Several
early vector supercomputers, including the CDC
Star-100~\cite{cdcstart100}, provided a similar facility to divide
64-bit physical vector lanes into narrower 32-bit lanes.

The most complex implementations are {\em max-width-per-register}
(MWPR), which reduce wasted space in the physical register files by
packing elements in each vector register according to the individual
{\tt vmaxew} settings and which within one configuration can
execute instructions with narrower datatypes at higher rates than for
wider datatypes.  The Berkeley Hwacha vector
engine~\cite{hwachatr,mixedprecision} is an example microarchitecture
with this property.
\end{commentary}

\clearpage

{\bf Following Sections are out-of-date.}

\section{Vector Instruction Formats}

\begin{commentary}
  The instruction encoding is a work in progress.

  An important design goal was that the base vector extension fit
  within a few major opcodes of the 32-bit encoding.  It is envisioned
  that future vector extensions will use 48-bit or 64-bit encodings to
  increase both the opcode space and the set of architectural
  registers.  The 64-bit vector encoding would support 256
  architectural vector registers and orthogonal specification of a
  predicate register in each instruction.
\end{commentary}

Vector arithmetic and vector memory instructions are encoded in new
variants of the R-format, shown in Figure~\ref{fig:vinstformats}.
Both new formats use one bit to hold a {\em vp} field, which usually
controls the predicate register in use, either {\tt vp0} or {\tt vp1}.
The VR4 form is used for fused multiply-add instructions.  The
existing RISC-V instruction formats are used for other vector-related
instructions, such as the vector configuration instructions.

\vspace{-0.2in}
\begin{figure}[h]
\begin{center}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{p{0.7in}@{}p{0.4in}@{}p{0.7in}@{}p{0.7in}@{}p{0.5in}@{}p{0.4in}@{}p{0.7in}@{}p{1in}l}
\\
\instbitrange{31}{27} &
\instbitrange{26}{25} &
\instbitrange{24}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{13} &
\instbit{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\cline{1-8}
\multicolumn{2}{|c|}{funct7} &
\multicolumn{1}{c|}{rs2} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct2} &
\multicolumn{1}{c|}{vp} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} &
VR-type \\
\cline{1-8}
\\
\cline{1-8}
\multicolumn{1}{|c|}{rs3} &
\multicolumn{1}{c|}{fmt} &
\multicolumn{1}{c|}{rs2} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct2} &
\multicolumn{1}{c|}{vp} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} &
VR4-type \\
\cline{1-8}
\end{tabular}
\end{center}
\caption{New V extension instruction formats.  }
\label{fig:vinstformats}
\end{figure}

Most vector instructions are available in both vector-vector and
vector-scalar variants.  Vector-vector instructions take the first
operand from the vector register specified by {\em rs1} and the second
operand from the vector register specified by {\em rs2}.

For vector-scalar operations, the {\em rs1} field specifies the scalar
register to be accessed.  For most vector-scalar instructions, the
type of the vector operand specified by {\em rs2} indicates whether
the integer or floating-point scalar register file is accessed using
the {\em rs1} register specifier.

Some non-commutative vector-scalar instructions (such as sub) are
provided in two forms, with the scalar value used as the second
operand.

\begin{commentary}
  The {\em rs1} field is used to provide the scalar operand because in
  the base encoding, whenever an instruction has a single scalar
  source operand, it is encoded in the {\tt rs1} field.
\end{commentary}

\section{Polymorphic Vector Instructions}

The vector extension uses a polymorphic instruction encoding where the
opcode is combined with the types of the source and destination
registers to determine the operation to be performed.  For example, an
ADD opcode will perform a 32-bit integer vector-vector add if both
vector source operands and the vector destination register are 32-bit
integers, but will perform a 16-bit floating-point vector-vector
operation if both vector source operands and the vector destination
are 16-bit floats.

The polymorphic encoding also naturally supports operations with mixed
precisions on the input and output, and also supports extending the
instruction set with new types without necessarily increasing the
opcode space.

Not all combinations of source and destination argument types need be
supported.  The base vector extension mandates only that
implementations provide a subset of combinations of types on inputs
and outputs.  Table~\ref{tab:vtypemix} shows the general rules for
integer and floating-point instructions, but the detailed instruction
listing should be consulted for accurate information.

\begin{table}
  \centering
  \begin{tabular}{|r|r|r|r|r|}
    \hline
    \multicolumn{1}{|c|}{Src1} &
    \multicolumn{1}{c|}{Src2} &
    \multicolumn{1}{c|}{Src3} &
    \multicolumn{1}{c|}{Dest} &
    \multicolumn{1}{c|}{Example} \\
    \hline
    \hline
    \multicolumn{5}{|c|}{Integer vector-scalar}\\
    \hline
    XLEN &   X & - &  X & 64b + 32b $\rightarrow$ 32b \\
    XLEN &   X & - & 2X & 64b + 8b  $\rightarrow$ 16b \\
    \hline
    \hline
    \multicolumn{5}{|c|}{Integer vector-vector}\\
    \hline
      X &  X & - &   X & 32b + 32b $\rightarrow$ 32b \\
      X &  X & - &  2X & 16b + 16b $\rightarrow$ 32b \\
     2X &  X & - &  2X & 64b + 32b $\rightarrow$ 64b \\
    \hline
    \hline
    \multicolumn{5}{|c|}{Floating-point vector-scalar}\\
    \hline
     F &  F & -  &  F &  64b + 64b $\rightarrow$ 64b \\
     F &  F & F  &  F &  32b $\times$ 32b + 32b $\rightarrow$ 32b \\
     F &  F & -  & 2F &  32b + 32b $\rightarrow$ 64b \\
     F &  F & 2F & 2F &  32b $\times$ 32b + 64b $\rightarrow$ 64b \\
    \hline
    \hline
    \multicolumn{5}{|c|}{Floating-point vector-vector}\\
    \hline
      F &  F  & - &   F & 32b + 32b $\rightarrow$ 32b \\
      F &  F  & - &  2F & 16b + 16b $\rightarrow$ 32b \\
     2F &  F  & - &  2F & 64b + 32b $\rightarrow$ 64b \\
      F &  F & F  &  F &  64b $\times$ 64b + 64b $\rightarrow$ 64b \\
      F &  F & 2F & 2F &  16b $\times$ 16b + 32b $\rightarrow$ 32b \\
    \hline
  \end{tabular}
  \caption{General rules for supported types per instruction in base
    vector extension.  X represents the number of bits in an integer
    type and F represents the number of bits in a floating-point type.
    Individual instruction types will provide more detailed listings.
    Note that the type of a scalar floating-point operand can never be
    different from that of the vector in Src2, hence the Src1=2F case
    is missing from vector-scalar operations.}
  \label{tab:vtypemix}
\end{table}

A general rule in the base vector instruction set is that the
destination precision is never less than any source operand, except
for explicit type-conversion instructions.  Another general rule is
that the input operands can only be the same width or half the width
of the destination operand except for the scalar operand in integer
vector-scalar instructions, which is always XLEN wide.  Also, src2 is
never larger than src1 or src3.

Integer computations of mixed-precision values always aligns values by
their LSB, and sign or zero-extends any smaller value according to its
type.  The result is truncated to fit in the destination type.  Note a
scalar integer value is already XLEN bits wide, and as wide as any
possible integer vector value.

Floating-point computations on mixed-precision values acts as if the
calculations are performed exactly then rounded once to the
destination format.

\section{Rapid Configuration Instructions}

It can take several CSR instructions to set up the {\tt vcfg} and
{\tt vnp} CSRs for a given configuration.  Specialized configuration
instructions are provided to quickly set up common configurations in
the {\tt vcfg} and {\tt vnp} CSRs.

The {\tt vsetdcfg} instruction takes a scalar register value encoded as
shown in Figure~\ref{fig:vcfg}, and returns the corresponding MVL in
the destination register.  The {\tt vsetdcfg} and {\tt vsetdcfgi}
instructions also clear the {\tt vnp} register, so no predicate
registers are allocated.

\begin{discussion}
  For now, only a 32-bit value supporting up to three different vector
  data types is supported by the {\tt vsetdcfg} instruction.  RV64 and
  RV128 could support larger number of types, though it's not clear if
  the hardware cost (area, latency) to support a larger number of
  different types is justified.
\end{discussion}

\begin{figure}[b]
  \centering
  \begin{tabular}{p{1cm}p{1cm}ccc|c|c|c|c|c|c|c|l}
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} & 
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{mode} &
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} &  \\
    \cline{6-12}
    & & & & &
    \tt type2 & \tt ntype2 &
    \tt type1 & \tt ntype1 &
    0 &
    \tt type0 & \tt ntype0 &  \\
    \cline{6-12}
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{} & 
    \multicolumn{1}{c}{} &
    \multicolumn{1}{c}{5} &
    \multicolumn{1}{c}{5} &
    \multicolumn{1}{c}{5} &
    \multicolumn{1}{c}{5} &
    \multicolumn{1}{c}{2} &
    \multicolumn{1}{c}{5} &
    \multicolumn{1}{c}{5} &  \\
    %% \cline{2-12}
    %% & \multicolumn{1}{|c|}{0} & F128 &
    %% \multicolumn{1}{c|}{type3} & \multicolumn{1}{c|}{\#type3} &
    %% type2 & \#type2 & type1 & \#type1 & 0 & type0 & \#type0 & RV64 \\
    %% \cline{2-12}
    %% & & &
    %% \multicolumn{1}{c}{} &
    %% \multicolumn{1}{c}{24} &
    %% \multicolumn{1}{c}{5} &
    %% \multicolumn{1}{c}{5} &
    %% \multicolumn{1}{c}{5} &
    %% \multicolumn{1}{c}{5} &
    %% \multicolumn{1}{c}{2} &
    %% \multicolumn{1}{c}{5} &
    %% \multicolumn{1}{c}{5} &  \\
    %% \cline{1-12}
    %% \multicolumn{1}{|c|}{0} & \multicolumn{1}{c|}{X128} &
    %% \multicolumn{1}{c|}{F128} & I64 & F64 & F32 & F16 & I32 & I16 & I8 & RV128 \\
    %% \cline{1-12}
    %% \multicolumn{1}{c}{83} &
    %% \multicolumn{1}{c}{5} &
    %% \multicolumn{1}{c}{5} &
    %% \multicolumn{1}{c}{5} &
    %% \multicolumn{1}{c}{5} &
    %% \multicolumn{1}{c}{2} &
    %% \multicolumn{1}{c}{5} &
    %% \multicolumn{1}{c}{5} &  \\
  \end{tabular}
  \caption{Format of the {\tt vsetdcfg} value.  The value contains
    three pairs of a 5-bit type and a 5-bit number of registers
    to create of that type. A value of 0 for the number of a type
    indicates that 32 registers should be allocated.  A value of 0 for
    the type indicates this pair should be skipped.  The types must be
    of monotonically increasing size from type0 to type2. }
  \label{fig:vcfg}
\end{figure}

The {\tt vsetdcfg} value specifies how many vector registers of each
datatype are allocated, and is divided into a 2-bit mode field and
pairs of 5-bit fields for each data type in the configuration.

The 2-bit mode field indicates the configuration mode of the vector
unit and is zero for the base vector extension.

\begin{commentary}
  The standard vector extension operating mode configures the vector
  unit into some number of vector registers, each with some number of
  elements of types supported by the scalar unit.

  At least one alternative mode is planned, where the vector unit is
  configured as some number of registers each holding a single large
  element, e.g., 256 bits.  This would be the base for cryptographic
  operations, or other coprocessors that operated on large structures.

  Other modes can be used to reconfigure the vector unit register file
  and functional units for other domain-specific purposes.
\end{commentary}

Each datatype pair contains a 5-bit {\tt type}$x$ value encoded as a
{\tt vetype}$n$ value, and a 5-bit {\tt ntype}$x$ for the number of
registers to allocate for that type. If the {\tt type0} field is
non-zero, the {\tt vsetdcfg} instruction will configure the first {\tt
  ntype0} vector data registers to have {\tt vetype}$n$ values of {\tt
  type0} with {\tt vmaxew}$n$ values set accordingly as shown in
Table~\ref{tab:vetype}.  If the {\tt type0} value is 0, the datatype
pair is skipped.  If the {\tt type1} field is non-zero, then the next
{\tt ntype1} vector registers are configured to be of the type given
in {\tt type1}.  Similarly for the {\tt type2} pair.

A value of zero in a {\tt type}$x$ field indicates this datatype pair
should be ignored.  A value of zero in a {\tt ntype}$x$ field
indicates 32 registers should be allocated for the corresponding type.

\begin{commentary}
Zero values are skipped to simplify setting a configuration with two
different data types, where a single LUI instruction can set the upper
20 bits leaving the low bits zero.

A single 12-bit immediate value is sufficient to create a
configuration with some number of vector registers with a single given
datatype.

A compressed C.LI with a zero-extended 5-bit immediate can create a
configuration with 32 vector registers of a given datatype.
\end{commentary}

A corresponding {\tt vsetdcfgi} instruction takes a 12-bit immediate
value to set the configuration instead of a scalar value, but
otherwise is identical to the {\tt vsetcfgd} instruction.

\begin{discussion}
It is not clear how many immediate bits will be made available for the
{\tt vsetdcfgi} instruction.  If encoding space is available for both
12 immediate bits and a source register specifier, then {\tt
  vsetdcgfi} can be defined to read the source register, OR in the
bits in the immediate, then create a configuration.  In this case,
there is no need for a separate {\tt vsetdcfg} instruction.
\end{discussion}

The configuration value given must result in a legal configuration or
else an illegal instruction exception will be raised.

If a zero argument is given to {\tt vsetdcfg} the vector unit will be
disabled and the value 0 will be returned for MVL.  This instruction
({\tt vsetdcfg x0, x0}) is given the assembly pseudo-code {\tt
  vdisable}.

Separate {\tt vsetpcfg} and {\tt vsetpcfgi} instructions are provided
that write the source value to the {\tt vnp} register and return the
new MVL.  These writes also clear the vector data registers, set all
bits in the allocated predicate registers, and set {\tt vl}=MVL. A
{\tt vsetpcfg} or {\tt vsetpcfgi} instruction can be used after a {\tt
  vsetdcfg} to complete a reconfiguration of the vector unit.

\begin{discussion}
  If {\tt vnp} is made accessible as a separate CSR, the {\tt setpcfg}
  and {\tt setpcfgi} instructions are less useful.  The only advantage
  over a CSR instruction is that they return MVL, which is rarely
  needed, and which can be obtained via that {\tt setvl} instruction.
\end{discussion}

\section{Vector-Type-Change Instructions}

To quickly change the individual types of a vector register, {\tt
  vetyperw} and {\tt vetyperwi} instructions are provided to change
the type of the specified vector data register to the given scalar
register value or 5-bit immediate value respectively, while returning
the previous type in the destination scalar register.

A vector convert instruction, described below, can simultaneously
convert a source vector register into a new type, and set that type in
the destination vector register.

\section{Vector Length}

The active vector length is held in the XLEN-bit WARL vector length
CSR {\tt vl}, which can only hold values between 0 and MVL inclusive.
Any writes to the configuration registers ({\tt vcfg}$x$ or {\tt
  vnp}) cause {\tt vl} to be initialized with MVL. Changes to {\tt
  vetype}$n$ via vector-type-change instructions do not affect {\tt
  vl}.

The active vector length is usually set via the {\tt setvl}
instruction.  The source argument to the {\tt setvl} is the requested
application vector length (AVL) as an unsigned XLEN-bit integer. The
{\tt setvl} instruction calculates the value to assign to {\tt vl}
according to Table~\ref{tab:vlcalc}.  The result of this calculation
is also returned as the result of the {\tt setvl} instruction.

\begin{commentary}
Earlier drafts encoded {\tt setvl} using a modified CSRRW instruction
whereas it is now encoded as a separate new instruction.
\end{commentary}

\begin{table}
  \centering
  \begin{tabular}{|c|c|}
    \hline
    AVL Value & {\tt vl} setting \\
    \hline
    AVL $\geq$ 2\,MVL & MVL \\
    2\,MVL $>$ AVL $>$ MVL & $\lceil$AVL$/2\rceil$ \\
    MVL $\geq$ AVL & AVL \\
    \hline
  \end{tabular}
  \caption{Operation of {\tt setvl} instruction to set vector
    length register {\tt vl} based on requested application vector
    length (AVL) and current maximum vector length (MVL).}
  \label{tab:vlcalc}
\end{table}

\begin{commentary}
  The rules for setting the {\tt vl} register help keep vector
  pipelines full over the last two iterations of a stripmined loop.
  This version of the rules guarantees monotonically decreasing vector
  lengths. 
  Similar rules were previously used in Cray-designed machines~\cite{crayx1asm}.
\end{commentary}

\begin{discussion}
  There are multiple possible rules for setting VL, and we could give
  implementations freedom to use different VL setting rules.
\end{discussion}

\begin{commentary}
  The idea of having implementation-defined vector length dates back
  to at least the IBM 3090 Vector Facility~\cite{ibm370varch}, which
  used a special ``Load Vector Count and Update'' (VLVCU) instruction
  to control stripmine loops.  The {\tt setvl} instruction included
  here is based on the simpler {\tt setvlr} instruction introduced by
  Asanovi\'{c}~\cite{krstephd}.
\end{commentary}

The {\tt setvl} instruction is typically used at the start of every
iteration of a stripmined loop to set the number of vector elements to
operate on in the following loop iteration.  The current MVL can be
obtained from a vector configuration instruction, or by performing a
{\tt setvl} with a source argument that has all bits set (largest
unsigned integer).

When {\tt vl} is less than MVL, vector instructions will set all
elements in the range [{\tt vl}:MAXVL-1] in the destination vector
data register or destination vector predicate register to zero.

\begin{commentary}
  Requring zeroing of elements past the current active vector length
  simplifies the design of units with renamed vector data registers.
  If the specification left destination elements unchanged, renaming
  implementations would have to copy the tail of the old destination
  register to the newly allocated destination register.
  Alternatively, specifying the tail to be undefined will expose
  implementation differences and possibly cause a security hole.

  Implementations that do not support renaming, will have to zero the
  tail of a vector, but this can reuse the mechanism that is already
  required to initialize all vector data registers to zero on
  reconfiguration, for example, by having a zero bit on each element
  or element group.
\end{commentary}

No element operations are performed for any vector instruction when
{\tt vl}=0.

\begin{commentary}
  Two possible choices are to 1) require destination registers to be
  completely zeroed when {\tt vl}=0, or 2) no changes to the
  destination registers.  Option 2 is currently chosen as this will
  prevents unnecessary work in some implementations, and option 1 does
  not provide a clear advantage beyond seeming more consistent with
  {\tt vl}>0 case.
\end{commentary}

\begin{figure}[bt]
  \centering
\begin{verbatim}
                 # Vector-vector 32-bit add loop.
                 # a0 holds N
                 # a1 holds pointer to result vector
                 # a2 holds pointer to first source vector
                 # a3 holds pointer to second source vector
                 li t0, (2<<VNTYPE0|VREGF32)
                 vsetdcfg t0     # Configure with two 32-bit float vectors

          loop:  setvl t0, a0    # Set length, get how many elements in strip
                 vld v0, a2      # Load first vector
                 sll t1, t0, 2   # Multiply length by 4 to get bytes
                 add a2, t1      # Bump pointer
                 vld v1, a3      # Load second vector
                 add a3, t1      # Bump pointer
                 vadd v0, v1     # Add elements
                 sub a0, t0      # Decrement elements completed
                 vst  v0, a1     # Store result vector
                 add a1, t1      # Bump pointer
                 bnez a0, loop   # Any more?

                 vdisable        # Turn off vector unit
\end{verbatim}
\caption{Example vector-vector add loop.}
\label{fig:vvadd}
\end{figure}

\section{Predicated Execution}


\begin{commentary}
  The 32-bit base encoding does not leave room for a fully orthogonal
  predicate register specifier.  A single bit is dedicated to the
  predicate register specification, and is used to select between two
  active predicate registers, {\tt vp0} or {\tt vp1}. An alternative
  scheme would have used the bit to select between {\tt vp0} and
  unpredicated (all elements active).  However, given the ease of
  setting all predicate bits in a vector predicate register with a
  single predicate instruction, the current scheme provides more
  flexibility.

  When there are no vector predicate registers enabled, {\tt vp0}
  returns all set bits when read.  So, the assembler convention is to
  assume {\tt vp0} as the predicate register when no predicate
  register is explicitly given.  The assembler can support a strict
  operands option to require the vector predicate register is
  explicitly specified.
\end{commentary}

At element positions where the selected predicate register bit is
zero, the corresponding vector element operation has no effect (does
not change architectural state or generate exceptions), except to
write a zero to the element position in the destination vector
register.

\begin{discussion}
  The previous proposal (undisturb) left the destination vector
  unchanged at element positions where the predicate bit is false,
  whereas the current plan-of-record (zero) writes zero to the
  destination where the predicate bit is false.

  The advantage of the undisturb option is that it can require fewer
  instructions and fewer architectural registers for many common code
  sequences.  For in-order machines without register renaming, the
  undisturb operation simply disables writes to the destination
  elements, except for vector registers that have not been written
  since configuration time. Typically an extra zero bit per vector
  register or element group will be added to represent a zeroed
  register instead of actually zeroing state at configuration time.
  For predicated undisturb writes to these uninitialized registers,
  the predicated false elements must be explicity written with zeros
  on each element group and the zero bit is then cleared down.
  However, in a machine with vector register renaming, undisturb does
  imply an additional read of the original destination register to
  write the value into the new physical destination register when the
  predicate is false.  This additional read port will often be cheaper
  than in a scalar machine as vector machines often time-multiplex
  read ports, and the additional read can be skipped when the
  predicate registers are disabled ({\tt vnp}=0) or when the source is
  known to be zero after configuration, but still adds complexity to a
  design.

  The advantage of the zero option is that a machine with vector
  register renaming does not need to read the original destination
  vector register and so a read port is saved.  The disadvantage of
  the zero option is that more instructions and architectural
  registers are required for common code sequences, and simpler
  microarchitectures without register renaming are penalized by
  requiring longer code sequences and greater register pressure.  In
  particular, vector merge instructions are required to collect
  results from two divergent control paths, and each vector merge has
  to read two vector values and write a vector result.  Whether the
  zero option saves total register file traffic in an register-renamed
  microarchitecture depends on the ratio of a) internal temporary
  writes, to b) writes creating values that are live out of each basic
  block, and also to the frequency of control flow merges.

  Overall, the zero option removes significant complexity from the
  renamed machines while reducing efficiency somewhat for the
  non-renamed machines, and is the current plan-of-record.
\end{discussion}

\section{Vector Load/Store Instructions}

Three vector load/store addressing modes are supported, unit-stride,
constant stride, and indexed (scatter/gather).  Each addressing mode
has a 7-bit unsigned immediate offset that is scaled by the element
type.

The unit-stride address mode takes a scalar base byte address, adds
the scaled immediate, then generates a contiguous set of element
addresses for loads or stores.

\begin{commentary}
  The primary use of immediates in unit-stride loads is to generate
  overlapping unit-stride loads for convolution operations.
\end{commentary}

The constant-stride address mode takes a scalar base byte address, a
stride value encoded in bytes, and adds a scaled immediate value.

\begin{commentary}
  The stride value is in bytes to allow a single stride register to be
  used to support operations on arrays-of-structures, where not all
  elements in each structure have the same size.  The immediate value
  is still scaled by element size to increase reach, given that
  element types will be naturally aligned.
\end{commentary}

The indexed address mode takes a scalar base byte address and a vector
of byte offsets.  The scalar base address and the immediate value are
added to element of the offset vector to give a vector of addresses
used in a scatter/gather.

Indexed stores are provided in three types.  Unordered, ordered, and
reverse-ordered.  The unordered indexed stores might update the same
memory location from two different elements in an unspecified order.
The ordered stores always update memory locations in increasing vector
element order.  The reverse-ordered stores always update memory
locations in decreasing memory order.

\begin{commentary}
  The reverse-ordered stores support vectorization of software memory
  disambiguation techniques.  A reverse-ordered store of element id
  into a hash table indexed by a hash on a store access address,
  followed by a read of the hash table using a load access address and
  a comparison against the original element id, will indicate if
  there's a potential RAW hazard with an earlier loop iteration.
\end{commentary}

\begin{discussion}
  Not clear if there is sufficient realizable improvement for
  supporting unordered stores over ordered stores.
\end{discussion}

Vector loads/stores have a simple memory model, where each vector
load/store is observed to complete sequentially in program order ony
the local hart, i.e., a vector load on a hart will observe all earlier
vector stores on the same hart, and no later vector stores.

Vector loads are available in a length-speculative form that writes
predicate register {\tt vp1} in addition to the destination vector
data register.  These instructions raise an illegal instruction
exception if {\tt vp1} is not configured.  For elements that do not
generate a permissions fault, the length-speculative vector loads
operate as normally except to also clear the bit in {\tt vp1}.  If an
element encounters a permission fault, a zero is written to the
destination vector register element and the {\tt vp1} bit is set to a
1.  Implementations may treat elements past the first faulting element
as also causing a fault even if they might not cause a permissions
fault when accessed alone.

Once software determines the active vector length, it should check if
any loads within the active vector length caused a fault, and in this
case, generate a non-length-speculative load to trigger reporting of
the error.

\begin{commentary}
  Length-speculative vector loads are required to vectorize while
  loops, with data-dependent exits (e.g. strlen).

  The only faults ignored by the length-speculative vector loads are
  ones that would have resulted in a permissions violation.  Page
  faults and other virtualization-related faults should be handled
  invisibly to the user thread by the execution environment.

  A malicious program can use length-speculative vector loads to probe
  accessible address space without fear of a fatal fault.
\end{commentary}

\section{Vector Register Gather}

A vector register gather produces a new result data vector by gathering
elements from one source data vector at the element locations
specified by a second source index vector.  Data source and
destination vector types must agree.  The index vector can have any
integer type.  Legal element indices can range from 0 to current
MAXVL.  Indices out of this range raise an illegal instruction
exception.

\begin{verbatim}
  # vindices holds values from 0..MAXVL
  vrgather  vdest, vsrc, vindices
\end{verbatim}

\section{Vector Slide}

Reductions (and convolutions) are supported via a vector slide
instruction that takes elements starting from the middle of one vector
and places these at the beginning of a second vector register.  This
supports a recursive-halving reduction approach for any binary
associative operator.

\begin{commentary}
  A similar vector register extract instruction was added to the Cray
  C90 after memory latency grew too large for the memory-memory
  reductions used in earlier Crays.

  The vector unit microarchitecture can be optimized for the
  power-of-2 sized element offsets used for reductions.
\end{commentary}


\section{Fixed-Point Support}

Clip instruction supports scaling, rounding, and clipping to
destination type.  Rounding set by CSR fixed-point rounding mode
(truncate, jam, round-up, round-nearest-even).  Clipping set by CSR
clip mode (wrap, saturate).

Add with average, rounding set by rounding mode.

Multiply with same size source and destination types, with some result
scaling values (+1, 0, -1, -8?) and rounding and clipping according to
CSR mode.

Accumulate with carry into predicate register to support larger
precise dot-products.

\section{Optional Transcendental Support}