diff options
-rw-r--r-- | src/a.tex | 524 | ||||
-rw-r--r-- | src/b.tex | 19 | ||||
-rw-r--r-- | src/c.tex | 1268 | ||||
-rw-r--r-- | src/counters.tex | 252 | ||||
-rw-r--r-- | src/csr.tex | 260 | ||||
-rw-r--r-- | src/d.tex | 442 | ||||
-rw-r--r-- | src/extensions.tex | 383 | ||||
-rw-r--r-- | src/f.tex | 851 | ||||
-rw-r--r-- | src/history.tex | 403 | ||||
-rw-r--r-- | src/intro.tex | 770 | ||||
-rw-r--r-- | src/j.tex | 13 | ||||
-rw-r--r-- | src/m.tex | 188 | ||||
-rw-r--r-- | src/memory-model-alloy.tex | 269 | ||||
-rw-r--r-- | src/memory-model-herd.tex | 160 | ||||
-rw-r--r-- | src/naming.tex | 189 | ||||
-rw-r--r-- | src/p.tex | 14 |
16 files changed, 0 insertions, 6005 deletions
diff --git a/src/a.tex b/src/a.tex deleted file mode 100644 index 1600cc6..0000000 --- a/src/a.tex +++ /dev/null @@ -1,524 +0,0 @@ -\chapter{``A'' Standard Extension for Atomic Instructions, Version 2.1} -\label{atomics} - -The standard atomic-instruction extension, named ``A'', -contains instructions that atomically -read-modify-write memory to support synchronization between multiple -RISC-V harts running in the same memory space. The two forms of -atomic instruction provided are load-reserved/store-conditional -instructions and atomic fetch-and-op memory instructions. Both types -of atomic instruction support various memory consistency orderings -including unordered, acquire, release, and sequentially consistent -semantics. These instructions allow RISC-V to support the RCsc memory -consistency model~\cite{Gharachorloo90memoryconsistency}. - -\begin{commentary} -After much debate, the language community and architecture community -appear to have finally settled on release consistency as the standard -memory consistency model and so the RISC-V atomic support is built -around this model. -\end{commentary} - -\section{Specifying Ordering of Atomic Instructions} - -The base RISC-V ISA has a relaxed memory model, with the FENCE -instruction used to impose additional ordering constraints. The -address space is divided by the execution environment into memory and -I/O domains, and the FENCE instruction provides options to order -accesses to one or both of these two address domains. - -To provide more efficient support for release -consistency~\cite{Gharachorloo90memoryconsistency}, each atomic -instruction has two bits, {\em aq} and {\em rl}, used to specify -additional memory ordering constraints as viewed by other RISC-V -harts. The bits order accesses to one of the two address domains, -memory or I/O, depending on which address domain the atomic -instruction is accessing. No ordering constraint is implied to -accesses to the other domain, and a FENCE instruction should be used -to order across both domains. - -If both bits are clear, no additional ordering constraints are imposed -on the atomic memory operation. If only the {\em aq} bit is set, the -atomic memory operation is treated as an {\em acquire} access, i.e., -no following memory operations on this RISC-V hart can be observed -to take place before the acquire memory operation. If only the {\em - rl} bit is set, the atomic memory operation is treated as a {\em - release} access, i.e., the release memory operation cannot be -observed to take place before any earlier memory operations on this -RISC-V hart. If both the {\em aq} and {\em rl} bits are set, the -atomic memory operation is {\em sequentially consistent} and cannot be -observed to happen before any earlier memory operations or after any -later memory operations in the same RISC-V hart and to the same -address domain. - -\section{Load-Reserved/Store-Conditional Instructions} -\label{sec:lrsc} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}W@{}W@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbit{26} & -\instbit{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{aq} & -\multicolumn{1}{c|}{rl} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\ -LR.W/D & \multicolumn{2}{c}{ordering} & 0 & addr & width & dest & AMO \\ -SC.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ -\end{tabular} -\end{center} - -Complex atomic memory operations on a single memory word or doubleword are performed -with the load-reserved (LR) and store-conditional (SC) instructions. -LR.W loads a word from the address in {\em rs1}, places the sign-extended -value in {\em rd}, and registers a {\em reservation set}---a set of bytes -that subsumes the bytes in the addressed word. -SC.W conditionally writes a word in {\em rs2} to the address in {\em rs1}: the -SC.W succeeds only if the reservation is still valid and the reservation set -contains the bytes being written. -If the SC.W succeeds, the instruction writes the word in {\em rs2} to memory, -and it writes zero to {\em rd}. -If the SC.W fails, the instruction does not write to memory, and it writes -a nonzero value to {\em rd}. -Regardless of success or failure, executing an SC.W instruction invalidates -any reservation held by this hart. -LR.D and SC.D act analogously on doublewords and are only available on RV64. -For RV64, LR.W and SC.W sign-extend the value placed in {\em rd}. - -\begin{commentary} -Both compare-and-swap (CAS) and LR/SC can be used to build lock-free -data structures. After extensive discussion, we opted for LR/SC for -several reasons: 1) CAS suffers from the ABA problem, which LR/SC -avoids because it monitors all writes to the address rather than -only checking for changes in the data value; 2) CAS would also require -a new integer instruction format to support three source operands -(address, compare value, swap value) as well as a different memory -system message format, which would complicate microarchitectures; 3) -Furthermore, to avoid the ABA problem, other systems provide a -double-wide CAS (DW-CAS) to allow a counter to be tested and -incremented along with a data word. This requires reading five -registers and writing two in one instruction, and also a new larger -memory system message type, further complicating implementations; 4) -LR/SC provides a more efficient implementation of many primitives as -it only requires one load as opposed to two with CAS (one load before -the CAS instruction to obtain a value for speculative computation, -then a second load as part of the CAS instruction to check if value is -unchanged before updating). - -The main disadvantage of LR/SC over CAS is livelock, which we avoid, -under certain circumstances, -with an architected guarantee of eventual forward progress as -described below. Another concern is whether the influence of the -current x86 architecture, with its DW-CAS, will complicate porting of -synchronization libraries and other software that assumes DW-CAS is -the basic machine primitive. A possible mitigating factor is the -recent addition of transactional memory instructions to x86, which -might cause a move away from DW-CAS. - -More generally, a multi-word atomic primitive is desirable, but there is -still considerable debate about what form this should take, and -guaranteeing forward progress adds complexity to a system. -\end{commentary} - -The failure code with value 1 encodes an unspecified failure. -Other failure codes are reserved at this time. -Portable software should only assume the failure code will be non-zero. - -\begin{commentary} -We reserve a failure code of 1 to mean ``unspecified'' so that simple -implementations may return this value using the existing mux required -for the SLT/SLTU instructions. More specific failure codes might be -defined in future versions or extensions to the ISA. -\end{commentary} - -For LR and SC, the A extension requires that the address held in {\em - rs1} be naturally aligned to the size of the operand (i.e., -eight-byte aligned for 64-bit words and four-byte aligned for 32-bit -words). If the address is not naturally aligned, an address-misaligned -exception or an access-fault exception will be generated. The access-fault -exception can be generated for a memory access that would otherwise be -able to complete except for the misalignment, if the misaligned access -should not be emulated. - -\begin{commentary} -Emulating misaligned LR/SC sequences is impractical in most systems. - -Misaligned LR/SC sequences also raise the possibility of accessing multiple -reservation sets at once, which present definitions do not provide for. -\end{commentary} - -An implementation can register an arbitrarily large reservation set on each -LR, provided the reservation set includes all bytes of the addressed data word -or doubleword. -An SC can only pair with the most recent LR in program order. -An SC may succeed only if no store from another hart -to the reservation set can be observed to have occurred between the LR -and the SC, and if there is no other SC between the LR and itself in program -order. -An SC may succeed only if no write from a device other than a hart -to the bytes accessed by the LR instruction can be observed to have occurred -between the LR and SC. -Note this LR might have had a different effective address and data size, but -reserved the SC's address as part of the reservation set. -\begin{commentary} -Following this model, in systems with memory translation, an SC is allowed to -succeed if the earlier LR reserved the same location using an alias with -a different virtual address, but is also allowed to fail if the virtual -address is different. -\end{commentary} -\begin{commentary} -To accommodate legacy devices and buses, writes from devices other than RISC-V -harts are only required to invalidate reservations when they overlap the bytes -accessed by the LR. These writes are not required to invalidate the -reservation when they access other bytes in the reservation set. -\end{commentary} - -The SC must fail if the address is not within the reservation set of the most -recent LR in program order. -The SC must fail if a store to the reservation set from another hart can be -observed to occur between the LR and SC. -The SC must fail if a write from some other device to the bytes accessed by -the LR can be observed to occur between the LR and SC. -(If such a device writes the reservation set but does not write the bytes -accessed by the LR, the SC may or may not fail.) -An SC must fail if there is another SC (to any address) between the LR and the -SC in program order. -The precise statement of the atomicity requirements for successful LR/SC -sequences is defined by the Atomicity Axiom in Section~\ref{sec:rvwmo}. - -\begin{commentary} -The platform should provide a means to determine the size and shape of the -reservation set. - -A platform specification may constrain the size and shape of the reservation -set. -\end{commentary} - -\begin{commentary} -A store-conditional instruction to a scratch word of memory should be used -to forcibly invalidate any existing load reservation: -\begin{itemize} -\item during a preemptive context switch, and -\item if necessary when changing virtual to physical address mappings, - such as when migrating pages that might contain an active reservation. -\end{itemize} -\end{commentary} - -\begin{commentary} -The invalidation of a hart's reservation when it executes an LR or SC -imply that a hart can only hold one reservation at a time, and that -an SC can only pair with the most recent LR, and LR with the next -following SC, in program order. This is a restriction to the -Atomicity Axiom in Section~\ref{sec:rvwmo} that ensures software runs -correctly on expected common implementations that operate in this manner. -\end{commentary} - -An SC instruction can never be observed by another RISC-V hart -before the LR instruction that established the reservation. -The LR/SC -sequence can be given acquire semantics by setting the {\em aq} bit on -the LR instruction. The LR/SC sequence can be given release semantics -by setting the {\em rl} bit on the SC instruction. Setting the {\em - aq} bit on the LR instruction, and setting both the {\em aq} and the {\em - rl} bit on the SC instruction makes the LR/SC sequence sequentially -consistent, meaning that it cannot be reordered with earlier or -later memory operations from the same hart. - -If neither bit is set on both LR and SC, the LR/SC sequence can be -observed to occur before or after surrounding memory operations from -the same RISC-V hart. This can be appropriate when the LR/SC -sequence is used to implement a parallel reduction operation. - -Software should not set the {\em rl} bit on an LR instruction unless the {\em -aq} bit is also set, nor should software set the {\em aq} bit on an SC -instruction unless the {\em rl} bit is also set. LR.{\em rl} and SC.{\em aq} -instructions are not guaranteed to provide any stronger ordering than those -with both bits clear, but may result in lower performance. - -\begin{figure}[h!] -\begin{center} -\begin{verbatim} - # a0 holds address of memory location - # a1 holds expected value - # a2 holds desired value - # a0 holds return value, 0 if successful, !0 otherwise - cas: - lr.w t0, (a0) # Load original value. - bne t0, a1, fail # Doesn't match, so fail. - sc.w t0, a2, (a0) # Try to update. - bnez t0, cas # Retry if store-conditional failed. - li a0, 0 # Set return to success. - jr ra # Return. - fail: - li a0, 1 # Set return to failure. - jr ra # Return. -\end{verbatim} -\end{center} -\caption{Sample code for compare-and-swap function using LR/SC.} -\label{cas} -\end{figure} - -LR/SC can be used to construct lock-free data structures. An example -using LR/SC to implement a compare-and-swap function is shown in -Figure~\ref{cas}. If inlined, compare-and-swap functionality need -only take four instructions. - -\section{Eventual Success of Store-Conditional Instructions} -\label{sec:lrscseq} - -The standard A extension defines {\em constrained LR/SC loops}, which have -the following properties: -\vspace{-0.2in} -\begin{itemize} -\parskip 0pt -\itemsep 1pt -\item The loop comprises only an LR/SC sequence and code to retry the sequence - in the case of failure, and must comprise at most 16 instructions placed - sequentially in memory. -\item An LR/SC sequence begins with an LR instruction and ends with an SC - instruction. The dynamic code executed between the LR and SC instructions - can only contain instructions from the base ``I'' instruction set, excluding - loads, stores, backward jumps, taken backward branches, JALR, FENCE, - and SYSTEM instructions. - If the ``C'' extension is supported, then compressed - forms of the aforementioned ``I'' instructions are also permitted. -\item The code to retry a failing LR/SC sequence can contain backwards jumps - and/or branches to repeat the LR/SC sequence, but otherwise has the same - constraint as the code between the LR and SC. -\item The LR and SC addresses must lie within a memory region with the - {\em LR/SC eventuality} property. The execution environment is responsible - for communicating which regions have this property. -\item The SC must be to the same effective address and of the same data size as - the latest LR executed by the same hart. -\end{itemize} - -LR/SC sequences that do not lie within constrained LR/SC loops are {\em -unconstrained}. Unconstrained LR/SC sequences might succeed on some attempts -on some implementations, but might never succeed on other implementations. - -\begin{commentary} -We restricted the length of LR/SC loops to fit within 64 contiguous -instruction bytes in the base ISA to avoid undue restrictions on instruction -cache and TLB size and associativity. -Similarly, we disallowed other loads and stores within the loops to avoid -restrictions on data-cache associativity in simple implementations that track -the reservation within a private cache. -The restrictions on branches and jumps limit the time that -can be spent in the sequence. Floating-point operations and integer -multiply/divide were disallowed to simplify the operating system's emulation -of these instructions on implementations lacking appropriate hardware support. - -Software is not forbidden from using unconstrained LR/SC sequences, but -portable software must detect the case that the sequence repeatedly fails, -then fall back to an alternate code sequence that does not rely on an -unconstrained LR/SC sequence. Implementations are permitted to -unconditionally fail any unconstrained LR/SC sequence. -\end{commentary} - -If a hart {\em H} enters a constrained LR/SC loop, the execution environment -must guarantee that one of the following events eventually occurs: -\vspace{-0.2in} -\begin{itemize} -\parskip 0pt -\itemsep 1pt -\item {\em H} or some other hart executes a successful SC to the reservation - set of the LR instruction in {\em H}'s constrained LR/SC loops. -\item Some other hart executes an unconditional store or AMO instruction to - the reservation set of the LR instruction in {\em H}'s constrained LR/SC - loop, or some other device in the system writes to that reservation set. -\item {\em H} executes a branch or jump that exits the constrained LR/SC loop. -\item {\em H} traps. -\end{itemize} - -\begin{commentary} -Note that these definitions permit an implementation to fail an SC instruction -occasionally for any reason, provided the aforementioned guarantee is not -violated. -\end{commentary} - -\begin{commentary} -As a consequence of the eventuality guarantee, if some harts in an execution -environment are executing constrained LR/SC loops, and no other harts or -devices in the execution environment execute an unconditional store or AMO to -that reservation set, then at least one hart will eventually exit its -constrained LR/SC loop. -By contrast, if other harts or devices continue to write to that reservation -set, it is not guaranteed that any hart will exit its LR/SC loop. - -Loads and load-reserved instructions do not by themselves impede the progress -of other harts' LR/SC sequences. -We note this constraint implies, among other things, that loads and -load-reserved instructions executed by other harts (possibly within the same -core) cannot impede LR/SC progress indefinitely. -For example, cache evictions caused by another hart sharing the cache cannot -impede LR/SC progress indefinitely. -Typically, this implies reservations are tracked independently of -evictions from any shared cache. -Similarly, cache misses caused by speculative execution within a hart cannot -impede LR/SC progress indefinitely. - -These definitions admit the possibility that SC instructions may spuriously -fail for implementation reasons, provided progress is eventually made. -\end{commentary} - -\begin{commentary} -One advantage of CAS is that it guarantees that some hart eventually -makes progress, whereas an LR/SC atomic sequence could livelock -indefinitely on some systems. To avoid this concern, we added an -architectural guarantee of livelock freedom for certain LR/SC sequences. - -Earlier versions of this specification imposed a stronger starvation-freedom -guarantee. However, the weaker livelock-freedom guarantee is sufficient to -implement the C11 and C++11 languages, and is substantially easier to provide -in some microarchitectural styles. -\end{commentary} - -\section{Atomic Memory Operations} -\label{sec:amo} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{O@{}W@{}W@{}R@{}R@{}F@{}R@{}R} -\\ -\instbitrange{31}{27} & -\instbit{26} & -\instbit{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{aq} & -\multicolumn{1}{c|}{rl} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 1 & 1 & 5 & 5 & 3 & 5 & 7 \\ -AMOSWAP.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ -AMOADD.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ -AMOAND.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ -AMOOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ -AMOXOR.W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ -AMOMAX[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ -AMOMIN[U].W/D & \multicolumn{2}{c}{ordering} & src & addr & width & dest & AMO \\ -\end{tabular} -\end{center} - -\vspace{-0.1in} The atomic memory operation (AMO) instructions perform -read-modify-write operations for multiprocessor synchronization and -are encoded with an R-type instruction format. These AMO instructions -atomically load a data value from the address in {\em rs1}, place the -value into register {\em rd}, apply a binary operator to the loaded -value and the original value in {\em rs2}, then store the result back -to the original address in {\em rs1}. AMOs can either operate on 64-bit (RV64 -only) or 32-bit words in memory. For RV64, 32-bit AMOs always -sign-extend the value placed in {\em rd}, and ignore the upper 32 bits -of the original value of {\em rs2}. - -For AMOs, the A extension requires that the address held in {\em rs1} -be naturally aligned to the size of the operand (i.e., eight-byte -aligned for 64-bit words and four-byte aligned for 32-bit words). If -the address is not naturally aligned, an address-misaligned exception -or an access-fault exception will be generated. The access-fault exception can be -generated for a memory access that would otherwise be able to complete -except for the misalignment, if the misaligned access should not be -emulated. The ``Zam'' extension, described in Chapter~\ref{sec:zam}, -relaxes this requirement and specifies the semantics of misaligned -AMOs. - -The operations supported are swap, integer add, bitwise AND, bitwise -OR, bitwise XOR, and signed and unsigned integer maximum and minimum. -Without ordering constraints, these AMOs can be used to implement -parallel reduction operations, where typically the return value would -be discarded by writing to {\tt x0}. - -\begin{commentary} -We provided fetch-and-op style atomic primitives as they scale to -highly parallel systems better than LR/SC or CAS. -A simple microarchitecture can implement AMOs using the LR/SC primitives, -provided the implementation can guarantee the AMO eventually completes. -More complex implementations might also implement AMOs at memory -controllers, and can optimize away fetching the original value when -the destination is {\tt x0}. - -The set of AMOs was chosen to support the C11/C++11 atomic memory -operations efficiently, and also to support parallel reductions in -memory. Another use of AMOs is to provide atomic updates to -memory-mapped device registers (e.g., setting, clearing, or toggling -bits) in the I/O space. -\end{commentary} - -To help implement multiprocessor synchronization, the AMOs optionally -provide release consistency semantics. If the {\em aq} bit is set, -then no later memory operations in this RISC-V hart can be observed -to take place before the AMO. -Conversely, if the {\em rl} bit is set, then other -RISC-V harts will not observe the AMO before memory accesses -preceding the AMO in this RISC-V hart. Setting both the {\em aq} and the {\em -rl} bit on an AMO makes the sequence sequentially consistent, meaning that -it cannot be reordered with earlier or later memory operations from the same -hart. - -\begin{commentary} -The AMOs were designed to implement the C11 and C++11 memory models -efficiently. Although the FENCE R, RW instruction suffices to -implement the {\em acquire} operation and FENCE RW, W suffices to -implement {\em release}, both imply additional unnecessary ordering as -compared to AMOs with the corresponding {\em aq} or {\em rl} bit set. -\end{commentary} - -An example code sequence for a critical section guarded by a -test-and-test-and-set spinlock is shown in Figure~\ref{critical}. Note the -first AMO is marked {\em aq} to order the lock acquisition before the -critical section, and the second AMO is marked {\em rl} to order -the critical section before the lock relinquishment. - -\begin{figure}[h!] -\begin{center} -\begin{verbatim} - li t0, 1 # Initialize swap value. - again: - lw t1, (a0) # Check if lock is held. - bnez t1, again # Retry if held. - amoswap.w.aq t1, t0, (a0) # Attempt to acquire lock. - bnez t1, again # Retry if held. - # ... - # Critical section. - # ... - amoswap.w.rl x0, x0, (a0) # Release lock by storing 0. -\end{verbatim} -\end{center} -\caption{Sample code for mutual exclusion. {\tt a0} contains the address of the lock.} -\label{critical} -\end{figure} - -\begin{commentary} -We recommend the use of the AMO Swap idiom shown above for both lock -acquire and release to simplify the implementation of speculative lock -elision~\cite{Rajwar:2001:SLE}. -\end{commentary} - -The instructions in the ``A'' extension can also be used to provide -sequentially consistent loads and stores. A sequentially consistent load can -be implemented as an LR with both {\em aq} and {\em rl} set. A sequentially -consistent store can be implemented as an AMOSWAP that writes the old value to -x0 and has both {\em aq} and {\em rl} set. diff --git a/src/b.tex b/src/b.tex deleted file mode 100644 index 0c4e497..0000000 --- a/src/b.tex +++ /dev/null @@ -1,19 +0,0 @@ -\chapter{``B'' Standard Extension for Bit Manipulation, Version 0.0} -\label{sec:bits} - -This chapter is a placeholder for a future standard extension to -provide bit manipulation instructions, including instructions to -insert, extract, and test bit fields, and for rotations, funnel -shifts, and bit and byte permutations. - -\begin{commentary} -Although bit manipulation instructions are very effective in some -application domains, particularly when dealing with externally packed -data structures, we excluded them from the base ISAs as they are not -useful in all domains and can add additional complexity or instruction -formats to supply all needed operands. - -We anticipate the B extension will be a brownfield encoding within the -base 30-bit instruction space. -\end{commentary} - diff --git a/src/c.tex b/src/c.tex deleted file mode 100644 index ea63d22..0000000 --- a/src/c.tex +++ /dev/null @@ -1,1268 +0,0 @@ -\chapter{``C'' Standard Extension for Compressed Instructions, Version -2.0} -\label{compressed} - -This chapter describes the RISC-V -standard compressed instruction-set extension, named ``C'', which -reduces static and dynamic code size by adding short 16-bit -instruction encodings for common operations. The C extension can be -added to any of the base ISAs (RV32, RV64, RV128), and we use the -generic term ``RVC'' to cover any of these. Typically, 50\%--60\% of -the RISC-V instructions in a program can be replaced with RVC -instructions, resulting in a 25\%--30\% code-size reduction. - -\section{Overview} - -RVC uses a simple compression scheme that offers shorter 16-bit -versions of common 32-bit RISC-V instructions when: -\begin{tightlist} - \item the immediate or address offset is small, or - \item one of the registers is the zero register ({\tt x0}), the - ABI link register ({\tt x1}), or the ABI stack pointer ({\tt - x2}), or - \item the destination register and the first source register are - identical, or - \item the registers used are the 8 most popular ones. -\end{tightlist} - -The C extension is compatible with all other standard instruction -extensions. The C extension allows 16-bit instructions to be freely -intermixed with 32-bit instructions, with the latter now able to start -on any 16-bit boundary, i.e., IALIGN=16. With the addition of the C -extension, no instructions can raise instruction-address-misaligned -exceptions. - -\begin{commentary} -Removing the 32-bit alignment constraint on the original 32-bit -instructions allows significantly greater code density. -\end{commentary} - -The compressed instruction encodings are mostly common across RV32C, -RV64C, and RV128C, but as shown in Table~\ref{rvcopcodemap}, a few -opcodes are used for different purposes depending on base ISA. -For example, the wider address-space RV64C and RV128C variants require -additional opcodes to compress loads and stores of 64-bit integer -values, while RV32C uses the same opcodes to compress loads and stores -of single-precision floating-point values. Similarly, RV128C requires -additional opcodes to capture loads and stores of 128-bit integer -values, while these same opcodes are used for loads and stores of -double-precision floating-point values in RV32C and RV64C. If the C -extension is implemented, the appropriate compressed floating-point -load and store instructions must be provided whenever the relevant -standard floating-point extension (F and/or D) is also implemented. -In addition, RV32C includes a compressed jump and link instruction to -compress short-range subroutine calls, where the same opcode is used -to compress ADDIW for RV64C and RV128C. - -\begin{commentary} -Double-precision loads and stores are a significant fraction of static -and dynamic instructions, hence the motivation to include them in the -RV32C and RV64C encoding. - -Although single-precision loads and stores are not a significant -source of static or dynamic compression for benchmarks compiled for -the currently supported ABIs, for microcontrollers that only provide -hardware single-precision floating-point units and have an ABI that -only supports single-precision floating-point numbers, the -single-precision loads and stores will be used at least as frequently -as double-precision loads and stores in the measured benchmarks. -Hence, the motivation to provide compressed support for these in -RV32C. - -Short-range subroutine calls are more likely in small binaries for -microcontrollers, hence the motivation to include these in RV32C. - -Although reusing opcodes for different purposes for different base -ISAs adds some complexity to documentation, the impact on -implementation complexity is small even for designs that support -multiple base ISAs. The compressed floating-point load -and store variants use the same instruction format with the same -register specifiers as the wider integer loads and stores. -\end{commentary} - -RVC was designed under the constraint that each RVC instruction -expands into a single 32-bit instruction in either the base ISA -(RV32I/E, RV64I, or RV128I) or the F and D standard extensions where -present. Adopting this constraint has two main benefits: - -\begin{tightlist} -\item Hardware designs can simply expand RVC instructions during - decode, simplifying verification and minimizing modifications to - existing microarchitectures. -\item Compilers can be unaware of the RVC extension and leave code - compression to the assembler and linker, although a - compression-aware compiler will generally be able to produce better - results. -\end{tightlist} - -\begin{commentary} -We felt the multiple complexity reductions of a simple one-one mapping -between C and base IFD instructions far outweighed the potential gains -of a slightly denser encoding that added additional instructions only -supported in the C extension, or that allowed encoding of multiple IFD -instructions in one C instruction. -\end{commentary} - -It is important to note that the C extension is not designed to be a -stand-alone ISA, and is meant to be used alongside a base ISA. - -\begin{commentary} -Variable-length instruction sets have long been used to improve code -density. For example, the IBM Stretch~\cite{stretch}, developed in -the late 1950s, had an ISA with 32-bit and 64-bit instructions, where -some of the 32-bit instructions were compressed versions of the full -64-bit instructions. Stretch also employed the concept of limiting -the set of registers that were addressable in some of the shorter -instruction formats, with short branch instructions that could only -refer to one of the index registers. The later IBM 360 -architecture~\cite{ibm360} supported a simple variable-length -instruction encoding with 16-bit, 32-bit, or 48-bit instruction -formats. - -In 1963, CDC introduced the Cray-designed CDC 6600~\cite{cdc6600}, a -precursor to RISC architectures, that introduced a register-rich -load-store architecture with instructions of two lengths, 15-bits and -30-bits. The later Cray-1 design used a very similar instruction -format, with 16-bit and 32-bit instruction lengths. - -The initial RISC ISAs from the 1980s all picked performance over code -size, which was reasonable for a workstation environment, but not for -embedded systems. Hence, both ARM and MIPS subsequently made versions -of the ISAs that offered smaller code size by offering an alternative -16-bit wide instruction set instead of the standard 32-bit wide -instructions. The compressed RISC ISAs reduced code size relative to -their starting points by about 25--30\%, yielding code that was -significantly \emph{smaller} than 80x86. This result surprised some, -as their intuition was that the variable-length CISC ISA should be -smaller than RISC ISAs that offered only 16-bit and 32-bit formats. - -Since the original RISC ISAs did not leave sufficient opcode space -free to include these unplanned compressed instructions, they were -instead developed as complete new ISAs. This meant compilers needed -different code generators for the separate compressed ISAs. The first -compressed RISC ISA extensions (e.g., ARM Thumb and MIPS16) used only -a fixed 16-bit instruction size, which gave good reductions in static -code size but caused an increase in dynamic instruction count, which -led to lower performance compared to the original fixed-width 32-bit -instruction size. This led to the development of a second generation -of compressed RISC ISA designs with mixed 16-bit and 32-bit -instruction lengths (e.g., ARM Thumb2, microMIPS, PowerPC VLE), so -that performance was similar to pure 32-bit instructions but with -significant code size savings. Unfortunately, these different -generations of compressed ISAs are incompatible with each other and -with the original uncompressed ISA, leading to significant complexity -in documentation, implementations, and software tools support. - -Of the commonly used 64-bit ISAs, only PowerPC and microMIPS currently -supports a compressed instruction format. It is surprising that the -most popular 64-bit ISA for mobile platforms (ARM v8) does not include -a compressed instruction format given that static code size and -dynamic instruction fetch bandwidth are important metrics. Although -static code size is not a major concern in larger systems, instruction -fetch bandwidth can be a major bottleneck in servers running -commercial workloads, which often have a large instruction working -set. - -Benefiting from 25 years of hindsight, RISC-V was designed to support -compressed instructions from the outset, leaving enough opcode -space for RVC to be added as a simple extension on top of the base ISA -(along with many other extensions). The philosophy of RVC is to -reduce code size for embedded applications \emph{and} to improve -performance and energy-efficiency for all applications due to fewer -misses in the instruction cache. Waterman shows that RVC fetches -25\%-30\% fewer instruction bits, which reduces instruction cache -misses by 20\%-25\%, or roughly the same performance impact as -doubling the instruction cache size~\cite{waterman-ms}. -\end{commentary} - - -\section{Compressed Instruction Formats} - -Table~\ref{rvc-formats} shows the nine compressed instruction -formats. CR, CI, and CSS can use any of the 32 RVI registers, but CIW, -CL, CS, CA, and CB are limited to just 8 of them. Table~\ref{registers} -lists these popular registers, which correspond to registers {\tt x8} -to {\tt x15}. Note that there is a -separate version of load and store instructions that use the stack -pointer as the base address register, since saving to and restoring -from the stack are so prevalent, and that they use the CI and CSS -formats to allow access to all 32 data registers. CIW supplies an -8-bit immediate for the ADDI4SPN instruction. - -\begin{commentary} -The RISC-V ABI was changed to make the frequently used registers map -to registers {\tt x8}--{\tt x15}. This simplifies the decompression -decoder by having a contiguous naturally aligned set of register -numbers, and is also compatible with the RV32E base ISA, -which only has 16 integer registers. -\end{commentary} - -Compressed register-based floating-point loads and stores also use the -CL and CS formats respectively, with the eight registers mapping to -{\tt f8} to {\tt f15}. - -\begin{commentary} -The standard RISC-V calling convention maps the most frequently used -floating-point registers to registers {\tt f8} to {\tt f15}, which -allows the same register decompression decoding as for integer -register numbers. -\end{commentary} - -The formats were designed to keep bits for the two register source -specifiers in the same place in all instructions, while the -destination register field can move. When the full 5-bit destination -register specifier is present, it is in the same place as in the -32-bit RISC-V encoding. Where immediates are -sign-extended, the sign-extension is always from bit 12. Immediate -fields have been scrambled, as in the base specification, to reduce -the number of immediate muxes required. - -\begin{commentary} -The immediate fields are scrambled in the instruction formats instead -of in sequential order so that as many bits as possible are in the -same position in every instruction, thereby simplifying -implementations. -\end{commentary} - -For many RVC instructions, zero-valued immediates are disallowed and -{\tt x0} is not a valid 5-bit register specifier. These restrictions -free up encoding space for other instructions requiring fewer operand -bits. - -\newcommand{\rdprime}{rd\,$'$} -\newcommand{\rsoneprime}{rs1\,$'$} -\newcommand{\rstwoprime}{rs2\,$'$} - -\begin{table}[h] -{ -\begin{small} -\begin{center} -\begin{tabular}{c c p{0in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}p{0.05in}} -& & & & & & & & & \\ -Format & Meaning & -\instbit{15} & -\instbit{14} & -\instbit{13} & -\multicolumn{1}{c}{\instbit{12}} & -\instbit{11} & -\instbit{10} & -\instbit{9} & -\instbit{8} & -\instbit{7} & -\instbit{6} & -\multicolumn{1}{r}{\instbit{5}} & -\instbit{4} & -\instbit{3} & -\instbit{2} & -\instbit{1} & -\instbit{0} \\ -\cline{3-18} - -CR & Register & -\multicolumn{4}{|c|}{funct4} & -\multicolumn{5}{c|}{rd/rs1} & -\multicolumn{5}{c|}{rs2} & -\multicolumn{2}{c|}{op} \\ -\cline{3-18} - -CI & Immediate & -\multicolumn{3}{|c|}{funct3} & -\multicolumn{1}{c|}{imm} & -\multicolumn{5}{c|}{rd/rs1} & -\multicolumn{5}{c|}{imm} & -\multicolumn{2}{c|}{op} \\ -\cline{3-18} - -CSS & Stack-relative Store & -\multicolumn{3}{|c|}{funct3} & -\multicolumn{6}{c|}{imm} & -\multicolumn{5}{c|}{rs2} & -\multicolumn{2}{c|}{op} \\ -\cline{3-18} - -CIW & Wide Immediate & -\multicolumn{3}{|c|}{funct3} & -\multicolumn{8}{c|}{imm} & -\multicolumn{3}{c|}{\rdprime} & -\multicolumn{2}{c|}{op} \\ -\cline{3-18} - -CL & Load & -\multicolumn{3}{|c|}{funct3} & -\multicolumn{3}{c|}{imm} & -\multicolumn{3}{c|}{\rsoneprime} & -\multicolumn{2}{c|}{imm} & -\multicolumn{3}{c|}{\rdprime} & -\multicolumn{2}{c|}{op} \\ -\cline{3-18} - -CS & Store & -\multicolumn{3}{|c|}{funct3} & -\multicolumn{3}{c|}{imm} & -\multicolumn{3}{c|}{\rsoneprime} & -\multicolumn{2}{c|}{imm} & -\multicolumn{3}{c|}{\rstwoprime} & -\multicolumn{2}{c|}{op} \\ -\cline{3-18} - -CA & Arithmetic & -\multicolumn{6}{|c|}{funct6} & -\multicolumn{3}{c|}{\rdprime/\rsoneprime} & -\multicolumn{2}{c|}{funct2} & -\multicolumn{3}{c|}{\rstwoprime} & -\multicolumn{2}{c|}{op} \\ -\cline{3-18} - -CB & Branch/Arithmetic & -\multicolumn{3}{|c|}{funct3} & -\multicolumn{3}{c|}{offset} & -\multicolumn{3}{c|}{\rdprime/\rsoneprime} & -\multicolumn{5}{c|}{offset} & -\multicolumn{2}{c|}{op} \\ -\cline{3-18} - -CJ & Jump & -\multicolumn{3}{|c|}{funct3} & -\multicolumn{11}{c|}{jump target} & -\multicolumn{2}{c|}{op} \\ -\cline{3-18} - -\end{tabular} -\end{center} -\end{small} -} -\caption{Compressed 16-bit RVC instruction formats.} -\label{rvc-formats} -\end{table} - - -\begin{table}[H] -{ -\begin{center} -\begin{tabular}{l|c|c|c|c|c|c|c|c|} -\cline{2-9} -RVC Register Number & 000 & 001 & 010 & 011 & 100 & 101 & 110 & 111 -\\ \cline{2-9} -Integer Register Number & {\tt x8} & {\tt x9} & {\tt x10} & {\tt x11} & {\tt x12} & {\tt x13} & {\tt x14} & {\tt x15} \\ \cline{2-9} -Integer Register ABI Name & {\tt s0} & {\tt s1} & {\tt a0} & {\tt a1} & {\tt a2} & {\tt a3} & {\tt a4} & {\tt a5} \\ \cline{2-9} -Floating-Point Register Number & {\tt f8} & {\tt f9} & {\tt f10} & {\tt f11} & {\tt f12} & {\tt f13} & {\tt f14} & {\tt f15} \\ \cline{2-9} -Floating-Point Register ABI Name & {\tt fs0} & {\tt fs1} & {\tt fa0} & {\tt fa1} & {\tt fa2} & {\tt fa3} & {\tt fa4} & {\tt fa5} \\ \cline{2-9} -\end{tabular} -\end{center} -} -\caption{Registers specified by the three-bit {\em \rsoneprime}, {\em \rstwoprime}, and {\em \rdprime} fields of the CIW, CL, CS, CA, and CB formats.} -\label{registers} -\end{table} - -\section{Load and Store Instructions} - -To increase the reach of 16-bit instructions, data-transfer -instructions use zero-extended immediates that are scaled by the size -of the data in bytes: $\times$4 for words, $\times$8 for double words, -and $\times$16 for quad words. - -RVC provides two variants of loads and stores. One uses the ABI stack -pointer, {\tt x2}, as the base address and can target any data register. The -other can reference one of 8 base address registers and one of 8 data -registers. - -\subsection*{Stack-Pointer-Based Loads and Stores} - -\begin{center} -\begin{tabular}{S@{}W@{}T@{}T@{}Y} -\\ -\instbitrange{15}{13} & -\multicolumn{1}{c}{\instbit{12}} & -\instbitrange{11}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 1 & 5 & 5 & 2 \\ -C.LWSP & offset[5] & dest$\neq$0 & offset[4:2$\vert$7:6] & C2 \\ -C.LDSP & offset[5] & dest$\neq$0 & offset[4:3$\vert$8:6] & C2 \\ -C.LQSP & offset[5] & dest$\neq$0 & offset[4$\vert$9:6] & C2 \\ -C.FLWSP& offset[5] & dest & offset[4:2$\vert$7:6] & C2 \\ -C.FLDSP& offset[5] & dest & offset[4:3$\vert$8:6] & C2 \\ -\end{tabular} -\end{center} -These instructions use the CI format. - -C.LWSP loads a 32-bit value from memory into register {\em rd}. It computes -an effective address by adding the {\em zero}-extended offset, scaled by 4, to -the stack pointer, {\tt x2}. It expands to {\tt lw rd, offset(x2)}. -C.LWSP is only valid when $\textit{rd}{\neq}\texttt{x0}$; -the code points with $\textit{rd}{=}\texttt{x0}$ are reserved. - - -C.LDSP is an RV64C/RV128C-only instruction that loads a 64-bit value from memory into -register {\em rd}. It computes its effective address by adding the -zero-extended offset, scaled by 8, to the stack pointer, {\tt x2}. -It expands to {\tt ld rd, offset(x2)}. -C.LDSP is only valid when $\textit{rd}{\neq}\texttt{x0}$; -the code points with $\textit{rd}{=}\texttt{x0}$ are reserved. - -C.LQSP is an RV128C-only instruction that loads a 128-bit value from memory -into register {\em rd}. It computes its effective address by adding the -zero-extended offset, scaled by 16, to the stack pointer, {\tt x2}. -It expands to {\tt lq rd, offset(x2)}. -C.LQSP is only valid when $\textit{rd}{\neq}\texttt{x0}$; -the code points with $\textit{rd}{=}\texttt{x0}$ are reserved. - -C.FLWSP is an RV32FC-only instruction that loads a single-precision -floating-point value from memory into floating-point register {\em rd}. It -computes its effective address by adding the {\em zero}-extended offset, -scaled by 4, to the stack pointer, {\tt x2}. It expands to {\tt flw rd, -offset(x2)}. - -C.FLDSP is an RV32DC/RV64DC-only instruction that loads a double-precision -floating-point value from memory into floating-point register {\em rd}. It -computes its effective address by adding the {\em zero}-extended offset, -scaled by 8, to the stack pointer, {\tt x2}. It expands to {\tt fld rd, -offset(x2)}. - -\begin{center} -\begin{tabular}{S@{}M@{}T@{}Y} -\\ -\instbitrange{15}{13} & -\instbitrange{12}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 6 & 5 & 2 \\ -C.SWSP & offset[5:2$\vert$7:6] & src & C2 \\ -C.SDSP & offset[5:3$\vert$8:6] & src & C2 \\ -C.SQSP & offset[5:4$\vert$9:6] & src & C2 \\ -C.FSWSP& offset[5:2$\vert$7:6] & src & C2 \\ -C.FSDSP& offset[5:3$\vert$8:6] & src & C2 \\ -\end{tabular} -\end{center} -These instructions use the CSS format. - -C.SWSP stores a 32-bit value in register {\em rs2} to memory. It computes -an effective address by adding the {\em zero}-extended offset, scaled by 4, to -the stack pointer, {\tt x2}. -It expands to {\tt sw rs2, offset(x2)}. - -C.SDSP is an RV64C/RV128C-only instruction that stores a 64-bit value in register -{\em rs2} to memory. It computes an effective address by adding the {\em -zero}-extended offset, scaled by 8, to the stack pointer, {\tt x2}. -It expands to {\tt sd rs2, offset(x2)}. - -C.SQSP is an RV128C-only instruction that stores a 128-bit value in register -{\em rs2} to memory. It computes an effective address by adding the {\em -zero}-extended offset, scaled by 16, to the stack pointer, {\tt x2}. -It expands to {\tt sq rs2, offset(x2)}. - -C.FSWSP is an RV32FC-only instruction that stores a single-precision -floating-point value in floating-point register {\em rs2} to memory. It -computes an effective address by adding the {\em zero}-extended offset, scaled -by 4, to the stack pointer, {\tt x2}. It expands to {\tt fsw rs2, -offset(x2)}. - -C.FSDSP is an RV32DC/RV64DC-only instruction that stores a double-precision -floating-point value in floating-point register {\em rs2} to memory. It -computes an effective address by adding the {\em zero}-extended offset, scaled -by 8, to the stack pointer, {\tt x2}. It expands to {\tt fsd rs2, -offset(x2)}. - -\begin{commentary} -Register save/restore code at function entry/exit represents a -significant portion of static code size. The stack-pointer-based -compressed loads and stores in RVC are effective at reducing the -save/restore static code size by a factor of 2 while improving -performance by reducing dynamic instruction bandwidth. - -A common mechanism used in other ISAs to further reduce -save/restore code size is load-multiple and store-multiple -instructions. We considered adopting these for RISC-V but noted the -following drawbacks to these instructions: -\begin{itemize} -\item These instructions complicate processor implementations. -\item For virtual memory systems, some data accesses could be - resident in physical memory and some could not, which requires a - new restart mechanism for partially executed instructions. -\item Unlike the rest of the RVC instructions, there is no IFD - equivalent to Load Multiple and Store Multiple. -\item Unlike the rest of the RVC instructions, the compiler would - have to be aware of these instructions to both generate the - instructions and to allocate registers in an order to maximize - the chances of the them being saved and stored, since they would - be saved and restored in sequential order. -\item Simple microarchitectural implementations will constrain how - other instructions can be scheduled around the load and store - multiple instructions, leading to a potential performance loss. -\item The desire for sequential register allocation might conflict with - the featured registers selected for the CIW, CL, CS, CA, and CB formats. -\end{itemize} -Furthermore, much of the gains can be realized in software by replacing -prologue and epilogue code with subroutine calls to common -prologue and epilogue code, a technique described in -Section 5.6 of~\cite{waterman-phd}. - -While reasonable architects might come to different conclusions, we -decided to omit load and store multiple and instead use the -software-only approach of calling save/restore millicode routines to -attain the greatest code size reduction. -\end{commentary} - -\subsection*{Register-Based Loads and Stores} - -\begin{center} -\begin{tabular}{S@{}S@{}S@{}Y@{}S@{}Y} -\\ -\instbitrange{15}{13} & -\instbitrange{12}{10} & -\instbitrange{9}{7} & -\instbitrange{6}{5} & -\instbitrange{4}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{\rsoneprime} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{\rdprime} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 3 & 3 & 2 & 3 & 2 \\ -C.LW & offset[5:3] & base & offset[2$\vert$6] & dest & C0 \\ -C.LD & offset[5:3] & base & offset[7:6] & dest & C0 \\ -C.LQ & offset[5$\vert$4$\vert$8] & base & offset[7:6] & dest & C0 \\ -C.FLW& offset[5:3] & base & offset[2$\vert$6] & dest & C0 \\ -C.FLD& offset[5:3] & base & offset[7:6] & dest & C0 \\ -\end{tabular} -\end{center} -These instructions use the CL format. - -C.LW loads a 32-bit value from memory into register {\em \rdprime}. It computes -an effective address by adding the {\em zero}-extended offset, scaled by 4, to -the base address in register {\em \rsoneprime}. -It expands to {\tt lw \rdprime, offset(\rsoneprime)}. - -C.LD is an RV64C/RV128C-only instruction that loads a 64-bit value from memory into -register {\em \rdprime}. It computes an effective address by adding the {\em -zero}-extended offset, scaled by 8, to the base address in register {\em -\rsoneprime}. -It expands to {\tt ld \rdprime, offset(\rsoneprime)}. - -C.LQ is an RV128C-only instruction that loads a 128-bit value from memory into -register {\em \rdprime}. It computes an effective address by adding the {\em -zero}-extended offset, scaled by 16, to the base address in register {\em -\rsoneprime}. -It expands to {\tt lq \rdprime, offset(\rsoneprime)}. - -C.FLW is an RV32FC-only instruction that loads a single-precision -floating-point value from memory into floating-point register {\em \rdprime}. It -computes an effective address by adding the {\em zero}-extended offset, scaled -by 4, to the base address in register {\em \rsoneprime}. It expands to {\tt flw -\rdprime, offset(\rsoneprime)}. - -C.FLD is an RV32DC/RV64DC-only instruction that loads a double-precision -floating-point value from memory into floating-point register {\em \rdprime}. It -computes an effective address by adding the {\em zero}-extended offset, scaled -by 8, to the base address in register {\em \rsoneprime}. It expands to {\tt fld -\rdprime, offset(\rsoneprime)}. - -\begin{center} -\begin{tabular}{S@{}S@{}S@{}Y@{}S@{}Y} -\\ -\instbitrange{15}{13} & -\instbitrange{12}{10} & -\instbitrange{9}{7} & -\instbitrange{6}{5} & -\instbitrange{4}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{\rsoneprime} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{\rstwoprime} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 3 & 3 & 2 & 3 & 2 \\ -C.SW & offset[5:3] & base & offset[2$\vert$6] & src & C0 \\ -C.SD & offset[5:3] & base & offset[7:6] & src & C0 \\ -C.SQ & offset[5$\vert$4$\vert$8] & base & offset[7:6] & src & C0 \\ -C.FSW& offset[5:3] & base & offset[2$\vert$6] & src & C0 \\ -C.FSD& offset[5:3] & base & offset[7:6] & src & C0 \\ -\end{tabular} -\end{center} -These instructions use the CS format. - -C.SW stores a 32-bit value in register {\em \rstwoprime} to memory. It computes an -effective address by adding the {\em zero}-extended offset, scaled by 4, to -the base address in register {\em \rsoneprime}. -It expands to {\tt sw \rstwoprime, offset(\rsoneprime)}. - -C.SD is an RV64C/RV128C-only instruction that stores a 64-bit value in -register {\em \rstwoprime} to memory. It computes an effective address by adding -the {\em zero}-extended offset, scaled by 8, to the base address in register -{\em \rsoneprime}. -It expands to {\tt sd \rstwoprime, offset(\rsoneprime)}. - -C.SQ is an RV128C-only instruction that stores a 128-bit value in register -{\em \rstwoprime} to memory. It computes an effective address by adding the {\em -zero}-extended offset, scaled by 16, to the base address in register {\em -\rsoneprime}. -It expands to {\tt sq \rstwoprime, offset(\rsoneprime)}. - -C.FSW is an RV32FC-only instruction that stores a single-precision -floating-point value in floating-point register {\em \rstwoprime} to memory. It -computes an effective address by adding the {\em zero}-extended offset, scaled -by 4, to the base address in register {\em \rsoneprime}. It expands to {\tt fsw -\rstwoprime, offset(\rsoneprime)}. - -C.FSD is an RV32DC/RV64DC-only instruction that stores a double-precision -floating-point value in floating-point register {\em \rstwoprime} to memory. It -computes an effective address by adding the {\em zero}-extended offset, scaled -by 8, to the base address in register {\em \rsoneprime}. It expands to {\tt fsd -\rstwoprime, offset(\rsoneprime)}. - -\section{Control Transfer Instructions} - -RVC provides unconditional jump instructions and conditional branch -instructions. As with base RVI instructions, the offsets of all RVC -control transfer instructions are in multiples of 2 bytes. - -\begin{center} -\begin{tabular}{S@{}L@{}Y} -\\ -\instbitrange{15}{13} & -\instbitrange{12}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 11 & 2 \\ -C.J & offset[11$\vert$4$\vert$9:8$\vert$10$\vert$6$\vert$7$\vert$3:1$\vert$5] & C1 \\ -C.JAL & offset[11$\vert$4$\vert$9:8$\vert$10$\vert$6$\vert$7$\vert$3:1$\vert$5] & C1 \\ -\end{tabular} -\end{center} -These instructions use the CJ format. - -C.J performs an unconditional control transfer. The offset is sign-extended and -added to the {\tt pc} to form the jump target address. C.J can therefore target -a $\pm$\wunits{2}{KiB} range. C.J expands to {\tt jal x0, offset}. - -C.JAL is an RV32C-only instruction that performs the same operation as C.J, -but additionally writes the address of the instruction following the jump -({\tt pc}+2) to the link register, {\tt x1}. C.JAL expands to {\tt jal x1, -offset}. - -\begin{center} -\begin{tabular}{E@{}T@{}T@{}Y} -\\ -\instbitrange{15}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct4} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{op} \\ -\hline -4 & 5 & 5 & 2 \\ -C.JR & src$\neq$0 & 0 & C2 \\ -C.JALR & src$\neq$0 & 0 & C2 \\ -\end{tabular} -\end{center} -These instructions use the CR format. - -C.JR (jump register) performs an unconditional control transfer to -the address in register {\em rs1}. C.JR expands to {\tt jalr x0, 0(rs1)}. -C.JR is only valid when $\textit{rs1}{\neq}\texttt{x0}$; the code point -with $\textit{rs1}{=}\texttt{x0}$ is reserved. - -C.JALR (jump and link register) performs the same operation as C.JR, -but additionally writes the address of the instruction following the -jump ({\tt pc}+2) to the link register, {\tt x1}. C.JALR expands to -{\tt jalr x1, 0(rs1)}. -C.JALR is only valid when $\textit{rs1}{\neq}\texttt{x0}$; the code point -with $\textit{rs1}{=}\texttt{x0}$ corresponds to the C.EBREAK instruction. - -\begin{commentary} -Strictly speaking, C.JALR does not expand exactly to a base RVI -instruction as the value added to the {\tt pc} to form the link address is 2 -rather than 4 as in the base ISA, but supporting both offsets of 2 and -4 bytes is only a very minor change to the base microarchitecture. -\end{commentary} - -\begin{center} -\begin{tabular}{S@{}S@{}S@{}T@{}Y} -\\ -\instbitrange{15}{13} & -\instbitrange{12}{10} & -\instbitrange{9}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{\rsoneprime} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 3 & 3 & 5 & 2 \\ -C.BEQZ & offset[8$\vert$4:3] & src & offset[7:6$\vert$2:1$\vert$5] & C1 \\ -C.BNEZ & offset[8$\vert$4:3] & src & offset[7:6$\vert$2:1$\vert$5] & C1 \\ -\end{tabular} -\end{center} -These instructions use the CB format. - -C.BEQZ performs conditional control transfers. The offset is sign-extended -and added to the {\tt pc} to form the branch target address. It can -therefore target a $\pm$\wunits{256}{B} range. C.BEQZ takes the branch if the -value in register {\em \rsoneprime} is zero. It expands to {\tt beq \rsoneprime, x0, -offset}. - -C.BNEZ is defined analogously, but it takes the branch if {\em \rsoneprime} contains -a nonzero value. It expands to {\tt bne \rsoneprime, x0, offset}. - -\section{Integer Computational Instructions} - -RVC provides several instructions for integer arithmetic and constant generation. - -\subsection*{Integer Constant-Generation Instructions} - -The two constant-generation instructions both use the CI instruction -format and can target any integer register. - -\vspace{-0.4in} -\begin{center} -\begin{tabular}{S@{}W@{}T@{}T@{}Y} -\\ -\instbitrange{15}{13} & -\multicolumn{1}{c}{\instbit{12}} & -\instbitrange{11}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm[5]} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{imm[4:0]} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 1 & 5 & 5 & 2 \\ -C.LI & imm[5] & dest$\neq$0 & imm[4:0] & C1 \\ -C.LUI & nzimm[17] & $\textrm{dest}{\neq}{\left\{0,2\right\}}$ & nzimm[16:12] & C1 \\ -\end{tabular} -\end{center} -C.LI loads the sign-extended 6-bit immediate, {\em imm}, into -register {\em rd}. -C.LI expands into {\tt addi rd, x0, imm}. -C.LI is only valid when {\em rd}$\neq${\tt x0}; -the code points with {\em rd}={\tt x0} encode HINTs. - -C.LUI loads the non-zero 6-bit immediate field into bits 17--12 of the -destination register, clears the bottom 12 bits, and sign-extends bit -17 into all higher bits of the destination. -C.LUI expands into {\tt lui rd, nzimm}. -C.LUI is only valid when -$\textit{rd}{\neq}{\left\{\texttt{x0},\texttt{x2}\right\}}$, -and when the immediate is not equal to zero. -The code points with {\em nzimm}=0 are reserved; the remaining code points -with {\em rd}={\tt x0} are HINTs; and the remaining code points with -{\em rd}={\tt x2} correspond to the C.ADDI16SP instruction. - -\subsection*{Integer Register-Immediate Operations} - -These integer register-immediate operations are encoded in the CI -format and perform operations on an integer register and -a 6-bit immediate. - -\vspace{-0.4in} -\begin{center} -\begin{tabular}{S@{}W@{}T@{}T@{}Y} -\\ -\instbitrange{15}{13} & -\multicolumn{1}{c}{\instbit{12}} & -\instbitrange{11}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm[5]} & -\multicolumn{1}{c|}{rd/rs1} & -\multicolumn{1}{c|}{imm[4:0]} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 1 & 5 & 5 & 2 \\ -C.ADDI & nzimm[5] & dest$\neq$0 & nzimm[4:0] & C1 \\ -C.ADDIW & imm[5] & dest$\neq$0 & imm[4:0] & C1 \\ -C.ADDI16SP & nzimm[9] & 2 & nzimm[4$\vert$6$\vert$8:7$\vert$5] & C1 \\ -\end{tabular} -\end{center} - -C.ADDI adds the non-zero sign-extended 6-bit immediate to the value in -register {\em rd} then writes the result to {\em rd}. C.ADDI expands -into {\tt addi rd, rd, nzimm}. -C.ADDI is only valid when {\em rd}$\neq${\tt x0} and {\em nzimm}$\neq$0. -The code points with {\em rd}={\tt x0} encode the C.NOP instruction; -the remaining code points with {\em nzimm}=0 encode HINTs. - -C.ADDIW is an RV64C/RV128C-only instruction that performs the same -computation but produces a 32-bit result, then sign-extends result to -64 bits. C.ADDIW expands into {\tt addiw rd, rd, imm}. The -immediate can be zero for C.ADDIW, where this corresponds to {\tt -sext.w rd}. C.ADDIW is only valid when {\em rd}$\neq${\tt x0}; -the code points with {\em rd}={\tt x0} are reserved. - -C.ADDI16SP shares the opcode with C.LUI, but has a destination field -of {\tt x2}. C.ADDI16SP adds the non-zero sign-extended 6-bit immediate to -the value in the stack pointer ({\tt sp}={\tt x2}), where the -immediate is scaled to represent multiples of 16 in the range -(-512,496). C.ADDI16SP is used to adjust the stack pointer in procedure -prologues and epilogues. It expands into {\tt addi x2, x2, nzimm}. -C.ADDI16SP is only valid when {\em nzimm}$\neq$0; -the code point with {\em nzimm}=0 is reserved. - -\begin{commentary} -In the standard RISC-V calling convention, the stack pointer {\tt sp} -is always 16-byte aligned. -\end{commentary} - -\begin{center} -\begin{tabular}{@{}S@{}K@{}S@{}Y} -\\ -\instbitrange{15}{13} & -\instbitrange{12}{5} & -\instbitrange{4}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm} & -\multicolumn{1}{c|}{\rdprime} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 8 & 3 & 2 \\ -C.ADDI4SPN & nzuimm[5:4$\vert$9:6$\vert$2$\vert$3] & dest & C0 \\ -\end{tabular} -\end{center} - -C.ADDI4SPN is a CIW-format instruction that adds a {\em zero}-extended -non-zero immediate, scaled by 4, to the stack pointer, {\tt x2}, and -writes the result to {\tt \rdprime}. This instruction is used -to generate pointers to stack-allocated variables, and expands to -{\tt addi \rdprime, x2, nzuimm}. -C.ADDI4SPN is only valid when {\em nzuimm}$\neq$0; -the code points with {\em nzuimm}=0 are reserved. - -\vspace{-0.4in} -\begin{center} -\begin{tabular}{S@{}W@{}T@{}T@{}Y} -\\ -\instbitrange{15}{13} & -\multicolumn{1}{c}{\instbit{12}} & -\instbitrange{11}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{shamt[5]} & -\multicolumn{1}{c|}{rd/rs1} & -\multicolumn{1}{c|}{shamt[4:0]} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 1 & 5 & 5 & 2 \\ -C.SLLI & shamt[5] & dest$\neq$0 & shamt[4:0] & C2 \\ -\end{tabular} -\end{center} - -C.SLLI is a CI-format instruction that performs a logical left shift -of the value in register {\em rd} then writes the result to {\em rd}. -The shift amount is encoded in the {\em shamt} field. -For RV128C, a shift amount of zero is used to encode a shift of 64. -C.SLLI expands into {\tt slli rd, rd, shamt}, except for -RV128C with {\tt shamt=0}, which expands to {\tt slli rd, rd, 64}. - -For RV32C, {\em shamt[5]} must be zero; the code points with {\em shamt[5]}=1 -are designated for custom extensions. For RV32C and RV64C, the shift -amount must be non-zero; the code points with {\em shamt}=0 are HINTs. For -all base ISAs, the code points with {\em rd}={\tt x0} are HINTs, except those -with {\em shamt[5]}=1 in RV32C. - -\vspace{-0.4in} -\begin{center} -\begin{tabular}{S@{}W@{}Y@{}S@{}T@{}Y} -\\ -\instbitrange{15}{13} & -\multicolumn{1}{c}{\instbit{12}} & -\instbitrange{11}{10} & -\instbitrange{9}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{shamt[5]} & -\multicolumn{1}{|c|}{funct2} & -\multicolumn{1}{c|}{\rdprime/\rsoneprime} & -\multicolumn{1}{c|}{shamt[4:0]} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 1 & 2 & 3 & 5 & 2 \\ -C.SRLI & shamt[5] & C.SRLI & dest & shamt[4:0] & C1 \\ -C.SRAI & shamt[5] & C.SRAI & dest & shamt[4:0] & C1 \\ -\end{tabular} -\end{center} - -C.SRLI is a CB-format instruction that performs a logical right shift -of the value in register {\em \rdprime} then writes the result to {\em \rdprime}. -The shift amount is encoded in the {\em shamt} field. -For RV128C, a shift amount of zero is used to encode a shift of 64. -Furthermore, the shift amount is sign-extended -for RV128C, and so the legal shift amounts are 1--31, 64, and 96--127. -C.SRLI expands into {\tt srli \rdprime, \rdprime, shamt}, -except for RV128C with {\tt shamt=0}, which expands to -{\tt srli \rdprime, \rdprime, 64}. - -For RV32C, {\em shamt[5]} must be zero; the code points with {\em shamt[5]}=1 -are designated for custom extensions. For RV32C and RV64C, the shift -amount must be non-zero; the code points with {\em shamt}=0 are HINTs. - -C.SRAI is defined analogously to C.SRLI, but instead performs an arithmetic -right shift. -C.SRAI expands to {\tt srai \rdprime, \rdprime, shamt}. - -\begin{commentary} -Left shifts are usually more frequent than right shifts, as left -shifts are frequently used to scale address values. Right shifts have -therefore been granted less encoding space and are placed in an -encoding quadrant where all other immediates are sign-extended. For -RV128, the decision was made to have the 6-bit shift-amount immediate -also be sign-extended. Apart from reducing the decode complexity, we -believe right-shift amounts of 96--127 will be more useful than 64--95, -to allow extraction of tags located in the high portions of 128-bit -address pointers. We note that RV128C will not be frozen at the same -point as RV32C and RV64C, to allow evaluation of typical usage of -128-bit address-space codes. -\end{commentary} - -\begin{center} -\begin{tabular}{S@{}W@{}Y@{}S@{}T@{}Y} -\\ -\instbitrange{15}{13} & -\multicolumn{1}{c}{\instbit{12}} & -\instbitrange{11}{10} & -\instbitrange{9}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm[5]} & -\multicolumn{1}{|c|}{funct2} & -\multicolumn{1}{c|}{\rdprime/\rsoneprime} & -\multicolumn{1}{c|}{imm[4:0]} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 1 & 2 & 3 & 5 & 2 \\ -C.ANDI & imm[5] & C.ANDI & dest & imm[4:0] & C1 \\ -\end{tabular} -\end{center} - -C.ANDI is a CB-format instruction that computes the bitwise AND of -the value in register {\em \rdprime} and the sign-extended 6-bit immediate, -then writes the result to {\em \rdprime}. -C.ANDI expands to {\tt andi \rdprime, \rdprime, imm}. - -\subsection*{Integer Register-Register Operations} -\vspace{-0.4in} -\begin{center} -\begin{tabular}{E@{}T@{}T@{}Y} -\\ -\instbitrange{15}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct4} & -\multicolumn{1}{c|}{rd/rs1} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{op} \\ -\hline -4 & 5 & 5 & 2 \\ -C.MV & dest$\neq$0 & src$\neq$0 & C2 \\ -C.ADD & dest$\neq$0 & src$\neq$0 & C2 \\ -\end{tabular} -\end{center} -These instructions use the CR format. - -C.MV copies the value in register {\em rs2} into register {\em rd}. C.MV -expands into {\tt add rd, x0, rs2}. -C.MV is only valid when $\textit{rs2}{\neq}\texttt{x0}$; the code points -with $\textit{rs2}{=}\texttt{x0}$ correspond to the C.JR instruction. -The code points with $\textit{rs2}{\neq}\texttt{x0}$ and -$\textit{rd}{=}\texttt{x0}$ are HINTs. - -\begin{commentary} -C.MV expands to a different instruction than the canonical MV -pseudoinstruction, which instead uses ADDI. Implementations that handle MV -specially, e.g. using register-renaming hardware, may find it more convenient -to expand C.MV to MV instead of ADD, at slight additional hardware cost. -\end{commentary} - -C.ADD adds the values in registers {\em rd} and {\em rs2} and writes the -result to register {\em rd}. C.ADD expands into {\tt add rd, rd, rs2}. -C.ADD is only valid when $\textit{rs2}{\neq}\texttt{x0}$; the code points -with $\textit{rs2}{=}\texttt{x0}$ correspond to the C.JALR and C.EBREAK instructions. -The code points with $\textit{rs2}{\neq}\texttt{x0}$ and -$\textit{rd}{=}\texttt{x0}$ are HINTs. - -\vspace{-0.4in} -\begin{center} -\begin{tabular}{M@{}S@{}Y@{}S@{}Y} -\\ -\instbitrange{15}{10} & -\instbitrange{9}{7} & -\instbitrange{6}{5} & -\instbitrange{4}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct6} & -\multicolumn{1}{c|}{\rdprime/\rsoneprime} & -\multicolumn{1}{c|}{funct2} & -\multicolumn{1}{c|}{\rstwoprime} & -\multicolumn{1}{c|}{op} \\ -\hline -6 & 3 & 2 & 3 & 2 \\ -C.AND & dest & C.AND & src & C1 \\ -C.OR & dest & C.OR & src & C1 \\ -C.XOR & dest & C.XOR & src & C1 \\ -C.SUB & dest & C.SUB & src & C1 \\ -C.ADDW & dest & C.ADDW & src & C1 \\ -C.SUBW & dest & C.SUBW & src & C1 \\ -\end{tabular} -\end{center} - -These instructions use the CA format. - -C.AND computes the bitwise AND of the values in registers {\em \rdprime} -and {\em \rstwoprime}, then writes the result to register {\em \rdprime}. -C.AND expands into {\tt and \rdprime, \rdprime, \rstwoprime}. - -C.OR computes the bitwise OR of the values in registers {\em \rdprime} -and {\em \rstwoprime}, then writes the result to register {\em \rdprime}. -C.OR expands into {\tt or \rdprime, \rdprime, \rstwoprime}. - -C.XOR computes the bitwise XOR of the values in registers {\em \rdprime} -and {\em \rstwoprime}, then writes the result to register {\em \rdprime}. -C.XOR expands into {\tt xor \rdprime, \rdprime, \rstwoprime}. - -C.SUB subtracts the value in register {\em \rstwoprime} from the value in -register {\em \rdprime}, then writes the result to register {\em \rdprime}. -C.SUB expands into {\tt sub \rdprime, \rdprime, \rstwoprime}. - -C.ADDW is an RV64C/RV128C-only instruction that adds the values in -registers {\em \rdprime} and {\em \rstwoprime}, then sign-extends the lower -32 bits of the sum before writing the result to register {\em \rdprime}. -C.ADDW expands into {\tt addw \rdprime, \rdprime, \rstwoprime}. - -C.SUBW is an RV64C/RV128C-only instruction that subtracts the value in -register {\em \rstwoprime} from the value in register {\em \rdprime}, then -sign-extends the lower 32 bits of the difference before writing the result -to register {\em \rdprime}. C.SUBW expands into {\tt subw \rdprime, \rdprime, \rstwoprime}. - -\begin{commentary} -This group of six instructions do not provide large savings -individually, but do not occupy much encoding space and are -straightforward to implement, and as a group provide a worthwhile -improvement in static and dynamic compression. -\end{commentary} - -\subsection*{Defined Illegal Instruction} -\vspace{-0.4in} -\begin{center} -\begin{tabular}{SW@{}T@{}T@{}Y} -\\ -\instbitrange{15}{13} & -\multicolumn{1}{c}{\instbit{12}} & -\instbitrange{11}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{0} & -\multicolumn{1}{c|}{0} & -\multicolumn{1}{c|}{0} & -\multicolumn{1}{c|}{0} & -\multicolumn{1}{c|}{0} \\ -\hline -3 & 1 & 5 & 5 & 2 \\ -0 & 0 & 0 & 0 & 0 \\ -\end{tabular} -\end{center} - -A 16-bit instruction with all bits zero is permanently reserved as an -illegal instruction. -\begin{commentary} -We reserve all-zero instructions to be illegal instructions to help -trap attempts to execute zero-ed or non-existent portions of the -memory space. The all-zero value should not be redefined in any -non-standard extension. Similarly, we reserve instructions with all -bits set to 1 (corresponding to very long instructions in the RISC-V -variable-length encoding scheme) as illegal to capture another common -value seen in non-existent memory regions. -\end{commentary} - -\subsection*{NOP Instruction} -\vspace{-0.4in} -\begin{center} -\begin{tabular}{SW@{}T@{}T@{}Y} -\\ -\instbitrange{15}{13} & -\multicolumn{1}{c}{\instbit{12}} & -\instbitrange{11}{7} & -\instbitrange{6}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct3} & -\multicolumn{1}{c|}{imm[5]} & -\multicolumn{1}{c|}{rd/rs1} & -\multicolumn{1}{c|}{imm[4:0]} & -\multicolumn{1}{c|}{op} \\ -\hline -3 & 1 & 5 & 5 & 2 \\ -C.NOP & 0 & 0 & 0 & C1 \\ -\end{tabular} -\end{center} - -C.NOP is a CI-format instruction that does not change any user-visible state, -except for advancing the {\tt pc} and incrementing any applicable performance -counters. C.NOP expands to {\tt nop}. C.NOP is only valid when {\em imm}=0; -the code points with {\em imm}$\neq$0 encode HINTs. - -\subsection*{Breakpoint Instruction} -\vspace{-0.4in} -\begin{center} -\begin{tabular}{E@{}U@{}Y} -\\ -\instbitrange{15}{12} & -\instbitrange{11}{2} & -\instbitrange{1}{0} \\ -\hline -\multicolumn{1}{|c|}{funct4} & -\multicolumn{1}{c|}{0} & -\multicolumn{1}{c|}{op} \\ -\hline -4 & 10 & 2 \\ -C.EBREAK & 0 & C2 \\ -\end{tabular} -\end{center} - -Debuggers can use the C.EBREAK instruction, which expands to {\tt ebreak}, -to cause control to be transferred back to the debugging environment. -C.EBREAK shares the opcode with the C.ADD instruction, but with {\em - rd} and {\em rs2} both zero, thus can also use the CR format. - -\section{Usage of C Instructions in LR/SC Sequences} - -On implementations that support the C extension, compressed forms of the -I instructions permitted inside constrained LR/SC sequences, as described in -Section~\ref{sec:lrscseq}, are also permitted inside constrained LR/SC -sequences. - -\begin{commentary} -The implication is that any implementation that claims to support both -the A and C extensions must ensure that LR/SC sequences containing -valid C instructions will eventually complete. -\end{commentary} - -\section{HINT Instructions} -\label{sec:rvc-hints} - -A portion of the RVC encoding space is reserved for microarchitectural HINTs. -Like the HINTs in the RV32I base ISA (see Section~\ref{sec:rv32i-hints}), -these instructions do not modify any architectural state, except for advancing -the {\tt pc} and any applicable performance counters. HINTs are -executed as no-ops on implementations that ignore them. - -RVC HINTs are encoded as computational instructions that do not modify the -architectural state, either because {\em rd}={\tt x0} -(e.g. \mbox{C.ADD {\em x0}, {\em t0}}), or because {\em rd} is overwritten -with a copy of itself (e.g. \mbox{C.ADDI {\em t0}, 0}). - -\begin{commentary} -This HINT encoding has been chosen so that simple implementations can ignore -HINTs altogether, and instead execute a HINT as a regular computational -instruction that happens not to mutate the architectural state. -\end{commentary} - -RVC HINTs do not necessarily expand to their RVI HINT counterparts. For -example, \mbox{C.ADD {\em x0}, {\em a0}} might not encode the same HINT -as \mbox{ADD {\em x0}, {\em x0}, {\em a0}}. - -\begin{commentary} -The primary reason to not require an RVC HINT to expand to an RVI HINT -is that HINTs are unlikely to be compressible in the same manner as -the underlying computational instruction. Also, decoupling the RVC -and RVI HINT mappings allows the scarce RVC HINT space to be allocated -to the most popular HINTs, and in particular, to HINTs that are -amenable to macro-op fusion. -\end{commentary} - -Table~\ref{tab:rvc-hints} lists all RVC HINT code points. For RV32C, 78\% of -the HINT space is reserved for standard HINTs. -The remainder of the HINT space is designated for custom HINTs: no standard -HINTs will ever be defined in this subspace. - -\begin{table}[hbt] -\centering -\begin{tabular}{|l|l|r|l|} - \hline - Instruction & Constraints & Code Points & Purpose \\ \hline \hline - C.NOP & {\em nzimm}$\neq$0 & 63 & \multirow{6}{*}{\em Reserved for future standard use} \\ \cline{1-3} - C.ADDI & {\em rd}$\neq${\tt x0}, {\em nzimm}=0 & 31 & \\ \cline{1-3} - C.LI & {\em rd}={\tt x0} & 64 & \\ \cline{1-3} - C.LUI & {\em rd}={\tt x0}, {\em nzimm}$\neq$0 & 63 & \\ \cline{1-3} - C.MV & {\em rd}={\tt x0}, {\em rs2}$\neq${\tt x0} & 31 & \\ \cline{1-3} - C.ADD & {\em rd}={\tt x0}, {\em rs2}$\neq${\tt x0}, {\em rs2}$\neq${\tt x2}--{\tt x5} & 27 & \\ \hline - \multirow{4}{*}{C.ADD} & \multirow{4}{*}{{\em rd}={\tt x0}, {\em rs2}={\tt x2}--{\tt x5}} - & \multirow{4}{*}{$4$} - & ({\em rs2}={\tt x2}) C.NTL.P1 \\ - & & & ({\em rs2}={\tt x3}) C.NTL.PALL \\ - & & & ({\em rs2}={\tt x4}) C.NTL.S1 \\ - & & & ({\em rs2}={\tt x5}) C.NTL.ALL \\ \hline - \multirow{2}{*}{C.SLLI} & \multirow{2}{*}{{\em rd}={\tt x0}, {\em nzimm}$\neq$0} & 31 (RV32) & \multirow{6}{*}{\em Designated for custom use} \\ - & & 63 (RV64/128) & \\ \cline{1-3} - C.SLLI64 & {\em rd}={\tt x0} & 1 & \\ \cline{1-3} - C.SLLI64 & {\em rd}$\neq${\tt x0}, RV32 and RV64 only & 31 & \\ \cline{1-3} - C.SRLI64 & RV32 and RV64 only & 8 & \\ \cline{1-3} - C.SRAI64 & RV32 and RV64 only & 8 & \\ \hline -\end{tabular} -\caption{RVC HINT instructions.} -\label{tab:rvc-hints} -\end{table} - -\clearpage - -\section{RVC Instruction Set Listings} - -Table~\ref{rvcopcodemap} shows a map of the major opcodes for RVC. -Each row of the table corresponds to one quadrant of the encoding -space. The last quadrant, which has the two -least-significant bits set, corresponds to instructions wider -than 16 bits, including those in the base ISAs. Several instructions -are only valid for certain operands; when invalid, they are marked -either {\em RES} to indicate that the opcode is reserved for future -standard extensions; {\em Custom} to indicate that the opcode is designated -for custom extensions; or {\em HINT} to indicate that the opcode -is reserved for microarchitectural hints (see Section~\ref{sec:rvc-hints}). - -\input{rvc-opcode-map} - -Tables~\ref{rvc-instr-table0}--\ref{rvc-instr-table2} list the RVC instructions. -\input{rvc-instr-table} diff --git a/src/counters.tex b/src/counters.tex deleted file mode 100644 index 545804b..0000000 --- a/src/counters.tex +++ /dev/null @@ -1,252 +0,0 @@ -\chapter{``Zicntr'' and ``Zihpm'' Counters} -\label{counters} - -RISC-V ISAs provide a set of up to thirty-two 64-bit performance counters and -timers that are accessible via unprivileged XLEN-bit read-only CSR -registers {\tt 0xC00}--{\tt 0xC1F} (when XLEN=32, the upper 32 bits -are accessed via CSR registers {\tt 0xC80}--{\tt 0xC9F}). -These counters are divided between the ``Zicntr'' and ``Zihpm'' extensions. - -\section{``Zicntr'' Standard Extension for Base Counters and Timers} - -The Zicntr standard extension comprises the first three of these -counters (CYCLE, TIME, and INSTRET), which -have dedicated functions (cycle -count, real-time clock, and instructions retired, respectively). -The Zicntr extension depends on the Zicsr extension. - -\begin{commentary} -We recommend provision of these basic counters in implementations as -they are essential for basic performance analysis, adaptive and -dynamic optimization, and to allow an application to work with -real-time streams. Additional counters in the separate Zihpm extension can -help diagnose performance problems and these should be made accessible -from user-level application code with low overhead. - -Some execution environments might prohibit access to counters, for -example, to impede timing side-channel attacks. -\end{commentary} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{M@{}R@{}F@{}R@{}S} -\\ -\instbitrange{31}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{csr} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -12 & 5 & 3 & 5 & 7 \\ -RDCYCLE[H] & 0 & CSRRS & dest & SYSTEM \\ -RDTIME[H] & 0 & CSRRS & dest & SYSTEM \\ -RDINSTRET[H] & 0 & CSRRS & dest & SYSTEM \\ -\end{tabular} -\end{center} - -For base ISAs with XLEN$\geq$64, CSR instructions can access the full -64-bit CSRs directly. In particular, the RDCYCLE, RDTIME, and -RDINSTRET pseudoinstructions read the full 64 bits of the {\tt cycle}, -{\tt time}, and {\tt instret} counters. - -\begin{commentary} -The counter pseudoinstructions are mapped to the read-only {\tt csrrs - rd, counter, x0} canonical form, but the other read-only CSR -instruction forms (based on CSRRC/CSRRSI/CSRRCI) are also legal ways -to read these CSRs. -\end{commentary} - -For base ISAs with XLEN=32, the Zicntr extension enables the three -64-bit read-only counters to be accessed in 32-bit pieces. -The RDCYCLE, RDTIME, and RDINSTRET pseudoinstructions provide the lower 32 -bits, and the RDCYCLEH, RDTIMEH, and RDINSTRETH pseudoinstructions provide -the upper 32 bits of the respective counters. - -\begin{commentary} -We required the counters be 64 bits wide, even when XLEN=32, as otherwise -it is very difficult for software to determine if values have -overflowed. For a low-end implementation, the upper 32 bits of each -counter can be implemented using software counters incremented by a -trap handler triggered by overflow of the lower 32 bits. The sample -code given below shows how the full 64-bit width value can be -safely read using the individual 32-bit width pseudoinstructions. -\end{commentary} - -The RDCYCLE pseudoinstruction reads the low XLEN bits of the {\tt - cycle} CSR which holds a count of the number of clock cycles -executed by the processor core on which the hart is running from an -arbitrary start time in the past. RDCYCLEH is only present when -XLEN=32 and reads bits 63--32 of the same cycle -counter. The underlying 64-bit counter should never overflow in -practice. The rate at which the cycle counter advances will depend on -the implementation and operating environment. The execution -environment should provide a means to determine the current rate -(cycles/second) at which the cycle counter is incrementing. - -\begin{commentary} -RDCYCLE is intended to return the number of cycles executed by the -processor core, not the hart. Precisely defining what is a ``core'' is -difficult given some implementation choices (e.g., AMD Bulldozer). -Precisely defining what is a ``clock cycle'' is also difficult given the -range of implementations (including software emulations), but the -intent is that RDCYCLE is used for performance monitoring along with the -other performance counters. In particular, where there is one -hart/core, one would expect cycle-count/instructions-retired to -measure CPI for a hart. - -Cores don't have to be exposed to software at all, and an implementor -might choose to pretend multiple harts on one physical core are -running on separate cores with one hart/core, and provide separate -cycle counters for each hart. This might make sense in a simple -barrel processor (e.g., CDC 6600 peripheral processors) where -inter-hart timing interactions are non-existent or minimal. - -Where there is more than one hart/core and dynamic multithreading, it -is not generally possible to separate out cycles per hart (especially -with SMT). It might be possible to define a separate performance -counter that tried to capture the number of cycles a particular hart -was running, but this definition would have to be very fuzzy to cover -all the possible threading implementations. For example, should we -only count cycles for which any instruction was issued to execution -for this hart, and/or cycles any instruction retired, or include -cycles this hart was occupying machine resources but couldn't execute -due to stalls while other harts went into execution? Likely, ``all of -the above'' would be needed to have understandable performance stats. -This complexity of defining a per-hart cycle count, and also the need -in any case for a total per-core cycle count when tuning multithreaded -code led to just standardizing the per-core cycle counter, which also -happens to work well for the common single hart/core case. - -Standardizing what happens during ``sleep'' is not practical given -that what ``sleep'' means is not standardized across execution -environments, but if the entire core is paused (entirely clock-gated -or powered-down in deep sleep), then it is not executing clock cycles, -and the cycle count shouldn't be increasing per the spec. There are -many details, e.g., whether clock cycles required to reset a processor -after waking up from a power-down event should be counted, and these -are considered execution-environment-specific details. - -Even though there is no precise definition that works for all -platforms, this is still a useful facility for most platforms, and an -imprecise, common, ``usually correct'' standard here is better than no -standard. The intent of RDCYCLE was primarily performance -monitoring/tuning, and the specification was written with that goal in -mind. -\end{commentary} - -The RDTIME pseudoinstruction reads the low XLEN bits of the {\tt - time} CSR, which counts wall-clock real time that has passed from an -arbitrary start time in the past. -RDTIMEH is only present when XLEN=32 and reads bits 63--32 of the same -real-time counter. -The underlying 64-bit counter increments by one with each tick of the -real-time clock, and, for realistic real-time clock frequencies, should never -overflow in practice. -The execution environment should provide a means of determining the period of -a counter tick (seconds/tick). -The period should be constant within a small error bound. -The environment should provide a means to determine the accuracy of the clock -(i.e., the maximum relative error between the nominal and actual real-time -clock periods). - -\begin{commentary} -On some simple platforms, cycle count might represent a valid -implementation of RDTIME, in which case RDTIME and RDCYCLE may -return the same result. - -It is difficult to provide a strict mandate on clock period given the -wide variety of possible implementation platforms. The maximum error -bound should be set based on the requirements of the platform. -\end{commentary} - -The real-time clocks of all harts -must be synchronized to within one tick of the real-time clock. - -\begin{commentary} -As with other architectural mandates, it suffices to appear ``as if'' -harts are synchronized to within one tick of the real-time clock, -i.e., software is unable to observe that there is a greater delta -between the real-time clock values observed on two harts. -\end{commentary} - -The RDINSTRET pseudoinstruction reads the low XLEN bits of the {\tt - instret} CSR, which counts the number of instructions retired by -this hart from some arbitrary start point in the past. RDINSTRETH is -only present when XLEN=32 and reads bits 63--32 of the same -instruction counter. The underlying 64-bit counter should never -overflow in practice. - -\begin{commentary} -Instructions that cause synchronous exceptions, including ECALL and EBREAK, -are not considered to retire and hence do not increment the {\tt instret} CSR. -\end{commentary} - -The following code sequence will read a valid 64-bit cycle counter value into -{\tt x3}:{\tt x2}, even if the counter overflows its lower half between reading its upper -and lower halves. - -\begin{figure}[h!] -\begin{center} -\begin{verbatim} - again: - rdcycleh x3 - rdcycle x2 - rdcycleh x4 - bne x3, x4, again -\end{verbatim} -\end{center} -\caption{Sample code for reading the 64-bit cycle counter when XLEN=32.} -\label{rdcycle} -\end{figure} - -\section{``Zihpm'' Standard Extension for Hardware Performance Counters} - -The Zihpm extension comprises up to 29 additional unprivileged 64-bit -hardware performance counters, {\tt hpmcounter3}--{\tt hpmcounter31}. -When XLEN=32, the upper 32 bits of these performance counters are -accessible via additional CSRs {\tt hpmcounter3h}--{\tt - hpmcounter31h}. The Zihpm extension depends on the Zicsr extension. - -\begin{commentary} -In some applications, it is important to be able to read multiple -counters at the same instant in time. When run under a multitasking -environment, a user thread can suffer a context switch while -attempting to read the counters. One solution is for the user thread -to read the real-time counter before and after reading the other -counters to determine if a context switch occurred in the middle of the -sequence, in which case the reads can be retried. We considered -adding output latches to allow a user thread to snapshot the counter -values atomically, but this would increase the size of the user -context, especially for implementations with a richer set of counters. -\end{commentary} - -The implemented number and width of these additional counters, and the -set of events they count, is platform-specific. Accessing an -unimplemented or ill-configured counter may cause an illegal -instruction exception or may return a constant value. - -The execution environment should provide a means to determine the -number and width of the implemented counters, and an interface to -configure the events to be counted by each counter. - -\begin{commentary} - For execution environments implemented on RISC-V privileged - platforms, the privileged architecture manual describes privileged - CSRs controlling access by lower privileged modes to these counters, - and to set the events to be counted. - - Alternative execution environments (e.g., user-level-only software - performance models) may provide alternative mechanisms to configure - the events counted by the performance counters. - - It would be useful to eventually standardize event settings to count - ISA-level metrics, such as the number of floating-point instructions - executed for example, and possibly a few common microarchitectural - metrics, such as ``L1 instruction cache misses''. -\end{commentary} diff --git a/src/csr.tex b/src/csr.tex deleted file mode 100644 index b642567..0000000 --- a/src/csr.tex +++ /dev/null @@ -1,260 +0,0 @@ -\chapter{``Zicsr'', Control and Status Register (CSR) Instructions, Version 2.0} -\label{csrinsts} - -RISC-V defines a separate address space of 4096 Control and Status -registers associated with each hart. This chapter defines the full -set of CSR instructions that operate on these CSRs. - -\begin{commentary} - While CSRs are primarily used by the privileged architecture, there - are several uses in unprivileged code including for counters and - timers, and for floating-point status. - - The counters and timers are no longer considered mandatory parts of - the standard base ISAs, and so the CSR instructions required to - access them have been moved out of Chapter~\ref{rv32} into this - separate chapter. -\end{commentary} - -\section{CSR Instructions} - -All CSR instructions atomically read-modify-write a single CSR, whose -CSR specifier is encoded in the 12-bit {\em csr} field of the -instruction held in bits 31--20. The immediate forms use a 5-bit -zero-extended immediate encoded in the {\em rs1} field. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{M@{}R@{}F@{}R@{}S} -\\ -\instbitrange{31}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{csr} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -12 & 5 & 3 & 5 & 7 \\ -source/dest & source & CSRRW & dest & SYSTEM \\ -source/dest & source & CSRRS & dest & SYSTEM \\ -source/dest & source & CSRRC & dest & SYSTEM \\ -source/dest & uimm[4:0] & CSRRWI & dest & SYSTEM \\ -source/dest & uimm[4:0] & CSRRSI & dest & SYSTEM \\ -source/dest & uimm[4:0] & CSRRCI & dest & SYSTEM \\ -\end{tabular} -\end{center} - -The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values -in the CSRs and integer registers. CSRRW reads the old value of the -CSR, zero-extends the value to XLEN bits, then writes it to integer -register {\em rd}. The initial value in {\em rs1} is written to the -CSR. If {\em rd}={\tt x0}, then the instruction shall not read the CSR -and shall not cause any of the side effects that might occur on a CSR -read. - -The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the -value of the CSR, zero-extends the value to XLEN bits, and writes it -to integer register {\em rd}. The initial value in integer register -{\em rs1} is treated as a bit mask that specifies bit positions to be -set in the CSR. Any bit that is high in {\em rs1} will cause the -corresponding bit to be set in the CSR, if that CSR bit is writable. -Other bits in the CSR are not explicitly written. - -The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the -value of the CSR, zero-extends the value to XLEN bits, and writes it -to integer register {\em rd}. The initial value in integer register -{\em rs1} is treated as a bit mask that specifies bit positions to be -cleared in the CSR. Any bit that is high in {\em rs1} will cause the -corresponding bit to be cleared in the CSR, if that CSR bit is writable. -Other bits in the CSR are not explicitly written. - -For both CSRRS and CSRRC, if {\em rs1}={\tt x0}, then the instruction -will not write to the CSR at all, and so shall not cause any of the -side effects that might otherwise occur on a CSR write, nor -raise illegal instruction exceptions on accesses to read-only CSRs. -Both CSRRS and CSRRC always read the addressed CSR and cause any read -side effects regardless of {\em rs1} and {\em rd} fields. Note that -if {\em rs1} specifies a register holding a zero value other than {\tt - x0}, the instruction will still attempt to write the unmodified -value back to the CSR and will cause any attendant side effects. A -CSRRW with {\em rs1}={\tt x0} will attempt to write zero to the -destination CSR. - -The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS, -and CSRRC respectively, except they update the CSR using an XLEN-bit -value obtained by zero-extending a 5-bit unsigned immediate (uimm[4:0]) field -encoded in the {\em rs1} field instead of a value from an integer -register. For CSRRSI and CSRRCI, if the uimm[4:0] field is zero, then -these instructions will not write to the CSR, and shall not cause any -of the side effects that might otherwise occur on a CSR write, nor raise -illegal instruction exceptions on accesses to read-only CSRs. -For CSRRWI, if {\em rd}={\tt x0}, then the instruction shall not read the -CSR and shall not cause any of the side effects that might occur on a -CSR read. Both CSRRSI and CSRRCI will always read the CSR and cause -any read side effects regardless of {\em rd} and {\em rs1} fields. - -\begin{table} - \centering - \begin{tabular}{|l|c|c|c|c|} - \hline - \multicolumn{5}{|c|}{Register operand} \\ - \hline - Instruction & \textit{rd} is \texttt{x0} - & \textit{rs1} is \texttt{x0} - & Reads CSR & Writes CSR \\ - \hline - CSRRW & Yes & -- & No & Yes \\ - CSRRW & No & -- & Yes & Yes \\ - CSRRS/CSRRC & -- & Yes & Yes & No \\ - CSRRS/CSRRC & -- & No & Yes & Yes \\ - \hline - \multicolumn{5}{|c|}{Immediate operand} \\ - \hline - Instruction & \textit{rd} is \texttt{x0} - & \textit{uimm}$=$0 - & Reads CSR & Writes CSR \\ - \hline - CSRRWI & Yes & -- & No & Yes \\ - CSRRWI & No & -- & Yes & Yes \\ - CSRRSI/CSRRCI & -- & Yes & Yes & No \\ - CSRRSI/CSRRCI & -- & No & Yes & Yes \\ - \hline - \end{tabular} - \caption{Conditions determining whether a CSR instruction reads or writes - the specified CSR.} - \label{tab:csrsideeffects} -\end{table} - -Table~\ref{tab:csrsideeffects} summarizes the behavior of the CSR -instructions with respect to whether they read and/or write the CSR. - -For any event or consequence that occurs due to a CSR having a particular -value, if a write to the CSR gives it that value, the resulting event or -consequence is said to be an \emph{indirect effect} of the write. -Indirect effects of a CSR write are not considered by the RISC-V ISA to -be side effects of that write. - -\begin{commentary} - An example of side effects for CSR accesses would be if reading from a - specific CSR causes a light bulb to turn on, while writing an odd value - to the same CSR causes the light to turn off. - Assume writing an even value has no effect. - In this case, both the read and write have side effects controlling - whether the bulb is lit, as this condition is not determined solely - from the CSR value. - (Note that after writing an odd value to the CSR to turn off the light, - then reading to turn the light on, writing again the same odd value - causes the light to turn off again. - Hence, on the last write, it is not a change in the CSR value that - turns off the light.) - - On the other hand, if a bulb is rigged to light whenever the value - of a particular CSR is odd, then turning the light on and off is not - considered a side effect of writing to the CSR but merely an indirect - effect of such writes. - - More concretely, the RISC-V privileged architecture defined in - Volume~II specifies that certain combinations of CSR values cause a - trap to occur. - When an explicit write to a CSR creates the conditions that trigger the - trap, the trap is not considered a side effect of the write but merely - an indirect effect. - - Standard CSRs do not have any side effects on reads. - Standard CSRs may have side effects on writes. - Custom extensions might add CSRs for which accesses have side effects - on either reads or writes. -\end{commentary} - -Some CSRs, such as the instructions-retired counter, {\tt instret}, -may be modified as side effects of instruction execution. In these -cases, if a CSR access instruction reads a CSR, it reads the value -prior to the execution of the instruction. If a CSR access -instruction writes such a CSR, the write is done instead of the -increment. In particular, a value written to {\tt instret} by one -instruction will be the value read by the following instruction. - -The assembler pseudoinstruction to read a CSR, CSRR {\em rd, csr}, is -encoded as CSRRS {\em rd, csr, x0}. The assembler pseudoinstruction -to write a CSR, CSRW {\em csr, rs1}, is encoded as CSRRW {\em x0, csr, - rs1}, while CSRWI {\em csr, uimm}, is encoded as CSRRWI {\em x0, - csr, uimm}. - -Further assembler pseudoinstructions are defined to set and clear -bits in the CSR when the old value is not required: CSRS/CSRC {\em - csr, rs1}; CSRSI/CSRCI {\em csr, uimm}. - - -\subsection*{CSR Access Ordering} - -Each RISC-V hart normally observes its own CSR accesses, including its -implicit CSR accesses, as performed in program order. -In particular, unless specified otherwise, a CSR access is performed -after the execution of any prior instructions in program order whose behavior -modifies or is modified by the CSR state and before the execution of any -subsequent instructions in program order whose behavior modifies or is modified -by the CSR state. -Furthermore, an explicit CSR read returns the -CSR state before the execution of the instruction, while an -explicit CSR write suppresses and overrides any implicit writes or -modifications to the same CSR by the same instruction. - -Likewise, any side effects from an explicit CSR access are normally -observed to occur synchronously in program order. -Unless specified otherwise, the full consequences of any such side -effects are observable by the very next instruction, and no consequences -may be observed out-of-order by preceding instructions. -(Note the distinction made earlier between side effects and indirect -effects of CSR writes.) - -For the RVWMO memory consistency model (Chapter~\ref{ch:memorymodel}), -CSR accesses are weakly ordered by default, -so other harts or devices may observe CSR accesses in an order -different from program order. In addition, CSR accesses are not ordered with -respect to explicit memory accesses, unless a CSR access modifies the execution -behavior of the instruction that performs the explicit memory access or unless -a CSR access and an explicit memory access are ordered by either the syntactic -dependencies defined by the memory model or the ordering requirements defined -by the Memory-Ordering PMAs section in Volume II of this manual. To enforce -ordering in all other cases, software should execute a FENCE instruction -between the relevant accesses. For the purposes of the FENCE instruction, CSR -read accesses are classified as device input (I), and CSR write accesses are -classified as device output (O). - -\begin{commentary} -Informally, the CSR space acts as a weakly ordered memory-mapped I/O region, as -defined by the Memory-Ordering PMAs section in Volume II of this manual. As a -result, the order of CSR accesses with respect to all other accesses is -constrained by the same mechanisms that constrain the order of memory-mapped -I/O accesses to such a region. - -These CSR-ordering constraints are imposed to support ordering main -memory and memory-mapped I/O accesses with respect to CSR accesses that -are visible to, or affected by, devices or other harts. -Examples include the {\tt time}, {\tt cycle}, and {\tt mcycle} -CSRs, in addition to CSRs that reflect pending interrupts, like {\tt mip} and -{\tt sip}. -Note that implicit reads of such CSRs (e.g., taking an interrupt because of -a change in {\tt mip}) are also ordered as device input. - -Most CSRs (including, e.g., the {\tt fcsr}) are not visible to other harts; -their accesses can be freely reordered in the global memory order with respect -to FENCE instructions without violating this specification. -\end{commentary} - -The hardware platform may define that accesses to certain CSRs are -strongly ordered, as defined by the Memory-Ordering PMAs section in Volume II -of this manual. Accesses to strongly ordered CSRs have stronger ordering -constraints with respect to accesses to both weakly ordered CSRs and accesses -to memory-mapped I/O regions. - -\begin{commentary} -The rules for the reordering of CSR accesses in the global memory order -should probably be moved to Chapter~\ref{ch:memorymodel} concerning the -RVWMO memory consistency model. -\end{commentary} diff --git a/src/d.tex b/src/d.tex deleted file mode 100644 index 8119f47..0000000 --- a/src/d.tex +++ /dev/null @@ -1,442 +0,0 @@ -\chapter{``D'' Standard Extension for Double-Precision Floating-Point, -Version 2.2} - -This chapter describes the standard double-precision floating-point -instruction-set extension, which is named ``D'' and adds -double-precision floating-point computational instructions compliant -with the IEEE 754-2008 arithmetic standard. The D extension depends on -the base single-precision instruction subset F. - -\section{D Register State} - -The D extension widens the 32 floating-point registers, {\tt f0}--{\tt - f31}, to 64 bits (FLEN=64 in Figure~\ref{fprs}). The {\tt f} -registers can now hold either 32-bit or 64-bit floating-point values -as described below in Section~\ref{nanboxing}. - -\begin{commentary} -FLEN can be 32, 64, or 128 depending on which of the F, D, and Q -extensions are supported. There can be up to four different -floating-point precisions supported, including H, F, D, and Q. -\end{commentary} - -\section{NaN Boxing of Narrower Values} -\label{nanboxing} - -When multiple floating-point precisions are supported, then valid -values of narrower $n$-bit types, \mbox{$n<$ FLEN}, are represented in -the lower $n$ bits of an FLEN-bit NaN value, in a process termed -NaN-boxing. The upper bits of a valid NaN-boxed value must be all 1s. -Valid NaN-boxed $n$-bit values therefore appear as negative quiet NaNs -(qNaNs) when viewed as any wider $m$-bit value, \mbox{$n < m \leq$ - FLEN}. Any operation that writes a narrower result to an {\tt f} -register must write all 1s to the uppermost FLEN$-n$ bits to yield a -legal NaN-boxed value. - -\begin{commentary} -Software might not know the current type of data stored in a -floating-point register but has to be able to save and restore the -register values, hence the result of using wider operations to -transfer narrower values has to be defined. A common case is for -callee-saved registers, but a standard convention is also desirable for -features including varargs, user-level threading libraries, virtual -machine migration, and debugging. -\end{commentary} - -Floating-point $n$-bit transfer operations move external values held -in IEEE standard formats into and out of the {\tt f} registers, and -comprise floating-point loads and stores (FL$n$/FS$n$) and -floating-point move instructions (FMV.$n$.X/FMV.X.$n$). A narrower -$n$-bit transfer, \mbox{$n<$ FLEN}, into the {\tt f} registers will create a -valid NaN-boxed value. A narrower $n$-bit transfer out of -the floating-point registers will transfer the lower $n$ bits of the -register ignoring the upper FLEN$-n$ bits. - -Apart from transfer operations described in the previous paragraph, -all other floating-point operations on narrower $n$-bit operations, -\mbox{$n<$ FLEN}, check if the input operands are correctly NaN-boxed, -i.e., all upper FLEN$-n$ bits are 1. If so, the $n$ least-significant -bits of the input are used as the input value, otherwise the input -value is treated as an $n$-bit canonical NaN. - -\begin{commentary} -Earlier versions of this document did not define the behavior of -feeding the results of narrower or wider operands into an operation, -except to require that wider saves and restores would preserve the -value of a narrower operand. The new definition removes this -implementation-specific behavior, while still accommodating both -non-recoded and recoded implementations of the floating-point unit. -The new definition also helps catch software errors by propagating -NaNs if values are used incorrectly. - -Non-recoded implementations unpack and pack the operands to IEEE -standard format on the input and output of every floating-point -operation. The NaN-boxing cost to a non-recoded implementation is -primarily in checking if the upper bits of a narrower operation -represent a legal NaN-boxed value, and in writing all 1s to the upper -bits of a result. - -Recoded implementations use a more convenient internal format to -represent floating-point values, with an added exponent bit to allow -all values to be held normalized. The cost to the recoded -implementation is primarily the extra tagging needed to track the -internal types and sign bits, but this can be done without adding new -state bits by recoding NaNs internally in the exponent field. Small -modifications are needed to the pipelines used to transfer values in -and out of the recoded format, but the datapath and latency costs are -minimal. The recoding process has to handle shifting of input -subnormal values for wide operands in any case, and extracting the -NaN-boxed value is a similar process to normalization except for -skipping over leading-1 bits instead of skipping over leading-0 bits, -allowing the datapath muxing to be shared. -\end{commentary} - -\section{Double-Precision Load and Store Instructions} -\label{fld_fsd} - -The FLD instruction loads a double-precision floating-point value from -memory into floating-point register {\em rd}. FSD stores a double-precision -value from the floating-point registers to memory. -\begin{commentary} -The double-precision value may be a NaN-boxed single-precision value. -\end{commentary} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{M@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[11:0]} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{width} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -12 & 5 & 3 & 5 & 7 \\ -offset[11:0] & base & D & dest & LOAD-FP \\ -\end{tabular} -\end{center} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{O@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[11:5]} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{width} & -\multicolumn{1}{c|}{imm[4:0]} & -\multicolumn{1}{c|}{opcode} \\ -\hline -7 & 5 & 5 & 3 & 5 & 7 \\ -offset[11:5] & src & base & D & offset[4:0] & STORE-FP \\ -\end{tabular} -\end{center} - -FLD and FSD are only guaranteed to execute atomically if the effective address -is naturally aligned and XLEN$\geq$64. - -FLD and FSD do not modify the bits being transferred; in particular, the -payloads of non-canonical NaNs are preserved. - -\section{Double-Precision Floating-Point Computational Instructions} - -The double-precision floating-point computational instructions are -defined analogously to their single-precision counterparts, but operate on -double-precision operands and produce double-precision results. -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FADD/FSUB & D & src2 & src1 & RM & dest & OP-FP \\ -FMUL/FDIV & D & src2 & src1 & RM & dest & OP-FP \\ -FMIN-MAX & D & src2 & src1 & MIN/MAX & dest & OP-FP \\ -FSQRT & D & 0 & src & RM & dest & OP-FP \\ -\end{tabular} -\end{center} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{rs3} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -src3 & D & src2 & src1 & RM & dest & F[N]MADD/F[N]MSUB \\ -\end{tabular} -\end{center} - -\section{Double-Precision Floating-Point Conversion and Move Instructions} - -Floating-point-to-integer and integer-to-floating-point conversion -instructions are encoded in the OP-FP major opcode space. -FCVT.W.D or FCVT.L.D converts a double-precision floating-point number -in floating-point register {\em rs1} to a signed 32-bit or 64-bit -integer, respectively, in integer register {\em rd}. FCVT.D.W -or FCVT.D.L converts a 32-bit or 64-bit signed integer, -respectively, in integer register {\em rs1} into a -double-precision floating-point -number in floating-point register {\em rd}. FCVT.WU.D, -FCVT.LU.D, FCVT.D.WU, and FCVT.D.LU variants -convert to or from unsigned integer values. -For RV64, FCVT.W[U].D sign-extends the 32-bit result. -FCVT.L[U].D and FCVT.D.L[U] are RV64-only instructions. -The range of valid inputs for FCVT.{\em int}.D and -the behavior for invalid inputs are the same as for FCVT.{\em int}.S. - -All floating-point to integer and integer to floating-point conversion -instructions round according to the {\em rm} field. Note FCVT.D.W[U] always -produces an exact result and is unaffected by rounding mode. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FCVT.{\em int}.D & D & W[U]/L[U] & src & RM & dest & OP-FP \\ -FCVT.D.{\em int} & D & W[U]/L[U] & src & RM & dest & OP-FP \\ -\end{tabular} -\end{center} - -The double-precision to single-precision and single-precision to -double-precision conversion instructions, FCVT.S.D and FCVT.D.S, are -encoded in the OP-FP major opcode space and both the source and -destination are floating-point registers. The {\em rs2} field -encodes the datatype of the source, and the {\em fmt} field encodes -the datatype of the destination. FCVT.S.D rounds according to the -RM field; FCVT.D.S will never round. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FCVT.S.D & S & D & src & RM & dest & OP-FP \\ -FCVT.D.S & D & S & src & RM & dest & OP-FP \\ -\end{tabular} -\end{center} - -Floating-point to floating-point sign-injection instructions, FSGNJ.D, -FSGNJN.D, and FSGNJX.D are defined analogously to the single-precision -sign-injection instruction. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FSGNJ & D & src2 & src1 & J[N]/JX & dest & OP-FP \\ -\end{tabular} -\end{center} - -For XLEN$\geq$64 only, instructions are provided to move bit patterns -between the floating-point and integer registers. FMV.X.D moves the -double-precision value in floating-point register {\em rs1} to a -representation in IEEE 754-2008 standard encoding in integer register -{\em rd}. FMV.D.X moves the double-precision value encoded in IEEE -754-2008 standard encoding from the integer register {\em rs1} to the -floating-point register {\em rd}. - -FMV.X.D and FMV.D.X do not modify the bits being transferred; in particular, the -payloads of non-canonical NaNs are preserved. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FMV.X.D & D & 0 & src & 000 & dest & OP-FP \\ -FMV.D.X & D & 0 & src & 000 & dest & OP-FP \\ -\end{tabular} -\end{center} - -\begin{commentary} - Early versions of the RISC-V ISA had additional instructions to - allow RV32 systems to transfer between the upper and lower portions - of a 64-bit floating-point register and an integer register. - However, these would be the only instructions with partial register - writes and would add complexity in implementations with recoded - floating-point or register renaming, requiring a pipeline read-modify-write - sequence. Scaling up to handling quad-precision for RV32 and RV64 - would also require additional instructions if they were to follow - this pattern. The ISA was defined to reduce the number of explicit - int-float register moves, by having conversions and comparisons - write results to the appropriate register file, so we expect the - benefit of these instructions to be lower than for other ISAs. - - We note that for systems that implement a 64-bit floating-point unit - including fused multiply-add support and 64-bit floating-point loads - and stores, the marginal hardware cost of moving from a 32-bit to - a 64-bit integer datapath is low, and a software ABI supporting 32-bit - wide address-space and pointers can be used to avoid growth of - static data and dynamic memory traffic. -\end{commentary} - -\section{Double-Precision Floating-Point Compare Instructions} - -The double-precision floating-point compare instructions are -defined analogously to their single-precision counterparts, but operate on -double-precision operands. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{S@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FCMP & D & src2 & src1 & EQ/LT/LE & dest & OP-FP \\ -\end{tabular} -\end{center} - -\section{Double-Precision Floating-Point Classify Instruction} - -The double-precision floating-point classify instruction, FCLASS.D, is -defined analogously to its single-precision counterpart, but operates on -double-precision operands. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{S@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FCLASS & D & 0 & src & 001 & dest & OP-FP \\ -\end{tabular} -\end{center} diff --git a/src/extensions.tex b/src/extensions.tex deleted file mode 100644 index 56cc912..0000000 --- a/src/extensions.tex +++ /dev/null @@ -1,383 +0,0 @@ -\chapter{Extending RISC-V} -\label{extensions} - -In addition to supporting standard general-purpose software -development, another goal of RISC-V is to provide a basis for more -specialized instruction-set extensions or more customized -accelerators. The instruction encoding spaces and optional -variable-length instruction encoding are designed to make it easier to -leverage software development effort for the standard ISA toolchain -when building more customized processors. For example, the intent is -to continue to provide full software support for implementations that -only use the standard I base, perhaps together with many non-standard -instruction-set extensions. - -This chapter describes various ways in which the base RISC-V ISA can -be extended, together with the scheme for managing instruction-set -extensions developed by independent groups. This volume only deals -with the unprivileged ISA, although the same approach and terminology is -used for supervisor-level extensions described in the second volume. - -\section{Extension Terminology} - -This section defines some standard terminology for describing RISC-V -extensions. -\vspace{-0.2in} -\subsection*{Standard versus Non-Standard Extension} - -Any RISC-V processor implementation must support a base integer ISA -(RV32I, RV32E, RV64I, or RV128I). In addition, an implementation may -support one or more extensions. We divide extensions into two broad -categories: {\em standard} versus {\em non-standard}. -\begin{itemize} -\item A standard extension is one that is generally useful and that is - designed to not conflict with any other standard extension. - Currently, ``MAFDQLCBTPV'', described in other chapters of this - manual, are either complete or planned standard extensions. -\item A non-standard extension may be highly specialized and may - conflict with other standard or non-standard extensions. We - anticipate a wide variety of non-standard extensions will be - developed over time, with some eventually being promoted to standard - extensions. -\end{itemize} - -\vspace{-0.2in} -\subsection*{Instruction Encoding Spaces and Prefixes} - -An instruction encoding space is some number of instruction bits -within which a base ISA or ISA extension is encoded. RISC-V supports -varying instruction lengths, but even within a single instruction -length, there are various sizes of encoding space available. For -example, the base ISAs are defined within a 30-bit encoding space (bits -31--2 of the 32-bit instruction), while the atomic extension ``A'' -fits within a 25-bit encoding space (bits 31--7). - -We use the term {\em prefix} to refer to the bits to the {\em right} -of an instruction encoding space (since instruction fetch in RISC-V is -little-endian, the -bits to the right are stored at earlier memory addresses, hence form a -prefix in instruction-fetch order). The prefix for the standard base -ISA encoding is the two-bit ``11'' field held in bits 1--0 of the -32-bit word, while the prefix for the standard atomic extension ``A'' -is the seven-bit ``0101111'' field held in bits 6--0 of the 32-bit -word representing the AMO major opcode. A quirk of the encoding -format is that the 3-bit funct3 field used to encode a minor opcode is -not contiguous with the major opcode bits in the 32-bit instruction -format, but is considered part of the prefix for 22-bit instruction -spaces. - -Although an instruction encoding space could be of any size, adopting -a smaller set of common sizes simplifies packing independently -developed extensions into a single global encoding. -Table~\ref{encodingspaces} gives the suggested sizes for RISC-V. - -\begin{table}[H] -\begin{center} -\begin{tabular}{|c|l|r|r|r|r|} -\hline -\multicolumn{1}{|c|}{Size} & \multicolumn{1}{|c|}{Usage} & -\multicolumn{4}{|c|}{\# Available in standard instruction length} \\ \cline{3-6} - & & -\multicolumn{1}{|c|}{16-bit} & -\multicolumn{1}{|c|}{32-bit} & -\multicolumn{1}{|c|}{48-bit} & -\multicolumn{1}{|c|}{64-bit} \\ \hline \hline -14-bit & Quadrant of compressed 16-bit encoding & 3 & & & \\ \hline \hline -22-bit & Minor opcode in base 32-bit encoding & & $2^{8}$ & $2^{20}$ & $2^{35}$ \\ \hline -25-bit & Major opcode in base 32-bit encoding & & 32 & $2^{17}$ & $2^{32}$ \\ \hline -30-bit & Quadrant of base 32-bit encoding & & 1 & $2^{12}$ & $2^{27}$ \\ \hline \hline -32-bit & Minor opcode in 48-bit encoding & & & $2^{10}$ & $2^{25}$ \\ \hline -37-bit & Major opcode in 48-bit encoding & & & 32 & $2^{20}$ \\ \hline -40-bit & Quadrant of 48-bit encoding & & & 4 & $2^{17}$ \\ \hline \hline -45-bit & Sub-minor opcode in 64-bit encoding & & & & $2^{12}$ \\ \hline -48-bit & Minor opcode in 64-bit encoding & & & & $2^{9}$ \\ \hline -52-bit & Major opcode in 64-bit encoding & & & & 32\\ \hline -\end{tabular} -\end{center} -\caption{Suggested standard RISC-V instruction encoding space sizes.} -\label{encodingspaces} -\end{table} - -\vspace{-0.2in} -\subsection*{Greenfield versus Brownfield Extensions} - -We use the term {\em greenfield extension} to describe an extension -that begins populating a new instruction encoding space, and hence can -only cause encoding conflicts at the prefix level. We use the term -{\em brownfield extension} to describe an extension that fits around -existing encodings in a previously defined instruction space. A -brownfield extension is necessarily tied to a particular greenfield -parent encoding, and there may be multiple brownfield extensions to -the same greenfield parent encoding. For example, the base ISAs are -greenfield encodings of a 30-bit instruction space, while the FDQ -floating-point extensions are all brownfield extensions adding to the -parent base ISA 30-bit encoding space. - -Note that we consider the standard A extension to have a greenfield -encoding as it defines a new previously empty 25-bit encoding space in -the leftmost bits of the full 32-bit base instruction encoding, even -though its standard prefix locates it within the 30-bit encoding space -of its parent base ISA. -Changing only its single 7-bit prefix could move the -A extension to a different 30-bit encoding space while only worrying -about conflicts at the prefix level, not within the encoding space -itself. - -\begin{table}[H] -{ -\begin{center} -\begin{tabular}{|r|c|c|} -\hline - & Adds state & No new state \\ \hline -Greenfield & RV32I(30), RV64I(30) & A(25) \\\hline -Brownfield & F(I), D(F), Q(D) & M(I) \\ -\hline -\end{tabular} -\end{center} -} -\caption{Two-dimensional characterization of standard instruction-set - extensions.} -\label{exttax} -\end{table} - -Table~\ref{exttax} shows the bases and standard extensions placed in a -simple two-dimensional taxonomy. One axis is whether the extension is -greenfield or brownfield, while the other axis is whether the -extension adds architectural state. For greenfield extensions, the -size of the instruction encoding space is given in parentheses. For -brownfield extensions, the name of the extension (greenfield or -brownfield) it builds upon is given in parentheses. Additional -user-level architectural state usually implies changes to the -supervisor-level system or possibly to the standard calling -convention. - -Note that RV64I is not considered an extension of RV32I, but a -different complete base encoding. - -\vspace{-0.2in} -\subsection*{Standard-Compatible Global Encodings} - -A complete or {\em global} encoding of an ISA for an actual RISC-V -implementation must allocate a unique non-conflicting prefix for every -included instruction encoding space. The bases and every standard -extension have each had a standard prefix allocated to ensure they can -all coexist in a global encoding. - -A {\em standard-compatible} global encoding is one where the base and -every included standard extension have their standard prefixes. A -standard-compatible global encoding can include non-standard -extensions that do not conflict with the included standard extensions. -A standard-compatible global encoding can also use standard prefixes -for non-standard extensions if the associated standard extensions are -not included in the global encoding. In other words, a standard -extension must use its standard prefix if included in a -standard-compatible global encoding, but otherwise its prefix is free -to be reallocated. These constraints allow a common toolchain to -target the standard subset of any RISC-V standard-compatible global -encoding. - -\vspace{-0.2in} -\subsection*{Guaranteed Non-Standard Encoding Space} - -To support development of proprietary custom extensions, portions of -the encoding space are guaranteed to never be used by standard -extensions. - -\section{RISC-V Extension Design Philosophy} - -We intend to support a large number of independently developed -extensions by encouraging extension developers to operate within -instruction encoding spaces, and by providing tools to pack these into -a standard-compatible global encoding by allocating unique prefixes. -Some extensions are more naturally implemented as brownfield -augmentations of existing extensions, and will share whatever prefix -is allocated to their parent greenfield extension. The standard -extension prefixes avoid spurious incompatibilities in the encoding of -core functionality, while allowing custom packing of more esoteric -extensions. - -This capability of repacking RISC-V extensions into different -standard-compatible global encodings can be used in a number of ways. - -One use-case is developing highly specialized custom accelerators, -designed to run kernels from important application domains. These -might want to drop all but the base integer ISA and add in only the -extensions that are required for the task in hand. The base ISAs have -been designed to place minimal requirements on a hardware -implementation, and has been encoded to use only a small fraction of a -32-bit instruction encoding space. - -Another use-case is to build a research prototype for a new type of -instruction-set extension. The researchers might not want to expend -the effort to implement a variable-length instruction-fetch unit, and -so would like to prototype their extension using a simple 32-bit -fixed-width instruction encoding. However, this new extension might -be too large to coexist with standard extensions in the 32-bit space. -If the research experiments do not need all of the standard -extensions, a standard-compatible global encoding might drop the -unused standard extensions and reuse their prefixes to place the -proposed extension in a non-standard location to simplify engineering -of the research prototype. Standard tools will still be able to -target the base and any standard extensions that are present to reduce -development time. Once the instruction-set extension has been -evaluated and refined, it could then be made available for packing -into a larger variable-length encoding space to avoid conflicts with -all standard extensions. - -The following sections describe increasingly sophisticated strategies -for developing implementations with new instruction-set extensions. -These are mostly intended for use in highly customized, educational, -or experimental architectures rather than for the main line of RISC-V -ISA development. - -\section{Extensions within fixed-width 32-bit instruction format} -\label{fix32b} - -In this section, we discuss adding extensions to implementations that -only support the base fixed-width 32-bit instruction format. - -\begin{commentary} -We anticipate the simplest fixed-width 32-bit encoding will be popular for -many restricted accelerators and research prototypes. -\end{commentary} - -\subsection*{Available 30-bit instruction encoding spaces} - -In the standard encoding, three of the available 30-bit instruction -encoding spaces (those with 2-bit prefixes 00, 01, and 10) are used to -enable the optional compressed instruction extension. However, if the -compressed instruction-set extension is not required, then these three -further 30-bit encoding spaces become available. This quadruples the -available encoding space within the 32-bit format. - -\subsection*{Available 25-bit instruction encoding spaces} - -A 25-bit instruction encoding space corresponds to a major opcode in -the base and standard extension encodings. - -There are four major opcodes expressly designated for custom extensions -(Table~\ref{opcodemap}), each of which represents a 25-bit encoding -space. Two of these are reserved for eventual use in the RV128 base -encoding (will be OP-IMM-64 and OP-64), but can be used for -non-standard extensions for RV32 and RV64. - -The two major opcodes reserved for RV64 (OP-IMM-32 and OP-32) can also be -used for non-standard extensions to RV32 only. - -If an implementation does not require floating-point, then the seven -major opcodes reserved for standard floating-point extensions -(LOAD-FP, STORE-FP, MADD, MSUB, NMSUB, NMADD, OP-FP) can be reused for -non-standard extensions. Similarly, the AMO major opcode can be -reused if the standard atomic extensions are not required. - -If an implementation does not require instructions longer than -32-bits, then an additional four major opcodes are available (those -marked in gray in Table~\ref{opcodemap}). - -The base RV32I encoding uses only 11 major opcodes plus 3 reserved -opcodes, leaving up to 18 available for extensions. The base RV64I -encoding uses only 13 major opcodes plus 3 reserved opcodes, leaving -up to 16 available for extensions. - -\subsection*{Available 22-bit instruction encoding spaces} - -A 22-bit encoding space corresponds to a funct3 minor opcode space in -the base and standard extension encodings. Several major opcodes have -a funct3 field minor opcode that is not completely occupied, leaving -available several 22-bit encoding spaces. - -Usually a major opcode selects the format used to encode operands in -the remaining bits of the instruction, and ideally, an extension -should follow the operand format of the major opcode to simplify -hardware decoding. - -\subsection*{Other spaces} - -Smaller spaces are available under certain major opcodes, and not all -minor opcodes are entirely filled. - -\section{Adding aligned 64-bit instruction extensions} - -The simplest approach to provide space for extensions that are too -large for the base 32-bit fixed-width instruction format is to add -naturally aligned 64-bit instructions. The implementation must still -support the 32-bit base instruction format, but can require that -64-bit instructions are aligned on 64-bit boundaries to simplify -instruction fetch, with a 32-bit NOP instruction used as alignment -padding where necessary. - -To simplify use of standard tools, the 64-bit instructions should be -encoded as described in Figure~\ref{instlengthcode}. However, an -implementation might choose a non-standard instruction-length encoding -for 64-bit instructions, while retaining the standard encoding for -32-bit instructions. For example, if compressed instructions are not -required, then a 64-bit instruction could be encoded using one or more -zero bits in the first two bits of an instruction. - -\begin{commentary} -We anticipate processor generators that produce instruction-fetch -units capable of automatically handling any combination of supported -variable-length instruction encodings. -\end{commentary} - -\section{Supporting VLIW encodings} - -Although RISC-V was not designed as a base for a pure VLIW machine, -VLIW encodings can be added as extensions using several alternative -approaches. In all cases, the base 32-bit encoding has to be supported -to allow use of any standard software tools. - -\subsection*{Fixed-size instruction group} - -The simplest approach is to define a single large naturally aligned -instruction format (e.g., 128 bits) within which VLIW operations are -encoded. In a conventional VLIW, this approach would tend to waste -instruction memory to hold NOPs, but a RISC-V-compatible -implementation would have to also support the base 32-bit -instructions, confining the VLIW code size expansion to -VLIW-accelerated functions. - -\subsection*{Encoded-Length Groups} - -Another approach is to use the standard length encoding from -Figure~\ref{instlengthcode} to encode parallel instruction groups, -allowing NOPs to be compressed out of the VLIW instruction. For -example, a 64-bit instruction could hold two 28-bit operations, while -a 96-bit instruction could hold three 28-bit operations, and so on. -Alternatively, a 48-bit instruction could hold one 42-bit operation, -while a 96-bit instruction could hold two 42-bit operations, and so -on. - -This approach has the advantage of retaining the base ISA encoding for -instructions holding a single operation, but has the disadvantage of -requiring a new 28-bit or 42-bit encoding for operations within the -VLIW instructions, and misaligned instruction fetch for larger groups. -One simplification is to not allow VLIW instructions to straddle -certain microarchitecturally significant boundaries (e.g., cache lines -or virtual memory pages). - -\subsection*{Fixed-Size Instruction Bundles} - -Another approach, similar to Itanium, is to use a larger naturally -aligned fixed instruction bundle size (e.g., 128 bits) across which -parallel operation groups are encoded. This simplifies instruction -fetch, but shifts the complexity to the group execution engine. To -remain RISC-V compatible, the base 32-bit instruction would still have -to be supported. - -\subsection*{End-of-Group bits in Prefix} - -None of the above approaches retains the RISC-V encoding for the -individual operations within a VLIW instruction. Yet another approach -is to repurpose the two prefix bits in the fixed-width 32-bit -encoding. One prefix bit can be used to signal ``end-of-group'' if -set, while the second bit could indicate execution under a predicate -if clear. Standard RISC-V 32-bit instructions generated by tools -unaware of the VLIW extension would have both prefix bits set (11) and -thus have the correct semantics, with each instruction at the end of a -group and not predicated. - -The main disadvantage of this approach is that the base ISAs lack the -complex predication support usually required in an aggressive VLIW -system, and it is difficult to add space to specify more predicate -registers in the standard 30-bit encoding space. diff --git a/src/f.tex b/src/f.tex deleted file mode 100644 index 4e2f723..0000000 --- a/src/f.tex +++ /dev/null @@ -1,851 +0,0 @@ -\chapter{``F'' Standard Extension for Single-Precision Floating-Point, -Version 2.2} -\label{sec:single-float} - -This chapter describes the standard instruction-set extension for -single-precision floating-point, which is named ``F'' and adds -single-precision floating-point computational instructions compliant -with the IEEE 754-2008 arithmetic standard~\cite{ieee754-2008}. -The F extension depends on the ``Zicsr'' extension for control -and status register access. - -\section{F Register State} - -The F extension adds 32 floating-point registers, {\tt f0}--{\tt f31}, -each 32 bits wide, and a floating-point control and status register -{\tt fcsr}, which contains the operating mode and exception status of the -floating-point unit. This additional state is shown in -Figure~\ref{fprs}. We use the term FLEN to describe the width of the -floating-point registers in the RISC-V ISA, and FLEN=32 for the F -single-precision floating-point extension. Most floating-point -instructions operate on values in the floating-point register file. -Floating-point load and store instructions transfer floating-point -values between registers and memory. Instructions to transfer values -to and from the integer register file are also provided. - -\begin{figure}[htbp] -{\footnotesize -\begin{center} -\begin{tabular}{p{2in}} -\instbitrange{FLEN-1}{0} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ f0\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ f1\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ f2\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ f3\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ f4\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ f5\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ f6\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ f7\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ f8\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ \ f9\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f10\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f11\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f12\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f13\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f14\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f15\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f16\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f17\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f18\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f19\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f20\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f21\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f22\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f23\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f24\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f25\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f26\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f27\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f28\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f29\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f30\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{\ \ \ f31\ \ \ \ \ }} \\ \cline{1-1} -\multicolumn{1}{c}{FLEN} \\ - -\instbitrange{31}{0} \\ \cline{1-1} -\multicolumn{1}{|c|}{\reglabel{fcsr}} \\ \cline{1-1} -\multicolumn{1}{c}{32} \\ -\end{tabular} -\end{center} -} -\caption{RISC-V standard F extension single-precision floating-point state.} -\label{fprs} -\end{figure} - -\begin{commentary} -We considered a unified register file for both integer and -floating-point values as this simplifies software register allocation -and calling conventions, and reduces total user state. However, a -split organization increases the total number of registers accessible -with a given instruction width, simplifies provision of enough regfile -ports for wide superscalar issue, supports decoupled -floating-point-unit architectures, and simplifies use of internal -floating-point encoding techniques. Compiler support and calling -conventions for split register file architectures are well understood, -and using dirty bits on floating-point register file state can reduce -context-switch overhead. -\end{commentary} - -\clearpage - -\section{Floating-Point Control and Status Register} - -The floating-point control and status register, {\tt fcsr}, is a RISC-V -control and status register (CSR). It is a 32-bit read/write register that -selects the dynamic rounding mode for floating-point arithmetic operations and -holds the accrued exception flags, as shown in Figure~\ref{fcsr}. - -\begin{figure*}[h] -{\footnotesize -\begin{center} -\begin{tabular}{K@{}E@{}ccccc} -\instbitrange{31}{8} & -\instbitrange{7}{5} & -\instbit{4} & -\instbit{3} & -\instbit{2} & -\instbit{1} & -\instbit{0} \\ -\hline -\multicolumn{1}{|c|}{{\em Reserved}} & -\multicolumn{1}{c|}{Rounding Mode ({\tt frm})} & -\multicolumn{5}{c|}{Accrued Exceptions ({\tt fflags})} \\ -\hline -\multicolumn{1}{c}{} & -\multicolumn{1}{c|}{} & -\multicolumn{1}{c|}{NV} & -\multicolumn{1}{c|}{DZ} & -\multicolumn{1}{c|}{OF} & -\multicolumn{1}{c|}{UF} & -\multicolumn{1}{c|}{NX} \\ -\cline{3-7} -24 & 3 & 1 & 1 & 1 & 1 & 1 \\ -\end{tabular} -\end{center} -} -\vspace{-0.1in} -\caption{Floating-point control and status register.} -\label{fcsr} -\end{figure*} - -The {\tt fcsr} register can be read and written with the FRCSR and -FSCSR instructions, which are assembler pseudoinstructions built on the -underlying CSR access instructions. FRCSR reads {\tt fcsr} by copying -it into integer register {\em rd}. FSCSR swaps the value in {\tt - fcsr} by copying the original value into integer register {\em rd}, -and then writing a new value obtained from integer register {\em rs1} -into {\tt fcsr}. - -The fields within the {\tt fcsr} can also be accessed individually -through different CSR addresses, and separate assembler pseudoinstructions are -defined for these accesses. The FRRM instruction reads the Rounding -Mode field {\tt frm} and copies it into the least-significant three -bits of integer register {\em rd}, with zero in all other bits. FSRM -swaps the value in {\tt frm} by copying the original value into -integer register {\em rd}, and then writing a new value obtained from -the three least-significant bits of integer register {\em rs1} into -{\tt frm}. FRFLAGS and FSFLAGS are defined analogously for the -Accrued Exception Flags field {\tt fflags}. - -Bits 31--8 of the {\tt fcsr} are reserved for other standard extensions. If -these extensions are not present, implementations shall ignore writes to -these bits and supply a zero value when read. Standard software should -preserve the contents of these bits. - -Floating-point operations use either a static rounding mode encoded in -the instruction, or a dynamic rounding mode held in {\tt frm}. -Rounding modes are encoded as shown in Table~\ref{rm}. A value of 111 -in the instruction's {\em rm} field selects the dynamic rounding mode -held in {\tt frm}. The behavior of floating-point instructions that -depend on rounding mode when executed with a reserved rounding mode is -{\em reserved}, including both static reserved rounding modes (101--110) and -dynamic reserved rounding modes (101--111). Some instructions, including -widening conversions, have the {\em rm} field but are nevertheless -mathematically unaffected by the rounding mode; software should set their -{\em rm} field to RNE (000) but implementations must treat the {\em rm} -field as usual (in particular, with regard to decoding legal vs. reserved -encodings). - -\begin{table}[htp] -\begin{small} -\begin{center} -\begin{tabular}{ccl} -\hline -\multicolumn{1}{|c|}{Rounding Mode} & -\multicolumn{1}{c|}{Mnemonic} & -\multicolumn{1}{c|}{Meaning} \\ -\hline -\multicolumn{1}{|c|}{000} & -\multicolumn{1}{l|}{RNE} & -\multicolumn{1}{l|}{Round to Nearest, ties to Even}\\ -\hline -\multicolumn{1}{|c|}{001} & -\multicolumn{1}{l|}{RTZ} & -\multicolumn{1}{l|}{Round towards Zero}\\ -\hline -\multicolumn{1}{|c|}{010} & -\multicolumn{1}{l|}{RDN} & -\multicolumn{1}{l|}{Round Down (towards $-\infty$)}\\ -\hline -\multicolumn{1}{|c|}{011} & -\multicolumn{1}{l|}{RUP} & -\multicolumn{1}{l|}{Round Up (towards $+\infty$)}\\ -\hline -\multicolumn{1}{|c|}{100} & -\multicolumn{1}{l|}{RMM} & -\multicolumn{1}{l|}{Round to Nearest, ties to Max Magnitude}\\ -\hline -\multicolumn{1}{|c|}{101} & -\multicolumn{1}{l|}{} & -\multicolumn{1}{l|}{\em Reserved for future use.}\\ -\hline -\multicolumn{1}{|c|}{110} & -\multicolumn{1}{l|}{} & -\multicolumn{1}{l|}{\em Reserved for future use.}\\ -\hline -\multicolumn{1}{|c|}{111} & -\multicolumn{1}{l|}{DYN} & -\multicolumn{1}{l|}{In instruction's {\em rm} field, selects dynamic rounding mode;}\\ -\multicolumn{1}{|c|}{} & -\multicolumn{1}{l|}{} & -\multicolumn{1}{l|}{In Rounding Mode register, {\em reserved}.}\\ -\hline -\end{tabular} -\end{center} -\end{small} -\caption{Rounding mode encoding.} -\label{rm} -\end{table} - -\begin{commentary} -The C99 language standard effectively mandates the provision of a -dynamic rounding mode register. In typical implementations, writes to -the dynamic rounding mode CSR state will serialize the pipeline. -Static rounding modes are used to implement specialized arithmetic -operations that often have to switch frequently between different -rounding modes. - -The ratified version of the F spec mandated that an illegal -instruction exception was raised when an instruction was executed with -a reserved dynamic rounding mode. This has been weakened to reserved, -which matches the behavior of static rounding-mode instructions. -Raising an illegal instruction exception is still valid behavior when -encountering a reserved encoding, so implementations compatible with -the ratified spec are compatible with the weakened spec. -\end{commentary} - -The accrued exception flags indicate the exception conditions that -have arisen on any floating-point arithmetic instruction since the -field was last reset by software, as shown in Table~\ref{bitdef}. -The base RISC-V ISA -does not support generating a trap on the setting of a floating-point -exception flag. - -\begin{table}[htp] -\begin{small} -\begin{center} -\begin{tabular}{cl} -\hline -\multicolumn{1}{|c|}{Flag Mnemonic} & -\multicolumn{1}{c|}{Flag Meaning} \\ -\hline -\multicolumn{1}{|c|}{NV} & -\multicolumn{1}{c|}{Invalid Operation}\\ -\hline -\multicolumn{1}{|c|}{DZ} & -\multicolumn{1}{c|}{Divide by Zero}\\ -\hline -\multicolumn{1}{|c|}{OF} & -\multicolumn{1}{c|}{Overflow}\\ -\hline -\multicolumn{1}{|c|}{UF} & -\multicolumn{1}{c|}{Underflow}\\ -\hline -\multicolumn{1}{|c|}{NX} & -\multicolumn{1}{c|}{Inexact}\\ -\hline -\end{tabular} -\end{center} -\end{small} -\caption{Accrued exception flag encoding.} -\label{bitdef} -\end{table} - -\begin{commentary} -As allowed by the standard, we do not support traps on floating-point -exceptions in the F extension, but instead require explicit checks of the flags -in software. We considered adding branches controlled directly by the -contents of the floating-point accrued exception flags, but ultimately chose -to omit these instructions to keep the ISA simple. -\end{commentary} - -\section{NaN Generation and Propagation} - -Except when otherwise stated, if the result of a floating-point operation is -NaN, it is the canonical NaN. The canonical NaN has a positive sign and all -significand bits clear except the MSB, a.k.a. the quiet bit. For -single-precision floating-point, this corresponds to the pattern {\tt -0x7fc00000}. - -\begin{commentary} -We considered propagating NaN payloads, as is recommended by the standard, -but this decision would have increased hardware cost. Moreover, since this -feature is optional in the standard, it cannot be used in portable code. - -Implementors are free to provide a NaN payload propagation scheme as -a non-standard extension enabled by a non-standard operating mode. However, the -canonical NaN scheme described above must always be supported and should be -the default mode. -\end{commentary} - -\begin{commentary} -We require implementations to return the standard-mandated default -values in the case of exceptional conditions, without any further -intervention on the part of user-level software (unlike the Alpha ISA -floating-point trap barriers). We believe full hardware handling of -exceptional cases will become more common, and so wish to avoid -complicating the user-level ISA to optimize other approaches. -Implementations can always trap to machine-mode software handlers to -provide exceptional default values. -\end{commentary} - -\section{Subnormal Arithmetic} - -Operations on subnormal numbers are handled in accordance with the IEEE -754-2008 standard. - -In the parlance of the IEEE standard, tininess is detected after rounding. - -\begin{commentary} -Detecting tininess after rounding results in fewer spurious underflow signals. -\end{commentary} - -\section{Single-Precision Load and Store Instructions} - -Floating-point loads and stores use the same base+offset addressing -mode as the integer base ISAs, with a base address in register {\em - rs1} and a 12-bit signed byte offset. The FLW instruction loads a -single-precision floating-point value from memory into floating-point -register {\em rd}. FSW stores a single-precision value from -floating-point register {\em rs2} to memory. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{M@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[11:0]} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{width} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -12 & 5 & 3 & 5 & 7 \\ -offset[11:0] & base & W & dest & LOAD-FP \\ -\end{tabular} -\end{center} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{O@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{imm[11:5]} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{width} & -\multicolumn{1}{c|}{imm[4:0]} & -\multicolumn{1}{c|}{opcode} \\ -\hline -7 & 5 & 5 & 3 & 5 & 7 \\ -offset[11:5] & src & base & W & offset[4:0] & STORE-FP \\ -\end{tabular} -\end{center} - -FLW and FSW are only guaranteed to execute atomically if the effective address -is naturally aligned. - -FLW and FSW do not modify the bits being transferred; in particular, the -payloads of non-canonical NaNs are preserved. - -As described in Section~\ref{sec:rv32:ldst}, the EEI defines whether -misaligned floating-point loads and stores are handled invisibly or raise -a contained or fatal trap. - -\section{Single-Precision Floating-Point Computational Instructions} -\label{sec:single-float-compute} - -Floating-point arithmetic instructions with one or two source operands use the -R-type format with the OP-FP major opcode. FADD.S and FMUL.S perform -single-precision floating-point addition and multiplication respectively, -between {\em rs1} and {\em rs2}. FSUB.S performs the single-precision -floating-point subtraction of {\em rs2} from {\em rs1}. FDIV.S performs the -single-precision floating-point division of {\em rs1} by {\em rs2}. FSQRT.S -computes the square root of {\em rs1}. In each case, the result is written to -{\em rd}. - -The 2-bit floating-point format field {\em fmt} is encoded as shown in -Table~\ref{tab:fmt}. It is set to {\em S} (00) for all instructions in -the F extension. - -\begin{table}[htp] -\begin{small} -\begin{center} -\begin{tabular}{|c|c|l|} -\hline -{\em fmt} field & -Mnemonic & -Meaning \\ -\hline -00 & S & 32-bit single-precision \\ -01 & D & 64-bit double-precision \\ -10 & H & 16-bit half-precision \\ -11 & Q & 128-bit quad-precision \\ -\hline -\end{tabular} -\end{center} -\end{small} -\caption{Format field encoding.} -\label{tab:fmt} -\end{table} - -All floating-point operations that perform rounding can select the -rounding mode using the {\em rm} field with the encoding shown in -Table~\ref{rm}. - -Floating-point minimum-number and maximum-number instructions FMIN.S and -FMAX.S write, respectively, the smaller or larger of {\em rs1} and {\em rs2} -to {\em rd}. For the purposes of these instructions only, the value $-0.0$ is -considered to be less than the value $+0.0$. If both inputs are NaNs, the -result is the canonical NaN. If only one operand is a NaN, the result is the -non-NaN operand. Signaling NaN inputs set the invalid operation exception flag, -even when the result is not NaN. - -\begin{commentary} -Note that in version 2.2 of the F extension, the FMIN.S and FMAX.S -instructions were amended to implement the proposed IEEE 754-201x -minimumNumber and maximumNumber operations, rather than the IEEE 754-2008 -minNum and maxNum operations. These operations differ in their handling of -signaling NaNs. -\end{commentary} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FADD/FSUB & S & src2 & src1 & RM & dest & OP-FP \\ -FMUL/FDIV & S & src2 & src1 & RM & dest & OP-FP \\ -FSQRT & S & 0 & src & RM & dest & OP-FP \\ -FMIN-MAX & S & src2 & src1 & MIN/MAX & dest & OP-FP \\ -\end{tabular} -\end{center} - -Floating-point fused multiply-add instructions require a new standard -instruction format. R4-type instructions specify three source -registers ({\em rs1}, {\em rs2}, and {\em rs3}) and a destination -register ({\em rd}). This format is only used by the floating-point -fused multiply-add instructions. - -FMADD.S multiplies the values in {\em -rs1} and {\em rs2}, adds the value in {\em rs3}, and writes the final -result to {\em rd}. FMADD.S computes {\em (rs1$\times$rs2)+rs3}. - -FMSUB.S multiplies the values in {\em rs1} and {\em rs2}, subtracts -the value in {\em rs3}, and writes the final result to {\em rd}. -FMSUB.S computes {\em (rs1$\times$rs2)-rs3}. - -FNMSUB.S multiplies the -values in {\em rs1} and {\em rs2}, negates the product, adds the value -in {\em rs3}, and writes the final result to {\em rd}. FNMSUB.S -computes {\em -(rs1$\times$rs2)+rs3}. - -FNMADD.S multiplies the values -in {\em rs1} and {\em rs2}, negates the product, subtracts the value -in {\em rs3}, and writes the final result to {\em rd}. FNMADD.S -computes {\em -(rs1$\times$rs2)-rs3}. - -\begin{commentary} -The FNMSUB and FNMADD instructions are counterintuitively named, owing to the -naming of the corresponding instructions in MIPS-IV. The MIPS instructions -were defined to negate the sum, rather than negating the product as the -RISC-V instructions do, so the naming scheme was more rational at the time. -The two definitions differ with respect to signed-zero results. The RISC-V -definition matches the behavior of the x86 and ARM fused multiply-add -instructions, but unfortunately the RISC-V FNMSUB and FNMADD instruction -names are swapped compared to x86 and ARM. -\end{commentary} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{rs3} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -src3 & S & src2 & src1 & RM & dest & F[N]MADD/F[N]MSUB \\ -\end{tabular} -\end{center} - -\begin{commentary} - The fused multiply-add (FMA) instructions consume a large part of the - 32-bit instruction encoding space. Some alternatives considered were - to restrict FMA to only use dynamic rounding modes, but static - rounding modes are useful in code that exploits the lack of product - rounding. Another alternative would have been to use rd to provide - rs3, but this would require additional move instructions in some - common sequences. The current design still leaves a large portion of - the 32-bit encoding space open while avoiding having FMA be - non-orthogonal. -\end{commentary} - -The fused multiply-add instructions must set the invalid operation exception flag -when the multiplicands are $\infty$ and zero, even when the addend is a quiet -NaN. -\begin{commentary} -The IEEE 754-2008 standard permits, but does not require, raising the -invalid exception for the operation \mbox{$\infty\times 0\ +$ qNaN}. -\end{commentary} - -\section{Single-Precision Floating-Point Conversion and Move \mbox{Instructions}} - -Floating-point-to-integer and integer-to-floating-point conversion -instructions are encoded in the OP-FP major opcode space. -FCVT.W.S or FCVT.L.S converts a floating-point number -in floating-point register {\em rs1} to a signed 32-bit or 64-bit -integer, respectively, in integer register {\em rd}. FCVT.S.W -or FCVT.S.L converts a 32-bit or 64-bit signed integer, -respectively, in integer register {\em rs1} into a floating-point -number in floating-point register {\em rd}. FCVT.WU.S, -FCVT.LU.S, FCVT.S.WU, and FCVT.S.LU variants -convert to or from unsigned integer values. -For XLEN$>32$, FCVT.W[U].S sign-extends the 32-bit result to the -destination register width. -FCVT.L[U].S and FCVT.S.L[U] are RV64-only instructions. -If the rounded result is not representable in the destination format, -it is clipped to the nearest value and the invalid flag is set. -Table~\ref{tab:int_conv} gives the range of valid inputs for FCVT.{\em int}.S -and the behavior for invalid inputs. - -\begin{table}[htp] -\begin{small} -\begin{center} -\begin{tabular}{|l|r|r|r|r|} -\hline - & FCVT.W.S & FCVT.WU.S & FCVT.L.S & FCVT.LU.S \\ -\hline -Minimum valid input (after rounding) & $-2^{31}$ & 0 & $-2^{63}$ & 0 \\ -Maximum valid input (after rounding) & $2^{31}-1$ & $2^{32}-1$ & $2^{63}-1$ & $2^{64}-1$ \\ -\hline -Output for out-of-range negative input & $-2^{31}$ & 0 & $-2^{63}$ & 0 \\ -Output for $-\infty$ & $-2^{31}$ & 0 & $-2^{63}$ & 0 \\ -Output for out-of-range positive input & $2^{31}-1$ & $2^{32}-1$ & $2^{63}-1$ & $2^{64}-1$ \\ -Output for $+\infty$ or NaN & $2^{31}-1$ & $2^{32}-1$ & $2^{63}-1$ & $2^{64}-1$ \\ -\hline -\end{tabular} -\end{center} -\end{small} -\caption{Domains of float-to-integer conversions and behavior for invalid inputs.} -\label{tab:int_conv} -\end{table} - -All floating-point to integer and integer to floating-point conversion -instructions round according to the {\em rm} field. A floating-point register -can be initialized to floating-point positive zero using FCVT.S.W {\em rd}, -{\tt x0}, which will never set any exception flags. - -All floating-point conversion instructions set the Inexact exception flag if -the rounded result differs from the operand value and the Invalid exception -flag is not set. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FCVT.{\em int}.{\em fmt} & S & W[U]/L[U] & src & RM & dest & OP-FP \\ -FCVT.{\em fmt}.{\em int} & S & W[U]/L[U] & src & RM & dest & OP-FP \\ -\end{tabular} -\end{center} - -Floating-point to floating-point sign-injection instructions, FSGNJ.S, -FSGNJN.S, and FSGNJX.S, produce a result that takes all bits except -the sign bit from {\em rs1}. For FSGNJ, the result's sign bit is {\em - rs2}'s sign bit; for FSGNJN, the result's sign bit is the opposite -of {\em rs2}'s sign bit; and for FSGNJX, the sign bit is the XOR of -the sign bits of {\em rs1} and {\em rs2}. Sign-injection instructions -do not set floating-point exception flags, nor do they canonicalize -NaNs. Note, FSGNJ.S {\em rx, ry, - ry} moves {\em ry} to {\em rx} (assembler pseudoinstruction FMV.S {\em rx, - ry}); FSGNJN.S {\em rx, ry, ry} moves the negation of {\em ry} to -{\em rx} (assembler pseudoinstruction FNEG.S {\em rx, ry}); and FSGNJX.S {\em rx, - ry, ry} moves the absolute value of {\em ry} to {\em rx} (assembler -pseudoinstruction FABS.S {\em rx, ry}). - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FSGNJ & S & src2 & src1 & J[N]/JX & dest & OP-FP \\ -\end{tabular} -\end{center} - -\begin{commentary} -The sign-injection instructions -provide floating-point MV, ABS, and NEG, -as well as supporting a few other operations, including the IEEE copySign -operation and sign manipulation in transcendental math function -libraries. Although MV, ABS, and NEG only need a single register -operand, whereas FSGNJ instructions need two, it is unlikely most -microarchitectures would add optimizations to benefit from the reduced -number of register reads for these relatively infrequent instructions. -Even in this case, a microarchitecture can simply detect when both -source registers are the same for FSGNJ instructions and only read a -single copy. -\end{commentary} - -Instructions are provided to move bit patterns between the -floating-point and integer registers. FMV.X.W moves the -single-precision value in floating-point register {\em rs1} -represented in IEEE 754-2008 encoding to the lower 32 bits of integer -register {\em rd}. The bits are not -modified in the transfer, and in particular, the payloads of -non-canonical NaNs are preserved. -For RV64, the higher 32 bits of the destination -register are filled with copies of the floating-point number's sign -bit. - -FMV.W.X moves the single-precision value encoded in IEEE -754-2008 standard encoding from the lower 32 bits of integer register -{\em rs1} to the floating-point register {\em rd}. The bits are not -modified in the transfer, and in particular, the payloads of -non-canonical NaNs are preserved. - -\begin{commentary} -The FMV.W.X and FMV.X.W instructions were previously called FMV.S.X -and FMV.X.S. The use of W is more consistent with their semantics as -an instruction that moves 32 bits without interpreting them. This -became clearer after defining NaN-boxing. To avoid disturbing -existing code, both the W and S versions will be supported by tools. -\end{commentary} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{R@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FMV.X.W & S & 0 & src & 000 & dest & OP-FP \\ -FMV.W.X & S & 0 & src & 000 & dest & OP-FP \\ -\end{tabular} -\end{center} - -\begin{commentary} -The base floating-point ISA was defined so as to allow implementations -to employ an internal recoding of the floating-point format in -registers to simplify handling of subnormal values and possibly to -reduce functional unit latency. To this end, the F extension avoids -representing integer values in the floating-point registers by -defining conversion and comparison operations that read and write the -integer register file directly. This also removes many of the common -cases where explicit moves between integer and floating-point -registers are required, reducing instruction count and critical paths -for common mixed-format code sequences. -\end{commentary} - -\section{Single-Precision Floating-Point Compare Instructions} - -Floating-point compare instructions (FEQ.S, FLT.S, FLE.S) perform the -specified comparison between floating-point registers ($\mbox{\em rs1} -= \mbox{\em rs2}$, $\mbox{\em rs1} < \mbox{\em rs2}$, $\mbox{\em rs1} \leq -\mbox{\em rs2}$) writing 1 to the integer register {\em rd} if the condition -holds, and 0 otherwise. - -FLT.S and FLE.S perform what the IEEE 754-2008 standard refers to as {\em -signaling} comparisons: that is, they set the invalid operation exception flag -if either input is NaN. FEQ.S performs a {\em quiet} comparison: it only -sets the invalid operation exception flag if either input is a signaling NaN. -For all three instructions, -the result is 0 if either operand is NaN. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{S@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FCMP & S & src2 & src1 & EQ/LT/LE & dest & OP-FP \\ -\end{tabular} -\end{center} - -\begin{commentary} -The F extension provides a $\leq$ comparison, whereas the base ISAs provide -a $\geq$ branch comparison. Because $\leq$ can be synthesized from $\geq$ and -vice-versa, there is no performance implication to this inconsistency, but it -is nevertheless an unfortunate incongruity in the ISA. -\end{commentary} - -\section{Single-Precision Floating-Point Classify Instruction} - -The FCLASS.S instruction examines the value in floating-point register {\em -rs1} and writes to integer register {\em rd} a 10-bit mask that indicates -the class of the floating-point number. The format of the mask is -described in Table~\ref{tab:fclass}. The corresponding bit in {\em rd} will -be set if the property is true and clear otherwise. All other bits in -{\em rd} are cleared. Note that exactly one bit in {\em rd} will be set. -FCLASS.S does not set the floating-point exception flags. - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{S@{}F@{}R@{}R@{}F@{}R@{}O} -\\ -\instbitrange{31}{27} & -\instbitrange{26}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct5} & -\multicolumn{1}{c|}{fmt} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{rm} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -5 & 2 & 5 & 5 & 3 & 5 & 7 \\ -FCLASS & S & 0 & src & 001 & dest & OP-FP \\ -\end{tabular} -\end{center} - -\begin{table}[htp] -\begin{small} -\begin{center} -\begin{tabular}{|c|l|} -\hline -{\em rd} bit & -Meaning \\ -\hline -0 & {\em rs1} is $-\infty$. \\ -1 & {\em rs1} is a negative normal number. \\ -2 & {\em rs1} is a negative subnormal number. \\ -3 & {\em rs1} is $-0$. \\ -4 & {\em rs1} is $+0$. \\ -5 & {\em rs1} is a positive subnormal number. \\ -6 & {\em rs1} is a positive normal number. \\ -7 & {\em rs1} is $+\infty$. \\ -8 & {\em rs1} is a signaling NaN. \\ -9 & {\em rs1} is a quiet NaN. \\ -\hline -\end{tabular} -\end{center} -\end{small} -\caption{Format of result of FCLASS instruction.} -\label{tab:fclass} -\end{table} diff --git a/src/history.tex b/src/history.tex deleted file mode 100644 index 0e6e816..0000000 --- a/src/history.tex +++ /dev/null @@ -1,403 +0,0 @@ -\chapter{History and Acknowledgments} -\label{history} - -\section{``Why Develop a new ISA?'' Rationale from Berkeley Group} - -We developed RISC-V to support our own needs in research and -education, where our group is particularly interested in actual -hardware implementations of research ideas (we have completed eleven -different silicon fabrications of RISC-V since the first edition of -this specification), and in providing real implementations for -students to explore in classes (RISC-V processor RTL designs have been -used in multiple undergraduate and graduate classes at Berkeley). In -our current research, we are especially interested in the move towards -specialized and heterogeneous accelerators, driven by the power -constraints imposed by the end of conventional transistor scaling. We -wanted a highly flexible and extensible base ISA around which to build -our research effort. - -A question we have been repeatedly asked is ``Why develop a new ISA?'' -The biggest obvious benefit of using an existing commercial ISA is the -large and widely supported software ecosystem, both development tools -and ported applications, which can be leveraged in research and -teaching. Other benefits include the existence of large amounts of -documentation and tutorial examples. However, our experience of using -commercial instruction sets for research and teaching is that these -benefits are smaller in practice, and do not outweigh the -disadvantages: - -\begin{itemize} -\item {\bf Commercial ISAs are proprietary.} Except for SPARC V8, - which is an open IEEE standard~\cite{sparcieee1994}, most owners of - commercial ISAs carefully guard their intellectual property and do - not welcome freely available competitive implementations. This is - much less of an issue for academic research and teaching using only - software simulators, but has been a major concern for groups wishing - to share actual RTL implementations. It is also a major concern for - entities who do not want to trust the few sources of commercial ISA - implementations, but who are prohibited from creating their own - clean room implementations. We cannot guarantee that all RISC-V - implementations will be free of third-party patent infringements, - but we can guarantee we will not attempt to sue a RISC-V - implementor. - -\item {\bf Commercial ISAs are only popular in certain market - domains.} The most obvious examples at time of writing are that - the ARM architecture is not well supported in the server space, and - the Intel x86 architecture (or for that matter, almost every other - architecture) is not well supported in the mobile space, though both - Intel and ARM are attempting to enter each other's market segments. - Another example is ARC and Tensilica, which provide extensible cores - but are focused on the embedded space. This market segmentation - dilutes the benefit of supporting a particular commercial ISA as in - practice the software ecosystem only exists for certain domains, and - has to be built for others. - -\item {\bf Commercial ISAs come and go.} Previous research - infrastructures have been built around commercial ISAs that are no - longer popular (SPARC, MIPS) or even no longer in production - (Alpha). These lose the benefit of an active software ecosystem, - and the lingering intellectual property issues around the ISA and - supporting tools interfere with the ability of interested third - parties to continue supporting the ISA. An open ISA might also lose - popularity, but any interested party can continue using and - developing the ecosystem. - -\item {\bf Popular commercial ISAs are complex.} The dominant - commercial ISAs (x86 and ARM) are both very complex to implement in - hardware to the level of supporting common software stacks and - operating systems. Worse, nearly all the complexity is due to bad, - or at least outdated, ISA design decisions rather than features that - truly improve efficiency. - -\item {\bf Commercial ISAs alone are not enough to bring up - applications.} Even if we expend the effort to implement a - commercial ISA, this is not enough to run existing applications for - that ISA. Most applications need a complete ABI (application binary - interface) to run, not just the user-level ISA. Most ABIs rely on - libraries, which in turn rely on operating system support. To run an - existing operating system requires implementing the supervisor-level - ISA and device interfaces expected by the OS. These are usually - much less well-specified and considerably more complex to - implement than the user-level ISA. - -\item {\bf Popular commercial ISAs were not designed for extensibility.} The - dominant commercial ISAs were not particularly designed for - extensibility, and as a consequence have added considerable - instruction encoding complexity as their instruction sets have - grown. Companies such as Tensilica (acquired by Cadence) and ARC - (acquired by Synopsys) have built ISAs and toolchains around - extensibility, but have focused on embedded applications rather than - general-purpose computing systems. - -\item {\bf A modified commercial ISA is a new ISA.} One of our main - goals is to support architecture research, including major ISA - extensions. Even small extensions diminish the benefit of using a - standard ISA, as compilers have to be modified and applications - rebuilt from source code to use the extension. Larger extensions - that introduce new architectural state also require modifications to - the operating system. Ultimately, the modified commercial ISA - becomes a new ISA, but carries along all the legacy baggage of the - base ISA. -\end{itemize} - -Our position is that the ISA is perhaps the most important interface -in a computing system, and there is no reason that such an important -interface should be proprietary. The dominant commercial ISAs are -based on instruction-set concepts that were already well known over 30 -years ago. Software developers should be able to target an open -standard hardware target, and commercial processor designers should -compete on implementation quality. - -We are far from the first to contemplate an open ISA design suitable -for hardware implementation. We also considered other existing open -ISA designs, of which the closest to our goals was the OpenRISC -architecture~\cite{openriscarch}. We decided against adopting the -OpenRISC ISA for several technical reasons: - -\begin{itemize} -\item OpenRISC has condition codes and branch delay slots, which - complicate higher performance implementations. -\item OpenRISC uses a fixed 32-bit encoding and 16-bit immediates, - which precludes a denser instruction encoding and limits space for - later expansion of the ISA. -\item OpenRISC does not support the 2008 revision to the IEEE 754 - floating-point standard. -\item The OpenRISC 64-bit design had not been completed when we began. -\end{itemize} - -By starting from a clean slate, we could design an ISA that met all of -our goals, though of course, this took far more effort than we had -planned at the outset. We have now invested considerable effort in -building up the RISC-V ISA infrastructure, including documentation, -compiler tool chains, operating system ports, reference ISA -simulators, FPGA implementations, efficient ASIC implementations, -architecture test suites, and teaching materials. Since the last -edition of this manual, there has been considerable uptake of the -RISC-V ISA in both academia and industry, and we have created the -non-profit RISC-V Foundation to protect and promote the standard. The -RISC-V Foundation website at \url{https://riscv.org} contains the latest -information on the Foundation membership and various open-source -projects using RISC-V. - - -\section{History from Revision 1.0 of ISA manual} - -The RISC-V ISA and instruction-set manual builds upon several earlier -projects. Several aspects of the supervisor-level machine and the -overall format of the manual date back to the T0 (Torrent-0) vector -microprocessor project at UC Berkeley and ICSI, begun in 1992. T0 was -a vector processor based on the MIPS-II ISA, with Krste Asanovi\'{c} -as main architect and RTL designer, and Brian Kingsbury and Bertrand -Irrisou as principal VLSI implementors. David Johnson at ICSI was a -major contributor to the T0 ISA design, particularly supervisor mode, -and to the manual text. John Hauser also provided considerable -feedback on the T0 ISA design. - -The Scale (Software-Controlled Architecture for Low Energy) project at -MIT, begun in 2000, built upon the T0 project infrastructure, refined -the supervisor-level interface, and moved away from the MIPS scalar -ISA by dropping the branch delay slot. Ronny Krashinsky and -Christopher Batten were the principal architects of the Scale -Vector-Thread processor at MIT, while Mark Hampton ported the -GCC-based compiler infrastructure and tools for Scale. - -A lightly edited version of the T0 MIPS scalar processor specification -(MIPS-6371) was used in teaching a new version of the MIT 6.371 -Introduction to VLSI Systems class in the Fall 2002 semester, with -Chris Terman and Krste Asanovi\'{c} as lecturers. Chris Terman -contributed most of the lab material for the class (there was no -TA!). The 6.371 class evolved into the trial 6.884 Complex Digital -Design class at MIT, taught by Arvind and Krste Asanovi\'{c} in Spring -2005, which became a regular Spring class 6.375. A reduced version of -the Scale MIPS-based scalar ISA, named SMIPS, was used in 6.884/6.375. -Christopher Batten was the TA for the early offerings of these classes -and developed a considerable amount of documentation and lab material -based around the SMIPS ISA. This same SMIPS lab material was adapted -and enhanced by TA Yunsup Lee for the UC Berkeley Fall 2009 CS250 VLSI -Systems Design class taught by John Wawrzynek, Krste Asanovi\'{c}, and -John Lazzaro. - -The Maven (Malleable Array of Vector-thread ENgines) project was a -second-generation vector-thread architecture. Its design was led by -Christopher Batten when he was an Exchange Scholar at UC Berkeley starting -in summer 2007. Hidetaka Aoki, a visiting industrial fellow from -Hitachi, gave considerable feedback on the early Maven ISA and -microarchitecture design. The Maven infrastructure was based on the -Scale infrastructure but the Maven ISA moved further away from the -MIPS ISA variant defined in Scale, with a unified floating-point and -integer register file. Maven was designed to support experimentation -with alternative data-parallel accelerators. Yunsup Lee was the main -implementor of the various Maven vector units, while Rimas Avi\v{z}ienis -was the main implementor of the various Maven scalar units. -Yunsup Lee and Christopher Batten ported GCC to work with the new -Maven ISA. Christopher Celio provided the initial definition of a -traditional vector instruction set (``Flood'') variant of Maven. - -Based on experience with all these previous projects, the RISC-V ISA -definition was begun in Summer 2010, with Andrew Waterman, Yunsup Lee, -Krste Asanovi\'{c}, and David Patterson as principal designers. -An initial version of the RISC-V -32-bit instruction subset was used in the UC Berkeley Fall 2010 CS250 -VLSI Systems Design class, with Yunsup Lee as TA. RISC-V is a clean -break from the earlier MIPS-inspired designs. John Hauser contributed -to the floating-point ISA definition, including the sign-injection -instructions and a register encoding scheme that permits -internal recoding of floating-point values. - -\section{History from Revision 2.0 of ISA manual} - -Multiple implementations of RISC-V processors have been completed, -including several silicon fabrications, as shown in -Figure~\ref{silicon}. - -\begin{table*}[!h] -\begin{center} -\begin{tabular}{|l|r|l|l|} -\hline -\multicolumn{1}{|c|}{Name} & \multicolumn{1}{|c|}{Tapeout Date} & \multicolumn{1}{|c|}{Process} & \multicolumn{1}{|c|}{ISA} \\ \hline -\hline -Raven-1 & May 29, 2011 & ST 28nm FDSOI & RV64G1\_Xhwacha1 \\ \hline -EOS14 & April 1, 2012 & IBM 45nm SOI & RV64G1p1\_Xhwacha2 \\ \hline -EOS16 & August 17, 2012 & IBM 45nm SOI & RV64G1p1\_Xhwacha2 \\ \hline -Raven-2 & August 22, 2012 & ST 28nm FDSOI & RV64G1p1\_Xhwacha2 \\ \hline -EOS18 & February 6, 2013 & IBM 45nm SOI & RV64G1p1\_Xhwacha2 \\ \hline -EOS20 & July 3, 2013 & IBM 45nm SOI & RV64G1p99\_Xhwacha2 \\ \hline -Raven-3 & September 26, 2013 & ST 28nm SOI & RV64G1p99\_Xhwacha2 \\ \hline -EOS22 & March 7, 2014 & IBM 45nm SOI & RV64G1p9999\_Xhwacha3 \\ \hline -\end{tabular} -\end{center} -\vspace{-0.15in} -\caption{Fabricated RISC-V testchips.} -\label{silicon} -\end{table*} - -The first RISC-V processors to be fabricated were written in Verilog and -manufactured in a pre-production \wunits{28}{nm} FDSOI technology from -ST as the Raven-1 testchip in 2011. Two cores were developed by Yunsup -Lee and Andrew Waterman, advised by Krste Asanovi\'{c}, and fabricated -together: 1) an RV64 scalar core with error-detecting flip-flops, and 2) -an RV64 core with an attached 64-bit floating-point vector unit. The -first microarchitecture was informally known as ``TrainWreck'', due to -the short time available to complete the design with immature design -libraries. - -Subsequently, a clean microarchitecture for an in-order decoupled RV64 -core was developed by Andrew Waterman, Rimas Avi\v{z}ienis, and Yunsup -Lee, advised by Krste Asanovi\'{c}, and, continuing the railway theme, -was codenamed ``Rocket'' after George Stephenson's successful steam -locomotive design. Rocket was written in Chisel, a new hardware -design language developed at UC Berkeley. The IEEE floating-point -units used in Rocket were developed by John Hauser, Andrew -Waterman, and Brian Richards. -Rocket has since been refined and developed further, and has been -fabricated two more times in \wunits{28}{nm} FDSOI (Raven-2, Raven-3), -and five times in IBM \wunits{45}{nm} SOI technology (EOS14, EOS16, -EOS18, EOS20, EOS22) for a photonics project. Work is ongoing to make -the Rocket design available as a parameterized RISC-V processor -generator. - -EOS14--EOS22 chips include early versions of Hwacha, a 64-bit IEEE -floating-point vector unit, developed by Yunsup Lee, Andrew Waterman, -Huy Vo, Albert Ou, Quan Nguyen, and Stephen Twigg, advised by Krste -Asanovi\'{c}. EOS16--EOS22 chips include dual cores with a -cache-coherence protocol developed by Henry Cook and Andrew Waterman, -advised by Krste Asanovi\'{c}. EOS14 silicon has successfully run at -\wunits{1.25}{GHz}. EOS16 silicon suffered from a bug in the IBM pad -libraries. EOS18 and EOS20 have successfully run at \wunits{1.35}{GHz}. - -Contributors to the Raven testchips include Yunsup Lee, Andrew Waterman, -Rimas Avi\v{z}ienis, Brian Zimmer, Jaehwa Kwak, Ruzica Jevti\'{c}, -Milovan Blagojevi\'{c}, Alberto Puggelli, Steven Bailey, Ben Keller, -Pi-Feng Chiu, Brian Richards, Borivoje Nikoli\'{c}, and Krste -Asanovi\'{c}. - -Contributors to the EOS testchips include Yunsup Lee, Rimas -Avi\v{z}ienis, Andrew Waterman, Henry Cook, Huy Vo, Daiwei Li, Chen Sun, -Albert Ou, Quan Nguyen, Stephen Twigg, Vladimir Stojanovi\'{c}, and -Krste Asanovi\'{c}. - -Andrew Waterman and Yunsup Lee developed the C++ ISA simulator -``Spike'', used as a golden model in development and named after the -golden spike used to celebrate completion of the US transcontinental -railway. Spike has been made available as a BSD open-source project. - -Andrew Waterman completed a Master's thesis with a preliminary design -of the RISC-V compressed instruction set~\cite{waterman-ms}. - -Various FPGA implementations of the RISC-V have been completed, -primarily as part of integrated demos for the Par Lab project research -retreats. The largest FPGA design has 3 cache-coherent RV64IMA -processors running a research operating system. Contributors to the -FPGA implementations include Andrew Waterman, Yunsup Lee, Rimas -Avi\v{z}ienis, and Krste Asanovi\'{c}. - -RISC-V processors have been used in several classes at UC Berkeley. -Rocket was used in the Fall 2011 offering of CS250 as a basis for class -projects, with Brian Zimmer as TA. For the undergraduate CS152 class in -Spring 2012, Christopher Celio used Chisel to write a suite of educational -RV32 processors, named ``Sodor'' after the island on which ``Thomas the -Tank Engine'' and friends live. The suite includes a microcoded core, -an unpipelined core, and 2, 3, and 5-stage pipelined cores, and is -publicly available under a BSD license. The suite was subsequently -updated and used again in CS152 in Spring 2013, with Yunsup Lee as TA, -and in Spring 2014, with Eric Love as TA. -Christopher Celio also developed an out-of-order RV64 design known as BOOM -(Berkeley Out-of-Order Machine), with accompanying pipeline -visualizations, that was used in the CS152 classes. The CS152 classes -also used cache-coherent versions of the Rocket core developed by Andrew -Waterman and Henry Cook. - -Over the summer of 2013, the RoCC (Rocket Custom Coprocessor) -interface was defined to simplify adding custom accelerators to the -Rocket core. Rocket and the RoCC interface were used extensively in -the Fall 2013 CS250 VLSI class taught by Jonathan Bachrach, with -several student accelerator projects built to the RoCC interface. The -Hwacha vector unit has been rewritten as a RoCC coprocessor. - -Two Berkeley undergraduates, Quan Nguyen and Albert Ou, have -successfully ported Linux to run on RISC-V in Spring 2013. - -Colin Schmidt successfully completed an LLVM backend for RISC-V 2.0 in -January 2014. - -Darius Rad at Bluespec contributed soft-float ABI support to the GCC port in -March 2014. - -John Hauser contributed the definition of the floating-point classification -instructions. - -We are aware of several other RISC-V core implementations, including -one in Verilog by Tommy Thorn, and one in Bluespec by Rishiyur Nikhil. - -\section*{Acknowledgments} - -Thanks to Christopher F. Batten, Preston Briggs, Christopher Celio, David -Chisnall, Stefan Freudenberger, John Hauser, Ben Keller, Rishiyur -Nikhil, Michael Taylor, Tommy Thorn, and Robert Watson for comments on -the draft ISA version 2.0 specification. - -\section{History from Revision 2.1} - -Uptake of the RISC-V ISA has been very rapid since the introduction of -the frozen version 2.0 in May 2014, with too much activity to record -in a short history section such as this. Perhaps the most important -single event was the formation of the non-profit RISC-V Foundation in -August 2015. The Foundation will now take over stewardship of the -official RISC-V ISA standard, and the official website {\tt riscv.org} -is the best place to obtain news and updates on the RISC-V standard. - -\section*{Acknowledgments} - -Thanks to Scott Beamer, Allen J. Baum, Christopher Celio, David Chisnall, -Paul Clayton, Palmer Dabbelt, Jan Gray, Michael Hamburg, and John -Hauser for comments on the version 2.0 specification. - -\section{History from Revision 2.2} - - -\section*{Acknowledgments} - -Thanks to Jacob Bachmeyer, Alex Bradbury, David Horner, Stefan O'Rear, -and Joseph Myers for comments on the version 2.1 specification. - -\section{History for Revision 2.3} - -Uptake of RISC-V continues at breakneck pace. - -John Hauser and Andrew Waterman contributed a hypervisor ISA extension -based upon a proposal from Paolo Bonzini. - -Daniel Lustig, Arvind, Krste Asanovi\'{c}, Shaked Flur, Paul Loewenstein, Yatin -Manerkar, Luc Maranget, Margaret Martonosi, Vijayanand Nagarajan, Rishiyur -Nikhil, Jonas Oberhauser, Christopher Pulte, Jose Renau, Peter Sewell, Susmit -Sarkar, Caroline Trippel, Muralidaran Vijayaraghavan, Andrew Waterman, Derek -Williams, Andrew Wright, and Sizhuo Zhang contributed the memory consistency -model. - -\section{Funding} - -Development of the RISC-V architecture and implementations has been -partially funded by the following sponsors. -\begin{itemize} - -\item {\bf Par Lab:} Research supported by Microsoft (Award \#024263) and Intel (Award - \#024894) funding and by matching funding by U.C. Discovery - (Award \#DIG07-10227). Additional support came from Par Lab - affiliates Nokia, NVIDIA, Oracle, and Samsung. - -\item {\bf Project Isis:} DoE Award DE-SC0003624. - -\item {\bf ASPIRE Lab}: DARPA PERFECT program, Award - HR0011-12-2-0016. DARPA POEM program Award HR0011-11-C-0100. The - Center for Future Architectures Research (C-FAR), a STARnet center - funded by the Semiconductor Research Corporation. Additional - support from ASPIRE industrial sponsor, Intel, and ASPIRE - affiliates, Google, Hewlett Packard Enterprise, Huawei, Nokia, - NVIDIA, Oracle, and Samsung. - -\end{itemize} - -The content of this paper does not necessarily reflect the position or the -policy of the US government and no official endorsement should be -inferred. diff --git a/src/intro.tex b/src/intro.tex deleted file mode 100644 index 7a74ab7..0000000 --- a/src/intro.tex +++ /dev/null @@ -1,770 +0,0 @@ -\chapter{Introduction} - -RISC-V (pronounced ``risk-five'') is a new instruction-set -architecture (ISA) that was originally designed to support computer -architecture research and education, but which we now hope will also -become a standard free and open architecture for industry -implementations. Our goals in defining RISC-V include: -\vspace{-0.1in} -\begin{itemize} -\parskip 0pt -\itemsep 1pt -\item A completely {\em open} ISA that is freely available to - academia and industry. -\item A {\em real} ISA suitable for direct native hardware implementation, - not just simulation or binary translation. -\item An ISA that avoids ``over-architecting'' for a particular - microarchitecture style (e.g., microcoded, in-order, decoupled, - out-of-order) or implementation technology (e.g., full-custom, ASIC, - FPGA), but which allows efficient implementation in any of these. -\item An ISA separated into a {\em small} base integer ISA, usable by - itself as a base for customized accelerators or for educational - purposes, and optional standard extensions, to support - general-purpose software development. -\item Support for the revised 2008 IEEE-754 floating-point standard~\cite{ieee754-2008}. -\item An ISA supporting extensive ISA extensions and - specialized variants. -\item Both 32-bit and 64-bit address space variants for - applications, operating system kernels, and hardware implementations. -\item An ISA with support for highly parallel multicore - or manycore implementations, including heterogeneous multiprocessors. -\item Optional {\em variable-length instructions} to both expand available - instruction encoding space and to support an optional {\em dense - instruction encoding} for improved performance, static code size, - and energy efficiency. -\item A fully virtualizable ISA to ease hypervisor development. -\item An ISA that simplifies experiments with new privileged architecture designs. -\end{itemize} -\vspace{-0.1in} - -\begin{commentary} - Commentary on our design decisions is formatted as in this - paragraph. This non-normative text can be skipped if the reader is - only interested in the specification itself. -\end{commentary} -\begin{commentary} -The name RISC-V was chosen to represent the fifth major RISC ISA -design from UC Berkeley (RISC-I~\cite{riscI-isca1981}, -RISC-II~\cite{Katevenis:1983}, SOAR~\cite{Ungar:1984}, and -SPUR~\cite{spur-jsscc1989} were the first four). We also pun on the -use of the Roman numeral ``V'' to signify ``variations'' and -``vectors'', as support for a range of architecture research, -including various data-parallel accelerators, is an explicit goal of -the ISA design. -\end{commentary} - -The RISC-V ISA is defined avoiding implementation details as much as -possible (although commentary is included on implementation-driven -decisions) and should be read as the software-visible interface to a -wide variety of implementations rather than as the design of a -particular hardware artifact. The RISC-V manual is structured in two -volumes. This volume covers the design of the base {\em unprivileged} -instructions, including optional unprivileged ISA extensions. -Unprivileged instructions are those that are generally usable in all -privilege modes in all privileged architectures, though behavior might -vary depending on privilege mode and privilege architecture. The -second volume provides the design of the first (``classic'') -privileged architecture. The manuals use IEC 80000-13:2008 -conventions, with a byte of 8 bits. - -\begin{commentary} -In the unprivileged ISA design, we tried to remove any dependence on -particular microarchitectural features, such as cache line size, or on -privileged architecture details, such as page translation. This is -both for simplicity and to allow maximum flexibility for alternative -microarchitectures or alternative privileged architectures. -\end{commentary} - - -\section{RISC-V Hardware Platform Terminology} - -A RISC-V hardware platform can contain one or more RISC-V-compatible -processing cores together with other non-RISC-V-compatible cores, -fixed-function accelerators, various physical memory structures, I/O -devices, and an interconnect structure to allow the components to -communicate. - -A component is termed a {\em core} if it contains an independent -instruction fetch unit. A RISC-V-compatible core might support -multiple RISC-V-compatible hardware threads, or {\em harts}, through -multithreading. - -A RISC-V core might have additional specialized instruction-set -extensions or an added {\em coprocessor}. We use the term {\em - coprocessor} to refer to a unit that is attached to a RISC-V core -and is mostly sequenced by a RISC-V instruction stream, but which -contains additional architectural state and instruction-set -extensions, and possibly some limited autonomy relative to the -primary RISC-V instruction stream. - -We use the term {\em accelerator} to refer to either a -non-programmable fixed-function unit or a core that can operate -autonomously but is specialized for certain tasks. In RISC-V systems, -we expect many programmable accelerators will be RISC-V-based cores -with specialized instruction-set extensions and/or customized -coprocessors. An important class of RISC-V accelerators are I/O -accelerators, which offload I/O processing tasks from the main -application cores. - -The system-level organization of a RISC-V hardware platform can range -from a single-core microcontroller to a many-thousand-node cluster of -shared-memory manycore server nodes. Even small systems-on-a-chip -might be structured as a hierarchy of multicomputers and/or -multiprocessors to modularize development effort or to provide secure -isolation between subsystems. - -\section{RISC-V Software Execution Environments and Harts} - -The behavior of a RISC-V program depends on the execution environment -in which it runs. A RISC-V execution environment interface (EEI) -defines the initial state of the program, the number and type of harts -in the environment including the privilege modes supported by the -harts, the accessibility and attributes of memory and I/O regions, the -behavior of all legal instructions executed on each hart (i.e., the -ISA is one component of the EEI), and the handling of any interrupts -or exceptions raised during execution including environment calls. -Examples of EEIs include the Linux application binary interface (ABI), -or the RISC-V supervisor binary interface (SBI). The implementation -of a RISC-V execution environment can be pure hardware, pure software, -or a combination of hardware and software. For example, opcode traps -and software emulation can be used to implement functionality not -provided in hardware. Examples of execution environment -implementations include: -\begin{itemize} - \item ``Bare metal'' hardware platforms where harts are directly - implemented by physical processor threads and instructions have - full access to the physical address space. The hardware platform - defines an execution environment that begins at power-on reset. - \item RISC-V operating systems that provide multiple user-level - execution environments by multiplexing user-level harts onto - available physical processor threads and by controlling access to - memory via virtual memory. - \item RISC-V hypervisors that provide multiple supervisor-level - execution environments for guest operating systems. - \item RISC-V emulators, such as Spike, QEMU or rv8, which emulate - RISC-V harts on an underlying x86 system, and which can provide - either a user-level or a supervisor-level execution environment. -\end{itemize} - -\begin{commentary} - A bare hardware platform can be considered to define an EEI, where - the accessible harts, memory, and other devices populate the - environment, and the initial state is that at power-on reset. - Generally, most software is designed to use a more abstract - interface to the hardware, as more abstract EEIs provide greater - portability across different hardware platforms. Often EEIs are - layered on top of one another, where one higher-level EEI uses - another lower-level EEI. -\end{commentary} - -From the perspective of software running in a given execution -environment, a hart is a resource that autonomously fetches and -executes RISC-V instructions within that execution environment. In -this respect, a hart behaves like a hardware thread resource even if -time-multiplexed onto real hardware by the execution environment. -Some EEIs support the creation and destruction of additional harts, -for example, via environment calls to fork new harts. - -The execution environment is responsible for ensuring the eventual forward -progress of each of its harts. -For a given hart, that responsibility is suspended while the hart is -exercising a mechanism that explicitly waits for an event, such as the -wait-for-interrupt instruction defined in Volume II of this specification; and -that responsibility ends if the hart is terminated. -The following events constitute forward progress: -\vspace{-0.2in} -\begin{itemize} -\parskip 0pt -\itemsep 1pt -\item The retirement of an instruction. -\item A trap, as defined in Section~\ref{sec:trap-defn}. -\item Any other event defined by an extension to constitute forward progress. -\end{itemize} - -\begin{commentary} -The term hart was introduced in the work on -Lithe~\cite{lithe-pan-hotpar09,lithe-pan-pldi10} to provide a term to -represent an abstract execution resource as opposed to a software -thread programming abstraction. - -The important distinction between a hardware thread (hart) and a -software thread context is that the software running inside an -execution environment is not responsible for causing progress of each -of its harts; that is the responsibility of the outer execution -environment. So the environment's harts operate like hardware threads -from the perspective of the software inside the execution environment. - -An execution environment implementation might time-multiplex a set of -guest harts onto fewer host harts provided by its own execution -environment but must do so in a way that guest harts operate like -independent hardware threads. In particular, if there are more guest -harts than host harts then the execution environment must be able to -preempt the guest harts and must not wait indefinitely for guest -software on a guest hart to ``yield" control of the guest hart. -\end{commentary} - -\section{RISC-V ISA Overview} - -A RISC-V ISA is defined as a base integer ISA, which must be present -in any implementation, plus optional extensions to the base ISA. The -base integer ISAs are very similar to that of the early RISC processors -except with no branch delay slots and with support for optional -variable-length instruction encodings. A base is carefully -restricted to a minimal set of instructions sufficient to provide a -reasonable target for compilers, assemblers, linkers, and operating -systems (with additional privileged operations), and so provides -a convenient ISA and software toolchain ``skeleton'' around which more -customized processor ISAs can be built. - -Although it is convenient to speak of {\em the} RISC-V ISA, RISC-V is -actually a family of related ISAs, of which there are currently four -base ISAs. Each base integer instruction set is characterized by the -width of the integer registers and the corresponding size of the -address space and by the number of integer registers. There are two -primary base integer variants, RV32I and RV64I, described in -Chapters~\ref{rv32} and \ref{rv64}, which provide 32-bit or 64-bit -address spaces respectively. We use the term XLEN to refer to the -width of an integer register in bits (either 32 or 64). -Chapter~\ref{rv32e} describes the RV32E subset variant of the RV32I -base instruction set, which has been added to support small -microcontrollers, and which has half the number of integer registers. -Chapter~\ref{rv128} sketches a future RV128I variant of the base -integer instruction set supporting a flat 128-bit address space -(XLEN=128). The base integer instruction sets use a two's-complement -representation for signed integer values. - -\begin{commentary} -Although 64-bit address spaces are a requirement for larger systems, -we believe 32-bit address spaces will remain adequate for many -embedded and client devices for decades to come and will be desirable -to lower memory traffic and energy consumption. In addition, 32-bit -address spaces are sufficient for educational purposes. A larger flat -128-bit address space might eventually be required, so we ensured this -could be accommodated within the RISC-V ISA framework. -\end{commentary} - -\begin{commentary} -The four base ISAs in RISC-V are treated as distinct base ISAs. A -common question is why is there not a single ISA, and in particular, -why is RV32I not a strict subset of RV64I? Some earlier ISA designs -(SPARC, MIPS) adopted a strict superset policy when increasing address -space size to support running existing 32-bit binaries on new 64-bit -hardware. - -The main advantage of explicitly separating base ISAs is that each -base ISA can be optimized for its needs without requiring to support -all the operations needed for other base ISAs. For example, RV64I can -omit instructions and CSRs that are only needed to cope with the -narrower registers in RV32I. The RV32I variants can use encoding -space otherwise reserved for instructions only required by wider -address-space variants. - -The main disadvantage of not treating the design as a single ISA is -that it complicates the hardware needed to emulate one base ISA on -another (e.g., RV32I on RV64I). However, differences in addressing -and illegal instruction traps generally mean some mode switch would be -required in hardware in any case even with full superset instruction -encodings, and the different RISC-V base ISAs are similar enough that -supporting multiple versions is relatively low cost. Although some -have proposed that the strict superset design would allow legacy -32-bit libraries to be linked with 64-bit code, this is impractical in -practice, even with compatible encodings, due to the differences in -software calling conventions and system-call interfaces. - -The RISC-V privileged architecture provides fields in {\tt - misa} to control the unprivileged ISA at each level to support emulating -different base ISAs on the same hardware. We note that newer SPARC -and MIPS ISA revisions have deprecated support for running 32-bit code -unchanged on 64-bit systems. - -A related question is why there is a different encoding for 32-bit -adds in RV32I (ADD) and RV64I (ADDW)? The ADDW opcode could be used -for 32-bit adds in RV32I and ADDD for 64-bit adds in RV64I, instead of -the existing design which uses the same opcode ADD for 32-bit adds in -RV32I and 64-bit adds in RV64I with a different opcode ADDW for 32-bit -adds in RV64I. This would also be more consistent with the use of the -same LW opcode for 32-bit load in both RV32I and RV64I. The very -first versions of RISC-V ISA did have a variant of this alternate -design, but the RISC-V design was changed to the current choice in -January 2011. Our focus was on supporting 32-bit integers in the -64-bit ISA not on providing compatibility with the 32-bit ISA, and the -motivation was to remove the asymmetry that arose from having not all -opcodes in RV32I have a *W suffix (e.g., ADDW, but AND not ANDW). In -hindsight, this was perhaps not well-justified and a consequence of -designing both ISAs at the same time as opposed to adding one later to -sit on top of another, and also from a belief we had to fold platform -requirements into the ISA spec which would imply that all the RV32I -instructions would have been required in RV64I. It is too late to -change the encoding now, but this is also of little practical -consequence for the reasons stated above. - -It has been noted we could enable the *W variants as an extension to -RV32I systems to provide a common encoding across RV64I and a future -RV32 variant. -\end{commentary} - -RISC-V has been designed to support extensive customization and -specialization. Each base integer ISA can be extended with one or -more optional instruction-set extensions. An extension may be -categorized as either standard, custom, or non-conforming. -For this purpose, we divide each RISC-V -instruction-set encoding space (and related encoding spaces such as -the CSRs) into three disjoint categories: {\em standard}, {\em - reserved}, and {\em custom}. Standard extensions and encodings -are defined by RISC-V International; any extensions not defined by -RISC-V International are {\em non-standard}. -Each base ISA and its standard extensions use only standard encodings, -and shall not conflict with each other in their uses of these encodings. -Reserved encodings are currently not defined but are saved for future -standard extensions; once thus used, they become standard encodings. -Custom encodings shall never be used for standard extensions and are -made available for vendor-specific non-standard extensions. -Non-standard extensions are either custom extensions, that use only -custom encodings, or {\em non-conforming} extensions, that use any -standard or reserved encoding. -Instruction-set extensions are generally shared but may provide slightly different -functionality depending on the base ISA. Chapter~\ref{extensions} -describes various ways of extending the RISC-V ISA. We have also -developed a naming convention for RISC-V base instructions and -instruction-set extensions, described in detail in -Chapter~\ref{naming}. - -To support more general software development, a set of standard -extensions are defined to provide integer multiply/divide, atomic -operations, and single and double-precision floating-point arithmetic. -The base integer ISA is named ``I'' (prefixed by RV32 or RV64 -depending on integer register width), and contains integer -computational instructions, integer loads, integer stores, and -control-flow instructions. The standard integer multiplication and -division extension is named ``M'', and adds instructions to multiply -and divide values held in the integer registers. The standard atomic -instruction extension, denoted by ``A'', adds instructions that -atomically read, modify, and write memory for inter-processor -synchronization. The standard single-precision floating-point -extension, denoted by ``F'', adds floating-point registers, -single-precision computational instructions, and single-precision -loads and stores. The standard double-precision floating-point -extension, denoted by ``D'', expands the floating-point registers, and -adds double-precision computational instructions, loads, and stores. -The standard ``C'' compressed instruction extension -provides narrower 16-bit forms of common instructions. - -Beyond the base integer ISA and these standard extensions, we believe -it is rare that a new instruction will provide a significant benefit -for all applications, although it may be very beneficial for a certain -domain. As energy efficiency concerns are forcing greater -specialization, we believe it is important to simplify the required -portion of an ISA specification. Whereas other architectures usually -treat their ISA as a single entity, which changes to a new version as -instructions are added over time, RISC-V will endeavor to keep the -base and each standard extension constant over time, and instead layer -new instructions as further optional extensions. For example, the -base integer ISAs will continue as fully supported standalone ISAs, -regardless of any subsequent extensions. - -\section{Memory} - -A RISC-V hart has a single byte-addressable address space -of $2^{\text{XLEN}}$ bytes for all memory -accesses. A {\em word} of memory is defined as \wunits{32}{bits} -(\wunits{4}{bytes}). Correspondingly, a {\em halfword} is \wunits{16}{bits} -(\wunits{2}{bytes}), a {\em doubleword} is \wunits{64}{bits} -(\wunits{8}{bytes}), and a {\em quadword} is \wunits{128}{bits} -(\wunits{16}{bytes}). -The memory address space is circular, so that the byte at address -$2^{\text{XLEN}}-1$ is adjacent to the byte at address zero. Accordingly, memory -address computations done by the hardware ignore overflow and instead -wrap around modulo $2^{\text{XLEN}}$. - - -The execution environment determines the mapping of hardware resources into -a hart's address space. -Different address ranges of a hart's address space may (1)~be vacant, or -(2)~contain {\em main memory}, or (3)~contain one or more {\em I/O devices}. -Reads and writes of I/O devices may have visible side effects, but accesses -to main memory cannot. -Although it is possible for the execution environment to call everything in -a hart's address space an I/O device, it is usually expected that some -portion will be specified as main memory. - -When a RISC-V platform has multiple harts, the address spaces of any two -harts may be entirely the same, or entirely different, or may be partly -different but sharing some subset of resources, mapped into the same or -different address ranges. - -\begin{commentary} -For a purely ``bare metal'' environment, all harts may see an identical -address space, accessed entirely by physical addresses. -However, when the execution environment includes an operating system -employing address translation, it is common for each hart to be given a -virtual address space that is largely or entirely its own. -\end{commentary} - -Executing each RISC-V machine instruction entails one or more memory -accesses, subdivided into {\em -implicit} and {\em explicit} accesses. For each instruction executed, an {\em -implicit} memory read (instruction fetch) is done to obtain the encoded -instruction to execute. Many RISC-V instructions perform no further memory -accesses beyond instruction fetch. Specific load and store instructions -perform an {\em explicit} read or write of memory at an address determined by -the instruction. The execution environment may dictate that instruction -execution performs other {\em implicit} memory accesses (such as to implement -address translation) beyond those documented for the unprivileged ISA. - -The execution environment determines what portions of the -non-vacant address space are -accessible for each kind of memory access. For example, the set of locations -that can be implicitly read for instruction fetch may or may not have any -overlap with the set of locations that can be explicitly read by a load -instruction; and the set of locations that can be explicitly written by -a store instruction may be only a subset of locations that can be read. -Ordinarily, if an instruction attempts to access memory at an inaccessible -address, an exception is raised for the instruction. -Vacant locations in the address space are never accessible. - -Except when specified otherwise, implicit reads that do not raise an exception -may occur arbitrarily early and speculatively, even before the machine could -possibly prove that the read will be needed. For instance, a valid -implementation could attempt to read all of main memory at the earliest -opportunity, cache as many fetchable (executable) bytes as possible for later -instruction fetches, and avoid reading main memory for instruction fetches ever -again. To ensure that certain implicit reads are ordered only after writes to -the same memory locations, software must execute specific fence or cache-control -instructions defined for this purpose (such as the FENCE.I instruction -defined in Chapter~\ref{chap:zifencei}). - -The memory accesses (implicit or explicit) made by a hart may appear to occur -in a different order as perceived by another hart or by any other agent that -can access the same memory. This perceived reordering of memory accesses is -always constrained, however, by the applicable memory consistency model. The -default memory consistency model for RISC-V is the RISC-V Weak Memory Ordering -(RVWMO), defined in Chapter~\ref{ch:memorymodel} and in appendices. -Optionally, an implementation may adopt the stronger model of Total Store -Ordering, as defined in Chapter~\ref{sec:ztso}. The execution environment may -also add constraints that further limit the perceived reordering of memory -accesses. -Since the RVWMO model is the weakest model allowed for any RISC-V -implementation, software written for this model is compatible with the -actual memory consistency rules of all RISC-V implementations. As with -implicit reads, software must execute fence or cache-control instructions to -ensure specific ordering of memory accesses beyond the requirements of the -assumed memory consistency model and execution environment. - -\section{Base Instruction-Length Encoding} - -The base RISC-V ISA has fixed-length 32-bit instructions that must be -naturally aligned on 32-bit boundaries. However, the standard RISC-V -encoding scheme is designed to support ISA extensions with -variable-length instructions, where each instruction can be any number -of 16-bit instruction {\em parcels} in length and parcels are -naturally aligned on 16-bit boundaries. The standard compressed ISA -extension described in Chapter~\ref{compressed} reduces code size by -providing compressed 16-bit instructions and relaxes the alignment -constraints to allow all instructions (16 bit and 32 bit) to be -aligned on any 16-bit boundary to improve code density. - -We use the term IALIGN (measured in bits) to refer to the instruction-address -alignment constraint the implementation enforces. IALIGN is 32 bits in the -base ISA, but some ISA extensions, including the compressed ISA extension, -relax IALIGN to 16 bits. IALIGN may not take on any value other than 16 or -32. - -We use the term ILEN (measured in bits) to refer to the maximum -instruction length supported by an implementation, and which is always -a multiple of IALIGN. For implementations supporting only a base -instruction set, ILEN is 32 bits. Implementations supporting longer -instructions have larger values of ILEN. - -Figure~\ref{instlengthcode} illustrates the standard RISC-V -instruction-length encoding convention. All the 32-bit instructions -in the base ISA have their lowest two bits set to {\tt 11}. The -optional compressed 16-bit instruction-set extensions have their -lowest two bits equal to {\tt 00}, {\tt 01}, or {\tt 10}. - -\subsection*{Expanded Instruction-Length Encoding} - -A portion of the 32-bit instruction-encoding space has been tentatively -allocated for instructions longer than 32 bits. The entirety of this space is -reserved at this time, and the following proposal for encoding instructions -longer than 32 bits is not considered frozen. - -Standard instruction-set extensions -encoded with more than 32 bits have additional low-order bits set to {\tt 1}, -with the conventions for 48-bit and 64-bit lengths shown in -Figure~\ref{instlengthcode}. Instruction lengths between 80 bits and 176 bits -are encoded using a 3-bit field in bits [14:12] giving the number of 16-bit -words in addition to the first 5$\times$16-bit words. The encoding with bits -[14:12] set to {\tt 111} is reserved for future longer instruction encodings. - - -\begin{figure}[hbt] -{ -\begin{center} -\begin{tabular}{ccccl} -\cline{4-4} -& & & \multicolumn{1}{|c|}{\tt xxxxxxxxxxxxxxaa} & 16-bit ({\tt aa} -$\neq$ {\tt 11})\\ -\cline{4-4} -\\ -\cline{3-4} -& & \multicolumn{1}{|c|}{\tt xxxxxxxxxxxxxxxx} -& \multicolumn{1}{c|}{\tt xxxxxxxxxxxbbb11} & 32-bit ({\tt bbb} -$\neq$ {\tt 111}) \\ -\cline{3-4} -\\ -\cline{2-4} -\hspace{0.1in} -& \multicolumn{1}{c|}{$\cdot\cdot\cdot${\tt xxxx} } -& \multicolumn{1}{c|}{\tt xxxxxxxxxxxxxxxx} -& \multicolumn{1}{c|}{\tt xxxxxxxxxx011111} & 48-bit \\ -\cline{2-4} -\\ -\cline{2-4} -\hspace{0.1in} -& \multicolumn{1}{c|}{$\cdot\cdot\cdot${\tt xxxx} } -& \multicolumn{1}{c|}{\tt xxxxxxxxxxxxxxxx} -& \multicolumn{1}{c|}{\tt xxxxxxxxx0111111} & 64-bit \\ -\cline{2-4} -\\ -\cline{2-4} -\hspace{0.1in} -& \multicolumn{1}{c|}{$\cdot\cdot\cdot${\tt xxxx} } -& \multicolumn{1}{c|}{\tt xxxxxxxxxxxxxxxx} -& \multicolumn{1}{c|}{\tt xnnnxxxxx1111111} & (80+16*{\tt nnn})-bit, - {\tt nnn}$\neq${\tt 111} \\ -\cline{2-4} -\\ -\cline{2-4} -\hspace{0.1in} -& \multicolumn{1}{c|}{$\cdot\cdot\cdot${\tt xxxx} } -& \multicolumn{1}{c|}{\tt xxxxxxxxxxxxxxxx} -& \multicolumn{1}{c|}{\tt x111xxxxx1111111} & Reserved for $\geq$192-bits \\ -\cline{2-4} -\\ -Byte Address: & \multicolumn{1}{r}{base+4} & \multicolumn{1}{r}{base+2} & \multicolumn{1}{r}{base} & \\ - \end{tabular} -\end{center} -} -\caption{RISC-V instruction length encoding. Only the 16-bit and 32-bit encodings are considered frozen at this time.} -\label{instlengthcode} -\end{figure} - -\begin{commentary} -Given the code size and energy savings of a compressed format, we -wanted to build in support for a compressed format to the ISA encoding -scheme rather than adding this as an afterthought, but to allow -simpler implementations we didn't want to make the compressed format -mandatory. We also wanted to optionally allow longer instructions to -support experimentation and larger instruction-set extensions. -Although our encoding convention required a tighter encoding of the -core RISC-V ISA, this has several beneficial effects. - -An implementation of the standard IMAFD ISA need only hold the -most-significant 30 bits in instruction caches (a 6.25\% saving). On -instruction cache refills, any instructions encountered with either -low bit clear should be recoded into illegal 30-bit instructions -before storing in the cache to preserve illegal instruction exception -behavior. - -Perhaps more importantly, by condensing our base ISA into a subset of -the 32-bit instruction word, we leave more space available for -non-standard and custom extensions. In particular, the base RV32I ISA -uses less than 1/8 of the encoding space in the 32-bit instruction -word. As described in Chapter~\ref{extensions}, an implementation -that does not require support for the standard compressed instruction -extension can map 3 additional non-conforming 30-bit instruction -spaces into the 32-bit fixed-width format, while preserving support -for standard $\geq$32-bit instruction-set extensions. Further, if the -implementation also does not need instructions $>$32-bits in length, -it can recover a further four major opcodes for non-conforming extensions. -\end{commentary} - -Encodings with bits [15:0] all zeros are defined as illegal -instructions. These instructions are considered to be of minimal -length: 16 bits if any 16-bit instruction-set extension is present, -otherwise 32 bits. The encoding with bits [ILEN-1:0] all ones is also -illegal; this instruction is considered to be ILEN bits long. - -\begin{commentary} -We consider it a feature that any length of instruction containing all -zero bits is not legal, as this quickly traps erroneous jumps into -zeroed memory regions. Similarly, we also reserve the instruction -encoding containing all ones to be an illegal instruction, to catch -the other common pattern observed with unprogrammed non-volatile -memory devices, disconnected memory buses, or broken memory devices. - -Software can rely on a naturally aligned 32-bit word containing zero to -act as an illegal instruction on all RISC-V implementations, to be used -by software where an illegal instruction is explicitly desired. -Defining a corresponding known illegal value for all ones is more -difficult due to the variable-length encoding. Software cannot -generally use the illegal value of ILEN bits of all 1s, as software -might not know ILEN for the eventual target machine (e.g., if software -is compiled into a standard binary library used by many different -machines). Defining a 32-bit word of all ones as illegal was also -considered, as all machines must support a 32-bit instruction size, but -this requires the instruction-fetch unit on machines with ILEN$>$32 -report an illegal instruction exception rather than an access-fault -exception when such an instruction borders a protection boundary, -complicating variable-instruction-length fetch and decode. -\end{commentary} - -RISC-V base ISAs have either little-endian or big-endian memory systems, -with the privileged architecture further defining bi-endian operation. -Instructions are stored in memory as a sequence of 16-bit little-endian -parcels, regardless of memory system endianness. -Parcels forming one instruction are stored at increasing -halfword addresses, with the lowest-addressed parcel holding the -lowest-numbered bits in the instruction specification. - -\begin{commentary} -We originally chose little-endian byte ordering for the RISC-V memory system -because little-endian systems are currently dominant commercially (all -x86 systems; iOS, Android, and Windows for ARM). A minor point is -that we have also found little-endian memory systems to be more -natural for hardware designers. However, certain application areas, -such as IP networking, operate on big-endian data structures, and -certain legacy code bases have been built assuming big-endian -processors, so we have defined big-endian and bi-endian variants of RISC-V. - -We have to fix the order in which instruction parcels are stored in -memory, independent of memory system endianness, to ensure that the -length-encoding bits always appear first in halfword address -order. This allows the length of a variable-length instruction to be -quickly determined by an instruction-fetch unit by examining only the -first few bits of the first 16-bit instruction parcel. - -We further make the instruction parcels themselves little-endian to decouple -the instruction encoding from the memory system endianness altogether. -This design benefits both software tooling and bi-endian hardware. -Otherwise, for instance, a RISC-V assembler or disassembler would always need -to know the intended active endianness, despite that in bi-endian systems, the -endianness mode might change dynamically during execution. -In contrast, by giving instructions a fixed endianness, it is sometimes -possible for carefully written software to be endianness-agnostic even in -binary form, much like position-independent code. - -The choice to have instructions be only little-endian does have consequences, -however, for RISC-V software that encodes or decodes machine instructions. -Big-endian JIT compilers, for example, must swap the byte order when storing -to instruction memory. - -Once we had decided to fix on a little-endian instruction encoding, this -naturally led to placing the length-encoding bits in the LSB positions of the -instruction format to avoid breaking up opcode fields. -\end{commentary} - -\section{Exceptions, Traps, and Interrupts} -\label{sec:trap-defn} - -We use the term {\em exception} to refer to an unusual condition -occurring at run time associated with an instruction in the current -RISC-V hart. We use the term {\em interrupt} to refer to an external -asynchronous event that may cause a RISC-V hart to experience an -unexpected transfer of control. We use the term {\em trap} to refer -to the transfer of control to a trap handler caused by either an -exception or an interrupt. - -The instruction descriptions in following chapters describe conditions -that can raise an exception during execution. The general behavior of -most RISC-V EEIs is that a trap to some handler occurs when an -exception is signaled on an instruction (except for floating-point -exceptions, which, in the standard floating-point extensions, do not -cause traps). The manner in which interrupts are generated, routed -to, and enabled by a hart depends on the EEI. - -\begin{commentary} -Our use of ``exception'' and ``trap'' is compatible with that in the IEEE-754 -floating-point standard. -\end{commentary} - -How traps are handled and made visible to software running on the hart -depends on the enclosing execution environment. From the perspective -of software running inside an execution environment, traps encountered -by a hart at runtime can have four different effects: -\begin{description} - \item[Contained Trap:] The trap is visible to, and handled by, - software running inside the execution environment. For example, - in an EEI providing both supervisor and user - mode on harts, an ECALL by a user-mode hart will generally result - in a transfer of control to a supervisor-mode handler running on - the same hart. Similarly, in the same environment, when a hart is - interrupted, an interrupt handler will be run in supervisor mode - on the hart. - \item[Requested Trap:] The trap is a synchronous exception that is - an explicit call to the execution environment requesting an action - on behalf of software inside the execution environment. An - example is a system call. In this case, execution may or may not - resume on the hart after the requested action is taken by the - execution environment. For example, a system call could remove the - hart or cause an orderly termination of the entire execution environment. - \item[Invisible Trap:] The trap is handled transparently by the - execution environment and execution resumes normally after the - trap is handled. Examples include emulating missing instructions, - handling non-resident page faults in a demand-paged virtual-memory - system, or handling device interrupts for a different job in a - multiprogrammed machine. In these cases, the software running - inside the execution environment is not aware of the trap (we - ignore timing effects in these definitions). - \item[Fatal Trap:] The trap represents a fatal failure and causes - the execution environment to terminate execution. Examples - include failing a virtual-memory page-protection check or allowing - a watchdog timer to expire. Each EEI should define how execution - is terminated and reported to an external environment. -\end{description} - -Table~\ref{table:trapcharacteristics} shows the characteristics of each -kind of trap. - -\begin{table}[hbt] - \centering - \begin{tabular}{|l|c|c|c|c|} - \hline - & Contained & Requested & Invisible & Fatal\\ - \hline - Execution terminates & No & No$^{1}$ & No & Yes \\ - Software is oblivious & No & No & Yes & Yes$^{2}$ \\ - Handled by environment & No & Yes & Yes & Yes \\ - \hline - \end{tabular} - \caption{Characteristics of traps. Notes: 1) Termination may be - requested. 2) Imprecise fatal traps might be observable by software.} -\label{table:trapcharacteristics} -\end{table} - -The EEI defines for each trap whether it is handled precisely, though -the recommendation is to maintain preciseness where possible. -Contained and requested traps can be observed to be imprecise by -software inside the execution environment. Invisible traps, by -definition, cannot be observed to be precise or imprecise by software -running inside the execution environment. Fatal traps can be observed -to be imprecise by software running inside the execution environment, -if known-errorful instructions do not cause immediate termination. - -Because this document describes unprivileged instructions, traps are -rarely mentioned. Architectural means to handle contained traps are -defined in the privileged architecture manual, along with other -features to support richer EEIs. Unprivileged instructions that are -defined solely to cause requested traps are documented here. -Invisible traps are, by their nature, out of scope for this document. -Instruction encodings that are not defined here and not defined by -some other means may cause a fatal trap. - -\section{UNSPECIFIED Behaviors and Values} - -The architecture fully describes what implementations must do and any -constraints on what they may do. In cases where the architecture -intentionally does not constrain implementations, the term \unspecified\ -is explicitly used. - -The term \unspecified\ refers to a behavior or value that is -intentionally unconstrained. The definition of these behaviors or -values is open to extensions, platform standards, or implementations. -Extensions, platform standards, or implementation documentation may -provide normative content to further constrain cases that the base -architecture defines as \unspecified. - -Like the base architecture, extensions should fully describe allowable -behavior and values and use the term \unspecified\ for cases that are -intentionally unconstrained. These cases may be constrained or defined -by other extensions, platform standards, or implementations. diff --git a/src/j.tex b/src/j.tex deleted file mode 100644 index 37a8b37..0000000 --- a/src/j.tex +++ /dev/null @@ -1,13 +0,0 @@ -\chapter{``J'' Standard Extension for Dynamically Translated Languages, Version 0.0} -\label{sec:j} - -This chapter is a placeholder for a future standard extension to -support dynamically translated languages. - -\begin{commentary} - Many popular languages are usually implemented via dynamic - translation, including Java and Javascript. These languages can - benefit from additional ISA support for dynamic checks and garbage - collection. -\end{commentary} - diff --git a/src/m.tex b/src/m.tex deleted file mode 100644 index 3aceb8b..0000000 --- a/src/m.tex +++ /dev/null @@ -1,188 +0,0 @@ -\chapter{``M'' Standard Extension for Integer Multiplication and - Division, Version 2.0} - -This chapter describes the standard integer multiplication and -division instruction extension, which is named ``M'' and contains -instructions that multiply or divide values held in two integer -registers. - -\begin{commentary} -We separate integer multiply and divide out from the base to simplify -low-end implementations, or for applications where integer multiply -and divide operations are either infrequent or better handled in -attached accelerators. -\end{commentary} - -\section{Multiplication Operations} -\label{multiplication-operations} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{S@{}R@{}R@{}S@{}R@{}O} -\\ -\instbitrange{31}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct7} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -7 & 5 & 5 & 3 & 5 & 7 \\ -MULDIV & multiplier & multiplicand & MUL/MULH[[S]U] & dest & OP \\ -MULDIV & multiplier & multiplicand & MULW & dest & OP-32 \\ -\end{tabular} -\end{center} - -MUL performs an XLEN-bit$\times$XLEN-bit multiplication -of {\em rs1} by {\em rs2} and places the -lower XLEN bits in the destination register. MULH, MULHU, and MULHSU -perform the same multiplication but return the upper XLEN bits of the -full 2$\times$XLEN-bit product, for signed$\times$signed, -unsigned$\times$unsigned, and \wunits{signed}{\em rs1}$\times$\wunits{unsigned}{\em rs2} multiplication, -respectively. If both the high and low bits of the same product are -required, then the recommended code sequence is: MULH[[S]U] {\em rdh, - rs1, rs2}; MUL {\em rdl, rs1, rs2} (source register specifiers must -be in same order and {\em rdh} cannot be the same as {\em rs1} or {\em - rs2}). Microarchitectures can then fuse these into a single -multiply operation instead of performing two separate multiplies. - -\begin{commentary} -MULHSU is used in multi-word signed multiplication to multiply the -most-significant word of the multiplicand (which contains the sign bit) -with the less-significant words of the multiplier (which are unsigned). -\end{commentary} - -MULW is an RV64 instruction that multiplies the lower 32 bits of the source -registers, placing the sign-extension of the lower 32 bits of the result -into the destination register. - -\begin{commentary} -In RV64, MUL can be used to obtain the upper 32 bits of the 64-bit product, -but signed arguments must be proper 32-bit signed values, whereas unsigned -arguments must have their upper 32 bits clear. If the -arguments are not known to be sign- or zero-extended, an alternative is to -shift both arguments left by 32 bits, then use MULH[[S]U]. -\end{commentary} - -\section{Division Operations} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{S@{}R@{}R@{}O@{}R@{}O} -\\ -\instbitrange{31}{25} & -\instbitrange{24}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{funct7} & -\multicolumn{1}{c|}{rs2} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -7 & 5 & 5 & 3 & 5 & 7 \\ -MULDIV & divisor & dividend & DIV[U]/REM[U] & dest & OP \\ -MULDIV & divisor & dividend & DIV[U]W/REM[U]W & dest & OP-32 \\ -\end{tabular} -\end{center} - -DIV and DIVU perform an XLEN bits by XLEN bits signed and unsigned integer -division of {\em rs1} by {\em rs2}, rounding towards zero. -REM and REMU provide the remainder of the corresponding division operation. -For REM, the sign of the result equals the sign of the dividend. - -\begin{commentary} -For both signed and unsigned division, it holds that -\mbox{$\textrm{dividend} = \textrm{divisor} \times \textrm{quotient} + \textrm{remainder}$}. -\end{commentary} - -If both the quotient and remainder -are required from the same division, the recommended code sequence is: -DIV[U] {\em rdq, rs1, rs2}; REM[U] {\em rdr, rs1, rs2} ({\em rdq} -cannot be the same as {\em rs1} or {\em rs2}). Microarchitectures can -then fuse these into a single divide operation instead of performing -two separate divides. - -DIVW and DIVUW are RV64 instructions that divide the -lower 32 bits of {\em rs1} by the lower 32 bits of {\em rs2}, treating -them as signed and unsigned integers respectively, placing the 32-bit -quotient in {\em rd}, sign-extended to 64 bits. REMW and REMUW -are RV64 instructions that provide the corresponding -signed and unsigned remainder operations respectively. Both REMW and -REMUW always sign-extend the 32-bit result to 64 bits, including on a -divide by zero. - -The semantics for division by zero and division overflow are summarized in -Table~\ref{tab:divby0}. The quotient of division by zero has all bits set, and -the remainder of division by zero equals the dividend. Signed division overflow -occurs only when the most-negative integer is divided by $-1$. The quotient of -a signed division with overflow is equal to the dividend, and the remainder is -zero. Unsigned division overflow cannot occur. - -\begin{table}[h] -\center -\begin{tabular}{|l|c|c||c|c|c|c|} -\hline -Condition & Dividend & Divisor & DIVU[W] & REMU[W] & DIV[W] & REM[W] \\ \hline -Division by zero & $x$ & 0 & $2^{L}-1$ & $x$ & $-1$ & $x$ \\ -Overflow (signed only) & $-2^{L-1}$ & $-1$ & -- & -- & $-2^{L-1}$ & 0 \\ -\hline -\end{tabular} -\caption{Semantics for division by zero and division overflow. -L is the width of the operation in bits: XLEN for DIV[U] and REM[U], or -32 for DIV[U]W and REM[U]W.} -\label{tab:divby0} -\end{table} - -\begin{commentary} -We considered raising exceptions on integer divide by zero, with these -exceptions causing a trap in most execution environments. However, -this would be the only arithmetic trap in the standard ISA -(floating-point exceptions set flags and write default values, but do -not cause traps) and would require language implementors to interact -with the execution environment's trap handlers for this case. -Further, where language standards mandate that a divide-by-zero -exception must cause an immediate control flow change, only a single -branch instruction needs to be added to each divide operation, and -this branch instruction can be inserted after the divide and should -normally be very predictably not taken, adding little runtime -overhead. - -The value of all bits set is returned for both unsigned and signed -divide by zero to simplify the divider circuitry. The value of all 1s -is both the natural value to return for unsigned divide, representing -the largest unsigned number, and also the natural result for simple -unsigned divider implementations. Signed division is often -implemented using an unsigned division circuit and specifying the same -overflow result simplifies the hardware. -\end{commentary} - -\section{Zmmul Extension, Version 1.0} - -The Zmmul extension implements the multiplication subset of the M extension. -It adds all of the instructions defined in Section~\ref{multiplication-operations}, -namely: MUL, MULH, MULHU, MULHSU, and (for RV64 only) MULW. -The encodings are identical to those of the corresponding M-extension instructions. - -\begin{commentary} -The Zmmul extension enables low-cost implementations that require -multiplication operations but not division. -For many microcontroller applications, division operations are too -infrequent to justify the cost of divider hardware. -By contrast, multiplication operations are more frequent, making the cost of -multiplier hardware more justifiable. -Simple FPGA soft cores particularly benefit from eliminating division but -retaining multiplication, since many FPGAs provide hardwired multipliers -but require dividers be implemented in soft logic. -\end{commentary} diff --git a/src/memory-model-alloy.tex b/src/memory-model-alloy.tex deleted file mode 100644 index c584931..0000000 --- a/src/memory-model-alloy.tex +++ /dev/null @@ -1,269 +0,0 @@ -\section{Formal Axiomatic Specification in Alloy} -\label{sec:alloy} - -\lstdefinelanguage{alloy}{ - morekeywords={abstract, sig, extends, pred, fun, fact, no, set, one, lone, let, not, all, iden, some, run, for}, - morecomment=[l]{//}, - morecomment=[s]{/*}{*/}, - commentstyle=\color{green!40!black}, - keywordstyle=\color{blue!40!black}, - moredelim=**[is][\color{red}]{@}{@}, - escapeinside={!}{!}, -} -\lstset{language=alloy} -\lstset{aboveskip=0pt} -\lstset{belowskip=0pt} - -We present a formal specification of the RVWMO memory model in Alloy (\url{http://alloy.mit.edu}). -This model is available online at \url{https://github.com/daniellustig/riscv-memory-model}. - -The online material also contains some litmus tests and some examples of how Alloy can be used to model check some of the mappings in Section~\ref{sec:memory:porting}. - -\begin{figure}[h!] - { - \tt\bfseries\centering\footnotesize - \begin{lstlisting} -//////////////////////////////////////////////////////////////////////////////// -// =RVWMO PPO= - -// Preserved Program Order -fun ppo : Event->Event { - // same-address ordering - po_loc :> Store - + rdw - + (AMO + StoreConditional) <: rfi - - // explicit synchronization - + ppo_fence - + Acquire <: ^po :> MemoryEvent - + MemoryEvent <: ^po :> Release - + RCsc <: ^po :> RCsc - + pair - - // syntactic dependencies - + addrdep - + datadep - + ctrldep :> Store - - // pipeline dependencies - + (addrdep+datadep).rfi - + addrdep.^po :> Store -} - -// the global memory order respects preserved program order -fact { ppo in ^gmo } -\end{lstlisting}} - \caption{The RVWMO memory model formalized in Alloy (1/5: PPO)} - \label{fig:alloy1} -\end{figure} -\begin{figure}[h!] - { - \tt\bfseries\centering\footnotesize - \begin{lstlisting} -//////////////////////////////////////////////////////////////////////////////// -// =RVWMO axioms= - -// Load Value Axiom -fun candidates[r: MemoryEvent] : set MemoryEvent { - (r.~^gmo & Store & same_addr[r]) // writes preceding r in gmo - + (r.^~po & Store & same_addr[r]) // writes preceding r in po -} - -fun latest_among[s: set Event] : Event { s - s.~^gmo } - -pred LoadValue { - all w: Store | all r: Load | - w->r in rf <=> w = latest_among[candidates[r]] -} - -// Atomicity Axiom -pred Atomicity { - all r: Store.~pair | // starting from the lr, - no x: Store & same_addr[r] | // there is no store x to the same addr - x not in same_hart[r] // such that x is from a different hart, - and x in r.~rf.^gmo // x follows (the store r reads from) in gmo, - and r.pair in x.^gmo // and r follows x in gmo -} - -// Progress Axiom implicit: Alloy only considers finite executions - -pred RISCV_mm { LoadValue and Atomicity /* and Progress */ } - -\end{lstlisting}} - \caption{The RVWMO memory model formalized in Alloy (2/5: Axioms)} - \label{fig:alloy2} -\end{figure} -\begin{figure}[h!] - { - \tt\bfseries\centering\footnotesize - \begin{lstlisting} -//////////////////////////////////////////////////////////////////////////////// -// Basic model of memory - -sig Hart { // hardware thread - start : one Event -} -sig Address {} -abstract sig Event { - po: lone Event // program order -} - -abstract sig MemoryEvent extends Event { - address: one Address, - acquireRCpc: lone MemoryEvent, - acquireRCsc: lone MemoryEvent, - releaseRCpc: lone MemoryEvent, - releaseRCsc: lone MemoryEvent, - addrdep: set MemoryEvent, - ctrldep: set Event, - datadep: set MemoryEvent, - gmo: set MemoryEvent, // global memory order - rf: set MemoryEvent -} -sig LoadNormal extends MemoryEvent {} // l{b|h|w|d} -sig LoadReserve extends MemoryEvent { // lr - pair: lone StoreConditional -} -sig StoreNormal extends MemoryEvent {} // s{b|h|w|d} -// all StoreConditionals in the model are assumed to be successful -sig StoreConditional extends MemoryEvent {} // sc -sig AMO extends MemoryEvent {} // amo -sig NOP extends Event {} - -fun Load : Event { LoadNormal + LoadReserve + AMO } -fun Store : Event { StoreNormal + StoreConditional + AMO } - -sig Fence extends Event { - pr: lone Fence, // opcode bit - pw: lone Fence, // opcode bit - sr: lone Fence, // opcode bit - sw: lone Fence // opcode bit -} -sig FenceTSO extends Fence {} - -/* Alloy encoding detail: opcode bits are either set (encoded, e.g., - * as f.pr in iden) or unset (f.pr not in iden). The bits cannot be used for - * anything else */ -fact { pr + pw + sr + sw in iden } -// likewise for ordering annotations -fact { acquireRCpc + acquireRCsc + releaseRCpc + releaseRCsc in iden } -// don't try to encode FenceTSO via pr/pw/sr/sw; just use it as-is -fact { no FenceTSO.(pr + pw + sr + sw) } -\end{lstlisting}} - \caption{The RVWMO memory model formalized in Alloy (3/5: model of memory)} - \label{fig:alloy3} -\end{figure} - -\begin{figure}[h!] - { - \tt\bfseries\centering\footnotesize - \begin{lstlisting} -//////////////////////////////////////////////////////////////////////////////// -// =Basic model rules= - -// Ordering annotation groups -fun Acquire : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.acquireRCsc } -fun Release : MemoryEvent { MemoryEvent.releaseRCpc + MemoryEvent.releaseRCsc } -fun RCpc : MemoryEvent { MemoryEvent.acquireRCpc + MemoryEvent.releaseRCpc } -fun RCsc : MemoryEvent { MemoryEvent.acquireRCsc + MemoryEvent.releaseRCsc } - -// There is no such thing as store-acquire or load-release, unless it's both -fact { Load & Release in Acquire } -fact { Store & Acquire in Release } - -// FENCE PPO -fun FencePRSR : Fence { Fence.(pr & sr) } -fun FencePRSW : Fence { Fence.(pr & sw) } -fun FencePWSR : Fence { Fence.(pw & sr) } -fun FencePWSW : Fence { Fence.(pw & sw) } - -fun ppo_fence : MemoryEvent->MemoryEvent { - (Load <: ^po :> FencePRSR).(^po :> Load) - + (Load <: ^po :> FencePRSW).(^po :> Store) - + (Store <: ^po :> FencePWSR).(^po :> Load) - + (Store <: ^po :> FencePWSW).(^po :> Store) - + (Load <: ^po :> FenceTSO) .(^po :> MemoryEvent) - + (Store <: ^po :> FenceTSO) .(^po :> Store) -} - -// auxiliary definitions -fun po_loc : Event->Event { ^po & address.~address } -fun same_hart[e: Event] : set Event { e + e.^~po + e.^po } -fun same_addr[e: Event] : set Event { e.address.~address } - -// initial stores -fun NonInit : set Event { Hart.start.*po } -fun Init : set Event { Event - NonInit } -fact { Init in StoreNormal } -fact { Init->(MemoryEvent & NonInit) in ^gmo } -fact { all e: NonInit | one e.*~po.~start } // each event is in exactly one hart -fact { all a: Address | one Init & a.~address } // one init store per address -fact { no Init <: po and no po :> Init } -\end{lstlisting}} - \caption{The RVWMO memory model formalized in Alloy (4/5: Basic model rules)} - \label{fig:alloy4} -\end{figure} - -\begin{figure}[h!] - { - \tt\bfseries\centering\footnotesize - \begin{lstlisting} -// po -fact { acyclic[po] } - -// gmo -fact { total[^gmo, MemoryEvent] } // gmo is a total order over all MemoryEvents - -//rf -fact { rf.~rf in iden } // each read returns the value of only one write -fact { rf in Store <: address.~address :> Load } -fun rfi : MemoryEvent->MemoryEvent { rf & (*po + *~po) } - -//dep -fact { no StoreNormal <: (addrdep + ctrldep + datadep) } -fact { addrdep + ctrldep + datadep + pair in ^po } -fact { datadep in datadep :> Store } -fact { ctrldep.*po in ctrldep } -fact { no pair & (^po :> (LoadReserve + StoreConditional)).^po } -fact { StoreConditional in LoadReserve.pair } // assume all SCs succeed - -// rdw -fun rdw : Event->Event { - (Load <: po_loc :> Load) // start with all same_address load-load pairs, - - (~rf.rf) // subtract pairs that read from the same store, - - (po_loc.rfi) // and subtract out "fri-rfi" patterns -} - -// filter out redundant instances and/or visualizations -fact { no gmo & gmo.gmo } // keep the visualization uncluttered -fact { all a: Address | some a.~address } - -//////////////////////////////////////////////////////////////////////////////// -// =Optional: opcode encoding restrictions= - -// the list of blessed fences -fact { Fence in - Fence.pr.sr - + Fence.pw.sw - + Fence.pr.pw.sw - + Fence.pr.sr.sw - + FenceTSO - + Fence.pr.pw.sr.sw -} - -pred restrict_to_current_encodings { - no (LoadNormal + StoreNormal) & (Acquire + Release) -} - -//////////////////////////////////////////////////////////////////////////////// -// =Alloy shortcuts= -pred acyclic[rel: Event->Event] { no iden & ^rel } -pred total[rel: Event->Event, bag: Event] { - all disj e, e': bag | e->e' in rel + ~rel - acyclic[rel] -} -\end{lstlisting}} - \caption{The RVWMO memory model formalized in Alloy (5/5: Auxiliaries)} - \label{fig:alloy5} -\end{figure} - diff --git a/src/memory-model-herd.tex b/src/memory-model-herd.tex deleted file mode 100644 index de4a59e..0000000 --- a/src/memory-model-herd.tex +++ /dev/null @@ -1,160 +0,0 @@ -\section{Formal Axiomatic Specification in Herd} -\label{sec:herd} - -The tool \textsf{herd} takes a memory model and a litmus test as input and simulates the execution of the test on top of the memory model. Memory models are written in the domain specific language \textsc{Cat}. This section provides two \textsc{Cat} memory model of RVWMO. The first model, Figure~\ref{fig:herd2}, follows the \emph{global memory order}, Chapter~\ref{ch:memorymodel}, definition of~RVWMO, as much as is possible for a \textsc{Cat} model. The second model, Figure~\ref{fig:herd3}, is an equivalent, more efficient, partial order based RVWMO model. - -The simulator~\textsf{herd} is part of the \textsf{diy} tool suite --- see \url{http://diy.inria.fr} for software and documentation. The models and more are available online at~\url{http://diy.inria.fr/cats7/riscv/}. - -\begin{figure}[h!] - { - \tt\bfseries\centering\footnotesize - \begin{lstlisting} -(*************) -(* Utilities *) -(*************) - -(* All fence relations *) -let fence.r.r = [R];fencerel(Fence.r.r);[R] -let fence.r.w = [R];fencerel(Fence.r.w);[W] -let fence.r.rw = [R];fencerel(Fence.r.rw);[M] -let fence.w.r = [W];fencerel(Fence.w.r);[R] -let fence.w.w = [W];fencerel(Fence.w.w);[W] -let fence.w.rw = [W];fencerel(Fence.w.rw);[M] -let fence.rw.r = [M];fencerel(Fence.rw.r);[R] -let fence.rw.w = [M];fencerel(Fence.rw.w);[W] -let fence.rw.rw = [M];fencerel(Fence.rw.rw);[M] -let fence.tso = - let f = fencerel(Fence.tso) in - ([W];f;[W]) | ([R];f;[M]) - -let fence = - fence.r.r | fence.r.w | fence.r.rw | - fence.w.r | fence.w.w | fence.w.rw | - fence.rw.r | fence.rw.w | fence.rw.rw | - fence.tso - -(* Same address, no W to the same address in-between *) -let po-loc-no-w = po-loc \ (po-loc?;[W];po-loc) -(* Read same write *) -let rsw = rf^-1;rf -(* Acquire, or stronger *) -let AQ = Acq|AcqRel -(* Release or stronger *) -and RL = RelAcqRel -(* All RCsc *) -let RCsc = Acq|Rel|AcqRel -(* Amo events are both R and W, relation rmw relates paired lr/sc *) -let AMO = R & W -let StCond = range(rmw) - -(*************) -(* ppo rules *) -(*************) - -(* Overlapping-Address Orderings *) -let r1 = [M];po-loc;[W] -and r2 = ([R];po-loc-no-w;[R]) \ rsw -and r3 = [AMO|StCond];rfi;[R] -(* Explicit Synchronization *) -and r4 = fence -and r5 = [AQ];po;[M] -and r6 = [M];po;[RL] -and r7 = [RCsc];po;[RCsc] -and r8 = rmw -(* Syntactic Dependencies *) -and r9 = [M];addr;[M] -and r10 = [M];data;[W] -and r11 = [M];ctrl;[W] -(* Pipeline Dependencies *) -and r12 = [R];(addr|data);[W];rfi;[R] -and r13 = [R];addr;[M];po;[W] - -let ppo = r1 | r2 | r3 | r4 | r5 | r6 | r7 | r8 | r9 | r10 | r11 | r12 | r13 -\end{lstlisting} - } - \caption{{\tt riscv-defs.cat}, a herd definition of preserved program order (1/3)} - \label{fig:herd1} -\end{figure} - -\begin{figure}[ht!] - { - \tt\bfseries\centering\footnotesize - \begin{lstlisting} -Total - -(* Notice that herd has defined its own rf relation *) - -(* Define ppo *) -include "riscv-defs.cat" - -(********************************) -(* Generate global memory order *) -(********************************) - -let gmo0 = (* precursor: ie build gmo as an total order that include gmo0 *) - loc & (W\FW) * FW | # Final write after any write to the same location - ppo | # ppo compatible - rfe # includes herd external rf (optimization) - -(* Walk over all linear extensions of gmo0 *) -with gmo from linearizations(M\IW,gmo0) - -(* Add initial writes upfront -- convenient for computing rfGMO *) -let gmo = gmo | loc & IW * (M\IW) - -(**********) -(* Axioms *) -(**********) - -(* Compute rf according to the load value axiom, aka rfGMO *) -let WR = loc & ([W];(gmo|po);[R]) -let rfGMO = WR \ (loc&([W];gmo);WR) - -(* Check equality of herd rf and of rfGMO *) -empty (rf\rfGMO)|(rfGMO\rf) as RfCons - -(* Atomicity axiom *) -let infloc = (gmo & loc)^-1 -let inflocext = infloc & ext -let winside = (infloc;rmw;inflocext) & (infloc;rf;rmw;inflocext) & [W] -empty winside as Atomic -\end{lstlisting} - } - \caption{{\tt riscv.cat}, a herd version of the RVWMO memory model (2/3)} - \label{fig:herd2} -\end{figure} - -\begin{figure}[h!] - { - \tt\bfseries\centering\footnotesize - \begin{lstlisting} -Partial - -(***************) -(* Definitions *) -(***************) - -(* Define ppo *) -include "riscv-defs.cat" - -(* Compute coherence relation *) -include "cos-opt.cat" - -(**********) -(* Axioms *) -(**********) - -(* Sc per location *) -acyclic co|rf|fr|po-loc as Coherence - -(* Main model axiom *) -acyclic co|rfe|fr|ppo as Model - -(* Atomicity axiom *) -empty rmw & (fre;coe) as Atomic -\end{lstlisting} - } - \caption{{\tt riscv.cat}, an alternative herd presentation of the RVWMO memory model (3/3)} - \label{fig:herd3} -\end{figure} - diff --git a/src/naming.tex b/src/naming.tex deleted file mode 100644 index bfd67d4..0000000 --- a/src/naming.tex +++ /dev/null @@ -1,189 +0,0 @@ -\chapter{ISA Extension Naming Conventions} -\label{naming} - -This chapter describes the RISC-V ISA extension naming scheme that is -used to concisely describe the set of instructions present in a -hardware implementation, or the set of instructions used by an -application binary interface (ABI). - -\begin{commentary} -The RISC-V ISA is designed to support a wide variety of -implementations with various experimental instruction-set extensions. -We have found that an organized naming scheme simplifies software -tools and documentation. -\end{commentary} - -\section{Case Sensitivity} - -The ISA naming strings are case insensitive. - -\section{Base Integer ISA} -RISC-V ISA strings begin with either RV32I, RV32E, RV64I, or RV128I -indicating the supported address space size in bits for the base -integer ISA. - -\section{Instruction-Set Extension Names} - -Standard ISA extensions are given a name consisting of a single -letter. For example, the first four standard -extensions to the integer bases are: -``M'' for integer multiplication and division, -``A'' for atomic memory instructions, -``F'' for single-precision floating-point instructions, and -``D'' for double-precision floating-point instructions. -Any RISC-V instruction-set variant can be succinctly described by -concatenating the base integer prefix with the names of the included -extensions, e.g., ``RV64IMAFD''. - -We have also defined an abbreviation ``G'' to represent the ``IMAFDZicsr\_Zifencei'' -base and extensions, as this is intended to represent our standard -general-purpose ISA. - -Standard extensions to the RISC-V ISA are given other reserved -letters, e.g., ``Q'' for quad-precision floating-point, or -``C'' for the 16-bit compressed instruction format. - -Some ISA extensions depend on the presence of other extensions, e.g., ``D'' -depends on ``F'' and ``F'' depends on ``Zicsr''. These dependences may be -implicit in the ISA name: for example, RV32IF is equivalent to RV32IFZicsr, -and RV32ID is equivalent to RV32IFD and RV32IFDZicsr. - -\section{Version Numbers} -Recognizing that instruction sets may expand or alter over time, we -encode extension version numbers following the extension name. Version -numbers are divided into major and minor version numbers, separated by -a ``p''. If the minor version is ``0'', then ``p0'' can be omitted -from the version string. Changes in major version numbers imply a -loss of backwards compatibility, whereas changes in only the minor -version number must be backwards-compatible. For example, the -original 64-bit standard ISA defined in release 1.0 of this manual can -be written in full as ``RV64I1p0M1p0A1p0F1p0D1p0'', more concisely as -``RV64I1M1A1F1D1''. - -We introduced the version numbering scheme with the second release. Hence, we -define the default version of a standard extension to be the version present at that -time, e.g., ``RV32I'' is equivalent to ``RV32I2''. - -\section{Underscores} - -Underscores ``\_'' may be used to separate ISA extensions to -improve readability and to provide disambiguation, e.g., ``RV32I2\_M2\_A2''. - -Because the ``P'' extension for Packed SIMD can be confused for the decimal -point in a version number, it must be preceded by an underscore if it follows -a number. For example, ``rv32i2p2'' means version 2.2 of RV32I, whereas -``rv32i2\_p2'' means version 2.0 of RV32I with version 2.0 of the P extension. - -\section{Additional Standard Extension Names} - -Standard extensions can also be named using a single ``Z'' followed by an -alphabetical name and an optional version number. For example, -``Zifencei'' names the instruction-fetch fence extension described in -Chapter~\ref{chap:zifencei}; ``Zifencei2'' and ``Zifencei2p0'' name version -2.0 of same. - -The first letter following the ``Z'' conventionally indicates the most closely -related alphabetical extension category, IMAFDQCVH. For the ``Zam'' -extension for misaligned atomics, for example, the letter ``a'' indicates the -extension is related to the ``A'' standard extension. If multiple ``Z'' -extensions are named, they should be ordered first by category, then -alphabetically within a category---for example, ``Zicsr\_Zifencei\_Zam''. - -Extensions with the ``Z'' prefix must be separated -from other multi-letter extensions by an underscore, e.g., -``RV32IMACZicsr\_Zifencei''. - -\section{Supervisor-level Instruction-Set Extensions} - -Standard supervisor-level instruction-set extensions are defined in Volume II, -but are named using ``S'' as a prefix, followed by an alphabetical name and an -optional version number. Supervisor-level extensions must be separated from -other multi-letter extensions by an underscore. - -Standard supervisor-level extensions should be listed after standard -unprivileged extensions. If multiple supervisor-level extensions are listed, -they should be ordered alphabetically. - -\section{Machine-level Instruction-Set Extensions} - -Standard machine-level instruction-set extensions are prefixed with the three -letters ``Zxm''. - -Standard machine-level extensions should be listed after standard -lesser-privileged extensions. If multiple machine-level extensions are listed, -they should be ordered alphabetically. - -\section{Non-Standard Extension Names} - -Non-standard extensions are named using a single ``X'' followed by an -alphabetical name and an optional version number. -For example, ``Xhwacha'' names the Hwacha vector-fetch ISA extension; -``Xhwacha2'' and ``Xhwacha2p0'' name version 2.0 of same. - -Non-standard extensions must be listed after all standard extensions. -They must be separated from other multi-letter extensions -by an underscore. For example, an ISA with non-standard extensions -Argle and Bargle may be named ``RV64IZifencei\_Xargle\_Xbargle''. - -If multiple non-standard extensions are listed, they should be ordered -alphabetically. - -\section{Subset Naming Convention} -Table~\ref{isanametable} summarizes the standardized extension names. -~\\ -\begin{table}[h] -\center -\begin{tabular}{|l|c|c|} -\hline -Subset & Name & Implies \\ -\hline -\hline -\multicolumn{3}{|c|}{Base ISA}\\ -\hline -Integer & I & \\ -Reduced Integer & E & \\ -\hline -\hline -\multicolumn{3}{|c|}{Standard Unprivileged Extensions}\\ -\hline -Integer Multiplication and Division & M & \\ -Atomics & A & \\ -Single-Precision Floating-Point & F & Zicsr \\ -Double-Precision Floating-Point & D & F \\ -\hline -General & G & IMAFDZicsr\_Zifencei \\ -\hline -Quad-Precision Floating-Point & Q & D\\ -16-bit Compressed Instructions & C & \\ -Packed-SIMD Extensions & P & \\ -Vector Extension & V & D \\ -Hypervisor Extension & H & \\ -Control and Status Register Access & Zicsr & \\ -Instruction-Fetch Fence & Zifencei & \\ -Misaligned Atomics & Zam & A \\ -Total Store Ordering & Ztso & \\ -\hline -\hline -\multicolumn{3}{|c|}{Standard Supervisor-Level Extensions}\\ -\hline -Supervisor-level extension ``def'' & Sdef & \\ -\hline -\hline -\multicolumn{3}{|c|}{Standard Machine-Level Extensions}\\ -\hline -Machine-level extension ``jkl'' & Zxmjkl & \\ -\hline -\hline -\multicolumn{3}{|c|}{Non-Standard Extensions}\\ -\hline -Non-standard extension ``mno'' & Xmno & \\ -\hline -\end{tabular} -\caption{Standard ISA extension names. The table also defines the - canonical order in which extension names must appear in the name - string, with top-to-bottom in table indicating first-to-last in the - name string, e.g., RV32IMACV is legal, whereas RV32IMAVC is not.} -\label{isanametable} -\end{table} - - diff --git a/src/p.tex b/src/p.tex deleted file mode 100644 index dac4e4f..0000000 --- a/src/p.tex +++ /dev/null @@ -1,14 +0,0 @@ -\chapter{``P'' Standard Extension for Packed-SIMD Instructions, - Version 0.2} -\label{sec:packedsimd} - -\begin{commentary} - Discussions at the 5th RISC-V workshop indicated a desire to drop - this packed-SIMD proposal for floating-point registers in favor of - standardizing on the V extension for large floating-point SIMD - operations. However, there was interest in packed-SIMD fixed-point - operations for use in the integer registers of small RISC-V - implementations. A task group is working to define the new P - extension. -\end{commentary} - |