aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKrste Asanovic <krste@eecs.berkeley.edu>2018-08-07 23:01:00 -0700
committerKrste Asanovic <krste@eecs.berkeley.edu>2018-08-07 23:01:00 -0700
commit627495d8a6935b03f3c09164f27a67393ef3173d (patch)
treeaf5a1bbb8ce1440ff3e7426d84fe779b984fa59d
parent22a0383af387e761b52f521a63c6e45f7fff7f74 (diff)
downloadriscv-isa-manual-627495d8a6935b03f3c09164f27a67393ef3173d.zip
riscv-isa-manual-627495d8a6935b03f3c09164f27a67393ef3173d.tar.gz
riscv-isa-manual-627495d8a6935b03f3c09164f27a67393ef3173d.tar.bz2
Broke out actual perf counters into separate chapter.
-rw-r--r--src/counters.tex184
-rw-r--r--src/csr.tex193
-rw-r--r--src/preface.tex2
-rw-r--r--src/riscv-spec.tex1
4 files changed, 201 insertions, 179 deletions
diff --git a/src/counters.tex b/src/counters.tex
new file mode 100644
index 0000000..adc8e0c
--- /dev/null
+++ b/src/counters.tex
@@ -0,0 +1,184 @@
+\chapter{Counters}
+\label{counters}
+
+RISC-V ISAs provide a set of up to 32$\times$64-bit performance counters and
+timers that are accessible via unprivileged XLEN read-only CSR
+registers {\tt 0xC00}--{\tt 0xC1F} (with the upper 32 bits accessed
+via CSR registers {\tt 0xC80}--{\tt 0xC9F} on RV32). The first three
+of these (CYCLE, TIME, and INSTRET) have dedicated functions (cycle
+count, real-time clock, and instructions-retired respectively), while
+the remaining counters, if implemented, provide programmable event
+counting.
+
+\section{Base Counters and Timers}
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{M@{}R@{}F@{}R@{}S}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{csr} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+RDCYCLE[H] & 0 & CSRRS & dest & SYSTEM \\
+RDTIME[H] & 0 & CSRRS & dest & SYSTEM \\
+RDINSTRET[H] & 0 & CSRRS & dest & SYSTEM \\
+\end{tabular}
+\end{center}
+
+RV32I provides a number of 64-bit read-only user-level counters, which
+are mapped into the 12-bit CSR address space and accessed in 32-bit
+pieces using CSRRS instructions. In RV64I, the CSR instructions can
+manipulate 64-bit CSRs. In particular, the RDCYCLE, RDTIME, and
+RDINSTRET pseudoinstructions read the full 64 bits of the {\tt cycle},
+{\tt time}, and {\tt instret} counters. Hence, the RDCYCLEH, RDTIMEH,
+and RDINSTRETH instructions are not required in RV64I.
+
+\begin{commentary}
+Some execution environments might prohibit access to counters to
+impede timing side-channel attacks.
+\end{commentary}
+
+The RDCYCLE pseudoinstruction reads the low XLEN bits of the {\tt
+ cycle} CSR which holds a count of the number of clock cycles
+executed by the processor core on which the hart is running from an
+arbitrary start time in the past. RDCYCLEH is an RV32I
+instruction that reads bits 63--32 of the same cycle counter. The
+underlying 64-bit counter should never overflow in practice. The rate
+at which the cycle counter advances will depend on the implementation
+and operating environment. The execution environment should provide a
+means to determine the current rate (cycles/second) at which the cycle
+counter is incrementing.
+
+\begin{commentary}
+RDCYCLE is intended to return the number of cycles executed by the
+processor core, not the hart. Precisely defining what is a ``core'' is
+difficult given some implementation choices (e.g., AMD Bulldozer).
+Precisely defining what is a ``clock cycle'' is also difficult given the
+range of implementations (including software emulations), but the
+intent is that RDCYCLE is used for performance monitoring along with the
+other performance counters. In particular, where there is one
+hart/core, one would expect cycle-count/instructions-retired to
+measure CPI for a hart.
+
+Cores don't have to be exposed to software at all, and an implementor
+might choose to pretend multiple harts on one physical core are
+running on separate cores with one hart/core, and provide separate
+cycle counters for each hart. This might make sense in a simple
+barrel processor (e.g., CDC 6600 peripheral processors) where
+inter-hart timing interactions are non-existent or minimal.
+
+Where there is more than one hart/core and dynamic multithreading, it
+is not generally possible to separate out cycles per hart (especially
+with SMT). It might be possible to define a separate performance
+counter that tried to capture the number of cycles a particular hart
+was running, but this definition would have to be very fuzzy to cover
+all the possible threading implementations. For example, should we
+only count cycles for which any instruction was issued to execution
+for this hart, and/or cycles any instruction retired, or include
+cycles this hart was occupying machine resources but couldn't execute
+due to stalls while other harts went into execution? Likely, ``all of
+the above'' would be needed to have understandable performance stats.
+This complexity of defining a per-hart cycle count, and also the need
+in any case for a total per-core cycle count when tuning multithreaded
+code led to just standardizing the per-core cycle counter, which also
+happens to work well for the common single hart/core case.
+
+Standardizing what happens during ``sleep'' is not practical given
+that what ``sleep'' means is not standardized across execution
+environments, but if the entire core is paused (entirely clock-gated
+or powered-down in deep sleep), then it is not executing clock cycles,
+and the cycle count shouldn't be increasing per the spec. There are
+many details, e.g., whether clock cycles required to reset a processor
+after waking up from a power-down event should be counted, and these
+are considered execution-environment-specific details.
+
+Even though there is no precise definition that works for all
+platforms, this is still a useful facility for most platforms, and an
+imprecise, common, ``usually correct'' standard here is better than no
+standard. The intent of RDCYCLE was primarily performance
+monitoring/tuning, and the specification was written with that goal in
+mind.
+\end{commentary}
+
+The RDTIME pseudoinstruction reads the low XLEN bits of the {\tt
+ time} CSR, which counts wall-clock real time that has passed from an
+arbitrary start time in the past. RDTIMEH is an RV32I-only instruction
+that reads bits 63--32 of the same real-time counter. The underlying 64-bit
+counter should never overflow in practice. The execution environment
+should provide a means of determining the period of the real-time
+counter (seconds/tick). The period must be constant. The
+real-time clocks of all harts in a single user application
+should be synchronized to within one tick of the real-time clock. The
+environment should provide a means to determine the accuracy of the
+clock.
+
+\begin{commentary}
+On some simple platforms, cycle count might represent a valid
+implementation of RDTIME, but in this case, platforms should implement
+the RDTIME instruction as an alias for RDCYCLE to make code more
+portable, rather than using RDCYCLE to measure wall-clock time.
+\end{commentary}
+
+The RDINSTRET pseudoinstruction reads the low XLEN bits of the {\tt
+ instret} CSR, which counts the number of instructions retired by
+this hart from some arbitrary start point in the past. RDINSTRETH is
+an RV32I-only instruction that reads bits 63--32 of the same
+instruction counter. The underlying 64-bit counter that should never
+overflow in practice.
+
+The following code sequence will read a valid 64-bit cycle counter value into
+{\tt x3}:{\tt x2}, even if the counter overflows between reading its upper
+and lower halves.
+
+\begin{figure}[h!]
+\begin{center}
+\begin{verbatim}
+ again:
+ rdcycleh x3
+ rdcycle x2
+ rdcycleh x4
+ bne x3, x4, again
+\end{verbatim}
+\end{center}
+\caption{Sample code for reading the 64-bit cycle counter in RV32.}
+\label{rdcycle}
+\end{figure}
+
+\begin{commentary}
+We recommend provision of these basic counters in implementations
+as they are essential for basic performance analysis, adaptive and
+dynamic optimization, and to allow an application to work with
+real-time streams. Additional counters should be provided to help
+diagnose performance problems and these should be made accessible from
+user-level application code with low overhead.
+
+We required the counters be 64 bits wide, even on RV32, as otherwise
+it is very difficult for software to determine if values have
+overflowed. For a low-end implementation, the upper 32 bits of each
+counter can be implemented using software counters incremented by a
+trap handler triggered by overflow of the lower 32 bits. The sample
+code described above shows how the full 64-bit width value can be
+safely read using the individual 32-bit instructions.
+
+In some applications, it is important to be able to read multiple
+counters at the same instant in time. When run under a multitasking
+environment, a user thread can suffer a context switch while
+attempting to read the counters. One solution is for the user thread
+to read the real-time counter before and after reading the other
+counters to determine if a context switch occurred in the middle of the
+sequence, in which case the reads can be retried. We considered
+adding output latches to allow a user thread to snapshot the counter
+values atomically, but this would increase the size of the user
+context, especially for implementations with a richer set of counters.
+\end{commentary}
+
diff --git a/src/csr.tex b/src/csr.tex
index 16877ad..1d02710 100644
--- a/src/csr.tex
+++ b/src/csr.tex
@@ -1,19 +1,28 @@
\chapter{Control and Status Register (CSR) Instructions}
\label{csrinsts}
-This chapter defines the full set of CSR instructions, although the
-control and status registers are primarily used by the privileged
-architecture. There are several uses in unprivileged code including
-for counters and timers, and floating-point status.
+RISC-V defines a separate address space of 4096 Control and Status
+registers associated with each hart. This chapter defines the full
+set of CSR instructions that operate on these CSRs.
\begin{commentary}
+ While CSRs are primarily used by the privileged architecture, there
+ are several uses in unprivileged code including for counters and
+ timers, and for floating-point status.
+
The counters and timers are no longer considered mandatory parts of
- the standard base ISAs, and so have been moved into this separate
- chapter.
+ the standard base ISAs, and so the CSR instructions required to
+ access them have been moved out of the base ISA chapter into this
+ separate chapter.
\end{commentary}
\section{CSR Instructions}
+All CSR instructions atomically read-modify-write a single CSR, whose
+CSR specifier is encoded in the 12-bit {\em csr} field of the
+instruction held in bits 31--20. The immediate forms use a 5-bit
+zero-extended immediate encoded in the {\em rs1} field.
+
\vspace{-0.2in}
\begin{center}
\begin{tabular}{M@{}R@{}F@{}R@{}S}
@@ -103,175 +112,3 @@ Further assembler pseudoinstructions are defined to set and clear
bits in the CSR when the old value is not required: CSRS/CSRC {\em
csr, rs1}; CSRSI/CSRCI {\em csr, uimm}.
-\section{Timers and Counters}
-
-\vspace{-0.2in}
-\begin{center}
-\begin{tabular}{M@{}R@{}F@{}R@{}S}
-\\
-\instbitrange{31}{20} &
-\instbitrange{19}{15} &
-\instbitrange{14}{12} &
-\instbitrange{11}{7} &
-\instbitrange{6}{0} \\
-\hline
-\multicolumn{1}{|c|}{csr} &
-\multicolumn{1}{c|}{rs1} &
-\multicolumn{1}{c|}{funct3} &
-\multicolumn{1}{c|}{rd} &
-\multicolumn{1}{c|}{opcode} \\
-\hline
-12 & 5 & 3 & 5 & 7 \\
-RDCYCLE[H] & 0 & CSRRS & dest & SYSTEM \\
-RDTIME[H] & 0 & CSRRS & dest & SYSTEM \\
-RDINSTRET[H] & 0 & CSRRS & dest & SYSTEM \\
-\end{tabular}
-\end{center}
-
-RV32I provides a number of 64-bit read-only user-level counters, which
-are mapped into the 12-bit CSR address space and accessed in 32-bit
-pieces using CSRRS instructions. In RV64I, the CSR instructions can
-manipulate 64-bit CSRs. In particular, the RDCYCLE, RDTIME, and
-RDINSTRET pseudoinstructions read the full 64 bits of the {\tt cycle},
-{\tt time}, and {\tt instret} counters. Hence, the RDCYCLEH, RDTIMEH,
-and RDINSTRETH instructions are not required in RV64I.
-
-\begin{commentary}
-Some execution environments might prohibit access to counters to
-impede timing side-channel attacks.
-\end{commentary}
-
-The RDCYCLE pseudoinstruction reads the low XLEN bits of the {\tt
- cycle} CSR which holds a count of the number of clock cycles
-executed by the processor core on which the hart is running from
-an arbitrary start time in the past. RDCYCLEH is
-an RV32I-only instruction that reads bits 63--32 of the same cycle
-counter. The underlying 64-bit counter should never overflow in
-practice. The rate at which the cycle counter advances will depend on
-the implementation and operating environment. The execution
-environment should provide a means to determine the current rate
-(cycles/second) at which the cycle counter is incrementing.
-
-\begin{commentary}
-RDCYCLE is intended to return the number of cycles executed by the
-processor core, not the hart. Precisely defining what is a ``core'' is
-difficult given some implementation choices (e.g., AMD Bulldozer).
-Precisely defining what is a ``clock cycle'' is also difficult given the
-range of implementations (including software emulations), but the
-intent is that RDCYCLE is used for performance monitoring along with the
-other performance counters. In particular, where there is one
-hart/core, one would expect cycle-count/instructions-retired to
-measure CPI for a hart.
-
-Cores don't have to be exposed to software at all, and an implementor
-might choose to pretend multiple harts on one physical core are
-running on separate cores with one hart/core, and provide separate
-cycle counters for each hart. This might make sense in a simple
-barrel processor (e.g., CDC 6600 peripheral processors) where
-inter-hart timing interactions are non-existent or minimal.
-
-Where there is more than one hart/core and dynamic multithreading, it
-is not generally possible to separate out cycles per hart (especially
-with SMT). It might be possible to define a separate performance
-counter that tried to capture the number of cycles a particular hart
-was running, but this definition would have to be very fuzzy to cover
-all the possible threading implementations. For example, should we
-only count cycles for which any instruction was issued to execution
-for this hart, and/or cycles any instruction retired, or include
-cycles this hart was occupying machine resources but couldn't execute
-due to stalls while other harts went into execution? Likely, ``all of
-the above'' would be needed to have understandable performance stats.
-This complexity of defining a per-hart cycle count, and also the need
-in any case for a total per-core cycle count when tuning multithreaded
-code led to just standardizing the per-core cycle counter, which also
-happens to work well for the common single hart/core case.
-
-Standardizing what happens during ``sleep'' is not practical given
-that what ``sleep'' means is not standardized across execution
-environments, but if the entire core is paused (entirely clock-gated
-or powered-down in deep sleep), then it is not executing clock cycles,
-and the cycle count shouldn't be increasing per the spec. There are
-many details, e.g., whether clock cycles required to reset a processor
-after waking up from a power-down event should be counted, and these
-are considered execution-environment-specific details.
-
-Even though there is no precise definition that works for all
-platforms, this is still a useful facility for most platforms, and an
-imprecise, common, ``usually correct'' standard here is better than no
-standard. The intent of RDCYCLE was primarily performance
-monitoring/tuning, and the specification was written with that goal in
-mind.
-\end{commentary}
-
-The RDTIME pseudoinstruction reads the low XLEN bits of the {\tt
- time} CSR, which counts wall-clock real time that has passed from an
-arbitrary start time in the past. RDTIMEH is an RV32I-only instruction
-that reads bits 63--32 of the same real-time counter. The underlying 64-bit
-counter should never overflow in practice. The execution environment
-should provide a means of determining the period of the real-time
-counter (seconds/tick). The period must be constant. The
-real-time clocks of all harts in a single user application
-should be synchronized to within one tick of the real-time clock. The
-environment should provide a means to determine the accuracy of the
-clock.
-
-\begin{commentary}
-On some simple platforms, cycle count might represent a valid
-implementation of RDTIME, but in this case, platforms should implement
-the RDTIME instruction as an alias for RDCYCLE to make code more
-portable, rather than using RDCYCLE to measure wall-clock time.
-\end{commentary}
-
-The RDINSTRET pseudoinstruction reads the low XLEN bits of the {\tt
- instret} CSR, which counts the number of instructions retired by
-this hart from some arbitrary start point in the past. RDINSTRETH is
-an RV32I-only instruction that reads bits 63--32 of the same
-instruction counter. The underlying 64-bit counter that should never
-overflow in practice.
-
-The following code sequence will read a valid 64-bit cycle counter value into
-{\tt x3}:{\tt x2}, even if the counter overflows between reading its upper
-and lower halves.
-
-\begin{figure}[h!]
-\begin{center}
-\begin{verbatim}
- again:
- rdcycleh x3
- rdcycle x2
- rdcycleh x4
- bne x3, x4, again
-\end{verbatim}
-\end{center}
-\caption{Sample code for reading the 64-bit cycle counter in RV32.}
-\label{rdcycle}
-\end{figure}
-
-\begin{commentary}
-We would like these basic counters be provided in all implementations as
-they are essential for basic performance analysis, adaptive and
-dynamic optimization, and to allow an application to work with
-real-time streams. Additional counters should be provided to help
-diagnose performance problems and these should be made accessible from
-user-level application code with low overhead.
-
-We required the counters be 64 bits wide, even on RV32, as otherwise
-it is very difficult for software to determine if values have
-overflowed. For a low-end implementation, the upper 32 bits of each
-counter can be implemented using software counters incremented by a
-trap handler triggered by overflow of the lower 32 bits. The sample
-code described above shows how the full 64-bit width value can be
-safely read using the individual 32-bit instructions.
-
-In some applications, it is important to be able to read multiple
-counters at the same instant in time. When run under a multitasking
-environment, a user thread can suffer a context switch while
-attempting to read the counters. One solution is for the user thread
-to read the real-time counter before and after reading the other
-counters to determine if a context switch occurred in the middle of the
-sequence, in which case the reads can be retried. We considered
-adding output latches to allow a user thread to snapshot the counter
-values atomically, but this would increase the size of the user
-context, especially for implementations with a richer set of counters.
-\end{commentary}
-
diff --git a/src/preface.tex b/src/preface.tex
index ad1681d..06cb435 100644
--- a/src/preface.tex
+++ b/src/preface.tex
@@ -67,7 +67,7 @@ The major changes in this version of the document include:
produced illegal instruction exceptions in RV32E and RV64I chapters.
\item Counter/timer instructions are now not considered part of
mandatory base ISA, and so CSR instructions were moved into separate
- chapter.
+ chapter, with the unprivileged counters into another separate chapter.
\item Defined the signed-zero behavior of FMIN.{\em fmt} and FMAX.{\em fmt},
and changed their behavior on signaling-NaN inputs to conform to the
minimumNumber and maximumNumber operations in the proposed IEEE 754-201x
diff --git a/src/riscv-spec.tex b/src/riscv-spec.tex
index 02d77bf..9cfba79 100644
--- a/src/riscv-spec.tex
+++ b/src/riscv-spec.tex
@@ -77,6 +77,7 @@ Andrew Waterman and Krste Asanovi\'{c}, RISC-V Foundation, May 2017.
\input{m}
\input{a}
\input{csr}
+\input{counters}
\input{f}
\input{d}
\input{q}