From 627495d8a6935b03f3c09164f27a67393ef3173d Mon Sep 17 00:00:00 2001 From: Krste Asanovic Date: Tue, 7 Aug 2018 23:01:00 -0700 Subject: Broke out actual perf counters into separate chapter. --- src/counters.tex | 184 ++++++++++++++++++++++++++++++++++++++++++++++++++ src/csr.tex | 193 +++++------------------------------------------------ src/preface.tex | 2 +- src/riscv-spec.tex | 1 + 4 files changed, 201 insertions(+), 179 deletions(-) create mode 100644 src/counters.tex diff --git a/src/counters.tex b/src/counters.tex new file mode 100644 index 0000000..adc8e0c --- /dev/null +++ b/src/counters.tex @@ -0,0 +1,184 @@ +\chapter{Counters} +\label{counters} + +RISC-V ISAs provide a set of up to 32$\times$64-bit performance counters and +timers that are accessible via unprivileged XLEN read-only CSR +registers {\tt 0xC00}--{\tt 0xC1F} (with the upper 32 bits accessed +via CSR registers {\tt 0xC80}--{\tt 0xC9F} on RV32). The first three +of these (CYCLE, TIME, and INSTRET) have dedicated functions (cycle +count, real-time clock, and instructions-retired respectively), while +the remaining counters, if implemented, provide programmable event +counting. + +\section{Base Counters and Timers} + +\vspace{-0.2in} +\begin{center} +\begin{tabular}{M@{}R@{}F@{}R@{}S} +\\ +\instbitrange{31}{20} & +\instbitrange{19}{15} & +\instbitrange{14}{12} & +\instbitrange{11}{7} & +\instbitrange{6}{0} \\ +\hline +\multicolumn{1}{|c|}{csr} & +\multicolumn{1}{c|}{rs1} & +\multicolumn{1}{c|}{funct3} & +\multicolumn{1}{c|}{rd} & +\multicolumn{1}{c|}{opcode} \\ +\hline +12 & 5 & 3 & 5 & 7 \\ +RDCYCLE[H] & 0 & CSRRS & dest & SYSTEM \\ +RDTIME[H] & 0 & CSRRS & dest & SYSTEM \\ +RDINSTRET[H] & 0 & CSRRS & dest & SYSTEM \\ +\end{tabular} +\end{center} + +RV32I provides a number of 64-bit read-only user-level counters, which +are mapped into the 12-bit CSR address space and accessed in 32-bit +pieces using CSRRS instructions. In RV64I, the CSR instructions can +manipulate 64-bit CSRs. In particular, the RDCYCLE, RDTIME, and +RDINSTRET pseudoinstructions read the full 64 bits of the {\tt cycle}, +{\tt time}, and {\tt instret} counters. Hence, the RDCYCLEH, RDTIMEH, +and RDINSTRETH instructions are not required in RV64I. + +\begin{commentary} +Some execution environments might prohibit access to counters to +impede timing side-channel attacks. +\end{commentary} + +The RDCYCLE pseudoinstruction reads the low XLEN bits of the {\tt + cycle} CSR which holds a count of the number of clock cycles +executed by the processor core on which the hart is running from an +arbitrary start time in the past. RDCYCLEH is an RV32I +instruction that reads bits 63--32 of the same cycle counter. The +underlying 64-bit counter should never overflow in practice. The rate +at which the cycle counter advances will depend on the implementation +and operating environment. The execution environment should provide a +means to determine the current rate (cycles/second) at which the cycle +counter is incrementing. + +\begin{commentary} +RDCYCLE is intended to return the number of cycles executed by the +processor core, not the hart. Precisely defining what is a ``core'' is +difficult given some implementation choices (e.g., AMD Bulldozer). +Precisely defining what is a ``clock cycle'' is also difficult given the +range of implementations (including software emulations), but the +intent is that RDCYCLE is used for performance monitoring along with the +other performance counters. In particular, where there is one +hart/core, one would expect cycle-count/instructions-retired to +measure CPI for a hart. + +Cores don't have to be exposed to software at all, and an implementor +might choose to pretend multiple harts on one physical core are +running on separate cores with one hart/core, and provide separate +cycle counters for each hart. This might make sense in a simple +barrel processor (e.g., CDC 6600 peripheral processors) where +inter-hart timing interactions are non-existent or minimal. + +Where there is more than one hart/core and dynamic multithreading, it +is not generally possible to separate out cycles per hart (especially +with SMT). It might be possible to define a separate performance +counter that tried to capture the number of cycles a particular hart +was running, but this definition would have to be very fuzzy to cover +all the possible threading implementations. For example, should we +only count cycles for which any instruction was issued to execution +for this hart, and/or cycles any instruction retired, or include +cycles this hart was occupying machine resources but couldn't execute +due to stalls while other harts went into execution? Likely, ``all of +the above'' would be needed to have understandable performance stats. +This complexity of defining a per-hart cycle count, and also the need +in any case for a total per-core cycle count when tuning multithreaded +code led to just standardizing the per-core cycle counter, which also +happens to work well for the common single hart/core case. + +Standardizing what happens during ``sleep'' is not practical given +that what ``sleep'' means is not standardized across execution +environments, but if the entire core is paused (entirely clock-gated +or powered-down in deep sleep), then it is not executing clock cycles, +and the cycle count shouldn't be increasing per the spec. There are +many details, e.g., whether clock cycles required to reset a processor +after waking up from a power-down event should be counted, and these +are considered execution-environment-specific details. + +Even though there is no precise definition that works for all +platforms, this is still a useful facility for most platforms, and an +imprecise, common, ``usually correct'' standard here is better than no +standard. The intent of RDCYCLE was primarily performance +monitoring/tuning, and the specification was written with that goal in +mind. +\end{commentary} + +The RDTIME pseudoinstruction reads the low XLEN bits of the {\tt + time} CSR, which counts wall-clock real time that has passed from an +arbitrary start time in the past. RDTIMEH is an RV32I-only instruction +that reads bits 63--32 of the same real-time counter. The underlying 64-bit +counter should never overflow in practice. The execution environment +should provide a means of determining the period of the real-time +counter (seconds/tick). The period must be constant. The +real-time clocks of all harts in a single user application +should be synchronized to within one tick of the real-time clock. The +environment should provide a means to determine the accuracy of the +clock. + +\begin{commentary} +On some simple platforms, cycle count might represent a valid +implementation of RDTIME, but in this case, platforms should implement +the RDTIME instruction as an alias for RDCYCLE to make code more +portable, rather than using RDCYCLE to measure wall-clock time. +\end{commentary} + +The RDINSTRET pseudoinstruction reads the low XLEN bits of the {\tt + instret} CSR, which counts the number of instructions retired by +this hart from some arbitrary start point in the past. RDINSTRETH is +an RV32I-only instruction that reads bits 63--32 of the same +instruction counter. The underlying 64-bit counter that should never +overflow in practice. + +The following code sequence will read a valid 64-bit cycle counter value into +{\tt x3}:{\tt x2}, even if the counter overflows between reading its upper +and lower halves. + +\begin{figure}[h!] +\begin{center} +\begin{verbatim} + again: + rdcycleh x3 + rdcycle x2 + rdcycleh x4 + bne x3, x4, again +\end{verbatim} +\end{center} +\caption{Sample code for reading the 64-bit cycle counter in RV32.} +\label{rdcycle} +\end{figure} + +\begin{commentary} +We recommend provision of these basic counters in implementations +as they are essential for basic performance analysis, adaptive and +dynamic optimization, and to allow an application to work with +real-time streams. Additional counters should be provided to help +diagnose performance problems and these should be made accessible from +user-level application code with low overhead. + +We required the counters be 64 bits wide, even on RV32, as otherwise +it is very difficult for software to determine if values have +overflowed. For a low-end implementation, the upper 32 bits of each +counter can be implemented using software counters incremented by a +trap handler triggered by overflow of the lower 32 bits. The sample +code described above shows how the full 64-bit width value can be +safely read using the individual 32-bit instructions. + +In some applications, it is important to be able to read multiple +counters at the same instant in time. When run under a multitasking +environment, a user thread can suffer a context switch while +attempting to read the counters. One solution is for the user thread +to read the real-time counter before and after reading the other +counters to determine if a context switch occurred in the middle of the +sequence, in which case the reads can be retried. We considered +adding output latches to allow a user thread to snapshot the counter +values atomically, but this would increase the size of the user +context, especially for implementations with a richer set of counters. +\end{commentary} + diff --git a/src/csr.tex b/src/csr.tex index 16877ad..1d02710 100644 --- a/src/csr.tex +++ b/src/csr.tex @@ -1,19 +1,28 @@ \chapter{Control and Status Register (CSR) Instructions} \label{csrinsts} -This chapter defines the full set of CSR instructions, although the -control and status registers are primarily used by the privileged -architecture. There are several uses in unprivileged code including -for counters and timers, and floating-point status. +RISC-V defines a separate address space of 4096 Control and Status +registers associated with each hart. This chapter defines the full +set of CSR instructions that operate on these CSRs. \begin{commentary} + While CSRs are primarily used by the privileged architecture, there + are several uses in unprivileged code including for counters and + timers, and for floating-point status. + The counters and timers are no longer considered mandatory parts of - the standard base ISAs, and so have been moved into this separate - chapter. + the standard base ISAs, and so the CSR instructions required to + access them have been moved out of the base ISA chapter into this + separate chapter. \end{commentary} \section{CSR Instructions} +All CSR instructions atomically read-modify-write a single CSR, whose +CSR specifier is encoded in the 12-bit {\em csr} field of the +instruction held in bits 31--20. The immediate forms use a 5-bit +zero-extended immediate encoded in the {\em rs1} field. + \vspace{-0.2in} \begin{center} \begin{tabular}{M@{}R@{}F@{}R@{}S} @@ -103,175 +112,3 @@ Further assembler pseudoinstructions are defined to set and clear bits in the CSR when the old value is not required: CSRS/CSRC {\em csr, rs1}; CSRSI/CSRCI {\em csr, uimm}. -\section{Timers and Counters} - -\vspace{-0.2in} -\begin{center} -\begin{tabular}{M@{}R@{}F@{}R@{}S} -\\ -\instbitrange{31}{20} & -\instbitrange{19}{15} & -\instbitrange{14}{12} & -\instbitrange{11}{7} & -\instbitrange{6}{0} \\ -\hline -\multicolumn{1}{|c|}{csr} & -\multicolumn{1}{c|}{rs1} & -\multicolumn{1}{c|}{funct3} & -\multicolumn{1}{c|}{rd} & -\multicolumn{1}{c|}{opcode} \\ -\hline -12 & 5 & 3 & 5 & 7 \\ -RDCYCLE[H] & 0 & CSRRS & dest & SYSTEM \\ -RDTIME[H] & 0 & CSRRS & dest & SYSTEM \\ -RDINSTRET[H] & 0 & CSRRS & dest & SYSTEM \\ -\end{tabular} -\end{center} - -RV32I provides a number of 64-bit read-only user-level counters, which -are mapped into the 12-bit CSR address space and accessed in 32-bit -pieces using CSRRS instructions. In RV64I, the CSR instructions can -manipulate 64-bit CSRs. In particular, the RDCYCLE, RDTIME, and -RDINSTRET pseudoinstructions read the full 64 bits of the {\tt cycle}, -{\tt time}, and {\tt instret} counters. Hence, the RDCYCLEH, RDTIMEH, -and RDINSTRETH instructions are not required in RV64I. - -\begin{commentary} -Some execution environments might prohibit access to counters to -impede timing side-channel attacks. -\end{commentary} - -The RDCYCLE pseudoinstruction reads the low XLEN bits of the {\tt - cycle} CSR which holds a count of the number of clock cycles -executed by the processor core on which the hart is running from -an arbitrary start time in the past. RDCYCLEH is -an RV32I-only instruction that reads bits 63--32 of the same cycle -counter. The underlying 64-bit counter should never overflow in -practice. The rate at which the cycle counter advances will depend on -the implementation and operating environment. The execution -environment should provide a means to determine the current rate -(cycles/second) at which the cycle counter is incrementing. - -\begin{commentary} -RDCYCLE is intended to return the number of cycles executed by the -processor core, not the hart. Precisely defining what is a ``core'' is -difficult given some implementation choices (e.g., AMD Bulldozer). -Precisely defining what is a ``clock cycle'' is also difficult given the -range of implementations (including software emulations), but the -intent is that RDCYCLE is used for performance monitoring along with the -other performance counters. In particular, where there is one -hart/core, one would expect cycle-count/instructions-retired to -measure CPI for a hart. - -Cores don't have to be exposed to software at all, and an implementor -might choose to pretend multiple harts on one physical core are -running on separate cores with one hart/core, and provide separate -cycle counters for each hart. This might make sense in a simple -barrel processor (e.g., CDC 6600 peripheral processors) where -inter-hart timing interactions are non-existent or minimal. - -Where there is more than one hart/core and dynamic multithreading, it -is not generally possible to separate out cycles per hart (especially -with SMT). It might be possible to define a separate performance -counter that tried to capture the number of cycles a particular hart -was running, but this definition would have to be very fuzzy to cover -all the possible threading implementations. For example, should we -only count cycles for which any instruction was issued to execution -for this hart, and/or cycles any instruction retired, or include -cycles this hart was occupying machine resources but couldn't execute -due to stalls while other harts went into execution? Likely, ``all of -the above'' would be needed to have understandable performance stats. -This complexity of defining a per-hart cycle count, and also the need -in any case for a total per-core cycle count when tuning multithreaded -code led to just standardizing the per-core cycle counter, which also -happens to work well for the common single hart/core case. - -Standardizing what happens during ``sleep'' is not practical given -that what ``sleep'' means is not standardized across execution -environments, but if the entire core is paused (entirely clock-gated -or powered-down in deep sleep), then it is not executing clock cycles, -and the cycle count shouldn't be increasing per the spec. There are -many details, e.g., whether clock cycles required to reset a processor -after waking up from a power-down event should be counted, and these -are considered execution-environment-specific details. - -Even though there is no precise definition that works for all -platforms, this is still a useful facility for most platforms, and an -imprecise, common, ``usually correct'' standard here is better than no -standard. The intent of RDCYCLE was primarily performance -monitoring/tuning, and the specification was written with that goal in -mind. -\end{commentary} - -The RDTIME pseudoinstruction reads the low XLEN bits of the {\tt - time} CSR, which counts wall-clock real time that has passed from an -arbitrary start time in the past. RDTIMEH is an RV32I-only instruction -that reads bits 63--32 of the same real-time counter. The underlying 64-bit -counter should never overflow in practice. The execution environment -should provide a means of determining the period of the real-time -counter (seconds/tick). The period must be constant. The -real-time clocks of all harts in a single user application -should be synchronized to within one tick of the real-time clock. The -environment should provide a means to determine the accuracy of the -clock. - -\begin{commentary} -On some simple platforms, cycle count might represent a valid -implementation of RDTIME, but in this case, platforms should implement -the RDTIME instruction as an alias for RDCYCLE to make code more -portable, rather than using RDCYCLE to measure wall-clock time. -\end{commentary} - -The RDINSTRET pseudoinstruction reads the low XLEN bits of the {\tt - instret} CSR, which counts the number of instructions retired by -this hart from some arbitrary start point in the past. RDINSTRETH is -an RV32I-only instruction that reads bits 63--32 of the same -instruction counter. The underlying 64-bit counter that should never -overflow in practice. - -The following code sequence will read a valid 64-bit cycle counter value into -{\tt x3}:{\tt x2}, even if the counter overflows between reading its upper -and lower halves. - -\begin{figure}[h!] -\begin{center} -\begin{verbatim} - again: - rdcycleh x3 - rdcycle x2 - rdcycleh x4 - bne x3, x4, again -\end{verbatim} -\end{center} -\caption{Sample code for reading the 64-bit cycle counter in RV32.} -\label{rdcycle} -\end{figure} - -\begin{commentary} -We would like these basic counters be provided in all implementations as -they are essential for basic performance analysis, adaptive and -dynamic optimization, and to allow an application to work with -real-time streams. Additional counters should be provided to help -diagnose performance problems and these should be made accessible from -user-level application code with low overhead. - -We required the counters be 64 bits wide, even on RV32, as otherwise -it is very difficult for software to determine if values have -overflowed. For a low-end implementation, the upper 32 bits of each -counter can be implemented using software counters incremented by a -trap handler triggered by overflow of the lower 32 bits. The sample -code described above shows how the full 64-bit width value can be -safely read using the individual 32-bit instructions. - -In some applications, it is important to be able to read multiple -counters at the same instant in time. When run under a multitasking -environment, a user thread can suffer a context switch while -attempting to read the counters. One solution is for the user thread -to read the real-time counter before and after reading the other -counters to determine if a context switch occurred in the middle of the -sequence, in which case the reads can be retried. We considered -adding output latches to allow a user thread to snapshot the counter -values atomically, but this would increase the size of the user -context, especially for implementations with a richer set of counters. -\end{commentary} - diff --git a/src/preface.tex b/src/preface.tex index ad1681d..06cb435 100644 --- a/src/preface.tex +++ b/src/preface.tex @@ -67,7 +67,7 @@ The major changes in this version of the document include: produced illegal instruction exceptions in RV32E and RV64I chapters. \item Counter/timer instructions are now not considered part of mandatory base ISA, and so CSR instructions were moved into separate - chapter. + chapter, with the unprivileged counters into another separate chapter. \item Defined the signed-zero behavior of FMIN.{\em fmt} and FMAX.{\em fmt}, and changed their behavior on signaling-NaN inputs to conform to the minimumNumber and maximumNumber operations in the proposed IEEE 754-201x diff --git a/src/riscv-spec.tex b/src/riscv-spec.tex index 02d77bf..9cfba79 100644 --- a/src/riscv-spec.tex +++ b/src/riscv-spec.tex @@ -77,6 +77,7 @@ Andrew Waterman and Krste Asanovi\'{c}, RISC-V Foundation, May 2017. \input{m} \input{a} \input{csr} +\input{counters} \input{f} \input{d} \input{q} -- cgit v1.1