aboutsummaryrefslogtreecommitdiff
path: root/src/counters.tex
diff options
context:
space:
mode:
authorKrste Asanovic <krste@eecs.berkeley.edu>2018-08-07 23:01:00 -0700
committerKrste Asanovic <krste@eecs.berkeley.edu>2018-08-07 23:01:00 -0700
commit627495d8a6935b03f3c09164f27a67393ef3173d (patch)
treeaf5a1bbb8ce1440ff3e7426d84fe779b984fa59d /src/counters.tex
parent22a0383af387e761b52f521a63c6e45f7fff7f74 (diff)
downloadriscv-isa-manual-627495d8a6935b03f3c09164f27a67393ef3173d.zip
riscv-isa-manual-627495d8a6935b03f3c09164f27a67393ef3173d.tar.gz
riscv-isa-manual-627495d8a6935b03f3c09164f27a67393ef3173d.tar.bz2
Broke out actual perf counters into separate chapter.
Diffstat (limited to 'src/counters.tex')
-rw-r--r--src/counters.tex184
1 files changed, 184 insertions, 0 deletions
diff --git a/src/counters.tex b/src/counters.tex
new file mode 100644
index 0000000..adc8e0c
--- /dev/null
+++ b/src/counters.tex
@@ -0,0 +1,184 @@
+\chapter{Counters}
+\label{counters}
+
+RISC-V ISAs provide a set of up to 32$\times$64-bit performance counters and
+timers that are accessible via unprivileged XLEN read-only CSR
+registers {\tt 0xC00}--{\tt 0xC1F} (with the upper 32 bits accessed
+via CSR registers {\tt 0xC80}--{\tt 0xC9F} on RV32). The first three
+of these (CYCLE, TIME, and INSTRET) have dedicated functions (cycle
+count, real-time clock, and instructions-retired respectively), while
+the remaining counters, if implemented, provide programmable event
+counting.
+
+\section{Base Counters and Timers}
+
+\vspace{-0.2in}
+\begin{center}
+\begin{tabular}{M@{}R@{}F@{}R@{}S}
+\\
+\instbitrange{31}{20} &
+\instbitrange{19}{15} &
+\instbitrange{14}{12} &
+\instbitrange{11}{7} &
+\instbitrange{6}{0} \\
+\hline
+\multicolumn{1}{|c|}{csr} &
+\multicolumn{1}{c|}{rs1} &
+\multicolumn{1}{c|}{funct3} &
+\multicolumn{1}{c|}{rd} &
+\multicolumn{1}{c|}{opcode} \\
+\hline
+12 & 5 & 3 & 5 & 7 \\
+RDCYCLE[H] & 0 & CSRRS & dest & SYSTEM \\
+RDTIME[H] & 0 & CSRRS & dest & SYSTEM \\
+RDINSTRET[H] & 0 & CSRRS & dest & SYSTEM \\
+\end{tabular}
+\end{center}
+
+RV32I provides a number of 64-bit read-only user-level counters, which
+are mapped into the 12-bit CSR address space and accessed in 32-bit
+pieces using CSRRS instructions. In RV64I, the CSR instructions can
+manipulate 64-bit CSRs. In particular, the RDCYCLE, RDTIME, and
+RDINSTRET pseudoinstructions read the full 64 bits of the {\tt cycle},
+{\tt time}, and {\tt instret} counters. Hence, the RDCYCLEH, RDTIMEH,
+and RDINSTRETH instructions are not required in RV64I.
+
+\begin{commentary}
+Some execution environments might prohibit access to counters to
+impede timing side-channel attacks.
+\end{commentary}
+
+The RDCYCLE pseudoinstruction reads the low XLEN bits of the {\tt
+ cycle} CSR which holds a count of the number of clock cycles
+executed by the processor core on which the hart is running from an
+arbitrary start time in the past. RDCYCLEH is an RV32I
+instruction that reads bits 63--32 of the same cycle counter. The
+underlying 64-bit counter should never overflow in practice. The rate
+at which the cycle counter advances will depend on the implementation
+and operating environment. The execution environment should provide a
+means to determine the current rate (cycles/second) at which the cycle
+counter is incrementing.
+
+\begin{commentary}
+RDCYCLE is intended to return the number of cycles executed by the
+processor core, not the hart. Precisely defining what is a ``core'' is
+difficult given some implementation choices (e.g., AMD Bulldozer).
+Precisely defining what is a ``clock cycle'' is also difficult given the
+range of implementations (including software emulations), but the
+intent is that RDCYCLE is used for performance monitoring along with the
+other performance counters. In particular, where there is one
+hart/core, one would expect cycle-count/instructions-retired to
+measure CPI for a hart.
+
+Cores don't have to be exposed to software at all, and an implementor
+might choose to pretend multiple harts on one physical core are
+running on separate cores with one hart/core, and provide separate
+cycle counters for each hart. This might make sense in a simple
+barrel processor (e.g., CDC 6600 peripheral processors) where
+inter-hart timing interactions are non-existent or minimal.
+
+Where there is more than one hart/core and dynamic multithreading, it
+is not generally possible to separate out cycles per hart (especially
+with SMT). It might be possible to define a separate performance
+counter that tried to capture the number of cycles a particular hart
+was running, but this definition would have to be very fuzzy to cover
+all the possible threading implementations. For example, should we
+only count cycles for which any instruction was issued to execution
+for this hart, and/or cycles any instruction retired, or include
+cycles this hart was occupying machine resources but couldn't execute
+due to stalls while other harts went into execution? Likely, ``all of
+the above'' would be needed to have understandable performance stats.
+This complexity of defining a per-hart cycle count, and also the need
+in any case for a total per-core cycle count when tuning multithreaded
+code led to just standardizing the per-core cycle counter, which also
+happens to work well for the common single hart/core case.
+
+Standardizing what happens during ``sleep'' is not practical given
+that what ``sleep'' means is not standardized across execution
+environments, but if the entire core is paused (entirely clock-gated
+or powered-down in deep sleep), then it is not executing clock cycles,
+and the cycle count shouldn't be increasing per the spec. There are
+many details, e.g., whether clock cycles required to reset a processor
+after waking up from a power-down event should be counted, and these
+are considered execution-environment-specific details.
+
+Even though there is no precise definition that works for all
+platforms, this is still a useful facility for most platforms, and an
+imprecise, common, ``usually correct'' standard here is better than no
+standard. The intent of RDCYCLE was primarily performance
+monitoring/tuning, and the specification was written with that goal in
+mind.
+\end{commentary}
+
+The RDTIME pseudoinstruction reads the low XLEN bits of the {\tt
+ time} CSR, which counts wall-clock real time that has passed from an
+arbitrary start time in the past. RDTIMEH is an RV32I-only instruction
+that reads bits 63--32 of the same real-time counter. The underlying 64-bit
+counter should never overflow in practice. The execution environment
+should provide a means of determining the period of the real-time
+counter (seconds/tick). The period must be constant. The
+real-time clocks of all harts in a single user application
+should be synchronized to within one tick of the real-time clock. The
+environment should provide a means to determine the accuracy of the
+clock.
+
+\begin{commentary}
+On some simple platforms, cycle count might represent a valid
+implementation of RDTIME, but in this case, platforms should implement
+the RDTIME instruction as an alias for RDCYCLE to make code more
+portable, rather than using RDCYCLE to measure wall-clock time.
+\end{commentary}
+
+The RDINSTRET pseudoinstruction reads the low XLEN bits of the {\tt
+ instret} CSR, which counts the number of instructions retired by
+this hart from some arbitrary start point in the past. RDINSTRETH is
+an RV32I-only instruction that reads bits 63--32 of the same
+instruction counter. The underlying 64-bit counter that should never
+overflow in practice.
+
+The following code sequence will read a valid 64-bit cycle counter value into
+{\tt x3}:{\tt x2}, even if the counter overflows between reading its upper
+and lower halves.
+
+\begin{figure}[h!]
+\begin{center}
+\begin{verbatim}
+ again:
+ rdcycleh x3
+ rdcycle x2
+ rdcycleh x4
+ bne x3, x4, again
+\end{verbatim}
+\end{center}
+\caption{Sample code for reading the 64-bit cycle counter in RV32.}
+\label{rdcycle}
+\end{figure}
+
+\begin{commentary}
+We recommend provision of these basic counters in implementations
+as they are essential for basic performance analysis, adaptive and
+dynamic optimization, and to allow an application to work with
+real-time streams. Additional counters should be provided to help
+diagnose performance problems and these should be made accessible from
+user-level application code with low overhead.
+
+We required the counters be 64 bits wide, even on RV32, as otherwise
+it is very difficult for software to determine if values have
+overflowed. For a low-end implementation, the upper 32 bits of each
+counter can be implemented using software counters incremented by a
+trap handler triggered by overflow of the lower 32 bits. The sample
+code described above shows how the full 64-bit width value can be
+safely read using the individual 32-bit instructions.
+
+In some applications, it is important to be able to read multiple
+counters at the same instant in time. When run under a multitasking
+environment, a user thread can suffer a context switch while
+attempting to read the counters. One solution is for the user thread
+to read the real-time counter before and after reading the other
+counters to determine if a context switch occurred in the middle of the
+sequence, in which case the reads can be retried. We considered
+adding output latches to allow a user thread to snapshot the counter
+values atomically, but this would increase the size of the user
+context, especially for implementations with a richer set of counters.
+\end{commentary}
+