aboutsummaryrefslogtreecommitdiff
path: root/src/counters-f.adoc
diff options
context:
space:
mode:
Diffstat (limited to 'src/counters-f.adoc')
-rw-r--r--src/counters-f.adoc167
1 files changed, 0 insertions, 167 deletions
diff --git a/src/counters-f.adoc b/src/counters-f.adoc
deleted file mode 100644
index 4678d78..0000000
--- a/src/counters-f.adoc
+++ /dev/null
@@ -1,167 +0,0 @@
-== Counters
-
-RISC-V ISAs provide a set of up to 32latexmath:[$\times$]64-bit
-performance counters and timers that are accessible via unprivileged
-XLEN read-only CSR registers `0xC00`–`0xC1F` (with the upper 32 bits
-accessed via CSR registers `0xC80`–`0xC9F` on RV32). The first three of
-these (CYCLE, TIME, and INSTRET) have dedicated functions (cycle count,
-real-time clock, and instructions-retired respectively), while the
-remaining counters, if implemented, provide programmable event counting.
-
-=== Base Counters and Timers
-
-M@R@F@R@S +
-& & & & +
-& & & & +
-& 5 & 3 & 5 & 7 +
-RDCYCLE[H] & 0 & CSRRS & dest & SYSTEM +
-RDTIME[H] & 0 & CSRRS & dest & SYSTEM +
-RDINSTRET[H] & 0 & CSRRS & dest & SYSTEM +
-
-RV32I provides a number of 64-bit read-only user-level counters, which
-are mapped into the 12-bit CSR address space and accessed in 32-bit
-pieces using CSRRS instructions. In RV64I, the CSR instructions can
-manipulate 64-bit CSRs. In particular, the RDCYCLE, RDTIME, and
-RDINSTRET pseudoinstructions read the full 64 bits of the `cycle`,
-`time`, and `instret` counters. Hence, the RDCYCLEH, RDTIMEH, and
-RDINSTRETH instructions are RV32I-only.
-
-Some execution environments might prohibit access to counters to impede
-timing side-channel attacks.
-
-The RDCYCLE pseudoinstruction reads the low XLEN bits of the ` cycle`
-CSR which holds a count of the number of clock cycles executed by the
-processor core on which the hart is running from an arbitrary start time
-in the past. RDCYCLEH is an RV32I-only instruction that reads bits 63–32
-of the same cycle counter. The underlying 64-bit counter should never
-overflow in practice. The rate at which the cycle counter advances will
-depend on the implementation and operating environment. The execution
-environment should provide a means to determine the current rate
-(cycles/second) at which the cycle counter is incrementing.
-
-RDCYCLE is intended to return the number of cycles executed by the
-processor core, not the hart. Precisely defining what is a ``core'' is
-difficult given some implementation choices (e.g., AMD Bulldozer).
-Precisely defining what is a ``clock cycle'' is also difficult given the
-range of implementations (including software emulations), but the intent
-is that RDCYCLE is used for performance monitoring along with the other
-performance counters. In particular, where there is one hart/core, one
-would expect cycle-count/instructions-retired to measure CPI for a hart.
-
-Cores don’t have to be exposed to software at all, and an implementor
-might choose to pretend multiple harts on one physical core are running
-on separate cores with one hart/core, and provide separate cycle
-counters for each hart. This might make sense in a simple barrel
-processor (e.g., CDC 6600 peripheral processors) where inter-hart timing
-interactions are non-existent or minimal.
-
-Where there is more than one hart/core and dynamic multithreading, it is
-not generally possible to separate out cycles per hart (especially with
-SMT). It might be possible to define a separate performance counter that
-tried to capture the number of cycles a particular hart was running, but
-this definition would have to be very fuzzy to cover all the possible
-threading implementations. For example, should we only count cycles for
-which any instruction was issued to execution for this hart, and/or
-cycles any instruction retired, or include cycles this hart was
-occupying machine resources but couldn’t execute due to stalls while
-other harts went into execution? Likely, ``all of the above'' would be
-needed to have understandable performance stats. This complexity of
-defining a per-hart cycle count, and also the need in any case for a
-total per-core cycle count when tuning multithreaded code led to just
-standardizing the per-core cycle counter, which also happens to work
-well for the common single hart/core case.
-
-Standardizing what happens during ``sleep'' is not practical given that
-what ``sleep'' means is not standardized across execution environments,
-but if the entire core is paused (entirely clock-gated or powered-down
-in deep sleep), then it is not executing clock cycles, and the cycle
-count shouldn’t be increasing per the spec. There are many details,
-e.g., whether clock cycles required to reset a processor after waking up
-from a power-down event should be counted, and these are considered
-execution-environment-specific details.
-
-Even though there is no precise definition that works for all platforms,
-this is still a useful facility for most platforms, and an imprecise,
-common, ``usually correct'' standard here is better than no standard.
-The intent of RDCYCLE was primarily performance monitoring/tuning, and
-the specification was written with that goal in mind.
-
-The RDTIME pseudoinstruction reads the low XLEN bits of the ` time` CSR,
-which counts wall-clock real time that has passed from an arbitrary
-start time in the past. RDTIMEH is an RV32I-only instruction that reads
-bits 63–32 of the same real-time counter. The underlying 64-bit counter
-should never overflow in practice. The execution environment should
-provide a means of determining the period of the real-time counter
-(seconds/tick). The period must be constant. The real-time clocks of all
-harts in a single user application should be synchronized to within one
-tick of the real-time clock. The environment should provide a means to
-determine the accuracy of the clock.
-
-On some simple platforms, cycle count might represent a valid
-implementation of RDTIME, but in this case, platforms should implement
-the RDTIME instruction as an alias for RDCYCLE to make code more
-portable, rather than using RDCYCLE to measure wall-clock time.
-
-The RDINSTRET pseudoinstruction reads the low XLEN bits of the
-` instret` CSR, which counts the number of instructions retired by this
-hart from some arbitrary start point in the past. RDINSTRETH is an
-RV32I-only instruction that reads bits 63–32 of the same instruction
-counter. The underlying 64-bit counter should never overflow in
-practice.
-
-The following code sequence will read a valid 64-bit cycle counter value
-into `x3`:`x2`, even if the counter overflows its lower half between
-reading its upper and lower halves.
-
-....
- again:
- rdcycleh x3
- rdcycle x2
- rdcycleh x4
- bne x3, x4, again
-....
-
-We recommend provision of these basic counters in implementations as
-they are essential for basic performance analysis, adaptive and dynamic
-optimization, and to allow an application to work with real-time
-streams. Additional counters should be provided to help diagnose
-performance problems and these should be made accessible from user-level
-application code with low overhead.
-
-We required the counters be 64 bits wide, even on RV32, as otherwise it
-is very difficult for software to determine if values have overflowed.
-For a low-end implementation, the upper 32 bits of each counter can be
-implemented using software counters incremented by a trap handler
-triggered by overflow of the lower 32 bits. The sample code described
-above shows how the full 64-bit width value can be safely read using the
-individual 32-bit instructions.
-
-In some applications, it is important to be able to read multiple
-counters at the same instant in time. When run under a multitasking
-environment, a user thread can suffer a context switch while attempting
-to read the counters. One solution is for the user thread to read the
-real-time counter before and after reading the other counters to
-determine if a context switch occurred in the middle of the sequence, in
-which case the reads can be retried. We considered adding output latches
-to allow a user thread to snapshot the counter values atomically, but
-this would increase the size of the user context, especially for
-implementations with a richer set of counters.
-
-=== Hardware Performance Counters
-
-There is CSR space allocated for 29 additional unprivileged 64-bit
-hardware performance counters, `hpmcounter3`–`hpmcounter31`. For RV32,
-the upper 32 bits of these performance counters is accessible via
-additional CSRs `hpmcounter3h`–` hpmcounter31h`. These counters count
-platform-specific events and are configured via additional privileged
-registers. The number and width of these additional counters, and the
-set of events they count is platform-specific.
-
-The privileged architecture manual describes the privileged CSRs
-controlling access to these counters and to set the events to be
-counted.
-
-It would be useful to eventually standardize event settings to count
-ISA-level metrics, such as the number of floating-point instructions
-executed for example, and possibly a few common microarchitectural
-metrics, such as ``L1 instruction cache misses''.