aboutsummaryrefslogtreecommitdiff
path: root/src/counters.adoc
blob: f4a34af496c13eb5b3235a3723014acc801bb068 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
[[counters]]
== "Zicntr" and "Zihpm" Extensions for Counters, Version 2.0

RISC-V ISAs provide a set of up to thirty-two 64-bit performance
counters and timers that are accessible via unprivileged XLEN-bit
read-only CSR registers `0xC00`–`0xC1F` (when XLEN=32, the upper 32 bits
are accessed via CSR registers `0xC80`–`0xC9F`). These counters are
divided between the "Zicntr" and "Zihpm" extensions.

=== "Zicntr" Extension for Base Counters and Timers

The Zicntr standard extension comprises the first three of these
counters (CYCLE, TIME, and INSTRET), which have dedicated functions
(cycle count, real-time clock, and instructions retired, respectively).
The Zicntr extension depends on the Zicsr extension.

[TIP]
====
We recommend provision of these basic counters in implementations as
they are essential for basic performance analysis, adaptive and dynamic
optimization, and to allow an application to work with real-time
streams. Additional counters in the separate Zihpm extension can help
diagnose performance problems and these should be made accessible from
user-level application code with low overhead.

Some execution environments might prohibit access to counters, for
example, to impede timing side-channel attacks.
====

include::images/wavedrom/counters-diag.adoc[]


For base ISAs with XLEN≥64, CSR instructions can access
the full 64-bit CSRs directly. In particular, the RDCYCLE, RDTIME, and
RDINSTRET pseudoinstructions read the full 64 bits of the `cycle`,
`time`, and `instret` counters.

[TIP]
====
The counter pseudoinstructions are mapped to the read-only
`csrrs rd, counter, x0` canonical form, but the other read-only CSR
instruction forms (based on CSRRC/CSRRSI/CSRRCI) are also legal ways to
read these CSRs.
====
For base ISAs with XLEN=32, the Zicntr extension enables the three
64-bit read-only counters to be accessed in 32-bit pieces. The RDCYCLE,
RDTIME, and RDINSTRET pseudoinstructions provide the lower 32 bits, and
the RDCYCLEH, RDTIMEH, and RDINSTRETH pseudoinstructions provide the
upper 32 bits of the respective counters.
[TIP]
====
We required the counters be 64 bits wide, even when XLEN=32, as
otherwise it is very difficult for software to determine if values have
overflowed. For a low-end implementation, the upper 32 bits of each
counter can be implemented using software counters incremented by a trap
handler triggered by overflow of the lower 32 bits. The sample code
given below shows how the full 64-bit width value can be safely read
using the individual 32-bit width pseudoinstructions.
====

The RDCYCLE pseudoinstruction reads the low XLEN bits of the `cycle`
CSR which holds a count of the number of clock cycles executed by the
processor core on which the hart is running from an arbitrary start time
in the past. RDCYCLEH is only present when XLEN=32 and reads bits 63-32
of the same cycle counter. The underlying 64-bit counter should never
overflow in practice. The rate at which the cycle counter advances will
depend on the implementation and operating environment. The execution
environment should provide a means to determine the current rate
(cycles/second) at which the cycle counter is incrementing.
[TIP]
====
RDCYCLE is intended to return the number of cycles executed by the
processor core, not the hart. Precisely defining what is a "core" is
difficult given some implementation choices (e.g., AMD Bulldozer).
Precisely defining what is a "clock cycle" is also difficult given the
range of implementations (including software emulations), but the intent
is that RDCYCLE is used for performance monitoring along with the other
performance counters. In particular, where there is one hart/core, one
would expect cycle-count/instructions-retired to measure CPI for a hart.

Cores don't have to be exposed to software at all, and an implementor
might choose to pretend multiple harts on one physical core are running
on separate cores with one hart/core, and provide separate cycle
counters for each hart. This might make sense in a simple barrel
processor (e.g., CDC 6600 peripheral processors) where inter-hart timing
interactions are non-existent or minimal.

Where there is more than one hart/core and dynamic multithreading, it is
not generally possible to separate out cycles per hart (especially with
SMT). It might be possible to define a separate performance counter that
tried to capture the number of cycles a particular hart was running, but
this definition would have to be very fuzzy to cover all the possible
threading implementations. For example, should we only count cycles for
which any instruction was issued to execution for this hart, and/or
cycles any instruction retired, or include cycles this hart was
occupying machine resources but couldn't execute due to stalls while
other harts went into execution? Likely, "all of the above" would be
needed to have understandable performance stats. This complexity of
defining a per-hart cycle count, and also the need in any case for a
total per-core cycle count when tuning multithreaded code led to just
standardizing the per-core cycle counter, which also happens to work
well for the common single hart/core case.

Standardizing what happens during "sleep" is not practical given that
what "sleep" means is not standardized across execution environments,
but if the entire core is paused (entirely clock-gated or powered-down
in deep sleep), then it is not executing clock cycles, and the cycle
count shouldn't be increasing per the spec. There are many details,
e.g., whether clock cycles required to reset a processor after waking up
from a power-down event should be counted, and these are considered
execution-environment-specific details.

Even though there is no precise definition that works for all platforms,
this is still a useful facility for most platforms, and an imprecise,
common, "usually correct" standard here is better than no standard.
The intent of RDCYCLE was primarily performance monitoring/tuning, and
the specification was written with that goal in mind.
====
The RDTIME pseudoinstruction reads the low XLEN bits of the "time" CSR,
which counts wall-clock real time that has passed from an arbitrary
start time in the past. RDTIMEH is only present when XLEN=32 and reads
bits 63-32 of the same real-time counter. The underlying 64-bit counter
increments by one with each tick of the real-time clock, and, for
realistic real-time clock frequencies, should never overflow in
practice. The execution environment should provide a means of
determining the period of a counter tick (seconds/tick). The period
should be constant within a small error bound. The environment should
provide a means to determine the accuracy of the clock (i.e., the
maximum relative error between the nominal and actual real-time clock
periods).
[TIP]
====
On some simple platforms, cycle count might represent a valid
implementation of RDTIME, in which case RDTIME and RDCYCLE may return
the same result.

It is difficult to provide a strict mandate on clock period given the
wide variety of possible implementation platforms. The maximum error
bound should be set based on the requirements of the platform.
====

The real-time clocks of all harts must be synchronized to within one
tick of the real-time clock.
[TIP]
====
As with other architectural mandates, it suffices to appear "as if"
harts are synchronized to within one tick of the real-time clock, i.e.,
software is unable to observe that there is a greater delta between the
real-time clock values observed on two harts.
====
The RDINSTRET pseudoinstruction reads the low XLEN bits of the
`instret` CSR, which counts the number of instructions retired by this
hart from some arbitrary start point in the past. RDINSTRETH is only
present when XLEN=32 and reads bits 63-32 of the same instruction
counter. The underlying 64-bit counter should never overflow in
practice.
[TIP]
====
Instructions that cause synchronous exceptions, including ECALL and
EBREAK, are not considered to retire and hence do not increment the
`instret` CSR.
====
The following code sequence will read a valid 64-bit cycle counter value
into `x3:x2`, even if the counter overflows its lower half between
reading its upper and lower halves.

[source,asm.]
.Sample code for reading the 64-bit cycle counter when XLEN=32.
    again:
        rdcycleh     x3
        rdcycle      x2
        rdcycleh     x4
        bne          x3, x4, again


=== "Zihpm" Extension for Hardware Performance Counters

The Zihpm extension comprises up to 29 additional unprivileged 64-bit
hardware performance counters, `hpmcounter3-hpmcounter31`. When
XLEN=32, the upper 32 bits of these performance counters are accessible
via additional CSRs `hpmcounter3h- hpmcounter31h`. The Zihpm extension
depends on the Zicsr extension.
[TIP]
====
In some applications, it is important to be able to read multiple
counters at the same instant in time. When run under a multitasking
environment, a user thread can suffer a context switch while attempting
to read the counters. One solution is for the user thread to read the
real-time counter before and after reading the other counters to
determine if a context switch occurred in the middle of the sequence, in
which case the reads can be retried. We considered adding output latches
to allow a user thread to snapshot the counter values atomically, but
this would increase the size of the user context, especially for
implementations with a richer set of counters.
====

The implemented number and width of these additional counters, and the
set of events they count, is platform-specific. Accessing an
unimplemented or ill-configured counter may cause an illegal-instruction
exception or may return a constant value.

The execution environment should provide a means to determine the number
and width of the implemented counters, and an interface to configure the
events to be counted by each counter.
[TIP]
====
For execution environments implemented on RISC-V privileged platforms,
the privileged architecture manual describes privileged CSRs controlling
access by lower privileged modes to these counters, and to set the
events to be counted.

Alternative execution environments (e.g., user-level-only software
performance models) may provide alternative mechanisms to configure the
events counted by the performance counters.

It would be useful to eventually standardize event settings to count
ISA-level metrics, such as the number of floating-point instructions
executed for example, and possibly a few common microarchitectural
metrics, such as "L1 instruction cache misses".
====