src/csr.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277

\chapter{Control and Status Register (CSR) Instructions}
\label{csrinsts}

This chapter defines the full set of CSR instructions, although the
control and status registers are primarily used by the privileged
architecture. There are several uses in unprivileged code including
for counters and timers, and floating-point status.

\begin{commentary}
  The counters and timers are no longer considered mandatory parts of
  the standard base ISAs, and so have been moved into this separate
  chapter.
\end{commentary}

\section{CSR Instructions}

\vspace{-0.2in}
\begin{center}
\begin{tabular}{M@{}R@{}F@{}R@{}S}
\\
\instbitrange{31}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\hline
\multicolumn{1}{|c|}{csr} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct3} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} \\
\hline
12 & 5 & 3 & 5 & 7 \\
source/dest  & source & CSRRW  & dest & SYSTEM \\
source/dest  & source & CSRRS  & dest & SYSTEM \\
source/dest  & source & CSRRC  & dest & SYSTEM \\
source/dest  & uimm[4:0]   & CSRRWI & dest & SYSTEM \\
source/dest  & uimm[4:0]   & CSRRSI & dest & SYSTEM \\
source/dest  & uimm[4:0]   & CSRRCI & dest & SYSTEM \\
\end{tabular}
\end{center}

The CSRRW (Atomic Read/Write CSR) instruction atomically swaps values
in the CSRs and integer registers. CSRRW reads the old value of the
CSR, zero-extends the value to XLEN bits, then writes it to integer
register {\em rd}.  The initial value in {\em rs1} is written to the
CSR.  If {\em rd}={\tt x0}, then the instruction shall not read the CSR
and shall not cause any of the side-effects that might occur on a CSR
read.

The CSRRS (Atomic Read and Set Bits in CSR) instruction reads the
value of the CSR, zero-extends the value to XLEN bits, and writes it
to integer register {\em rd}.  The initial value in integer register
{\em rs1} is treated as a bit mask that specifies bit positions to be
set in the CSR.  Any bit that is high in {\em rs1} will cause the
corresponding bit to be set in the CSR, if that CSR bit is writable.
Other bits in the CSR are unaffected (though CSRs might have side
effects when written).

The CSRRC (Atomic Read and Clear Bits in CSR) instruction reads the
value of the CSR, zero-extends the value to XLEN bits, and writes it
to integer register {\em rd}.  The initial value in integer register
{\em rs1} is treated as a bit mask that specifies bit positions to be
cleared in the CSR.  Any bit that is high in {\em rs1} will cause the
corresponding bit to be cleared in the CSR, if that CSR bit is
writable.  Other bits in the CSR are unaffected.

For both CSRRS and CSRRC, if {\em rs1}={\tt x0}, then the instruction
will not write to the CSR at all, and so shall not cause any of the
side effects that might otherwise occur on a CSR write, such as
raising illegal instruction exceptions on accesses to read-only CSRs.
Note that if {\em rs1} specifies a register holding a zero value other
than {\tt x0}, the instruction will still attempt to write the
unmodified value back to the CSR and will cause any attendant side effects.

The CSRRWI, CSRRSI, and CSRRCI variants are similar to CSRRW, CSRRS,
and CSRRC respectively, except they update the CSR using an XLEN-bit
value obtained by zero-extending a 5-bit unsigned immediate (uimm[4:0]) field
encoded in the {\em rs1} field instead of a value from an integer
register.  For CSRRSI and CSRRCI, if the uimm[4:0] field is zero, then
these instructions will not write to the CSR, and shall not cause any
of the side effects that might otherwise occur on a CSR write.  For
CSRRWI, if {\em rd}={\tt x0}, then the instruction shall not read the
CSR and shall not cause any of the side-effects that might occur on a
CSR read.

Some CSRs, such as the instructions retired counter, {\tt instret}, may be
modified as side effects of instruction execution.  In these cases, if a CSR
access instruction reads a CSR, it reads the value prior to the execution of
the instruction.  If a CSR access instruction writes a CSR, the update occurs
after the execution of the instruction.  In particular, a value written to
{\tt instret} by one instruction will be the value read by the following
instruction (i.e., the increment of {\tt instret} caused by the first
instruction retiring happens before the write of the new value).

The assembler pseudoinstruction to read a CSR, CSRR {\em rd, csr}, is
encoded as CSRRS {\em rd, csr, x0}.  The assembler pseudoinstruction
to write a CSR, CSRW {\em csr, rs1}, is encoded as CSRRW {\em x0, csr,
  rs1}, while CSRWI {\em csr, uimm}, is encoded as CSRRWI {\em x0,
  csr, uimm}.

Further assembler pseudoinstructions are defined to set and clear
bits in the CSR when the old value is not required: CSRS/CSRC {\em
  csr, rs1}; CSRSI/CSRCI {\em csr, uimm}.

\section{Timers and Counters}

\vspace{-0.2in}
\begin{center}
\begin{tabular}{M@{}R@{}F@{}R@{}S}
\\
\instbitrange{31}{20} &
\instbitrange{19}{15} &
\instbitrange{14}{12} &
\instbitrange{11}{7} &
\instbitrange{6}{0} \\
\hline
\multicolumn{1}{|c|}{csr} &
\multicolumn{1}{c|}{rs1} &
\multicolumn{1}{c|}{funct3} &
\multicolumn{1}{c|}{rd} &
\multicolumn{1}{c|}{opcode} \\
\hline
12 & 5 & 3 & 5 & 7 \\
RDCYCLE[H]   & 0 & CSRRS  & dest & SYSTEM \\
RDTIME[H]    & 0 & CSRRS  & dest & SYSTEM \\
RDINSTRET[H] & 0 & CSRRS  & dest & SYSTEM \\
\end{tabular}
\end{center}

RV32I provides a number of 64-bit read-only user-level counters, which
are mapped into the 12-bit CSR address space and accessed in 32-bit
pieces using CSRRS instructions.  In RV64I, the CSR instructions can
manipulate 64-bit CSRs.  In particular, the RDCYCLE, RDTIME, and
RDINSTRET pseudoinstructions read the full 64 bits of the {\tt cycle},
{\tt time}, and {\tt instret} counters.  Hence, the RDCYCLEH, RDTIMEH,
and RDINSTRETH instructions are not required in RV64I.

\begin{commentary}
Some execution environments might prohibit access to counters to
impede timing side-channel attacks.
\end{commentary}

The RDCYCLE pseudoinstruction reads the low XLEN bits of the {\tt
  cycle} CSR which holds a count of the number of clock cycles
executed by the processor core on which the hart is running from
an arbitrary start time in the past.  RDCYCLEH is
an RV32I-only instruction that reads bits 63--32 of the same cycle
counter.  The underlying 64-bit counter should never overflow in
practice.  The rate at which the cycle counter advances will depend on
the implementation and operating environment.  The execution
environment should provide a means to determine the current rate
(cycles/second) at which the cycle counter is incrementing.

\begin{commentary}
RDCYCLE is intended to return the number of cycles executed by the
processor core, not the hart.  Precisely defining what is a ``core'' is
difficult given some implementation choices (e.g., AMD Bulldozer).
Precisely defining what is a ``clock cycle'' is also difficult given the
range of implementations (including software emulations), but the
intent is that RDCYCLE is used for performance monitoring along with the
other performance counters.  In particular, where there is one
hart/core, one would expect cycle-count/instructions-retired to
measure CPI for a hart.

Cores don't have to be exposed to software at all, and an implementor
might choose to pretend multiple harts on one physical core are
running on separate cores with one hart/core, and provide separate
cycle counters for each hart.  This might make sense in a simple
barrel processor (e.g., CDC 6600 peripheral processors) where
inter-hart timing interactions are non-existent or minimal.

Where there is more than one hart/core and dynamic multithreading, it
is not generally possible to separate out cycles per hart (especially
with SMT).  It might be possible to define a separate performance
counter that tried to capture the number of cycles a particular hart
was running, but this definition would have to be very fuzzy to cover
all the possible threading implementations.  For example, should we
only count cycles for which any instruction was issued to execution
for this hart, and/or cycles any instruction retired, or include
cycles this hart was occupying machine resources but couldn't execute
due to stalls while other harts went into execution? Likely, ``all of
the above'' would be needed to have understandable performance stats.
This complexity of defining a per-hart cycle count, and also the need
in any case for a total per-core cycle count when tuning multithreaded
code led to just standardizing the per-core cycle counter, which also
happens to work well for the common single hart/core case.

Standardizing what happens during ``sleep'' is not practical given
that what ``sleep'' means is not standardized across execution
environments, but if the entire core is paused (entirely clock-gated
or powered-down in deep sleep), then it is not executing clock cycles,
and the cycle count shouldn't be increasing per the spec.  There are
many details, e.g., whether clock cycles required to reset a processor
after waking up from a power-down event should be counted, and these
are considered execution-environment-specific details.

Even though there is no precise definition that works for all
platforms, this is still a useful facility for most platforms, and an
imprecise, common, ``usually correct'' standard here is better than no
standard.  The intent of RDCYCLE was primarily performance
monitoring/tuning, and the specification was written with that goal in
mind.
\end{commentary}

The RDTIME pseudoinstruction reads the low XLEN bits of the {\tt
  time} CSR, which counts wall-clock real time that has passed from an
arbitrary start time in the past.  RDTIMEH is an RV32I-only instruction
that reads bits 63--32 of the same real-time counter.  The underlying 64-bit
counter should never overflow in practice.  The execution environment
should provide a means of determining the period of the real-time
counter (seconds/tick).  The period must be constant.  The
real-time clocks of all harts in a single user application
should be synchronized to within one tick of the real-time clock.  The
environment should provide a means to determine the accuracy of the
clock.

\begin{commentary}
On some simple platforms, cycle count might represent a valid
implementation of RDTIME, but in this case, platforms should implement
the RDTIME instruction as an alias for RDCYCLE to make code more
portable, rather than using RDCYCLE to measure wall-clock time.
\end{commentary}

The RDINSTRET pseudoinstruction reads the low XLEN bits of the {\tt
  instret} CSR, which counts the number of instructions retired by
this hart from some arbitrary start point in the past.  RDINSTRETH is
an RV32I-only instruction that reads bits 63--32 of the same
instruction counter. The underlying 64-bit counter that should never
overflow in practice.

The following code sequence will read a valid 64-bit cycle counter value into
{\tt x3}:{\tt x2}, even if the counter overflows between reading its upper
and lower halves.

\begin{figure}[h!]
\begin{center}
\begin{verbatim}
    again:
        rdcycleh     x3
        rdcycle      x2
        rdcycleh     x4
        bne          x3, x4, again
\end{verbatim}
\end{center}
\caption{Sample code for reading the 64-bit cycle counter in RV32.}
\label{rdcycle}
\end{figure}

\begin{commentary}
We would like these basic counters be provided in all implementations as
they are essential for basic performance analysis, adaptive and
dynamic optimization, and to allow an application to work with
real-time streams.  Additional counters should be provided to help
diagnose performance problems and these should be made accessible from
user-level application code with low overhead.

We required the counters be 64 bits wide, even on RV32, as otherwise
it is very difficult for software to determine if values have
overflowed.  For a low-end implementation, the upper 32 bits of each
counter can be implemented using software counters incremented by a
trap handler triggered by overflow of the lower 32 bits.  The sample
code described above shows how the full 64-bit width value can be
safely read using the individual 32-bit instructions.

In some applications, it is important to be able to read multiple
counters at the same instant in time.  When run under a multitasking
environment, a user thread can suffer a context switch while
attempting to read the counters.  One solution is for the user thread
to read the real-time counter before and after reading the other
counters to determine if a context switch occurred in the middle of the
sequence, in which case the reads can be retried.  We considered
adding output latches to allow a user thread to snapshot the counter
values atomically, but this would increase the size of the user
context, especially for implementations with a richer set of counters.
\end{commentary}