src/zc/pushpop.adoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349

<<<

[#insns-pushpop,reftext="PUSH/POP Register Instructions"]
== PUSH/POP register instructions

These instructions are collectively referred to as PUSH/POP: 

* <<#insns-cm_push>> 
* <<#insns-cm_pop>> 
* <<#insns-cm_popret>> 
* <<#insns-cm_popretz>> 

The term PUSH refers to _cm.push_.

The term POP refers to _cm.pop_.

The term POPRET refers to _cm.popret and cm.popretz_.

Common details for these instructions are in this section.

=== PUSH/POP functional overview

PUSH, POP, POPRET are used to reduce the size of function prologues and epilogues.

. The PUSH instruction
** adjusts the stack pointer to create the stack frame
** pushes (stores) the registers specified in the register list to the stack frame

. The POP instruction
** pops (loads) the registers in the register list from the stack frame
** adjusts the stack pointer to destroy the stack frame

. The POPRET instructions
** pop (load) the registers in the register list from the stack frame
** _cm.popretz_ also moves zero into _a0_ as the return value
** adjust the stack pointer  to destroy the stack frame
** execute a _ret_ instruction to return from the function

<<<
=== Example usage

This example gives an illustration of the use of PUSH and POPRET.

The function _processMarkers_ in the EMBench benchmark picojpeg in the following file on github: https://github.com/embench/embench-iot/blob/master/src/picojpeg/libpicojpeg.c[libpicojpeg.c]

The prologue and epilogue compile with GCC10 to:

[source,SAIL]
----

   0001098a <processMarkers>:
   1098a:       711d                    addi    sp,sp,-96 ;#cm.push(1)
   1098c:       c8ca                    sw      s2,80(sp) ;#cm.push(2)
   1098e:       c6ce                    sw      s3,76(sp) ;#cm.push(3)
   10990:       c4d2                    sw      s4,72(sp) ;#cm.push(4)
   10992:       ce86                    sw      ra,92(sp) ;#cm.push(5)
   10994:       cca2                    sw      s0,88(sp) ;#cm.push(6)
   10996:       caa6                    sw      s1,84(sp) ;#cm.push(7)
   10998:       c2d6                    sw      s5,68(sp) ;#cm.push(8)
   1099a:       c0da                    sw      s6,64(sp) ;#cm.push(9)
   1099c:       de5e                    sw      s7,60(sp) ;#cm.push(10)
   1099e:       dc62                    sw      s8,56(sp) ;#cm.push(11)
   109a0:       da66                    sw      s9,52(sp) ;#cm.push(12)
   109a2:       d86a                    sw      s10,48(sp);#cm.push(13)
   109a4:       d66e                    sw      s11,44(sp);#cm.push(14)
...
   109f4:       4501                    li      a0,0      ;#cm.popretz(1)
   109f6:       40f6                    lw      ra,92(sp) ;#cm.popretz(2)
   109f8:       4466                    lw      s0,88(sp) ;#cm.popretz(3)
   109fa:       44d6                    lw      s1,84(sp) ;#cm.popretz(4)
   109fc:       4946                    lw      s2,80(sp) ;#cm.popretz(5)
   109fe:       49b6                    lw      s3,76(sp) ;#cm.popretz(6)
   10a00:       4a26                    lw      s4,72(sp) ;#cm.popretz(7)
   10a02:       4a96                    lw      s5,68(sp) ;#cm.popretz(8)
   10a04:       4b06                    lw      s6,64(sp) ;#cm.popretz(9)
   10a06:       5bf2                    lw      s7,60(sp) ;#cm.popretz(10)
   10a08:       5c62                    lw      s8,56(sp) ;#cm.popretz(11)
   10a0a:       5cd2                    lw      s9,52(sp) ;#cm.popretz(12)
   10a0c:       5d42                    lw      s10,48(sp);#cm.popretz(13)
   10a0e:       5db2                    lw      s11,44(sp);#cm.popretz(14)
   10a10:       6125                    addi    sp,sp,96  ;#cm.popretz(15)
   10a12:       8082                    ret               ;#cm.popretz(16)
----

<<<

with the GCC option _-msave-restore_ the output is the following:

[source,SAIL]
----
0001080e <processMarkers>:
   1080e:       73a012ef                jal     t0,11f48 <__riscv_save_12>
   10812:       1101                    addi    sp,sp,-32
...
   10862:       4501                    li      a0,0
   10864:       6105                    addi    sp,sp,32
   10866:       71e0106f                j       11f84 <__riscv_restore_12>
----

with PUSH/POPRET this reduces to

[source,SAIL]
----
0001080e <processMarkers>:
   1080e:       b8fa                    cm.push    {ra,s0-s11},-96
...
   10866:       bcfa                    cm.popretz {ra,s0-s11}, 96
----

The prologue / epilogue reduce from 60-bytes in the original code, to 14-bytes with _-msave-restore_, 
and to 4-bytes with PUSH and POPRET. 
As well as reducing the code-size PUSH and POPRET eliminate the branches from 
calling the millicode _save/restore_ routines and so may also perform better. 
  
[NOTE]

  The calls to _<riscv_save_0>/<riscv_restore_0>_ become 64-bit when the target functions are out of the ±1MB range, increasing the prologue/epilogue size to 22-bytes.

[NOTE]

  POP is typically used in tail-calling sequences where _ret_ is not used to return to _ra_ after destroying the stack frame.

[#pushpop-areg-list]

==== Stack pointer adjustment handling

The instructions all automatically adjust the stack pointer by enough to cover the memory required for the registers being saved or restored. 
Additionally the _spimm_ field in the encoding allows the stack pointer to be adjusted in additional increments of 16-bytes. There is only a small restricted
range available in the encoding; if the range is insufficient then a separate _c.addi16sp_ can be used to increase the range.

==== Register list handling

There is no support for the _{ra, s0-s10}_ register list without also adding _s11_. Therefore the _{ra, s0-s11}_ register list must be used in this case.

[#pushpop-idempotent-memory]
=== PUSH/POP Fault handling

Correct execution requires that _sp_ refers to idempotent memory (also see <<pushpop_non-idem-mem>>), because the core must be able to 
handle traps detected during the sequence. 
The entire PUSH/POP sequence is re-executed after returning from the trap handler, and multiple traps are possible during the sequence.

If a trap occurs during the sequence then _xEPC_ is updated with the PC of the instruction, _xTVAL_ (if not read-only-zero) updated with the bad address if it was an access fault and _xCAUSE_ updated with the type of trap.

NOTE: It is implementation defined whether interrupts can also be taken during the sequence execution.

[#pushpop-software-view]
=== Software view of execution

==== Software view of the PUSH sequence

From a software perspective the PUSH sequence appears as:

* A sequence of stores writing the bytes required by the pseudo-code
** The bytes may be written in any order.
** The bytes may be grouped into larger accesses.
** Any of the bytes may be written multiple times.
* A stack pointer adjustment

NOTE: If an implementation allows interrupts during the sequence, and the interrupt handler uses _sp_ to allocate stack memory, then any stores which were executed before the interrupt may be overwritten by the handler. This is safe because the memory is idempotent and the stores will be re-executed when execution resumes.

The stack pointer adjustment must only be committed only when it is certain that the entire PUSH instruction will commit.

Stores may also return imprecise faults from the bus. 
It is platform defined whether the core implementation waits for the bus responses before continuing to the final stage of the sequence, 
or handles errors responses after completing the PUSH instruction.

<<<

For example:

[source,sail]
--
cm.push  {ra, s0-s5}, -64
--

Appears to software as:

[source,sail]
--
# any bytes from sp-1 to sp-28 may be written multiple times before 
# the instruction completes therefore these updates may be visible in 
# the interrupt/exception handler below the stack pointer
sw  s5, -4(sp)   
sw  s4, -8(sp)   
sw  s3,-12(sp)   
sw  s2,-16(sp)  
sw  s1,-20(sp)   
sw  s0,-24(sp)   
sw  ra,-28(sp)   

# this must only execute once, and will only execute after all stores
# completed without any precise faults, therefore this update is only 
# visible in the interrupt/exception handler if cm.push has completed
addi sp, sp, -64
--

==== Software view of the POP/POPRET sequence

From a software perspective the POP/POPRET sequence appears as:

* A sequence of loads reading the bytes required by the pseudo-code.
** The bytes may be loaded in any order.
** The bytes may be grouped into larger accesses.
** Any of the bytes may be loaded multiple times.
* A stack pointer adjustment
* An optional `li a0, 0`
* An optional `ret`

If a trap occurs during the sequence, then any loads which were executed before the trap may update architectural state. 
The loads will be re-executed once the trap handler completes, so the values will be overwritten. 
Therefore it is permitted for an implementation to update some of the destination registers before taking a fault.

The optional `li a0, 0`, stack pointer adjustment and optional `ret` must only be committed only when it is certain that the entire POP/POPRET instruction will commit.

For POPRET once the stack pointer adjustment has been committed the `ret` must execute.

<<<
For example:

[source,sail]
--
cm.popretz {ra, s0-s3}, 32;
--

Appears to software as:

[source,sail]
--
# any or all of these load instructions may execute multiple times
# therefore these updates may be visible in the interrupt/exception handler
lw   s3, 28(sp)
lw   s2, 24(sp)
lw   s1, 20(sp)
lw   s0, 16(sp)
lw   ra, 12(sp)

# these must only execute once, will only execute after all loads 
# complete successfully all instructions must execute atomically
# therefore these updates are not visible in the interrupt/exception handler
li a0, 0
addi sp, sp, 32
ret
--

[[pushpop_non-idem-mem]]
=== Non-idempotent memory handling

An implementation may have a requirement to issue a PUSH/POP instruction to non-idempotent memory. 

If the core implementation does not support PUSH/POP to non-idempotent memories, the core may use an idempotency PMA to detect it and take a 
load (POP/POPRET) or store (PUSH) access fault exception in order to avoid unpredictable results.

Software should only use these instructions on non-idempotent memory regions when software can tolerate the required memory accesses
being issued repeatedly in the case that they cause exceptions.

<<<

=== Example RV32I PUSH/POP sequences

The examples are included show the load/store series expansion and the stack adjustment. 
Examples of _cm.popret_ and _cm.popretz_ are not included, as the difference in the expanded sequence from _cm.pop_ is trivial in all cases.

==== cm.push  {ra, s0-s2}, -64

Encoding: _rlist_=7, _spimm_=3

expands to:

[source,sail]
--
sw  s2,  -4(sp);
sw  s1,  -8(sp);
sw  s0, -12(sp);
sw  ra, -16(sp);
addi sp, sp, -64;
--

==== cm.push {ra, s0-s11}, -112

Encoding: _rlist_=15, _spimm_=3

expands to:

[source,sail]
--
sw  s11,  -4(sp);
sw  s10,  -8(sp);
sw  s9,  -12(sp);
sw  s8,  -16(sp);
sw  s7,  -20(sp);
sw  s6,  -24(sp);
sw  s5,  -28(sp);
sw  s4,  -32(sp);
sw  s3,  -36(sp);
sw  s2,  -40(sp);
sw  s1,  -44(sp);
sw  s0,  -48(sp);
sw  ra,  -52(sp);
addi sp, sp, -112;
--

<<<

==== cm.pop   {ra}, 16

Encoding: _rlist_=4, _spimm_=0

expands to:

[source,sail]
--
lw   ra, 12(sp);
addi sp, sp, 16;
--

==== cm.pop {ra, s0-s3}, 48

Encoding: _rlist_=8, _spimm_=1

expands to:

[source,sail]
--
lw   s3, 44(sp);
lw   s2, 40(sp);
lw   s1, 36(sp);
lw   s0, 32(sp);
lw   ra, 28(sp);
addi sp, sp, 48;
--

==== cm.pop {ra, s0-s4}, 64

Encoding: _rlist_=9, _spimm_=2

expands to: 

[source,sail]
--
lw   s4, 60(sp);
lw   s3, 56(sp);
lw   s2, 52(sp);
lw   s1, 48(sp);
lw   s0, 44(sp);
lw   ra, 40(sp);
addi sp, sp, 64;
--

include::Zcmp_footer.adoc[]