src/zihintntl.adoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188

[[chap:zihintntl]]
== "Zihintntl" Extension for Non-Temporal Locality Hints, Version 1.0

The NTL instructions are HINTs that indicate that the explicit memory
accesses of the immediately subsequent instruction (henceforth "target
instruction") exhibit poor temporal locality of reference. The NTL
instructions do not change architectural state, nor do they alter the
architecturally visible effects of the target instruction. Four variants
are provided:

The NTL.P1 instruction indicates that the target instruction does not
exhibit temporal locality within the capacity of the innermost level of
private cache in the memory hierarchy. NTL.P1 is encoded as
ADD _x0, x0, x2_.

The NTL.PALL instruction indicates that the target instruction does not
exhibit temporal locality within the capacity of any level of private
cache in the memory hierarchy. NTL.PALL is encoded as ADD _x0, x0, x3_.

The NTL.S1 instruction indicates that the target instruction does not
exhibit temporal locality within the capacity of the innermost level of
shared cache in the memory hierarchy. NTL.S1 is encoded as
ADD _x0, x0, x4_.

The NTL.ALL instruction indicates that the target instruction does not
exhibit temporal locality within the capacity of any level of cache in
the memory hierarchy. NTL.ALL is encoded as ADD _x0, x0, x5_.

[NOTE]
====
The NTL instructions can be used to avoid cache pollution when streaming
data or traversing large data structures, or to reduce latency in
producer-consumer interactions.

A microarchitecture might use the NTL instructions to inform the cache
replacement policy, or to decide which cache to allocate into, or to
avoid cache allocation altogether. For example, NTL.P1 might indicate
that an implementation should not allocate a line in a private L1 cache,
but should allocate in L2 (whether private or shared). In another
implementation, NTL.P1 might allocate the line in L1, but in the
least-recently used state.

NTL.ALL will typically inform implementations not to allocate anywhere
in the cache hierarchy. Programmers should use NTL.ALL for accesses that
have no exploitable temporal locality.

Like any HINTs, these instructions may be freely ignored. Hence,
although they are described in terms of cache-based memory hierarchies,
they do not mandate the provision of caches.

Some implementations might respect these HINTs for some memory accesses
but not others: e.g., implementations that implement LR/SC by acquiring
a cache line in the exclusive state in L1 might ignore NTL instructions
on LR and SC, but might respect NTL instructions for AMOs and regular
loads and stores.
====

<<ntl-portable>> lists several software use cases and the recommended NTL variant that _portable_ software—i.e., software not tuned for any specific implementation's memory hierarchy—should use in each case.

[[ntl-portable]]
.Recommended NTL variant for portable software to employ in various scenarios.
[%autowidth,float="center",align="center",cols="<,<",options="header",]
|===
|Scenario |Recommended NTL variant
|Access to a working set between and in size |NTL.P1
|Access to a working set between and in size |NTL.PALL
|Access to a working set greater than in size |NTL.S1
|Access with no exploitable temporal locality (e.g., streaming) |NTL.ALL
|Access to a contended synchronization variable |NTL.PALL
|===

[NOTE]
====
The working-set sizes listed in <<ntl-portable>> are not meant to
constrain implementers' cache-sizing decisions.
Cache sizes will obviously vary between implementations, and so software
writers should only take these working-set sizes as rough guidelines.
====

<<ntl>> lists several sample memory hierarchies and
recommends how each NTL variant maps onto each cache level. The table
also recommends which NTL variant that implementation-tuned software
should use to avoid allocating in a particular cache level. For example,
for a system with a private L1 and a shared L2, it is recommended that
NTL.P1 and NTL.PALL indicate that temporal locality cannot be exploited
by the L1, and that NTL.S1 and NTL.ALL indicate that temporal locality
cannot be exploited by the L2. Furthermore, software tuned for such a
system should use NTL.P1 to indicate a lack of temporal locality
exploitable by the L1, or should use NTL.ALL indicate a lack of temporal
locality exploitable by the L2.

If the C extension is provided, compressed variants of these HINTs are
also provided: C.NTL.P1 is encoded as C.ADD _x0, x2_; C.NTL.PALL is
encoded as C.ADD _x0, x3_; C.NTL.S1 is encoded as C.ADD _x0, x4_; and
C.NTL.ALL is encoded as C.ADD _x0, x5_.

The NTL instructions affect all memory-access instructions except the
cache-management instructions in the Zicbom extension.

[NOTE]
====
As of this writing, there are no other exceptions to this rule, and so
the NTL instructions affect all memory-access instructions defined in
the base ISAs and the A, F, D, Q, C, and V standard extensions, as well
as those defined within the hypervisor extension in Volume II.

The NTL instructions can affect cache-management operations other than
those in the Zicbom extension. For example, NTL.PALL followed by
CBO.ZERO might indicate that the line should be allocated in L3 and
zeroed, but not allocated in L1 or L2.
====

<<<

[[ntl]]
[%autowidth,float="center",align="center",cols="<,^,^,^,^,^,^,^,^",options="header"]
.Mapping of NTL variants to various memory hierarchies.
|===
| Memory hierarchy 4+| Recommended mapping of NTL +
variant to actual cache level 4+| Recommended NTL variant for +
explicit cache management  
|
|P1 |PALL |S1 |ALL
|L1 |L2 |L3 |L4/L5
 9+^| Common Scenarios
| No caches 4+|--- 4+|none                   
|Private L1 only |L1 |L1 |L1 |L1| ALL |--- |--- |--- 
|Private L1; shared L2 |L1  |L1  |L2  |L2 |P1|ALL|---|---   
|Private L1; shared L2/L3 |L1 | L1 | L2 | L3 |P1  |S1   |ALL |---
|Private L1/L2 |L1  |L2  |L2  |L2 | P1  |ALL  |--- |---
|Private L1/L2; shared L3 |L1 | L2 | L3 | L3 | P1 | PALL| ALL |---
|Private L1/L2; shared L3/L4 | L1 | L2|  L3 | L4 | P1 | PALL | S1 | ALL
 9+^| Uncommon Scenarios
|Private L1/L2/L3; shared L4 | L1 | L3 |L4 |L4 |P1 |P1 |PALL |ALL
|Private L1; shared L2/L3/L4 |L1 | L1 |L2 |L4 |P1 |S1 |ALL |ALL  
|Private L1/L2; shared L3/L4/L5  |L1 | L2 | L3 | L5 |P1 | PALL |S1 |ALL  
|Private L1/L2/L3; shared L4/L5  |L1 |L3 |L4 |L5 |P1 |P1 |PALL |ALL  
|===

When an NTL instruction is applied to a prefetch hint in the Zicbop
extension, it indicates that a cache line should be prefetched into a
cache that is _outer_ from the level specified by the NTL.

[NOTE]
====
For example, in a system with a private L1 and shared L2, NTL.P1
followed by PREFETCH.R might prefetch into L2 with read intent.

To prefetch into the innermost level of cache, do not prefix the
prefetch instruction with an NTL instruction.

In some systems, NTL.ALL followed by a prefetch instruction might
prefetch into a cache or prefetch buffer internal to a memory
controller.
====

Software is discouraged from following an NTL instruction with an
instruction that does not explicitly access memory. Nonadherence to this
recommendation might reduce performance but otherwise has no
architecturally visible effect.

In the event that a trap is taken on the target instruction,
implementations are discouraged from applying the NTL to the first
instruction in the trap handler. Instead, implementations are
recommended to ignore the HINT in this case.

[NOTE]
====
If an interrupt occurs between the execution of an NTL instruction and
its target instruction, execution will normally resume at the target
instruction. That the NTL instruction is not reexecuted does not change
the semantics of the program.

Some implementations might prefer not to process the NTL instruction
until the target instruction is seen (e.g., so that the NTL can be fused
with the memory access it modifies). Such implementations might
preferentially take the interrupt before the NTL, rather than between
the NTL and the memory access.
====
'''
[TIP]
====
Since the NTL instructions are encoded as ADDs, they can be used within
LR/SC loops without voiding the forward-progress guarantee. But, since
using other loads and stores within an LR/SC loop _does_ void the
forward-progress guarantee, the only reason to use an NTL within such a
loop is to modify the LR or the SC.
====