src/zihintntl.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202

\chapter{``Zihintntl'' Non-Temporal Locality Hints, Version 0.2}
\label{chap:zihintpause}

The NTL instructions are HINTs that indicate that the explicit memory accesses of the immediately subsequent
instruction (henceforth ``target instruction'') exhibit poor temporal locality of reference.
The NTL instructions do not change architectural state, nor do they alter the
architecturally visible effects of the target instruction.
Four variants are provided:

The NTL.P1 instruction indicates that the target instruction
does not exhibit temporal locality within the capacity of the innermost level
of private cache in the memory hierarchy.
NTL.P1 is encoded as \mbox{ADD {\em x0, x0, x2}}.

The NTL.PALL instruction indicates that the target instruction
does not exhibit temporal locality within the capacity of any
level of private cache in the memory hierarchy.
NTL.PALL is encoded as \mbox{ADD {\em x0, x0, x3}}.

The NTL.S1 instruction indicates that the target instruction
does not exhibit temporal locality within the capacity of the innermost level
of shared cache in the memory hierarchy.
NTL.S1 is encoded as \mbox{ADD {\em x0, x0, x4}}.

The NTL.ALL instruction indicates that the target
instruction does not exhibit temporal locality within the capacity of any
level of cache in the memory hierarchy.
NTL.ALL is encoded as \mbox{ADD {\em x0, x0, x5}}.

\begin{commentary}
The NTL instructions can be used to avoid cache pollution when streaming data
or traversing large data structures, or to reduce latency in producer-consumer
interactions.

A microarchitecture might use the NTL instructions to inform the cache
replacement policy, or to decide which cache to allocate into, or to avoid
cache allocation altogether.
For example, NTL.P1 might indicate that an implementation should not allocate
a line in a private L1 cache, but should allocate in L2 (whether private or
shared).
In another implementation, NTL.P1 might allocate the line in L1, but in
the least-recently used state.

NTL.ALL will typically inform implementations not to allocate anywhere in the
cache hierarchy.
Programmers should use NTL.ALL for accesses that have no exploitable temporal
locality.

Like any HINTs, these instructions may be freely ignored.
Hence, although they are described in terms of cache-based memory hierarchies,
they do not mandate the provision of caches.

Some implementations might respect these HINTs for some memory accesses but
not others: e.g., implementations that implement LR/SC by acquiring a
cache line in the exclusive state in L1 might ignore NTL instructions
on LR and SC, but might respect NTL instructions for
AMOs and regular loads and stores.
\end{commentary}

Table~\ref{tab:ntl-portable} lists several software use cases and the
recommended NTL variant that {\em portable} software---i.e., software not
tuned for any specific implementation's memory hierarchy---should use in each
case.

\begin{table}[h!]
\begin{center}
\begin{tabular}{|l|l|}
\hline
Scenario & Recommended NTL variant \\
\hline
Access to a working set between \wunits{64}{KiB} and \wunits{256}{KiB} in size & NTL.P1 \\
Access to a working set between \wunits{256}{KiB} and \wunits{1}{MiB} in size  & NTL.PALL \\
Access to a working set greater than \wunits{1}{MiB} in size                   & NTL.S1 \\
Access with no exploitable temporal locality (e.g., streaming)                 & NTL.ALL \\
Access to a contended synchronization variable                                 & NTL.PALL \\
\hline
\end{tabular}
\end{center}
\caption{Recommended NTL variant for portable software to employ in various scenarios.}
\label{tab:ntl-portable}
\end{table}

\begin{commentary}
Cache sizes will obviously vary between implementations, and so the working-set
sizes listed in Table~\ref{tab:ntl-portable} are merely rough guidelines.
\end{commentary}

Table~\ref{tab:ntl} lists several sample memory hierarchies and recommends
how each NTL variant maps onto each cache level.
The table also recommends which NTL variant that implementation-tuned
software should use to avoid allocating in a particular cache level.
For example, for a system with a private L1 and a shared L2, it is recommended
that NTL.P1 and NTL.PALL indicate that temporal locality cannot be exploited by
the L1, and that NTL.S1 and NTL.ALL indicate that temporal locality cannot be
exploited by the L2.
Furthermore, software tuned for such a system should use NTL.P1 to indicate
a lack of temporal locality exploitable by the L1, or should use NTL.ALL
indicate a lack of temporal locality exploitable by the L2.

\begin{table}[h!]
\begin{center}
\scalebox{0.95}{
\begin{tabular}{|l|WWWW|WWWW|}
\hline
Memory hierarchy & \multicolumn{4}{c|}{Recommended mapping of NTL}    & \multicolumn{4}{c|}{Recommended NTL variant for} \\
                 & \multicolumn{4}{c|}{variant to actual cache level} & \multicolumn{4}{c|}{explicit cache management} \\
\hline
                                 & P1  & PALL& S1  & ALL& L1  & L2  & L3  & L4/L5  \\
\hline
\multicolumn{9}{|c|}{Common Scenarios} \\
\hline
No caches             & \multicolumn{4}{c|}{---} & \multicolumn{4}{c|}{\em none} \\
\hline
Private L1 only                  & L1  & L1  & L1  & L1 & ALL & --- & --- & --- \\
Private L1; shared L2            & L1  & L1  & L2  & L2 & P1  & ALL & --- & --- \\
Private L1; shared L2/L3         & L1  & L1  & L2  & L3 & P1  & S1  & ALL & --- \\
Private L1/L2                    & L1  & L2  & L2  & L2 & P1  & ALL & --- & --- \\
Private L1/L2; shared L3         & L1  & L2  & L3  & L3 & P1  & PALL& ALL & --- \\
Private L1/L2; shared L3/L4      & L1  & L2  & L3  & L4 & P1  & PALL& S1  & ALL \\
\hline
\multicolumn{9}{|c|}{Uncommon Scenarios} \\
\hline
Private L1/L2/L3; shared L4      & L1  & L3  & L4  & L4 & P1  & P1  & PALL& ALL \\
Private L1; shared L2/L3/L4      & L1  & L1  & L2  & L4 & P1  & S1  & ALL & ALL \\
Private L1/L2; shared L3/L4/L5   & L1  & L2  & L3  & L5 & P1  & PALL& S1  & ALL \\
Private L1/L2/L3; shared L4/L5   & L1  & L3  & L4  & L5 & P1  & P1  & PALL& ALL \\
\hline
\end{tabular}}
\end{center}
\caption{Mapping of NTL variants to various memory hierarchies.}
\label{tab:ntl}
\end{table}

If the C extension is provided, compressed variants of these HINTs are also
provided:
C.NTL.P1 is encoded as \mbox{C.ADD {\em x0, x2}};
C.NTL.PALL is encoded as \mbox{C.ADD {\em x0, x3}};
C.NTL.S1 is encoded as \mbox{C.ADD {\em x0, x4}};
and C.NTL.ALL is encoded as \mbox{C.ADD {\em x0, x5}}.

The NTL instructions affect all memory-access instructions except the
cache-management instructions in the Zicbom extension.

\begin{commentary}
As of this writing, there are no other exceptions to this rule, and so
the NTL instructions affect all memory-access instructions
defined in the base ISAs and the A, F, D, Q, C, and V standard extensions,
as well as those defined within the hypervisor extension in Volume II.

The NTL instructions can affect cache-management operations other than those
in the Zicbom extension.
For example, NTL.PALL followed by CBO.ZERO might indicate
that the line should be allocated in L3 and zeroed, but not allocated in
L1 or L2.
\end{commentary}

When an NTL instruction is applied to a prefetch hint in the Zicbop extension,
it indicates that a cache line should be prefetched into a cache that is
{\em outer} from the level specified by the NTL.

\begin{commentary}
For example, in a system with a private L1 and shared L2, NTL.P1 followed by
PREFETCH.R might prefetch into L2 with read intent.

To prefetch into the innermost level of cache, do not prefix the prefetch
instruction with an NTL instruction.

In some systems, NTL.ALL followed by a prefetch instruction might prefetch
into a cache or prefetch buffer internal to a memory controller.
\end{commentary}

Software is discouraged from following an NTL instruction with an
instruction that does not explicitly access memory.
Nonadherence to this recommendation might reduce performance but
otherwise has no architecturally visible effect.

In the event that a trap is taken on the target instruction,
implementations are discouraged from applying the NTL to the first instruction
in the trap handler.
Instead, implementations are recommended to ignore the HINT in this case.

\begin{commentary}
If an interrupt occurs between the execution of an NTL instruction and its
target instruction, execution will normally resume at the
target instruction.
That the NTL instruction is not reexecuted does not change the semantics of
the program.

Some implementations might prefer not to process the NTL instruction until the
target instruction is seen (e.g., so that the NTL can be
fused with the memory access it modifies).
Such implementations might preferentially take the interrupt before the NTL,
rather than between the NTL and the memory access.
\end{commentary}

\begin{commentary}
Since the NTL instructions are encoded as ADDs, they can be used within LR/SC
loops without voiding the forward-progress guarantee.
But, since using other loads and stores within an LR/SC loop {\em does}
void the forward-progress guarantee, the only reason to use an NTL
within such a loop is to modify the LR or the SC.
\end{commentary}