libgomp/NOTES


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279

Notes on the external ABI presented by libgomp.  This ought to get
transformed into proper documentation at some point.

Implementing MASTER construct

	if (omp_get_thread_num () == 0)
	  block

	Alternately, we generate two copies of the parallel subfunction
	and only include this in the version run by the master thread.
	Surely that's not worthwhile though...

Implementing CRITICAL construct

	Without a specified name,

	void GOMP_critical_start (void);
	void GOMP_critical_end (void);

	so that we don't get COPY relocations from libgomp to the main
	application.

	With a specified name, use omp_set_lock and omp_unset_lock with
	name being transformed into a variable declared like

		omp_lock_t gomp_critical_user_<name>
			__attribute__((common))

	Ideally the ABI would specify that all zero is a valid unlocked
	state, and so we wouldn't actually need to initialize this at
	startup.

Implementing ATOMIC construct

	The target should implement the __sync builtins.

	Failing that we could add

	void GOMP_atomic_enter (void)
	void GOMP_atomic_exit (void)

	which reuses the regular lock code, but with yet another lock
	object private to the library.

Implementing FLUSH construct

	Expands to the __sync_synchronize builtin.

Implementing BARRIER construct

	void GOMP_barrier (void)

Implementing THREADPRIVATE construct

	In _most_ cases we can map this directly to __thread.  Except
	that OMP allows constructors for C++ objects.  We can either
	refuse to support this (how often is it used?) or we can 
	implement something akin to .ctors.

	Even more ideally, this ctor feature is handled by extensions
	to the main pthreads library.  Failing that, we can have a set
	of entry points to register ctor functions to be called.

Implementing PRIVATE clause

	In association with a PARALLEL, or within the lexical extent
	of a PARALLEL block, the variable becomes a local variable in
	the parallel subfunction.

	In association with FOR or SECTIONS blocks, create a new
	automatic variable within the current function.  This preserves
	the semantic of new variable creation.

Implementing FIRSTPRIVATE, LASTPRIVATE, COPYIN, COPYPRIVATE clauses

	Seems simple enough for PARALLEL blocks.  Create a private 
	struct for communicating between parent and subfunction.
	In the parent, copy in values for scalar and "small" structs;
	copy in addresses for others TREE_ADDRESSABLE types.  In the 
	subfunction, copy the value into the local variable.

	Not clear at all what to do with bare FOR or SECTION blocks.
	The only thing I can figure is that we do something like


		#pragma omp for firstprivate(x) lastprivate(y)
		for (int i = 0; i < n; ++i)
		  body;

		=>

		{
		  int x = x, y;

		  // for stuff

		  if (i == n)
		    y = y;
		}

	where the "x=x" and "y=y" assignments actually have different
	uids for the two variables, i.e. not something you could write
	directly in C.  Presumably this only makes sense if the "outer"
	x and y are global variables.

	COPYPRIVATE would work the same way, except the structure 
	broadcast would have to happen via SINGLE machinery instead.

Implementing REDUCTION clause

	The private struct mentioned above should have a pointer to
	an array of the type of the variable, indexed by the thread's
	team_id.  The thread stores its final value into the array,
	and after the barrier the master thread iterates over the
	array to collect the values.

Implementing PARALLEL construct

	#pragma omp parallel
	{
	  body;
	}

	=>

	void subfunction (void *data)
	{
	  use data;
	  body;
	}

	setup data;
	GOMP_parallel_start (subfunction, &data, num_threads);
	subfunction (&data);
	GOMP_parallel_end ();

  void GOMP_parallel_start (void (*fn)(void *), void *data,
			    unsigned num_threads)

	The FN argument is the subfunction to be run in parallel.

	The DATA argument is a pointer to a structure used to 
	communicate data in and out of the subfunction, as discussed
	above wrt FIRSTPRIVATE et al.

	The NUM_THREADS argument is 1 if an IF clause is present
	and false, or the value of the NUM_THREADS clause, if
	present, or 0.

	The function needs to create the appropriate number of
	threads and/or launch them from the dock.  It needs to
	create the team structure and assign team ids.

  void GOMP_parallel_end (void)

	Tears down the team and return us to the previous
	omp_in_parallel() state.

Implementing FOR construct

	#pragma omp parallel for
	for (i = lb; i <= ub; i++)
	  body;

	=>

	void subfunction (void *data)
	{
	  long _s0, _e0;
	  while (GOMP_loop_static_next (&_s0, &_e0))
	    {
	      long _e1 = _e0, i;
	      for (i = _s0; i < _e1; i++)
		body;
	    }
	  GOMP_loop_end_nowait ();
	}

	GOMP_parallel_loop_static (subfunction, NULL, 0, lb, ub+1, 1, 0);
	subfunction (NULL);
	GOMP_parallel_end ();

	#pragma omp for schedule(runtime)
	for (i = 0; i < n; i++)
	  body;

	=>

	{
	  long i, _s0, _e0;
	  if (GOMP_loop_runtime_start (0, n, 1, &_s0, &_e0))
	    do {
	      long _e1 = _e0;
	      for (i = _s0, i < _e0; i++)
	        body;
	    } while (GOMP_loop_runtime_next (&_s0, _&e0));
	  GOMP_loop_end ();
	}

	Note that while it looks like there is trickyness to propagating
	a non-constant STEP, there isn't really.  We're explicitly allowed
	to evaluate it as many times as we want, and any variables involved
	should automatically be handled as PRIVATE or SHARED like any other
	variables.  So the expression should remain evaluable in the 
	subfunction.  We can also pull it into a local variable if we like,
	but since its supposed to remain unchanged, we can also not if we like.

	If we have SCHEDULE(STATIC), and no ORDERED, then we ought to be
	able to get away with no work-sharing context at all, since we can
	simply perform the arithmetic directly in each thread to divide up
	the iterations.  Which would mean that we wouldn't need to call any
	of these routines.

	There are separate routines for handling loops with an ORDERED
	clause.  Bookkeeping for that is non-trivial...

Implementing ORDERED construct

	void GOMP_ordered_start (void)
	void GOMP_ordered_end (void)

Implementing SECTIONS construct

	#pragma omp sections
	{
	  #pragma omp section
	  stmt1;
	  #pragma omp section
	  stmt2;
	  #pragma omp section
	  stmt3;
	}

	=>
	
	for (i = GOMP_sections_start (3); i != 0; i = GOMP_sections_next ())
	  switch (i)
	    {
	    case 1:
	      stmt1;
	      break;
	    case 2:
	      stmt2;
	      break;
	    case 3:
	      stmt3;
	      break;
	    }
	GOMP_barrier ();

Implementing SINGLE construct

	#pragma omp single
	{
	  body;
	}

	=>

	if (GOMP_single_start ())
	  body;
	GOMP_barrier ();


	#pragma omp single copyprivate(x)
	body;

	=>

	datap = GOMP_single_copy_start ();
	if (datap == NULL)
	  {
	    body;
	    data.x = x;
	    GOMP_single_copy_end (&data);
	  }
	else
	  x = datap->x;
	GOMP_barrier ();