diff options
author | Julian Brown <julian@codesourcery.com> | 2021-02-26 04:34:49 -0800 |
---|---|---|
committer | Thomas Schwinge <thomas@codesourcery.com> | 2021-05-21 18:58:07 +0200 |
commit | 29a2f51806c5b30e17a8d0e9ba7915a3c53c34ff (patch) | |
tree | 0e18184b3d50407c5100b2dcc182134227ceb053 /gcc/omp-offload.c | |
parent | 782e57f2c0900f3c3bbaec4b367568b6d05236b8 (diff) | |
download | gcc-29a2f51806c5b30e17a8d0e9ba7915a3c53c34ff.zip gcc-29a2f51806c5b30e17a8d0e9ba7915a3c53c34ff.tar.gz gcc-29a2f51806c5b30e17a8d0e9ba7915a3c53c34ff.tar.bz2 |
openacc: Add support for gang local storage allocation in shared memory [PR90115]
This patch implements a method to track the "private-ness" of
OpenACC variables declared in offload regions in gang-partitioned,
worker-partitioned or vector-partitioned modes. Variables declared
implicitly in scoped blocks and those declared "private" on enclosing
directives (e.g. "acc parallel") are both handled. Variables that are
e.g. gang-private can then be adjusted so they reside in GPU shared
memory.
The reason for doing this is twofold: correct implementation of OpenACC
semantics, and optimisation, since shared memory might be faster than
the main memory on a GPU. Handling of private variables is intimately
tied to the execution model for gangs/workers/vectors implemented by
a particular target: for current targets, we use (or on mainline, will
soon use) a broadcasting/neutering scheme.
That is sufficient for code that e.g. sets a variable in worker-single
mode and expects to use the value in worker-partitioned mode. The
difficulty (semantics-wise) comes when the user wants to do something like
an atomic operation in worker-partitioned mode and expects a worker-single
(gang private) variable to be shared across each partitioned worker.
Forcing use of shared memory for such variables makes that work properly.
In terms of implementation, the parallelism level of a given loop is
not fixed until the oaccdevlow pass in the offload compiler, so the
patch delays fixing the parallelism level of variables declared on or
within such loops until the same point. This is done by adding a new
internal UNIQUE function (OACC_PRIVATE) that lists (the address of) each
private variable as an argument, and other arguments set so as to be able
to determine the correct parallelism level to use for the listed
variables. This new internal function fits into the existing scheme for
demarcating OpenACC loops, as described in comments in the patch.
Two new target hooks are introduced: TARGET_GOACC_ADJUST_PRIVATE_DECL and
TARGET_GOACC_EXPAND_VAR_DECL. The first can tweak a variable declaration
at oaccdevlow time, and the second at expand time. The first or both
of these target hooks can be used by a given offload target, depending
on its strategy for implementing private variables.
This patch updates the TARGET_GOACC_ADJUST_PRIVATE_DECL target hook in
the AMD GCN backend to the current name and prototype. (An earlier
version of the hook was already present, but dormant.)
gcc/
PR middle-end/90115
* doc/tm.texi.in (TARGET_GOACC_EXPAND_VAR_DECL)
(TARGET_GOACC_ADJUST_PRIVATE_DECL): Add documentation hooks.
* doc/tm.texi: Regenerate.
* expr.c (expand_expr_real_1): Expand decls using the
expand_var_decl OpenACC hook if defined.
* internal-fn.c (expand_UNIQUE): Handle IFN_UNIQUE_OACC_PRIVATE.
* internal-fn.h (IFN_UNIQUE_CODES): Add OACC_PRIVATE.
* omp-low.c (omp_context): Add oacc_privatization_candidates
field.
(lower_oacc_reductions): Add PRIVATE_MARKER parameter. Insert
before fork.
(lower_oacc_head_tail): Add PRIVATE_MARKER parameter. Modify
private marker's gimple call arguments, and pass it to
lower_oacc_reductions.
(oacc_privatization_scan_clause_chain)
(oacc_privatization_scan_decl_chain, lower_oacc_private_marker):
New functions.
(lower_omp_for, lower_omp_target, lower_omp_1): Use these.
* omp-offload.c (convert.h): Include.
(oacc_loop_xform_head_tail): Treat private-variable markers like
fork/join when transforming head/tail sequences.
(struct var_decl_rewrite_info): Add struct.
(oacc_rewrite_var_decl, is_sync_builtin_call): New functions.
(execute_oacc_device_lower): Support rewriting gang-private
variables using target hook, and fix up addr_expr and var_decl
nodes afterwards.
* target.def (adjust_private_decl, expand_var_decl): New hooks.
* config/gcn/gcn-protos.h (gcn_goacc_adjust_gangprivate_decl):
Rename to...
(gcn_goacc_adjust_private_decl): ...this.
* config/gcn/gcn-tree.c (gcn_goacc_adjust_gangprivate_decl):
Rename to...
(gcn_goacc_adjust_private_decl): ...this. Add LEVEL parameter.
* config/gcn/gcn.c (TARGET_GOACC_ADJUST_GANGPRIVATE_DECL): Rename
definition using gcn_goacc_adjust_gangprivate_decl...
(TARGET_GOACC_ADJUST_PRIVATE_DECL): ...to this, using
gcn_goacc_adjust_private_decl.
* config/nvptx/nvptx.c (tree-pretty-print.h): Include.
(gang_private_shared_size): New global variable.
(gang_private_shared_align): Likewise.
(gang_private_shared_sym): Likewise.
(gang_private_shared_hmap): Likewise.
(nvptx_option_override): Initialize these.
(nvptx_file_end): Output gang_private_shared_sym.
(nvptx_goacc_adjust_private_decl, nvptx_goacc_expand_var_decl):
New functions.
(nvptx_set_current_function): Clear gang_private_shared_hmap.
(TARGET_GOACC_ADJUST_PRIVATE_DECL): Define hook.
(TARGET_GOACC_EXPAND_VAR_DECL): Likewise.
libgomp/
PR middle-end/90115
* testsuite/libgomp.oacc-c-c++-common/private-atomic-1-gang.c: New
test.
* testsuite/libgomp.oacc-fortran/private-atomic-1-gang.f90:
Likewise.
* testsuite/libgomp.oacc-fortran/private-atomic-1-worker.f90:
Likewise.
Co-Authored-By: Chung-Lin Tang <cltang@codesourcery.com>
Co-Authored-By: Thomas Schwinge <thomas@codesourcery.com>
Diffstat (limited to 'gcc/omp-offload.c')
-rw-r--r-- | gcc/omp-offload.c | 225 |
1 files changed, 224 insertions, 1 deletions
diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c index 1612461..080bddd 100644 --- a/gcc/omp-offload.c +++ b/gcc/omp-offload.c @@ -53,6 +53,7 @@ along with GCC; see the file COPYING3. If not see #include "attribs.h" #include "cfgloop.h" #include "context.h" +#include "convert.h" /* Describe the OpenACC looping structure of a function. The entire function is held in a 'NULL' loop. */ @@ -1357,7 +1358,9 @@ oacc_loop_xform_head_tail (gcall *from, int level) = ((enum ifn_unique_kind) TREE_INT_CST_LOW (gimple_call_arg (stmt, 0))); - if (k == IFN_UNIQUE_OACC_FORK || k == IFN_UNIQUE_OACC_JOIN) + if (k == IFN_UNIQUE_OACC_FORK + || k == IFN_UNIQUE_OACC_JOIN + || k == IFN_UNIQUE_OACC_PRIVATE) *gimple_call_arg_ptr (stmt, 2) = replacement; else if (k == kind && stmt != from) break; @@ -1774,6 +1777,136 @@ default_goacc_reduction (gcall *call) gsi_replace_with_seq (&gsi, seq, true); } +struct var_decl_rewrite_info +{ + gimple *stmt; + hash_map<tree, tree> *adjusted_vars; + bool avoid_pointer_conversion; + bool modified; +}; + +/* Helper function for execute_oacc_device_lower. Rewrite VAR_DECLs (by + themselves or wrapped in various other nodes) according to ADJUSTED_VARS in + the var_decl_rewrite_info pointed to via DATA. Used as part of coercing + gang-private variables in OpenACC offload regions to reside in GPU shared + memory. */ + +static tree +oacc_rewrite_var_decl (tree *tp, int *walk_subtrees, void *data) +{ + walk_stmt_info *wi = (walk_stmt_info *) data; + var_decl_rewrite_info *info = (var_decl_rewrite_info *) wi->info; + + if (TREE_CODE (*tp) == ADDR_EXPR) + { + tree arg = TREE_OPERAND (*tp, 0); + tree *new_arg = info->adjusted_vars->get (arg); + + if (new_arg) + { + if (info->avoid_pointer_conversion) + { + *tp = build_fold_addr_expr (*new_arg); + info->modified = true; + *walk_subtrees = 0; + } + else + { + gimple_stmt_iterator gsi = gsi_for_stmt (info->stmt); + tree repl = build_fold_addr_expr (*new_arg); + gimple *stmt1 + = gimple_build_assign (make_ssa_name (TREE_TYPE (repl)), repl); + tree conv = convert_to_pointer (TREE_TYPE (*tp), + gimple_assign_lhs (stmt1)); + gimple *stmt2 + = gimple_build_assign (make_ssa_name (TREE_TYPE (*tp)), conv); + gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT); + gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT); + *tp = gimple_assign_lhs (stmt2); + info->modified = true; + *walk_subtrees = 0; + } + } + } + else if (TREE_CODE (*tp) == COMPONENT_REF || TREE_CODE (*tp) == ARRAY_REF) + { + tree *base = &TREE_OPERAND (*tp, 0); + + while (TREE_CODE (*base) == COMPONENT_REF + || TREE_CODE (*base) == ARRAY_REF) + base = &TREE_OPERAND (*base, 0); + + if (TREE_CODE (*base) != VAR_DECL) + return NULL; + + tree *new_decl = info->adjusted_vars->get (*base); + if (!new_decl) + return NULL; + + int base_quals = TYPE_QUALS (TREE_TYPE (*new_decl)); + tree field = TREE_OPERAND (*tp, 1); + + /* Adjust the type of the field. */ + int field_quals = TYPE_QUALS (TREE_TYPE (field)); + if (TREE_CODE (field) == FIELD_DECL && field_quals != base_quals) + { + tree *field_type = &TREE_TYPE (field); + while (TREE_CODE (*field_type) == ARRAY_TYPE) + field_type = &TREE_TYPE (*field_type); + field_quals |= base_quals; + *field_type = build_qualified_type (*field_type, field_quals); + } + + /* Adjust the type of the component ref itself. */ + tree comp_type = TREE_TYPE (*tp); + int comp_quals = TYPE_QUALS (comp_type); + if (TREE_CODE (*tp) == COMPONENT_REF && comp_quals != base_quals) + { + comp_quals |= base_quals; + TREE_TYPE (*tp) + = build_qualified_type (comp_type, comp_quals); + } + + *base = *new_decl; + info->modified = true; + } + else if (TREE_CODE (*tp) == VAR_DECL) + { + tree *new_decl = info->adjusted_vars->get (*tp); + if (new_decl) + { + *tp = *new_decl; + info->modified = true; + } + } + + return NULL_TREE; +} + +/* Return TRUE if CALL is a call to a builtin atomic/sync operation. */ + +static bool +is_sync_builtin_call (gcall *call) +{ + tree callee = gimple_call_fndecl (call); + + if (callee != NULL_TREE + && gimple_call_builtin_p (call, BUILT_IN_NORMAL)) + switch (DECL_FUNCTION_CODE (callee)) + { +#undef DEF_SYNC_BUILTIN +#define DEF_SYNC_BUILTIN(ENUM, NAME, TYPE, ATTRS) case ENUM: +#include "sync-builtins.def" +#undef DEF_SYNC_BUILTIN + return true; + + default: + ; + } + + return false; +} + /* Main entry point for oacc transformations which run on the device compiler after LTO, so we know what the target device is at this point (including the host fallback). */ @@ -1923,6 +2056,8 @@ execute_oacc_device_lower () dominance information to update SSA. */ calculate_dominance_info (CDI_DOMINATORS); + hash_map<tree, tree> adjusted_vars; + /* Now lower internal loop functions to target-specific code sequences. */ basic_block bb; @@ -1999,6 +2134,45 @@ execute_oacc_device_lower () case IFN_UNIQUE_OACC_TAIL_MARK: remove = true; break; + + case IFN_UNIQUE_OACC_PRIVATE: + { + HOST_WIDE_INT level + = TREE_INT_CST_LOW (gimple_call_arg (call, 2)); + if (level == -1) + break; + for (unsigned i = 3; + i < gimple_call_num_args (call); + i++) + { + tree arg = gimple_call_arg (call, i); + gcc_checking_assert (TREE_CODE (arg) == ADDR_EXPR); + tree decl = TREE_OPERAND (arg, 0); + if (dump_file && (dump_flags & TDF_DETAILS)) + { + static char const *const axes[] = + /* Must be kept in sync with GOMP_DIM + enumeration. */ + { "gang", "worker", "vector" }; + fprintf (dump_file, "Decl UID %u has %s " + "partitioning:", DECL_UID (decl), + axes[level]); + print_generic_decl (dump_file, decl, TDF_SLIM); + fputc ('\n', dump_file); + } + if (targetm.goacc.adjust_private_decl) + { + tree oldtype = TREE_TYPE (decl); + tree newdecl + = targetm.goacc.adjust_private_decl (decl, level); + if (TREE_TYPE (newdecl) != oldtype + || newdecl != decl) + adjusted_vars.put (decl, newdecl); + } + } + remove = true; + } + break; } break; } @@ -2030,6 +2204,55 @@ execute_oacc_device_lower () gsi_next (&gsi); } + /* Make adjustments to gang-private local variables if required by the + target, e.g. forcing them into a particular address space. Afterwards, + ADDR_EXPR nodes which have adjusted variables as their argument need to + be modified in one of two ways: + + 1. They can be recreated, making a pointer to the variable in the new + address space, or + + 2. The address of the variable in the new address space can be taken, + converted to the default (original) address space, and the result of + that conversion subsituted in place of the original ADDR_EXPR node. + + Which of these is done depends on the gimple statement being processed. + At present atomic operations and inline asms use (1), and everything else + uses (2). At least on AMD GCN, there are atomic operations that work + directly in the LDS address space. + + COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also rewritten to use + the new decl, adjusting types of appropriate tree nodes as necessary. */ + + if (targetm.goacc.adjust_private_decl) + { + FOR_ALL_BB_FN (bb, cfun) + for (gimple_stmt_iterator gsi = gsi_start_bb (bb); + !gsi_end_p (gsi); + gsi_next (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + walk_stmt_info wi; + var_decl_rewrite_info info; + + info.avoid_pointer_conversion + = (is_gimple_call (stmt) + && is_sync_builtin_call (as_a <gcall *> (stmt))) + || gimple_code (stmt) == GIMPLE_ASM; + info.stmt = stmt; + info.modified = false; + info.adjusted_vars = &adjusted_vars; + + memset (&wi, 0, sizeof (wi)); + wi.info = &info; + + walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi); + + if (info.modified) + update_stmt (stmt); + } + } + free_oacc_loop (loops); return 0; |