diff options
Diffstat (limited to 'llvm/docs/AMDGPUUsage.rst')
-rw-r--r-- | llvm/docs/AMDGPUUsage.rst | 59 |
1 files changed, 38 insertions, 21 deletions
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst index c5b9bd9..d13f95b 100644 --- a/llvm/docs/AMDGPUUsage.rst +++ b/llvm/docs/AMDGPUUsage.rst @@ -677,7 +677,7 @@ the device used to execute the code match the features enabled when generating the code. A mismatch of features may result in incorrect execution, or a reduction in performance. -The target features supported by each processor is listed in +The target features supported by each processor are listed in :ref:`amdgpu-processors`. Target features are controlled by exactly one of the following Clang @@ -783,7 +783,7 @@ description. The AMDGPU target specific information is: Is an AMDGPU processor or alternative processor name specified in :ref:`amdgpu-processor-table`. The non-canonical form target ID allows both the primary processor and alternative processor names. The canonical form - target ID only allow the primary processor name. + target ID only allows the primary processor name. **target-feature** Is a target feature name specified in :ref:`amdgpu-target-features-table` that @@ -793,7 +793,7 @@ description. The AMDGPU target specific information is: ``--offload-arch``. Each target feature must appear at most once in a target ID. The non-canonical form target ID allows the target features to be specified in any order. The canonical form target ID requires the target - features to be specified in alphabetic order. + features to be specified in alphabetical order. .. _amdgpu-target-id-v2-v3: @@ -886,7 +886,7 @@ supported for the ``amdgcn`` target. setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`). To convert between a private or group address space address (termed a segment - address) and a flat address the base address of the corresponding aperture + address) and a flat address, the base address of the corresponding aperture can be used. For GFX7-GFX8 these are available in the :ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For @@ -1186,7 +1186,7 @@ The AMDGPU backend implements the following LLVM IR intrinsics. :ref:`llvm.stackrestore.p5 <int_stackrestore>` Implemented, must use the alloca address space. :ref:`llvm.get.fpmode.i32 <int_get_fpmode>` The natural floating-point mode type is i32. This - implemented by extracting relevant bits out of the MODE + is implemented by extracting relevant bits out of the MODE register with s_getreg_b32. The first 10 bits are the core floating-point mode. Bits 12:18 are the exception mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not @@ -1266,14 +1266,14 @@ The AMDGPU backend implements the following LLVM IR intrinsics. llvm.amdgcn.permlane16 Provides direct access to v_permlane16_b32. Performs arbitrary gather-style operation within a row (16 contiguous lanes) of the second input operand. - The third and fourth inputs must be scalar values. these are combined into + The third and fourth inputs must be scalar values. These are combined into a single 64-bit value representing lane selects used to swizzle within each row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors. llvm.amdgcn.permlanex16 Provides direct access to v_permlanex16_b32. Performs arbitrary gather-style operation across two rows of the second input operand (each row is 16 contiguous - lanes). The third and fourth inputs must be scalar values. these are combined + lanes). The third and fourth inputs must be scalar values. These are combined into a single 64-bit value representing lane selects used to swizzle within each row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors. @@ -1285,31 +1285,31 @@ The AMDGPU backend implements the following LLVM IR intrinsics. 32-bit vectors. llvm.amdgcn.udot2 Provides direct access to v_dot2_u32_u16 across targets which - support such instructions. This performs unsigned dot product + support such instructions. This performs an unsigned dot product with two v2i16 operands, summed with the third i32 operand. The i1 fourth operand is used to clamp the output. llvm.amdgcn.udot4 Provides direct access to v_dot4_u32_u8 across targets which - support such instructions. This performs unsigned dot product + support such instructions. This performs an unsigned dot product with two i32 operands (holding a vector of 4 8bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output. llvm.amdgcn.udot8 Provides direct access to v_dot8_u32_u4 across targets which - support such instructions. This performs unsigned dot product + support such instructions. This performs an unsigned dot product with two i32 operands (holding a vector of 8 4bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output. llvm.amdgcn.sdot2 Provides direct access to v_dot2_i32_i16 across targets which - support such instructions. This performs signed dot product + support such instructions. This performs a signed dot product with two v2i16 operands, summed with the third i32 operand. The i1 fourth operand is used to clamp the output. When applicable (e.g. no clamping), this is lowered into v_dot2c_i32_i16 for targets which support it. llvm.amdgcn.sdot4 Provides direct access to v_dot4_i32_i8 across targets which - support such instructions. This performs signed dot product + support such instructions. This performs a signed dot product with two i32 operands (holding a vector of 4 8bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output. @@ -1321,7 +1321,7 @@ The AMDGPU backend implements the following LLVM IR intrinsics. of this instruction for gfx11 targets. llvm.amdgcn.sdot8 Provides direct access to v_dot8_u32_u4 across targets which - support such instructions. This performs signed dot product + support such instructions. This performs a signed dot product with two i32 operands (holding a vector of 8 4bit values), summed with the third i32 operand. The i1 fourth operand is used to clamp the output. @@ -1401,7 +1401,7 @@ The AMDGPU backend implements the following LLVM IR intrinsics. llvm.amdgcn.atomic.cond.sub.u32 Provides direct access to flat_atomic_cond_sub_u32, global_atomic_cond_sub_u32 and ds_cond_sub_u32 based on address space on gfx12 targets. This - performs subtraction only if the memory value is greater than or + performs a subtraction only if the memory value is greater than or equal to the data value. llvm.amdgcn.s.barrier.signal.isfirst Provides access to the s_barrier_signal_first instruction; @@ -1646,7 +1646,7 @@ The AMDGPU backend supports the following LLVM IR attributes. llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint attributes, the queue pointer may be required in situations where the intrinsic call does not directly appear in the program. Some subtargets - require the queue pointer for to handle some addrspacecasts, as well + require the queue pointer to handle some addrspacecasts, as well as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and llvm.debug intrinsics. @@ -1844,6 +1844,20 @@ The AMDGPU backend supports the following calling conventions: ..TODO:: Describe. + ``amdgpu_gfx_whole_wave`` Used for AMD graphics targets. Functions with this calling convention + cannot be used as entry points. They must have an i1 as the first argument, + which will be mapped to the value of EXEC on entry into the function. Other + arguments will contain poison in their inactive lanes. Similarly, the return + value for the inactive lanes is poison. + + The function will run with all lanes enabled, i.e. EXEC will be set to -1 in the + prologue and restored to its original value in the epilogue. The inactive lanes + will be preserved for all the registers used by the function. Active lanes only + will only be preserved for the callee saved registers. + + In all other respects, functions with this calling convention behave like + ``amdgpu_gfx`` functions. + ``amdgpu_gs`` Used for Mesa/AMDPAL geometry shaders. ..TODO:: Describe. @@ -1933,7 +1947,7 @@ The following describes all emitted function resource usage symbols: callees, contains an indirect call ===================================== ========= ========================================= =============================================================================== -Futhermore, three symbols are additionally emitted describing the compilation +Furthermore, three symbols are additionally emitted describing the compilation unit's worst case (i.e, maxima) ``num_vgpr``, ``num_agpr``, and ``numbered_sgpr`` which may be referenced and used by the aforementioned symbolic expressions. These three symbols are ``amdgcn.max_num_vgpr``, @@ -6344,10 +6358,13 @@ also have to wait on all global memory operations, which is unnecessary. :doc:`Memory Model Relaxation Annotations <MemoryModelRelaxationAnnotations>` can be used as an optimization hint for fences to solve this problem. -The AMDGPU backend recognizes the following tags on fences: +The AMDGPU backend recognizes the following tags on fences to control which address +space a fence can synchronize: + +- ``amdgpu-synchronize-as:local`` - for the local address space +- ``amdgpu-synchronize-as:global``- for the global address space -- ``amdgpu-as:local`` - fence only the local address space -- ``amdgpu-as:global``- fence only the global address space +Multiple tags can be used at the same time to synchronize with more than one address space. .. note:: @@ -17934,7 +17951,7 @@ set architecture (ISA) version of the assembly program. "AMD" and *arch* should always be equal to "AMDGPU". By default, the assembler will derive the ISA version, *vendor*, and *arch* -from the value of the -mcpu option that is passed to the assembler. +from the value of the ``-mcpu`` option that is passed to the assembler. .. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel: @@ -17958,7 +17975,7 @@ default value for all keys is 0, with the following exceptions: - *amd_kernel_code_version_minor* defaults to 2. - *amd_machine_kind* defaults to 1. - *amd_machine_version_major*, *machine_version_minor*, and - *amd_machine_version_stepping* are derived from the value of the -mcpu option + *amd_machine_version_stepping* are derived from the value of the ``-mcpu`` option that is passed to the assembler. - *kernel_code_entry_byte_offset* defaults to 256. - *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards |