aboutsummaryrefslogtreecommitdiff
path: root/llvm/docs
diff options
context:
space:
mode:
Diffstat (limited to 'llvm/docs')
-rw-r--r--llvm/docs/AMDGPUUsage.rst50
-rw-r--r--llvm/docs/CIBestPractices.rst28
-rw-r--r--llvm/docs/CommandGuide/llvm-objdump.rst26
-rw-r--r--llvm/docs/DirectX/RootSignatures.rst245
-rw-r--r--llvm/docs/DirectXUsage.rst1
-rw-r--r--llvm/docs/GettingStarted.rst42
-rw-r--r--llvm/docs/LangRef.rst106
-rw-r--r--llvm/docs/NVPTXUsage.rst161
-rw-r--r--llvm/docs/ProgrammersManual.rst38
-rw-r--r--llvm/docs/ReleaseNotes.md11
-rw-r--r--llvm/docs/TestingGuide.rst4
-rw-r--r--llvm/docs/tutorial/MyFirstLanguageFrontend/LangImpl04.rst2
12 files changed, 580 insertions, 134 deletions
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index c5b9bd9..e46437a 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -677,7 +677,7 @@ the device used to execute the code match the features enabled when
generating the code. A mismatch of features may result in incorrect
execution, or a reduction in performance.
-The target features supported by each processor is listed in
+The target features supported by each processor are listed in
:ref:`amdgpu-processors`.
Target features are controlled by exactly one of the following Clang
@@ -783,7 +783,7 @@ description. The AMDGPU target specific information is:
Is an AMDGPU processor or alternative processor name specified in
:ref:`amdgpu-processor-table`. The non-canonical form target ID allows both
the primary processor and alternative processor names. The canonical form
- target ID only allow the primary processor name.
+ target ID only allows the primary processor name.
**target-feature**
Is a target feature name specified in :ref:`amdgpu-target-features-table` that
@@ -793,7 +793,7 @@ description. The AMDGPU target specific information is:
``--offload-arch``. Each target feature must appear at most once in a target
ID. The non-canonical form target ID allows the target features to be
specified in any order. The canonical form target ID requires the target
- features to be specified in alphabetic order.
+ features to be specified in alphabetical order.
.. _amdgpu-target-id-v2-v3:
@@ -886,7 +886,7 @@ supported for the ``amdgcn`` target.
setup (see :ref:`amdgpu-amdhsa-kernel-prolog-m0`).
To convert between a private or group address space address (termed a segment
- address) and a flat address the base address of the corresponding aperture
+ address) and a flat address, the base address of the corresponding aperture
can be used. For GFX7-GFX8 these are available in the
:ref:`amdgpu-amdhsa-hsa-aql-queue` the address of which can be obtained with
Queue Ptr SGPR (see :ref:`amdgpu-amdhsa-initial-kernel-execution-state`). For
@@ -1186,7 +1186,7 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
:ref:`llvm.stackrestore.p5 <int_stackrestore>` Implemented, must use the alloca address space.
:ref:`llvm.get.fpmode.i32 <int_get_fpmode>` The natural floating-point mode type is i32. This
- implemented by extracting relevant bits out of the MODE
+ is implemented by extracting relevant bits out of the MODE
register with s_getreg_b32. The first 10 bits are the
core floating-point mode. Bits 12:18 are the exception
mask. On gfx9+, bit 23 is FP16_OVFL. Bitfields not
@@ -1266,14 +1266,14 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
llvm.amdgcn.permlane16 Provides direct access to v_permlane16_b32. Performs arbitrary gather-style
operation within a row (16 contiguous lanes) of the second input operand.
- The third and fourth inputs must be scalar values. these are combined into
+ The third and fourth inputs must be scalar values. These are combined into
a single 64-bit value representing lane selects used to swizzle within each
row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>,
<2 x half>, <2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.
llvm.amdgcn.permlanex16 Provides direct access to v_permlanex16_b32. Performs arbitrary gather-style
operation across two rows of the second input operand (each row is 16 contiguous
- lanes). The third and fourth inputs must be scalar values. these are combined
+ lanes). The third and fourth inputs must be scalar values. These are combined
into a single 64-bit value representing lane selects used to swizzle within each
row. Currently implemented for i16, i32, float, half, bfloat, <2 x i16>, <2 x half>,
<2 x bfloat>, i64, double, pointers, multiples of the 32-bit vectors.
@@ -1285,31 +1285,31 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
32-bit vectors.
llvm.amdgcn.udot2 Provides direct access to v_dot2_u32_u16 across targets which
- support such instructions. This performs unsigned dot product
+ support such instructions. This performs an unsigned dot product
with two v2i16 operands, summed with the third i32 operand. The
i1 fourth operand is used to clamp the output.
llvm.amdgcn.udot4 Provides direct access to v_dot4_u32_u8 across targets which
- support such instructions. This performs unsigned dot product
+ support such instructions. This performs an unsigned dot product
with two i32 operands (holding a vector of 4 8bit values), summed
with the third i32 operand. The i1 fourth operand is used to clamp
the output.
llvm.amdgcn.udot8 Provides direct access to v_dot8_u32_u4 across targets which
- support such instructions. This performs unsigned dot product
+ support such instructions. This performs an unsigned dot product
with two i32 operands (holding a vector of 8 4bit values), summed
with the third i32 operand. The i1 fourth operand is used to clamp
the output.
llvm.amdgcn.sdot2 Provides direct access to v_dot2_i32_i16 across targets which
- support such instructions. This performs signed dot product
+ support such instructions. This performs a signed dot product
with two v2i16 operands, summed with the third i32 operand. The
i1 fourth operand is used to clamp the output.
When applicable (e.g. no clamping), this is lowered into
v_dot2c_i32_i16 for targets which support it.
llvm.amdgcn.sdot4 Provides direct access to v_dot4_i32_i8 across targets which
- support such instructions. This performs signed dot product
+ support such instructions. This performs a signed dot product
with two i32 operands (holding a vector of 4 8bit values), summed
with the third i32 operand. The i1 fourth operand is used to clamp
the output.
@@ -1321,7 +1321,7 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
of this instruction for gfx11 targets.
llvm.amdgcn.sdot8 Provides direct access to v_dot8_u32_u4 across targets which
- support such instructions. This performs signed dot product
+ support such instructions. This performs a signed dot product
with two i32 operands (holding a vector of 8 4bit values), summed
with the third i32 operand. The i1 fourth operand is used to clamp
the output.
@@ -1401,7 +1401,7 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
llvm.amdgcn.atomic.cond.sub.u32 Provides direct access to flat_atomic_cond_sub_u32, global_atomic_cond_sub_u32
and ds_cond_sub_u32 based on address space on gfx12 targets. This
- performs subtraction only if the memory value is greater than or
+ performs a subtraction only if the memory value is greater than or
equal to the data value.
llvm.amdgcn.s.barrier.signal.isfirst Provides access to the s_barrier_signal_first instruction;
@@ -1646,7 +1646,7 @@ The AMDGPU backend supports the following LLVM IR attributes.
llvm.amdgcn.queue.ptr intrinsic. Note that unlike the other ABI hint
attributes, the queue pointer may be required in situations where the
intrinsic call does not directly appear in the program. Some subtargets
- require the queue pointer for to handle some addrspacecasts, as well
+ require the queue pointer to handle some addrspacecasts, as well
as the llvm.amdgcn.is.shared, llvm.amdgcn.is.private, llvm.trap, and
llvm.debug intrinsics.
@@ -1844,6 +1844,20 @@ The AMDGPU backend supports the following calling conventions:
..TODO::
Describe.
+ ``amdgpu_gfx_whole_wave`` Used for AMD graphics targets. Functions with this calling convention
+ cannot be used as entry points. They must have an i1 as the first argument,
+ which will be mapped to the value of EXEC on entry into the function. Other
+ arguments will contain poison in their inactive lanes. Similarly, the return
+ value for the inactive lanes is poison.
+
+ The function will run with all lanes enabled, i.e. EXEC will be set to -1 in the
+ prologue and restored to its original value in the epilogue. The inactive lanes
+ will be preserved for all the registers used by the function. Active lanes only
+ will only be preserved for the callee saved registers.
+
+ In all other respects, functions with this calling convention behave like
+ ``amdgpu_gfx`` functions.
+
``amdgpu_gs`` Used for Mesa/AMDPAL geometry shaders.
..TODO::
Describe.
@@ -1933,7 +1947,7 @@ The following describes all emitted function resource usage symbols:
callees, contains an indirect call
===================================== ========= ========================================= ===============================================================================
-Futhermore, three symbols are additionally emitted describing the compilation
+Furthermore, three symbols are additionally emitted describing the compilation
unit's worst case (i.e, maxima) ``num_vgpr``, ``num_agpr``, and
``numbered_sgpr`` which may be referenced and used by the aforementioned
symbolic expressions. These three symbols are ``amdgcn.max_num_vgpr``,
@@ -17934,7 +17948,7 @@ set architecture (ISA) version of the assembly program.
"AMD" and *arch* should always be equal to "AMDGPU".
By default, the assembler will derive the ISA version, *vendor*, and *arch*
-from the value of the -mcpu option that is passed to the assembler.
+from the value of the ``-mcpu`` option that is passed to the assembler.
.. _amdgpu-amdhsa-assembler-directive-amdgpu_hsa_kernel:
@@ -17958,7 +17972,7 @@ default value for all keys is 0, with the following exceptions:
- *amd_kernel_code_version_minor* defaults to 2.
- *amd_machine_kind* defaults to 1.
- *amd_machine_version_major*, *machine_version_minor*, and
- *amd_machine_version_stepping* are derived from the value of the -mcpu option
+ *amd_machine_version_stepping* are derived from the value of the ``-mcpu`` option
that is passed to the assembler.
- *kernel_code_entry_byte_offset* defaults to 256.
- *wavefront_size* defaults 6 for all targets before GFX10. For GFX10 onwards
diff --git a/llvm/docs/CIBestPractices.rst b/llvm/docs/CIBestPractices.rst
index 71fdd12..8301b95 100644
--- a/llvm/docs/CIBestPractices.rst
+++ b/llvm/docs/CIBestPractices.rst
@@ -108,3 +108,31 @@ If specific jobs within the workflow need additional permissions, those
permissions should be added within the specific job. This practice locks down
all permissions by default and only enables them when needed, better enforcing
the principle of least privilege.
+
+Ensuring Workflows Run on the Correct Events
+--------------------------------------------
+
+Github allows workflows to run on a multitude of events and it is important to
+configure a workflow such that it triggers on the correct events. There are
+two main best practices around events that trigger workflows:
+
+1. Workflows that are designed to run on pull requests should not be
+restricted by target branch. Restricting the target branch unnecessarily
+will prevent any stacked PRs from being tested. ``pull_request`` events should
+not contain a branches key.
+
+2. Workflows that are designed to also trigger on push events (e.g., for
+testing on ``main`` or one of the release branches) need to be restricted by
+branch. While pushes to a fork will not trigger a workflow run due to the
+``push`` event if the workflow already has its jobs disabled in forks
+(described above), stacked PRs will end up running jobs twice if the ``push``
+event does not have any branch restrictions. ``push`` events should have
+their branches restricted at the very least to ``main`` and the release
+branches as follows:
+
+.. code-block:: yaml
+
+ push:
+ branches:
+ - main
+ - releases/*
diff --git a/llvm/docs/CommandGuide/llvm-objdump.rst b/llvm/docs/CommandGuide/llvm-objdump.rst
index c9f0379..aaf38f8 100644
--- a/llvm/docs/CommandGuide/llvm-objdump.rst
+++ b/llvm/docs/CommandGuide/llvm-objdump.rst
@@ -140,23 +140,29 @@ OPTIONS
debug information for stripped binaries. Multiple instances of this argument
are searched in the order given.
-.. option:: --debuginfod, --no-debuginfod
+.. option:: --debug-indent=<width>
- Whether or not to try debuginfod lookups for debug binaries. Unless specified,
- debuginfod is only enabled if libcurl was compiled in (``LLVM_ENABLE_CURL``)
- and at least one server URL was provided by the environment variable
- ``DEBUGINFOD_URLS``.
+ Distance to indent the source-level variable or inlined function display,
+ relative to the start of the disassembly. Defaults to 52 characters.
+
+.. option:: --debug-inlined-funcs[=<format>]
-.. option:: --debug-vars=<format>
+ Print the locations of inlined functions alongside disassembly.
+ ``format`` may be ``ascii``, ``limits-only``, or ``unicode``, defaulting to
+ ``unicode`` if omitted.
+
+.. option:: --debug-vars[=<format>]
Print the locations (in registers or memory) of source-level variables
- alongside disassembly. ``format`` may be ``unicode`` or ``ascii``, defaulting
+ alongside disassembly. ``format`` may be ``ascii`` or ``unicode``, defaulting
to ``unicode`` if omitted.
-.. option:: --debug-vars-indent=<width>
+.. option:: --debuginfod, --no-debuginfod
- Distance to indent the source-level variable display, relative to the start
- of the disassembly. Defaults to 52 characters.
+ Whether or not to try debuginfod lookups for debug binaries. Unless specified,
+ debuginfod is only enabled if libcurl was compiled in (``LLVM_ENABLE_CURL``)
+ and at least one server URL was provided by the environment variable
+ ``DEBUGINFOD_URLS``.
.. option:: -j, --section=<section1[,section2,...]>
diff --git a/llvm/docs/DirectX/RootSignatures.rst b/llvm/docs/DirectX/RootSignatures.rst
new file mode 100644
index 0000000..e328b4a
--- /dev/null
+++ b/llvm/docs/DirectX/RootSignatures.rst
@@ -0,0 +1,245 @@
+===============
+Root Signatures
+===============
+
+.. contents::
+ :local:
+
+.. toctree::
+ :hidden:
+
+Overview
+========
+
+A root signature is used to describe what resources a shader needs access to
+and how they're organized and bound in the pipeline. The DirectX Container
+(DXContainer) contains a root signature part (RTS0), which stores this
+information in a binary format. To assist with the construction of, and
+interaction with, a root signature is represented as metadata
+(``dx.rootsignatures`` ) in the LLVM IR. The metadata can then be converted to
+its binary form, as defined in
+`llvm/include/llvm/llvm/Frontend/HLSL/RootSignatureMetadata.h
+<https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Frontend/HLSL/RootSignatureMetadata.h>`_.
+This document serves as a reference for the metadata representation of a root
+signature for users to interface with.
+
+Metadata Representation
+=======================
+
+Consider the reference root signature, then the following sections describe the
+metadata representation of this root signature and the corresponding operands.
+
+.. code-block:: HLSL
+
+ RootFlags(ALLOW_INPUT_ASSEMBLER_INPUT_LAYOUT),
+ RootConstants(b0, space = 1, num32Constants = 3),
+ CBV(b1, flags = 0),
+ StaticSampler(
+ filter = FILTER_MIN_MAG_POINT_MIP_LINEAR,
+ addressU = TEXTURE_ADDRESS_BORDER,
+ ),
+ DescriptorTable(
+ visibility = VISIBILITY_ALL,
+ SRV(t0, flags = DATA_STATIC_WHILE_SET_AT_EXECUTE),
+ UAV(
+ numDescriptors = 5, u1, space = 10, offset = 5,
+ flags = DATA_VOLATILE
+ )
+ )
+
+.. note::
+
+ A root signature does not necessarily have a unique metadata representation.
+ Futher, a malformed root signature can be represented in the metadata format,
+ (eg. mixing Sampler and non-Sampler descriptor ranges), and so it is the
+ user's responsibility to verify that it is a well-formed root signature.
+
+Named Root Signature Table
+==========================
+
+.. code-block:: LLVM
+
+ !dx.rootsignatures = !{!0}
+
+A named metadata node, ``dx.rootsignatures``` is used to identify the root
+signature table. The table itself is a list of references to function/root
+signature pairs.
+
+Function/Root Signature Pair
+============================
+
+.. code-block:: LLVM
+
+ !1 = !{ptr @main, !2, i32 2 }
+
+The function/root signature associates a function (the first operand) with a
+reference to a root signature (the second operand). The root signature version
+(the third operand) used for validation logic and binary format follows.
+
+Root Signature
+==============
+
+.. code-block:: LLVM
+
+ !2 = !{ !3, !4, !5, !6, !7 }
+
+The root signature itself simply consists of a list of references to its root
+signature elements.
+
+Root Signature Element
+======================
+
+A root signature element is identified by the first operand, which is a string.
+The following root signature elements are defined:
+
+================= ======================
+Identifier String Root Signature Element
+================= ======================
+"RootFlags" Root Flags
+"RootConstants" Root Constants
+"RootCBV" Root Descriptor
+"RootSRV" Root Descriptor
+"RootUAV" Root Descriptor
+"StaticSampler" Static Sampler
+"DescriptorTable" Descriptor Table
+================= ======================
+
+Below is listed the representation for each type of root signature element.
+
+Root Flags
+==========
+
+.. code-block:: LLVM
+
+ !3 = { !"RootFlags", i32 1 }
+
+======================= ====
+Description Type
+======================= ====
+`Root Signature Flags`_ i32
+======================= ====
+
+.. _Root Signature Flags: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_root_signature_flags
+
+Root Constants
+==============
+
+.. code-block:: LLVM
+
+ !4 = { !"RootConstants", i32 0, i32 1, i32 2, i32 3 }
+
+==================== ====
+Description Type
+==================== ====
+`Shader Visibility`_ i32
+Shader Register i32
+Register Space i32
+Number 32-bit Values i32
+==================== ====
+
+.. _Shader Visibility: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_shader_visibility
+
+Root Descriptor
+===============
+
+As noted in the table above, the first operand will denote the type of
+root descriptor.
+
+.. code-block:: LLVM
+
+ !5 = { !"RootCBV", i32 0, i32 1, i32 0, i32 0 }
+
+======================== ====
+Description Type
+======================== ====
+`Shader Visibility`_ i32
+Shader Register i32
+Register Space i32
+`Root Descriptor Flags`_ i32
+======================== ====
+
+.. _Root Descriptor Flags: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_root_descriptor_flags
+
+Static Sampler
+==============
+
+.. code-block:: LLVM
+
+ !6 = !{ !"StaticSampler", i32 1, i32 4, ... }; remaining operands omitted for space
+
+==================== =====
+Description Type
+==================== =====
+`Filter`_ i32
+`AddressU`_ i32
+`AddressV`_ i32
+`AddressW`_ i32
+MipLODBias float
+MaxAnisotropy i32
+`ComparisonFunc`_ i32
+`BorderColor`_ i32
+MinLOD float
+MaxLOD float
+ShaderRegister i32
+RegisterSpace i32
+`Shader Visibility`_ i32
+==================== =====
+
+.. _Filter: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_filter
+.. _AddressU: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_texture_address_mode
+.. _AddressV: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_texture_address_mode
+.. _AddressW: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_texture_address_mode
+.. _ComparisonFunc: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_comparison_func>
+.. _BorderColor: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_static_border_color>
+
+Descriptor Table
+================
+
+A descriptor table consists of a visibility and the remaining operands are a
+list of references to its descriptor ranges.
+
+.. note::
+
+ The term Descriptor Table Clause is synonymous with Descriptor Range when
+ referencing the implementation details.
+
+.. code-block:: LLVM
+
+ !7 = { !"DescriptorTable", i32 0, !8, !9 }
+
+========================= ================
+Description Type
+========================= ================
+`Shader Visibility`_ i32
+Descriptor Range Elements Descriptor Range
+========================= ================
+
+
+Descriptor Range
+================
+
+Similar to a root descriptor, the first operand will denote the type of
+descriptor range. It is one of the following types:
+
+- "CBV"
+- "SRV"
+- "UAV"
+- "Sampler"
+
+.. code-block:: LLVM
+
+ !8 = !{ !"SRV", i32 1, i32 0, i32 0, i32 -1, i32 4 }
+ !9 = !{ !"UAV", i32 5, i32 1, i32 10, i32 5, i32 2 }
+
+============================== ====
+Description Type
+============================== ====
+Number of Descriptors in Range i32
+Shader Register i32
+Register Space i32
+`Offset`_ i32
+`Descriptor Range Flags`_ i32
+============================== ====
+
+.. _Offset: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ns-d3d12-d3d12_descriptor_range
+.. _Descriptor Range Flags: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_descriptor_range_flags
diff --git a/llvm/docs/DirectXUsage.rst b/llvm/docs/DirectXUsage.rst
index 4d8f49b..1d964e6 100644
--- a/llvm/docs/DirectXUsage.rst
+++ b/llvm/docs/DirectXUsage.rst
@@ -17,6 +17,7 @@ User Guide for the DirectX Target
DirectX/DXILArchitecture
DirectX/DXILOpTableGenDesign
DirectX/DXILResources
+ DirectX/RootSignatures
Introduction
============
diff --git a/llvm/docs/GettingStarted.rst b/llvm/docs/GettingStarted.rst
index 372fd40..3036dae 100644
--- a/llvm/docs/GettingStarted.rst
+++ b/llvm/docs/GettingStarted.rst
@@ -12,7 +12,7 @@ Welcome to the LLVM project!
The LLVM project has multiple components. The core of the project is
itself called "LLVM". This contains all of the tools, libraries, and header
-files needed to process intermediate representations and converts it into
+files needed to process intermediate representations and convert them into
object files. Tools include an assembler, disassembler, bitcode analyzer, and
bitcode optimizer. It also contains basic regression tests.
@@ -32,11 +32,11 @@ Getting the Source Code and Building LLVM
#. Check out LLVM (including subprojects like Clang):
* ``git clone https://github.com/llvm/llvm-project.git``
- * Or, on windows:
+ * Or, on Windows:
``git clone --config core.autocrlf=false
https://github.com/llvm/llvm-project.git``
- * To save storage and speed-up the checkout time, you may want to do a
+ * To save storage and speed up the checkout time, you may want to do a
`shallow clone <https://git-scm.com/docs/git-clone#Documentation/git-clone.txt---depthltdepthgt>`_.
For example, to get the latest revision of the LLVM project, use
@@ -71,7 +71,7 @@ Getting the Source Code and Building LLVM
Some common options:
- * ``-DLLVM_ENABLE_PROJECTS='...'`` --- semicolon-separated list of the LLVM
+ * ``-DLLVM_ENABLE_PROJECTS='...'`` --- A semicolon-separated list of the LLVM
subprojects you'd like to additionally build. Can include any of: clang,
clang-tools-extra, lldb, lld, polly, or cross-project-tests.
@@ -82,10 +82,10 @@ Getting the Source Code and Building LLVM
pathname of where you want the LLVM tools and libraries to be installed
(default ``/usr/local``).
- * ``-DCMAKE_BUILD_TYPE=type`` --- Controls optimization level and debug
+ * ``-DCMAKE_BUILD_TYPE=type`` --- Controls the optimization level and debug
information of the build. Valid options for *type* are ``Debug``,
``Release``, ``RelWithDebInfo``, and ``MinSizeRel``. For more detailed
- information see :ref:`CMAKE_BUILD_TYPE <cmake_build_type>`.
+ information, see :ref:`CMAKE_BUILD_TYPE <cmake_build_type>`.
* ``-DLLVM_ENABLE_ASSERTIONS=ON`` --- Compile with assertion checks enabled
(default is ON for Debug builds, OFF for all other build types).
@@ -124,7 +124,7 @@ Getting the Source Code and Building LLVM
``ninja -C build check-llvm``
- This will setup an LLVM build with debugging info, then compile LLVM and
+ This will set up an LLVM build with debugging info, then compile LLVM and
run LLVM tests.
* For more detailed information on CMake options, see `CMake <CMake.html>`__
@@ -150,7 +150,7 @@ page.
For stand-alone builds, you must have an llvm install that is configured
properly to be consumable by stand-alone builds of the other projects.
-This could be a distro provided LLVM install, or you can build it yourself,
+This could be a distro-provided LLVM install, or you can build it yourself,
like this:
.. code-block:: console
@@ -195,7 +195,7 @@ clang clang, cmake CLANG_INCLUDE_TESTS=ON (Required for check
lld lld, cmake
============ ======================== ======================
-Example for building stand-alone `clang`:
+Example of building stand-alone `clang`:
.. code-block:: console
@@ -224,7 +224,7 @@ Example for building stand-alone `clang`:
Requirements
============
-Before you begin to use the LLVM system, review the requirements given below.
+Before you begin to use the LLVM system, review the requirements below.
This may save you some trouble by knowing ahead of time what hardware and
software you will need.
@@ -265,7 +265,7 @@ Windows on Arm ARM64 Visual Studio, Clang\ :sup:`4`
#. Code generation supported for Pentium processors and up
#. Code generation supported for 32-bit ABI only
- #. To use LLVM modules on Win32-based system, you may configure LLVM
+ #. To use LLVM modules on a Win32-based system, you may configure LLVM
with ``-DBUILD_SHARED_LIBS=On``.
#. Visual Studio alone can compile LLVM. When using Clang, you
must also have Visual Studio installed.
@@ -309,7 +309,7 @@ Package Version Notes
#. Only needed if you want to run the automated test suite in the
``llvm/test`` directory, or if you plan to utilize any Python libraries,
utilities, or bindings.
- #. Optional, adds compression / uncompression capabilities to selected LLVM
+ #. Optional, adds compression/uncompression capabilities to selected LLVM
tools.
#. Optional, you can use any other build tool supported by CMake.
#. Only needed when building libc with New Headergen. Mainly used by libc.
@@ -401,11 +401,11 @@ Studio 2019 (or later), or a recent version of mingw64. FreeBSD 10.0 and newer
have a modern Clang as the system compiler.
However, some Linux distributions and some other or older BSDs sometimes have
-extremely old versions of GCC. These steps attempt to help you upgrade you
+extremely old versions of GCC. These steps attempt to help you upgrade your
compiler even on such a system. However, if at all possible, we encourage you
to use a recent version of a distribution with a modern system compiler that
meets these requirements. Note that it is tempting to install a prior
-version of Clang and libc++ to be the host compiler, however libc++ was not
+version of Clang and libc++ to be the host compiler; however, libc++ was not
well tested or set up to build on Linux until relatively recently. As
a consequence, this guide suggests just using libstdc++ and a modern GCC as the
initial host in a bootstrap, and then using Clang (and potentially libc++).
@@ -514,11 +514,11 @@ appropriate pathname on your local system. All these paths are absolute:
``SRC_ROOT``
- This is the top level directory of the LLVM source tree.
+ This is the top-level directory of the LLVM source tree.
``OBJ_ROOT``
- This is the top level directory of the LLVM object tree (i.e. the tree where
+ This is the top-level directory of the LLVM object tree (i.e. the tree where
object files and compiled programs will be placed. It can be the same as
SRC_ROOT).
@@ -666,7 +666,7 @@ cross-compiling CMake provides a variable ``CMAKE_TOOLCHAIN_FILE`` which can
define compiler flags and variables used during the CMake test operations.
The result of such a build is executables that are not runnable on the build
-host but can be executed on the target. As an example the following CMake
+host but can be executed on the target. As an example, the following CMake
invocation can generate build files targeting iOS. This will work on macOS
with the latest Xcode:
@@ -770,7 +770,7 @@ Generates system build files.
- Some simple examples showing how to use LLVM as a compiler for a custom
language - including lowering, optimization, and code generation.
-- Kaleidoscope Tutorial: Kaleidoscope language tutorial run through the
+- Kaleidoscope Tutorial: Kaleidoscope language tutorial runs through the
implementation of a nice little compiler for a non-trivial language
including a hand-written lexer, parser, AST, as well as code generation
support using LLVM- both static (ahead of time) and various approaches to
@@ -858,7 +858,7 @@ share code among the `tools`_.
``llvm/lib/Support/``
- Source code that corresponding to the header files in ``llvm/include/ADT/``
+ Source code that corresponds to the header files in ``llvm/include/ADT/``
and ``llvm/include/Support/``.
``llvm/bindings``
@@ -1051,7 +1051,7 @@ Example with clang
% lli hello.bc
- The second examples shows how to invoke the LLVM JIT, :doc:`lli
+ The second example shows how to invoke the LLVM JIT, :doc:`lli
<CommandGuide/lli>`.
#. Use the ``llvm-dis`` utility to take a look at the LLVM assembly code:
@@ -1163,7 +1163,7 @@ following options with cmake:
Consider setting this to ``ON`` if you require a debug build, as this will ease
memory pressure on the linker. This will make linking much faster, as the
- binaries will not contain any of the debug information. Instead the debug
+ binaries will not contain any of the debug information. Instead, the debug
information is in a separate DWARF object file (with the extension ``.dwo``).
This only applies to host platforms using ELF, such as Linux.
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 9a32f0c..bac13cc 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -280,9 +280,9 @@ linkage:
linkage are linked together, the two global arrays are appended
together. This is the LLVM, typesafe, equivalent of having the
system linker append together "sections" with identical names when
- .o files are linked.
+ ``.o`` files are linked.
- Unfortunately this doesn't correspond to any feature in .o files, so it
+ Unfortunately this doesn't correspond to any feature in ``.o`` files, so it
can only be used for variables like ``llvm.global_ctors`` which llvm
interprets specially.
@@ -371,7 +371,7 @@ added in the future:
This calling convention supports `tail call
optimization <CodeGenerator.html#tail-call-optimization>`_ but requires
- both the caller and callee are using it.
+ both the caller and callee to use it.
"``cc 11``" - The HiPE calling convention
This calling convention has been implemented specifically for use by
the `High-Performance Erlang
@@ -447,7 +447,7 @@ added in the future:
R11. R11 can be used as a scratch register. Furthermore it also preserves
all floating-point registers (XMMs/YMMs).
- - On AArch64 the callee preserve all general purpose registers, except
+ - On AArch64 the callee preserves all general purpose registers, except
X0-X8 and X16-X18. Furthermore it also preserves lower 128 bits of V8-V31
SIMD floating point registers. Not allowed with ``nest``.
@@ -890,7 +890,7 @@ Syntax::
[gc] [prefix Constant] [prologue Constant] [personality Constant]
(!name !N)* { ... }
-The argument list is a comma separated sequence of arguments where each
+The argument list is a comma-separated sequence of arguments where each
argument is of the following form:
Syntax::
@@ -1011,7 +1011,7 @@ some can only be checked when producing an object file:
IFuncs
-------
-IFuncs, like as aliases, don't create any new data or func. They are just a new
+IFuncs, like aliases, don't create any new data or func. They are just a new
symbol that is resolved at runtime by calling a resolver function.
On ELF platforms, IFuncs are resolved by the dynamic linker at load time. On
@@ -1211,7 +1211,7 @@ Currently, only the following parameter attributes are defined:
the callee (for a return value).
``noext``
This indicates to the code generator that the parameter or return
- value has the high bits undefined, as for a struct in register, and
+ value has the high bits undefined, as for a struct in a register, and
therefore does not need to be sign or zero extended. This is the same
as default behavior and is only actually used (by some targets) to
validate that one of the attributes is always present.
@@ -1252,7 +1252,7 @@ Currently, only the following parameter attributes are defined:
on the stack. This implies the pointer is dereferenceable up to
the storage size of the type.
- It is not generally permissible to introduce a write to an
+ It is not generally permissible to introduce a write to a
``byref`` pointer. The pointer may have any address space and may
be read only.
@@ -1393,7 +1393,7 @@ Currently, only the following parameter attributes are defined:
storage for any other object accessible to the caller.
``captures(...)``
- This attributes restrict the ways in which the callee may capture the
+ This attribute restricts the ways in which the callee may capture the
pointer. This is not a valid attribute for return values. This attribute
applies only to the particular copy of the pointer passed in this argument.
@@ -1615,7 +1615,7 @@ Currently, only the following parameter attributes are defined:
assigning this parameter or return value to a stack slot during calling
convention lowering. The enforcement of the specified alignment is
target-dependent, as target-specific calling convention rules may override
- this value. This attribute serves the purpose of carrying language specific
+ this value. This attribute serves the purpose of carrying language-specific
alignment information that is not mapped to base types in the backend (for
example, over-alignment specification through language attributes).
@@ -1993,7 +1993,7 @@ For example:
``cold``
This attribute indicates that this function is rarely called. When
computing edge weights, basic blocks post-dominated by a cold
- function call are also considered to be cold; and, thus, given low
+ function call are also considered to be cold and, thus, given a low
weight.
.. _attr_convergent:
@@ -2892,7 +2892,7 @@ site, these bundles may contain any values that are needed by the
generated code. For more details, see :ref:`GC Transitions
<gc_transition_args>`.
-The bundle contain an arbitrary list of Values which need to be passed
+The bundle contains an arbitrary list of Values which need to be passed
to GC transition code. They will be lowered and passed as operands to
the appropriate GC_TRANSITION nodes in the selection DAG. It is assumed
that these arguments must be available before and after (but not
@@ -2903,7 +2903,7 @@ necessarily during) the execution of the callee.
Assume Operand Bundles
^^^^^^^^^^^^^^^^^^^^^^
-Operand bundles on an :ref:`llvm.assume <int_assume>` allows representing
+Operand bundles on an :ref:`llvm.assume <int_assume>` allow representing
assumptions, such as that a :ref:`parameter attribute <paramattrs>` or a
:ref:`function attribute <fnattrs>` holds for a certain value at a certain
location. Operand bundles enable assumptions that are either hard or impossible
@@ -2922,11 +2922,11 @@ restricted form:
"<tag>"([ <holds for value> [, <attribute argument>] ])
-* The tag of the operand bundle is usually the name of attribute that can be
- assumed to hold. It can also be `ignore`, this tag doesn't contain any
+* The tag of the operand bundle is usually the name of the attribute that can be
+ assumed to hold. It can also be `ignore`; this tag doesn't contain any
information and should be ignored.
-* The first argument if present is the value for which the attribute hold.
-* The second argument if present is an argument of the attribute.
+* The first argument, if present, is the value for which the attribute holds.
+* The second argument, if present, is an argument of the attribute.
If there are no arguments the attribute is a property of the call location.
@@ -2968,7 +2968,7 @@ the behavior is undefined, unless one of the following exceptions applies:
dereferenceable at later pointers, e.g. because it could have been freed.
In addition to allowing operand bundles encoding function and parameter
-attributes, an assume operand bundle my also encode a ``separate_storage``
+attributes, an assume operand bundle may also encode a ``separate_storage``
operand bundle. This has the form:
.. code-block:: llvm
@@ -3115,7 +3115,7 @@ Note that the assembly string *must* be parseable by LLVM's integrated assembler
Data Layout
-----------
-A module may specify a target specific data layout string that specifies
+A module may specify a target-specific data layout string that specifies
how data is to be laid out in memory. The syntax for the data layout is
simply:
@@ -3356,6 +3356,19 @@ behavior is undefined:
- the size of all allocated objects must be non-negative and not exceed the
largest signed integer that fits into the index type.
+Allocated objects that are created with operations recognized by LLVM (such as
+:ref:`alloca <i_alloca>`, heap allocation functions marked as such, and global
+variables) may *not* change their size. (``realloc``-style operations do not
+change the size of an existing allocated object; instead, they create a new
+allocated object. Even if the object is at the same location as the old one, old
+pointers cannot be used to access this new object.) However, allocated objects
+can also be created by means not recognized by LLVM, e.g. by directly calling
+``mmap``. Those allocated objects are allowed to grow to the right (i.e.,
+keeping the same base address, but increasing their size) while maintaining the
+validity of existing pointers, as long as they always satisfy the properties
+described above. Currently, allocated objects are not permitted to grow to the
+left or to shrink, nor can they have holes.
+
.. _objectlifetime:
Object Lifetime
@@ -3611,7 +3624,7 @@ operation may modify the memory at that address. A volatile operation
may not modify any other memory accessible by the module being compiled.
A volatile operation may not call any code in the current module.
-In general (without target specific context), the address space of a
+In general (without target-specific context), the address space of a
volatile operation may not be changed. Different address spaces may
have different trapping behavior when dereferencing an invalid
pointer.
@@ -3794,7 +3807,7 @@ If an atomic operation is marked ``syncscope("singlethread")``, it only
other operations running in the same thread (for example, in signal handlers).
If an atomic operation is marked ``syncscope("<target-scope>")``, where
-``<target-scope>`` is a target specific synchronization scope, then it is target
+``<target-scope>`` is a target-specific synchronization scope, then it is target
dependent if it *synchronizes with* and participates in the seq\_cst total
orderings of other operations.
@@ -3896,10 +3909,10 @@ Floating-Point Semantics
------------------------
This section defines the semantics for core floating-point operations on types
-that use a format specified by IEEE-745. These types are: ``half``, ``float``,
+that use a format specified by IEEE-754. These types are: ``half``, ``float``,
``double``, and ``fp128``, which correspond to the binary16, binary32, binary64,
and binary128 formats, respectively. The "core" operations are those defined in
-section 5 of IEEE-745, which all have corresponding LLVM operations.
+section 5 of IEEE-754, which all have corresponding LLVM operations.
The value returned by those operations matches that of the corresponding
IEEE-754 operation executed in the :ref:`default LLVM floating-point environment
@@ -8746,11 +8759,11 @@ framework::
The metadata encoding as lists of lists of options, as opposed to a collapsed
list of options, is chosen so that the IR encoding can use multiple option
strings to specify e.g., a single library, while still having that specifier be
-preserved as an atomic element that can be recognized by a target specific
+preserved as an atomic element that can be recognized by a target-specific
assembly writer or object file emitter.
Each individual option is required to be either a valid option for the target's
-linker, or an option that is reserved by the target specific assembly writer or
+linker, or an option that is reserved by the target-specific assembly writer or
object file emitter. No other aspect of these options is defined by the IR.
Dependent Libs Named Metadata
@@ -11928,6 +11941,9 @@ if the ``getelementptr`` has any non-zero indices, the following rules apply:
:ref:`based <pointeraliasing>` on. This means that it points into that
allocated object, or to its end. Note that the object does not have to be
live anymore; being in-bounds of a deallocated object is sufficient.
+ If the allocated object can grow, then the relevant size for being *in
+ bounds* is the maximal size the object could have while satisfying the
+ allocated object rules, not its current size.
* During the successive addition of offsets to the address, the resulting
pointer must remain *in bounds* of the allocated object at each step.
@@ -19508,7 +19524,7 @@ Semantics:
The '``llvm.set.loop.iterations.*``' intrinsics do not perform any arithmetic
on their operand. It's a hint to the backend that can use this to set up the
-hardware-loop count with a target specific instruction, usually a move of this
+hardware-loop count with a target-specific instruction, usually a move of this
value to a special register or a hardware-loop instruction.
@@ -19547,7 +19563,7 @@ Semantics:
The '``llvm.start.loop.iterations.*``' intrinsics do not perform any arithmetic
on their operand. It's a hint to the backend that can use this to set up the
-hardware-loop count with a target specific instruction, usually a move of this
+hardware-loop count with a target-specific instruction, usually a move of this
value to a special register or a hardware-loop instruction.
'``llvm.test.set.loop.iterations.*``' Intrinsic
@@ -19583,7 +19599,7 @@ Semantics:
The '``llvm.test.set.loop.iterations.*``' intrinsics do not perform any
arithmetic on their operand. It's a hint to the backend that can use this to
-set up the hardware-loop count with a target specific instruction, usually a
+set up the hardware-loop count with a target-specific instruction, usually a
move of this value to a special register or a hardware-loop instruction.
The result is the conditional value of whether the given count is not zero.
@@ -19621,7 +19637,7 @@ Semantics:
The '``llvm.test.start.loop.iterations.*``' intrinsics do not perform any
arithmetic on their operand. It's a hint to the backend that can use this to
-set up the hardware-loop count with a target specific instruction, usually a
+set up the hardware-loop count with a target-specific instruction, usually a
move of this value to a special register or a hardware-loop instruction.
The result is a pair of the input and a conditional value of whether the
given count is not zero.
@@ -26639,19 +26655,14 @@ Arguments:
The first argument is a constant integer representing the size of the
object, or -1 if it is variable sized. The second argument is a pointer
-to the object.
+to an ``alloca`` instruction.
Semantics:
""""""""""
-If ``ptr`` is a stack-allocated object and it points to the first byte of
-the object, the object is initially marked as dead.
-``ptr`` is conservatively considered as a non-stack-allocated object if
-the stack coloring algorithm that is used in the optimization pipeline cannot
-conclude that ``ptr`` is a stack-allocated object.
-
-After '``llvm.lifetime.start``', the stack object that ``ptr`` points is marked
-as alive and has an uninitialized value.
+The stack-allocated object that ``ptr`` points to is initially marked as dead.
+After '``llvm.lifetime.start``', the stack object is marked as alive and has an
+uninitialized value.
The stack object is marked as dead when either
:ref:`llvm.lifetime.end <int_lifeend>` to the alloca is executed or the
function returns.
@@ -26661,11 +26672,6 @@ After :ref:`llvm.lifetime.end <int_lifeend>` is called,
The second '``llvm.lifetime.start``' call marks the object as alive, but it
does not change the address of the object.
-If ``ptr`` is a non-stack-allocated object, it does not point to the first
-byte of the object or it is a stack object that is already alive, it simply
-fills all bytes of the object with ``poison``.
-
-
.. _int_lifeend:
'``llvm.lifetime.end``' Intrinsic
@@ -26689,24 +26695,16 @@ Arguments:
The first argument is a constant integer representing the size of the
object, or -1 if it is variable sized. The second argument is a pointer
-to the object.
+to an ``alloca`` instruction.
Semantics:
""""""""""
-If ``ptr`` is a stack-allocated object and it points to the first byte of the
-object, the object is dead.
-``ptr`` is conservatively considered as a non-stack-allocated object if
-the stack coloring algorithm that is used in the optimization pipeline cannot
-conclude that ``ptr`` is a stack-allocated object.
+The stack-allocated object that ``ptr`` points to becomes dead after the call
+to this intrinsic.
Calling ``llvm.lifetime.end`` on an already dead alloca is no-op.
-If ``ptr`` is a non-stack-allocated object or it does not point to the first
-byte of the object, it is equivalent to simply filling all bytes of the object
-with ``poison``.
-
-
'``llvm.invariant.start``' Intrinsic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/llvm/docs/NVPTXUsage.rst b/llvm/docs/NVPTXUsage.rst
index 11017fe..d28eb68 100644
--- a/llvm/docs/NVPTXUsage.rst
+++ b/llvm/docs/NVPTXUsage.rst
@@ -1072,6 +1072,8 @@ Syntax:
declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.gather4.2d(ptr addrspace(7) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %x0, i32 %y0, i32 %y1, i32 %y2, i32 %y3, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch, i32 %flag_cta_group)
+
Overview:
"""""""""
@@ -1082,7 +1084,13 @@ global memory to shared::cluster memory (indicated by the ``g2s`` prefix)
in ``tile`` mode. In tile mode, the multi-dimensional layout of the
source tensor is preserved at the destination. The dimension of the
tensor data ranges from 1d to 5d with the coordinates specified
-by the ``i32 %d0 ... i32 %d4`` arguments.
+by the ``i32 %d0 ... i32 %d4`` arguments. In ``tile.gather4`` mode,
+four rows in a 2D tensor are combined to form a single 2D destination
+tensor. The first coordinate ``i32 %x0`` denotes the column index
+followed by four coordinates indicating the four row-indices.
+So, this mode takes a total of 5 coordinates as input arguments.
+For more information on ``gather4`` mode, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes>`_.
* The last three arguments to these intrinsics are flags
indicating support for multicast, cache_hint and cta_group::1/2
@@ -1116,10 +1124,18 @@ Syntax:
.. code-block:: llvm
- declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch, i32 %flag_cta_group)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.3d(ptr addrspace(7) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch, i32 %flag_cta_group)
declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.3d(ptr addrspace(7) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch, i32 %flag_cta_group)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.128.3d(ptr addrspace(7) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch, i32 %flag_cta_group)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.128.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.w.128.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
Overview:
"""""""""
@@ -1131,10 +1147,105 @@ in ``im2col`` mode. In im2col mode, some dimensions of the source tensor
are unrolled into a single dimensional column at the destination. In this
mode, the tensor has to be at least three-dimensional. Along with the tensor
coordinates, im2col offsets are also specified (denoted by
-``i16 im2col0...i16 %im2col2``). The number of im2col offsets is two less
-than the number of dimensions of the tensor operation. The last three arguments
-to these intrinsics are flags, with the same functionality as described
-in the ``tile`` mode intrinsics above.
+``i16 im2col0...i16 %im2col2``). For the ``im2col`` mode, the number of offsets
+is two less than the number of dimensions of the tensor operation. For the
+``im2col.w`` and ``im2col.w.128`` mode, the number of offsets is always 2,
+denoted by ``i16 %wHalo`` and ``i16 %wOffset`` arguments. For more information
+on ``im2col.w`` and ``im2col.w.128`` modes, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-w-w128-modes>`_.
+
+The last three arguments to these intrinsics are flags, with the same functionality
+as described in the ``tile`` mode intrinsics above.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.
+
+'``llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.[1-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.1d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.2d(..., i32 %d0, i32 %d1, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.gather4.2d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %x0, i32 %y0, i32 %y1, i32 %y2, i32 %y3, i64 %ch, i1 %flag_ch)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.tile.[1-5]d``' intrinsics
+correspond to the ``cp.async.bulk.tensor.[1-5]d.shared::cta.global.*``
+set of PTX instructions. These instructions initiate an asynchronous
+copy of tensor data from global memory to shared::cta memory in
+``tile`` mode. In tile mode, the multi-dimensional layout of the
+source tensor is preserved at the destination. The dimension of the
+tensor data ranges from 1d to 5d with the coordinates specified
+by the ``i32 %d0 ... i32 %d4`` arguments. In ``tile.gather4`` mode,
+four rows in a 2D tensor are combined to form a single 2D destination
+tensor. The first coordinate ``i32 %x0`` denotes the column index
+followed by four coordinates indicating the four row-indices.
+So, this mode takes a total of 5 coordinates as input arguments.
+For more information on ``gather4`` mode, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes>`_.
+
+* The last argument to these intrinsics is a boolean flag
+ indicating support for cache_hint. This flag argument must
+ be a compile-time constant. When set, it indicates a valid
+ cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
+ variant of the PTX instruction.
+
+For more information, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.
+
+'``llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.[3-5]d``'
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+
+.. code-block:: llvm
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i64 %ch, i1 %flag_ch)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...)
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i64 %ch, i1 %flag_ch)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.128.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i64 %ch, i1 %flag_ch)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.128.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.w.128.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+Overview:
+"""""""""
+
+The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.cta.im2col.[3-5]d``' intrinsics
+correspond to the ``cp.async.bulk.tensor.[1-5]d.shared::cta.global.*``
+set of PTX instructions. These instructions initiate an asynchronous copy
+of tensor data from global memory to shared::cta memory in ``im2col`` mode.
+In im2col mode, some dimensions of the source tensor are unrolled into a
+single dimensional column at the destination. In this mode, the tensor has
+to be at least three-dimensional. Along with the tensor coordinates, im2col
+offsets are also specified (denoted by ``i16 im2col0...i16 %im2col2``).
+For the ``im2col`` mode, the number of offsets is two less than the number
+of dimensions of the tensor operation. For the ``im2col.w`` and ``im2col.w.128``
+mode, the number of offsets is always 2, denoted by ``i16 %wHalo`` and
+``i16 %wOffset`` arguments. For more information on ``im2col.w`` and
+``im2col.w.128`` modes, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-w-w128-modes>`_.
+
+* The last argument to these intrinsics is a boolean flag
+ indicating support for cache_hint. This flag argument must
+ be a compile-time constant. When set, it indicates a valid
+ cache_hint (``i64 %ch``) and generates the ``.L2::cache_hint``
+ variant of the PTX instruction.
For more information, refer PTX ISA
`<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor>`_.
@@ -1153,6 +1264,8 @@ Syntax:
declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.scatter4.2d(ptr addrspace(3) %src, ptr %tensor_map, i32 %x0, i32 %y0, i32 %y1, i32 %y2, i32 %y3, i64 %ch, i1 %flag_ch)
+
Overview:
"""""""""
@@ -1162,6 +1275,12 @@ These instructions initiate an asynchronous copy of tensor data from
shared::cta to global memory (indicated by the ``s2g`` prefix)
in ``tile`` mode. The dimension of the tensor data ranges from 1d to 5d
with the coordinates specified by the ``i32 %d0 ... i32 %d4`` arguments.
+In ``tile.scatter4`` mode, a single 2D source tensor is divided into
+four rows in the 2D destination tensor. The first coordinate ``i32 %x0``
+denotes the column index followed by four coordinates indicating the
+four row-indices. So, this mode takes a total of 5 coordinates as input arguments.
+For more information on ``scatter4`` mode, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes>`_.
* The last argument to these intrinsics is a boolean flag
indicating support for cache_hint. This flag argument must
@@ -1214,6 +1333,8 @@ Syntax:
declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.tile.gather4.2d(ptr %tensor_map, i32 %x0, i32 %y0, i32 %y1, i32 %y2, i32 %y3, i64 %ch, i1 %flag_ch)
+
Overview:
"""""""""
@@ -1225,6 +1346,13 @@ multi-dimensional layout of the source tensor is preserved at the destination.
The dimension of the tensor data ranges from 1d to 5d with the coordinates
specified by the ``i32 %d0 ... i32 %d4`` arguments.
+In ``tile.gather4`` mode, four rows in the 2-dimnesional source tensor are
+fetched to the L2 cache. The first coordinate ``i32 %x0`` denotes the column index
+followed by four coordinates indicating the four row-indices. So, this mode takes
+a total of 5 coordinates as input arguments.
+For more information on ``gather4`` mode, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-tiled-scatter4-gather4-modes>`_.
+
* The last argument to these intrinsics is a boolean flag
indicating support for cache_hint. This flag argument must
be a compile-time constant. When set, it indicates a valid
@@ -1246,6 +1374,14 @@ Syntax:
declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...)
declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.3d(ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i64 %ch, i1 %flag_ch)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
+ declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.128.3d(ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %wHalo, i16 %wOffset, i64 %ch, i1 %flag_ch)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.128.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
+ declare void @llvm.nvvm.cp.async.bulk.tensor.prefetch.im2col.w.128.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
+
Overview:
"""""""""
@@ -1256,9 +1392,16 @@ of tensor data from global memory to the L2 cache. In im2col mode, some
dimensions of the source tensor are unrolled into a single dimensional
column at the destination. In this mode, the tensor has to be at least
three-dimensional. Along with the tensor coordinates, im2col offsets are
-also specified (denoted by ``i16 im2col0...i16 %im2col2``). The number
-of im2col offsets is two less than the number of dimensions of the tensor
-operation. The last argument to these intrinsics is a boolean flag, with
+also specified (denoted by ``i16 im2col0...i16 %im2col2``). For ``im2col``
+mode, the number of offsets is two less than the number of dimensions of
+the tensor operation. For the ``im2col.w`` and ``im2col.w.128`` modes,
+the number of offsets is always 2, denoted by ``i16 %wHalo`` and
+``i16 %wOffset`` arguments. For more information on ``im2col.w`` and
+``im2col.w.128`` modes, refer PTX ISA
+`<https://docs.nvidia.com/cuda/parallel-thread-execution/#tensor-im2col-w-w128-modes>`_.
+
+
+The last argument to these intrinsics is a boolean flag, with
the same functionality as described in the ``tile`` mode intrinsics above.
For more information, refer PTX ISA
diff --git a/llvm/docs/ProgrammersManual.rst b/llvm/docs/ProgrammersManual.rst
index d417de7..68490c8 100644
--- a/llvm/docs/ProgrammersManual.rst
+++ b/llvm/docs/ProgrammersManual.rst
@@ -135,7 +135,7 @@ rarely have to include this file directly).
return !L->contains(cast<Instruction>(V)->getParent());
}
- Note that you should **not** use an ``isa<>`` test followed by a ``cast<>``,
+ Note that you should **not** use an ``isa<>`` test followed by a ``cast<>``;
for that use the ``dyn_cast<>`` operator.
``dyn_cast<>``:
@@ -234,8 +234,8 @@ the ``str`` member function. See ``llvm/ADT/StringRef.h`` (`doxygen
<https://llvm.org/doxygen/StringRef_8h_source.html>`__) for more
information.
-You should rarely use the ``StringRef`` class directly, because it contains
-pointers to external memory it is not generally safe to store an instance of the
+You should rarely use the ``StringRef`` class directly. Because it contains
+pointers to external memory, it is not generally safe to store an instance of the
class (unless you know that the external storage will not be freed).
``StringRef`` is small and pervasive enough in LLVM that it should always be
passed by value.
@@ -416,14 +416,14 @@ to abort quickly at the point of failure (providing some basic diagnostic) when
invariants are broken at runtime.
The fundamental tools for handling programmatic errors are assertions and the
-llvm_unreachable function. Assertions are used to express invariant conditions,
+``llvm_unreachable`` function. Assertions are used to express invariant conditions,
and should include a message describing the invariant:
.. code-block:: c++
assert(isPhysReg(R) && "All virt regs should have been allocated already.");
-The llvm_unreachable function can be used to document areas of control flow
+The ``llvm_unreachable`` function can be used to document areas of control flow
that should never be entered if the program invariants hold:
.. code-block:: c++
@@ -598,7 +598,7 @@ semantics. For example:
}
This third form works with any type that can be assigned to from ``T&&``. This
-can be useful if the ``Expected<T>`` value needs to be stored an already-declared
+can be useful if the ``Expected<T>`` value needs to be stored in an already-declared
``std::optional<T>``. For example:
.. code-block:: c++
@@ -619,7 +619,7 @@ can be useful if the ``Expected<T>`` value needs to be stored an already-declare
All ``Error`` instances, whether success or failure, must be either checked or
moved from (via ``std::move`` or a return) before they are destructed.
-Accidentally discarding an unchecked error will cause a program abort at the
+Accidentally discarding an unchecked error will cause a program to abort at the
point where the unchecked value's destructor is run, making it easy to identify
and fix violations of this rule.
@@ -661,7 +661,7 @@ a variadic list of "handlers", each of which must be a callable type (a
function, lambda, or class with a call operator) with one argument. The
``handleErrors`` function will visit each handler in the sequence and check its
argument type against the dynamic type of the error, running the first handler
-that matches. This is the same decision process that is used decide which catch
+that matches. This is the same decision process that is used to decide which catch
clause to run for a C++ exception.
Since the list of handlers passed to ``handleErrors`` may not cover every error
@@ -869,10 +869,10 @@ T value:
}
Like the ExitOnError utility, cantFail simplifies control flow. Their treatment
-of error cases is very different however: Where ExitOnError is guaranteed to
+of error cases is very different, however: Where ExitOnError is guaranteed to
terminate the program on an error input, cantFail simply asserts that the result
is success. In debug builds this will result in an assertion failure if an error
-is encountered. In release builds the behavior of cantFail for failure values is
+is encountered. In release builds, the behavior of cantFail for failure values is
undefined. As such, care must be taken in the use of cantFail: clients must be
certain that a cantFail wrapped call really can not fail with the given
arguments.
@@ -928,7 +928,7 @@ well-formed Foo or an Error, never an object in an invalid state.
Propagating and consuming errors based on types
"""""""""""""""""""""""""""""""""""""""""""""""
-In some contexts, certain types of error are known to be benign. For example,
+In some contexts, certain types of errors are known to be benign. For example,
when walking an archive, some clients may be happy to skip over badly formatted
object files rather than terminating the walk immediately. Skipping badly
formatted objects could be achieved using an elaborate handler method, but the
@@ -956,7 +956,7 @@ type inspection method, ``isA``, and the ``consumeError`` function:
Concatenating Errors with joinErrors
""""""""""""""""""""""""""""""""""""
-In the archive walking example above ``BadFileFormat`` errors are simply
+In the archive walking example above, ``BadFileFormat`` errors are simply
consumed and ignored. If the client had wanted report these errors after
completing the walk over the archive they could use the ``joinErrors`` utility:
@@ -982,13 +982,13 @@ The ``joinErrors`` routine builds a special error type called ``ErrorList``,
which holds a list of user defined errors. The ``handleErrors`` routine
recognizes this type and will attempt to handle each of the contained errors in
order. If all contained errors can be handled, ``handleErrors`` will return
-``Error::success()``, otherwise ``handleErrors`` will concatenate the remaining
+``Error::success()``; otherwise, ``handleErrors`` will concatenate the remaining
errors and return the resulting ``ErrorList``.
Building fallible iterators and iterator ranges
"""""""""""""""""""""""""""""""""""""""""""""""
-The archive walking examples above retrieve archive members by index, however
+The archive walking examples above retrieve archive members by index; however,
this requires considerable boiler-plate for iteration and error checking. We can
clean this up by using the "fallible iterator" pattern, which supports the
following natural iteration idiom for fallible containers like Archive:
@@ -1039,7 +1039,7 @@ fallible_iterator utility which provides ``operator++`` and ``operator--``,
returning any errors via a reference passed in to the wrapper at construction
time. The fallible_iterator wrapper takes care of (a) jumping to the end of the
range on error, and (b) marking the error as checked whenever an iterator is
-compared to ``end`` and found to be inequal (in particular: this marks the
+compared to ``end`` and found to be inequal (in particular, this marks the
error as checked throughout the body of a range-based for loop), enabling early
exit from the loop without redundant error checking.
@@ -1068,7 +1068,7 @@ functions. E.g.:
Using the fallible_iterator utility allows for both natural construction of
fallible iterators (using failing ``inc`` and ``dec`` operations) and
-relatively natural use of c++ iterator/loop idioms.
+relatively natural use of C++ iterator/loop idioms.
.. _function_apis:
@@ -1175,7 +1175,7 @@ Then you can run your pass like this:
I am here!
Using the ``LLVM_DEBUG()`` macro instead of a home-brewed solution allows you to not
-have to create "yet another" command line option for the debug output for your
+have to create "yet another" command-line option for the debug output for your
pass. Note that ``LLVM_DEBUG()`` macros are disabled for non-asserts builds, so they
do not cause a performance impact at all (for the same reason, they should also
not contain side-effects!).
@@ -1349,7 +1349,7 @@ certain number of times.
The ``llvm/Support/DebugCounter.h`` (`doxygen
<https://llvm.org/doxygen/DebugCounter_8h_source.html>`__) file
provides a class named ``DebugCounter`` that can be used to create
-command line counter options that control execution of parts of your code.
+command-line counter options that control execution of parts of your code.
Define your DebugCounter like this:
@@ -1364,7 +1364,7 @@ is specified by the first argument. The name of the counter
argument, and the description used in the help is specified by the
third argument.
-Whatever code you want that control, use ``DebugCounter::shouldExecute`` to control it.
+Whatever code you want to control, use ``DebugCounter::shouldExecute`` to control it.
.. code-block:: c++
diff --git a/llvm/docs/ReleaseNotes.md b/llvm/docs/ReleaseNotes.md
index 5591ac6..8d11701 100644
--- a/llvm/docs/ReleaseNotes.md
+++ b/llvm/docs/ReleaseNotes.md
@@ -86,6 +86,8 @@ Changes to LLVM infrastructure
* Added the support for ``fmaximum`` and ``fminimum`` in ``atomicrmw`` instruction. The
comparison is expected to match the behavior of ``llvm.maximum.*`` and
``llvm.minimum.*`` respectively.
+* Removed the codegen pass ``finalize-mi-bundles``. The same functionality is
+ still available as an API function ``llvm::finalizeBundles``.
Changes to building LLVM
------------------------
@@ -96,6 +98,12 @@ Changes to TableGen
Changes to Interprocedural Optimizations
----------------------------------------
+Changes to Vectorizers
+----------------------------------------
+
+* Added initial support for copyable elements in SLP, which models copyable
+ elements as add <element>, 0, i.e. uses identity constants for missing lanes.
+
Changes to the AArch64 Backend
------------------------------
@@ -282,6 +290,9 @@ Changes to the LLVM tools
([#47468](https://github.com/llvm/llvm-project/issues/47468))
* llvm-addr2line now supports a `+` prefix when specifying an address.
* Support for `SHT_LLVM_BB_ADDR_MAP` versions 0 and 1 has been dropped.
+* llvm-objdump now supports the `--debug-inlined-funcs` flag, which prints the
+ locations of inlined functions alongside disassembly. The
+ `--debug-vars-indent` flag has also been renamed to `--debug-indent`.
Changes to LLDB
---------------------------------
diff --git a/llvm/docs/TestingGuide.rst b/llvm/docs/TestingGuide.rst
index b6dda6a..76b6b4e 100644
--- a/llvm/docs/TestingGuide.rst
+++ b/llvm/docs/TestingGuide.rst
@@ -152,12 +152,12 @@ can run the LLVM and Clang tests simultaneously using:
% make check-all
-To run the tests with Valgrind (Memcheck by default), use the ``LIT_ARGS`` make
+To run the tests with Valgrind (Memcheck by default), use the ``LIT_OPTS`` make
variable to pass the required options to lit. For example, you can use:
.. code-block:: bash
- % make check LIT_ARGS="-v --vg --vg-leak"
+ % make check LIT_OPTS="-v --vg --vg-leak"
to enable testing with valgrind and with leak checking enabled.
diff --git a/llvm/docs/tutorial/MyFirstLanguageFrontend/LangImpl04.rst b/llvm/docs/tutorial/MyFirstLanguageFrontend/LangImpl04.rst
index 5660802..5ebff3b 100644
--- a/llvm/docs/tutorial/MyFirstLanguageFrontend/LangImpl04.rst
+++ b/llvm/docs/tutorial/MyFirstLanguageFrontend/LangImpl04.rst
@@ -86,7 +86,7 @@ instead of computing "``x+3``" twice.
Unfortunately, no amount of local analysis will be able to detect and
correct this. This requires two transformations: reassociation of
-expressions (to make the add's lexically identical) and Common
+expressions (to make the adds lexically identical) and Common
Subexpression Elimination (CSE) to delete the redundant add instruction.
Fortunately, LLVM provides a broad range of optimizations that you can
use, in the form of "passes".