aboutsummaryrefslogtreecommitdiff
path: root/llvm/docs
diff options
context:
space:
mode:
Diffstat (limited to 'llvm/docs')
-rw-r--r--llvm/docs/CodingStandards.rst46
-rw-r--r--llvm/docs/CommandGuide/index.rst1
-rw-r--r--llvm/docs/CommandGuide/llvm-ir2vec.rst170
-rw-r--r--llvm/docs/Extensions.rst20
-rw-r--r--llvm/docs/LangRef.rst2
-rw-r--r--llvm/docs/MLGO.rst12
6 files changed, 226 insertions, 25 deletions
diff --git a/llvm/docs/CodingStandards.rst b/llvm/docs/CodingStandards.rst
index c614a6d..732227b 100644
--- a/llvm/docs/CodingStandards.rst
+++ b/llvm/docs/CodingStandards.rst
@@ -30,7 +30,7 @@ because the naming and other conventions are dictated by the C++ standard.
There are some conventions that are not uniformly followed in the code base
(e.g. the naming convention). This is because they are relatively new, and a
-lot of code was written before they were put in place. Our long term goal is
+lot of code was written before they were put in place. Our long-term goal is
for the entire codebase to follow the convention, but we explicitly *do not*
want patches that do large-scale reformatting of existing code. On the other
hand, it is reasonable to rename the methods of a class if you're about to
@@ -50,7 +50,7 @@ code imported into the tree. Generally, our preference is for standards
conforming, modern, and portable C++ code as the implementation language of
choice.
-For automation, build-systems and utility scripts Python is preferred and
+For automation, build-systems, and utility scripts, Python is preferred and
is widely used in the LLVM repository already.
C++ Standard Versions
@@ -92,7 +92,7 @@ LLVM support libraries (for example, `ADT
<https://github.com/llvm/llvm-project/tree/main/llvm/include/llvm/ADT>`_)
implement specialized data structures or functionality missing in the standard
library. Such libraries are usually implemented in the ``llvm`` namespace and
-follow the expected standard interface, when there is one.
+follow the expected standard interface when there is one.
When both C++ and the LLVM support libraries provide similar functionality, and
there isn't a specific reason to favor the C++ implementation, it is generally
@@ -325,8 +325,8 @@ implementation file. In any case, implementation files can include additional
comments (not necessarily in Doxygen markup) to explain implementation details
as needed.
-Don't duplicate function or class name at the beginning of the comment.
-For humans it is obvious which function or class is being documented;
+Don't duplicate the function or class name at the beginning of the comment.
+For humans, it is obvious which function or class is being documented;
automatic documentation processing tools are smart enough to bind the comment
to the correct declaration.
@@ -369,7 +369,7 @@ lower-case letter, and finish the last sentence without a period, if it would
end in one otherwise. Sentences which end with different punctuation, such as
"did you forget ';'?", should still do so.
-For example this is a good error message:
+For example, this is a good error message:
.. code-block:: none
@@ -443,7 +443,7 @@ Write your code to fit within 80 columns.
There must be some limit to the width of the code in
order to allow developers to have multiple files side-by-side in
windows on a modest display. If you are going to pick a width limit, it is
-somewhat arbitrary but you might as well pick something standard. Going with 90
+somewhat arbitrary, but you might as well pick something standard. Going with 90
columns (for example) instead of 80 columns wouldn't add any significant value
and would be detrimental to printing out code. Also many other projects have
standardized on 80 columns, so some people have already configured their editors
@@ -520,7 +520,7 @@ within each other and within function calls in order to build up aggregates
The historically common formatting of braced initialization of aggregate
variables does not mix cleanly with deep nesting, general expression contexts,
function arguments, and lambdas. We suggest new code use a simple rule for
-formatting braced initialization lists: act as-if the braces were parentheses
+formatting braced initialization lists: act as if the braces were parentheses
in a function call. The formatting rules exactly match those already well
understood for formatting nested function calls. Examples:
@@ -607,11 +607,11 @@ Static constructors and destructors (e.g., global variables whose types have a
constructor or destructor) should not be added to the code base, and should be
removed wherever possible.
-Globals in different source files are initialized in `arbitrary order
+Globals in different source files are initialized in an `arbitrary order
<https://yosefk.com/c++fqa/ctors.html#fqa-10.12>`_, making the code more
difficult to reason about.
-Static constructors have negative impact on launch time of programs that use
+Static constructors have a negative impact on the launch time of programs that use
LLVM as a library. We would really like for there to be zero cost for linking
in an additional LLVM target or other library into an application, but static
constructors undermine this goal.
@@ -698,7 +698,7 @@ If you use a braced initializer list when initializing a variable, use an equals
Use ``auto`` Type Deduction to Make Code More Readable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Some are advocating a policy of "almost always ``auto``" in C++11, however LLVM
+Some are advocating a policy of "almost always ``auto``" in C++11; however, LLVM
uses a more moderate stance. Use ``auto`` if and only if it makes the code more
readable or easier to maintain. Don't "almost always" use ``auto``, but do use
``auto`` with initializers like ``cast<Foo>(...)`` or other places where the
@@ -783,14 +783,14 @@ guards, and might not include their prerequisites. Name such files with the
In general, a header should be implemented by one or more ``.cpp`` files. Each
of these ``.cpp`` files should include the header that defines their interface
-first. This ensures that all of the dependences of the header have been
+first. This ensures that all of the dependencies of the header have been
properly added to the header itself, and are not implicit. System headers
should be included after user headers for a translation unit.
Library Layering
^^^^^^^^^^^^^^^^
-A directory of header files (for example ``include/llvm/Foo``) defines a
+A directory of header files (for example, ``include/llvm/Foo``) defines a
library (``Foo``). One library (both
its headers and implementation) should only use things from the libraries
listed in its dependencies.
@@ -822,7 +822,7 @@ especially in header files.
But wait! Sometimes you need to have the definition of a class to use it, or to
inherit from it. In these cases go ahead and ``#include`` that header file. Be
-aware however that there are many cases where you don't need to have the full
+aware, however, that there are many cases where you don't need to have the full
definition of a class. If you are using a pointer or reference to a class, you
don't need the header file. If you are simply returning a class instance from a
prototyped function or method, you don't need it. In fact, for most cases, you
@@ -970,7 +970,7 @@ loops. A silly example is something like this:
When you have very, very small loops, this sort of structure is fine. But if it
exceeds more than 10-15 lines, it becomes difficult for people to read and
understand at a glance. The problem with this sort of code is that it gets very
-nested very quickly. Meaning that the reader of the code has to keep a lot of
+nested very quickly. This means that the reader of the code has to keep a lot of
context in their brain to remember what is going immediately on in the loop,
because they don't know if/when the ``if`` conditions will have ``else``\s etc.
It is strongly preferred to structure the loop like this:
@@ -988,7 +988,7 @@ It is strongly preferred to structure the loop like this:
...
}
-This has all the benefits of using early exits for functions: it reduces nesting
+This has all the benefits of using early exits for functions: it reduces the nesting
of the loop, it makes it easier to describe why the conditions are true, and it
makes it obvious to the reader that there is no ``else`` coming up that they
have to push context into their brain for. If a loop is large, this can be a
@@ -1149,12 +1149,12 @@ In general, names should be in camel case (e.g. ``TextFileReader`` and
nouns and start with an upper-case letter (e.g. ``TextFileReader``).
* **Variable names** should be nouns (as they represent state). The name should
- be camel case, and start with an upper case letter (e.g. ``Leader`` or
+ be camel case, and start with an upper-case letter (e.g. ``Leader`` or
``Boats``).
* **Function names** should be verb phrases (as they represent actions), and
command-like function should be imperative. The name should be camel case,
- and start with a lower case letter (e.g. ``openFile()`` or ``isFoo()``).
+ and start with a lower-case letter (e.g. ``openFile()`` or ``isFoo()``).
* **Enum declarations** (e.g. ``enum Foo {...}``) are types, so they should
follow the naming conventions for types. A common use for enums is as a
@@ -1207,7 +1207,7 @@ Assert Liberally
^^^^^^^^^^^^^^^^
Use the "``assert``" macro to its fullest. Check all of your preconditions and
-assumptions, you never know when a bug (not necessarily even yours) might be
+assumptions. You never know when a bug (not necessarily even yours) might be
caught early by an assertion, which reduces debugging time dramatically. The
"``<cassert>``" header file is probably already included by the header files you
are using, so it doesn't cost anything to use it.
@@ -1302,7 +1302,7 @@ preferred to write the code like this:
assert(NewToSet && "The value shouldn't be in the set yet");
In C code where ``[[maybe_unused]]`` is not supported, use ``void`` cast to
-suppress unused variable warning as follows:
+suppress an unused variable warning as follows:
.. code-block:: c
@@ -1546,7 +1546,7 @@ whenever possible.
The semantics of postincrement include making a copy of the value being
incremented, returning it, and then preincrementing the "work value". For
primitive types, this isn't a big deal. But for iterators, it can be a huge
-issue (for example, some iterators contains stack and set objects in them...
+issue (for example, some iterators contain stack and set objects in them...
copying an iterator could invoke the copy ctor's of these as well). In general,
get in the habit of always using preincrement, and you won't have a problem.
@@ -1663,7 +1663,7 @@ Don't Use Braces on Simple Single-Statement Bodies of if/else/loop Statements
When writing the body of an ``if``, ``else``, or for/while loop statement, we
prefer to omit the braces to avoid unnecessary line noise. However, braces
-should be used in cases where the omission of braces harm the readability and
+should be used in cases where the omission of braces harms the readability and
maintainability of the code.
We consider that readability is harmed when omitting the brace in the presence
@@ -1763,7 +1763,7 @@ would help to avoid running into a "dangling else" situation.
handleAttrOnDecl(D, A, i);
}
- // Use braces on the outer block because of a nested `if`; otherwise the
+ // Use braces on the outer block because of a nested `if`; otherwise, the
// compiler would warn: `add explicit braces to avoid dangling else`
if (auto *D = dyn_cast<FunctionDecl>(D)) {
if (shouldProcess(D))
diff --git a/llvm/docs/CommandGuide/index.rst b/llvm/docs/CommandGuide/index.rst
index 88fc1fd..f85f32a 100644
--- a/llvm/docs/CommandGuide/index.rst
+++ b/llvm/docs/CommandGuide/index.rst
@@ -27,6 +27,7 @@ Basic Commands
llvm-dis
llvm-dwarfdump
llvm-dwarfutil
+ llvm-ir2vec
llvm-lib
llvm-libtool-darwin
llvm-link
diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst
new file mode 100644
index 0000000..13fe4996
--- /dev/null
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -0,0 +1,170 @@
+llvm-ir2vec - IR2Vec Embedding Generation Tool
+==============================================
+
+.. program:: llvm-ir2vec
+
+SYNOPSIS
+--------
+
+:program:`llvm-ir2vec` [*options*] *input-file*
+
+DESCRIPTION
+-----------
+
+:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
+generates IR2Vec embeddings for LLVM IR and supports triplet generation
+for vocabulary training. It provides two main operation modes:
+
+1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+ training from LLVM IR.
+
+2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+ at different granularity levels (instruction, basic block, or function).
+
+The tool is designed to facilitate machine learning applications that work with
+LLVM IR by converting the IR into numerical representations that can be used by
+ML models.
+
+.. note::
+
+ For information about using IR2Vec programmatically within LLVM passes and
+ the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_
+ section in the MLGO documentation.
+
+OPERATION MODES
+---------------
+
+Triplet Generation Mode
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
+consisting of opcodes, types, and operands. These triplets can be used to train
+vocabularies for embedding generation.
+
+Usage:
+
+.. code-block:: bash
+
+ llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+
+Embedding Generation Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
+generate numerical embeddings for LLVM IR at different levels of granularity.
+
+Example Usage:
+
+.. code-block:: bash
+
+ llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
+
+OPTIONS
+-------
+
+.. option:: --mode=<mode>
+
+ Specify the operation mode. Valid values are:
+
+ * ``triplets`` - Generate triplets for vocabulary training
+ * ``embeddings`` - Generate embeddings using trained vocabulary (default)
+
+.. option:: --level=<level>
+
+ Specify the embedding generation level. Valid values are:
+
+ * ``inst`` - Generate instruction-level embeddings
+ * ``bb`` - Generate basic block-level embeddings
+ * ``func`` - Generate function-level embeddings (default)
+
+.. option:: --function=<name>
+
+ Process only the specified function instead of all functions in the module.
+
+.. option:: --ir2vec-vocab-path=<path>
+
+ Specify the path to the vocabulary file (required for embedding mode).
+ The vocabulary file should be in JSON format and contain the trained
+ vocabulary for embedding generation. See `llvm/lib/Analysis/models`
+ for pre-trained vocabulary files.
+
+.. option:: --ir2vec-opc-weight=<weight>
+
+ Specify the weight for opcode embeddings (default: 1.0). This controls
+ the relative importance of instruction opcodes in the final embedding.
+
+.. option:: --ir2vec-type-weight=<weight>
+
+ Specify the weight for type embeddings (default: 0.5). This controls
+ the relative importance of type information in the final embedding.
+
+.. option:: --ir2vec-arg-weight=<weight>
+
+ Specify the weight for argument embeddings (default: 0.2). This controls
+ the relative importance of operand information in the final embedding.
+
+.. option:: -o <filename>
+
+ Specify the output filename. Use ``-`` to write to standard output (default).
+
+.. option:: --help
+
+ Print a summary of command line options.
+
+.. note::
+
+ ``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``,
+ ``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding
+ mode. These options are ignored in triplet mode.
+
+INPUT FILE FORMAT
+-----------------
+
+:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files
+(``.ll``) as input. The input file should contain valid LLVM IR.
+
+OUTPUT FORMAT
+-------------
+
+Triplet Mode Output
+~~~~~~~~~~~~~~~~~~~
+
+In triplet mode, the output consists of lines containing space-separated triplets:
+
+.. code-block:: text
+
+ <opcode> <type> <operand1> <operand2> ...
+
+Each line represents the information of one instruction, with the opcode, type,
+and operands.
+
+Embedding Mode Output
+~~~~~~~~~~~~~~~~~~~~~
+
+In embedding mode, the output format depends on the specified level:
+
+* **Function Level**: One embedding vector per function
+* **Basic Block Level**: One embedding vector per basic block, grouped by function
+* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function
+
+Each embedding is represented as a floating point vector.
+
+EXIT STATUS
+-----------
+
+:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure.
+
+Common failure cases include:
+
+* Invalid or missing input file
+* Missing or invalid vocabulary file (in embedding mode)
+* Specified function not found in the module
+* Invalid command line options
+
+SEE ALSO
+--------
+
+:doc:`../MLGO`
+
+For more information about the IR2Vec algorithm and approach, see:
+`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
diff --git a/llvm/docs/Extensions.rst b/llvm/docs/Extensions.rst
index bad72c6c..d8fb87b 100644
--- a/llvm/docs/Extensions.rst
+++ b/llvm/docs/Extensions.rst
@@ -581,6 +581,26 @@ This section stores pairs of (jump table address, number of entries).
This information is useful for tools that need to statically reconstruct
the control flow of executables.
+``SHT_LLVM_CFI_JUMP_TABLE`` Section (CFI jump table)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This section contains the instructions that make up a `CFI jump table`_.
+It is expected to be ``SHF_ALLOC`` and may be laid out like a normal
+section. The ``SHT_LLVM_CFI_JUMP_TABLE`` section type gives the linker
+permission to modify the section in ways that would not normally be
+permitted, in order to optimize calls via the jump table.
+
+Each ``sh_entsize`` sized slice of a section of this type containing
+exactly one relocation may be considered to be a jump table entry
+that branches to the target of the relocation. This allows the linker
+to replace the jump table entry with the function body if it is small
+enough, or if the function is the last function in the jump table.
+
+A section of this type does not have to be placed according to its
+name. The linker may place the section in whichever output section it
+sees fit (generally the section that would provide the best locality).
+
+.. _CFI jump table: https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-indirect-function-calls
+
CodeView-Dependent
------------------
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 2759e18..371f356 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -4867,7 +4867,7 @@ to be eliminated. This is because '``poison``' is stronger than '``undef``'.
%D = undef
%E = icmp slt %D, 4
- %F = icmp gte %D, 4
+ %F = icmp sge %D, 4
Safe:
%A = undef
diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst
index ed0769b..965a21b 100644
--- a/llvm/docs/MLGO.rst
+++ b/llvm/docs/MLGO.rst
@@ -468,6 +468,13 @@ The core components are:
Using IR2Vec
------------
+.. note::
+
+ This section describes how to use IR2Vec within LLVM passes. A standalone
+ tool :doc:`CommandGuide/llvm-ir2vec` is available for generating the
+ embeddings and triplets from LLVM IR files, which can be useful for
+ training vocabularies and generating embeddings outside of compiler passes.
+
For generating embeddings, first the vocabulary should be obtained. Then, the
embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
@@ -524,6 +531,10 @@ Further Details
For more detailed information about the IR2Vec algorithm, its parameters, and
advanced usage, please refer to the original paper:
`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
+
+For information about using IR2Vec tool for generating embeddings and
+triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
+
The LLVM source code for ``IR2Vec`` can also be explored to understand the
implementation details.
@@ -595,4 +606,3 @@ optimizations that are currently MLGO-enabled, it may be used as follows:
where the ``name`` is a path fragment. We will expect to find 2 files,
``<name>.in`` (readable, data incoming from the managing process) and
``<name>.out`` (writable, the model runner sends data to the managing process)
-