6 files changed, 226 insertions, 25 deletions
diff --git a/llvm/docs/CodingStandards.rst b/llvm/docs/CodingStandards.rst
index c614a6d..732227b 100644
--- a/llvm/docs/CodingStandards.rst
+++ b/llvm/docs/CodingStandards.rst
@@ -30,7 +30,7 @@ because the naming and other conventions are dictated by the C++ standard.
 
 There are some conventions that are not uniformly followed in the code base
 (e.g. the naming convention).  This is because they are relatively new, and a
-lot of code was written before they were put in place.  Our long term goal is
+lot of code was written before they were put in place.  Our long-term goal is
 for the entire codebase to follow the convention, but we explicitly *do not*
 want patches that do large-scale reformatting of existing code.  On the other
 hand, it is reasonable to rename the methods of a class if you're about to
@@ -50,7 +50,7 @@ code imported into the tree. Generally, our preference is for standards
 conforming, modern, and portable C++ code as the implementation language of
 choice.
 
-For automation, build-systems and utility scripts Python is preferred and
+For automation, build-systems, and utility scripts, Python is preferred and
 is widely used in the LLVM repository already.
 
 C++ Standard Versions
@@ -92,7 +92,7 @@ LLVM support libraries (for example, `ADT
 <https://github.com/llvm/llvm-project/tree/main/llvm/include/llvm/ADT>`_)
 implement specialized data structures or functionality missing in the standard
 library. Such libraries are usually implemented in the ``llvm`` namespace and
-follow the expected standard interface, when there is one.
+follow the expected standard interface when there is one.
 
 When both C++ and the LLVM support libraries provide similar functionality, and
 there isn't a specific reason to favor the C++ implementation, it is generally
@@ -325,8 +325,8 @@ implementation file.  In any case, implementation files can include additional
 comments (not necessarily in Doxygen markup) to explain implementation details
 as needed.
 
-Don't duplicate function or class name at the beginning of the comment.
-For humans it is obvious which function or class is being documented;
+Don't duplicate the function or class name at the beginning of the comment.
+For humans, it is obvious which function or class is being documented;
 automatic documentation processing tools are smart enough to bind the comment
 to the correct declaration.
 
@@ -369,7 +369,7 @@ lower-case letter, and finish the last sentence without a period, if it would
 end in one otherwise. Sentences which end with different punctuation, such as
 "did you forget ';'?", should still do so.
 
-For example this is a good error message:
+For example, this is a good error message:
 
 .. code-block:: none
 
@@ -443,7 +443,7 @@ Write your code to fit within 80 columns.
 There must be some limit to the width of the code in
 order to allow developers to have multiple files side-by-side in
 windows on a modest display.  If you are going to pick a width limit, it is
-somewhat arbitrary but you might as well pick something standard.  Going with 90
+somewhat arbitrary, but you might as well pick something standard.  Going with 90
 columns (for example) instead of 80 columns wouldn't add any significant value
 and would be detrimental to printing out code.  Also many other projects have
 standardized on 80 columns, so some people have already configured their editors
@@ -520,7 +520,7 @@ within each other and within function calls in order to build up aggregates
 The historically common formatting of braced initialization of aggregate
 variables does not mix cleanly with deep nesting, general expression contexts,
 function arguments, and lambdas. We suggest new code use a simple rule for
-formatting braced initialization lists: act as-if the braces were parentheses
+formatting braced initialization lists: act as if the braces were parentheses
 in a function call. The formatting rules exactly match those already well
 understood for formatting nested function calls. Examples:
 
@@ -607,11 +607,11 @@ Static constructors and destructors (e.g., global variables whose types have a
 constructor or destructor) should not be added to the code base, and should be
 removed wherever possible.
 
-Globals in different source files are initialized in `arbitrary order
+Globals in different source files are initialized in an `arbitrary order
 <https://yosefk.com/c++fqa/ctors.html#fqa-10.12>`_, making the code more
 difficult to reason about.
 
-Static constructors have negative impact on launch time of programs that use
+Static constructors have a negative impact on the launch time of programs that use
 LLVM as a library. We would really like for there to be zero cost for linking
 in an additional LLVM target or other library into an application, but static
 constructors undermine this goal.
@@ -698,7 +698,7 @@ If you use a braced initializer list when initializing a variable, use an equals
 Use ``auto`` Type Deduction to Make Code More Readable
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Some are advocating a policy of "almost always ``auto``" in C++11, however LLVM
+Some are advocating a policy of "almost always ``auto``" in C++11; however, LLVM
 uses a more moderate stance. Use ``auto`` if and only if it makes the code more
 readable or easier to maintain. Don't "almost always" use ``auto``, but do use
 ``auto`` with initializers like ``cast<Foo>(...)`` or other places where the
@@ -783,14 +783,14 @@ guards, and might not include their prerequisites. Name such files with the
 
 In general, a header should be implemented by one or more ``.cpp`` files.  Each
 of these ``.cpp`` files should include the header that defines their interface
-first.  This ensures that all of the dependences of the header have been
+first.  This ensures that all of the dependencies of the header have been
 properly added to the header itself, and are not implicit.  System headers
 should be included after user headers for a translation unit.
 
 Library Layering
 ^^^^^^^^^^^^^^^^
 
-A directory of header files (for example ``include/llvm/Foo``) defines a
+A directory of header files (for example, ``include/llvm/Foo``) defines a
 library (``Foo``). One library (both
 its headers and implementation) should only use things from the libraries
 listed in its dependencies.
@@ -822,7 +822,7 @@ especially in header files.
 
 But wait! Sometimes you need to have the definition of a class to use it, or to
 inherit from it.  In these cases go ahead and ``#include`` that header file.  Be
-aware however that there are many cases where you don't need to have the full
+aware, however, that there are many cases where you don't need to have the full
 definition of a class.  If you are using a pointer or reference to a class, you
 don't need the header file.  If you are simply returning a class instance from a
 prototyped function or method, you don't need it.  In fact, for most cases, you
@@ -970,7 +970,7 @@ loops.  A silly example is something like this:
 When you have very, very small loops, this sort of structure is fine. But if it
 exceeds more than 10-15 lines, it becomes difficult for people to read and
 understand at a glance. The problem with this sort of code is that it gets very
-nested very quickly. Meaning that the reader of the code has to keep a lot of
+nested very quickly. This means that the reader of the code has to keep a lot of
 context in their brain to remember what is going immediately on in the loop,
 because they don't know if/when the ``if`` conditions will have ``else``\s etc.
 It is strongly preferred to structure the loop like this:
@@ -988,7 +988,7 @@ It is strongly preferred to structure the loop like this:
     ...
   }
 
-This has all the benefits of using early exits for functions: it reduces nesting
+This has all the benefits of using early exits for functions: it reduces the nesting
 of the loop, it makes it easier to describe why the conditions are true, and it
 makes it obvious to the reader that there is no ``else`` coming up that they
 have to push context into their brain for.  If a loop is large, this can be a
@@ -1149,12 +1149,12 @@ In general, names should be in camel case (e.g. ``TextFileReader`` and
   nouns and start with an upper-case letter (e.g. ``TextFileReader``).
 
 * **Variable names** should be nouns (as they represent state).  The name should
-  be camel case, and start with an upper case letter (e.g. ``Leader`` or
+  be camel case, and start with an upper-case letter (e.g. ``Leader`` or
   ``Boats``).
 
 * **Function names** should be verb phrases (as they represent actions), and
   command-like function should be imperative.  The name should be camel case,
-  and start with a lower case letter (e.g. ``openFile()`` or ``isFoo()``).
+  and start with a lower-case letter (e.g. ``openFile()`` or ``isFoo()``).
 
 * **Enum declarations** (e.g. ``enum Foo {...}``) are types, so they should
   follow the naming conventions for types.  A common use for enums is as a
@@ -1207,7 +1207,7 @@ Assert Liberally
 ^^^^^^^^^^^^^^^^
 
 Use the "``assert``" macro to its fullest.  Check all of your preconditions and
-assumptions, you never know when a bug (not necessarily even yours) might be
+assumptions.  You never know when a bug (not necessarily even yours) might be
 caught early by an assertion, which reduces debugging time dramatically.  The
 "``<cassert>``" header file is probably already included by the header files you
 are using, so it doesn't cost anything to use it.
@@ -1302,7 +1302,7 @@ preferred to write the code like this:
   assert(NewToSet && "The value shouldn't be in the set yet");
 
 In C code where ``[[maybe_unused]]`` is not supported, use ``void`` cast to
-suppress unused variable warning as follows:
+suppress an unused variable warning as follows:
 
 .. code-block:: c
 
@@ -1546,7 +1546,7 @@ whenever possible.
 The semantics of postincrement include making a copy of the value being
 incremented, returning it, and then preincrementing the "work value".  For
 primitive types, this isn't a big deal. But for iterators, it can be a huge
-issue (for example, some iterators contains stack and set objects in them...
+issue (for example, some iterators contain stack and set objects in them...
 copying an iterator could invoke the copy ctor's of these as well).  In general,
 get in the habit of always using preincrement, and you won't have a problem.
 
@@ -1663,7 +1663,7 @@ Don't Use Braces on Simple Single-Statement Bodies of if/else/loop Statements
 
 When writing the body of an ``if``, ``else``, or for/while loop statement, we
 prefer to omit the braces to avoid unnecessary line noise. However, braces
-should be used in cases where the omission of braces harm the readability and
+should be used in cases where the omission of braces harms the readability and
 maintainability of the code.
 
 We consider that readability is harmed when omitting the brace in the presence
@@ -1763,7 +1763,7 @@ would help to avoid running into a "dangling else" situation.
         handleAttrOnDecl(D, A, i);
   }
 
-  // Use braces on the outer block because of a nested `if`; otherwise the
+  // Use braces on the outer block because of a nested `if`; otherwise, the
   // compiler would warn: `add explicit braces to avoid dangling else`
   if (auto *D = dyn_cast<FunctionDecl>(D)) {
     if (shouldProcess(D))
diff --git a/llvm/docs/CommandGuide/index.rst b/llvm/docs/CommandGuide/index.rst
index 88fc1fd..f85f32a 100644
--- a/llvm/docs/CommandGuide/index.rst
+++ b/llvm/docs/CommandGuide/index.rst
@@ -27,6 +27,7 @@ Basic Commands
    llvm-dis
    llvm-dwarfdump
    llvm-dwarfutil
+   llvm-ir2vec
    llvm-lib
    llvm-libtool-darwin
    llvm-link
diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst
new file mode 100644
index 0000000..13fe4996
--- /dev/null
+++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst
@@ -0,0 +1,170 @@
+llvm-ir2vec - IR2Vec Embedding Generation Tool
+==============================================
+
+.. program:: llvm-ir2vec
+
+SYNOPSIS
+--------
+
+:program:`llvm-ir2vec` [*options*] *input-file*
+
+DESCRIPTION
+-----------
+
+:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It
+generates IR2Vec embeddings for LLVM IR and supports triplet generation 
+for vocabulary training. It provides two main operation modes:
+
+1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary
+   training from LLVM IR.
+
+2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary
+   at different granularity levels (instruction, basic block, or function).
+
+The tool is designed to facilitate machine learning applications that work with
+LLVM IR by converting the IR into numerical representations that can be used by
+ML models.
+
+.. note::
+
+   For information about using IR2Vec programmatically within LLVM passes and 
+   the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_ 
+   section in the MLGO documentation.
+
+OPERATION MODES
+---------------
+
+Triplet Generation Mode
+~~~~~~~~~~~~~~~~~~~~~~~
+
+In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets
+consisting of opcodes, types, and operands. These triplets can be used to train
+vocabularies for embedding generation.
+
+Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=triplets input.bc -o triplets.txt
+
+Embedding Generation Mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to
+generate numerical embeddings for LLVM IR at different levels of granularity.
+
+Example Usage:
+
+.. code-block:: bash
+
+   llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt
+
+OPTIONS
+-------
+
+.. option:: --mode=<mode>
+
+ Specify the operation mode. Valid values are:
+
+ * ``triplets`` - Generate triplets for vocabulary training
+ * ``embeddings`` - Generate embeddings using trained vocabulary (default)
+
+.. option:: --level=<level>
+
+ Specify the embedding generation level. Valid values are:
+
+ * ``inst`` - Generate instruction-level embeddings
+ * ``bb`` - Generate basic block-level embeddings  
+ * ``func`` - Generate function-level embeddings (default)
+
+.. option:: --function=<name>
+
+ Process only the specified function instead of all functions in the module.
+
+.. option:: --ir2vec-vocab-path=<path>
+
+ Specify the path to the vocabulary file (required for embedding mode).
+ The vocabulary file should be in JSON format and contain the trained
+ vocabulary for embedding generation. See `llvm/lib/Analysis/models`
+ for pre-trained vocabulary files.
+
+.. option:: --ir2vec-opc-weight=<weight>
+
+ Specify the weight for opcode embeddings (default: 1.0). This controls
+ the relative importance of instruction opcodes in the final embedding.
+
+.. option:: --ir2vec-type-weight=<weight>
+
+ Specify the weight for type embeddings (default: 0.5). This controls
+ the relative importance of type information in the final embedding.
+
+.. option:: --ir2vec-arg-weight=<weight>
+
+ Specify the weight for argument embeddings (default: 0.2). This controls
+ the relative importance of operand information in the final embedding.
+
+.. option:: -o <filename>
+
+ Specify the output filename. Use ``-`` to write to standard output (default).
+
+.. option:: --help
+
+ Print a summary of command line options.
+
+.. note::
+
+   ``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``, 
+   ``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding 
+   mode. These options are ignored in triplet mode.
+
+INPUT FILE FORMAT
+-----------------
+
+:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files 
+(``.ll``) as input. The input file should contain valid LLVM IR.
+
+OUTPUT FORMAT
+-------------
+
+Triplet Mode Output
+~~~~~~~~~~~~~~~~~~~
+
+In triplet mode, the output consists of lines containing space-separated triplets:
+
+.. code-block:: text
+
+   <opcode> <type> <operand1> <operand2> ...
+
+Each line represents the information of one instruction, with the opcode, type,
+and operands.
+
+Embedding Mode Output
+~~~~~~~~~~~~~~~~~~~~~
+
+In embedding mode, the output format depends on the specified level:
+
+* **Function Level**: One embedding vector per function
+* **Basic Block Level**: One embedding vector per basic block, grouped by function
+* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function
+
+Each embedding is represented as a floating point vector.
+
+EXIT STATUS
+-----------
+
+:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure.
+
+Common failure cases include:
+
+* Invalid or missing input file
+* Missing or invalid vocabulary file (in embedding mode)
+* Specified function not found in the module
+* Invalid command line options
+
+SEE ALSO
+--------
+
+:doc:`../MLGO`
+
+For more information about the IR2Vec algorithm and approach, see:
+`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
diff --git a/llvm/docs/Extensions.rst b/llvm/docs/Extensions.rst
index bad72c6c..d8fb87b 100644
--- a/llvm/docs/Extensions.rst
+++ b/llvm/docs/Extensions.rst
@@ -581,6 +581,26 @@ This section stores pairs of (jump table address, number of entries).
 This information is useful for tools that need to statically reconstruct
 the control flow of executables.
 
+``SHT_LLVM_CFI_JUMP_TABLE`` Section (CFI jump table)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+This section contains the instructions that make up a `CFI jump table`_.
+It is expected to be ``SHF_ALLOC`` and may be laid out like a normal
+section. The ``SHT_LLVM_CFI_JUMP_TABLE`` section type gives the linker
+permission to modify the section in ways that would not normally be
+permitted, in order to optimize calls via the jump table.
+
+Each ``sh_entsize`` sized slice of a section of this type containing
+exactly one relocation may be considered to be a jump table entry
+that branches to the target of the relocation. This allows the linker
+to replace the jump table entry with the function body if it is small
+enough, or if the function is the last function in the jump table.
+
+A section of this type does not have to be placed according to its
+name. The linker may place the section in whichever output section it
+sees fit (generally the section that would provide the best locality).
+
+.. _CFI jump table: https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-indirect-function-calls
+
 CodeView-Dependent
 ------------------
 
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 2759e18..371f356 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -4867,7 +4867,7 @@ to be eliminated. This is because '``poison``' is stronger than '``undef``'.
 
       %D = undef
       %E = icmp slt %D, 4
-      %F = icmp gte %D, 4
+      %F = icmp sge %D, 4
 
     Safe:
       %A = undef
diff --git a/llvm/docs/MLGO.rst b/llvm/docs/MLGO.rst
index ed0769b..965a21b 100644
--- a/llvm/docs/MLGO.rst
+++ b/llvm/docs/MLGO.rst
@@ -468,6 +468,13 @@ The core components are:
 Using IR2Vec
 ------------
 
+.. note::
+
+   This section describes how to use IR2Vec within LLVM passes. A standalone 
+   tool :doc:`CommandGuide/llvm-ir2vec` is available for generating the
+   embeddings and triplets from LLVM IR files, which can be useful for
+   training vocabularies and generating embeddings outside of compiler passes.
+
 For generating embeddings, first the vocabulary should be obtained. Then, the 
 embeddings can be computed and accessed via an ``ir2vec::Embedder`` instance.
 
@@ -524,6 +531,10 @@ Further Details
 For more detailed information about the IR2Vec algorithm, its parameters, and
 advanced usage, please refer to the original paper:
 `IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_.
+
+For information about using IR2Vec tool for generating embeddings and
+triplets from LLVM IR, see :doc:`CommandGuide/llvm-ir2vec`.
+
 The LLVM source code for ``IR2Vec`` can also be explored to understand the 
 implementation details.
 
@@ -595,4 +606,3 @@ optimizations that are currently MLGO-enabled, it may be used as follows:
 where the ``name`` is a path fragment. We will expect to find 2 files,
 ``<name>.in`` (readable, data incoming from the managing process) and
 ``<name>.out`` (writable, the model runner sends data to the managing process)
-