diff options
Diffstat (limited to 'llvm/docs/CommandGuide/llvm-ir2vec.rst')
-rw-r--r-- | llvm/docs/CommandGuide/llvm-ir2vec.rst | 170 |
1 files changed, 170 insertions, 0 deletions
diff --git a/llvm/docs/CommandGuide/llvm-ir2vec.rst b/llvm/docs/CommandGuide/llvm-ir2vec.rst new file mode 100644 index 0000000..13fe4996 --- /dev/null +++ b/llvm/docs/CommandGuide/llvm-ir2vec.rst @@ -0,0 +1,170 @@ +llvm-ir2vec - IR2Vec Embedding Generation Tool +============================================== + +.. program:: llvm-ir2vec + +SYNOPSIS +-------- + +:program:`llvm-ir2vec` [*options*] *input-file* + +DESCRIPTION +----------- + +:program:`llvm-ir2vec` is a standalone command-line tool for IR2Vec. It +generates IR2Vec embeddings for LLVM IR and supports triplet generation +for vocabulary training. It provides two main operation modes: + +1. **Triplet Mode**: Generates triplets (opcode, type, operands) for vocabulary + training from LLVM IR. + +2. **Embedding Mode**: Generates IR2Vec embeddings using a trained vocabulary + at different granularity levels (instruction, basic block, or function). + +The tool is designed to facilitate machine learning applications that work with +LLVM IR by converting the IR into numerical representations that can be used by +ML models. + +.. note:: + + For information about using IR2Vec programmatically within LLVM passes and + the C++ API, see the `IR2Vec Embeddings <https://llvm.org/docs/MLGO.html#ir2vec-embeddings>`_ + section in the MLGO documentation. + +OPERATION MODES +--------------- + +Triplet Generation Mode +~~~~~~~~~~~~~~~~~~~~~~~ + +In triplet mode, :program:`llvm-ir2vec` analyzes LLVM IR and extracts triplets +consisting of opcodes, types, and operands. These triplets can be used to train +vocabularies for embedding generation. + +Usage: + +.. code-block:: bash + + llvm-ir2vec --mode=triplets input.bc -o triplets.txt + +Embedding Generation Mode +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In embedding mode, :program:`llvm-ir2vec` uses a pre-trained vocabulary to +generate numerical embeddings for LLVM IR at different levels of granularity. + +Example Usage: + +.. code-block:: bash + + llvm-ir2vec --mode=embeddings --ir2vec-vocab-path=vocab.json --level=func input.bc -o embeddings.txt + +OPTIONS +------- + +.. option:: --mode=<mode> + + Specify the operation mode. Valid values are: + + * ``triplets`` - Generate triplets for vocabulary training + * ``embeddings`` - Generate embeddings using trained vocabulary (default) + +.. option:: --level=<level> + + Specify the embedding generation level. Valid values are: + + * ``inst`` - Generate instruction-level embeddings + * ``bb`` - Generate basic block-level embeddings + * ``func`` - Generate function-level embeddings (default) + +.. option:: --function=<name> + + Process only the specified function instead of all functions in the module. + +.. option:: --ir2vec-vocab-path=<path> + + Specify the path to the vocabulary file (required for embedding mode). + The vocabulary file should be in JSON format and contain the trained + vocabulary for embedding generation. See `llvm/lib/Analysis/models` + for pre-trained vocabulary files. + +.. option:: --ir2vec-opc-weight=<weight> + + Specify the weight for opcode embeddings (default: 1.0). This controls + the relative importance of instruction opcodes in the final embedding. + +.. option:: --ir2vec-type-weight=<weight> + + Specify the weight for type embeddings (default: 0.5). This controls + the relative importance of type information in the final embedding. + +.. option:: --ir2vec-arg-weight=<weight> + + Specify the weight for argument embeddings (default: 0.2). This controls + the relative importance of operand information in the final embedding. + +.. option:: -o <filename> + + Specify the output filename. Use ``-`` to write to standard output (default). + +.. option:: --help + + Print a summary of command line options. + +.. note:: + + ``--level``, ``--function``, ``--ir2vec-vocab-path``, ``--ir2vec-opc-weight``, + ``--ir2vec-type-weight``, and ``--ir2vec-arg-weight`` are only used in embedding + mode. These options are ignored in triplet mode. + +INPUT FILE FORMAT +----------------- + +:program:`llvm-ir2vec` accepts LLVM bitcode files (``.bc``) and LLVM IR files +(``.ll``) as input. The input file should contain valid LLVM IR. + +OUTPUT FORMAT +------------- + +Triplet Mode Output +~~~~~~~~~~~~~~~~~~~ + +In triplet mode, the output consists of lines containing space-separated triplets: + +.. code-block:: text + + <opcode> <type> <operand1> <operand2> ... + +Each line represents the information of one instruction, with the opcode, type, +and operands. + +Embedding Mode Output +~~~~~~~~~~~~~~~~~~~~~ + +In embedding mode, the output format depends on the specified level: + +* **Function Level**: One embedding vector per function +* **Basic Block Level**: One embedding vector per basic block, grouped by function +* **Instruction Level**: One embedding vector per instruction, grouped by basic block and function + +Each embedding is represented as a floating point vector. + +EXIT STATUS +----------- + +:program:`llvm-ir2vec` returns 0 on success, and a non-zero value on failure. + +Common failure cases include: + +* Invalid or missing input file +* Missing or invalid vocabulary file (in embedding mode) +* Specified function not found in the module +* Invalid command line options + +SEE ALSO +-------- + +:doc:`../MLGO` + +For more information about the IR2Vec algorithm and approach, see: +`IR2Vec: LLVM IR Based Scalable Program Embeddings <https://doi.org/10.1145/3418463>`_. |