========================== Clang Transformer Tutorial ========================== A tutorial on how to write a source-to-source translation tool using Clang Transformer. .. contents:: :local: What is Clang Transformer? -------------------------- Clang Transformer is a framework for writing C++ diagnostics and program transformations. It is built on the clang toolchain and the LibTooling library, but aims to hide much of the complexity of clang's native, low-level libraries. The core abstraction of Transformer is the *rewrite rule*, which specifies how to change a given program pattern into a new form. Here are some examples of tasks you can achieve with Transformer: * warn against using the name ``MkX`` for a declared function, * change ``MkX`` to ``MakeX``, where ``MkX`` is the name of a declared function, * change ``s.size()`` to ``Size(s)``, where ``s`` is a ``string``, * collapse ``e.child().m()`` to ``e.m()``, for any expression ``e`` and method named ``m``. All of the examples have a common form: they identify a pattern that is the target of the transformation, they specify an *edit* to the code identified by the pattern, and their pattern and edit refer to common variables, like ``s``, ``e``, and ``m``, that range over code fragments. Our first and second examples also specify constraints on the pattern that aren't apparent from the syntax alone, like "``s`` is a ``string``." Even the first example ("warn ...") shares this form, even though it doesn't change any of the code -- it's "edit" is simply a no-op. Transformer helps users succinctly specify rules of this sort and easily execute them locally over a collection of files, apply them to selected portions of a codebase, or even bundle them as a clang-tidy check for ongoing application. Who is Clang Transformer for? ----------------------------- Clang Transformer is for developers who want to write clang-tidy checks or write tools to modify a large number of C++ files in (roughly) the same way. What qualifies as "large" really depends on the nature of the change and your patience for repetitive editing. In our experience, automated solutions become worthwhile somewhere between 100 and 500 files. Getting Started --------------- Patterns in Transformer are expressed with :doc:`clang's AST matchers `. Matchers are a language of combinators for describing portions of a clang Abstract Syntax Tree (AST). Since clang's AST includes complete type information (within the limits of single `Translation Unit (TU)`_, these patterns can even encode rich constraints on the type properties of AST nodes. .. _`Translation Unit (TU)`: https://en.wikipedia.org/wiki/Translation_unit_\(programming\) We assume a familiarity with the clang AST and the corresponding AST matchers for the purpose of this tutorial. Users who are unfamiliar with either are encouraged to start with the recommended references in `Related Reading`_. Example: style-checking names ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Assume you have a style-guide rule which forbids functions from being named "MkX" and you want to write a check that catches any violations of this rule. We can express this a Transformer rewrite rule: .. code-block:: c++ makeRule(functionDecl(hasName("MkX").bind("fun"), noopEdit(node("fun")), cat("The name ``MkX`` is not allowed for functions; please rename")); ``makeRule`` is our go-to function for generating rewrite rules. It takes three arguments: the pattern, the edit, and (optionally) an explanatory note. In our example, the pattern (``functionDecl(...)``) identifies the declaration of the function ``MkX``. Since we're just diagnosing the problem, but not suggesting a fix, our edit is an no-op. But, it contains an *anchor* for the diagnostic message: ``node("fun")`` says to associate the message with the source range of the AST node bound to "fun"; in this case, the ill-named function declaration. Finally, we use ``cat`` to build a message that explains the change. Regarding the name ``cat`` -- we'll discuss it in more detail below, but suffice it to say that it can also take multiple arguments and concatenate their results. Note that the result of ``makeRule`` is a value of type ``clang::transformer::RewriteRule``, but most users don't need to care about the details of this type. Example: renaming a function ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Now, let's extend this example to a *transformation*; specifically, the second example above: .. code-block:: c++ makeRule(declRefExpr(to(functionDecl(hasName("MkX")))), changeTo(cat("MakeX")), cat("MkX has been renamed MakeX")); In this example, the pattern (``declRefExpr(...)``) identifies any *reference* to the function ``MkX``, rather than the declaration itself, as in our previous example. Our edit (``changeTo(...)``) says to *change* the code matched by the pattern *to* the text "MakeX". Finally, we use ``cat`` again to build a message that explains the change. Here are some example changes that this rule would make: +--------------------------+----------------------------+ | Original | Result | +==========================+============================+ | ``X x = MkX(3);`` | ``X x = MakeX(3);`` | +--------------------------+----------------------------+ | ``CallFactory(MkX, 3);`` | ``CallFactory(MakeX, 3);`` | +--------------------------+----------------------------+ | ``auto f = MkX;`` | ``auto f = MakeX;`` | +--------------------------+----------------------------+ Example: method to function ^^^^^^^^^^^^^^^^^^^^^^^^^^^ Next, let's write a rule to replace a method call with a (free) function call, applied to the original method call's target object. Specifically, "change ``s.size()`` to ``Size(s)``, where ``s`` is a ``string``." We start with a simpler change that ignores the type of ``s``. That is, it will modify *any* method call where the method is named "size": .. code-block:: c++ llvm::StringRef s = "str"; makeRule( cxxMemberCallExpr( on(expr().bind(s)), callee(cxxMethodDecl(hasName("size")))), changeTo(cat("Size(", node(s), ")")), cat("Method ``size`` is deprecated in favor of free function ``Size``")); We express the pattern with the given AST matcher, which binds the method call's target to ``s`` [#f1]_. For the edit, we again use ``changeTo``, but this time we construct the term from multiple parts, which we compose with ``cat``. The second part of our term is ``node(s)``, which selects the source code corresponding to the AST node ``s`` that was bound when a match was found in the AST for our rule's pattern. ``node(s)`` constructs a ``RangeSelector``, which, when used in ``cat``, indicates that the selected source should be inserted in the output at that point. Now, we probably don't want to rewrite *all* invocations of "size" methods, just those on ``std::string``\ s. We can achieve this change simply by refining our matcher. The rest of the rule remains unchanged: .. code-block:: c++ llvm::StringRef s = "str"; makeRule( cxxMemberCallExpr( on(expr(hasType(namedDecl(hasName("std::string")))) .bind(s)), callee(cxxMethodDecl(hasName("size")))), changeTo(cat("Size(", node(s), ")")), cat("Method ``size`` is deprecated in favor of free function ``Size``")); Example: rewriting method calls ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In this example, we delete an "intermediary" method call in a string of invocations. This scenario can arise, for example, if you want to collapse a substructure into its parent. .. code-block:: c++ llvm::StringRef e = "expr", m = "member"; auto child_call = cxxMemberCallExpr(on(expr().bind(e)), callee(cxxMethodDecl(hasName("child")))); makeRule(cxxMemberCallExpr(on(child_call), callee(memberExpr().bind(m)), changeTo(cat(e, ".", member(m), "()"))), cat("``child`` accessor is being removed; call ", member(m), " directly on parent")); This rule isn't quite what we want: it will rewrite ``my_object.child().foo()`` to ``my_object.foo()``, but it will also rewrite ``my_ptr->child().foo()`` to ``my_ptr.foo()``, which is not what we intend. We could fix this by restricting the pattern with ``not(isArrow())`` in the definition of ``child_call``. Yet, we *want* to rewrite calls through pointers. To capture this idiom, we provide the ``access`` combinator to intelligently construct a field/method access. In our example, the member access is expressed as: .. code-block:: c++ access(e, cat(member(m))) The first argument specifies the object being accessed and the second, a description of the field/method name. In this case, we specify that the method name should be copied from the source -- specifically, the source range of ``m``'s member. To construct the method call, we would use this expression in ``cat``: .. code-block:: c++ cat(access(e, cat(member(m))), "()") Reference: ranges, stencils, edits, rules ----------------------------------------- The above examples demonstrate just the basics of rewrite rules. Every element we touched on has more available constructors: range selectors, stencils, edits and rules. In this section, we'll briefly review each in turn, with references to the source headers for up-to-date information. First, though, we clarify what rewrite rules are actually rewriting. Rewriting ASTs to... Text? ^^^^^^^^^^^^^^^^^^^^^^^^^^ The astute reader may have noticed that we've been somewhat vague in our explanation of what the rewrite rules are actually rewriting. We've referred to "code", but code can be represented both as raw source text and as an abstract syntax tree. So, which one is it? Ideally, we'd be rewriting the input AST to a new AST, but clang's AST is not terribly amenable to this kind of transformation. So, we compromise: we express our patterns and the names that they bind in terms of the AST, but our changes in terms of source code text. We've designed Transformer's language to bridge the gap between the two representations, in an attempt to minimize the user's need to reason about source code locations and other, low-level syntactic details. Range Selectors ^^^^^^^^^^^^^^^ Transformer provides a small API for describing source ranges: the ``RangeSelector`` combinators. These ranges are most commonly used to specify the source code affected by an edit and to extract source code in constructing new text. Roughly, there are two kinds of range combinators: ones that select a source range based on the AST, and others that combine existing ranges into new ranges. For example, ``node`` selects the range of source spanned by a particular AST node, as we've seen, while ``after`` selects the (empty) range located immediately after its argument range. So, ``after(node("id"))`` is the empty range immediately following the AST node bound to ``id``. For the full collection of ``RangeSelector``\ s, see the header, `clang/Tooling/Transformer/RangeSelector.h `_ Stencils ^^^^^^^^ Transformer offers a large and growing collection of combinators for constructing output. Above, we demonstrated ``cat``, the core function for constructing stencils. It takes a series of arguments, of three possible kinds: #. Raw text, to be copied directly to the output. #. Selector: specified with a ``RangeSelector``, indicates a range of source text to copy to the output. #. Builder: an operation that constructs a code snippet from its arguments. For example, the ``access`` function we saw above. Data of these different types are all represented (generically) by a ``Stencil``. ``cat`` takes text and ``RangeSelector``\ s directly as arguments, rather than requiring that they be constructed with a builder; other builders are constructed explicitly. In general, ``Stencil``\ s produce text from a match result. So, they are not limited to generating source code, but can also be used to generate diagnostic messages that reference (named) elements of the matched code, like we saw in the example of rewriting method calls. Further details of the ``Stencil`` type are documented in the header file `clang/Tooling/Transformer/Stencil.h `_. Edits ^^^^^ Transformer supports additional forms of edits. First, in a ``changeTo``, we can specify the particular portion of code to be replaced, using the same ``RangeSelector`` we saw earlier. For example, we could change the function name in a function declaration with: .. code-block:: c++ makeRule(functionDecl(hasName("bad")).bind(f), changeTo(name(f), cat("good")), cat("bad is now good")); We also provide simpler editing primitives for insertion and deletion: ``insertBefore``, ``insertAfter`` and ``remove``. These can all be found in the header file `clang/Tooling/Transformer/RewriteRule.h `_. We are not limited one edit per match found. Some situations require making multiple edits for each match. For example, suppose we wanted to swap two arguments of a function call. For this, we provide an overload of ``makeRule`` that takes a list of edits, rather than just a single one. Our example might look like: .. code-block:: c++ makeRule(callExpr(...), {changeTo(node(arg0), cat(node(arg2))), changeTo(node(arg2), cat(node(arg0)))}, cat("swap the first and third arguments of the call")); ``EditGenerator``\ s (Advanced) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The particular edits we've seen so far are all instances of the ``ASTEdit`` class, or a list of such. But, not all edits can be expressed as ``ASTEdit``\ s. So, we also support a very general signature for edit generators: .. code-block:: c++ using EditGenerator = MatchConsumer>; That is, an ``EditGenerator`` is function that maps a ``MatchResult`` to a set of edits, or fails. This signature supports a very general form of computation over match results. Transformer provides a number of functions for working with ``EditGenerator``\ s, most notably `flatten `_ ``EditGenerator``\ s, like list flattening. For the full list, see the header file `clang/Tooling/Transformer/RewriteRule.h `_. Rules ^^^^^ We can also compose multiple *rules*, rather than just edits within a rule, using ``applyFirst``: it composes a list of rules as an ordered choice, where Transformer applies the first rule whose pattern matches, ignoring others in the list that follow. If the matchers are independent then order doesn't matter. In that case, ``applyFirst`` is simply joining the set of rules into one. The benefit of ``applyFirst`` is that, for some problems, it allows the user to more concisely formulate later rules in the list, since their patterns need not explicitly exclude the earlier patterns of the list. For example, consider a set of rules that rewrite compound statements, where one rule handles the case of an empty compound statement and the other handles non-empty compound statements. With ``applyFirst``, these rules can be expressed compactly as: .. code-block:: c++ applyFirst({ makeRule(compoundStmt(statementCountIs(0)).bind("empty"), ...), makeRule(compoundStmt().bind("non-empty"),...) }) The second rule does not need to explicitly specify that the compound statement is non-empty -- it follows from the rules position in ``applyFirst``. For more complicated examples, this can lead to substantially more readable code. Sometimes, a modification to the code might require the inclusion of a particular header file. To this end, users can modify rules to specify include directives with ``addInclude``. For additional documentation on these functions, see the header file `clang/Tooling/Transformer/RewriteRule.h `_. Using a RewriteRule as a clang-tidy check ----------------------------------------- Transformer supports executing a rewrite rule as a `clang-tidy `_ check, with the class ``clang::tidy::utils::TransformerClangTidyCheck``. It is designed to require minimal code in the definition. For example, given a rule ``MyCheckAsRewriteRule``, one can define a tidy check as follows: .. code-block:: c++ class MyCheck : public TransformerClangTidyCheck { public: MyCheck(StringRef Name, ClangTidyContext *Context) : TransformerClangTidyCheck(MyCheckAsRewriteRule, Name, Context) {} }; ``TransformerClangTidyCheck`` implements the virtual ``registerMatchers`` and ``check`` methods based on your rule specification, so you don't need to implement them yourself. If the rule needs to be configured based on the language options and/or the clang-tidy configuration, it can be expressed as a function taking these as parameters and (optionally) returning a ``RewriteRule``. This would be useful, for example, for our method-renaming rule, which is parameterized by the original name and the target. For details, see `clang-tools-extra/clang-tidy/utils/TransformerClangTidyCheck.h `_ Related Reading --------------- A good place to start understanding the clang AST and its matchers is with the introductions on clang's site: * :doc:`Introduction to the Clang AST ` * :doc:`Matching the Clang AST ` * `AST Matcher Reference `_ .. rubric:: Footnotes .. [#f1] Technically, it binds it to the string "str", to which our variable ``s`` is bound. But, the choice of that id string is irrelevant, so elide the difference.