diff options
Diffstat (limited to 'gcc/doc/cppinternals/cppinternals.rst')
-rw-r--r-- | gcc/doc/cppinternals/cppinternals.rst | 284 |
1 files changed, 284 insertions, 0 deletions
diff --git a/gcc/doc/cppinternals/cppinternals.rst b/gcc/doc/cppinternals/cppinternals.rst new file mode 100644 index 0000000..66d54ae --- /dev/null +++ b/gcc/doc/cppinternals/cppinternals.rst @@ -0,0 +1,284 @@ +.. + Copyright 1988-2022 Free Software Foundation, Inc. + This is part of the GCC manual. + For copying conditions, see the copyright.rst file. + +.. @smallbook + @cropmarks + @finalout + +.. index:: interface, header files + +.. _conventions: + +Conventions +=========== + +cpplib has two interfaces---one is exposed internally only, and the +other is for both internal and external use. + +The convention is that functions and types that are exposed to multiple +files internally are prefixed with :samp:`_cpp_`, and are to be found in +the file :samp:`internal.h`. Functions and types exposed to external +clients are in :samp:`cpplib.h`, and prefixed with :samp:`cpp_`. For +historical reasons this is no longer quite true, but we should strive to +stick to it. + +We are striving to reduce the information exposed in :samp:`cpplib.h` to the +bare minimum necessary, and then to keep it there. This makes clear +exactly what external clients are entitled to assume, and allows us to +change internals in the future without worrying whether library clients +are perhaps relying on some kind of undocumented implementation-specific +behavior. + +.. index:: lexer, newlines, escaped newlines + +.. _lexer: + +The Lexer +========= + +.. toctree:: + :maxdepth: 2 + + overview + lexing-a-token + lexing-a-line + +.. index:: hash table, identifiers, macros, assertions, named operators + +.. _hash-nodes: + +Hash Nodes +========== + +When cpplib encounters an 'identifier', it generates a hash code for +it and stores it in the hash table. By 'identifier' we mean tokens +with type ``CPP_NAME`` ; this includes identifiers in the usual C +sense, as well as keywords, directive names, macro names and so on. For +example, all of ``pragma``, ``int``, ``foo`` and +``__GNUC__`` are identifiers and hashed when lexed. + +Each node in the hash table contain various information about the +identifier it represents. For example, its length and type. At any one +time, each identifier falls into exactly one of three categories: + +* Macros + + These have been declared to be macros, either on the command line or + with ``#define``. A few, such as ``__TIME__`` are built-ins + entered in the hash table during initialization. The hash node for a + normal macro points to a structure with more information about the + macro, such as whether it is function-like, how many arguments it takes, + and its expansion. Built-in macros are flagged as special, and instead + contain an enum indicating which of the various built-in macros it is. + +* Assertions + + Assertions are in a separate namespace to macros. To enforce this, cpp + actually prepends a ``#`` character before hashing and entering it in + the hash table. An assertion's node points to a chain of answers to + that assertion. + +* Void + + Everything else falls into this category---an identifier that is not + currently a macro, or a macro that has since been undefined with + ``#undef``. + + When preprocessing C++, this category also includes the named operators, + such as ``xor``. In expressions these behave like the operators they + represent, but in contexts where the spelling of a token matters they + are spelt differently. This spelling distinction is relevant when they + are operands of the stringizing and pasting macro operators ``#`` and + ``##``. Named operator hash nodes are flagged, both to catch the + spelling distinction and to prevent them from being defined as macros. + +The same identifiers share the same hash node. Since each identifier +token, after lexing, contains a pointer to its hash node, this is used +to provide rapid lookup of various information. For example, when +parsing a ``#define`` statement, CPP flags each argument's identifier +hash node with the index of that argument. This makes duplicated +argument checking an O(1) operation for each argument. Similarly, for +each identifier in the macro's expansion, lookup to see if it is an +argument, and which argument it is, is also an O(1) operation. Further, +each directive name, such as ``endif``, has an associated directive +enum stored in its hash node, so that directive lookup is also O(1). + +.. index:: macro expansion + +.. _macro-expansion: + +Macro Expansion Algorithm +========================= + +Macro expansion is a tricky operation, fraught with nasty corner cases +and situations that render what you thought was a nifty way to +optimize the preprocessor's expansion algorithm wrong in quite subtle +ways. + +I strongly recommend you have a good grasp of how the C and C++ +standards require macros to be expanded before diving into this +section, let alone the code!. If you don't have a clear mental +picture of how things like nested macro expansion, stringizing and +token pasting are supposed to work, damage to your sanity can quickly +result. + +.. toctree:: + :maxdepth: 2 + + internal-representation-of-macros + macro-expansion-overview + scanning-the-replacement-list-for-macros-to-expand + looking-for-a-function-like-macros-opening-parenthesis + marking-tokens-ineligible-for-future-expansion + +.. index:: paste avoidance, spacing, token spacing + +.. _token-spacing: + +Token Spacing +============= + +First, consider an issue that only concerns the stand-alone +preprocessor: there needs to be a guarantee that re-reading its preprocessed +output results in an identical token stream. Without taking special +measures, this might not be the case because of macro substitution. +For example: + +.. code-block:: + + #define PLUS + + #define EMPTY + #define f(x) =x= + +PLUS -EMPTY- PLUS+ f(=) + → + + - - + + = = = + not + → ++ -- ++ === + +One solution would be to simply insert a space between all adjacent +tokens. However, we would like to keep space insertion to a minimum, +both for aesthetic reasons and because it causes problems for people who +still try to abuse the preprocessor for things like Fortran source and +Makefiles. + +For now, just notice that when tokens are added (or removed, as shown by +the ``EMPTY`` example) from the original lexed token stream, we need +to check for accidental token pasting. We call this :dfn:`paste +avoidance`. Token addition and removal can only occur because of macro +expansion, but accidental pasting can occur in many places: both before +and after each macro replacement, each argument replacement, and +additionally each token created by the :samp:`#` and :samp:`##` operators. + +Look at how the preprocessor gets whitespace output correct +normally. The ``cpp_token`` structure contains a flags byte, and one +of those flags is ``PREV_WHITE``. This is flagged by the lexer, and +indicates that the token was preceded by whitespace of some form other +than a new line. The stand-alone preprocessor can use this flag to +decide whether to insert a space between tokens in the output. + +Now consider the result of the following macro expansion: + +.. code-block:: + + #define add(x, y, z) x + y +z; + sum = add (1,2, 3); + → sum = 1 + 2 +3; + +The interesting thing here is that the tokens :samp:`1` and :samp:`2` are +output with a preceding space, and :samp:`3` is output without a +preceding space, but when lexed none of these tokens had that property. +Careful consideration reveals that :samp:`1` gets its preceding +whitespace from the space preceding :samp:`add` in the macro invocation, +*not* replacement list. :samp:`2` gets its whitespace from the +space preceding the parameter :samp:`y` in the macro replacement list, +and :samp:`3` has no preceding space because parameter :samp:`z` has none +in the replacement list. + +Once lexed, tokens are effectively fixed and cannot be altered, since +pointers to them might be held in many places, in particular by +in-progress macro expansions. So instead of modifying the two tokens +above, the preprocessor inserts a special token, which I call a +:dfn:`padding token`, into the token stream to indicate that spacing of +the subsequent token is special. The preprocessor inserts padding +tokens in front of every macro expansion and expanded macro argument. +These point to a :dfn:`source token` from which the subsequent real token +should inherit its spacing. In the above example, the source tokens are +:samp:`add` in the macro invocation, and :samp:`y` and :samp:`z` in the +macro replacement list, respectively. + +It is quite easy to get multiple padding tokens in a row, for example if +a macro's first replacement token expands straight into another macro. + +.. code-block:: + + #define foo bar + #define bar baz + [foo] + → [baz] + +Here, two padding tokens are generated with sources the :samp:`foo` token +between the brackets, and the :samp:`bar` token from foo's replacement +list, respectively. Clearly the first padding token is the one to +use, so the output code should contain a rule that the first +padding token in a sequence is the one that matters. + +But what if a macro expansion is left? Adjusting the above +example slightly: + +.. code-block:: + + #define foo bar + #define bar EMPTY baz + #define EMPTY + [foo] EMPTY; + → [ baz] ; + +As shown, now there should be a space before :samp:`baz` and the +semicolon in the output. + +The rules we decided above fail for :samp:`baz`: we generate three +padding tokens, one per macro invocation, before the token :samp:`baz`. +We would then have it take its spacing from the first of these, which +carries source token :samp:`foo` with no leading space. + +It is vital that cpplib get spacing correct in these examples since any +of these macro expansions could be stringized, where spacing matters. + +So, this demonstrates that not just entering macro and argument +expansions, but leaving them requires special handling too. I made +cpplib insert a padding token with a ``NULL`` source token when +leaving macro expansions, as well as after each replaced argument in a +macro's replacement list. It also inserts appropriate padding tokens on +either side of tokens created by the :samp:`#` and :samp:`##` operators. +I expanded the rule so that, if we see a padding token with a +``NULL`` source token, *and* that source token has no leading +space, then we behave as if we have seen no padding tokens at all. A +quick check shows this rule will then get the above example correct as +well. + +Now a relationship with paste avoidance is apparent: we have to be +careful about paste avoidance in exactly the same locations we have +padding tokens in order to get white space correct. This makes +implementation of paste avoidance easy: wherever the stand-alone +preprocessor is fixing up spacing because of padding tokens, and it +turns out that no space is needed, it has to take the extra step to +check that a space is not needed after all to avoid an accidental paste. +The function ``cpp_avoid_paste`` advises whether a space is required +between two consecutive tokens. To avoid excessive spacing, it tries +hard to only require a space if one is likely to be necessary, but for +reasons of efficiency it is slightly conservative and might recommend a +space where one is not strictly needed. + +.. index:: line numbers + +.. _line-numbering: + +Line numbering +============== + +.. toctree:: + :maxdepth: 2 + + just-which-line-number-anyway + representation-of-line-numbers
\ No newline at end of file |