aboutsummaryrefslogtreecommitdiff
path: root/gcc/doc/cppinternals/cppinternals.rst
diff options
context:
space:
mode:
Diffstat (limited to 'gcc/doc/cppinternals/cppinternals.rst')
-rw-r--r--gcc/doc/cppinternals/cppinternals.rst284
1 files changed, 284 insertions, 0 deletions
diff --git a/gcc/doc/cppinternals/cppinternals.rst b/gcc/doc/cppinternals/cppinternals.rst
new file mode 100644
index 0000000..66d54ae
--- /dev/null
+++ b/gcc/doc/cppinternals/cppinternals.rst
@@ -0,0 +1,284 @@
+..
+ Copyright 1988-2022 Free Software Foundation, Inc.
+ This is part of the GCC manual.
+ For copying conditions, see the copyright.rst file.
+
+.. @smallbook
+ @cropmarks
+ @finalout
+
+.. index:: interface, header files
+
+.. _conventions:
+
+Conventions
+===========
+
+cpplib has two interfaces---one is exposed internally only, and the
+other is for both internal and external use.
+
+The convention is that functions and types that are exposed to multiple
+files internally are prefixed with :samp:`_cpp_`, and are to be found in
+the file :samp:`internal.h`. Functions and types exposed to external
+clients are in :samp:`cpplib.h`, and prefixed with :samp:`cpp_`. For
+historical reasons this is no longer quite true, but we should strive to
+stick to it.
+
+We are striving to reduce the information exposed in :samp:`cpplib.h` to the
+bare minimum necessary, and then to keep it there. This makes clear
+exactly what external clients are entitled to assume, and allows us to
+change internals in the future without worrying whether library clients
+are perhaps relying on some kind of undocumented implementation-specific
+behavior.
+
+.. index:: lexer, newlines, escaped newlines
+
+.. _lexer:
+
+The Lexer
+=========
+
+.. toctree::
+ :maxdepth: 2
+
+ overview
+ lexing-a-token
+ lexing-a-line
+
+.. index:: hash table, identifiers, macros, assertions, named operators
+
+.. _hash-nodes:
+
+Hash Nodes
+==========
+
+When cpplib encounters an 'identifier', it generates a hash code for
+it and stores it in the hash table. By 'identifier' we mean tokens
+with type ``CPP_NAME`` ; this includes identifiers in the usual C
+sense, as well as keywords, directive names, macro names and so on. For
+example, all of ``pragma``, ``int``, ``foo`` and
+``__GNUC__`` are identifiers and hashed when lexed.
+
+Each node in the hash table contain various information about the
+identifier it represents. For example, its length and type. At any one
+time, each identifier falls into exactly one of three categories:
+
+* Macros
+
+ These have been declared to be macros, either on the command line or
+ with ``#define``. A few, such as ``__TIME__`` are built-ins
+ entered in the hash table during initialization. The hash node for a
+ normal macro points to a structure with more information about the
+ macro, such as whether it is function-like, how many arguments it takes,
+ and its expansion. Built-in macros are flagged as special, and instead
+ contain an enum indicating which of the various built-in macros it is.
+
+* Assertions
+
+ Assertions are in a separate namespace to macros. To enforce this, cpp
+ actually prepends a ``#`` character before hashing and entering it in
+ the hash table. An assertion's node points to a chain of answers to
+ that assertion.
+
+* Void
+
+ Everything else falls into this category---an identifier that is not
+ currently a macro, or a macro that has since been undefined with
+ ``#undef``.
+
+ When preprocessing C++, this category also includes the named operators,
+ such as ``xor``. In expressions these behave like the operators they
+ represent, but in contexts where the spelling of a token matters they
+ are spelt differently. This spelling distinction is relevant when they
+ are operands of the stringizing and pasting macro operators ``#`` and
+ ``##``. Named operator hash nodes are flagged, both to catch the
+ spelling distinction and to prevent them from being defined as macros.
+
+The same identifiers share the same hash node. Since each identifier
+token, after lexing, contains a pointer to its hash node, this is used
+to provide rapid lookup of various information. For example, when
+parsing a ``#define`` statement, CPP flags each argument's identifier
+hash node with the index of that argument. This makes duplicated
+argument checking an O(1) operation for each argument. Similarly, for
+each identifier in the macro's expansion, lookup to see if it is an
+argument, and which argument it is, is also an O(1) operation. Further,
+each directive name, such as ``endif``, has an associated directive
+enum stored in its hash node, so that directive lookup is also O(1).
+
+.. index:: macro expansion
+
+.. _macro-expansion:
+
+Macro Expansion Algorithm
+=========================
+
+Macro expansion is a tricky operation, fraught with nasty corner cases
+and situations that render what you thought was a nifty way to
+optimize the preprocessor's expansion algorithm wrong in quite subtle
+ways.
+
+I strongly recommend you have a good grasp of how the C and C++
+standards require macros to be expanded before diving into this
+section, let alone the code!. If you don't have a clear mental
+picture of how things like nested macro expansion, stringizing and
+token pasting are supposed to work, damage to your sanity can quickly
+result.
+
+.. toctree::
+ :maxdepth: 2
+
+ internal-representation-of-macros
+ macro-expansion-overview
+ scanning-the-replacement-list-for-macros-to-expand
+ looking-for-a-function-like-macros-opening-parenthesis
+ marking-tokens-ineligible-for-future-expansion
+
+.. index:: paste avoidance, spacing, token spacing
+
+.. _token-spacing:
+
+Token Spacing
+=============
+
+First, consider an issue that only concerns the stand-alone
+preprocessor: there needs to be a guarantee that re-reading its preprocessed
+output results in an identical token stream. Without taking special
+measures, this might not be the case because of macro substitution.
+For example:
+
+.. code-block::
+
+ #define PLUS +
+ #define EMPTY
+ #define f(x) =x=
+ +PLUS -EMPTY- PLUS+ f(=)
+ → + + - - + + = = =
+ not
+ → ++ -- ++ ===
+
+One solution would be to simply insert a space between all adjacent
+tokens. However, we would like to keep space insertion to a minimum,
+both for aesthetic reasons and because it causes problems for people who
+still try to abuse the preprocessor for things like Fortran source and
+Makefiles.
+
+For now, just notice that when tokens are added (or removed, as shown by
+the ``EMPTY`` example) from the original lexed token stream, we need
+to check for accidental token pasting. We call this :dfn:`paste
+avoidance`. Token addition and removal can only occur because of macro
+expansion, but accidental pasting can occur in many places: both before
+and after each macro replacement, each argument replacement, and
+additionally each token created by the :samp:`#` and :samp:`##` operators.
+
+Look at how the preprocessor gets whitespace output correct
+normally. The ``cpp_token`` structure contains a flags byte, and one
+of those flags is ``PREV_WHITE``. This is flagged by the lexer, and
+indicates that the token was preceded by whitespace of some form other
+than a new line. The stand-alone preprocessor can use this flag to
+decide whether to insert a space between tokens in the output.
+
+Now consider the result of the following macro expansion:
+
+.. code-block::
+
+ #define add(x, y, z) x + y +z;
+ sum = add (1,2, 3);
+ → sum = 1 + 2 +3;
+
+The interesting thing here is that the tokens :samp:`1` and :samp:`2` are
+output with a preceding space, and :samp:`3` is output without a
+preceding space, but when lexed none of these tokens had that property.
+Careful consideration reveals that :samp:`1` gets its preceding
+whitespace from the space preceding :samp:`add` in the macro invocation,
+*not* replacement list. :samp:`2` gets its whitespace from the
+space preceding the parameter :samp:`y` in the macro replacement list,
+and :samp:`3` has no preceding space because parameter :samp:`z` has none
+in the replacement list.
+
+Once lexed, tokens are effectively fixed and cannot be altered, since
+pointers to them might be held in many places, in particular by
+in-progress macro expansions. So instead of modifying the two tokens
+above, the preprocessor inserts a special token, which I call a
+:dfn:`padding token`, into the token stream to indicate that spacing of
+the subsequent token is special. The preprocessor inserts padding
+tokens in front of every macro expansion and expanded macro argument.
+These point to a :dfn:`source token` from which the subsequent real token
+should inherit its spacing. In the above example, the source tokens are
+:samp:`add` in the macro invocation, and :samp:`y` and :samp:`z` in the
+macro replacement list, respectively.
+
+It is quite easy to get multiple padding tokens in a row, for example if
+a macro's first replacement token expands straight into another macro.
+
+.. code-block::
+
+ #define foo bar
+ #define bar baz
+ [foo]
+ → [baz]
+
+Here, two padding tokens are generated with sources the :samp:`foo` token
+between the brackets, and the :samp:`bar` token from foo's replacement
+list, respectively. Clearly the first padding token is the one to
+use, so the output code should contain a rule that the first
+padding token in a sequence is the one that matters.
+
+But what if a macro expansion is left? Adjusting the above
+example slightly:
+
+.. code-block::
+
+ #define foo bar
+ #define bar EMPTY baz
+ #define EMPTY
+ [foo] EMPTY;
+ → [ baz] ;
+
+As shown, now there should be a space before :samp:`baz` and the
+semicolon in the output.
+
+The rules we decided above fail for :samp:`baz`: we generate three
+padding tokens, one per macro invocation, before the token :samp:`baz`.
+We would then have it take its spacing from the first of these, which
+carries source token :samp:`foo` with no leading space.
+
+It is vital that cpplib get spacing correct in these examples since any
+of these macro expansions could be stringized, where spacing matters.
+
+So, this demonstrates that not just entering macro and argument
+expansions, but leaving them requires special handling too. I made
+cpplib insert a padding token with a ``NULL`` source token when
+leaving macro expansions, as well as after each replaced argument in a
+macro's replacement list. It also inserts appropriate padding tokens on
+either side of tokens created by the :samp:`#` and :samp:`##` operators.
+I expanded the rule so that, if we see a padding token with a
+``NULL`` source token, *and* that source token has no leading
+space, then we behave as if we have seen no padding tokens at all. A
+quick check shows this rule will then get the above example correct as
+well.
+
+Now a relationship with paste avoidance is apparent: we have to be
+careful about paste avoidance in exactly the same locations we have
+padding tokens in order to get white space correct. This makes
+implementation of paste avoidance easy: wherever the stand-alone
+preprocessor is fixing up spacing because of padding tokens, and it
+turns out that no space is needed, it has to take the extra step to
+check that a space is not needed after all to avoid an accidental paste.
+The function ``cpp_avoid_paste`` advises whether a space is required
+between two consecutive tokens. To avoid excessive spacing, it tries
+hard to only require a space if one is likely to be necessary, but for
+reasons of efficiency it is slightly conservative and might recommend a
+space where one is not strictly needed.
+
+.. index:: line numbers
+
+.. _line-numbering:
+
+Line numbering
+==============
+
+.. toctree::
+ :maxdepth: 2
+
+ just-which-line-number-anyway
+ representation-of-line-numbers \ No newline at end of file