diff options
author | Lewis Hyatt <lhyatt@gmail.com> | 2019-09-19 19:56:11 +0000 |
---|---|---|
committer | Joseph Myers <jsm28@gcc.gnu.org> | 2019-09-19 20:56:11 +0100 |
commit | 7d112d6670a0e0e662f8a7e64c33686e475832c8 (patch) | |
tree | 983eb23217b2572ff4fe5a7f7fe0e5c0c0b9a48d /gcc/doc | |
parent | e0710fcf7dc70054a9a20ab1b8d77f4fef26ef2c (diff) | |
download | gcc-7d112d6670a0e0e662f8a7e64c33686e475832c8.zip gcc-7d112d6670a0e0e662f8a7e64c33686e475832c8.tar.gz gcc-7d112d6670a0e0e662f8a7e64c33686e475832c8.tar.bz2 |
Support extended characters in C/C++ identifiers (PR c/67224)
libcpp/ChangeLog
2019-09-19 Lewis Hyatt <lhyatt@gmail.com>
PR c/67224
* charset.c (_cpp_valid_utf8): New function to help lex UTF-8 tokens.
* internal.h (_cpp_valid_utf8): Declare.
* lex.c (forms_identifier_p): Use it to recognize UTF-8 identifiers.
(_cpp_lex_direct): Handle UTF-8 in identifiers and CPP_OTHER tokens.
Do all work in "default" case to avoid slowing down typical code paths.
Also handle $ and UCN in the default case for consistency.
gcc/Changelog
2019-09-19 Lewis Hyatt <lhyatt@gmail.com>
PR c/67224
* doc/cpp.texi: Document support for extended characters in
identifiers.
* doc/cppopts.texi: Likewise.
gcc/testsuite/ChangeLog
2019-09-19 Lewis Hyatt <lhyatt@gmail.com>
PR c/67224
* c-c++-common/cpp/ucnid-2011-1-utf8.c: New test.
* g++.dg/cpp/ucnid-1-utf8.C: New test.
* g++.dg/cpp/ucnid-2-utf8.C: New test.
* g++.dg/cpp/ucnid-3-utf8.C: New test.
* g++.dg/cpp/ucnid-4-utf8.C: New test.
* g++.dg/other/ucnid-1-utf8.C: New test.
* gcc.dg/cpp/ucnid-1-utf8.c: New test.
* gcc.dg/cpp/ucnid-10-utf8.c: New test.
* gcc.dg/cpp/ucnid-11-utf8.c: New test.
* gcc.dg/cpp/ucnid-12-utf8.c: New test.
* gcc.dg/cpp/ucnid-13-utf8.c: New test.
* gcc.dg/cpp/ucnid-14-utf8.c: New test.
* gcc.dg/cpp/ucnid-15-utf8.c: New test.
* gcc.dg/cpp/ucnid-2-utf8.c: New test.
* gcc.dg/cpp/ucnid-3-utf8.c: New test.
* gcc.dg/cpp/ucnid-4-utf8.c: New test.
* gcc.dg/cpp/ucnid-6-utf8.c: New test.
* gcc.dg/cpp/ucnid-7-utf8.c: New test.
* gcc.dg/cpp/ucnid-9-utf8.c: New test.
* gcc.dg/ucnid-1-utf8.c: New test.
* gcc.dg/ucnid-10-utf8.c: New test.
* gcc.dg/ucnid-11-utf8.c: New test.
* gcc.dg/ucnid-12-utf8.c: New test.
* gcc.dg/ucnid-13-utf8.c: New test.
* gcc.dg/ucnid-14-utf8.c: New test.
* gcc.dg/ucnid-15-utf8.c: New test.
* gcc.dg/ucnid-16-utf8.c: New test.
* gcc.dg/ucnid-2-utf8.c: New test.
* gcc.dg/ucnid-3-utf8.c: New test.
* gcc.dg/ucnid-4-utf8.c: New test.
* gcc.dg/ucnid-5-utf8.c: New test.
* gcc.dg/ucnid-6-utf8.c: New test.
* gcc.dg/ucnid-7-utf8.c: New test.
* gcc.dg/ucnid-8-utf8.c: New test.
* gcc.dg/ucnid-9-utf8.c: New test.
From-SVN: r275979
Diffstat (limited to 'gcc/doc')
-rw-r--r-- | gcc/doc/cpp.texi | 32 | ||||
-rw-r--r-- | gcc/doc/cppopts.texi | 5 |
2 files changed, 20 insertions, 17 deletions
diff --git a/gcc/doc/cpp.texi b/gcc/doc/cpp.texi index e271f51..f2de39a 100644 --- a/gcc/doc/cpp.texi +++ b/gcc/doc/cpp.texi @@ -274,11 +274,11 @@ the character in the source character set that they represent, then converted to the execution character set, just like unescaped characters. -In identifiers, characters outside the ASCII range can only be -specified with the @samp{\u} and @samp{\U} escapes, not used -directly. If strict ISO C90 conformance is specified with an option +In identifiers, characters outside the ASCII range can be specified +with the @samp{\u} and @samp{\U} escapes or used directly in the input +encoding. If strict ISO C90 conformance is specified with an option such as @option{-std=c90}, or @option{-fno-extended-identifiers} is -used, then those escapes are not permitted in identifiers. +used, then those constructs are not permitted in identifiers. @node Initial processing @section Initial processing @@ -503,8 +503,7 @@ In the 1999 C standard, identifiers may contain letters which are not part of the ``basic source character set'', at the implementation's discretion (such as accented Latin letters, Greek letters, or Chinese ideograms). This may be done with an extended character set, or the -@samp{\u} and @samp{\U} escape sequences. GCC only accepts such -characters in the @samp{\u} and @samp{\U} forms. +@samp{\u} and @samp{\U} escape sequences. As an extension, GCC treats @samp{$} as a letter. This is for compatibility with some systems, such as VMS, where @samp{$} is commonly @@ -584,15 +583,15 @@ Punctuator: @{ @} [ ] # ## @end smallexample @cindex other tokens -Any other single character is considered ``other''. It is passed on to -the preprocessor's output unmolested. The C compiler will almost -certainly reject source code containing ``other'' tokens. In ASCII, the -only other characters are @samp{@@}, @samp{$}, @samp{`}, and control +Any other single byte is considered ``other'' and passed on to the +preprocessor's output unchanged. The C compiler will almost certainly +reject source code containing ``other'' tokens. In ASCII, the only +``other'' characters are @samp{@@}, @samp{$}, @samp{`}, and control characters other than NUL (all bits zero). (Note that @samp{$} is -normally considered a letter.) All characters with the high bit set -(numeric range 0x7F--0xFF) are also ``other'' in the present -implementation. This will change when proper support for international -character sets is added to GCC@. +normally considered a letter.) All bytes with the high bit set +(numeric range 0x7F--0xFF) that were not succesfully interpreted as +part of an extended character in the input encoding are also ``other'' +in the present implementation. NUL is a special case because of the high probability that its appearance is accidental, and because it may be invisible to the user @@ -4179,7 +4178,10 @@ be controlled using the @option{-fexec-charset} and The C and C++ standards allow identifiers to be composed of @samp{_} and the alphanumeric characters. C++ also allows universal character names. C99 and later C standards permit both universal character -names and implementation-defined characters. +names and implementation-defined characters. In both C and C++ modes, +GCC accepts in identifiers exactly those extended characters that +correspond to universal character names permitted by the chosen +standard. GCC allows the @samp{$} character in identifiers as an extension for most targets. This is true regardless of the @option{std=} switch, diff --git a/gcc/doc/cppopts.texi b/gcc/doc/cppopts.texi index 61e22cd..f4bc3f5 100644 --- a/gcc/doc/cppopts.texi +++ b/gcc/doc/cppopts.texi @@ -254,8 +254,9 @@ Accept @samp{$} in identifiers. @item -fextended-identifiers @opindex fextended-identifiers -Accept universal character names in identifiers. This option is -enabled by default for C99 (and later C standard versions) and C++. +Accept universal character names and extended characters in +identifiers. This option is enabled by default for C99 (and later C +standard versions) and C++. @item -fno-canonical-system-headers @opindex fno-canonical-system-headers |