Support extended characters in C/C++ identifiers (PR c/67224)

libcpp/ChangeLog 2019-09-19 Lewis Hyatt <lhyatt@gmail.com> PR c/67224 * charset.c (_cpp_valid_utf8): New function to help lex UTF-8 tokens. * internal.h (_cpp_valid_utf8): Declare. * lex.c (forms_identifier_p): Use it to recognize UTF-8 identifiers. (_cpp_lex_direct): Handle UTF-8 in identifiers and CPP_OTHER tokens. Do all work in "default" case to avoid slowing down typical code paths. Also handle $ and UCN in the default case for consistency. gcc/Changelog 2019-09-19 Lewis Hyatt <lhyatt@gmail.com> PR c/67224 * doc/cpp.texi: Document support for extended characters in identifiers. * doc/cppopts.texi: Likewise. gcc/testsuite/ChangeLog 2019-09-19 Lewis Hyatt <lhyatt@gmail.com> PR c/67224 * c-c++-common/cpp/ucnid-2011-1-utf8.c: New test. * g++.dg/cpp/ucnid-1-utf8.C: New test. * g++.dg/cpp/ucnid-2-utf8.C: New test. * g++.dg/cpp/ucnid-3-utf8.C: New test. * g++.dg/cpp/ucnid-4-utf8.C: New test. * g++.dg/other/ucnid-1-utf8.C: New test. * gcc.dg/cpp/ucnid-1-utf8.c: New test. * gcc.dg/cpp/ucnid-10-utf8.c: New test. * gcc.dg/cpp/ucnid-11-utf8.c: New test. * gcc.dg/cpp/ucnid-12-utf8.c: New test. * gcc.dg/cpp/ucnid-13-utf8.c: New test. * gcc.dg/cpp/ucnid-14-utf8.c: New test. * gcc.dg/cpp/ucnid-15-utf8.c: New test. * gcc.dg/cpp/ucnid-2-utf8.c: New test. * gcc.dg/cpp/ucnid-3-utf8.c: New test. * gcc.dg/cpp/ucnid-4-utf8.c: New test. * gcc.dg/cpp/ucnid-6-utf8.c: New test. * gcc.dg/cpp/ucnid-7-utf8.c: New test. * gcc.dg/cpp/ucnid-9-utf8.c: New test. * gcc.dg/ucnid-1-utf8.c: New test. * gcc.dg/ucnid-10-utf8.c: New test. * gcc.dg/ucnid-11-utf8.c: New test. * gcc.dg/ucnid-12-utf8.c: New test. * gcc.dg/ucnid-13-utf8.c: New test. * gcc.dg/ucnid-14-utf8.c: New test. * gcc.dg/ucnid-15-utf8.c: New test. * gcc.dg/ucnid-16-utf8.c: New test. * gcc.dg/ucnid-2-utf8.c: New test. * gcc.dg/ucnid-3-utf8.c: New test. * gcc.dg/ucnid-4-utf8.c: New test. * gcc.dg/ucnid-5-utf8.c: New test. * gcc.dg/ucnid-6-utf8.c: New test. * gcc.dg/ucnid-7-utf8.c: New test. * gcc.dg/ucnid-8-utf8.c: New test. * gcc.dg/ucnid-9-utf8.c: New test. From-SVN: r275979
author: Lewis Hyatt <lhyatt@gmail.com> 2019-09-19 19:56:11 +0000
committer: Joseph Myers <jsm28@gcc.gnu.org> 2019-09-19 20:56:11 +0100
commit: 7d112d6670a0e0e662f8a7e64c33686e475832c8 (patch)
tree: 983eb23217b2572ff4fe5a7f7fe0e5c0c0b9a48d /gcc/doc
parent: e0710fcf7dc70054a9a20ab1b8d77f4fef26ef2c (diff)
download: gcc-7d112d6670a0e0e662f8a7e64c33686e475832c8.zip
gcc-7d112d6670a0e0e662f8a7e64c33686e475832c8.tar.gz
gcc-7d112d6670a0e0e662f8a7e64c33686e475832c8.tar.bz2
2 files changed, 20 insertions, 17 deletions
diff --git a/gcc/doc/cpp.texi b/gcc/doc/cpp.texi
index e271f51..f2de39a 100644
--- a/gcc/doc/cpp.texi
+++ b/gcc/doc/cpp.texi
@@ -274,11 +274,11 @@ the character in the source character set that they represent, then
 converted to the execution character set, just like unescaped
 characters.
 
-In identifiers, characters outside the ASCII range can only be
-specified with the @samp{\u} and @samp{\U} escapes, not used
-directly.  If strict ISO C90 conformance is specified with an option
+In identifiers, characters outside the ASCII range can be specified
+with the @samp{\u} and @samp{\U} escapes or used directly in the input
+encoding.  If strict ISO C90 conformance is specified with an option
 such as @option{-std=c90}, or @option{-fno-extended-identifiers} is
-used, then those escapes are not permitted in identifiers.
+used, then those constructs are not permitted in identifiers.
 
 @node Initial processing
 @section Initial processing
@@ -503,8 +503,7 @@ In the 1999 C standard, identifiers may contain letters which are not
 part of the ``basic source character set'', at the implementation's
 discretion (such as accented Latin letters, Greek letters, or Chinese
 ideograms).  This may be done with an extended character set, or the
-@samp{\u} and @samp{\U} escape sequences.  GCC only accepts such
-characters in the @samp{\u} and @samp{\U} forms.
+@samp{\u} and @samp{\U} escape sequences.
 
 As an extension, GCC treats @samp{$} as a letter.  This is for
 compatibility with some systems, such as VMS, where @samp{$} is commonly
@@ -584,15 +583,15 @@ Punctuator:      @{   @}   [   ]   #    ##
 @end smallexample
 
 @cindex other tokens
-Any other single character is considered ``other''.  It is passed on to
-the preprocessor's output unmolested.  The C compiler will almost
-certainly reject source code containing ``other'' tokens.  In ASCII, the
-only other characters are @samp{@@}, @samp{$}, @samp{`}, and control
+Any other single byte is considered ``other'' and passed on to the
+preprocessor's output unchanged.  The C compiler will almost certainly
+reject source code containing ``other'' tokens.  In ASCII, the only
+``other'' characters are @samp{@@}, @samp{$}, @samp{`}, and control
 characters other than NUL (all bits zero).  (Note that @samp{$} is
-normally considered a letter.)  All characters with the high bit set
-(numeric range 0x7F--0xFF) are also ``other'' in the present
-implementation.  This will change when proper support for international
-character sets is added to GCC@.
+normally considered a letter.)  All bytes with the high bit set
+(numeric range 0x7F--0xFF) that were not succesfully interpreted as
+part of an extended character in the input encoding are also ``other''
+in the present implementation.
 
 NUL is a special case because of the high probability that its
 appearance is accidental, and because it may be invisible to the user
@@ -4179,7 +4178,10 @@ be controlled using the @option{-fexec-charset} and
 The C and C++ standards allow identifiers to be composed of @samp{_}
 and the alphanumeric characters.  C++ also allows universal character
 names.  C99 and later C standards permit both universal character
-names and implementation-defined characters.
+names and implementation-defined characters.  In both C and C++ modes,
+GCC accepts in identifiers exactly those extended characters that
+correspond to universal character names permitted by the chosen
+standard.
 
 GCC allows the @samp{$} character in identifiers as an extension for
 most targets.  This is true regardless of the @option{std=} switch,
diff --git a/gcc/doc/cppopts.texi b/gcc/doc/cppopts.texi
index 61e22cd..f4bc3f5 100644
--- a/gcc/doc/cppopts.texi
+++ b/gcc/doc/cppopts.texi
@@ -254,8 +254,9 @@ Accept @samp{$} in identifiers.
 
 @item -fextended-identifiers
 @opindex fextended-identifiers
-Accept universal character names in identifiers.  This option is
-enabled by default for C99 (and later C standard versions) and C++.
+Accept universal character names and extended characters in
+identifiers.  This option is enabled by default for C99 (and later C
+standard versions) and C++.
 
 @item -fno-canonical-system-headers
 @opindex fno-canonical-system-headers
author	Lewis Hyatt <lhyatt@gmail.com>	2019-09-19 19:56:11 +0000
committer	Joseph Myers <jsm28@gcc.gnu.org>	2019-09-19 20:56:11 +0100
commit	7d112d6670a0e0e662f8a7e64c33686e475832c8 (patch)
tree	983eb23217b2572ff4fe5a7f7fe0e5c0c0b9a48d /gcc/doc
parent	e0710fcf7dc70054a9a20ab1b8d77f4fef26ef2c (diff)
download	gcc-7d112d6670a0e0e662f8a7e64c33686e475832c8.zip gcc-7d112d6670a0e0e662f8a7e64c33686e475832c8.tar.gz gcc-7d112d6670a0e0e662f8a7e64c33686e475832c8.tar.bz2