libcpp: Named universal character escapes and delimited escape sequence tweaks

On Tue, Aug 30, 2022 at 09:10:37PM +0000, Joseph Myers wrote: > I'm seeing build failures of glibc for powerpc64, as illustrated by the > following C code: > > #if 0 > \NARG > #endif > > (the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef > __ASSEMBLER__). > > This shows some problems with this feature - and with delimited escape > sequences - as it affects C. It's fine to accept it as an extension > inside string and character literals, because \N or \u{...} would be > invalid in the absence of the feature (i.e. the syntax for such literals > fails to match, meaning that the rule about undefined behavior for a > single ' or " as a pp-token applies). But outside string and character > literals, the usual lexing rules apply, the \ is a pp-token on its own and > the code is valid at the preprocessing level, and with expansion of macros > appearing before or after the \ (e.g. u defined as a macro in the \u{...} > case) it may be valid code at the language level as well. I don't know > what older C++ versions say about this, but for C this means e.g. > > #define z(x) 0 > #define a z( > int x = a\NARG); > > needs to be accepted as expanding to "int x = 0;", not interpreted as > using the \N feature in an identifier and produce an error. The following patch changes this, so that: 1) outside of string/character literals, \N without following { is never treated as an error nor warning, it is silently treated as \ separate token followed by whatever is after it 2) \u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} are not handled as extension at all outside of string/character literals in the strict standard modes (-std=c*) except for -std=c++{23,2b}, only in the -std=gnu* modes, because it changes behavior on valid sources, e.g. #define z(x) 0 #define a z( int x = a\u{123}); int y = a\N{LATIN SMALL LETTER A WITH ACUTE}); 3) introduces -Wunicode warning (on by default) and warns for cases of what looks like invalid delimited escape sequence or named universal character escape outside of string/character literals and is treated as separate tokens 2022-09-07 Jakub Jelinek <jakub@redhat.com> libcpp/ * include/cpplib.h (struct cpp_options): Add cpp_warn_unicode member. (enum cpp_warning_reason): Add CPP_W_UNICODE. * init.cc (cpp_create_reader): Initialize cpp_warn_unicode. * charset.cc (_cpp_valid_ucn): In possible identifier contexts, don't handle \u{ or \N{ specially in -std=c* modes except -std=c++2{3,b}. In possible identifier contexts, don't emit an error and punt if \N isn't followed by {, or if \N{} surrounds some lower case letters or _. In possible identifier contexts when not C++23, don't emit an error but warning about unknown character names and treat as separate tokens. When treating as separate tokens \u{ or \N{, emit warnings. gcc/ * doc/invoke.texi (-Wno-unicode): Document. gcc/c-family/ * c.opt (Winvalid-utf8): Use ObjC instead of objC. Remove " in comments" from description. (Wunicode): New option. gcc/testsuite/ * c-c++-common/cpp/delimited-escape-seq-4.c: New test. * c-c++-common/cpp/delimited-escape-seq-5.c: New test. * c-c++-common/cpp/delimited-escape-seq-6.c: New test. * c-c++-common/cpp/delimited-escape-seq-7.c: New test. * c-c++-common/cpp/named-universal-char-escape-5.c: New test. * c-c++-common/cpp/named-universal-char-escape-6.c: New test. * c-c++-common/cpp/named-universal-char-escape-7.c: New test. * g++.dg/cpp23/named-universal-char-escape1.C: New test. * g++.dg/cpp23/named-universal-char-escape2.C: New test.
author: Jakub Jelinek <jakub@redhat.com> 2022-09-07 08:44:38 +0200
committer: Jakub Jelinek <jakub@redhat.com> 2022-09-07 08:44:38 +0200
commit: 572f5e1bc68e131b25cd2d5ba231e932f5038904 (patch)
tree: 3bb768b7f06160f88ca0aaa8f0e2f5df0408ab7e /gcc/c-family/c.opt
parent: ea6e89e07f4223c8ac7877508c62bba368084999 (diff)
download: gcc-572f5e1bc68e131b25cd2d5ba231e932f5038904.zip
gcc-572f5e1bc68e131b25cd2d5ba231e932f5038904.tar.gz
gcc-572f5e1bc68e131b25cd2d5ba231e932f5038904.tar.bz2
1 files changed, 6 insertions, 2 deletions
diff --git a/gcc/c-family/c.opt b/gcc/c-family/c.opt
index 4515664..1c7f89e 100644
--- a/gcc/c-family/c.opt
+++ b/gcc/c-family/c.opt
@@ -822,8 +822,8 @@ C ObjC C++ ObjC++ CPP(warn_invalid_pch) CppReason(CPP_W_INVALID_PCH) Var(cpp_war
 Warn about PCH files that are found but not used.
 
 Winvalid-utf8
-C objC C++ ObjC++ CPP(cpp_warn_invalid_utf8) CppReason(CPP_W_INVALID_UTF8) Var(warn_invalid_utf8) Init(0) Warning
-Warn about invalid UTF-8 characters in comments.
+C ObjC C++ ObjC++ CPP(cpp_warn_invalid_utf8) CppReason(CPP_W_INVALID_UTF8) Var(warn_invalid_utf8) Init(0) Warning
+Warn about invalid UTF-8 characters.
 
 Wjump-misses-init
 C ObjC Var(warn_jump_misses_init) Warning LangEnabledby(C ObjC,Wc++-compat)
@@ -1345,6 +1345,10 @@ Wundef
 C ObjC C++ ObjC++ CPP(warn_undef) CppReason(CPP_W_UNDEF) Var(cpp_warn_undef) Init(0) Warning
 Warn if an undefined macro is used in an #if directive.
 
+Wunicode
+C ObjC C++ ObjC++ CPP(cpp_warn_unicode) CppReason(CPP_W_UNICODE) Var(warn_unicode) Init(1) Warning
+Warn about invalid forms of delimited or named escape sequences.
+
 Wuninitialized
 C ObjC C++ ObjC++ LTO LangEnabledBy(C ObjC C++ ObjC++ LTO,Wall)
 ;
author	Jakub Jelinek <jakub@redhat.com>	2022-09-07 08:44:38 +0200
committer	Jakub Jelinek <jakub@redhat.com>	2022-09-07 08:44:38 +0200
commit	572f5e1bc68e131b25cd2d5ba231e932f5038904 (patch)
tree	3bb768b7f06160f88ca0aaa8f0e2f5df0408ab7e /gcc/c-family/c.opt
parent	ea6e89e07f4223c8ac7877508c62bba368084999 (diff)
download	gcc-572f5e1bc68e131b25cd2d5ba231e932f5038904.zip gcc-572f5e1bc68e131b25cd2d5ba231e932f5038904.tar.gz gcc-572f5e1bc68e131b25cd2d5ba231e932f5038904.tar.bz2