libcpp: Handle extended characters in user-defined literal suffix [PR103902]

The PR complains that we do not handle UTF-8 in the suffix for a user-defined literal, such as: bool operator ""_π (unsigned long long); In fact we don't handle any extended identifier characters there, whether UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after the "" tokens is included, since then the identifier is lexed in the "normal" way as its own token. But when it is lexed as part of the string token, this is handled in lex_string() with a one-off loop that is not aware of extended characters. This patch fixes it by adding a new function scan_cur_identifier() that can be used to lex an identifier while in the middle of lexing another token. BTW, the other place that has been mis-lexing identifiers is lex_identifier_intern(), which is used to implement #pragma push_macro and #pragma pop_macro. This does not support extended characters either. I will add that in a subsequent patch, because it can't directly reuse the new function, but rather needs to lex from a string instead of a cpp_buffer. With scan_cur_identifier(), we do also correctly warn about bidi and normalization issues in the extended identifiers comprising the suffix. libcpp/ChangeLog: PR preprocessor/103902 * lex.cc (identifier_diagnostics_on_lex): New function refactoring some common code. (lex_identifier_intern): Use the new function. (lex_identifier): Don't run identifier diagnostics here, rather let the call site do it when needed. (_cpp_lex_direct): Adjust the call sites of lex_identifier () acccordingly. (struct scan_id_result): New struct. (scan_cur_identifier): New function. (create_literal2): New function. (lit_accum::create_literal2): New function. (is_macro): Folded into new function... (maybe_ignore_udl_macro_suffix): ...here. (is_macro_not_literal_suffix): Folded likewise. (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier (). (lex_string): Likewise. gcc/testsuite/ChangeLog: PR preprocessor/103902 * g++.dg/cpp0x/udlit-extended-id-1.C: New test. * g++.dg/cpp0x/udlit-extended-id-2.C: New test. * g++.dg/cpp0x/udlit-extended-id-3.C: New test. * g++.dg/cpp0x/udlit-extended-id-4.C: New test.
author: Lewis Hyatt <lhyatt@gmail.com> 2023-07-18 17:16:08 -0400
committer: Lewis Hyatt <lhyatt@gmail.com> 2023-07-18 22:37:55 -0400
commit: 1d3e4f4e2d19c3394dc018118a78c1f4b59cb5c2 (patch)
tree: 8215d493a81199760c6316e083cdb2ed042526cc /gcc
parent: 9a19fa8b616f83474c35cc5b34a3865073ced829 (diff)
download: gcc-1d3e4f4e2d19c3394dc018118a78c1f4b59cb5c2.zip
gcc-1d3e4f4e2d19c3394dc018118a78c1f4b59cb5c2.tar.gz
gcc-1d3e4f4e2d19c3394dc018118a78c1f4b59cb5c2.tar.bz2
4 files changed, 106 insertions, 0 deletions
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
new file mode 100644
index 0000000..5ea5ef0
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
@@ -0,0 +1,71 @@
+// { dg-do run { target c++11 } }
+// { dg-additional-options "-Wno-error=normalized" }
+#include <cstring>
+using namespace std;
+
+constexpr unsigned long long operator "" _π (unsigned long long x)
+{
+  return 3 * x;
+}
+
+/* Historically we didn't parse properly as part of the "" token, so check that
+   as well.  */
+constexpr unsigned long long operator ""_Π2 (unsigned long long x)
+{
+  return 4 * x;
+}
+
+char x1[1_π];
+char x2[2_Π2];
+
+static_assert (sizeof x1 == 3, "test1");
+static_assert (sizeof x2 == 8, "test2");
+
+const char * operator "" _1σ (const char *s, unsigned long)
+{
+  return s + 1;
+}
+
+const char * operator ""_Σ2 (const char *s, unsigned long)
+{
+  return s + 2;
+}
+
+const char * operator "" _\U000000e61 (const char *s, unsigned long)
+{
+  return "ae";
+}
+
+const char* operator ""_\u01532 (const char *s, unsigned long)
+{
+  return "oe";
+}
+
+bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning "not in NFC" }
+bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warning "not in NFC" }
+
+#define xτy
+const char * str = ""xτy; // { dg-warning "invalid suffix on literal" }
+
+int main()
+{
+  if (3_π != 9)
+    __builtin_abort ();
+  if (4_Π2 != 16)
+    __builtin_abort ();
+  if (strcmp ("abc"_1σ, "bc"))
+    __builtin_abort ();
+  if (strcmp ("abcd"_Σ2, "cd"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_1σ, "bcdef"))
+    __builtin_abort ();
+  if (strcmp (R"(abc
+def)"_1σ, "bc\ndef"))
+    __builtin_abort ();
+  if (strcmp (R"(abcdef)"_Σ2, "cdef"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_æ1, "ae"))
+    __builtin_abort ();
+  if (strcmp ("xyz"_œ2, "oe"))
+    __builtin_abort ();
+}
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
new file mode 100644
index 0000000..05a2804
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
@@ -0,0 +1,6 @@
+// { dg-do compile { target c++11 } }
+// { dg-additional-options "-Wbidi-chars=any,ucn" }
+bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line line1 }
+// { dg-error "universal character \\\\u202a is not valid in an identifier" "test1" { target *-*-* } line1 }
+// { dg-error "universal character \\\\u202c is not valid in an identifier" "test2" { target *-*-* } line1 }
+// { dg-warning "found problematic Unicode character" "test3" { target *-*-* } line1 }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
new file mode 100644
index 0000000..11292e4
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
@@ -0,0 +1,15 @@
+// { dg-do compile { target c++11 } }
+
+// Check that we do not look for poisoned identifier when it is a suffix.
+int _ħ;
+#pragma GCC poison _ħ
+const char * operator ""_ħ (const char *, unsigned long); // { dg-bogus "poisoned" }
+bool operator ""_ħ (unsigned long long x); // { dg-bogus "poisoned" }
+bool b = 1_ħ; // { dg-bogus "poisoned" }
+const char *x = "hbar"_ħ; // { dg-bogus "poisoned" }
+
+/* Ideally, we should not warn here either, but this is not implemented yet.  This
+   syntax has been deprecated for C++23.  */
+#pragma GCC poison _ħ2
+const char * operator "" _ħ2 (const char *, unsigned long); // { dg-bogus "poisoned" "" { xfail *-*-*} }
+const char *x2 = "hbar2"_ħ2; // { dg-bogus "poisoned" }
diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
new file mode 100644
index 0000000..d1683c4
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
@@ -0,0 +1,14 @@
+// { dg-options "-std=c++98 -Wc++11-compat" }
+#define END ;
+#define εND ;
+#define EηD ;
+#define EN\u0394 ;
+
+const char *s1 = "s1"END // { dg-warning "requires a space between string literal and macro" }
+const char *s2 = "s2"εND // { dg-warning "requires a space between string literal and macro" }
+const char *s3 = "s3"EηD // { dg-warning "requires a space between string literal and macro" }
+const char *s4 = "s4"ENΔ // { dg-warning "requires a space between string literal and macro" }
+
+/* Make sure we did not skip the token also in the case that it wasn't found to
+   be a macro; compilation should fail here.  */
+const char *s5 = "s5"NØT_A_MACRO; // { dg-error "expected ',' or ';' before" }
author	Lewis Hyatt <lhyatt@gmail.com>	2023-07-18 17:16:08 -0400
committer	Lewis Hyatt <lhyatt@gmail.com>	2023-07-18 22:37:55 -0400
commit	1d3e4f4e2d19c3394dc018118a78c1f4b59cb5c2 (patch)
tree	8215d493a81199760c6316e083cdb2ed042526cc /gcc
parent	9a19fa8b616f83474c35cc5b34a3865073ced829 (diff)
download	gcc-1d3e4f4e2d19c3394dc018118a78c1f4b59cb5c2.zip gcc-1d3e4f4e2d19c3394dc018118a78c1f4b59cb5c2.tar.gz gcc-1d3e4f4e2d19c3394dc018118a78c1f4b59cb5c2.tar.bz2