aboutsummaryrefslogtreecommitdiff
path: root/libcpp/charset.cc
diff options
context:
space:
mode:
authorJakub Jelinek <jakub@redhat.com>2022-09-01 09:48:01 +0200
committerJakub Jelinek <jakub@redhat.com>2022-09-01 09:56:44 +0200
commit0b8c57ed40f19086e30ce54faec3222ac21cc0df (patch)
tree1ce3aa0f19ef45a7d2c03e272d1d8f835bb7f0b6 /libcpp/charset.cc
parentbdfe0d1ce0aebdb68b77e2c04a0f45956c56b449 (diff)
downloadgcc-0b8c57ed40f19086e30ce54faec3222ac21cc0df.zip
gcc-0b8c57ed40f19086e30ce54faec3222ac21cc0df.tar.gz
gcc-0b8c57ed40f19086e30ce54faec3222ac21cc0df.tar.bz2
libcpp: Add -Winvalid-utf8 warning [PR106655]
The following patch introduces a new warning - -Winvalid-utf8 similarly to what clang now has - to diagnose invalid UTF-8 byte sequences in comments, but not just in those, but also in string/character literals and outside of them. The warning is on by default when explicit -finput-charset=UTF-8 is used and C++23 compilation is requested and if -{,W}pedantic or -pedantic-errors it is actually a pedwarn. The reason it is on by default only for -finput-charset=UTF-8 is that the sources often are UTF-8, but sometimes could be some ASCII compatible single byte encoding where non-ASCII characters only appear in comments. So having the warning off by default is IMO desirable. The C++23 pedantic mode for when the source code is UTF-8 is -std=c++23 -pedantic-errors -finput-charset=UTF-8. 2022-09-01 Jakub Jelinek <jakub@redhat.com> PR c++/106655 libcpp/ * include/cpplib.h (struct cpp_options): Implement C++23 P2295R6 - Support for UTF-8 as a portable source file encoding. Add cpp_warn_invalid_utf8 and cpp_input_charset_explicit fields. (enum cpp_warning_reason): Add CPP_W_INVALID_UTF8 enumerator. * init.cc (cpp_create_reader): Initialize cpp_warn_invalid_utf8 and cpp_input_charset_explicit. * charset.cc (_cpp_valid_utf8): Adjust function comment. * lex.cc (UCS_LIMIT): Define. (utf8_continuation): New const variable. (utf8_signifier): Move earlier in the file. (_cpp_warn_invalid_utf8, _cpp_handle_multibyte_utf8): New functions. (_cpp_skip_block_comment): Handle -Winvalid-utf8 warning. (skip_line_comment): Likewise. (lex_raw_string, lex_string): Likewise. (_cpp_lex_direct): Likewise. gcc/ * doc/invoke.texi (-Winvalid-utf8): Document it. gcc/c-family/ * c.opt (-Winvalid-utf8): New warning. * c-opts.cc (c_common_handle_option) <case OPT_finput_charset_>: Set cpp_opts->cpp_input_charset_explicit. (c_common_post_options): If -finput-charset=UTF-8 is explicit in C++23, enable -Winvalid-utf8 by default and if -pedantic or -pedantic-errors, make it a pedwarn. gcc/testsuite/ * c-c++-common/cpp/Winvalid-utf8-1.c: New test. * c-c++-common/cpp/Winvalid-utf8-2.c: New test. * c-c++-common/cpp/Winvalid-utf8-3.c: New test. * g++.dg/cpp23/Winvalid-utf8-1.C: New test. * g++.dg/cpp23/Winvalid-utf8-2.C: New test. * g++.dg/cpp23/Winvalid-utf8-3.C: New test. * g++.dg/cpp23/Winvalid-utf8-4.C: New test. * g++.dg/cpp23/Winvalid-utf8-5.C: New test. * g++.dg/cpp23/Winvalid-utf8-6.C: New test. * g++.dg/cpp23/Winvalid-utf8-7.C: New test. * g++.dg/cpp23/Winvalid-utf8-8.C: New test. * g++.dg/cpp23/Winvalid-utf8-9.C: New test. * g++.dg/cpp23/Winvalid-utf8-10.C: New test. * g++.dg/cpp23/Winvalid-utf8-11.C: New test. * g++.dg/cpp23/Winvalid-utf8-12.C: New test.
Diffstat (limited to 'libcpp/charset.cc')
-rw-r--r--libcpp/charset.cc6
1 files changed, 3 insertions, 3 deletions
diff --git a/libcpp/charset.cc b/libcpp/charset.cc
index d3c07d6..c9656db 100644
--- a/libcpp/charset.cc
+++ b/libcpp/charset.cc
@@ -1742,9 +1742,9 @@ convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit,
case, no diagnostic is emitted, and the return value of FALSE should cause
a new token to be formed.
- Unlike _cpp_valid_ucn, this will never be called when lexing a string; only
- a potential identifier, or a CPP_OTHER token. NST is unused in the latter
- case.
+ _cpp_valid_utf8 can be called when lexing a potential identifier, or a
+ CPP_OTHER token or for the purposes of -Winvalid-utf8 warning in string or
+ character literals. NST is unused when not in a potential identifier.
As in _cpp_valid_ucn, IDENTIFIER_POS is 0 when not in an identifier, 1 for
the start of an identifier, or 2 otherwise. */