diagnostics: escape non-ASCII source bytes for certain diagnostics

This patch adds support to GCC's diagnostic subsystem for escaping certain bytes and Unicode characters when quoting source code. Specifically, this patch adds a new flag rich_location::m_escape_on_output which is a hint from a diagnostic that non-ASCII bytes in the pertinent lines of the user's source code should be escaped when printed. The patch sets this for the following diagnostics: - when complaining about stray bytes in the program (when these are non-printable) - when complaining about "null character(s) ignored"); - for -Wnormalized= (and generate source ranges for such warnings) The escaping is controlled by a new option: -fdiagnostics-escape-format=[unicode|bytes] For example, consider a diagnostic involing a source line containing the string "before" followed by the Unicode character U+03C0 ("GREEK SMALL LETTER PI", with UTF-8 encoding 0xCF 0x80) followed by the byte 0xBF (a stray UTF-8 trailing byte), followed by the string "after", where the diagnostic highlights the U+03C0 character. By default, this line will be printed verbatim to the user when reporting a diagnostic at it, as: beforeπXafter ^ (using X for the stray byte to avoid putting invalid UTF-8 in this commit message) If the diagnostic sets the "escape" flag, it will be printed as: before<U+03C0><BF>after ^~~~~~~~ with -fdiagnostics-escape-format=unicode (the default), or as: before<CF><80><BF>after ^~~~~~~~ if the user supplies -fdiagnostics-escape-format=bytes. This only affects how the source is printed; it does not affect how column numbers that are printed (as per -fdiagnostics-column-unit= and -fdiagnostics-column-origin=). gcc/c-family/ChangeLog: * c-lex.c (c_lex_with_flags): When complaining about non-printable CPP_OTHER tokens, set the "escape on output" flag. gcc/ChangeLog: * common.opt (fdiagnostics-escape-format=): New. (diagnostics_escape_format): New enum. (DIAGNOSTICS_ESCAPE_FORMAT_UNICODE): New enum value. (DIAGNOSTICS_ESCAPE_FORMAT_BYTES): Likewise. * diagnostic-format-json.cc (json_end_diagnostic): Add "escape-source" attribute. * diagnostic-show-locus.c (exploc_with_display_col::exploc_with_display_col): Replace "tabstop" param with a cpp_char_column_policy and add an "aspect" param. Use these to compute m_display_col accordingly. (struct char_display_policy): New struct. (layout::m_policy): New field. (layout::m_escape_on_output): New field. (def_policy): New function. (make_range): Update for changes to exploc_with_display_col ctor. (default_print_decoded_ch): New. (width_per_escaped_byte): New. (escape_as_bytes_width): New. (escape_as_bytes_print): New. (escape_as_unicode_width): New. (escape_as_unicode_print): New. (make_policy): New. (layout::layout): Initialize new fields. Update m_exploc ctor call for above change to ctor. (layout::maybe_add_location_range): Update for changes to exploc_with_display_col ctor. (layout::calculate_x_offset_display): Update for change to cpp_display_width. (layout::print_source_line): Pass policy to cpp_display_width_computation. Capture cpp_decoded_char when calling process_next_codepoint. Move printing of source code to m_policy.m_print_cb. (line_label::line_label): Pass in policy rather than context. (layout::print_any_labels): Update for change to line_label ctor. (get_affected_range): Pass in policy rather than context, updating calls to location_compute_display_column accordingly. (get_printed_columns): Likewise, also for cpp_display_width. (correction::correction): Pass in policy rather than tabstop. (correction::compute_display_cols): Pass m_policy rather than m_tabstop to cpp_display_width. (correction::m_tabstop): Replace with... (correction::m_policy): ...this. (line_corrections::line_corrections): Pass in policy rather than context. (line_corrections::m_context): Replace with... (line_corrections::m_policy): ...this. (line_corrections::add_hint): Update to use m_policy rather than m_context. (line_corrections::add_hint): Likewise. (layout::print_trailing_fixits): Likewise. (selftest::test_display_widths): New. (selftest::test_layout_x_offset_display_utf8): Update to use policy rather than tabstop. (selftest::test_one_liner_labels_utf8): Add test of escaping source lines. (selftest::test_diagnostic_show_locus_one_liner_utf8): Update to use policy rather than tabstop. (selftest::test_overlapped_fixit_printing): Likewise. (selftest::test_overlapped_fixit_printing_utf8): Likewise. (selftest::test_overlapped_fixit_printing_2): Likewise. (selftest::test_tab_expansion): Likewise. (selftest::test_escaping_bytes_1): New. (selftest::test_escaping_bytes_2): New. (selftest::diagnostic_show_locus_c_tests): Call the new tests. * diagnostic.c (diagnostic_initialize): Initialize context->escape_format. (convert_column_unit): Update to use default character width policy. (selftest::test_diagnostic_get_location_text): Likewise. * diagnostic.h (enum diagnostics_escape_format): New enum. (diagnostic_context::escape_format): New field. * doc/invoke.texi (-fdiagnostics-escape-format=): New option. (-fdiagnostics-format=): Add "escape-source" attribute to examples of JSON output, and document it. * input.c (location_compute_display_column): Pass in "policy" rather than "tabstop", passing to cpp_byte_column_to_display_column. (selftest::test_cpp_utf8): Update to use cpp_char_column_policy. * input.h (class cpp_char_column_policy): New forward decl. (location_compute_display_column): Pass in "policy" rather than "tabstop". * opts.c (common_handle_option): Handle OPT_fdiagnostics_escape_format_. * selftest.c (temp_source_file::temp_source_file): New ctor overload taking a size_t. * selftest.h (temp_source_file::temp_source_file): Likewise. gcc/testsuite/ChangeLog: * c-c++-common/diagnostic-format-json-1.c: Add regexp to consume "escape-source" attribute. * c-c++-common/diagnostic-format-json-2.c: Likewise. * c-c++-common/diagnostic-format-json-3.c: Likewise. * c-c++-common/diagnostic-format-json-4.c: Likewise, twice. * c-c++-common/diagnostic-format-json-5.c: Likewise. * gcc.dg/cpp/warn-normalized-4-bytes.c: New test. * gcc.dg/cpp/warn-normalized-4-unicode.c: New test. * gcc.dg/encoding-issues-bytes.c: New test. * gcc.dg/encoding-issues-unicode.c: New test. * gfortran.dg/diagnostic-format-json-1.F90: Add regexp to consume "escape-source" attribute. * gfortran.dg/diagnostic-format-json-2.F90: Likewise. * gfortran.dg/diagnostic-format-json-3.F90: Likewise. libcpp/ChangeLog: * charset.c (convert_escape): Use encoding_rich_location when complaining about nonprintable unknown escape sequences. (cpp_display_width_computation::::cpp_display_width_computation): Pass in policy rather than tabstop. (cpp_display_width_computation::process_next_codepoint): Add "out" param and populate *out if non-NULL. (cpp_display_width_computation::advance_display_cols): Pass NULL to process_next_codepoint. (cpp_byte_column_to_display_column): Pass in policy rather than tabstop. Pass NULL to process_next_codepoint. (cpp_display_column_to_byte_column): Pass in policy rather than tabstop. * errors.c (cpp_diagnostic_get_current_location): New function, splitting out the logic from... (cpp_diagnostic): ...here. (cpp_warning_at): New function. (cpp_pedwarning_at): New function. * include/cpplib.h (cpp_warning_at): New decl for rich_location. (cpp_pedwarning_at): Likewise. (struct cpp_decoded_char): New. (struct cpp_char_column_policy): New. (cpp_display_width_computation::cpp_display_width_computation): Replace "tabstop" param with "policy". (cpp_display_width_computation::process_next_codepoint): Add "out" param. (cpp_display_width_computation::m_tabstop): Replace with... (cpp_display_width_computation::m_policy): ...this. (cpp_byte_column_to_display_column): Replace "tabstop" param with "policy". (cpp_display_width): Likewise. (cpp_display_column_to_byte_column): Likewise. * include/line-map.h (rich_location::escape_on_output_p): New. (rich_location::set_escape_on_output): New. (rich_location::m_escape_on_output): New. * internal.h (cpp_diagnostic_get_current_location): New decl. (class encoding_rich_location): New. * lex.c (skip_whitespace): Use encoding_rich_location when complaining about null characters. (warn_about_normalization): Generate a source range when complaining about improperly normalized tokens, rather than just a point, and use encoding_rich_location so that the source code is escaped on printing. * line-map.c (rich_location::rich_location): Initialize m_escape_on_output. Signed-off-by: David Malcolm <dmalcolm@redhat.com>
author: David Malcolm <dmalcolm@redhat.com> 2021-10-18 18:55:31 -0400
committer: David Malcolm <dmalcolm@redhat.com> 2021-11-01 09:35:46 -0400
commit: bd5e882cf6e0def3dd1bc106075d59a303fe0d1e (patch)
tree: 858681a7eee72064d0b6977f0b68c5fdd0cbb9b7 /gcc/input.c
parent: 91bac9fed5d082f0b180834110ebc0f46f97599a (diff)
download: gcc-bd5e882cf6e0def3dd1bc106075d59a303fe0d1e.zip
gcc-bd5e882cf6e0def3dd1bc106075d59a303fe0d1e.tar.gz
gcc-bd5e882cf6e0def3dd1bc106075d59a303fe0d1e.tar.bz2
1 files changed, 35 insertions, 27 deletions
diff --git a/gcc/input.c b/gcc/input.c
index dd753de..4650547 100644
--- a/gcc/input.c
+++ b/gcc/input.c
@@ -1060,7 +1060,8 @@ make_location (location_t caret, source_range src_range)
    source line in order to calculate the display width.  If that cannot be done
    for any reason, then returns the byte column as a fallback.  */
 int
-location_compute_display_column (expanded_location exploc, int tabstop)
+location_compute_display_column (expanded_location exploc,
+				 const cpp_char_column_policy &policy)
 {
   if (!(exploc.file && *exploc.file && exploc.line && exploc.column))
     return exploc.column;
@@ -1068,7 +1069,7 @@ location_compute_display_column (expanded_location exploc, int tabstop)
   /* If line is NULL, this function returns exploc.column which is the
      desired fallback.  */
   return cpp_byte_column_to_display_column (line.get_buffer (), line.length (),
-					    exploc.column, tabstop);
+					    exploc.column, policy);
 }
 
 /* Dump statistics to stderr about the memory usage of the line_table
@@ -3767,43 +3768,50 @@ test_line_offset_overflow ()
 void test_cpp_utf8 ()
 {
   const int def_tabstop = 8;
+  cpp_char_column_policy policy (def_tabstop, cpp_wcwidth);
+
   /* Verify that wcwidth of invalid UTF-8 or control bytes is 1.  */
   {
-    int w_bad = cpp_display_width ("\xf0!\x9f!\x98!\x82!", 8, def_tabstop);
+    int w_bad = cpp_display_width ("\xf0!\x9f!\x98!\x82!", 8, policy);
     ASSERT_EQ (8, w_bad);
-    int w_ctrl = cpp_display_width ("\r\n\v\0\1", 5, def_tabstop);
+    int w_ctrl = cpp_display_width ("\r\n\v\0\1", 5, policy);
     ASSERT_EQ (5, w_ctrl);
   }
 
   /* Verify that wcwidth of valid UTF-8 is as expected.  */
   {
-    const int w_pi = cpp_display_width ("\xcf\x80", 2, def_tabstop);
+    const int w_pi = cpp_display_width ("\xcf\x80", 2, policy);
     ASSERT_EQ (1, w_pi);
-    const int w_emoji = cpp_display_width ("\xf0\x9f\x98\x82", 4, def_tabstop);
+    const int w_emoji = cpp_display_width ("\xf0\x9f\x98\x82", 4, policy);
     ASSERT_EQ (2, w_emoji);
     const int w_umlaut_precomposed = cpp_display_width ("\xc3\xbf", 2,
-							def_tabstop);
+							policy);
     ASSERT_EQ (1, w_umlaut_precomposed);
     const int w_umlaut_combining = cpp_display_width ("y\xcc\x88", 3,
-						      def_tabstop);
+						      policy);
     ASSERT_EQ (1, w_umlaut_combining);
-    const int w_han = cpp_display_width ("\xe4\xb8\xba", 3, def_tabstop);
+    const int w_han = cpp_display_width ("\xe4\xb8\xba", 3, policy);
     ASSERT_EQ (2, w_han);
-    const int w_ascii = cpp_display_width ("GCC", 3, def_tabstop);
+    const int w_ascii = cpp_display_width ("GCC", 3, policy);
     ASSERT_EQ (3, w_ascii);
     const int w_mixed = cpp_display_width ("\xcf\x80 = 3.14 \xf0\x9f\x98\x82"
 					   "\x9f! \xe4\xb8\xba y\xcc\x88",
-					   24, def_tabstop);
+					   24, policy);
     ASSERT_EQ (18, w_mixed);
   }
 
   /* Verify that display width properly expands tabs.  */
   {
     const char *tstr = "\tabc\td";
-    ASSERT_EQ (6, cpp_display_width (tstr, 6, 1));
-    ASSERT_EQ (10, cpp_display_width (tstr, 6, 3));
-    ASSERT_EQ (17, cpp_display_width (tstr, 6, 8));
-    ASSERT_EQ (1, cpp_display_column_to_byte_column (tstr, 6, 7, 8));
+    ASSERT_EQ (6, cpp_display_width (tstr, 6,
+				     cpp_char_column_policy (1, cpp_wcwidth)));
+    ASSERT_EQ (10, cpp_display_width (tstr, 6,
+				      cpp_char_column_policy (3, cpp_wcwidth)));
+    ASSERT_EQ (17, cpp_display_width (tstr, 6,
+				      cpp_char_column_policy (8, cpp_wcwidth)));
+    ASSERT_EQ (1,
+	       cpp_display_column_to_byte_column
+		 (tstr, 6, 7, cpp_char_column_policy (8, cpp_wcwidth)));
   }
 
   /* Verify that cpp_byte_column_to_display_column can go past the end,
@@ -3816,13 +3824,13 @@ void test_cpp_utf8 ()
       /* 111122223456
 	 Byte columns.  */
 
-    ASSERT_EQ (5, cpp_display_width (str, 6, def_tabstop));
+    ASSERT_EQ (5, cpp_display_width (str, 6, policy));
     ASSERT_EQ (105,
-	       cpp_byte_column_to_display_column (str, 6, 106, def_tabstop));
+	       cpp_byte_column_to_display_column (str, 6, 106, policy));
     ASSERT_EQ (10000,
-	       cpp_byte_column_to_display_column (NULL, 0, 10000, def_tabstop));
+	       cpp_byte_column_to_display_column (NULL, 0, 10000, policy));
     ASSERT_EQ (0,
-	       cpp_byte_column_to_display_column (NULL, 10000, 0, def_tabstop));
+	       cpp_byte_column_to_display_column (NULL, 10000, 0, policy));
   }
 
   /* Verify that cpp_display_column_to_byte_column can go past the end,
@@ -3836,25 +3844,25 @@ void test_cpp_utf8 ()
       /* 000000000000000000000000000000000111111
 	 111122223333444456666777788889999012345
 	 Byte columns.  */
-    ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 2, def_tabstop));
+    ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 2, policy));
     ASSERT_EQ (15,
-	       cpp_display_column_to_byte_column (str, 15, 11, def_tabstop));
+	       cpp_display_column_to_byte_column (str, 15, 11, policy));
     ASSERT_EQ (115,
-	       cpp_display_column_to_byte_column (str, 15, 111, def_tabstop));
+	       cpp_display_column_to_byte_column (str, 15, 111, policy));
     ASSERT_EQ (10000,
-	       cpp_display_column_to_byte_column (NULL, 0, 10000, def_tabstop));
+	       cpp_display_column_to_byte_column (NULL, 0, 10000, policy));
     ASSERT_EQ (0,
-	       cpp_display_column_to_byte_column (NULL, 10000, 0, def_tabstop));
+	       cpp_display_column_to_byte_column (NULL, 10000, 0, policy));
 
     /* Verify that we do not interrupt a UTF-8 sequence.  */
-    ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 1, def_tabstop));
+    ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 1, policy));
 
     for (int byte_col = 1; byte_col <= 15; ++byte_col)
       {
 	const int disp_col
-	  = cpp_byte_column_to_display_column (str, 15, byte_col, def_tabstop);
+	  = cpp_byte_column_to_display_column (str, 15, byte_col, policy);
 	const int byte_col2
-	  = cpp_display_column_to_byte_column (str, 15, disp_col, def_tabstop);
+	  = cpp_display_column_to_byte_column (str, 15, disp_col, policy);
 
 	/* If we ask for the display column in the middle of a UTF-8
 	   sequence, it will return the length of the partial sequence,
author	David Malcolm <dmalcolm@redhat.com>	2021-10-18 18:55:31 -0400
committer	David Malcolm <dmalcolm@redhat.com>	2021-11-01 09:35:46 -0400
commit	bd5e882cf6e0def3dd1bc106075d59a303fe0d1e (patch)
tree	858681a7eee72064d0b6977f0b68c5fdd0cbb9b7 /gcc/input.c
parent	91bac9fed5d082f0b180834110ebc0f46f97599a (diff)
download	gcc-bd5e882cf6e0def3dd1bc106075d59a303fe0d1e.zip gcc-bd5e882cf6e0def3dd1bc106075d59a303fe0d1e.tar.gz gcc-bd5e882cf6e0def3dd1bc106075d59a303fe0d1e.tar.bz2