0xff chars in name components table; cp-name-parser lex UTF-8 identifiers

The find-upper-bound-for-completion algorithm in the name components accelerator table in dwarf2read.c increments a char in a string, and asserts that it's not incrementing a 0xff char, but that's incorrect. First, we shouldn't be calling gdb_assert on input. Then, if "char" is signed, comparing a caracther with "0xff" will never yield true, which is caught by Clang with: error: comparison of constant 255 with expression of type '....' (aka 'char') is always true [-Werror,-Wtautological-constant-out-of-range-compare] gdb_assert (after.back () != 0xff); ~~~~~~~~~~~~~ ^ ~~~~ And then, 0xff is a valid character on non-UTF-8/ASCII character sets. E.g., it's 'ÿ' in Latin1. While GCC nor Clang support !ASCII && !UTF-8 characters in identifiers (GCC supports UTF-8 characters only via UCNs, see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html), but other compilers might (Visual Studio?), so it doesn't hurt to handle it correctly. Testing is covered by extending the dw2_expand_symtabs_matching unit tests with relevant cases. However, without further changes, the unit tests still fail... The problem is that cp-name-parser.y assumes that identifiers are ASCII (via ISALPHA/ISALNUM). This commit fixes that too, so that we can unit test the dwarf2read.c changes. (The regular C/C++ lexer in c-lang.y needs a similar treatment, but I'm leaving that for another patch.) While doing this, I noticed a thinko in the computation of the upper bound for completion in dw2_expand_symtabs_matching_symbol. We're using std::upper_bound but we should use std::lower_bound. I extended the unit test with a case that I thought would expose it, this one: + /* These are used to check that the increment-last-char in the + matching algorithm for completion doesn't match "t1_fund" when + completing "t1_func". */ + "t1_func", + "t1_func1", + "t1_fund", + "t1_fund1", The algorithm actually returns "t1_fund1" as lower bound, so "t1_fund" matches incorrectly. But turns out the problem is masked because later here: for (;lower != upper; ++lower) { const char *qualified = index.symbol_name_at (lower->idx); if (!lookup_name_matcher.matches (qualified) the lookup_name_matcher.matches check above filters out "t1_fund" because that doesn't start with "t1_func". I'll fix the latent bug in follow up patches, after factoring things out a bit in a way that allows unit testing the relevant code more directly. gdb/ChangeLog: 2017-11-21 Pedro Alves <palves@redhat.com> * cp-name-parser.y (cp_ident_is_alpha, cp_ident_is_alnum): New. (symbol_end): Use cp_ident_is_alnum. (yylex): Use cp_ident_is_alpha and cp_ident_is_alnum. * dwarf2read.c (make_sort_after_prefix_name): New function. (dw2_expand_symtabs_matching_symbol): Use it. (test_symbols): Add more symbols. (run_test): Add tests.
author: Pedro Alves <palves@redhat.com> 2017-11-21 00:02:46 +0000
committer: Pedro Alves <palves@redhat.com> 2017-11-21 00:02:46 +0000
commit: e1ef7d7a5166f846b14bea5a77acb0dec76661a8 (patch)
tree: 7213176a99da2ac71ba20319a2e4721c62b385e4 /gdb/cp-name-parser.y
parent: 9752982420bffb6cace1a25830a92db7fcd1dbd0 (diff)
download: gdb-e1ef7d7a5166f846b14bea5a77acb0dec76661a8.zip
gdb-e1ef7d7a5166f846b14bea5a77acb0dec76661a8.tar.gz
gdb-e1ef7d7a5166f846b14bea5a77acb0dec76661a8.tar.bz2
1 files changed, 25 insertions, 3 deletions
diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y
index 33ecf13..fdfbf15 100644
--- a/gdb/cp-name-parser.y
+++ b/gdb/cp-name-parser.y
@@ -1304,6 +1304,28 @@ d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp
 		      fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs));
 }
 
+/* Like ISALPHA, but also returns true for the union of all UTF-8
+   multi-byte sequence bytes and non-ASCII characters in
+   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
+   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
+   identifiers, but we don't need to be pedantic so for simplicity we
+   ignore that here.  Plus this avoids the complication of actually
+   knowing what was the right encoding.  */
+
+static inline bool
+cp_ident_is_alpha (unsigned char ch)
+{
+  return ISALPHA (ch) || ch >= 0x80;
+}
+
+/* Similarly, but Like ISALNUM.  */
+
+static inline bool
+cp_ident_is_alnum (unsigned char ch)
+{
+  return ISALNUM (ch) || ch >= 0x80;
+}
+
 /* Find the end of a symbol name starting at LEXPTR.  */
 
 static const char *
@@ -1311,7 +1333,7 @@ symbol_end (const char *lexptr)
 {
   const char *p = lexptr;
 
-  while (*p && (ISALNUM (*p) || *p == '_' || *p == '$' || *p == '.'))
+  while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
     p++;
 
   return p;
@@ -1791,7 +1813,7 @@ yylex (void)
       return ERROR;
     }
 
-  if (!(c == '_' || c == '$' || ISALPHA (c)))
+  if (!(c == '_' || c == '$' || cp_ident_is_alpha (c)))
     {
       /* We must have come across a bad character (e.g. ';').  */
       yyerror (_("invalid character"));
@@ -1802,7 +1824,7 @@ yylex (void)
   namelen = 0;
   do
     c = tokstart[++namelen];
-  while (ISALNUM (c) || c == '_' || c == '$');
+  while (cp_ident_is_alnum (c) || c == '_' || c == '$');
 
   lexptr += namelen;
author	Pedro Alves <palves@redhat.com>	2017-11-21 00:02:46 +0000
committer	Pedro Alves <palves@redhat.com>	2017-11-21 00:02:46 +0000
commit	e1ef7d7a5166f846b14bea5a77acb0dec76661a8 (patch)
tree	7213176a99da2ac71ba20319a2e4721c62b385e4 /gdb/cp-name-parser.y
parent	9752982420bffb6cace1a25830a92db7fcd1dbd0 (diff)
download	gdb-e1ef7d7a5166f846b14bea5a77acb0dec76661a8.zip gdb-e1ef7d7a5166f846b14bea5a77acb0dec76661a8.tar.gz gdb-e1ef7d7a5166f846b14bea5a77acb0dec76661a8.tar.bz2