diff options
author | Pedro Alves <palves@redhat.com> | 2017-11-21 00:02:46 +0000 |
---|---|---|
committer | Pedro Alves <palves@redhat.com> | 2017-11-21 00:02:46 +0000 |
commit | e1ef7d7a5166f846b14bea5a77acb0dec76661a8 (patch) | |
tree | 7213176a99da2ac71ba20319a2e4721c62b385e4 /gdb | |
parent | 9752982420bffb6cace1a25830a92db7fcd1dbd0 (diff) | |
download | gdb-e1ef7d7a5166f846b14bea5a77acb0dec76661a8.zip gdb-e1ef7d7a5166f846b14bea5a77acb0dec76661a8.tar.gz gdb-e1ef7d7a5166f846b14bea5a77acb0dec76661a8.tar.bz2 |
0xff chars in name components table; cp-name-parser lex UTF-8 identifiers
The find-upper-bound-for-completion algorithm in the name components
accelerator table in dwarf2read.c increments a char in a string, and
asserts that it's not incrementing a 0xff char, but that's incorrect.
First, we shouldn't be calling gdb_assert on input.
Then, if "char" is signed, comparing a caracther with "0xff" will
never yield true, which is caught by Clang with:
error: comparison of constant 255 with expression of type '....' (aka 'char') is always true [-Werror,-Wtautological-constant-out-of-range-compare]
gdb_assert (after.back () != 0xff);
~~~~~~~~~~~~~ ^ ~~~~
And then, 0xff is a valid character on non-UTF-8/ASCII character sets.
E.g., it's 'ÿ' in Latin1. While GCC nor Clang support !ASCII &&
!UTF-8 characters in identifiers (GCC supports UTF-8 characters only
via UCNs, see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html),
but other compilers might (Visual Studio?), so it doesn't hurt to
handle it correctly. Testing is covered by extending the
dw2_expand_symtabs_matching unit tests with relevant cases.
However, without further changes, the unit tests still fail... The
problem is that cp-name-parser.y assumes that identifiers are ASCII
(via ISALPHA/ISALNUM). This commit fixes that too, so that we can
unit test the dwarf2read.c changes. (The regular C/C++ lexer in
c-lang.y needs a similar treatment, but I'm leaving that for another
patch.)
While doing this, I noticed a thinko in the computation of the upper
bound for completion in dw2_expand_symtabs_matching_symbol. We're
using std::upper_bound but we should use std::lower_bound. I extended
the unit test with a case that I thought would expose it, this one:
+ /* These are used to check that the increment-last-char in the
+ matching algorithm for completion doesn't match "t1_fund" when
+ completing "t1_func". */
+ "t1_func",
+ "t1_func1",
+ "t1_fund",
+ "t1_fund1",
The algorithm actually returns "t1_fund1" as lower bound, so "t1_fund"
matches incorrectly. But turns out the problem is masked because
later here:
for (;lower != upper; ++lower)
{
const char *qualified = index.symbol_name_at (lower->idx);
if (!lookup_name_matcher.matches (qualified)
the lookup_name_matcher.matches check above filters out "t1_fund"
because that doesn't start with "t1_func".
I'll fix the latent bug in follow up patches, after factoring things
out a bit in a way that allows unit testing the relevant code more
directly.
gdb/ChangeLog:
2017-11-21 Pedro Alves <palves@redhat.com>
* cp-name-parser.y (cp_ident_is_alpha, cp_ident_is_alnum): New.
(symbol_end): Use cp_ident_is_alnum.
(yylex): Use cp_ident_is_alpha and cp_ident_is_alnum.
* dwarf2read.c (make_sort_after_prefix_name): New function.
(dw2_expand_symtabs_matching_symbol): Use it.
(test_symbols): Add more symbols.
(run_test): Add tests.
Diffstat (limited to 'gdb')
-rw-r--r-- | gdb/ChangeLog | 10 | ||||
-rw-r--r-- | gdb/cp-name-parser.y | 28 | ||||
-rw-r--r-- | gdb/dwarf2read.c | 136 |
3 files changed, 157 insertions, 17 deletions
diff --git a/gdb/ChangeLog b/gdb/ChangeLog index 31e447a..1a38573 100644 --- a/gdb/ChangeLog +++ b/gdb/ChangeLog @@ -1,3 +1,13 @@ +2017-11-21 Pedro Alves <palves@redhat.com> + + * cp-name-parser.y (cp_ident_is_alpha, cp_ident_is_alnum): New. + (symbol_end): Use cp_ident_is_alnum. + (yylex): Use cp_ident_is_alpha and cp_ident_is_alnum. + * dwarf2read.c (make_sort_after_prefix_name): New function. + (dw2_expand_symtabs_matching_symbol): Use it. + (test_symbols): Add more symbols. + (run_test): Add tests. + 2017-11-17 Tom Tromey <tom@tromey.com> * symtab.h (enum symbol_subclass_kind): New. diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y index 33ecf13..fdfbf15 100644 --- a/gdb/cp-name-parser.y +++ b/gdb/cp-name-parser.y @@ -1304,6 +1304,28 @@ d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs)); } +/* Like ISALPHA, but also returns true for the union of all UTF-8 + multi-byte sequence bytes and non-ASCII characters in + extended-ASCII charsets (e.g., Latin1). I.e., returns true if the + high bit is set. Note that not all UTF-8 ranges are allowed in C++ + identifiers, but we don't need to be pedantic so for simplicity we + ignore that here. Plus this avoids the complication of actually + knowing what was the right encoding. */ + +static inline bool +cp_ident_is_alpha (unsigned char ch) +{ + return ISALPHA (ch) || ch >= 0x80; +} + +/* Similarly, but Like ISALNUM. */ + +static inline bool +cp_ident_is_alnum (unsigned char ch) +{ + return ISALNUM (ch) || ch >= 0x80; +} + /* Find the end of a symbol name starting at LEXPTR. */ static const char * @@ -1311,7 +1333,7 @@ symbol_end (const char *lexptr) { const char *p = lexptr; - while (*p && (ISALNUM (*p) || *p == '_' || *p == '$' || *p == '.')) + while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.')) p++; return p; @@ -1791,7 +1813,7 @@ yylex (void) return ERROR; } - if (!(c == '_' || c == '$' || ISALPHA (c))) + if (!(c == '_' || c == '$' || cp_ident_is_alpha (c))) { /* We must have come across a bad character (e.g. ';'). */ yyerror (_("invalid character")); @@ -1802,7 +1824,7 @@ yylex (void) namelen = 0; do c = tokstart[++namelen]; - while (ISALNUM (c) || c == '_' || c == '$'); + while (cp_ident_is_alnum (c) || c == '_' || c == '$'); lexptr += namelen; diff --git a/gdb/dwarf2read.c b/gdb/dwarf2read.c index 5437d21..96c1393 100644 --- a/gdb/dwarf2read.c +++ b/gdb/dwarf2read.c @@ -4188,6 +4188,79 @@ gdb_index_symbol_name_matcher::matches (const char *symbol_name) return false; } +/* Starting from a search name, return the string that finds the upper + bound of all strings that start with SEARCH_NAME in a sorted name + list. Returns the empty string to indicate that the upper bound is + the end of the list. */ + +static std::string +make_sort_after_prefix_name (const char *search_name) +{ + /* When looking to complete "func", we find the upper bound of all + symbols that start with "func" by looking for where we'd insert + the closest string that would follow "func" in lexicographical + order. Usually, that's "func"-with-last-character-incremented, + i.e. "fund". Mind non-ASCII characters, though. Usually those + will be UTF-8 multi-byte sequences, but we can't be certain. + Especially mind the 0xff character, which is a valid character in + non-UTF-8 source character sets (e.g. Latin1 'ÿ'), and we can't + rule out compilers allowing it in identifiers. Note that + conveniently, strcmp/strcasecmp are specified to compare + characters interpreted as unsigned char. So what we do is treat + the whole string as a base 256 number composed of a sequence of + base 256 "digits" and add 1 to it. I.e., adding 1 to 0xff wraps + to 0, and carries 1 to the following more-significant position. + If the very first character in SEARCH_NAME ends up incremented + and carries/overflows, then the upper bound is the end of the + list. The string after the empty string is also the empty + string. + + Some examples of this operation: + + SEARCH_NAME => "+1" RESULT + + "abc" => "abd" + "ab\xff" => "ac" + "\xff" "a" "\xff" => "\xff" "b" + "\xff" => "" + "\xff\xff" => "" + "" => "" + + Then, with these symbols for example: + + func + func1 + fund + + completing "func" looks for symbols between "func" and + "func"-with-last-character-incremented, i.e. "fund" (exclusive), + which finds "func" and "func1", but not "fund". + + And with: + + funcÿ (Latin1 'ÿ' [0xff]) + funcÿ1 + fund + + completing "funcÿ" looks for symbols between "funcÿ" and "fund" + (exclusive), which finds "funcÿ" and "funcÿ1", but not "fund". + + And with: + + ÿÿ (Latin1 'ÿ' [0xff]) + ÿÿ1 + + completing "ÿ" or "ÿÿ" looks for symbols between between "ÿÿ" and + the end of the list. + */ + std::string after = search_name; + while (!after.empty () && (unsigned char) after.back () == 0xff) + after.pop_back (); + if (!after.empty ()) + after.back () = (unsigned char) after.back () + 1; + return after; +} + /* Helper for dw2_expand_symtabs_matching that works with a mapped_index instead of the containing objfile. This is split to a separate function in order to be able to unit test the @@ -4303,21 +4376,20 @@ dw2_expand_symtabs_matching_symbol { if (lookup_name_in.completion_mode ()) { - /* The string frobbing below won't work if the string is - empty. We don't need it then, anyway -- if we're - completing an empty string, then we want to iterate over - the whole range. */ - if (cplus[0] == '\0') + /* In completion mode, we want UPPER to point past all + symbols names that have the same prefix. I.e., with + these symbols, and completing "func": + + function << lower bound + function1 + other_function << upper bound + + We find the upper bound by looking for the insertion + point of "func"-with-last-character-incremented, + i.e. "fund". */ + std::string after = make_sort_after_prefix_name (cplus); + if (after.empty ()) return end; - - /* In completion mode, increment the last character because - we want UPPER to point past all symbols names that have - the same prefix. */ - std::string after = cplus; - - gdb_assert (after.back () != 0xff); - after.back ()++; - return std::upper_bound (lower, end, after.c_str (), lookup_compare_upper); } @@ -4493,6 +4565,26 @@ static const char *test_symbols[] = { "ns::foo<int>", "ns::foo<long>", + /* These are used to check that the increment-last-char in the + matching algorithm for completion doesn't match "t1_fund" when + completing "t1_func". */ + "t1_func", + "t1_func1", + "t1_fund", + "t1_fund1", + + /* A UTF-8 name with multi-byte sequences to make sure that + cp-name-parser understands this as a single identifier ("função" + is "function" in PT). */ + u8"u8função", + + /* \377 (0xff) is Latin1 'ÿ'. */ + "yfunc\377", + + /* \377 (0xff) is Latin1 'ÿ'. */ + "\377", + "\377\377123", + /* A name with all sorts of complications. Starts with "z" to make it easier for the completion tests below. */ #define Z_SYM_NAME \ @@ -4551,6 +4643,22 @@ run_test () {}); } + /* Check that the name matching algorithm for completion doesn't get + confused with Latin1 'ÿ' / 0xff. */ + { + static const char str[] = "\377"; + CHECK_MATCH (str, symbol_name_match_type::FULL, true, + EXPECT ("\377", "\377\377123")); + } + + /* Check that the increment-last-char in the matching algorithm for + completion doesn't match "t1_fund" when completing "t1_func". */ + { + static const char str[] = "t1_func"; + CHECK_MATCH (str, symbol_name_match_type::FULL, true, + EXPECT ("t1_func", "t1_func1")); + } + /* Check that completion mode works at each prefix of the expected symbol name. */ { |