Handle non-ASCII identifiers in Ada

Ada allows non-ASCII identifiers, and GNAT supports several such encodings. This patch adds the corresponding support to gdb. GNAT encodes non-ASCII characters using special symbol names. For character sets like Latin-1, where all characters are a single byte, it uses a "U" followed by the hex for the character. So, for example, thorn would be encoded as "Ufe" (0xFE being lower case thorn). For wider characters, despite what the manual says (it claims Shift-JIS and EUC can be used), in practice recent versions only support Unicode. Here, characters in the base plane are represented using "Wxxxx" and characters outside the base plane using "WWxxxxxxxx". GNAT has some further quirks here. Ada is case-insensitive, and GNAT emits symbols that have been case-folded. For characters in ASCII, and for all characters in non-Unicode character sets, lower case is used. For Unicode, however, characters that fit in a single byte are converted to lower case, but all others are converted to upper case. Furthermore, there is a bug in GNAT where two symbols that differ only in the case of "Y WITH DIAERESIS" (and potentially others, I did not check exhaustively) can be used in one program. I chose to omit handling this case from gdb, on the theory that it is hard to figure out the logic, and anyway if the bug is ever fixed, we'll regret having a heuristic. This patch introduces a new "ada source-charset" setting. It defaults to Latin-1, as that is GNAT's default. This setting controls how "U" characters are decoded -- W/WW are always handled as UTF-32. The ada_tag_name_from_tsd change is needed because this function will read memory from the inferior and interpret it -- and this caused an encoding failure on PPC when running a test that tries to read uninitialized memory. This patch implements its own UTF-32-based case folder. This avoids host platform quirks, and is relatively simple. A short Python program to generate the case-folding table is included. It simply relies on whatever version of Unicode is used by the host Python, which seems basically acceptable. Test cases for UTF-8, Latin-1, and Latin-3 are included. This exercises most of the new code paths, aside from Y WITH DIAERESIS as noted above.
author: Tom Tromey <tromey@adacore.com> 2022-02-03 10:42:07 -0700
committer: Tom Tromey <tromey@adacore.com> 2022-03-07 07:52:59 -0700
commit: 315e4ebb4b7ef01da2f5c419edc74f39a0122d20 (patch)
tree: ed8a010b58b1f7cb532b83d602b39adfc07397f8 /gdb/testsuite/gdb.ada/non-ascii-utf-8.exp
parent: ee3d46491537e343c276a7fc455dd94812fd3f72 (diff)
download: gdb-315e4ebb4b7ef01da2f5c419edc74f39a0122d20.zip
gdb-315e4ebb4b7ef01da2f5c419edc74f39a0122d20.tar.gz
gdb-315e4ebb4b7ef01da2f5c419edc74f39a0122d20.tar.bz2
1 files changed, 57 insertions, 0 deletions
diff --git a/gdb/testsuite/gdb.ada/non-ascii-utf-8.exp b/gdb/testsuite/gdb.ada/non-ascii-utf-8.exp
new file mode 100644
index 0000000..4ab0ca5
--- /dev/null
+++ b/gdb/testsuite/gdb.ada/non-ascii-utf-8.exp
@@ -0,0 +1,57 @@
+# Copyright 2022 Free Software Foundation, Inc.
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# Test UTF-8 identifiers.
+
+load_lib "ada.exp"
+
+if { [skip_ada_tests] } { return -1 }
+
+# Enable basic use of UTF-8.  LC_ALL gets reset for each testfile.
+setenv LC_ALL C.UTF-8
+
+standard_ada_testfile prog
+
+set flags [list debug additional_flags=-gnatW8]
+if {[gdb_compile_ada "${srcfile}" "${binfile}" executable $flags] != ""} {
+    return -1
+}
+
+# Restart without an executable so that we can set the encoding early.
+clean_restart
+
+gdb_test_no_output "set ada source-charset UTF-8"
+
+gdb_load ${binfile}
+
+set bp_location [gdb_get_line_number "BREAK" ${testdir}/prog.adb]
+runto "prog.adb:$bp_location"
+
+gdb_test "print VAR_Ü" " = 23"
+gdb_test "print var_ü" " = 23"
+gdb_test "print VAR_Ƹ" " = 24"
+gdb_test "print var_ƹ" " = 24"
+gdb_test "print VAR_𐐁" " = 25"
+gdb_test "print var_𐐩" " = 25"
+gdb_test "print VAR_Ż" " = 26"
+gdb_test "print var_ż" " = 26"
+
+gdb_breakpoint "FUNC_Ü" message
+gdb_breakpoint "func_ü" message
+gdb_breakpoint "FUNC_Ƹ" message
+gdb_breakpoint "func_ƹ" message
+gdb_breakpoint "FUNC_Ż" message
+gdb_breakpoint "func_ż" message
+gdb_breakpoint "FUNC_𐐁" message
author	Tom Tromey <tromey@adacore.com>	2022-02-03 10:42:07 -0700
committer	Tom Tromey <tromey@adacore.com>	2022-03-07 07:52:59 -0700
commit	315e4ebb4b7ef01da2f5c419edc74f39a0122d20 (patch)
tree	ed8a010b58b1f7cb532b83d602b39adfc07397f8 /gdb/testsuite/gdb.ada/non-ascii-utf-8.exp
parent	ee3d46491537e343c276a7fc455dd94812fd3f72 (diff)
download	gdb-315e4ebb4b7ef01da2f5c419edc74f39a0122d20.zip gdb-315e4ebb4b7ef01da2f5c419edc74f39a0122d20.tar.gz gdb-315e4ebb4b7ef01da2f5c419edc74f39a0122d20.tar.bz2