riscv-gnu-toolchain/llvm.git - Unnamed repository; edit this file 'description' to name the repository.

diff options

author	Mark de Wever <koraq@xs4all.nl>	2023-02-09 21:38:42 +0100
committer	Mark de Wever <koraq@xs4all.nl>	2023-03-08 22:01:49 +0100
commit	c866855b42eb3e8aa7578aadb26e4431d1d71efd (patch)
tree	505aa6f1dbb08dac48a8dede3ac2257b05219d0b /lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.cpp
parent	95bc01dbecda0e3c4fe95ed4f515cd64a4ed1555 (diff)
download	llvm-c866855b42eb3e8aa7578aadb26e4431d1d71efd.zip llvm-c866855b42eb3e8aa7578aadb26e4431d1d71efd.tar.gz llvm-c866855b42eb3e8aa7578aadb26e4431d1d71efd.tar.bz2

[libc++][format] Improves Unicode decoders.

During the implementation of P2286 a second Unicode decoder was added. The original decoder was only used for the width estimation. Changing an ill-formed Unicode sequence to the replacement character, works properly for this use case. For P2286 an ill-formed Unicode sequence needs to be formatted as a sequence of code units. The exact wording in the Standard as a bit unclear and there was odd example in the WP. This made it hard to use the same decoder. SG16 determined the odd example in the WP was a bug and this has been fixed in the WP. This made it possible to combine the two decoders. The P2286 decoder kept track of the size of the ill-formed sequence. However this was not needed since the output algorithm needs to keep track of size of a well-formed and an ill-formed sequence. So this feature has been removed. The error status remains since it's needed for P2286, the grapheme clustering can ignore this unneeded value. (In general, grapheme clustering is only has specified behaviour for Unicode. When the string is in a non-Unicode encoding there are no requirements. Ill-formed Unicode is a non-Unicode encoding. Still libc++ does a best effort estimation.) There UTF-8 decoder accepted several ill-formed sequences: - Values in the surrogate range U+D800..U+DFFF. - Values encoded in more code units than required, for example 0+0020 in theory can be encoded using 1, 2, 3, or 4 were accepted. This is not allowed by the Unicode Standard. - Values larger than U+10FFFF were not always rejected. Reviewed By: #libc, ldionne, tahonermann, Mordante Differential Revision: https://reviews.llvm.org/D144346

Diffstat (limited to 'lldb/source/Plugins/ObjectFile/JSON/ObjectFileJSON.cpp')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: