From 84ae3392d8b001acb9731be6d95821f32704e3e6 Mon Sep 17 00:00:00 2001 From: Steve Bennett Date: Tue, 2 Nov 2010 21:20:36 +1000 Subject: Updates to the UTF-8 documentation Signed-off-by: Steve Bennett --- README.utf-8 | 17 ++++++++++++----- Tcl_shipped.html | 18 +++++++++--------- jim_tcl.txt | 16 ++++++++-------- 3 files changed, 29 insertions(+), 22 deletions(-) diff --git a/README.utf-8 b/README.utf-8 index ad6c7b5..eca528c 100644 --- a/README.utf-8 +++ b/README.utf-8 @@ -98,11 +98,18 @@ unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt Working with Binary Data and non-UTF-8 encodings ------------------------------------------------ -If it is necessary to work with both UTF-8 and binary data (bytes ->= 0x80), or non-UTF-8 encodings you will need to arrange for the -data to be converted between UTF-8 on input and output. Individual -characters can be converted from Unicode to UTF-8 with the -utf8_fromunicode() function and the reverse with utf8_tounicode(). +Almost all Jim commands will work identically with binary data and +UTF-8 encoded data, including read, gets, puts and 'string eq'. It +is only certain string manipulation commands which will operated +differently. For example, 'string index' will return UTF-8 characters, +not bytes. + +If it is necessary to manipulate strings containing binary, non-ASCII +data (bytes >= 0x80), there are two options. + +1. Build Jim without UTF-8 support +2. Arrange to encode and decode binary data or data in other encodings + to UTF-8 before manipulation. Internal Details ---------------- diff --git a/Tcl_shipped.html b/Tcl_shipped.html index 7cc23e5..ba1d32c 100644 --- a/Tcl_shipped.html +++ b/Tcl_shipped.html @@ -2372,18 +2372,18 @@ characters can take a variable number of bytes. Thus the addition of while string length returns the number of characters.

If UTF-8 support is not enabled, all commands treat bytes as characters and string bytelength returns the same value as string length.

-

Note that even if UTF-8 support is not enabled, the \uNNNN syntax +

Note that even if UTF-8 support is not enabled, the \uNNNN syntax is still available to embed UTF-8 sequences.

String Matching

Commands such as string match, lsearch -glob, array names and others use string pattern matching rules. These commands support UTF-8. For example:

-
string match a\[\ua0-\ubf\]b "a\a3b"
+
string match a\[\ua0-\ubf\]b "a\u00a3b"

format and scan

-

format %c allows a unicode codepoint to be be encoded. For example, the following will return -a string with two bytes and one character. The same as \ub5

+

format %c allows a unicode codepoint to be be encoded. For example, the following will return +a string with two bytes and one character. The same as \ub5

format %c 0xb5
@@ -2394,11 +2394,11 @@ return a string with three characters, not three bytes.

format %.3s \ub5\ub6\ub7\ub8
-

Similarly, scan … %c allows a UTF-8 to be decoded to a unicode codepoint. The following will set -a to 181 (0xb5) and b to 181 and b to 65.

+

Similarly, scan … %c allows a UTF-8 to be decoded to a unicode codepoint. The following will set +a to 181 (0xb5) and b to 65 (0x41).

-
scan \00b5A %c%c a b
+
scan \u00b5A %c%c a b

scan %s will also accept a character class, including unicode ranges.

String Classes

@@ -2406,7 +2406,7 @@ return a string with three characters, not three bytes.

will return 0, even though the string may be considered to be alphabetic.

-
string is \b5Test
+
string is alpha \ub5Test

This does not affect the string classes ascii, control, digit, double, integer or xdigit.

Case Mapping and Conversion

@@ -6408,7 +6408,7 @@ official policies, either expressed or implied, of the Jim Tcl Project. diff --git a/jim_tcl.txt b/jim_tcl.txt index 4984655..443f96b 100644 --- a/jim_tcl.txt +++ b/jim_tcl.txt @@ -1372,7 +1372,7 @@ while 'string length' returns the number of characters. If UTF-8 support is not enabled, all commands treat bytes as characters and 'string bytelength' returns the same value as 'string length'. -Note that even if UTF-8 support is not enabled, the {backslash}uNNNN syntax +Note that even if UTF-8 support is not enabled, the +{backslash}uNNNN+ syntax is still available to embed UTF-8 sequences. String Matching @@ -1380,12 +1380,12 @@ String Matching Commands such as 'string match', 'lsearch -glob', 'array names' and others use string pattern matching rules. These commands support UTF-8. For example: - string match a\[\ua0-\ubf\]b "a\a3b" + string match a\[\ua0-\ubf\]b "a\u00a3b" format and scan ~~~~~~~~~~~~~~~ -'format %c' allows a unicode codepoint to be be encoded. For example, the following will return -a string with two bytes and one character. The same as {backslash}ub5 ++format %c+ allows a unicode codepoint to be be encoded. For example, the following will return +a string with two bytes and one character. The same as +{backslash}ub5+ format %c 0xb5 @@ -1394,10 +1394,10 @@ return a string with three characters, not three bytes. format %.3s \ub5\ub6\ub7\ub8 -Similarly, 'scan ... %c' allows a UTF-8 to be decoded to a unicode codepoint. The following will set -*a* to 181 (0xb5) and *b* to '181' and 'b' to 65. +Similarly, +scan ... %c+ allows a UTF-8 to be decoded to a unicode codepoint. The following will set +*a* to 181 (0xb5) and *b* to 65 (0x41). - scan \00b5A %c%c a b + scan \u00b5A %c%c a b 'scan %s' will also accept a character class, including unicode ranges. @@ -1406,7 +1406,7 @@ String Classes 'string is' has *not* been extended to classify UTF-8 characters. Therefore, the following will return 0, even though the string may be considered to be alphabetic. - string is \b5Test + string is alpha \ub5Test This does not affect the string classes 'ascii', 'control', 'digit', 'double', 'integer' or 'xdigit'. -- cgit v1.1