From 84ae3392d8b001acb9731be6d95821f32704e3e6 Mon Sep 17 00:00:00 2001
From: Steve Bennett <steveb@workware.net.au>
Date: Tue, 2 Nov 2010 21:20:36 +1000
Subject: Updates to the UTF-8 documentation

Signed-off-by: Steve Bennett <steveb@workware.net.au>
---
 README.utf-8     | 17 ++++++++++++-----
 Tcl_shipped.html | 18 +++++++++---------
 jim_tcl.txt      | 16 ++++++++--------
 3 files changed, 29 insertions(+), 22 deletions(-)
diff --git a/README.utf-8 b/README.utf-8
index ad6c7b5..eca528c 100644
--- a/README.utf-8
+++ b/README.utf-8
@@ -98,11 +98,18 @@ unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt
 
 Working with Binary Data and non-UTF-8 encodings
 ------------------------------------------------
-If it is necessary to work with both UTF-8 and binary data (bytes
->= 0x80), or non-UTF-8 encodings you will need to arrange for the
-data to be converted between UTF-8 on input and output.  Individual
-characters can be converted from Unicode to UTF-8 with the
-utf8_fromunicode() function and the reverse with utf8_tounicode().
+Almost all Jim commands will work identically with binary data and
+UTF-8 encoded data, including read, gets, puts and 'string eq'.  It
+is only certain string manipulation commands which will operated
+differently.  For example, 'string index' will return UTF-8 characters,
+not bytes.
+
+If it is necessary to manipulate strings containing binary, non-ASCII
+data (bytes >= 0x80), there are two options.
+
+1. Build Jim without UTF-8 support
+2. Arrange to encode and decode binary data or data in other encodings
+   to UTF-8 before manipulation.
 
 Internal Details
 ----------------
diff --git a/Tcl_shipped.html b/Tcl_shipped.html
index 7cc23e5..ba1d32c 100644
--- a/Tcl_shipped.html
+++ b/Tcl_shipped.html
@@ -2372,18 +2372,18 @@ characters can take a variable number of bytes. Thus the addition of
 while <em>string length</em> returns the number of characters.</p></div>
 <div class="paragraph"><p>If UTF-8 support is not enabled, all commands treat bytes as characters
 and <em>string bytelength</em> returns the same value as <em>string length</em>.</p></div>
-<div class="paragraph"><p>Note that even if UTF-8 support is not enabled, the \uNNNN syntax
+<div class="paragraph"><p>Note that even if UTF-8 support is not enabled, the <tt>\uNNNN</tt> syntax
 is still available to embed UTF-8 sequences.</p></div>
 <h3 id="_string_matching">String Matching</h3><div style="clear:left"></div>
 <div class="paragraph"><p>Commands such as <em>string match</em>, <em>lsearch -glob</em>, <em>array names</em> and others use string
 pattern matching rules. These commands support UTF-8. For example:</p></div>
 <div class="literalblock">
 <div class="content">
-<pre><tt>string match a\[\ua0-\ubf\]b "a\a3b"</tt></pre>
+<pre><tt>string match a\[\ua0-\ubf\]b "a\u00a3b"</tt></pre>
 </div></div>
 <h3 id="_format_and_scan">format and scan</h3><div style="clear:left"></div>
-<div class="paragraph"><p><em>format %c</em> allows a unicode codepoint to be be encoded. For example, the following will return
-a string with two bytes and one character. The same as \ub5</p></div>
+<div class="paragraph"><p><tt>format %c</tt> allows a unicode codepoint to be be encoded. For example, the following will return
+a string with two bytes and one character. The same as <tt>\ub5</tt></p></div>
 <div class="literalblock">
 <div class="content">
 <pre><tt>format %c 0xb5</tt></pre>
@@ -2394,11 +2394,11 @@ return a string with three characters, not three bytes.</p></div>
 <div class="content">
 <pre><tt>format %.3s \ub5\ub6\ub7\ub8</tt></pre>
 </div></div>
-<div class="paragraph"><p>Similarly, <em>scan &#8230; %c</em> allows a UTF-8 to be decoded to a unicode codepoint. The following will set
-<strong>a</strong> to 181 (0xb5) and <strong>b</strong> to <em>181</em> and <em>b</em> to 65.</p></div>
+<div class="paragraph"><p>Similarly, <tt>scan &#8230; %c</tt> allows a UTF-8 to be decoded to a unicode codepoint. The following will set
+<strong>a</strong> to 181 (0xb5) and <strong>b</strong> to 65 (0x41).</p></div>
 <div class="literalblock">
 <div class="content">
-<pre><tt>scan \00b5A %c%c a b</tt></pre>
+<pre><tt>scan \u00b5A %c%c a b</tt></pre>
 </div></div>
 <div class="paragraph"><p><em>scan %s</em> will also accept a character class, including unicode ranges.</p></div>
 <h3 id="_string_classes">String Classes</h3><div style="clear:left"></div>
@@ -2406,7 +2406,7 @@ return a string with three characters, not three bytes.</p></div>
 will return 0, even though the string may be considered to be alphabetic.</p></div>
 <div class="literalblock">
 <div class="content">
-<pre><tt>string is \b5Test</tt></pre>
+<pre><tt>string is alpha \ub5Test</tt></pre>
 </div></div>
 <div class="paragraph"><p>This does not affect the string classes <em>ascii</em>, <em>control</em>, <em>digit</em>, <em>double</em>, <em>integer</em> or <em>xdigit</em>.</p></div>
 <h3 id="_case_mapping_and_conversion">Case Mapping and Conversion</h3><div style="clear:left"></div>
@@ -6408,7 +6408,7 @@ official policies, either expressed or implied, of the Jim Tcl Project.</tt></pr
 </div>
 <div id="footer">
 <div id="footer-text">
-Last updated 2010-11-11 10:57:51 EST
+Last updated 2010-11-11 10:58:17 EST
 </div>
 </div>
 </body>
diff --git a/jim_tcl.txt b/jim_tcl.txt
index 4984655..443f96b 100644
--- a/jim_tcl.txt
+++ b/jim_tcl.txt
@@ -1372,7 +1372,7 @@ while 'string length' returns the number of characters.
 If UTF-8 support is not enabled, all commands treat bytes as characters
 and 'string bytelength' returns the same value as 'string length'.
 
-Note that even if UTF-8 support is not enabled, the {backslash}uNNNN syntax
+Note that even if UTF-8 support is not enabled, the +{backslash}uNNNN+ syntax
 is still available to embed UTF-8 sequences.
 
 String Matching
@@ -1380,12 +1380,12 @@ String Matching
 Commands such as 'string match', 'lsearch -glob', 'array names' and others use string
 pattern matching rules. These commands support UTF-8. For example:
 
-  string match a\[\ua0-\ubf\]b "a\a3b"
+  string match a\[\ua0-\ubf\]b "a\u00a3b"
 
 format and scan
 ~~~~~~~~~~~~~~~
-'format %c' allows a unicode codepoint to be be encoded. For example, the following will return
-a string with two bytes and one character. The same as {backslash}ub5
++format %c+ allows a unicode codepoint to be be encoded. For example, the following will return
+a string with two bytes and one character. The same as +{backslash}ub5+
 
   format %c 0xb5
 
@@ -1394,10 +1394,10 @@ return a string with three characters, not three bytes.
 
   format %.3s \ub5\ub6\ub7\ub8
 
-Similarly, 'scan ... %c' allows a UTF-8 to be decoded to a unicode codepoint. The following will set
-*a* to 181 (0xb5) and *b* to '181' and 'b' to 65.
+Similarly, +scan ... %c+ allows a UTF-8 to be decoded to a unicode codepoint. The following will set
+*a* to 181 (0xb5) and *b* to 65 (0x41).
 
-  scan \00b5A %c%c a b
+  scan \u00b5A %c%c a b
 
 'scan %s' will also accept a character class, including unicode ranges.
 
@@ -1406,7 +1406,7 @@ String Classes
 'string is' has *not* been extended to classify UTF-8 characters. Therefore, the following
 will return 0, even though the string may be considered to be alphabetic.
 
-  string is \b5Test
+  string is alpha \ub5Test
 
 This does not affect the string classes 'ascii', 'control', 'digit', 'double', 'integer' or 'xdigit'.
 
-- 
cgit v1.1