Update documentation to cover UTF-8 support for regexp

Also create README.utf-8 Signed-off-by: Steve Bennett <steveb@workware.net.au>
author: Steve Bennett <steveb@workware.net.au> 2010-10-22 11:34:27 +1000
committer: Steve Bennett <steveb@workware.net.au> 2010-11-17 07:57:38 +1000
commit: b45e63f2b5ca11d102690b19d8c5f7685754e75c (patch)
tree: 62ccd72adc33974d3aad2dd69621968d4684faab
parent: f86ed51e9b0f38954519ca21a623d27bc7c80a88 (diff)
download: jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.zip
jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.tar.gz
jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.tar.bz2
7 files changed, 251 insertions, 35 deletions
diff --git a/README.utf-8 b/README.utf-8
new file mode 100644
index 0000000..ad6c7b5
--- /dev/null
+++ b/README.utf-8
@@ -0,0 +1,116 @@
+UTF-8 Support for Jim Tcl
+=========================
+
+Author: Steve Bennett <steveb@workware.net.au>
+Date: 2 Nov 2010 10:55:52 EST
+
+OVERVIEW
+--------
+Traditionally Jim Tcl has support strings, including binary strings containing
+nulls, however it has had no support for multi-byte character encodings.
+
+In some fields, such as when dealing with the web, or other user-generated content,
+support for multi-byte character encodings is necessary.
+In these cases it would be very useful for Jim Tcl to be able to process strings
+as multi-byte character strings rather than simply binary bytes. 
+
+Supporting multiple character encodings and translation between those encodings
+is beyond the scope of Jim Tcl. Therefore, Jim has been enhanced to add support
+for UTF-8, as probably the most popular general purpose multi-byte encoding.
+
+UTF-8 support is optional. It can be enabled at compile time with:
+
+  ./configure --enable-utf8
+
+The Jim Tcl documentation fully documents the UTF-8 support. This README includes
+additional background information.
+
+Unicode vs UTF-8
+----------------
+It is important to understand that Unicode is an abstract representation
+of the concept of a "character", while UTF-8 is an encoding of
+Unicode into bytes.  Thus the Unicode codepoint U+00B5 is encoded
+in UTF-8 with the byte sequence: 0xc2, 0xb5. This is different from
+ASCII which the same name is used interchangeably between a character
+set and an encoding.
+
+Unicode Escapes
+---------------
+Even without UTF-8 enabled, it is useful to be able to encode UTF-8 characters
+in strings. This can be done with the \uNNNN Unicode escape. This syntax
+is compatible with Tcl and is enabled even if UTF-8 is disabled.
+
+Like Tcl, currently only 16-bit Unicode characters can be encoded.
+
+UTF-8 Properties
+----------------
+Due to the design of the UTF-8 encoding, many (most) commands continue
+to work with UTF-8 strings. This is due to the following properties of UTF-8:
+
+* ASCII characters in strings have the same representation in UTF-8
+* An ASCII string will never match the middle of a multi-byte UTF-8 sequence
+* UTF-8 strings can be sorted as bytes and produce the same result as sorting
+  by characters
+* UTF-8 strings in Jim continue to be null terminated
+
+Commands Supporting UTF-8
+-------------------------
+The following commands have been enhanced to support UTF-8 strings.
+
+* array {get,names,unset}
+* case
+* glob
+* lsearch -glob, -regexp
+* switch -glob, -regexp
+* regexp, regsub
+* format
+* scan
+* split
+* string index, range, length, compare, equal, first, last, map, match, reverse, tolower, toupper
+* string bytelength (new)
+* info procs, commands, vars, globals, locals
+
+Character Classes
+-----------------
+Jim Tcl has no support for UTF-8 character classes.  Thus [:alpha:]
+will match [a-zA-Z], but not non-ASCII alphabetic characters.  The
+same is true for 'string is'.
+
+Regular Expressions
+-------------------
+Normally, Jim Tcl uses the system-supplied POSIX-compatible regex
+implementation.
+
+Typically systems do not provide a UTF-8 capable regex implementation,
+therefore when UTF-8 support is enabled, the built-in regex
+implementation is used which includes UTF-8 support.
+
+Case Insensitivity
+------------------
+Case folding is much more complex under Unicode than under ASCII.
+For example it is possible for a character to change the number of
+bytes required for representation when converting from one case to
+another. Jim Tcl supports only "simple" case folding, where case
+is folded only where the number of bytes does not change.
+
+Case folding tables are automatically generated from the official
+unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt
+
+Working with Binary Data and non-UTF-8 encodings
+------------------------------------------------
+If it is necessary to work with both UTF-8 and binary data (bytes
+>= 0x80), or non-UTF-8 encodings you will need to arrange for the
+data to be converted between UTF-8 on input and output.  Individual
+characters can be converted from Unicode to UTF-8 with the
+utf8_fromunicode() function and the reverse with utf8_tounicode().
+
+Internal Details
+----------------
+Jim_Utf8Length() will calculate the character length of the string and cache
+it for later access. It uses utf8_strlen() which relies on the string to be null
+terminated (which it always will be).
+
+It is possible to tell if a string is ascii-only because length == bytelength
+
+It is possible to provide optimised versions of various routines for
+the ascii-only case. Currently this is done only for 'string index' and 'string range'.
diff --git a/Tcl_shipped.html b/Tcl_shipped.html
index c862b72..7cc23e5 100644
--- a/Tcl_shipped.html
+++ b/Tcl_shipped.html
@@ -1250,7 +1250,7 @@ sequence is replaced by the given character:</p></div>
 </p>
 </dd>
 <dt class="hdlist1">
-<tt>\u*nnnn*</tt>
+<tt>\<strong>unnnn</strong></tt>
 </dt>
 <dd>
 <p>
@@ -1836,12 +1836,64 @@ for backward compatibility with experimental versions of this feature.</p></div>
 </div>
 <h2 id="_regular_expressions">REGULAR EXPRESSIONS</h2>
 <div class="sectionbody">
-<div class="paragraph"><p>Tcl provides two commands that support string matching using
-<em>egrep</em>-style regular expressions: <em>regexp</em> and <em>regsub</em>.</p></div>
-<div class="paragraph"><p>Regular expressions are implemented using the system&#8217;s C library as
-Extended Regular Expressions (EREs) rather than Basic Regular Expressions (BREs).</p></div>
+<div class="paragraph"><p>Tcl provides two commands that support string matching using regular
+expressions, <em>regexp</em> and <em>regsub</em>, as well as <em>switch -regexp</em> and
+<em>lsearch -regexp</em>.</p></div>
+<div class="paragraph"><p>Regular expressions may be implemented one of two ways. Either using the system&#8217;s C library
+POSIX regular expression support, or using the built-in regular expression engine.
+The differences between these are described below.</p></div>
+<div class="paragraph"><p><strong>NOTE</strong> Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (<tt>ARE</tt>).</p></div>
+<h3 id="_posix_regular_expressions">POSIX Regular Expressions</h3><div style="clear:left"></div>
+<div class="paragraph"><p>If the system supports POSIX regular expressions, and UTF-8 support is not enabled,
+this support will be used by default. The type of regular expressions supported are
+Extended Regular Expressions (<tt>ERE</tt>) rather than Basic Regular Expressions (<tt>BRE</tt>).
+See REG_EXTENDED in the documentation.</p></div>
+<div class="paragraph"><p>Using the system-supported POSIX regular expressions will typically
+make for the smallest code size, but some features such as UTF-8
+and <tt>\w</tt>, <tt>\d</tt>, <tt>\s</tt> are not supported.</p></div>
 <div class="paragraph"><p>See regex(3) and regex(7) for full details.</p></div>
-<div class="paragraph"><p><strong>NOTE</strong> Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (AREs).</p></div>
+<h3 id="_jim_built_in_regular_expressions">Jim built-in Regular Expressions</h3><div style="clear:left"></div>
+<div class="paragraph"><p>The Jim built-in regulare expression engine may be selected with <tt>./configure --with-jim-regexp</tt>
+or it will be selected automatically if UTF-8 support is enabled.</p></div>
+<div class="paragraph"><p>This engine supports UTF-8 as well as some <tt>ARE</tt> features. The differences with both Tcl 7.x/8.x
+and POSIX are highlighted below.</p></div>
+<div class="olist arabic"><ol class="arabic">
+<li>
+<p>
+UTF-8 strings and patterns are both supported
+</p>
+</li>
+<li>
+<p>
+Supported character classes: <tt>[:alnum:]</tt>, <tt>[:digit:]</tt> and <tt>[:space:]</tt>
+</p>
+</li>
+<li>
+<p>
+Supported shorthand character classes: <tt>\w = +[:alnum:]</tt>, <tt>\d</tt> = <tt>[:digit:],</tt> <tt>\s</tt> = <tt>[:space:]</tt>
+</p>
+</li>
+<li>
+<p>
+Character classes apply to ASCII characters only
+</p>
+</li>
+<li>
+<p>
+Supported constraint escapes: <tt>\m</tt> = <tt>\&lt;</tt> = start of word, <tt>\M</tt> = <tt>\&gt;</tt> = end of word
+</p>
+</li>
+<li>
+<p>
+Backslash escapes may be used within regular expressions, such as <tt>\n</tt> = newline, <tt>\uNNNN</tt> = unicode
+</p>
+</li>
+<li>
+<p>
+No support for the <tt>?</tt> non-greedy quantifier. e.g. <tt>*?</tt>
+</p>
+</li>
+</ol></div>
 </div>
 <h2 id="_command_results">COMMAND RESULTS</h2>
 <div class="sectionbody">
@@ -2327,7 +2379,7 @@ is still available to embed UTF-8 sequences.</p></div>
 pattern matching rules. These commands support UTF-8. For example:</p></div>
 <div class="literalblock">
 <div class="content">
-<pre><tt>string match a{backslash}[{backslash}ua0-{backslash}ubf{backslash}]b "a{backslash}a3b"</tt></pre>
+<pre><tt>string match a\[\ua0-\ubf\]b "a\a3b"</tt></pre>
 </div></div>
 <h3 id="_format_and_scan">format and scan</h3><div style="clear:left"></div>
 <div class="paragraph"><p><em>format %c</em> allows a unicode codepoint to be be encoded. For example, the following will return
@@ -2340,13 +2392,13 @@ a string with two bytes and one character. The same as \ub5</p></div>
 return a string with three characters, not three bytes.</p></div>
 <div class="literalblock">
 <div class="content">
-<pre><tt>format %.3s {backslash}ub5{backslash}ub6{backslash}ub7{backslash}ub8</tt></pre>
+<pre><tt>format %.3s \ub5\ub6\ub7\ub8</tt></pre>
 </div></div>
 <div class="paragraph"><p>Similarly, <em>scan &#8230; %c</em> allows a UTF-8 to be decoded to a unicode codepoint. The following will set
 <strong>a</strong> to 181 (0xb5) and <strong>b</strong> to <em>181</em> and <em>b</em> to 65.</p></div>
 <div class="literalblock">
 <div class="content">
-<pre><tt>scan {backslash}00b5A %c%c a b</tt></pre>
+<pre><tt>scan \00b5A %c%c a b</tt></pre>
 </div></div>
 <div class="paragraph"><p><em>scan %s</em> will also accept a character class, including unicode ranges.</p></div>
 <h3 id="_string_classes">String Classes</h3><div style="clear:left"></div>
@@ -2354,7 +2406,7 @@ return a string with three characters, not three bytes.</p></div>
 will return 0, even though the string may be considered to be alphabetic.</p></div>
 <div class="literalblock">
 <div class="content">
-<pre><tt>string is {backslash}b5Test</tt></pre>
+<pre><tt>string is \b5Test</tt></pre>
 </div></div>
 <div class="paragraph"><p>This does not affect the string classes <em>ascii</em>, <em>control</em>, <em>digit</em>, <em>double</em>, <em>integer</em> or <em>xdigit</em>.</p></div>
 <h3 id="_case_mapping_and_conversion">Case Mapping and Conversion</h3><div style="clear:left"></div>
@@ -2376,9 +2428,9 @@ the following returns 2.</p></div>
 <pre><tt>string bytelength \xff\xff</tt></pre>
 </div></div>
 <h3 id="_regular_expressions_2">Regular Expressions</h3><div style="clear:left"></div>
-<div class="paragraph"><p>At this time, regular expressions do <strong>not</strong> support UTF-8 strings. This included
-<em>regexp</em>, <em>regsub</em>, <em>switch -regexp</em> and <em>lsearch -regexp</em>.</p></div>
-<div class="paragraph"><p>This means that regular expresion operations operate on bytes, not characters.</p></div>
+<div class="paragraph"><p>If UTF-8 support is enabled, the built-in regular expression engine will be
+selected which supports UTF-8 strings and patterns.</p></div>
+<div class="paragraph"><p>See REGULAR EXPRESSIONS</p></div>
 </div>
 <h2 id="_built_in_commands">BUILT-IN COMMANDS</h2>
 <div class="sectionbody">
@@ -6356,7 +6408,7 @@ official policies, either expressed or implied, of the Jim Tcl Project.</tt></pr
 </div>
 <div id="footer">
 <div id="footer-text">
-Last updated 2010-11-11 10:56:50 EST
+Last updated 2010-11-11 10:57:51 EST
 </div>
 </div>
 </body>
diff --git a/jim_tcl.txt b/jim_tcl.txt
index f19db0f..4984655 100644
--- a/jim_tcl.txt
+++ b/jim_tcl.txt
@@ -492,7 +492,7 @@ sequence is replaced by the given character:
     The digits *ddd* (one, two, or three of them) give the octal value of
     the character.  Note that Jim supports null characters in strings.
 
-+{backslash}u*nnnn*+::
++{backslash}*unnnn*+::
     The hex digits *nnnn* (between one and four of them) give a unicode codepoint.
 	The UTF-8 encoding of the codepoint is inserted.
 
@@ -918,15 +918,44 @@ for backward compatibility with experimental versions of this feature.
 
 REGULAR EXPRESSIONS
 -------------------
-Tcl provides two commands that support string matching using
-'egrep'-style regular expressions: 'regexp' and 'regsub'.
+Tcl provides two commands that support string matching using regular
+expressions, 'regexp' and 'regsub', as well as 'switch -regexp' and
+'lsearch -regexp'.
 
-Regular expressions are implemented using the system's C library as
-Extended Regular Expressions (EREs) rather than Basic Regular Expressions (BREs).
+Regular expressions may be implemented one of two ways. Either using the system's C library
+POSIX regular expression support, or using the built-in regular expression engine.
+The differences between these are described below.
+
+*NOTE* Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (+ARE+).
+
+POSIX Regular Expressions
+~~~~~~~~~~~~~~~~~~~~~~~~~
+If the system supports POSIX regular expressions, and UTF-8 support is not enabled,
+this support will be used by default. The type of regular expressions supported are
+Extended Regular Expressions (+ERE+) rather than Basic Regular Expressions (+BRE+).
+See REG_EXTENDED in the documentation.
+
+Using the system-supported POSIX regular expressions will typically
+make for the smallest code size, but some features such as UTF-8
+and +{backslash}w+, +{backslash}d+, +{backslash}s+ are not supported.
 
 See regex(3) and regex(7) for full details.
 
-*NOTE* Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (AREs).
+Jim built-in Regular Expressions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The Jim built-in regulare expression engine may be selected with +./configure --with-jim-regexp+
+or it will be selected automatically if UTF-8 support is enabled.
+
+This engine supports UTF-8 as well as some +ARE+ features. The differences with both Tcl 7.x/8.x
+and POSIX are highlighted below.
+
+1. UTF-8 strings and patterns are both supported
+2. Supported character classes: +[:alnum:]+, +[:digit:]+ and +[:space:]+
+3. Supported shorthand character classes: +{backslash}w = +[:alnum:]+, +{backslash}d+ = +[:digit:],+ +{backslash}s+ = +[:space:]+
+4. Character classes apply to ASCII characters only
+5. Supported constraint escapes: +{backslash}m+ = +{backslash}<+ = start of word, +{backslash}M+ = +{backslash}>+ = end of word
+6. Backslash escapes may be used within regular expressions, such as +{backslash}n+ = newline, +{backslash}uNNNN+ = unicode
+7. No support for the +?+ non-greedy quantifier. e.g. +*?+
 
 COMMAND RESULTS
 ---------------
@@ -1351,7 +1380,7 @@ String Matching
 Commands such as 'string match', 'lsearch -glob', 'array names' and others use string
 pattern matching rules. These commands support UTF-8. For example:
 
-  string match a{backslash}[{backslash}ua0-{backslash}ubf{backslash}]b "a{backslash}a3b"
+  string match a\[\ua0-\ubf\]b "a\a3b"
 
 format and scan
 ~~~~~~~~~~~~~~~
@@ -1363,12 +1392,12 @@ a string with two bytes and one character. The same as {backslash}ub5
 'format' respects widths as character widths, not byte widths. For example, the following will
 return a string with three characters, not three bytes.
 
-  format %.3s {backslash}ub5{backslash}ub6{backslash}ub7{backslash}ub8
+  format %.3s \ub5\ub6\ub7\ub8
 
 Similarly, 'scan ... %c' allows a UTF-8 to be decoded to a unicode codepoint. The following will set
 *a* to 181 (0xb5) and *b* to '181' and 'b' to 65.
 
-  scan {backslash}00b5A %c%c a b
+  scan \00b5A %c%c a b
 
 'scan %s' will also accept a character class, including unicode ranges.
 
@@ -1377,7 +1406,7 @@ String Classes
 'string is' has *not* been extended to classify UTF-8 characters. Therefore, the following
 will return 0, even though the string may be considered to be alphabetic.
 
-  string is {backslash}b5Test
+  string is \b5Test
 
 This does not affect the string classes 'ascii', 'control', 'digit', 'double', 'integer' or 'xdigit'.
 
@@ -1406,10 +1435,10 @@ the following returns 2.
 
 Regular Expressions
 ~~~~~~~~~~~~~~~~~~~
-At this time, regular expressions do *not* support UTF-8 strings. This included
-'regexp', 'regsub', 'switch -regexp' and 'lsearch -regexp'.
+If UTF-8 support is enabled, the built-in regular expression engine will be
+selected which supports UTF-8 strings and patterns.
 
-This means that regular expresion operations operate on bytes, not characters.
+See REGULAR EXPRESSIONS
 
 BUILT-IN COMMANDS
 -----------------
diff --git a/jimregexp.c b/jimregexp.c
index d6a8723..7a0adf2 100644
--- a/jimregexp.c
+++ b/jimregexp.c
@@ -38,11 +38,12 @@
  *** seiwald@perforce.com, on 20 January 2000, to use function prototypes.
  *** THIS IS AN ALTERED VERSION.  It was altered by Christopher Seiwald
  *** seiwald@perforce.com, on 05 November 2002, to const string literals.
- *** THIS IS AN ALTERED VERSION.  It was altered by Steve Bennett <steveb@workware.net.au>
- *** on 16 October 2010, to remove static state and add better Tcl ARE compatibility.
- *** This includes counted repetitions, UTF-8 support, character classes,
- *** shorthand character classes, increased number of parentheses to 100,
- *** backslash escape sequences.
+ *
+ *   THIS IS AN ALTERED VERSION.  It was altered by Steve Bennett <steveb@workware.net.au>
+ *   on 16 October 2010, to remove static state and add better Tcl ARE compatibility.
+ *   This includes counted repetitions, UTF-8 support, character classes,
+ *   shorthand character classes, increased number of parentheses to 100,
+ *   backslash escape sequences.
  *
  * Beware that some of this code is subtly aware of the way operator
  * precedence is structured in regular expressions.  Serious changes in
diff --git a/parse-unidata.tcl b/parse-unidata.tcl
index 9e41e1f..4b5ec3a 100644
--- a/parse-unidata.tcl
+++ b/parse-unidata.tcl
@@ -1,5 +1,12 @@
 #!/usr/bin/env tclsh
 
+# Generate UTF-8 case mapping tables
+#
+# (c) 2010 Steve Bennett <steveb@workware.net.au>
+#
+# See LICENCE for licence details.
+#/
+
 # Parse the unicode data from: http://unicode.org/Public/UNIDATA/UnicodeData.txt
 # to generate case mapping tables
 
diff --git a/utf8.c b/utf8.c
index 13c5fe6..3be9899 100644
--- a/utf8.c
+++ b/utf8.c
@@ -1,6 +1,10 @@
-/* -----------------------------------------------------------------------------
- * Utility functions
- * ---------------------------------------------------------------------------*/
+/**
+ * UTF-8 utility functions
+ *
+ * (c) 2010 Steve Bennett <steveb@workware.net.au>
+ *
+ * See LICENCE for licence details.
+ */
 
 #include <ctype.h>
 #include <stdlib.h>
diff --git a/utf8.h b/utf8.h
index 9e03059..5df2e45 100644
--- a/utf8.h
+++ b/utf8.h
@@ -1,5 +1,12 @@
 #ifndef UTF8_UTIL_H
 #define UTF8_UTIL_H
+/**
+ * UTF-8 utility functions
+ *
+ * (c) 2010 Steve Bennett <steveb@workware.net.au>
+ *
+ * See LICENCE for licence details.
+ */
 
 /**
  * Converts the given unicode codepoint (0 - 0xffff) to utf-8
author	Steve Bennett <steveb@workware.net.au>	2010-10-22 11:34:27 +1000
committer	Steve Bennett <steveb@workware.net.au>	2010-11-17 07:57:38 +1000
commit	b45e63f2b5ca11d102690b19d8c5f7685754e75c (patch)
tree	62ccd72adc33974d3aad2dd69621968d4684faab
parent	f86ed51e9b0f38954519ca21a623d27bc7c80a88 (diff)
download	jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.zip jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.tar.gz jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.tar.bz2