From b45e63f2b5ca11d102690b19d8c5f7685754e75c Mon Sep 17 00:00:00 2001
From: Steve Bennett
Date: Fri, 22 Oct 2010 11:34:27 +1000
Subject: Update documentation to cover UTF-8 support for regexp
Also create README.utf-8
Signed-off-by: Steve Bennett
---
README.utf-8 | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
Tcl_shipped.html | 80 ++++++++++++++++++++++++++++++-------
jim_tcl.txt | 55 ++++++++++++++++++++------
jimregexp.c | 11 +++---
parse-unidata.tcl | 7 ++++
utf8.c | 10 +++--
utf8.h | 7 ++++
7 files changed, 251 insertions(+), 35 deletions(-)
create mode 100644 README.utf-8
diff --git a/README.utf-8 b/README.utf-8
new file mode 100644
index 0000000..ad6c7b5
--- /dev/null
+++ b/README.utf-8
@@ -0,0 +1,116 @@
+UTF-8 Support for Jim Tcl
+=========================
+
+Author: Steve Bennett
+Date: 2 Nov 2010 10:55:52 EST
+
+OVERVIEW
+--------
+Traditionally Jim Tcl has support strings, including binary strings containing
+nulls, however it has had no support for multi-byte character encodings.
+
+In some fields, such as when dealing with the web, or other user-generated content,
+support for multi-byte character encodings is necessary.
+In these cases it would be very useful for Jim Tcl to be able to process strings
+as multi-byte character strings rather than simply binary bytes.
+
+Supporting multiple character encodings and translation between those encodings
+is beyond the scope of Jim Tcl. Therefore, Jim has been enhanced to add support
+for UTF-8, as probably the most popular general purpose multi-byte encoding.
+
+UTF-8 support is optional. It can be enabled at compile time with:
+
+ ./configure --enable-utf8
+
+The Jim Tcl documentation fully documents the UTF-8 support. This README includes
+additional background information.
+
+Unicode vs UTF-8
+----------------
+It is important to understand that Unicode is an abstract representation
+of the concept of a "character", while UTF-8 is an encoding of
+Unicode into bytes. Thus the Unicode codepoint U+00B5 is encoded
+in UTF-8 with the byte sequence: 0xc2, 0xb5. This is different from
+ASCII which the same name is used interchangeably between a character
+set and an encoding.
+
+Unicode Escapes
+---------------
+Even without UTF-8 enabled, it is useful to be able to encode UTF-8 characters
+in strings. This can be done with the \uNNNN Unicode escape. This syntax
+is compatible with Tcl and is enabled even if UTF-8 is disabled.
+
+Like Tcl, currently only 16-bit Unicode characters can be encoded.
+
+UTF-8 Properties
+----------------
+Due to the design of the UTF-8 encoding, many (most) commands continue
+to work with UTF-8 strings. This is due to the following properties of UTF-8:
+
+* ASCII characters in strings have the same representation in UTF-8
+* An ASCII string will never match the middle of a multi-byte UTF-8 sequence
+* UTF-8 strings can be sorted as bytes and produce the same result as sorting
+ by characters
+* UTF-8 strings in Jim continue to be null terminated
+
+Commands Supporting UTF-8
+-------------------------
+The following commands have been enhanced to support UTF-8 strings.
+
+* array {get,names,unset}
+* case
+* glob
+* lsearch -glob, -regexp
+* switch -glob, -regexp
+* regexp, regsub
+* format
+* scan
+* split
+* string index, range, length, compare, equal, first, last, map, match, reverse, tolower, toupper
+* string bytelength (new)
+* info procs, commands, vars, globals, locals
+
+Character Classes
+-----------------
+Jim Tcl has no support for UTF-8 character classes. Thus [:alpha:]
+will match [a-zA-Z], but not non-ASCII alphabetic characters. The
+same is true for 'string is'.
+
+Regular Expressions
+-------------------
+Normally, Jim Tcl uses the system-supplied POSIX-compatible regex
+implementation.
+
+Typically systems do not provide a UTF-8 capable regex implementation,
+therefore when UTF-8 support is enabled, the built-in regex
+implementation is used which includes UTF-8 support.
+
+Case Insensitivity
+------------------
+Case folding is much more complex under Unicode than under ASCII.
+For example it is possible for a character to change the number of
+bytes required for representation when converting from one case to
+another. Jim Tcl supports only "simple" case folding, where case
+is folded only where the number of bytes does not change.
+
+Case folding tables are automatically generated from the official
+unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt
+
+Working with Binary Data and non-UTF-8 encodings
+------------------------------------------------
+If it is necessary to work with both UTF-8 and binary data (bytes
+>= 0x80), or non-UTF-8 encodings you will need to arrange for the
+data to be converted between UTF-8 on input and output. Individual
+characters can be converted from Unicode to UTF-8 with the
+utf8_fromunicode() function and the reverse with utf8_tounicode().
+
+Internal Details
+----------------
+Jim_Utf8Length() will calculate the character length of the string and cache
+it for later access. It uses utf8_strlen() which relies on the string to be null
+terminated (which it always will be).
+
+It is possible to tell if a string is ascii-only because length == bytelength
+
+It is possible to provide optimised versions of various routines for
+the ascii-only case. Currently this is done only for 'string index' and 'string range'.
diff --git a/Tcl_shipped.html b/Tcl_shipped.html
index c862b72..7cc23e5 100644
--- a/Tcl_shipped.html
+++ b/Tcl_shipped.html
@@ -1250,7 +1250,7 @@ sequence is replaced by the given character:
-\u*nnnn*
+\unnnn
@@ -1836,12 +1836,64 @@ for backward compatibility with experimental versions of this feature.
REGULAR EXPRESSIONS
-
Tcl provides two commands that support string matching using
-egrep-style regular expressions: regexp and regsub.
-
Regular expressions are implemented using the system’s C library as
-Extended Regular Expressions (EREs) rather than Basic Regular Expressions (BREs).
+
Tcl provides two commands that support string matching using regular
+expressions, regexp and regsub, as well as switch -regexp and
+lsearch -regexp.
+
Regular expressions may be implemented one of two ways. Either using the system’s C library
+POSIX regular expression support, or using the built-in regular expression engine.
+The differences between these are described below.
+
NOTE Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (ARE).
+
POSIX Regular Expressions
+
If the system supports POSIX regular expressions, and UTF-8 support is not enabled,
+this support will be used by default. The type of regular expressions supported are
+Extended Regular Expressions (ERE) rather than Basic Regular Expressions (BRE).
+See REG_EXTENDED in the documentation.
+
Using the system-supported POSIX regular expressions will typically
+make for the smallest code size, but some features such as UTF-8
+and \w, \d, \s are not supported.
See regex(3) and regex(7) for full details.
-
NOTE Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (AREs).
+
Jim built-in Regular Expressions
+
The Jim built-in regulare expression engine may be selected with ./configure --with-jim-regexp
+or it will be selected automatically if UTF-8 support is enabled.
+
This engine supports UTF-8 as well as some ARE features. The differences with both Tcl 7.x/8.x
+and POSIX are highlighted below.
+
+-
+
+UTF-8 strings and patterns are both supported
+
+
+-
+
+Supported character classes: [:alnum:], [:digit:] and [:space:]
+
+
+-
+
+Supported shorthand character classes: \w = +[:alnum:], \d = [:digit:], \s = [:space:]
+
+
+-
+
+Character classes apply to ASCII characters only
+
+
+-
+
+Supported constraint escapes: \m = \< = start of word, \M = \> = end of word
+
+
+-
+
+Backslash escapes may be used within regular expressions, such as \n = newline, \uNNNN = unicode
+
+
+-
+
+No support for the ? non-greedy quantifier. e.g. *?
+
+
+
COMMAND RESULTS
@@ -2327,7 +2379,7 @@ is still available to embed UTF-8 sequences.
pattern matching rules. These commands support UTF-8. For example:
-
string match a{backslash}[{backslash}ua0-{backslash}ubf{backslash}]b "a{backslash}a3b"
+
string match a\[\ua0-\ubf\]b "a\a3b"
format %c allows a unicode codepoint to be be encoded. For example, the following will return
@@ -2340,13 +2392,13 @@ a string with two bytes and one character. The same as \ub5
return a string with three characters, not three bytes.
-
format %.3s {backslash}ub5{backslash}ub6{backslash}ub7{backslash}ub8
+
format %.3s \ub5\ub6\ub7\ub8
Similarly, scan … %c allows a UTF-8 to be decoded to a unicode codepoint. The following will set
a to 181 (0xb5) and b to 181 and b to 65.
-
scan {backslash}00b5A %c%c a b
+
scan \00b5A %c%c a b
scan %s will also accept a character class, including unicode ranges.
String Classes
@@ -2354,7 +2406,7 @@ return a string with three characters, not three bytes.
will return 0, even though the string may be considered to be alphabetic.
-
string is {backslash}b5Test
+
string is \b5Test
This does not affect the string classes ascii, control, digit, double, integer or xdigit.
Case Mapping and Conversion
@@ -2376,9 +2428,9 @@ the following returns 2.
string bytelength \xff\xff
Regular Expressions
-At this time, regular expressions do not support UTF-8 strings. This included
-regexp, regsub, switch -regexp and lsearch -regexp.
-This means that regular expresion operations operate on bytes, not characters.
+If UTF-8 support is enabled, the built-in regular expression engine will be
+selected which supports UTF-8 strings and patterns.
+
BUILT-IN COMMANDS
@@ -6356,7 +6408,7 @@ official policies, either expressed or implied, of the Jim Tcl Project.