Update documentation to cover UTF-8 support for regexp

Also create README.utf-8 Signed-off-by: Steve Bennett <steveb@workware.net.au>
author: Steve Bennett <steveb@workware.net.au> 2010-10-22 11:34:27 +1000
committer: Steve Bennett <steveb@workware.net.au> 2010-11-17 07:57:38 +1000
commit: b45e63f2b5ca11d102690b19d8c5f7685754e75c (patch)
tree: 62ccd72adc33974d3aad2dd69621968d4684faab /jim_tcl.txt
parent: f86ed51e9b0f38954519ca21a623d27bc7c80a88 (diff)
download: jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.zip
jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.tar.gz
jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.tar.bz2
1 files changed, 42 insertions, 13 deletions
diff --git a/jim_tcl.txt b/jim_tcl.txt
index f19db0f..4984655 100644
--- a/jim_tcl.txt
+++ b/jim_tcl.txt
@@ -492,7 +492,7 @@ sequence is replaced by the given character:
     The digits *ddd* (one, two, or three of them) give the octal value of
     the character.  Note that Jim supports null characters in strings.
 
-+{backslash}u*nnnn*+::
++{backslash}*unnnn*+::
     The hex digits *nnnn* (between one and four of them) give a unicode codepoint.
 	The UTF-8 encoding of the codepoint is inserted.
 
@@ -918,15 +918,44 @@ for backward compatibility with experimental versions of this feature.
 
 REGULAR EXPRESSIONS
 -------------------
-Tcl provides two commands that support string matching using
-'egrep'-style regular expressions: 'regexp' and 'regsub'.
+Tcl provides two commands that support string matching using regular
+expressions, 'regexp' and 'regsub', as well as 'switch -regexp' and
+'lsearch -regexp'.
 
-Regular expressions are implemented using the system's C library as
-Extended Regular Expressions (EREs) rather than Basic Regular Expressions (BREs).
+Regular expressions may be implemented one of two ways. Either using the system's C library
+POSIX regular expression support, or using the built-in regular expression engine.
+The differences between these are described below.
+
+*NOTE* Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (+ARE+).
+
+POSIX Regular Expressions
+~~~~~~~~~~~~~~~~~~~~~~~~~
+If the system supports POSIX regular expressions, and UTF-8 support is not enabled,
+this support will be used by default. The type of regular expressions supported are
+Extended Regular Expressions (+ERE+) rather than Basic Regular Expressions (+BRE+).
+See REG_EXTENDED in the documentation.
+
+Using the system-supported POSIX regular expressions will typically
+make for the smallest code size, but some features such as UTF-8
+and +{backslash}w+, +{backslash}d+, +{backslash}s+ are not supported.
 
 See regex(3) and regex(7) for full details.
 
-*NOTE* Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (AREs).
+Jim built-in Regular Expressions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+The Jim built-in regulare expression engine may be selected with +./configure --with-jim-regexp+
+or it will be selected automatically if UTF-8 support is enabled.
+
+This engine supports UTF-8 as well as some +ARE+ features. The differences with both Tcl 7.x/8.x
+and POSIX are highlighted below.
+
+1. UTF-8 strings and patterns are both supported
+2. Supported character classes: +[:alnum:]+, +[:digit:]+ and +[:space:]+
+3. Supported shorthand character classes: +{backslash}w = +[:alnum:]+, +{backslash}d+ = +[:digit:],+ +{backslash}s+ = +[:space:]+
+4. Character classes apply to ASCII characters only
+5. Supported constraint escapes: +{backslash}m+ = +{backslash}<+ = start of word, +{backslash}M+ = +{backslash}>+ = end of word
+6. Backslash escapes may be used within regular expressions, such as +{backslash}n+ = newline, +{backslash}uNNNN+ = unicode
+7. No support for the +?+ non-greedy quantifier. e.g. +*?+
 
 COMMAND RESULTS
 ---------------
@@ -1351,7 +1380,7 @@ String Matching
 Commands such as 'string match', 'lsearch -glob', 'array names' and others use string
 pattern matching rules. These commands support UTF-8. For example:
 
-  string match a{backslash}[{backslash}ua0-{backslash}ubf{backslash}]b "a{backslash}a3b"
+  string match a\[\ua0-\ubf\]b "a\a3b"
 
 format and scan
 ~~~~~~~~~~~~~~~
@@ -1363,12 +1392,12 @@ a string with two bytes and one character. The same as {backslash}ub5
 'format' respects widths as character widths, not byte widths. For example, the following will
 return a string with three characters, not three bytes.
 
-  format %.3s {backslash}ub5{backslash}ub6{backslash}ub7{backslash}ub8
+  format %.3s \ub5\ub6\ub7\ub8
 
 Similarly, 'scan ... %c' allows a UTF-8 to be decoded to a unicode codepoint. The following will set
 *a* to 181 (0xb5) and *b* to '181' and 'b' to 65.
 
-  scan {backslash}00b5A %c%c a b
+  scan \00b5A %c%c a b
 
 'scan %s' will also accept a character class, including unicode ranges.
 
@@ -1377,7 +1406,7 @@ String Classes
 'string is' has *not* been extended to classify UTF-8 characters. Therefore, the following
 will return 0, even though the string may be considered to be alphabetic.
 
-  string is {backslash}b5Test
+  string is \b5Test
 
 This does not affect the string classes 'ascii', 'control', 'digit', 'double', 'integer' or 'xdigit'.
 
@@ -1406,10 +1435,10 @@ the following returns 2.
 
 Regular Expressions
 ~~~~~~~~~~~~~~~~~~~
-At this time, regular expressions do *not* support UTF-8 strings. This included
-'regexp', 'regsub', 'switch -regexp' and 'lsearch -regexp'.
+If UTF-8 support is enabled, the built-in regular expression engine will be
+selected which supports UTF-8 strings and patterns.
 
-This means that regular expresion operations operate on bytes, not characters.
+See REGULAR EXPRESSIONS
 
 BUILT-IN COMMANDS
 -----------------
author	Steve Bennett <steveb@workware.net.au>	2010-10-22 11:34:27 +1000
committer	Steve Bennett <steveb@workware.net.au>	2010-11-17 07:57:38 +1000
commit	b45e63f2b5ca11d102690b19d8c5f7685754e75c (patch)
tree	62ccd72adc33974d3aad2dd69621968d4684faab /jim_tcl.txt
parent	f86ed51e9b0f38954519ca21a623d27bc7c80a88 (diff)
download	jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.zip jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.tar.gz jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.tar.bz2