diff options
author | Steve Bennett <steveb@workware.net.au> | 2010-10-22 11:34:27 +1000 |
---|---|---|
committer | Steve Bennett <steveb@workware.net.au> | 2010-11-17 07:57:38 +1000 |
commit | b45e63f2b5ca11d102690b19d8c5f7685754e75c (patch) | |
tree | 62ccd72adc33974d3aad2dd69621968d4684faab /jim_tcl.txt | |
parent | f86ed51e9b0f38954519ca21a623d27bc7c80a88 (diff) | |
download | jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.zip jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.tar.gz jimtcl-b45e63f2b5ca11d102690b19d8c5f7685754e75c.tar.bz2 |
Update documentation to cover UTF-8 support for regexp
Also create README.utf-8
Signed-off-by: Steve Bennett <steveb@workware.net.au>
Diffstat (limited to 'jim_tcl.txt')
-rw-r--r-- | jim_tcl.txt | 55 |
1 files changed, 42 insertions, 13 deletions
diff --git a/jim_tcl.txt b/jim_tcl.txt index f19db0f..4984655 100644 --- a/jim_tcl.txt +++ b/jim_tcl.txt @@ -492,7 +492,7 @@ sequence is replaced by the given character: The digits *ddd* (one, two, or three of them) give the octal value of the character. Note that Jim supports null characters in strings. -+{backslash}u*nnnn*+:: ++{backslash}*unnnn*+:: The hex digits *nnnn* (between one and four of them) give a unicode codepoint. The UTF-8 encoding of the codepoint is inserted. @@ -918,15 +918,44 @@ for backward compatibility with experimental versions of this feature. REGULAR EXPRESSIONS ------------------- -Tcl provides two commands that support string matching using -'egrep'-style regular expressions: 'regexp' and 'regsub'. +Tcl provides two commands that support string matching using regular +expressions, 'regexp' and 'regsub', as well as 'switch -regexp' and +'lsearch -regexp'. -Regular expressions are implemented using the system's C library as -Extended Regular Expressions (EREs) rather than Basic Regular Expressions (BREs). +Regular expressions may be implemented one of two ways. Either using the system's C library +POSIX regular expression support, or using the built-in regular expression engine. +The differences between these are described below. + +*NOTE* Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (+ARE+). + +POSIX Regular Expressions +~~~~~~~~~~~~~~~~~~~~~~~~~ +If the system supports POSIX regular expressions, and UTF-8 support is not enabled, +this support will be used by default. The type of regular expressions supported are +Extended Regular Expressions (+ERE+) rather than Basic Regular Expressions (+BRE+). +See REG_EXTENDED in the documentation. + +Using the system-supported POSIX regular expressions will typically +make for the smallest code size, but some features such as UTF-8 +and +{backslash}w+, +{backslash}d+, +{backslash}s+ are not supported. See regex(3) and regex(7) for full details. -*NOTE* Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (AREs). +Jim built-in Regular Expressions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The Jim built-in regulare expression engine may be selected with +./configure --with-jim-regexp+ +or it will be selected automatically if UTF-8 support is enabled. + +This engine supports UTF-8 as well as some +ARE+ features. The differences with both Tcl 7.x/8.x +and POSIX are highlighted below. + +1. UTF-8 strings and patterns are both supported +2. Supported character classes: +[:alnum:]+, +[:digit:]+ and +[:space:]+ +3. Supported shorthand character classes: +{backslash}w = +[:alnum:]+, +{backslash}d+ = +[:digit:],+ +{backslash}s+ = +[:space:]+ +4. Character classes apply to ASCII characters only +5. Supported constraint escapes: +{backslash}m+ = +{backslash}<+ = start of word, +{backslash}M+ = +{backslash}>+ = end of word +6. Backslash escapes may be used within regular expressions, such as +{backslash}n+ = newline, +{backslash}uNNNN+ = unicode +7. No support for the +?+ non-greedy quantifier. e.g. +*?+ COMMAND RESULTS --------------- @@ -1351,7 +1380,7 @@ String Matching Commands such as 'string match', 'lsearch -glob', 'array names' and others use string pattern matching rules. These commands support UTF-8. For example: - string match a{backslash}[{backslash}ua0-{backslash}ubf{backslash}]b "a{backslash}a3b" + string match a\[\ua0-\ubf\]b "a\a3b" format and scan ~~~~~~~~~~~~~~~ @@ -1363,12 +1392,12 @@ a string with two bytes and one character. The same as {backslash}ub5 'format' respects widths as character widths, not byte widths. For example, the following will return a string with three characters, not three bytes. - format %.3s {backslash}ub5{backslash}ub6{backslash}ub7{backslash}ub8 + format %.3s \ub5\ub6\ub7\ub8 Similarly, 'scan ... %c' allows a UTF-8 to be decoded to a unicode codepoint. The following will set *a* to 181 (0xb5) and *b* to '181' and 'b' to 65. - scan {backslash}00b5A %c%c a b + scan \00b5A %c%c a b 'scan %s' will also accept a character class, including unicode ranges. @@ -1377,7 +1406,7 @@ String Classes 'string is' has *not* been extended to classify UTF-8 characters. Therefore, the following will return 0, even though the string may be considered to be alphabetic. - string is {backslash}b5Test + string is \b5Test This does not affect the string classes 'ascii', 'control', 'digit', 'double', 'integer' or 'xdigit'. @@ -1406,10 +1435,10 @@ the following returns 2. Regular Expressions ~~~~~~~~~~~~~~~~~~~ -At this time, regular expressions do *not* support UTF-8 strings. This included -'regexp', 'regsub', 'switch -regexp' and 'lsearch -regexp'. +If UTF-8 support is enabled, the built-in regular expression engine will be +selected which supports UTF-8 strings and patterns. -This means that regular expresion operations operate on bytes, not characters. +See REGULAR EXPRESSIONS BUILT-IN COMMANDS ----------------- |