From b45e63f2b5ca11d102690b19d8c5f7685754e75c Mon Sep 17 00:00:00 2001 From: Steve Bennett Date: Fri, 22 Oct 2010 11:34:27 +1000 Subject: Update documentation to cover UTF-8 support for regexp Also create README.utf-8 Signed-off-by: Steve Bennett --- README.utf-8 | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ Tcl_shipped.html | 80 ++++++++++++++++++++++++++++++------- jim_tcl.txt | 55 ++++++++++++++++++++------ jimregexp.c | 11 +++--- parse-unidata.tcl | 7 ++++ utf8.c | 10 +++-- utf8.h | 7 ++++ 7 files changed, 251 insertions(+), 35 deletions(-) create mode 100644 README.utf-8 diff --git a/README.utf-8 b/README.utf-8 new file mode 100644 index 0000000..ad6c7b5 --- /dev/null +++ b/README.utf-8 @@ -0,0 +1,116 @@ +UTF-8 Support for Jim Tcl +========================= + +Author: Steve Bennett +Date: 2 Nov 2010 10:55:52 EST + +OVERVIEW +-------- +Traditionally Jim Tcl has support strings, including binary strings containing +nulls, however it has had no support for multi-byte character encodings. + +In some fields, such as when dealing with the web, or other user-generated content, +support for multi-byte character encodings is necessary. +In these cases it would be very useful for Jim Tcl to be able to process strings +as multi-byte character strings rather than simply binary bytes. + +Supporting multiple character encodings and translation between those encodings +is beyond the scope of Jim Tcl. Therefore, Jim has been enhanced to add support +for UTF-8, as probably the most popular general purpose multi-byte encoding. + +UTF-8 support is optional. It can be enabled at compile time with: + + ./configure --enable-utf8 + +The Jim Tcl documentation fully documents the UTF-8 support. This README includes +additional background information. + +Unicode vs UTF-8 +---------------- +It is important to understand that Unicode is an abstract representation +of the concept of a "character", while UTF-8 is an encoding of +Unicode into bytes. Thus the Unicode codepoint U+00B5 is encoded +in UTF-8 with the byte sequence: 0xc2, 0xb5. This is different from +ASCII which the same name is used interchangeably between a character +set and an encoding. + +Unicode Escapes +--------------- +Even without UTF-8 enabled, it is useful to be able to encode UTF-8 characters +in strings. This can be done with the \uNNNN Unicode escape. This syntax +is compatible with Tcl and is enabled even if UTF-8 is disabled. + +Like Tcl, currently only 16-bit Unicode characters can be encoded. + +UTF-8 Properties +---------------- +Due to the design of the UTF-8 encoding, many (most) commands continue +to work with UTF-8 strings. This is due to the following properties of UTF-8: + +* ASCII characters in strings have the same representation in UTF-8 +* An ASCII string will never match the middle of a multi-byte UTF-8 sequence +* UTF-8 strings can be sorted as bytes and produce the same result as sorting + by characters +* UTF-8 strings in Jim continue to be null terminated + +Commands Supporting UTF-8 +------------------------- +The following commands have been enhanced to support UTF-8 strings. + +* array {get,names,unset} +* case +* glob +* lsearch -glob, -regexp +* switch -glob, -regexp +* regexp, regsub +* format +* scan +* split +* string index, range, length, compare, equal, first, last, map, match, reverse, tolower, toupper +* string bytelength (new) +* info procs, commands, vars, globals, locals + +Character Classes +----------------- +Jim Tcl has no support for UTF-8 character classes. Thus [:alpha:] +will match [a-zA-Z], but not non-ASCII alphabetic characters. The +same is true for 'string is'. + +Regular Expressions +------------------- +Normally, Jim Tcl uses the system-supplied POSIX-compatible regex +implementation. + +Typically systems do not provide a UTF-8 capable regex implementation, +therefore when UTF-8 support is enabled, the built-in regex +implementation is used which includes UTF-8 support. + +Case Insensitivity +------------------ +Case folding is much more complex under Unicode than under ASCII. +For example it is possible for a character to change the number of +bytes required for representation when converting from one case to +another. Jim Tcl supports only "simple" case folding, where case +is folded only where the number of bytes does not change. + +Case folding tables are automatically generated from the official +unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt + +Working with Binary Data and non-UTF-8 encodings +------------------------------------------------ +If it is necessary to work with both UTF-8 and binary data (bytes +>= 0x80), or non-UTF-8 encodings you will need to arrange for the +data to be converted between UTF-8 on input and output. Individual +characters can be converted from Unicode to UTF-8 with the +utf8_fromunicode() function and the reverse with utf8_tounicode(). + +Internal Details +---------------- +Jim_Utf8Length() will calculate the character length of the string and cache +it for later access. It uses utf8_strlen() which relies on the string to be null +terminated (which it always will be). + +It is possible to tell if a string is ascii-only because length == bytelength + +It is possible to provide optimised versions of various routines for +the ascii-only case. Currently this is done only for 'string index' and 'string range'. diff --git a/Tcl_shipped.html b/Tcl_shipped.html index c862b72..7cc23e5 100644 --- a/Tcl_shipped.html +++ b/Tcl_shipped.html @@ -1250,7 +1250,7 @@ sequence is replaced by the given character:

-\u*nnnn* +\unnnn

@@ -1836,12 +1836,64 @@ for backward compatibility with experimental versions of this feature.

REGULAR EXPRESSIONS

-

Tcl provides two commands that support string matching using -egrep-style regular expressions: regexp and regsub.

-

Regular expressions are implemented using the system’s C library as -Extended Regular Expressions (EREs) rather than Basic Regular Expressions (BREs).

+

Tcl provides two commands that support string matching using regular +expressions, regexp and regsub, as well as switch -regexp and +lsearch -regexp.

+

Regular expressions may be implemented one of two ways. Either using the system’s C library +POSIX regular expression support, or using the built-in regular expression engine. +The differences between these are described below.

+

NOTE Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (ARE).

+

POSIX Regular Expressions

+

If the system supports POSIX regular expressions, and UTF-8 support is not enabled, +this support will be used by default. The type of regular expressions supported are +Extended Regular Expressions (ERE) rather than Basic Regular Expressions (BRE). +See REG_EXTENDED in the documentation.

+

Using the system-supported POSIX regular expressions will typically +make for the smallest code size, but some features such as UTF-8 +and \w, \d, \s are not supported.

See regex(3) and regex(7) for full details.

-

NOTE Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (AREs).

+

Jim built-in Regular Expressions

+

The Jim built-in regulare expression engine may be selected with ./configure --with-jim-regexp +or it will be selected automatically if UTF-8 support is enabled.

+

This engine supports UTF-8 as well as some ARE features. The differences with both Tcl 7.x/8.x +and POSIX are highlighted below.

+
    +
  1. +

    +UTF-8 strings and patterns are both supported +

    +
  2. +
  3. +

    +Supported character classes: [:alnum:], [:digit:] and [:space:] +

    +
  4. +
  5. +

    +Supported shorthand character classes: \w = +[:alnum:], \d = [:digit:], \s = [:space:] +

    +
  6. +
  7. +

    +Character classes apply to ASCII characters only +

    +
  8. +
  9. +

    +Supported constraint escapes: \m = \< = start of word, \M = \> = end of word +

    +
  10. +
  11. +

    +Backslash escapes may be used within regular expressions, such as \n = newline, \uNNNN = unicode +

    +
  12. +
  13. +

    +No support for the ? non-greedy quantifier. e.g. *? +

    +
  14. +

COMMAND RESULTS

@@ -2327,7 +2379,7 @@ is still available to embed UTF-8 sequences.

pattern matching rules. These commands support UTF-8. For example:

-
string match a{backslash}[{backslash}ua0-{backslash}ubf{backslash}]b "a{backslash}a3b"
+
string match a\[\ua0-\ubf\]b "a\a3b"

format and scan

format %c allows a unicode codepoint to be be encoded. For example, the following will return @@ -2340,13 +2392,13 @@ a string with two bytes and one character. The same as \ub5

return a string with three characters, not three bytes.

-
format %.3s {backslash}ub5{backslash}ub6{backslash}ub7{backslash}ub8
+
format %.3s \ub5\ub6\ub7\ub8

Similarly, scan … %c allows a UTF-8 to be decoded to a unicode codepoint. The following will set a to 181 (0xb5) and b to 181 and b to 65.

-
scan {backslash}00b5A %c%c a b
+
scan \00b5A %c%c a b

scan %s will also accept a character class, including unicode ranges.

String Classes

@@ -2354,7 +2406,7 @@ return a string with three characters, not three bytes.

will return 0, even though the string may be considered to be alphabetic.

-
string is {backslash}b5Test
+
string is \b5Test

This does not affect the string classes ascii, control, digit, double, integer or xdigit.

Case Mapping and Conversion

@@ -2376,9 +2428,9 @@ the following returns 2.

string bytelength \xff\xff

Regular Expressions

-

At this time, regular expressions do not support UTF-8 strings. This included -regexp, regsub, switch -regexp and lsearch -regexp.

-

This means that regular expresion operations operate on bytes, not characters.

+

If UTF-8 support is enabled, the built-in regular expression engine will be +selected which supports UTF-8 strings and patterns.

+

See REGULAR EXPRESSIONS

BUILT-IN COMMANDS

@@ -6356,7 +6408,7 @@ official policies, either expressed or implied, of the Jim Tcl Project. diff --git a/jim_tcl.txt b/jim_tcl.txt index f19db0f..4984655 100644 --- a/jim_tcl.txt +++ b/jim_tcl.txt @@ -492,7 +492,7 @@ sequence is replaced by the given character: The digits *ddd* (one, two, or three of them) give the octal value of the character. Note that Jim supports null characters in strings. -+{backslash}u*nnnn*+:: ++{backslash}*unnnn*+:: The hex digits *nnnn* (between one and four of them) give a unicode codepoint. The UTF-8 encoding of the codepoint is inserted. @@ -918,15 +918,44 @@ for backward compatibility with experimental versions of this feature. REGULAR EXPRESSIONS ------------------- -Tcl provides two commands that support string matching using -'egrep'-style regular expressions: 'regexp' and 'regsub'. +Tcl provides two commands that support string matching using regular +expressions, 'regexp' and 'regsub', as well as 'switch -regexp' and +'lsearch -regexp'. -Regular expressions are implemented using the system's C library as -Extended Regular Expressions (EREs) rather than Basic Regular Expressions (BREs). +Regular expressions may be implemented one of two ways. Either using the system's C library +POSIX regular expression support, or using the built-in regular expression engine. +The differences between these are described below. + +*NOTE* Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (+ARE+). + +POSIX Regular Expressions +~~~~~~~~~~~~~~~~~~~~~~~~~ +If the system supports POSIX regular expressions, and UTF-8 support is not enabled, +this support will be used by default. The type of regular expressions supported are +Extended Regular Expressions (+ERE+) rather than Basic Regular Expressions (+BRE+). +See REG_EXTENDED in the documentation. + +Using the system-supported POSIX regular expressions will typically +make for the smallest code size, but some features such as UTF-8 +and +{backslash}w+, +{backslash}d+, +{backslash}s+ are not supported. See regex(3) and regex(7) for full details. -*NOTE* Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (AREs). +Jim built-in Regular Expressions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The Jim built-in regulare expression engine may be selected with +./configure --with-jim-regexp+ +or it will be selected automatically if UTF-8 support is enabled. + +This engine supports UTF-8 as well as some +ARE+ features. The differences with both Tcl 7.x/8.x +and POSIX are highlighted below. + +1. UTF-8 strings and patterns are both supported +2. Supported character classes: +[:alnum:]+, +[:digit:]+ and +[:space:]+ +3. Supported shorthand character classes: +{backslash}w = +[:alnum:]+, +{backslash}d+ = +[:digit:],+ +{backslash}s+ = +[:space:]+ +4. Character classes apply to ASCII characters only +5. Supported constraint escapes: +{backslash}m+ = +{backslash}<+ = start of word, +{backslash}M+ = +{backslash}>+ = end of word +6. Backslash escapes may be used within regular expressions, such as +{backslash}n+ = newline, +{backslash}uNNNN+ = unicode +7. No support for the +?+ non-greedy quantifier. e.g. +*?+ COMMAND RESULTS --------------- @@ -1351,7 +1380,7 @@ String Matching Commands such as 'string match', 'lsearch -glob', 'array names' and others use string pattern matching rules. These commands support UTF-8. For example: - string match a{backslash}[{backslash}ua0-{backslash}ubf{backslash}]b "a{backslash}a3b" + string match a\[\ua0-\ubf\]b "a\a3b" format and scan ~~~~~~~~~~~~~~~ @@ -1363,12 +1392,12 @@ a string with two bytes and one character. The same as {backslash}ub5 'format' respects widths as character widths, not byte widths. For example, the following will return a string with three characters, not three bytes. - format %.3s {backslash}ub5{backslash}ub6{backslash}ub7{backslash}ub8 + format %.3s \ub5\ub6\ub7\ub8 Similarly, 'scan ... %c' allows a UTF-8 to be decoded to a unicode codepoint. The following will set *a* to 181 (0xb5) and *b* to '181' and 'b' to 65. - scan {backslash}00b5A %c%c a b + scan \00b5A %c%c a b 'scan %s' will also accept a character class, including unicode ranges. @@ -1377,7 +1406,7 @@ String Classes 'string is' has *not* been extended to classify UTF-8 characters. Therefore, the following will return 0, even though the string may be considered to be alphabetic. - string is {backslash}b5Test + string is \b5Test This does not affect the string classes 'ascii', 'control', 'digit', 'double', 'integer' or 'xdigit'. @@ -1406,10 +1435,10 @@ the following returns 2. Regular Expressions ~~~~~~~~~~~~~~~~~~~ -At this time, regular expressions do *not* support UTF-8 strings. This included -'regexp', 'regsub', 'switch -regexp' and 'lsearch -regexp'. +If UTF-8 support is enabled, the built-in regular expression engine will be +selected which supports UTF-8 strings and patterns. -This means that regular expresion operations operate on bytes, not characters. +See REGULAR EXPRESSIONS BUILT-IN COMMANDS ----------------- diff --git a/jimregexp.c b/jimregexp.c index d6a8723..7a0adf2 100644 --- a/jimregexp.c +++ b/jimregexp.c @@ -38,11 +38,12 @@ *** seiwald@perforce.com, on 20 January 2000, to use function prototypes. *** THIS IS AN ALTERED VERSION. It was altered by Christopher Seiwald *** seiwald@perforce.com, on 05 November 2002, to const string literals. - *** THIS IS AN ALTERED VERSION. It was altered by Steve Bennett - *** on 16 October 2010, to remove static state and add better Tcl ARE compatibility. - *** This includes counted repetitions, UTF-8 support, character classes, - *** shorthand character classes, increased number of parentheses to 100, - *** backslash escape sequences. + * + * THIS IS AN ALTERED VERSION. It was altered by Steve Bennett + * on 16 October 2010, to remove static state and add better Tcl ARE compatibility. + * This includes counted repetitions, UTF-8 support, character classes, + * shorthand character classes, increased number of parentheses to 100, + * backslash escape sequences. * * Beware that some of this code is subtly aware of the way operator * precedence is structured in regular expressions. Serious changes in diff --git a/parse-unidata.tcl b/parse-unidata.tcl index 9e41e1f..4b5ec3a 100644 --- a/parse-unidata.tcl +++ b/parse-unidata.tcl @@ -1,5 +1,12 @@ #!/usr/bin/env tclsh +# Generate UTF-8 case mapping tables +# +# (c) 2010 Steve Bennett +# +# See LICENCE for licence details. +#/ + # Parse the unicode data from: http://unicode.org/Public/UNIDATA/UnicodeData.txt # to generate case mapping tables diff --git a/utf8.c b/utf8.c index 13c5fe6..3be9899 100644 --- a/utf8.c +++ b/utf8.c @@ -1,6 +1,10 @@ -/* ----------------------------------------------------------------------------- - * Utility functions - * ---------------------------------------------------------------------------*/ +/** + * UTF-8 utility functions + * + * (c) 2010 Steve Bennett + * + * See LICENCE for licence details. + */ #include #include diff --git a/utf8.h b/utf8.h index 9e03059..5df2e45 100644 --- a/utf8.h +++ b/utf8.h @@ -1,5 +1,12 @@ #ifndef UTF8_UTIL_H #define UTF8_UTIL_H +/** + * UTF-8 utility functions + * + * (c) 2010 Steve Bennett + * + * See LICENCE for licence details. + */ /** * Converts the given unicode codepoint (0 - 0xffff) to utf-8 -- cgit v1.1