diff options
-rw-r--r-- | README.utf-8 | 116 | ||||
-rw-r--r-- | Tcl_shipped.html | 80 | ||||
-rw-r--r-- | jim_tcl.txt | 55 | ||||
-rw-r--r-- | jimregexp.c | 11 | ||||
-rw-r--r-- | parse-unidata.tcl | 7 | ||||
-rw-r--r-- | utf8.c | 10 | ||||
-rw-r--r-- | utf8.h | 7 |
7 files changed, 251 insertions, 35 deletions
diff --git a/README.utf-8 b/README.utf-8 new file mode 100644 index 0000000..ad6c7b5 --- /dev/null +++ b/README.utf-8 @@ -0,0 +1,116 @@ +UTF-8 Support for Jim Tcl +========================= + +Author: Steve Bennett <steveb@workware.net.au> +Date: 2 Nov 2010 10:55:52 EST + +OVERVIEW +-------- +Traditionally Jim Tcl has support strings, including binary strings containing +nulls, however it has had no support for multi-byte character encodings. + +In some fields, such as when dealing with the web, or other user-generated content, +support for multi-byte character encodings is necessary. +In these cases it would be very useful for Jim Tcl to be able to process strings +as multi-byte character strings rather than simply binary bytes. + +Supporting multiple character encodings and translation between those encodings +is beyond the scope of Jim Tcl. Therefore, Jim has been enhanced to add support +for UTF-8, as probably the most popular general purpose multi-byte encoding. + +UTF-8 support is optional. It can be enabled at compile time with: + + ./configure --enable-utf8 + +The Jim Tcl documentation fully documents the UTF-8 support. This README includes +additional background information. + +Unicode vs UTF-8 +---------------- +It is important to understand that Unicode is an abstract representation +of the concept of a "character", while UTF-8 is an encoding of +Unicode into bytes. Thus the Unicode codepoint U+00B5 is encoded +in UTF-8 with the byte sequence: 0xc2, 0xb5. This is different from +ASCII which the same name is used interchangeably between a character +set and an encoding. + +Unicode Escapes +--------------- +Even without UTF-8 enabled, it is useful to be able to encode UTF-8 characters +in strings. This can be done with the \uNNNN Unicode escape. This syntax +is compatible with Tcl and is enabled even if UTF-8 is disabled. + +Like Tcl, currently only 16-bit Unicode characters can be encoded. + +UTF-8 Properties +---------------- +Due to the design of the UTF-8 encoding, many (most) commands continue +to work with UTF-8 strings. This is due to the following properties of UTF-8: + +* ASCII characters in strings have the same representation in UTF-8 +* An ASCII string will never match the middle of a multi-byte UTF-8 sequence +* UTF-8 strings can be sorted as bytes and produce the same result as sorting + by characters +* UTF-8 strings in Jim continue to be null terminated + +Commands Supporting UTF-8 +------------------------- +The following commands have been enhanced to support UTF-8 strings. + +* array {get,names,unset} +* case +* glob +* lsearch -glob, -regexp +* switch -glob, -regexp +* regexp, regsub +* format +* scan +* split +* string index, range, length, compare, equal, first, last, map, match, reverse, tolower, toupper +* string bytelength (new) +* info procs, commands, vars, globals, locals + +Character Classes +----------------- +Jim Tcl has no support for UTF-8 character classes. Thus [:alpha:] +will match [a-zA-Z], but not non-ASCII alphabetic characters. The +same is true for 'string is'. + +Regular Expressions +------------------- +Normally, Jim Tcl uses the system-supplied POSIX-compatible regex +implementation. + +Typically systems do not provide a UTF-8 capable regex implementation, +therefore when UTF-8 support is enabled, the built-in regex +implementation is used which includes UTF-8 support. + +Case Insensitivity +------------------ +Case folding is much more complex under Unicode than under ASCII. +For example it is possible for a character to change the number of +bytes required for representation when converting from one case to +another. Jim Tcl supports only "simple" case folding, where case +is folded only where the number of bytes does not change. + +Case folding tables are automatically generated from the official +unicode data table at http://unicode.org/Public/UNIDATA/UnicodeData.txt + +Working with Binary Data and non-UTF-8 encodings +------------------------------------------------ +If it is necessary to work with both UTF-8 and binary data (bytes +>= 0x80), or non-UTF-8 encodings you will need to arrange for the +data to be converted between UTF-8 on input and output. Individual +characters can be converted from Unicode to UTF-8 with the +utf8_fromunicode() function and the reverse with utf8_tounicode(). + +Internal Details +---------------- +Jim_Utf8Length() will calculate the character length of the string and cache +it for later access. It uses utf8_strlen() which relies on the string to be null +terminated (which it always will be). + +It is possible to tell if a string is ascii-only because length == bytelength + +It is possible to provide optimised versions of various routines for +the ascii-only case. Currently this is done only for 'string index' and 'string range'. diff --git a/Tcl_shipped.html b/Tcl_shipped.html index c862b72..7cc23e5 100644 --- a/Tcl_shipped.html +++ b/Tcl_shipped.html @@ -1250,7 +1250,7 @@ sequence is replaced by the given character:</p></div> </p>
</dd>
<dt class="hdlist1">
-<tt>\u*nnnn*</tt>
+<tt>\<strong>unnnn</strong></tt>
</dt>
<dd>
<p>
@@ -1836,12 +1836,64 @@ for backward compatibility with experimental versions of this feature.</p></div> </div>
<h2 id="_regular_expressions">REGULAR EXPRESSIONS</h2>
<div class="sectionbody">
-<div class="paragraph"><p>Tcl provides two commands that support string matching using
-<em>egrep</em>-style regular expressions: <em>regexp</em> and <em>regsub</em>.</p></div>
-<div class="paragraph"><p>Regular expressions are implemented using the system’s C library as
-Extended Regular Expressions (EREs) rather than Basic Regular Expressions (BREs).</p></div>
+<div class="paragraph"><p>Tcl provides two commands that support string matching using regular
+expressions, <em>regexp</em> and <em>regsub</em>, as well as <em>switch -regexp</em> and
+<em>lsearch -regexp</em>.</p></div>
+<div class="paragraph"><p>Regular expressions may be implemented one of two ways. Either using the system’s C library
+POSIX regular expression support, or using the built-in regular expression engine.
+The differences between these are described below.</p></div>
+<div class="paragraph"><p><strong>NOTE</strong> Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (<tt>ARE</tt>).</p></div>
+<h3 id="_posix_regular_expressions">POSIX Regular Expressions</h3><div style="clear:left"></div>
+<div class="paragraph"><p>If the system supports POSIX regular expressions, and UTF-8 support is not enabled,
+this support will be used by default. The type of regular expressions supported are
+Extended Regular Expressions (<tt>ERE</tt>) rather than Basic Regular Expressions (<tt>BRE</tt>).
+See REG_EXTENDED in the documentation.</p></div>
+<div class="paragraph"><p>Using the system-supported POSIX regular expressions will typically
+make for the smallest code size, but some features such as UTF-8
+and <tt>\w</tt>, <tt>\d</tt>, <tt>\s</tt> are not supported.</p></div>
<div class="paragraph"><p>See regex(3) and regex(7) for full details.</p></div>
-<div class="paragraph"><p><strong>NOTE</strong> Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (AREs).</p></div>
+<h3 id="_jim_built_in_regular_expressions">Jim built-in Regular Expressions</h3><div style="clear:left"></div>
+<div class="paragraph"><p>The Jim built-in regulare expression engine may be selected with <tt>./configure --with-jim-regexp</tt>
+or it will be selected automatically if UTF-8 support is enabled.</p></div>
+<div class="paragraph"><p>This engine supports UTF-8 as well as some <tt>ARE</tt> features. The differences with both Tcl 7.x/8.x
+and POSIX are highlighted below.</p></div>
+<div class="olist arabic"><ol class="arabic">
+<li>
+<p>
+UTF-8 strings and patterns are both supported
+</p>
+</li>
+<li>
+<p>
+Supported character classes: <tt>[:alnum:]</tt>, <tt>[:digit:]</tt> and <tt>[:space:]</tt>
+</p>
+</li>
+<li>
+<p>
+Supported shorthand character classes: <tt>\w = +[:alnum:]</tt>, <tt>\d</tt> = <tt>[:digit:],</tt> <tt>\s</tt> = <tt>[:space:]</tt>
+</p>
+</li>
+<li>
+<p>
+Character classes apply to ASCII characters only
+</p>
+</li>
+<li>
+<p>
+Supported constraint escapes: <tt>\m</tt> = <tt>\<</tt> = start of word, <tt>\M</tt> = <tt>\></tt> = end of word
+</p>
+</li>
+<li>
+<p>
+Backslash escapes may be used within regular expressions, such as <tt>\n</tt> = newline, <tt>\uNNNN</tt> = unicode
+</p>
+</li>
+<li>
+<p>
+No support for the <tt>?</tt> non-greedy quantifier. e.g. <tt>*?</tt>
+</p>
+</li>
+</ol></div>
</div>
<h2 id="_command_results">COMMAND RESULTS</h2>
<div class="sectionbody">
@@ -2327,7 +2379,7 @@ is still available to embed UTF-8 sequences.</p></div> pattern matching rules. These commands support UTF-8. For example:</p></div>
<div class="literalblock">
<div class="content">
-<pre><tt>string match a{backslash}[{backslash}ua0-{backslash}ubf{backslash}]b "a{backslash}a3b"</tt></pre>
+<pre><tt>string match a\[\ua0-\ubf\]b "a\a3b"</tt></pre>
</div></div>
<h3 id="_format_and_scan">format and scan</h3><div style="clear:left"></div>
<div class="paragraph"><p><em>format %c</em> allows a unicode codepoint to be be encoded. For example, the following will return
@@ -2340,13 +2392,13 @@ a string with two bytes and one character. The same as \ub5</p></div> return a string with three characters, not three bytes.</p></div>
<div class="literalblock">
<div class="content">
-<pre><tt>format %.3s {backslash}ub5{backslash}ub6{backslash}ub7{backslash}ub8</tt></pre>
+<pre><tt>format %.3s \ub5\ub6\ub7\ub8</tt></pre>
</div></div>
<div class="paragraph"><p>Similarly, <em>scan … %c</em> allows a UTF-8 to be decoded to a unicode codepoint. The following will set
<strong>a</strong> to 181 (0xb5) and <strong>b</strong> to <em>181</em> and <em>b</em> to 65.</p></div>
<div class="literalblock">
<div class="content">
-<pre><tt>scan {backslash}00b5A %c%c a b</tt></pre>
+<pre><tt>scan \00b5A %c%c a b</tt></pre>
</div></div>
<div class="paragraph"><p><em>scan %s</em> will also accept a character class, including unicode ranges.</p></div>
<h3 id="_string_classes">String Classes</h3><div style="clear:left"></div>
@@ -2354,7 +2406,7 @@ return a string with three characters, not three bytes.</p></div> will return 0, even though the string may be considered to be alphabetic.</p></div>
<div class="literalblock">
<div class="content">
-<pre><tt>string is {backslash}b5Test</tt></pre>
+<pre><tt>string is \b5Test</tt></pre>
</div></div>
<div class="paragraph"><p>This does not affect the string classes <em>ascii</em>, <em>control</em>, <em>digit</em>, <em>double</em>, <em>integer</em> or <em>xdigit</em>.</p></div>
<h3 id="_case_mapping_and_conversion">Case Mapping and Conversion</h3><div style="clear:left"></div>
@@ -2376,9 +2428,9 @@ the following returns 2.</p></div> <pre><tt>string bytelength \xff\xff</tt></pre>
</div></div>
<h3 id="_regular_expressions_2">Regular Expressions</h3><div style="clear:left"></div>
-<div class="paragraph"><p>At this time, regular expressions do <strong>not</strong> support UTF-8 strings. This included
-<em>regexp</em>, <em>regsub</em>, <em>switch -regexp</em> and <em>lsearch -regexp</em>.</p></div>
-<div class="paragraph"><p>This means that regular expresion operations operate on bytes, not characters.</p></div>
+<div class="paragraph"><p>If UTF-8 support is enabled, the built-in regular expression engine will be
+selected which supports UTF-8 strings and patterns.</p></div>
+<div class="paragraph"><p>See REGULAR EXPRESSIONS</p></div>
</div>
<h2 id="_built_in_commands">BUILT-IN COMMANDS</h2>
<div class="sectionbody">
@@ -6356,7 +6408,7 @@ official policies, either expressed or implied, of the Jim Tcl Project.</tt></pr </div>
<div id="footer">
<div id="footer-text">
-Last updated 2010-11-11 10:56:50 EST
+Last updated 2010-11-11 10:57:51 EST
</div>
</div>
</body>
diff --git a/jim_tcl.txt b/jim_tcl.txt index f19db0f..4984655 100644 --- a/jim_tcl.txt +++ b/jim_tcl.txt @@ -492,7 +492,7 @@ sequence is replaced by the given character: The digits *ddd* (one, two, or three of them) give the octal value of the character. Note that Jim supports null characters in strings. -+{backslash}u*nnnn*+:: ++{backslash}*unnnn*+:: The hex digits *nnnn* (between one and four of them) give a unicode codepoint. The UTF-8 encoding of the codepoint is inserted. @@ -918,15 +918,44 @@ for backward compatibility with experimental versions of this feature. REGULAR EXPRESSIONS ------------------- -Tcl provides two commands that support string matching using -'egrep'-style regular expressions: 'regexp' and 'regsub'. +Tcl provides two commands that support string matching using regular +expressions, 'regexp' and 'regsub', as well as 'switch -regexp' and +'lsearch -regexp'. -Regular expressions are implemented using the system's C library as -Extended Regular Expressions (EREs) rather than Basic Regular Expressions (BREs). +Regular expressions may be implemented one of two ways. Either using the system's C library +POSIX regular expression support, or using the built-in regular expression engine. +The differences between these are described below. + +*NOTE* Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (+ARE+). + +POSIX Regular Expressions +~~~~~~~~~~~~~~~~~~~~~~~~~ +If the system supports POSIX regular expressions, and UTF-8 support is not enabled, +this support will be used by default. The type of regular expressions supported are +Extended Regular Expressions (+ERE+) rather than Basic Regular Expressions (+BRE+). +See REG_EXTENDED in the documentation. + +Using the system-supported POSIX regular expressions will typically +make for the smallest code size, but some features such as UTF-8 +and +{backslash}w+, +{backslash}d+, +{backslash}s+ are not supported. See regex(3) and regex(7) for full details. -*NOTE* Tcl 7.x and 8.x use perl-style Advanced Regular Expressions (AREs). +Jim built-in Regular Expressions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The Jim built-in regulare expression engine may be selected with +./configure --with-jim-regexp+ +or it will be selected automatically if UTF-8 support is enabled. + +This engine supports UTF-8 as well as some +ARE+ features. The differences with both Tcl 7.x/8.x +and POSIX are highlighted below. + +1. UTF-8 strings and patterns are both supported +2. Supported character classes: +[:alnum:]+, +[:digit:]+ and +[:space:]+ +3. Supported shorthand character classes: +{backslash}w = +[:alnum:]+, +{backslash}d+ = +[:digit:],+ +{backslash}s+ = +[:space:]+ +4. Character classes apply to ASCII characters only +5. Supported constraint escapes: +{backslash}m+ = +{backslash}<+ = start of word, +{backslash}M+ = +{backslash}>+ = end of word +6. Backslash escapes may be used within regular expressions, such as +{backslash}n+ = newline, +{backslash}uNNNN+ = unicode +7. No support for the +?+ non-greedy quantifier. e.g. +*?+ COMMAND RESULTS --------------- @@ -1351,7 +1380,7 @@ String Matching Commands such as 'string match', 'lsearch -glob', 'array names' and others use string pattern matching rules. These commands support UTF-8. For example: - string match a{backslash}[{backslash}ua0-{backslash}ubf{backslash}]b "a{backslash}a3b" + string match a\[\ua0-\ubf\]b "a\a3b" format and scan ~~~~~~~~~~~~~~~ @@ -1363,12 +1392,12 @@ a string with two bytes and one character. The same as {backslash}ub5 'format' respects widths as character widths, not byte widths. For example, the following will return a string with three characters, not three bytes. - format %.3s {backslash}ub5{backslash}ub6{backslash}ub7{backslash}ub8 + format %.3s \ub5\ub6\ub7\ub8 Similarly, 'scan ... %c' allows a UTF-8 to be decoded to a unicode codepoint. The following will set *a* to 181 (0xb5) and *b* to '181' and 'b' to 65. - scan {backslash}00b5A %c%c a b + scan \00b5A %c%c a b 'scan %s' will also accept a character class, including unicode ranges. @@ -1377,7 +1406,7 @@ String Classes 'string is' has *not* been extended to classify UTF-8 characters. Therefore, the following will return 0, even though the string may be considered to be alphabetic. - string is {backslash}b5Test + string is \b5Test This does not affect the string classes 'ascii', 'control', 'digit', 'double', 'integer' or 'xdigit'. @@ -1406,10 +1435,10 @@ the following returns 2. Regular Expressions ~~~~~~~~~~~~~~~~~~~ -At this time, regular expressions do *not* support UTF-8 strings. This included -'regexp', 'regsub', 'switch -regexp' and 'lsearch -regexp'. +If UTF-8 support is enabled, the built-in regular expression engine will be +selected which supports UTF-8 strings and patterns. -This means that regular expresion operations operate on bytes, not characters. +See REGULAR EXPRESSIONS BUILT-IN COMMANDS ----------------- diff --git a/jimregexp.c b/jimregexp.c index d6a8723..7a0adf2 100644 --- a/jimregexp.c +++ b/jimregexp.c @@ -38,11 +38,12 @@ *** seiwald@perforce.com, on 20 January 2000, to use function prototypes. *** THIS IS AN ALTERED VERSION. It was altered by Christopher Seiwald *** seiwald@perforce.com, on 05 November 2002, to const string literals. - *** THIS IS AN ALTERED VERSION. It was altered by Steve Bennett <steveb@workware.net.au> - *** on 16 October 2010, to remove static state and add better Tcl ARE compatibility. - *** This includes counted repetitions, UTF-8 support, character classes, - *** shorthand character classes, increased number of parentheses to 100, - *** backslash escape sequences. + * + * THIS IS AN ALTERED VERSION. It was altered by Steve Bennett <steveb@workware.net.au> + * on 16 October 2010, to remove static state and add better Tcl ARE compatibility. + * This includes counted repetitions, UTF-8 support, character classes, + * shorthand character classes, increased number of parentheses to 100, + * backslash escape sequences. * * Beware that some of this code is subtly aware of the way operator * precedence is structured in regular expressions. Serious changes in diff --git a/parse-unidata.tcl b/parse-unidata.tcl index 9e41e1f..4b5ec3a 100644 --- a/parse-unidata.tcl +++ b/parse-unidata.tcl @@ -1,5 +1,12 @@ #!/usr/bin/env tclsh +# Generate UTF-8 case mapping tables +# +# (c) 2010 Steve Bennett <steveb@workware.net.au> +# +# See LICENCE for licence details. +#/ + # Parse the unicode data from: http://unicode.org/Public/UNIDATA/UnicodeData.txt # to generate case mapping tables @@ -1,6 +1,10 @@ -/* ----------------------------------------------------------------------------- - * Utility functions - * ---------------------------------------------------------------------------*/ +/** + * UTF-8 utility functions + * + * (c) 2010 Steve Bennett <steveb@workware.net.au> + * + * See LICENCE for licence details. + */ #include <ctype.h> #include <stdlib.h> @@ -1,5 +1,12 @@ #ifndef UTF8_UTIL_H #define UTF8_UTIL_H +/** + * UTF-8 utility functions + * + * (c) 2010 Steve Bennett <steveb@workware.net.au> + * + * See LICENCE for licence details. + */ /** * Converts the given unicode codepoint (0 - 0xffff) to utf-8 |