From 9f6ad73686d6dc1fc8628be60a0d42a6ee20817c Mon Sep 17 00:00:00 2001 From: Steve Bennett Date: Wed, 20 Oct 2010 16:01:17 +1000 Subject: Add UTF-8 support to Jim Signed-off-by: Steve Bennett --- jim_tcl.txt | 119 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 119 insertions(+) (limited to 'jim_tcl.txt') diff --git a/jim_tcl.txt b/jim_tcl.txt index 7cff785..f19db0f 100644 --- a/jim_tcl.txt +++ b/jim_tcl.txt @@ -492,6 +492,10 @@ sequence is replaced by the given character: The digits *ddd* (one, two, or three of them) give the octal value of the character. Note that Jim supports null characters in strings. ++{backslash}u*nnnn*+:: + The hex digits *nnnn* (between one and four of them) give a unicode codepoint. + The UTF-8 encoding of the codepoint is inserted. + For example, in the command set a \{x\[\ yz\141 @@ -1324,6 +1328,89 @@ The procedure may also be delete immediately by renaming it "". e.g. jim> rename $f "" +UTF-8 AND UNICODE +----------------- +If Jim is built with UTF-8 support enabled (configure --enable-utf), +then most string-related commands become UTF-8 aware. These include, +but are not limited to, 'string match', 'split', 'glob', 'scan' and +'format'. + +UTF-8 encoding has many advantages, but one of the complications is that +characters can take a variable number of bytes. Thus the addition of +'string bytelength' which returns the number of bytes in a string, +while 'string length' returns the number of characters. + +If UTF-8 support is not enabled, all commands treat bytes as characters +and 'string bytelength' returns the same value as 'string length'. + +Note that even if UTF-8 support is not enabled, the {backslash}uNNNN syntax +is still available to embed UTF-8 sequences. + +String Matching +~~~~~~~~~~~~~~~ +Commands such as 'string match', 'lsearch -glob', 'array names' and others use string +pattern matching rules. These commands support UTF-8. For example: + + string match a{backslash}[{backslash}ua0-{backslash}ubf{backslash}]b "a{backslash}a3b" + +format and scan +~~~~~~~~~~~~~~~ +'format %c' allows a unicode codepoint to be be encoded. For example, the following will return +a string with two bytes and one character. The same as {backslash}ub5 + + format %c 0xb5 + +'format' respects widths as character widths, not byte widths. For example, the following will +return a string with three characters, not three bytes. + + format %.3s {backslash}ub5{backslash}ub6{backslash}ub7{backslash}ub8 + +Similarly, 'scan ... %c' allows a UTF-8 to be decoded to a unicode codepoint. The following will set +*a* to 181 (0xb5) and *b* to '181' and 'b' to 65. + + scan {backslash}00b5A %c%c a b + +'scan %s' will also accept a character class, including unicode ranges. + +String Classes +~~~~~~~~~~~~~~ +'string is' has *not* been extended to classify UTF-8 characters. Therefore, the following +will return 0, even though the string may be considered to be alphabetic. + + string is {backslash}b5Test + +This does not affect the string classes 'ascii', 'control', 'digit', 'double', 'integer' or 'xdigit'. + +Case Mapping and Conversion +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Jim provides a simplified unicode case mapping. This means that case conversion +and comparison will not increase or decrease the number of characters in a string. + +'string toupper' will convert any lowercase letters to their uppercase equivalent. +Any character which is not a letter or has no uppercase equivalent is left unchanged. +Similarly for 'string tolower'. + +Commands which perform case insensitive matches, such as 'string compare -nocase' +and 'lsearch -nocase' fold both strings to uppercase before comparison. + +Invalid UTF-8 Sequences +~~~~~~~~~~~~~~~~~~~~~~~ +Some UTF-8 character sequences are invalid, such as those beginning with '0xff', +those which represent character sequences longer than 3 bytes (greater than U+FFFF), +and those which end prematurely, such as a lone '0xc2'. + +In these situations, the offending bytes are treated as single characters. For example, +the following returns 2. + + string bytelength \xff\xff + +Regular Expressions +~~~~~~~~~~~~~~~~~~~ +At this time, regular expressions do *not* support UTF-8 strings. This included +'regexp', 'regsub', 'switch -regexp' and 'lsearch -regexp'. + +This means that regular expresion operations operate on bytes, not characters. + BUILT-IN COMMANDS ----------------- The Tcl library provides the following built-in commands, which will @@ -1707,6 +1794,12 @@ command. The legal *options* are: names match *pattern* (using the matching rules of string match) are included. +*dict keys* 'dictionary ?pattern?'+:: + Returns a list of the keys in the dictionary. + If pattern is specified, then only those keys whose + names match *pattern* (using the matching rules of string + match) are included. + *dict set* 'dictionaryName key ?key ...? value'+:: This operation takes the *name* of a variable containing a dictionary value and places an updated dictionary value in that variable @@ -2343,6 +2436,9 @@ The legal *option*'s (which may be abbreviated) are: +*info channels*+:: Returns a list of all open 'aio' channels. ++*info channels*+:: + Returns a list of all open file handles from 'open' or 'socket' + +*info commands* ?'pattern'?+:: If *pattern* isn't specified, returns a list of names of all the Tcl commands, including both the built-in commands written in C and @@ -2845,6 +2941,10 @@ If *-index listindex* is specified, each element of the list is treated as a lis the given index is extracted from the list for comparison. The list index may be any valid list index, such as '1', 'end' or 'end-2'. +If *-index listindex* is specified, each element of the list is treated as a list and +the given index is extracted from the list for comparison. The list index may +be any valid list index, such as '1', 'end' or 'end-2'. + open ~~~~ +*open* 'fileName ?access?'+ @@ -3474,6 +3574,12 @@ string Perform one of several string operations, depending on *option*. The legal options (which may be abbreviated) are: ++*string bytelength 'string'+:: + Returns the length of the string in bytes. This will return + the same value as 'string length' if UTF-8 support is not enabled, + or if the string is composed entirely of ASCII characters. + See UTF-8 AND UNICODE. + +*string compare ?-nocase?* 'string1 string2'+:: Perform a character-by-character comparison of strings *string1* and *string2* in the same way as the C 'strcmp' procedure. Return @@ -3525,6 +3631,8 @@ The legal options (which may be abbreviated) are: +space+;; Any space character. +upper+;; Any upper case alphabet character. +xdigit+;; Any hexadecimal digit character ([0-9A-Fa-f]). + :: + Note that string classification does *not* respect UTF-8. See UTF-8 AND UNICODE +*string last* 'string1 string2 ?lastIndex?'+:: Search *string2* for a sequence of characters that exactly match @@ -3537,6 +3645,8 @@ The legal options (which may be abbreviated) are: +*string length* 'string'+:: Returns a decimal string giving the number of characters in *string*. + If UTF-8 support is enabled, this may be different than the number of bytes. + See UTF-8 AND UNICODE +*string match ?-nocase?* 'pattern string'+:: See if *pattern* matches *string*; return 1 if it does, 0 @@ -4082,6 +4192,11 @@ aio +$handle *tell*+:: Returns the current seek position ++$handle *filename*+:: + Returns the original filename used when opening the file. + If the handle was returned from 'socket', the type of the + handle is returned instead. + +$handle *ndelay ?0|1?*+:: Set O_NDELAY (if arg). Returns current/new setting. Note that in general ANSI I/O interacts badly with non-blocking I/O. @@ -4327,6 +4442,10 @@ The following global variables are set by jimsh. This variable is set to 1 if jimsh is started in interactive mode or 0 otherwise. ++*tcl_platform*+:: + This variable is set by Jim as an array containing information + about the platform upon which Jim was built. + +*argv0*+:: If jimsh is invoked to run a script, this variable contains the name of the script. -- cgit v1.1