diff options
author | Ulrich Drepper <drepper@redhat.com> | 1999-01-11 20:13:43 +0000 |
---|---|---|
committer | Ulrich Drepper <drepper@redhat.com> | 1999-01-11 20:13:43 +0000 |
commit | 390955cbdeb674bead490fc3f74a8a0893ea83cf (patch) | |
tree | 2900fdc697f52133f633c09edbbe712882736bf0 /manual/charset.texi | |
parent | 68ef28edc2f1bafa417da1ac8d35a3bf2a1b565b (diff) | |
download | glibc-390955cbdeb674bead490fc3f74a8a0893ea83cf.zip glibc-390955cbdeb674bead490fc3f74a8a0893ea83cf.tar.gz glibc-390955cbdeb674bead490fc3f74a8a0893ea83cf.tar.bz2 |
Update.
1999-01-11 Ulrich Drepper <drepper@cygnus.com>
* ctype/Versions [GLIBC_2.0]: Export __ctype32_b.
* include/wctype.h: Declare __iswctype.
* stdio-common/vfscanf.c (__vfscanf): Use __iswspace instead of
iswspace.
* wctype/Makefile (routines): Add wcextra_l.
* wctype/wcextra.c (iswblank): Implement function here and don't use
__iswctype.
(__iswblank_l): Move definition to...
* wctype/wcextra_l.c: ...here. New file.
* wctype/wcfuncs.c: Really implement functions and don't call
__iswctype or __towctrans.
* wctype/wctype.h: Change isw* and tow* macros. Don't call
__iswctype or __towctrans. Instead optimize constant argument case.
* iconv/gconv.h: Fix typos.
* iconv/skeleton.c: Fix typos. Optimize init function a bit.
Correctly emit escape sequence to return to initial state in
conversion function.
* iconvdata/iso-2022-jp.c (gconv_init): Correctly initialize
max_needed_to element.
* manual/mbyte.texi: Removed. This is now described in charset.texi.
* manual/charset.texi: New file.
* manual/Makefile (chapters): Replace mbyte by charset.
* manual/ctype.texi: Document wide character functions.
* manual/intro.texi: Fix reference to mbyte chapter.
* manual/lang.texi: Likewise.
* manual/locale.texi: Likewise.
* manual/stdio.texi: Likewise.
* manual/string.texi: Fix @node line for new charset chapter.
* manual/libc.texinfo (UPDATED): Updated. Also update copyright years.
* manual/memory.texi (savestring): Optimize code to give a good
example.
* manual/filesys.texi: Fix wording. Patches by Jim Meyering.
* nscd/nscd_getgr_r.c: Include stdint.h to get uintptr_t definition.
* nscd/nscd_getpw_r.c: Likewise.
* nscd/nscd_gethst_r.c: Likewise.
* stdlib/stdtold_l.c: Always include xlocale.h.
1999-01-11 Geoffrey Keating <geoffk@ozemail.com.au>
* stdlib/fpioconst.h (LDBL_MAX_10_EXP_LOG): Define to be same as
DBL_MAX_10_EXP_LOG if there is no long double.
(_fpioconst_pow10): Always use size as LDBL_MAX_10_EXP_LOG to match
printf_fp.c.
1999-01-10 Andreas Jaeger <aj@arthur.rhein-neckar.de>
* timezone/Makefile ($(testdata)/GB): Changed to ...
($(testdata)/Europe/London): ... for tst-timezone test.
($(objpfx)tst-timezone.out): Change GB to Europe/London.
* timezone/tst-timezone.c (main): Enable DST switching test,
change GB to Europe/London.
1999-01-10 Philip Blundell <philb@gnu.org>
* socket/Makefile (headers): Remove bits/sockunion.h.
1999-01-09 Philip Blundell <philb@gnu.org>
* socket/sys/socket.h: Don't include <bits/sockunion.h>.
* sysdeps/generic/bits/sockunion.h: Deleted.
* sysdeps/unix/sysv/linux/bits/sockunion.h: Likewise.
1999-01-08 H.J. Lu <hjl@gnu.org>
* io/fts.c (fts_close): Don't access memory after having it freed.
Diffstat (limited to 'manual/charset.texi')
-rw-r--r-- | manual/charset.texi | 2846 |
1 files changed, 2846 insertions, 0 deletions
diff --git a/manual/charset.texi b/manual/charset.texi new file mode 100644 index 0000000..6179128 --- /dev/null +++ b/manual/charset.texi @@ -0,0 +1,2846 @@ +@node Character Set Handling, Locales, String and Array Utilities, Top +@c %MENU% Support for extended character sets +@chapter Character Set Handling + +@ifnottex +@macro cal{text} +\text\ +@end macro +@end ifnottex + +Character sets used in the early days of computers had only six, seven, +or eight bits for each character. In no case more bits than would fit +into one byte which nowadays is almost exclusively @w{8 bits} wide. +This of course leads to several problems once not all characters needed +at one time can be represented by the up to 256 available characters. +This chapter shows the functionality which was added to the C library to +overcome this problem. + +@menu +* Extended Char Intro:: Introduction to Extended Characters. +* Charset Function Overview:: Overview about Character Handling + Functions. +* Restartable multibyte conversion:: Restartable multibyte conversion + Functions. +* Non-reentrant Conversion:: Non-reentrant Conversion Function. +* Generic Charset Conversion:: Generic Charset Conversion. +@end menu + + +@node Extended Char Intro +@section Introduction to Extended Characters + +To overcome the limitations of character sets with a 1:1 relation +between bytes and characters people came up with a variety of solutions. +The remainder of this section gives a few examples to help understanding +the design decision made while developing the functionality of the @w{C +library} to support them. + +@cindex internal representation +A distinction we have to make right away is between internal and +external representation. @dfn{Internal representation} means the +representation used by a program while keeping the text in memory. +External representations are used when text is stored or transmitted +through whatever communication channel. + +Traditionally there was no difference between the two representations. +It was equally comfortable and useful to use the same one-byte +representation internally and externally. This changes with more and +larger character sets. + +One of the problems to overcome with the internal representation is +handling text which were externally encoded using different character +sets. Assume a program which reads two texts and compares them using +some metric. The comparison can be usefully done only if the texts are +internally kept in a common format. + +@cindex wide character +For such a common format (@math{=} character set) eight bits are certainly +not enough anymore. So the smallest entity will have to grow: @dfn{wide +characters} will be used. Here instead of one byte one uses two or four +(three are not good to address in memory and more than four bytes seem +not to be necessary). + +@cindex Unicode +@cindex ISO 10646 +As shown in some other part of this manual +@c !!! Ahem, wide char string functions are not yet covered -- drepper +there exists a completely new family of functions which can handle texts +of this kinds in memory. The most commonly used character set for such +internal wide character representations are Unicode and @w{ISO 10646}. +The former is a subset of the later and used when wide characters are +chosen to by 2 bytes (@math{= 16} bits) wide. The standard names of the +@cindex UCS2 +@cindex UCS4 +encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4 +(@math{= 32} bits). + +To represent wide characters the @code{char} type is certainly not +suitable. For this reason the @w{ISO C} standard introduces a new type +which is designed to keep one character of a wide character string. To +maintain the similarity there is also a type corresponding to @code{int} +for those functions which take a single wide character. + +@comment stddef.h +@comment ISO +@deftp {Data type} wchar_t +This data type is used as the base type for wide character strings. +I.e., arrays of objects of this type are the equivalent of @code{char[]} +for multibyte character strings. The type is defined in @file{stddef.h}. + +The @w{ISO C89} standard, where this type was introduced, does not say +anything specific about the representation. It only requires that this +type is capable to store all elements of the basic character set. +Therefore it would be legitimate to define @code{wchar_t} and +@code{char}. This might make sense for embedded systems. + +But for GNU systems this type is always 32 bits wide. It is therefore +capable to represent all UCS4 value therefore covering all of @w{ISO +10646}. Some Unix systems define @code{wchar_t} as a 16 bit type and +thereby follow Unicode very strictly. This is perfectly fine with the +standard but it also means that to represent all characters fro Unicode +and @w{ISO 10646} one has to use surrogate character which is in fact a +multi-wide-character encoding. But this contradicts the purpose of the +@code{wchar_t} type. +@end deftp + +@comment wchar.h +@comment ISO +@deftp {Data type} wint_t +@code{wint_t} is a data type used for parameters and variables which +contain a single wide character. As the name already suggests it is the +equivalent to @code{int} when using the normal @code{char} strings. The +types @code{wchar_t} and @code{wint_t} have often the same +representation if their size if 32 bits wide but if @code{wchar_t} is +defined as @code{char} the type @code{wint_t} must be defined as +@code{int} due to the parameter promotion. + +@pindex wchar.h +This type is defined in @file{wchar.h} and got introduced in the second +amendment to @w{ISO C 89}. +@end deftp + +As there are for the @code{char} data type there also exist macros +specifying the minimum and maximum value representable in an object of +type @code{wchar_t}. + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WCHAR_MIN +The macro @code{WCHAR_MIN} evaluates to the minimum value representable +by an object of type @code{wint_t}. + +This macro got introduced in the second amendment to @w{ISO C89}. +@end deftypevr + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WCHAR_MAX +The macro @code{WCHAR_MIN} evaluates to the maximum value representable +by an object of type @code{wint_t}. + +This macro got introduced in the second amendment to @w{ISO C89}. +@end deftypevr + +Another special wide character value is the equivalent to @code{EOF}. + +@comment wchar.h +@comment ISO +@deftypevr Macro wint_t WEOF +The macro @code{WEOF} evaluates to a constant expression of type +@code{wint_t} whose value is different from any member of the extended +character set. + +@code{WEOF} need not be the same value as @code{EOF} and unlike +@code{EOF} it also need @emph{not} be negative. I.e., sloppy code like + +@smallexample +@{ + int c; + ... + while ((c = getc (fp)) < 0) + ... +@} +@end smallexample + +@noindent +has to be rewritten to explicitly use @code{WEOF} when wide characters +are used. + +@smallexample +@{ + wint_t c; + ... + while ((c = wgetc (fp)) != WEOF) + ... +@} +@end smallexample + +@pindex wchar.h +This macro was introduced in the second amendment to @w{ISO C89} and is +defined in @file{wchar.h}. +@end deftypevr + + +These internal representations present problems when it comes to storing +and transmitting them. Since a single wide character consists of more +than one byte they are effected by byte-ordering. I.e., machines with +different endianesses would see different value accessing the same data. +This also applies for communication protocols which are all byte-based +and therefore the sender has to decide about splitting the wide +character in bytes. A last but not least important point is that wide +characters often require more storage space than an customized byte +oriented character set. + +@cindex multibyte character +This is why most of the time an external encoding which is different +from the internal encoding is used if the later is UCS2 or UCS4. The +external encoding is byte-based and can be chosen appropriately for the +environment and for the texts to be handled. There exists a variety of +different character sets which can be used which is too much to be +handled completely here. We restrict ourself here to a description of +the major groups. All of the ASCII-based character sets fulfill one +requirement: they are ``filesystem safe''. This means that the +character @code{'/'} is used in the encoding @emph{only} to represent +itself. Things are a bit different for character like EBCDIC but if the +operation system does not understand EBCDIC directly the parameters to +system calls have to be converted first anyhow. + +@itemize @bullet +@item +The simplest character sets are one-byte character sets. There can be +only up to 256 characters (for @w{8 bit} character sets) which is not +sufficient to cover all languages but might be sufficient to handle a +specific text. Another reason to choose this is because of constraints +from interaction with other programs. + +@cindex ISO 2022 +@item +The @w{ISO 2022} standard defines a mechanism for extended character +sets where one character @emph{can} be represented by more than one +byte. This is achieved by associating a state with the text. Embedded +in the text can be characters which can be used to change the state. +Each byte in the text might have a different interpretation in each +state. The state might even influence whether a given byte stands for a +character on its own or whether it has to be combined with some more +bytes. + +@cindex EUC +@cindex SJIS +In most uses of @w{ISO 2022} the defined character sets do not allow +state changes which cover more than the next character. This has the +big advantage that whenever one can identify the beginning of the byte +sequence of a character one can interpret a text correctly. Examples of +character sets using this policy are the various EUC character sets +(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN) +or SJIS (Shift JIS, a Japanese encoding). + +But there are also character sets using a state which is valid for more +than one character and has to be changed by another byte sequence. +Examples for this are ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN. + +@item +@cindex ISO 6937 +Early attempts to fix 8 bit character sets for other languages using the +Roman alphabet lead to character sets like @w{ISO 6937}. Here bytes +representing characters like the acute accent do not produce output on +there on. One has to combine them with other characters. E.g., the +byte sequence @code{0xc2 0x61} (non-spacing acute accent, following by +lower-case `a') to get the ``small a with acute'' character. To get the +acute accent character on its on one has to write @code{0xc2 0x20} (the +non-spacing acute followed by a space). + +This type of characters sets is quite frequently used in embedded +systems such as video text. + +@item +@cindex UTF-8 +Instead of converting the Unicode or @w{ISO 10646} text used internally +it is often also sufficient to simply use an encoding different then +UCS2/UCS4. The Unicode and @w{ISO 10646} standards even specify such an +encoding: UTF-8. This encoding is able to represent all of @w{ISO +10464} 31 bits in a byte string of length one to seven. + +@cindex UTF-7 +There were a few other attempts to encode @w{ISO 10646} such as UTF-7 +but UTF-8 is today the only encoding which should be used. In fact, +UTF-8 will hopefully soon be the only external which has to be +supported. It proofs to be universally usable and the only disadvantage +is that it favor Latin languages very much by making the byte string +representation of other scripts (Cyrillic, Greek, Asian scripts) longer +than necessary if using a specific character set for these scripts. But +with methods like the Unicode compression scheme one can overcome these +problems and the ever growing memory and storage capacities do the rest. +@end itemize + +The question remaining now is: how to select the character set or +encoding to use. The answer is mostly: you cannot decide about it +yourself, it is decided by the developers of the system or the majority +of the users. Since the goal is interoperability one has to use +whatever the other people one works with use. If there are no +constraints the selection is based on the requirements the expected +circle of users will have. I.e., if a project is expected to only be +used in, say, Russia it is fine to use KOI8-R or a similar character +set. But if at the same time people from, say, Greek are participating +one should use a character set which allows all people to collaborate. + +A general advice here could be: go with the most general character set, +namely @w{ISO 10646}. Use UTF-8 as the external encoding and problems +about users not being able to use their own language adequately are a +thing of the past. + +One final comment about the choice of the wide character representation +is necessary at this point. We have said above that the natural choice +is using Unicode or @w{ISO 10646}. This is not specified in any +standard, though. The @w{ISO C} standard does not specify anything +specific about the @code{wchar_t} type. There might be systems where +the developers decided differently. Therefore one should as much as +possible avoid making assumption about the wide character representation +although GNU systems will always work as described above. If the +programmer uses only the functions provided by the C library to handle +wide character strings there should not be any compatibility problems +with other systems. + +@node Charset Function Overview +@section Overview about Character Handling Functions + +A Unix @w{C library} contains three different sets of functions in two +families to handling character set conversion. The one function family +is specified in the @w{ISO C} standard and therefore is portable even +beyond the Unix world. + +The most commonly known set of functions, coming from the @w{ISO C89} +standard, is unfortunately the least useful one. In fact, these +functions should be avoided whenever possible, especially when +developing libraries (as opposed to applications). + +The second family o functions got introduced in the early Unix standards +(XPG2) and is still part of the latest and greatest Unix standard: +@w{Unix 98}. It is also the most powerful and useful set of functions. +But we will start with the functions defined in the second amendment to +@w{ISO C89}. + +@node Restartable multibyte conversion +@section Restartable Multibyte Conversion Functions + +The @w{ISO C} standard defines functions to convert strings from a +multibyte representation to wide character strings. There are a number +of peculiarities: + +@itemize @bullet +@item +The character set assumed for the multibyte encoding is not specified +as an argument to the functions. Instead the character set specified by +the @code{LC_CTYPE} category of the current locale is used; see +@ref{Locale Categories}. + +@item +The functions handling more than one character at a time require NUL +terminated strings as the argument. I.e., converting blocks of text +does not work unless one can add a NUL byte at an appropriate place. +The GNU C library contains some extensions the standard which allow +specifying a size but basically they also expect terminated strings. +@end itemize + +Despite these limitations the @w{ISO C} functions can very well be used +in many contexts. In graphical user interfaces, for instance, it is not +uncommon to have functions which require text to be displayed in a wide +character string if it is not simple ASCII. The text itself might come +from a file with translations and of course to user should decide about +the current locale which determines the translation and therefore also +the external encoding used. In such a situation (and many others) the +functions described here are perfect. If more freedom while performing +the conversion is necessary take a look at the @code{iconv} functions +(@pxref{Generic Charset Conversion}) + +@menu +* Selecting the Conversion:: Selecting the conversion and its properties. +* Keeping the state:: Representing the state of the conversion. +* Converting a Character:: Converting Single Characters. +* Converting Strings:: Converting Multibyte and Wide Character + Strings. +* Multibyte Conversion Example:: A Complete Multibyte Conversion Example. +@end menu + +@node Selecting the Conversion +@subsection Selecting the conversion and its properties + +We already said above that the currently selected locale for the +@code{LC_CTYPE} category decides about the conversion which is performed +by the functions we are about to describe. Each locale uses its own +character set (given as an argument to @code{localedef}) and this is the +one assumed as the external multibyte encoding. The wide character +character set always is UCS4. So we can see here already where the +limitations of these conversion functions are. + +A characteristic of each multibyte character set is the maximum number +of bytes which can be necessary to represent one character. This +information is quite important when writing code which uses the +conversion functions. In the examples below we will see some examples. +The @w{ISO C} standard defines two macros which provide this information. + + +@comment limits.h +@comment ISO +@deftypevr Macro int MB_LEN_MAX +This macro specifies the maximum number of bytes in the multibyte +sequence for a single character in any of the supported locales. It is +a compile-time constant and it is defined in @file{limits.h}. +@pindex limits.h +@end deftypevr + +@comment stdlib.h +@comment ISO +@deftypevr Macro int MB_CUR_MAX +@code{MB_CUR_MAX} expands into a positive integer expression that is the +maximum number of bytes in a multibyte character in the current locale. +The value is never greater than @code{MB_LEN_MAX}. Unlike +@code{MB_LEN_MAX} this macro need not be a compile-time constant and in +fact, in the GNU C library it is not. + +@pindex stdlib.h +@code{MB_CUR_MAX} is defined in @file{stdlib.h}. +@end deftypevr + +Two different macros are necessary since strictly @w{ISO C89} compiles +do not allow variable length array definitions but still it is desirable +to avoid dynamic allocation. This incomplete piece of code shows the +problem: + +@smallexample +@{ + char buf[MB_LEN_MAX]; + ssize_t len = 0; + + while (! feof (fp)) + @{ + fread (&buf[len], 1, MB_CUR_MAX - len, fp); + /* @r{... process} buf */ + len -= used; + @} +@} +@end smallexample + +The code in the inner loop is expected to have always enough bytes in +the array @var{buf} to convert one multibyte character. The array +@var{buf} has to be sized statically since many compilers do not allow a +variable size. The @code{fread} call makes sure that always +@code{MB_CUR_MAX} bytes are available in @var{buf}. Note that it is no +problem if @code{MB_CUR_MAX} is not a compile-time constant. + + +@node Keeping the state +@subsection Representing the state of the conversion + +@cindex stateful +In the introduction of this chapter it was said that certain character +sets use a @dfn{stateful} encoding. I.e., the encoded values depend in +some way on the previous byte in the text. + +Since the conversion functions allow converting a text in more than one +step we must have a way to pass this information from one call of the +functions to another. + +@comment wchar.h +@comment ISO +@deftp {Data type} mbstate_t +@cindex shift state +A variable of type @code{mbstate_t} can contain all the information +about the @dfn{shift state} needed from one call to a conversion +function to another. + +@pindex wchar.h +This type is defined in @file{wchar.h}. It got introduced in the second +amendment to @w{ISO C89}. +@end deftp + +To use objects of this type the programmer has to define such objects +(normally as local variables on the stack) and pass a pointer to the +object to the conversion functions. This way the conversion function +can update the object if the current multibyte character set is +stateful. + +There is no specific function or initializer to put the state object in +any specific state. The rules are that the object should always +represent the initial state before the first use and this is achieved by +clearing the whole variable with code such as follows: + +@smallexample +@{ + mbstate_t state; + memset (&state, '\0', sizeof (state)); + /* @r{from now on @var{state} can be used.} */ + ... +@} +@end smallexample + +When using the conversion functions to generate output it is often +necessary to test whether current state corresponds to the initial +state. This is necessary, for example, to decide whether or not to emit +escape sequences to set the state to the initial state at certain +sequence points. Communication protocols often require this. + +@comment wchar.h +@comment ISO +@deftypefun int mbsinit (const mbstate_t *@var{ps}) +This function determines whether the state object pointed to by @var{ps} +is in the initial state or not. If @var{ps} is no null pointer or the +object is in the initial state the return value is nonzero. Otherwise +it is zero. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +Code using this function often looks similar to this: + +@smallexample +@{ + mbstate_t state; + memset (&state, '\0', sizeof (state)); + /* @r{Use @var{state}.} */ + ... + if (! mbsinit (&state)) + @{ + /* @r{Emit code to return to initial state.} */ + fputs ("@r{whatever needed}", fp); + @} + ... +@} +@end smallexample + +@node Converting a Character +@subsection Converting Single Characters + +The most fundamental of the conversion functions are those dealing with +single characters. Please note that this does not always mean single +bytes. But since there is very often a subset of the multibyte +character set which consists of single byte sequences there are +functions to help with converting bytes. One very important and often +applicable scenario is where ASCII is a subpart of the multibyte +character set. I.e., all ASCII characters stand for itself and all +other characters have at least a first byte which is beyond the range +@math{0} to @math{127}. + +@comment wchar.h +@comment ISO +@deftypefun wint_t btowc (int @var{c}) +The @code{btowc} function (``byte to wide character'') converts a valid +single byte character in the initial shift state into the wide character +equivalent using the conversion rules from the currently selected locale +of the @code{LC_CTYPE} category. + +If @code{(unsigned char) @var{c}} is no valid single byte multibyte +character or if @var{c} is @code{EOF} the function returns @code{WEOF}. + +Please note the restriction of @var{c} being tested for validity only in +the initial shift state. There is no @code{mbstate_t} object used from +which the state information is taken and the function also does not use +any static state. + +@pindex wchar.h +This function was introduced in the second amendment of @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +Despite the limitation that the single byte value always is interpreted +in the initial state this function is actually useful most of the time. +Most character are either entirely single-byte character sets or they +are extension to ASCII. But then it is possible to write code like this +(not that this specific example is useful): + +@smallexample +wchar_t * +itow (unsigned long int val) +@{ + static wchar_t buf[30]; + wchar_t *wcp = &buf[29]; + *wcp = L'\0'; + while (val != 0) + @{ + *--wcp = btowc ('0' + val % 10); + val /= 10; + @} + if (wcp == &buf[29]) + *--wcp = btowc ('0'); + return wcp; +@} +@end smallexample + +The question is why is it necessary to use such a complicated +implementation and not simply cast L'0' to a wide character. The answer +is that there is no guarantee that the compiler knows about the wide +character set used at runtime. Even if the wide character equivalent of +a given single-byte character is simply the equivalent to casting a +single-byte character to @code{wchar_t} this is no guarantee that this +is the case everywhere. + +There also is a function for the conversion in the other direction. + +@comment wchar.h +@comment ISO +@deftypefun int wctob (wint_t @var{c}) +The @code{wctob} function (``wide character to byte'') takes as the +paremeter a valid wide character. If the multibyte representation for +this character in the initial state is exactly one byte long the return +value of this function is this character. Otherwise the return value is +@code{EOF}. + +@pindex wchar.h +This function was introduced in the second amendment of @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +There are more general functions to convert single character from +multibyte representation to wide characters and vice versa. These +functions pose no limit on the length of the multibyte representation +and they also do not require it to be in the initial state. + +@comment wchar.h +@comment ISO +@deftypefun size_t mbrtowc (wchar_t *restrict @var{pwc}, const char *restrict @var{s}, size_t @var{n}, mbstate_t *restrict @var{ps}) +@cindex stateful +The @code{mbrtowc} function (``multibyte restartable to wide +character'') converts the next multibyte character in the string pointed +to by @var{s} into a wide character and stores it in the wide character +string pointed to by @var{pwc}. The conversion is performed according +to the locale currently selected for the @code{LC_CTYPE} category. If +the character set for the locale is stateful the multibyte string is +interpreted in the state represented by the object pointed to by +@var{ps}. If @var{ps} is a null pointer an static, internal state +variable used only by the @code{mbrtowc} variable is used. + +If the next multibyte character corresponds to the NUL wide character +the return value of the function is @math{0} and the state object is +afterwards in the initial state. If the next @var{n} or fewer bytes +form a correct multibyte character the return value is the number of +bytes starting from @var{s} which form the multibyte character. The +conversion state is updated according to the bytes consumed in the +conversion. In both cases the wide character (either the @code{L'\0'} +or the one found in the conversion) is stored in the string pointer to +by @var{pwc} iff @var{pwc} is not null. + +If the first @var{n} bytes of the multibyte string possibly form a valid +multibyte character but there are more than @var{n} bytes needed to +complete it the return value of the function is @code{(size_t) -2} and +no value is stored. Please note that this can happen even if @var{n} +has a value greater or equal to @code{MB_CUR_MAX} since the input might +contain redundant shift sequences. + +If the first @code{n} bytes of the multibyte string cannot possibly +form a valid multibyte character also no value is stored, the global +variable i set to the value @code{EILSEQ} and the function return +@code{(size_t) -1}. The conversion state is afterwards undefined. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +Using this function is straight forward. A function which copies a +multibyte string into a wide character string while at the same time +converting all lowercase character into uppercase could look like this +(this is not the final version, just an example; it has no error +checking and leaks sometimes memory): + +@smallexample +wchar_t * +mbstouwcs (const char *s) +@{ + size_t len = strlen (s); + wchar_t *result = malloc ((len + 1) * sizeof (wchar_t)); + wchar_t *wcp = result; + wchar_t tmp[1]; + mbstate_t state; + memset (&state, '\0', sizeof (state)); + size_t nbytes; + while ((nbytes = mbrtowc (tmp, s, len, &state)) > 0) + @{ + if (nbytes >= (size_t) -2) + /* Invalid input string. */ + return NULL; + *result++ = towupper (tmp[0]); + len -= nbytes; + s += nbytes; + @} + return result; +@} +@end smallexample + +The use of @code{mbrtowc} should be clear. A single wide character is +stored in @code{@var{tmp}[0]} and the number of consumed bytes is stored +in the variable @var{nbytes}. In case the the conversion was successful +the uppercase variant of the wide character is stored in the +@var{result} array and the pointer to the input string and the number of +available bytes is adjusted. + +The only non-obvious thing about the function might be the way memory is +allocated for the result. The above code uses the fact that there can +never be more wide characters in the converted results than there are +bytes in the multibyte input string. This method yields to a +pessimistic guess about the size of the result and if many wide +character strings have to be constructed this way or the strings are +long, the extra memory required to store the wide character strings +might be significant. It would of course be possible to resize the +allocated memory block to the correct size before returning it. A +better solution might be to allocate just the right amount of space for +the result right away. Unfortunately there is no function to compute +the length of the wide character string directly from the multibyte +string. But there is a function which does part of the work. + +@comment wchar.h +@comment ISO +@deftypefun size_t mbrlen (const char *restrict @var{s}, size_t @var{n}, mbstate_t *@var{ps}) +The @code{mbrlen} function (``multibyte restartable length'') computes +the number of at most @var{n} bytes starting at @var{s} which form the +next valid and complete multibyte character. + +If the next multibyte character corresponds to the NUL wide character +the return value is @math{0}. If the next @var{n} bytes form a valid +multibyte character the number of bytes belonging to this multibyte +character byte sequence is returned. + +If the the first @var{n} bytes possibly form a valid multibyte +character but it is incomplete the return value is @code{(size_t) -2}. +Otherwise the multibyte character sequence is invalid and the return +value is @code{(size_t) -1}. + +The multibyte sequence is interpreted in the state represented by the +object pointer to by @var{ps}. If @var{ps} is a null pointer an state +object local to @code{mbrlen} is used. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C89} and +is declared in @file{wchar.h}. +@end deftypefun + +The tentative reader now will of course note that @code{mbrlen} can be +implemented as + +@smallexample +mbrtowc (NULL, s, n, ps != NULL ? ps : &internal) +@end smallexample + +This is true and in fact is mentioned in the official specification. +Now, how can this function be used to determine the length of the wide +character string created from a multibyte character string? It is not +directly usable but we can define a function @code{mbslen} using it: + +@smallexample +size_t +mbslen (const char *s) +@{ + mbstate_t state; + size_t result = 0; + size_t nbytes; + memset (&state, '\0', sizeof (state)); + while ((nbytes = mbrlen (s, MB_LEN_MAX, &state)) > 0) + @{ + if (nbytes >= (size_t) -2) + /* @r{Something is wrong.} */ + return (size_t) -1; + s += nbytes; + ++result; + @} + return result; +@} +@end smallexample + +This function simply calls @code{mbrlen} for each multibyte character +in the string and counts the number of function calls. Please note that +we here use @code{MB_LEN_MAX} as the size argument in the @code{mbrlen} +call. This is OK since a) this value is larger then the length of the +longest multibyte character sequence and b) because we know that the +string @var{s} ends with a NIL byte which cannot be part of any other +multibyte character sequence but the one representing the NIL wide +character. Therefore the @code{mbrlen} function will never read invalid +memory. + +Now that this function is available (just to make this clear, this +function is @emph{not} part of the GNU C library) we can compute the +number of wide character required to store the converted multibyte +character string @var{s} using + +@smallexample +wcs_bytes = (mbslen (s) + 1) * sizeof (wchar_t); +@end smallexample + +Please note that the @code{mbslen} function is quite inefficient. The +implementation of @code{mbstouwcs} implemented using @code{mbslen} would +have to perform the conversion of the multibyte character input string +twice and this conversion might be quite expensive. So it is necessary +to think about the consequences of using the easier but inprecise method +before doing the work twice. + +@comment wchar.h +@comment ISO +@deftypefun size_t wcrtomb (char *restrict @var{s}, wchar_t @var{wc}, mbstate_t *restrict @var{ps}) +The @code{wcrtomb} function (``wide character restartable to +multibyte'') converts a single wide character into a multibyte string +corresponding to that wide character. + +If @var{s} is a null pointer the resets the the state stored in the +objects pointer to by @var{ps} to the initial state. This can also be +achieved by a call like this: + +@smallexample +wcrtombs (temp_buf, L'\0', ps) +@end smallexample + +@noindent +since when @var{s} is a null pointer @code{wcrtomb} performs as if it +writes into an internal buffer which is guaranteed to be large enough. + +If @var{wc} is the NUL wide character @code{wcrtomb} emits, if +necessary, a shift sequence to get the state @var{ps} into the initial +state followed by a single NUL byte is stored in the string @var{s}. + +Otherwise a byte sequence (possibly including shift sequences) is +written into the string @var{s}. This of course only happens if +@var{wc} is a valid wide character, i.e., it has a multibyte +representation in the character set selected by locale of the +@code{LC_CTYPE} category. If @var{wc} is no valid wide character +nothing is stored in the strings @var{s}, @code{errno} is set to +@code{EILSEQ}, the conversion state in @var{ps} is undefined and the +return value is @code{(size_t) -1}. + +If no error occurred the function returns the number of bytes stored in +the string @var{s}. This includes all byte representing shift +sequences. + +One word about the interface of the function: there is no parameter +specifying the length of the array @var{s}. Instead the function +assumes that there are at least @code{MB_CUR_MAX} bytes available since +this is the maximum length of any byte sequence representing a single +character. So the caller has to make sure that there is enough space +available, otherwise buffer overruns can occur. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C} and is +declared in @file{wchar.h}. +@end deftypefun + +Using this function is as easy as using @code{mbrtowc}. The following +example appends a wide character string to a multibyte character string. +Again, the code is not really useful, it is simply here to demonstrate +the use and some problems. + +@smallexample +char * +mbscatwc (char *s, size_t len, const wchar_t *ws) +@{ + mbstate_t state; + char *wp = strchr (s, '\0'); + len -= wp - s; + memset (&state, '\0', sizeof (state)); + do + @{ + size_t nbytes; + if (len < MB_CUR_LEN) + @{ + /* @r{We cannot guarantee that the next} + @r{character fits into the buffer, so} + @r{return an error.} */ + errno = E2BIG; + return NULL; + @} + nbytes = wcrtomb (wp, *ws, &state); + if (nbytes == (size_t) -1) + /* @r{Error in the conversion.} */ + return NULL; + len -= nbytes; + wp += nbytes; + @} + while (*ws++ != L'\0'); + return s; +@} +@end smallexample + +First the function has to find the end of the string currently in the +array @var{s}. The @code{strchr} call does this very efficiently since a +requirement for multibyte character representations is that the NUL byte +never is used except to represent itself (and in this context, the end +of the string). + +After initializing the state object the loop is entered where the first +task is to make sure there is enough room in the array @var{s}. We +abort if there are not at least @code{MB_CUR_LEN} bytes available. This +is not always optimal but we have no other choice. We might have less +than @code{MB_CUR_LEN} bytes available but the next multibyte character +might also be only one byte long. At the time the @code{wcrtomb} call +returns it is too late to decide whether the buffer was large enough or +not. If this solution is really unsuitable there is a very slow but +more accurate solution. + +@smallexample + ... + if (len < MB_CUR_LEN) + @{ + mbstate_t temp_state; + memcpy (&temp_state, &state, sizeof (state)); + if (wcrtomb (NULL, *ws, &temp_state) > len) + @{ + /* @r{We cannot guarantee that the next} + @r{character fits into the buffer, so} + @r{return an error.} */ + errno = E2BIG; + return NULL; + @} + @} + ... +@end smallexample + +Here we do perform the conversion which might overflow the buffer so +that we are afterwards in the position to make an exact decision about +the buffer size. Please note the @code{NULL} argument for the +destination buffer in the new @code{wcrtomb} call; since we are not +interested in the result at this point this is a nice way to express +this. The most unusual thing about this piece of code certainly is the +duplication of the conversion state object. But think about it: if a +change of the state is necessary to emit the next multibyte character we +want to have the same shift state change performed in the real +conversion. Therefore we have to preserve the initial shift state +information. + +There are certainly many more and even better solutions to this problem. +This example is only meant for educational purposes. + +@node Converting Strings +@subsection Converting Multibyte and Wide Character Strings + +The functions described in the previous section only convert a single +character at a time. Most operations to be performed in real-world +programs include strings and therefore the @w{ISO C} standard also +defines conversions on entire strings. The defined set of functions is +quite limited, though. Therefore contains the GNU C library a few +extensions which are necessary in some important situations. + +@comment wchar.h +@comment ISO +@deftypefun size_t mbsrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{mbsrtowcs} function (``multibyte string restartable to wide +character string'') converts an NUL terminated multibyte character +string at @code{*@var{src}} into an equivalent wide character string, +including the NUL wide character at the end. The conversion is started +using the state information from the object pointed to by @var{ps} or +from an internal object of @code{mbsrtowcs} if @var{ps} is a null +pointer. Before returning the state object to match the state after the +last converted character. The state is the initial state if the +terminating NUL byte is reached and converted. + +If @var{dst} is not a null pointer the result is stored in the array +pointed to by @var{dst}, otherwise the conversion result is not +available since it is stored in an internal buffer. + +If @var{len} wide characters are stored in the array @var{dst} before +reaching the end of the input string the conversion stops and @var{len} +is returned. If @var{dst} is a null pointer @var{len} is never checked. + +Another reason for a premature return from the function call is if the +input string contains an invalid multibyte sequence. In this case the +global variable @code{errno} is set to @code{EILSEQ} and the function +returns @code{(size_t) -1}. + +@c XXX The ISO C9x draft seems to have a problem here. It says that PS +@c is not updated if DST is NULL. This is not said straight forward and +@c none of the other functions is described like this. It would make sense +@c to define the function this way but I don't think it is meant like this. + +In all other cases the function returns the number of wide characters +converted during this call. If @var{dst} is not null @code{mbsrtowcs} +stores in the pointer pointed to by @var{src} a null pointer (if the NUL +byte in the input string was reached) or the address of the byte +following the last converted multibyte character. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C} and is +declared in @file{wchar.h}. +@end deftypefun + +The definition of this function has one limitation which has to be +understood. The requirement that @var{dst} has to be a NUL terminated +string provides problems if one wants to convert buffers with text. A +buffer is normally no collection of NUL terminated strings but instead a +continuous collection of lines, separated by newline characters. Now +assume a function to convert one line from a buffer is needed. Since +the line is not NUL terminated the source pointer cannot directly point +into the unmodified text buffer. This means, either one inserts the NUL +byte at the appropriate place for the time of the @code{mbsrtowcs} +function call (which is not doable for a read-only buffer or in a +multi-threaded application) or one copies the line in an extra buffer +where it can be terminated by a NUL byte. Note that it is not in +general possible to limit the number of characters to convert by setting +the parameter @var{len} to any specific value. Since it is not known +how many bytes each multibyte character sequence is in length one always +could do only a guess. + +@cindex stateful +There is still a problem with the method of NUL-terminating a line right +after the newline character which could lead to very strange results. +As said in the description of the @var{mbsrtowcs} function above the +conversion state is guaranteed to be in the initial shift state after +processing the NUL byte at the end of the input string. But this NUL +byte is not really part of the text. I.e., the conversion state after +the newline in the original text could be something different than the +initial shift state and therefore the first character of the next line +is encoded using this state. But the state in question is never +accessible to the user since the conversion stops after the NUL byte. +Fortunately most stateful character sets in use today require that the +shift state after a newline is the initial state but this is no +guarantee. Therefore simply NUL terminating a piece of a running text +is not always the adequate solution. + +The generic conversion +@comment XXX reference to iconv +interface does not have this limitation (it simply works on buffers, not +strings) but there is another way. The GNU C library contains a set of +functions why take additional parameters specifying maximal number of +bytes which are consumed from the input string. This way the problem of +above's example could be solved by determining the line length and +passing this length to the function. + +@comment wchar.h +@comment ISO +@deftypefun size_t wcsrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{wcsrtombs} function (``wide character string restartable to +multibyte string'') converts the NUL terminated wide character string at +@code{*@var{src}} into an equivalent multibyte character string and +stores the result in the array pointed to by @var{dst}. The NUL wide +character is also converted. The conversion starts in the state +described in the object pointed to by @var{ps} or by a state object +locally to @code{wcsrtombs} in case @var{ps} is a null pointer. If +@var{dst} is a null pointer the conversion is performed as usual but the +result is not available. If all characters of the input string were +successfully converted and if @var{dst} is not a null pointer the +pointer pointed to by @var{src} gets assigned a null pointer. + +If one of the wide characters in the input string has no valid multibyte +character equivalent the conversion stops early, sets the global +variable @code{errno} to @code{EILSEQ}, and returns @code{(size_t) -1}. + +Another reason for a premature stop is if @var{dst} is not a null +pointer and the next converted character would require more than +@var{len} bytes in total to the array @var{dst}. In this case (and if +@var{dest} is not a null pointer) the pointer pointed to by @var{src} is +assigned a value pointing to the wide character right after the last one +successfully converted. + +Except in the case of an encoding error the return value of the function +is the number of bytes in all the multibyte character sequences stored +in @var{dst}. Before returning the state in the object pointed to by +@var{ps} (or the internal object in case @var{ps} is a null pointer) is +updated to reflect the state after the last conversion. The state is +the initial shift state in case the terminating NUL wide character was +converted. + +@pindex wchar.h +This function was introduced in the second amendment to @w{ISO C} and is +declared in @file{wchar.h}. +@end deftypefun + +The restriction mentions above for the @code{mbsrtowcs} function applies +also here. There is no possibility to directly control the number of +input characters. One has to place the NUL wide character at the +correct place or control the consumed input indirectly via the available +output array size (the @var{len} parameter). + +@comment wchar.h +@comment GNU +@deftypefun size_t mbsnrtowcs (wchar_t *restrict @var{dst}, const char **restrict @var{src}, size_t @var{nmc}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{mbsnrtowcs} function is very similar to the @code{mbsrtowcs} +function. All the parameters are the same except for @var{nmc} which is +new. The return value is the same as for @code{mbsrtowcs}. + +This new parameter specifies how many bytes at most can be used from the +multibyte character string. I.e., the multibyte character string +@code{*@var{src}} need not be NUL terminated. But if a NUL byte is +found within the @var{nmc} first bytes of the string the conversion +stops here. + +This function is a GNU extensions. It is meant to work around the +problems mentioned above. Now it is possible to convert buffer with +multibyte character text piece for piece without having to care about +inserting NUL bytes and the effect of NUL bytes on the conversion state. +@end deftypefun + +A function to convert a multibyte string into a wide character string +and display it could be written like this (this is no really useful +example): + +@smallexample +void +showmbs (const char *src, FILE *fp) +@{ + mbstate_t state; + int cnt = 0; + memset (&state, '\0', sizeof (state)); + while (1) + @{ + wchar_t linebuf[100]; + const char *endp = strchr (src, '\n'); + size_t n; + + /* @r{Exit if there is no more line.} */ + if (endp == NULL) + break; + + n = mbsnrtowcs (linebuf, &src, endp - src, 99, &state); + linebuf[n] = L'\0'; + fprintf (fp, "line %d: \"%S\"\n", linebuf); + @} +@} +@end smallexample + +There is no more problem with the state after a call to +@code{mbsnrtowcs}. Since we don't insert characters in the strings +which were not in there right from the beginning and we use @var{state} +only for the conversion of the given buffer there is no problem with +mixing the state up. + +@comment wchar.h +@comment GNU +@deftypefun size_t wcsnrtombs (char *restrict @var{dst}, const wchar_t **restrict @var{src}, size_t @var{nwc}, size_t @var{len}, mbstate_t *restrict @var{ps}) +The @code{wcsnrtombs} function implements the conversion from wide +character strings to multibyte character strings. It is similar to +@code{wcsrtombs} but it takes, just like @code{mbsnrtowcs}, an extra +parameter which specifies the length of the input string. + +No more than @var{nwc} wide characters from the input string +@code{*@var{src}} are converted. If the input string contains a NUL +wide character in the first @var{nwc} character to conversion stops at +this place. + +This function is a GNU extension and just like @code{mbsnrtowcs} is +helps in situations where no NUL terminated input strings are available. +@end deftypefun + + +@node Multibyte Conversion Example +@subsection A Complete Multibyte Conversion Example + +The example programs given in the last sections are only brief and do +not contain all the error checking etc. Therefore here comes a complete +and documented example. It features the @code{mbrtowc} function but it +should be easy to derive versions using the other functions. + +@smallexample +int +file_mbsrtowcs (int input, int output) +@{ + /* @r{Note the use of @code{MB_LEN_MAX}.} + @r{@code{MB_CUR_MAX} cannot portably be used here.} */ + char buffer[BUFSIZ + MB_LEN_MAX]; + mbstate_t state; + int filled = 0; + int eof = 0; + + /* @r{Initialize the state.} */ + memset (&state, '\0', sizeof (state)); + + while (!eof) + @{ + ssize_t nread; + ssize_t nwrite; + char *inp = buffer; + wchar_t outbuf[BUFSIZ]; + wchar_t *outp = outbuf; + + /* @r{Fill up the buffer from the input file.} */ + nread = read (input, buffer + filled, BUFSIZ); + if (nread < 0) + @{ + perror ("read"); + return 0; + @} + /* @r{If we reach end of file, make a note to read no more.} */ + if (nread == 0) + eof = 1; + + /* @r{@code{filled} is now the number of bytes in @code{buffer}.} */ + filled += nread; + + /* @r{Convert those bytes to wide characters--as many as we can.} */ + while (1) + @{ + size_t thislen = mbrtowc (outp, inp, filled, &state); + /* @r{Stop converting at invalid character;} + @r{this can mean we have read just the first part} + @r{of a valid character.} */ + if (thislen == (size_t) -1) + break; + /* @r{We want to handle embedded NUL bytes} + @r{but the return value is 0. Correct this.} */ + if (thislen == 0) + thislen = 1; + /* @r{Advance past this character.} */ + inp += thislen; + filled -= thislen; + ++outp; + @} + + /* @r{Write the wide characters we just made.} */ + nwrite = write (output, outbuf, + (outp - outbuf) * sizeof (wchar_t)); + if (nwrite < 0) + @{ + perror ("write"); + return 0; + @} + + /* @r{See if we have a @emph{real} invalid character.} */ + if ((eof && filled > 0) || filled >= MB_CUR_MAX) + @{ + error (0, 0, "invalid multibyte character"); + return 0; + @} + + /* @r{If any characters must be carried forward,} + @r{put them at the beginning of @code{buffer}.} */ + if (filled > 0) + memmove (inp, buffer, filled); + @} + + return 1; +@} +@end smallexample + + +@node Non-reentrant Conversion +@section Non-reentrant Conversion Function + +The functions described in the last chapter are defined in the second +amendment to @w{ISO C89}. But the original @w{ISO C89} standard also +contained functions for character set conversion. The reason that they +are not described in the first place is that they are almost entirely +useless. + +The problem is that all the functions for conversion defined in @w{ISO +C89} use a local state. This does not only mean that multiple +conversions at the same time (not only when using threads) cannot be +done. It also means that you cannot first convert single characters and +the strings since you cannot say the conversion functions which state to +use. + +These functions are therefore usable only in a very limited set of +situation. One most complete converting the entire string before +starting a new one and each string/text must be converted with the same +function (there is no problem with the library itself; it is guaranteed +that no library function changes the state of any of these functions). +For these reasons it is @emph{highly} requested to use the functions +from the last section. + +@menu +* Non-reentrant Character Conversion:: Non-reentrant Conversion of Single + Characters. +* Non-reentrant String Conversion:: Non-reentrant Conversion of Strings. +* Shift State:: States in Non-reentrant Functions. +@end menu + +@node Non-reentrant Character Conversion +@subsection Non-reentrant Conversion of Single Characters + +@comment stdlib.h +@comment ISO +@deftypefun int mbtowc (wchar_t *@var{result}, const char *@var{string}, size_t @var{size}) +The @code{mbtowc} (``multibyte to wide character'') function when called +with non-null @var{string} converts the first multibyte character +beginning at @var{string} to its corresponding wide character code. It +stores the result in @code{*@var{result}}. + +@code{mbtowc} never examines more than @var{size} bytes. (The idea is +to supply for @var{size} the number of bytes of data you have in hand.) + +@code{mbtowc} with non-null @var{string} distinguishes three +possibilities: the first @var{size} bytes at @var{string} start with +valid multibyte character, they start with an invalid byte sequence or +just part of a character, or @var{string} points to an empty string (a +null character). + +For a valid multibyte character, @code{mbtowc} converts it to a wide +character and stores that in @code{*@var{result}}, and returns the +number of bytes in that character (always at least @code{1}, and never +more than @var{size}). + +For an invalid byte sequence, @code{mbtowc} returns @code{-1}. For an +empty string, it returns @code{0}, also storing @code{0} in +@code{*@var{result}}. + +If the multibyte character code uses shift characters, then +@code{mbtowc} maintains and updates a shift state as it scans. If you +call @code{mbtowc} with a null pointer for @var{string}, that +initializes the shift state to its standard initial value. It also +returns nonzero if the multibyte character code in use actually has a +shift state. @xref{Shift State}. +@end deftypefun + +@comment stdlib.h +@comment ISO +@deftypefun int wctomb (char *@var{string}, wchar_t @var{wchar}) +The @code{wctomb} (``wide character to multibyte'') function converts +the wide character code @var{wchar} to its corresponding multibyte +character sequence, and stores the result in bytes starting at +@var{string}. At most @code{MB_CUR_MAX} characters are stored. + +@code{wctomb} with non-null @var{string} distinguishes three +possibilities for @var{wchar}: a valid wide character code (one that can +be translated to a multibyte character), an invalid code, and @code{0}. + +Given a valid code, @code{wctomb} converts it to a multibyte character, +storing the bytes starting at @var{string}. Then it returns the number +of bytes in that character (always at least @code{1}, and never more +than @code{MB_CUR_MAX}). + +If @var{wchar} is an invalid wide character code, @code{wctomb} returns +@code{-1}. If @var{wchar} is @code{0}, it returns @code{0}, also +storing @code{0} in @code{*@var{string}}. + +If the multibyte character code uses shift characters, then +@code{wctomb} maintains and updates a shift state as it scans. If you +call @code{wctomb} with a null pointer for @var{string}, that +initializes the shift state to its standard initial value. It also +returns nonzero if the multibyte character code in use actually has a +shift state. @xref{Shift State}. + +Calling this function with a @var{wchar} argument of zero when +@var{string} is not null has the side-effect of reinitializing the +stored shift state @emph{as well as} storing the multibyte character +@code{0} and returning @code{0}. +@end deftypefun + +Similar to @code{mbrlen} there is also a non-reentrant function which +computes the length of a multibyte character. It can be defined in +terms of @code{mbtowc}. + +@comment stdlib.h +@comment ISO +@deftypefun int mblen (const char *@var{string}, size_t @var{size}) +The @code{mblen} function with a non-null @var{string} argument returns +the number of bytes that make up the multibyte character beginning at +@var{string}, never examining more than @var{size} bytes. (The idea is +to supply for @var{size} the number of bytes of data you have in hand.) + +The return value of @code{mblen} distinguishes three possibilities: the +first @var{size} bytes at @var{string} start with valid multibyte +character, they start with an invalid byte sequence or just part of a +character, or @var{string} points to an empty string (a null character). + +For a valid multibyte character, @code{mblen} returns the number of +bytes in that character (always at least @code{1}, and never more than +@var{size}). For an invalid byte sequence, @code{mblen} returns +@code{-1}. For an empty string, it returns @code{0}. + +If the multibyte character code uses shift characters, then @code{mblen} +maintains and updates a shift state as it scans. If you call +@code{mblen} with a null pointer for @var{string}, that initializes the +shift state to its standard initial value. It also returns nonzero if +the multibyte character code in use actually has a shift state. +@xref{Shift State}. + +@pindex stdlib.h +The function @code{mblen} is declared in @file{stdlib.h}. +@end deftypefun + + +@node Non-reentrant String Conversion +@subsection Non-reentrant Conversion of Strings + +For convenience reasons the @w{ISO C89} standard defines also functions +to convert entire strings instead of single characters. These functions +suffer from the same problems as their reentrant counterparts from the +second amendment to @w{ISO C89}; see @xref{Converting Strings}. + +@comment stdlib.h +@comment ISO +@deftypefun size_t mbstowcs (wchar_t *@var{wstring}, const char *@var{string}, size_t @var{size}) +The @code{mbstowcs} (``multibyte string to wide character string'') +function converts the null-terminated string of multibyte characters +@var{string} to an array of wide character codes, storing not more than +@var{size} wide characters into the array beginning at @var{wstring}. +The terminating null character counts towards the size, so if @var{size} +is less than the actual number of wide characters resulting from +@var{string}, no terminating null character is stored. + +The conversion of characters from @var{string} begins in the initial +shift state. + +If an invalid multibyte character sequence is found, this function +returns a value of @code{-1}. Otherwise, it returns the number of wide +characters stored in the array @var{wstring}. This number does not +include the terminating null character, which is present if the number +is less than @var{size}. + +Here is an example showing how to convert a string of multibyte +characters, allocating enough space for the result. + +@smallexample +wchar_t * +mbstowcs_alloc (const char *string) +@{ + size_t size = strlen (string) + 1; + wchar_t *buf = xmalloc (size * sizeof (wchar_t)); + + size = mbstowcs (buf, string, size); + if (size == (size_t) -1) + return NULL; + buf = xrealloc (buf, (size + 1) * sizeof (wchar_t)); + return buf; +@} +@end smallexample + +@end deftypefun + +@comment stdlib.h +@comment ISO +@deftypefun size_t wcstombs (char *@var{string}, const wchar_t *@var{wstring}, size_t @var{size}) +The @code{wcstombs} (``wide character string to multibyte string'') +function converts the null-terminated wide character array @var{wstring} +into a string containing multibyte characters, storing not more than +@var{size} bytes starting at @var{string}, followed by a terminating +null character if there is room. The conversion of characters begins in +the initial shift state. + +The terminating null character counts towards the size, so if @var{size} +is less than or equal to the number of bytes needed in @var{wstring}, no +terminating null character is stored. + +If a code that does not correspond to a valid multibyte character is +found, this function returns a value of @code{-1}. Otherwise, the +return value is the number of bytes stored in the array @var{string}. +This number does not include the terminating null character, which is +present if the number is less than @var{size}. +@end deftypefun + +@node Shift State +@subsection States in Non-reentrant Functions + +In some multibyte character codes, the @emph{meaning} of any particular +byte sequence is not fixed; it depends on what other sequences have come +earlier in the same string. Typically there are just a few sequences +that can change the meaning of other sequences; these few are called +@dfn{shift sequences} and we say that they set the @dfn{shift state} for +other sequences that follow. + +To illustrate shift state and shift sequences, suppose we decide that +the sequence @code{0200} (just one byte) enters Japanese mode, in which +pairs of bytes in the range from @code{0240} to @code{0377} are single +characters, while @code{0201} enters Latin-1 mode, in which single bytes +in the range from @code{0240} to @code{0377} are characters, and +interpreted according to the ISO Latin-1 character set. This is a +multibyte code which has two alternative shift states (``Japanese mode'' +and ``Latin-1 mode''), and two shift sequences that specify particular +shift states. + +When the multibyte character code in use has shift states, then +@code{mblen}, @code{mbtowc} and @code{wctomb} must maintain and update +the current shift state as they scan the string. To make this work +properly, you must follow these rules: + +@itemize @bullet +@item +Before starting to scan a string, call the function with a null pointer +for the multibyte character address---for example, @code{mblen (NULL, +0)}. This initializes the shift state to its standard initial value. + +@item +Scan the string one character at a time, in order. Do not ``back up'' +and rescan characters already scanned, and do not intersperse the +processing of different strings. +@end itemize + +Here is an example of using @code{mblen} following these rules: + +@smallexample +void +scan_string (char *s) +@{ + int length = strlen (s); + + /* @r{Initialize shift state.} */ + mblen (NULL, 0); + + while (1) + @{ + int thischar = mblen (s, length); + /* @r{Deal with end of string and invalid characters.} */ + if (thischar == 0) + break; + if (thischar == -1) + @{ + error ("invalid multibyte character"); + break; + @} + /* @r{Advance past this character.} */ + s += thischar; + length -= thischar; + @} +@} +@end smallexample + +The functions @code{mblen}, @code{mbtowc} and @code{wctomb} are not +reentrant when using a multibyte code that uses a shift state. However, +no other library functions call these functions, so you don't have to +worry that the shift state will be changed mysteriously. + + +@node Generic Charset Conversion +@section Generic Charset Conversion + +The conversion functions mentioned so far in this chapter all had in +common that they operate on character sets which are not directly +specified by the functions. The multibyte encoding used is specified by +the currently selected locale for the @code{LC_CTYPE} category. The +wide character set is fixed by the implementation (in the case of GNU C +library it always is @w{ISO 10646}. + +This has of course several problems when it comes to general character +conversion: + +@itemize @bullet +@item +For every conversion where neither the source or destination character +set is the character set of the locale for the @code{LC_CTYPE} category, +one has to change the @code{LC_CTYPE} locale using @code{setlocale}. + +This introduces major problems for the rest of the programs since +several more functions (e.g., the character classification functions, +@xref{Classification of Characters}) use the @code{LC_CTYPE} category. + +@item +Parallel conversions to and from different character sets are not +possible since the @code{LC_CTYPE} selection is global and shared by all +threads. + +@item +If neither the source nor the destination character set is the character +set used for @code{wchar_t} representation there is at least a two-step +process necessary to convert a text using the functions above. One +would have to select the source character set as the multibyte encoding, +convert the text into a @code{wchar_t} text, select the destination +character set as the multibyte encoding and convert the wide character +text to the multibyte (=destination) character set. + +Even if this is possible (which is not guaranteed) it is a very tiring +work. Plus it suffers from the other two raised points even more due to +the steady changing of the locale. +@end itemize + + +The XPG2 standard defines a completely new set of functions which has +none of these limitations. They are not at all coupled to the selected +locales and they but no constraints on the character sets selected for +source and destination. Only the set of available conversions is +limiting them. The standard does not specify that any conversion at all +must be available. It is a measure of the quality of the implementation. + +In the following text first the interface will be described. It is here +shortly named @code{iconv}-interface after the name of the conversion +function. Then the implementation is described as far as interesting to +the advanced user who wants to extend the conversion capabilities. +Comparisons with other implementations will show what trapfalls lie on +the way of portable applications. + +@menu +* Generic Conversion Interface:: Generic Character Set Conversion Interface. +* iconv Examples:: A complete @code{iconv} example. +* Other iconv Implementations:: Some Details about other @code{iconv} + Implementations. +* glibc iconv Implementation:: The @code{iconv} Implementation in the GNU C + library. +@end menu + +@node Generic Conversion Interface +@subsection Generic Character Set Conversion Interface + +This set of functions follows the traditional cycle of using a resource: +open--use--close. The interface consists of three functions, each of +which implement one step. + +Before the interfaces are described it is necessary to introduce a +datatype. Just like other open--use--close interface the functions +introduced here work using a handles and the @file{iconv.h} header +defines a special type for the handles used. + +@comment iconv.h +@comment XPG2 +@deftp {Data Type} iconv_t +This data type is an abstract type defined in @file{iconv.h}. The user +must not assume anything about the definition of this type, it must be +completely opaque. + +Objects of this type can get assigned handles for the conversions using +the @code{iconv} functions. The objects themselves need not be freed but +the conversions for which the handles stand for have to. +@end deftp + +@noindent +The first step is the function to create a handle. + +@comment iconv.h +@comment XPG2 +@deftypefun iconv_t iconv_open (const char *@var{tocode}, const char *@var{fromcode}) +The @code{iconv_open} function has to be used before starting a +conversion. The two parameters this function takes determine the +sources and destination character set for the conversion and if the +implementation has the possibility to perform such a conversion the +function returns a handle. + +If the wanted conversion is not available the function returns +@code{(iconv_t) -1}. In this case the global variable @code{errno} can +have the following values: + +@table @code +@item EMFILE +The process already has @code{OPEN_MAX} file descriptors open. +@item ENFILE +The system limit of open file is reached. +@item ENOMEM +Not enough memory to carry out the operation. +@item EINVAL +The conversion from @var{fromcode} to @var{tocode} is not supported. +@end table + +It is not possible to use the same descriptor in different threads to +perform independent conversions. Within the data structures associated +with the descriptor there is information about the conversion state. +This must of course not be messed up by using it in different +conversions. + +An @code{iconv} descriptor is just a file descriptor as for every use a +new descriptor must be created. The descriptor does not stand for all +of the conversions from @var{fromset} to @var{toset}. + +The GNU C library implementation of @code{iconv_open} has one +significant extension to other implementations. To ease the extension +of the set of available conversions the implementation allows to store +the necessary files with data and code in arbitrary many directories. +How this extensions have to be written will be explained below +(@pxref{glibc iconv Implementation}). Here it is only important to say +that all directories mentioned in the @code{GCONV_PATH} environment +variable are considered if they contain a file @file{gconv-modules}. +These directories need not necessarily be created by the system +administrator. In fact, this extension is introduced to help users +writing and using own, new conversions. Of course this does not work +for security reasons in SUID binaries; in this case only the system +directory is considered and this normally is +@file{@var{prefix}/lib/gconv}. The @code{GCONV_PATH} environment +variable is examined exactly once at the first call of the +@code{iconv_open} function. Later modifications of the variable have no +effect. + +@pindex iconv.h +This function got introduced early in the X/Open Portability Guide, +@w{version 2}. It is supported by all commercial Unices as it is +required for the Unix branding. The quality and completeness of the +implementation varies widely, though. The function is declared in +@file{iconv.h}. +@end deftypefun + +The @code{iconv} implementation can associate large data structure with +the handle returned by @code{iconv_open}. Therefore it is crucial to +free all the resources once all conversions are carried out and the +conversion is not needed anymore. + +@comment iconv.h +@comment XPG2 +@deftypefun int iconv_close (iconv_t @var{cd}) +The @code{iconv_close} function frees all resources associated with the +handle @var{cd} which must have been returned by a successful call to +the @code{iconv_open} function. + +If the function call was successful the return value is @math{0}. +Otherwise it is @math{-1} and @code{errno} is set appropriately. +Defined error are: + +@table @code +@item EBADF +The conversion descriptor is invalid. +@end table + +@pindex iconv.h +This function was introduced together with the rest of the @code{iconv} +functions in XPG2 and it is declared in @file{iconv.h}. +@end deftypefun + +The standard defines only one actual conversion function. This has +therefore the most general interface: it allows conversion from one +buffer to another. Conversion from a file to a buffer, vice versa, or +even file to file can be implemented on top of it. + +@comment iconv.h +@comment XPG2 +@deftypefun size_t iconv (iconv_t @var{cd}, const char **@var{inbuf}, size_t *@var{inbytesleft}, char **@var{outbuf}, size_t *@var{outbytesleft}) +@cindex stateful +The @code{iconv} function converts the text in the input buffer +according to the rules associated with the descriptor @var{cd} and +stores the result in the output buffer. It is possible to call the +function for the same text several times in a row since for stateful +character sets the necessary state information is kept in the data +structures associated with the descriptor. + +The input buffer is specified by @code{*@var{inbuf}} and it contains +@code{*@var{inbytesleft}} bytes. The extra indirection is necessary for +communicating the used input back to the caller (see below). It is +important to note that the buffer pointer is of type @code{char} and the +length is measured in bytes even if the input text is encoded in wide +characters. + +The output buffer is specified in a similar way. @code{*@var{outbuf}} +points to the beginning of the buffer with at least +@code{*@var{outbytesleft}} bytes room for the result. The buffer +pointer again is of type @code{char} and the length is measured in +bytes. If @var{outbuf} or @code{*@var{outbuf}} is a null pointer the +conversion is performed but no output is available. + +If @var{inbuf} is a null pointer the @code{iconv} function performs the +necessary action to put the state of the conversion into the initial +state. This is obviously a no-op for non-stateful encodings, but if the +encoding has a state such a function call might put some byte sequences +in the output buffer which perform the necessary state changes. The +next call with @var{inbuf} not being a null pointer then simply goes on +from the initial state. It is important that the programmer never makes +any assumption on whether the conversion has to deal with states or not. +Even if the input and output character sets are not stateful the +implementation might still have to keep states. This is due to the +implementation chosen for the GNU C library as it is described below. +Therefore an @code{iconv} call to reset the state should always be +performed if some protocol requires this for the output text. + +The conversion stops for three reasons. The first is that all +characters from the input buffer are converted. This actually can mean +two things: really all bytes from the input buffer are consumed or the +there are some bytes at the end of the buffer which possibly can form a +complete character but the input is incomplete. The second reason for a +stop is when the output buffer is full. And the third reason is that +the input contains invalid characters. + +In all these cases the buffer pointers after the last successful +conversion, for input and output buffer, are stored in @var{inbuf} and +@var{outbuf} and the available room in each buffer is stored in +@var{inbytesleft} and @var{outbytesleft}. + +Since the character sets selected in the @code{iconv_open} call can be +almost arbitrary there can be situations where the input buffer contains +valid characters which have no identical representation in the output +character set. The behavior in this situation is undefined. The +@emph{current} behavior of the GNU C library in this situation is to +return with an error immediately. This certainly is not the most +desirable solution. Therefore future versions will provide better ones +but they are not yet finished. + +If all input from the input buffer is successfully converted and stored +in the output buffer the function returns the number of conversion +performed. In all other cases the return value is @code{(size_t) -1} +and @code{errno} is set appropriately. In this case the value pointed +to by @var{inbytesleft} is nonzero. + +@table @code +@item EILSEQ +The conversion stopped because of an invalid byte sequence in the input. +After the call @code{*@var{inbuf}} points at the first byte of the +invalid byte sequence. + +@item E2BIG +The conversion stopped because it ran out of space in the output buffer. + +@item EINVAL +The conversion stopped because of an incomplete byte sequence at the end +of the input buffer. + +@item EBADF +The @var{cd} argument is invalid. +@end table + +@pindex iconv.h +This function was introduced in the XPG2 standard and is declared in the +@file{iconv.h} header. +@end deftypefun + +The definition of the @code{iconv} function is quite good overall. It +provides quite flexible functionality. The only problems lie in the +boundary cases which are incomplete byte sequences at the end of the +input buffer and invalid input. A third problem, which is not really a +design problem, is the way conversions are selected. The standard does +not say anything about the legitimate names, a minimal set of available +conversions. We will see how this has negative impacts in the +discussion of other implementations further down. + + +@node iconv Examples +@subsection A complete @code{iconv} example + +The example below features a solution for a common problem. Given that +one knows the internal encoding used by the system for @code{wchar_t} +strings one often is in the position to read text from a file and store +it in wide character buffers. One can do this using @code{mbsrtowcs} +but then we run into the problems discussed above. + +@smallexample +int +file2wcs (int fd, const char *charset, wchar_t *outbuf, size_t avail) +@{ + char inbuf[BUFSIZ]; + size_t insize = 0; + char *wrptr = (char *) outbuf; + int result = 0; + iconv_t cd; + + cd = iconv_open ("UCS4", charset); + if (cd == (iconv_t) -1) + @{ + /* @r{Something went wrong.} */ + if (errno == EINVAL) + error (0, 0, "conversion from `%s' to `UCS4' no available", + charset); + else + perror ("iconv_open"); + + /* @r{Terminate the output string.} */ + *outbuf = L'\0'; + + return -1; + @} + + while (avail > 0) + @{ + size_t nread; + size_t nconv; + char *inptr = inbuf; + + /* @r{Read more input.} */ + nread = read (fd, inbuf + insize, sizeof (inbuf) - insize); + if (nread == 0) + @{ + /* @r{When we come here the file is completely read.} + @r{This still could mean there are some unused} + @r{characters in the @code{inbuf}. Put them back.} */ + if (lseek (fd, -insize, SEEK_CUR) == -1) + result = -1; + break; + @} + insize += nread; + + /* @r{Do the conversion.} */ + nconv = iconv (cd, &inptr, &insize, &wrptr, &avail); + if (nconv == (size_t) -1) + @{ + /* @r{Not everything went right. It might only be} + @r{an unfinished byte sequence at the end of the} + @r{buffer. Or it is a real problem.} */ + if (errno == EINVAL) + /* @r{This is harmless. Simply move the unused} + @r{bytes to the beginning of the buffer so that} + @r{they can be used in the next round.} */ + memmove (inbuf, inptr, insize); + else + @{ + /* @r{It is a real problem. Maybe we ran out of} + @r{space in the output buffer or we have invalid} + @r{input. In any case back the file pointer to} + @r{the position of the last processed byte.} */ + lseek (fd, -insize, SEEK_CUR); + result = -1; + break; + @} + @} + @} + + /* @r{Terminate the output string.} */ + *((wchar_t *) wrptr) = L'\0'; + + if (iconv_close (cd) != 0) + perror ("iconv_close"); + + return (wchar_t *) wrptr - outbuf; +@} +@end smallexample + +@cindex stateful +This example shows the most important aspects of using the @code{iconv} +functions. It shows how successive calls to @code{iconv} can be used to +convert large amounts of text. The user does not have to care about +stateful encodings as the functions take care of everything. + +An interesting point is the case where @code{iconv} return an error and +@code{errno} is set to @code{EINVAL}. This is not really an error in +the transformation. It can happen whenever the input character set +contains byte sequences of more than one byte for some character and +texts are not processed in one piece. In this case there is a chance +that a multibyte sequence is cut. The caller than can simply read the +remainder of the takes and feed the offending bytes together with new +character from the input to @code{iconv} and continue the work. The +internal state kept in the descriptor is @emph{not} unspecified after +such an event as it is the case with the conversion functions from the +@w{ISO C} standard. + +The example also shows the problem of using wide character strings with +@code{iconv}. As explained in the description of the @code{iconv} +function above the function always takes a pointer to a @code{char} +array and the available space is measured in bytes. In the example the +output buffer is a wide character buffer. Therefore we use a local +variable @var{wrptr} of type @code{char *} which is used in the +@code{iconv} calls. + +This looks rather innocent but can lead to problems on platforms which +have tight restriction on alignment. Therefore the caller of +@code{iconv} has to make sure that the pointers passed are suitable for +access of characters from the appropriate character set. Since in the +above case the input parameter to the function is a @code{wchar_t} +pointer this is the case (unless the user violates alignment when +computing the parameter). But in other situations, especially when +writing generic functions where one does not know what type of character +set on uses and therefore treats text as a sequence of bytes, it might +become tricky. + + +@node Other iconv Implementations +@subsection Some Details about other @code{iconv} Implementations + +This is not really the place to discuss the @code{iconv} implementation +of other systems but it is necessary to know a bit about them to write +portable programs. The above mentioned problems with the specification +of the @code{iconv} functions can lead to portability issues. + +The first thing to notice is that due to the large number of character +sets in use it is certainly not practical to encode the conversions +directly in the C library. Therefore the conversion information must +come from files outside the C library. This is usually in one or both +of the following ways: + +@itemize @bullet +@item +The C library contains a set of generic conversion functions which can +read the needed conversion tables and other information from data files. +These files get loaded when necessary. + +This solution is problematic as it is only with very much effort +applicable to all character set (maybe it is even impossible). The +differences in structure of the different character sets is so large +that many different variants of the table processing functions must be +developed. On top of this the generic nature of these functions make +them slower than specifically implemented functions. + +@item +The C library only contains a framework which can dynamically load +object files and execute the therein contained conversion functions. + +This solution provides much more flexibility. The C library itself +contains only very little code and therefore reduces the general memory +footprint. Also, with a documented interface between the C library and +the loadable modules it is possible for third parties to extend the set +of available conversion modules. A drawback of this solution is that +dynamic loading must be available. +@end itemize + +Some implementations in commercial Unices implement a mixture of these +possibilities, the majority only the second solution. This often leads +to problems, though. Since the modules with the conversion modules must +be dynamically loaded the system must have this possibility for all +programs. But this is not the case. At least some platforms (if no +all) are not able to dynamically load objects if the program is linked +statically. This is often solved by outlawing static linking entirely +but sure it is a weak solution. The GNU C library does not have this +restriction though it also uses dynamic loading. The danger is that one +get acquainted with this and forgets about the restriction on other +systems. + +A second thing to know about other @code{iconv} implementations is that +the number of available conversion is often very limited. Some +implementations provide in the standard release (not the special +international release, if something exists) at most 100 to 200 +conversion possibilities. This does not mean 200 different character +sets are supported. E.g., conversions from one character set to a set +of, say, 10 others counts as 10 conversion. Together with the other +direction this makes already 20. One can imagine the thin coverage +these platform provide. Some Unix vendors even provide only a handful +of conversions which renders them useless for almost all uses. + +This directly leads to a third and probably the most problematic point. +The way the @code{iconv} conversion functions are implemented on all +known Unix system the availability of the conversion functions from +character set @math{@cal{A}} to @math{@cal{B}} and the conversion from +@math{@cal{B}} to @math{@cal{C}} does @emph{not} imply that the +conversion from @math{@cal{A}} to @math{@cal{C}} is available. + +This might not seem unreasonable and problematic at first but it is a +quite big problem as one will notice shortly after hitting it. To show +the problem we assume to write a program which has to convert from +@math{@cal{A}} to @math{@cal{C}}. A call like + +@smallexample +cd = iconv_open ("@math{@cal{C}}", "@math{@cal{A}}"); +@end smallexample + +@noindent +does fail according to the assumption above. But what does the program +do now? The conversion is really necessary and therefore simply giving +up is no possibility. + +First this is of course a nuisance. The @code{iconv} function should +take care of this. But second, how should the program proceed from here +on? If it would try to convert to character set @math{@cal{B}} first +the two @code{iconv_open} calls + +@smallexample +cd1 = iconv_open ("@math{@cal{B}}", "@math{@cal{A}}"); +@end smallexample + +@noindent +and + +@smallexample +cd2 = iconv_open ("@math{@cal{C}}", "@math{@cal{B}}"); +@end smallexample + +@noindent +will succeed but how to find @math{@cal{B}}? + +The answer is unfortunately: there is no general solution. On some +systems guessing might help. On those systems most character sets can +convert to and from UTF8 encoded @w{ISO 10646} or Unicode text. Beside +this only some very system-specific methods can help. Since the +conversion functions come from loadable modules and these modules must +be stored somewhere in the filesystem, one @emph{could} try to find them +and determine from the available file which conversions are available +and whether there is an indirect route from @math{@cal{A}} to +@math{@cal{C}}. + +This shows one of the design errors of @code{iconv} mentioned above. It +should at least be possible to determine the list of available +conversion programmatically so that if @code{iconv_open} says there is +no such conversion, one could make sure this also is true for indirect +routes. + + +@node glibc iconv Implementation +@subsection The @code{iconv} Implementation in the GNU C library + +After reading about the problems of @code{iconv} implementations in the +last section it is certainly good to read here that the implementation +in the GNU C library has none of the problems mentioned above. But step +by step now. We will now address the points raised above. The +evaluation is based on the current state of the development (as of +January 1999). The development of the @code{iconv} functions is not +entirely finished by now but things can only get better. + +The GNU C library's @code{iconv} implementation uses shared loadable +modules to implement the conversions. A very small number of +conversions are built into the library itself but these are only rather +trivial conversions. + +All the benefits of loadable modules are available in the GNU C library +implementation. This is especially interesting since the interface is +well documented (see below) and it therefore is easy to write new +conversion modules. The drawback of using loadable object is not a +problem in the GNU C library, at least on ELF systems. Since the +library is able to load shared objects even in statically linked +binaries this means that static linking must not be forbidden in case +one wants to use @code{iconv}. + +The second mentioned problems is the number of supported conversions. +First, the GNU C library supports more then 150 character. And the was +the implementation is designed the number of supported conversions is +greater than 22350 (@math{150} times @math{149}). If any conversion +from or to a character set is missing it can easily be added. + +This high number is due to the fact that the GNU C library +implementation of @code{iconv} does not have the third problem mentioned +above. I.e., whenever there is a conversion from a character set +@math{@cal{A}} to @math{@cal{B}} and from @math{@cal{B}} to +@math{@cal{C}} it always is possible to convert from @math{@cal{A}} to +@math{@cal{C}} directly. If the @code{iconv_open} returns an error and +sets @code{errno} to @code{EINVAL} this really means there is no known +way, directly or indirectly, to perform the wanted conversion. + +@cindex triangulation +This is achieved by providing for each character set a conversion from +and to UCS4 encoded @w{ISO 10646}. Using @w{ISO 10646} as an +intermediate representation it is possible to ``triangulate''. + +There is no inherent requirement to provide a conversion to @w{ISO +10646} for a new character set and it is also possible to provide other +conversion where neither source not destination character set is @w{ISO +10646}. The currently existing set of conversion is simply meant to +convert all conversions which might be of interest. What could be done +in future is improving the speed of certain conversions. + +@cindex ISO-2022-JP +@cindex EUC-JP +Since all currently available conversions use the triangulation methods +often used conversion run unnecessarily slow. If, e.g., somebody often +needs the conversion from ISO-2022-JP to EUC-JP it is not the best way +to convert the input to @w{ISO 10646} first. The two character sets of +interest are much more similar to each other than to @w{ISO 10646}. + +In such a situation one can easy write a new conversion and provide it +as a better alternative. The GNU C library @code{iconv} implementation +would automatically use the module implementing the conversion if it is +specified to be more efficient. + +@subsubsection Format of @file{gconv-modules} files + +All information about the available conversions comes from a file named +@file{gconv-modules} which can be found in any of the directories along +the @code{GCONV_PATH}. The @file{gconv-modules} files are line-oriented +text files, where each of the lines has one of the following formats: + +@itemize @bullet +@item +If the first non-whitespace character is a @kbd{#} the line contains +only comments is is ignored. + +@item +Lines starting with @code{alias} define an alias name for a character +set. There are two more words expected on the line. The first one +defines the alias name and the second defines the original name of the +character set. The effect is that it is possible to use the alias name +in the @var{fromset} or @var{toset} parameters of @code{iconv_open} and +achieve the same result as when using the real character set name. + +This is quite important as a character set has often many different +names. There is normally always an official name but this need not +correspond to the most popular name. Beside this many character sets +have special names which are somehow constructed. E.g., all character +sets specified by the ISO have an alias of the form +@code{ISO-IR-@var{nnn}} where @var{nnn} is the registration number. +This allows programs which know about the registration number to +construct character set names and use them in @code{iconv_open} calls. +More on the available names and alias follows below. + +@item +Lines starting with @code{module} introduce an available conversion +module. These lines must contain three or four more words. + +The first word specifies the source character set, the second word the +destination character set of conversion implemented in this module. The +third word is the name of the loadable module. The filename is +constructed by appending the usual shared object prefix (normally +@file{.so}) and this file is then supposed to be found in the same +directory the @file{gconv-modules} file is in. The last word on the +line, which is optional, is a numeric value representing the cost of the +conversion. If this word is missing a cost of @math{1} is assumed. The +numeric value itself does not matter that much; what counts are the +relative values of the sums of costs for all possible conversion paths. +Below is a more precise description of the use of the cost value. +@end itemize + +Coming back to the example where one has written a module to directly +convert from ISO-2022-JP to EUC-JP and back. All what has to be done is +to put the new module, be its name ISO2022JP-EUCJP.so, in a directory +and add a file @file{gconv-modules} with the following content in the +same directory: + +@smallexample +module ISO-2022-JP// EUC-JP// ISO2022JP-EUCJP 1 +module EUC-JP// ISO-2022-JP// ISO2022JP-EUCJP 1 +@end smallexample + +To see why this is enough it is necessary to understand how the +conversion used by @code{iconv} and described in the descriptor is +selected. The approach to this problem is quite simple. + +At the first call of the @code{iconv_open} function the program reads +all available @file{gconv-modules} files and builds up two tables: one +containing all the known aliases and another which contains the +information about the conversions and which shared object implements +them. + +@subsubsection Finding the conversion path in @code{iconv} + +The set of available conversions form a directed graph with weighted +edges. The weights on the edges are of course the costs specified in +the @file{gconv-modules} files. The @code{iconv_open} function +therefore uses an algorithm suitable to search for the best path in such +a graph and so constructs a list of conversions which must be performed +in succession to get the transformation from the source to the +destination character set. + +Now it can be easily seen why the above @file{gconv-modules} files +allows the @code{iconv} implementation to pick up the specific +ISO-2022-JP to EUC-JP conversion module instead of the conversion coming +with the library itself. Since the later conversion takes two steps +(from ISO-2022-JP to @w{ISO 10646} and then from @w{ISO 10646} to +EUC-JP) the cost is @math{1+1 = 2}. But the above @file{gconv-modules} +file specifies that the new conversion modules can perform this +conversion with only the cost of @math{1}. + +A bit mysterious about the @file{gconv-modules} file above (and also the +file coming with the GNU C library) are the names of the character sets +specified in the @code{module} lines. Why do almost all the names end +in @code{//}? And this is not all: the names can actually be regular +expressions. At this point of time this mystery should not be revealed. +Sorry! @strong{The part of the implementation where this is used is not +yet finished. For now please simply follow the existing examples. +It'll become clearer once it is. --drepper} + +A last remark about the @file{gconv-modules} is about the names not +ending with @code{//}. There often is a character set named +@code{INTERNAL} mentioned. From the discussion above and the chosen +name it should have become clear that this is the names for the +representation used in the intermediate step of the triangulation. We +have said that this is UCS4 but actually it is not quite right. The +UCS4 specification also includes the specification of the byte ordering +used. Since an UCS4 value consists of four bytes a stored value is +effected by byte ordering. The internal representation is @emph{not} +the same as UCS4 in case the byte ordering of the processor (or at least +the running process) is not the same as the one required for UCS4. This +is done for performance reasons as one does not want to perform +unnecessary byte-swapping operations if one is not interested in actually +seeing the result in UCS4. To avoid trouble with endianess the internal +representation consistently is named @code{INTERNAL} even on big-endian +systems where the representations are identical. + +@subsubsection @code{iconv} module data structures + +So far this section described how modules are located and considered to +be used. What remains to be described is the interface of the modules +so that one can write new ones. This section describes the interface as +it is in use in January 1999. The interface will change in future a bit +but hopefully only in an upward compatible way. + +The definitions necessary to write new modules are publically available +in the non-standard header @file{gconv.h}. The following text will +therefore describe the definitions from this header file. But first it +is necessary to get an overview. + +From the perspective of the user of @code{iconv} the interface is quite +simple: the @code{iconv_open} function returns a handle which can be +used in calls @code{iconv} and finally the handle is freed with a call +to @code{iconv_close}. The problem is: the handle has to be able to +represent the possibly long sequences of conversion steps and also the +state of each conversion since the handle is all which is passed to the +@code{iconv} function. Therefore the data structures are really the +elements to understanding the implementation. + +We need two different kinds of data structures. The first describes the +conversion and the second describes the state etc. There are really two +type definitions like this in @file{gconv.h}. +@pindex gconv.h + +@comment gconv.h +@comment GNU +@deftp {Data type} {struct gconv_step} +This data structure describes one conversion a module can perform. For +each function in a loaded module with conversion functions there is +exactly one object of this type. This object is shared by all users of +the conversion. I.e., this object does not contain any information +corresponding to an actual conversion. It only describes the conversion +itself. + +@table @code +@item struct gconv_loaded_object *shlib_handle +@itemx const char *modname +@itemx int counter +All these elements of the structure are used internally in the C library +to coordinate loading and unloading the shared. One must not expect any +of the other elements be available or initialized. + +@item const char *from_name +@itemx const char *to_name +@code{from_name} and @code{to_name} contain the names of the source and +destination character sets. They can be used to identify the actual +conversion to be carried out since one module might implement +conversions for more than one character set and/or direction. + +@item gconv_fct fct +@itemx gconv_init_fct init_fct +@itemx gconv_end_fct end_fct +These elements contain pointers to the functions in the loadable module. +The interface will be explained below. + +@item int min_needed_from +@itemx int max_needed_from +@itemx int min_needed_to +@itemx int max_needed_to; +These values have to be filled in the the init function of the module. +The @code{min_needed_from} value specifies how many bytes a character of +the source character set at least needs. The @code{max_needed_from} +specifies the maximum value which also includes possible shift +sequences. + +The @code{min_needed_to} and @code{max_needed_to} values serve the same +purpose but this time for the destination character set. + +It is crucial that these values are accurate since otherwise the +conversion functions will have problems or not work at all. + +@item int stateful +This element must also be initialized by the init function. It is +nonzero if the source character set is stateful. Otherwise it is zero. + +@item void *data +This element can be used freely by the conversion functions in the +module. It can be used to communicate extra information from one call +to another. It need not be initialized if not needed at all. If this +element gets assigned a pointer to dynamically allocated memory +(presumably in the init function) it has to be made sure that the end +function deallocates the memory. Otherwise the application will leak +memory. + +It is important to be aware that this data structure is shared by all +users of this specification conversion and therefore the @code{data} +element must not contain data specific to one specific use of the +conversion function. +@end table +@end deftp + +@comment gconv.h +@comment GNU +@deftp {Data type} {struct gconv_step_data} +This is the data structure which contains the information specific to +each use of the conversion functions. + +@table @code +@item char *outbuf +@itemx char *outbufend +These elements specify the output buffer for the conversion step. The +@code{outbuf} element points to the beginning of the buffer and +@code{outbufend} points to the byte following the last byte in the +buffer. The conversion function must not assume anything about the size +of the buffer but it can be safely assumed the there is room for at +least one complete character in the output buffer. + +Once the conversion is finished and the conversion is the last step the +@code{outbuf} element must be modified to point after last last byte +written into the buffer to signal how much output is available. If this +conversion step is not the last one the element must not be modified. +The @code{outbufend} element must not be modified. + +@item int is_last +This element is nonzero if this conversion step is the last one. This +information is necessary for the recursion. See the description of the +conversion function internals below. This element must never be +modified. + +@item int invocation_counter +The conversion function can use this element to see how many calls of +the conversion function already happened. Some character sets require +when generating output a certain prolog and by comparing this value with +zero one can find out whether it is the first call and therefore the +prolog should be emitted or not. This element must never be modified. + +@item int internal_use +This element is another one rarely used but needed in certain +situations. It got assigned a nonzero value in case the conversion +functions are used to implement @code{mbsrtowcs} et.al. I.e., the +function is not used directly through the @code{iconv} interface. + +This sometimes makes a difference as it is expected that the +@code{iconv} functions are used to translate entire texts while the +@code{mbsrtowcs} functions are normally only used to convert single +strings and might be used multiple times to convert entire texts. + +But in this situation we would have problem complying with some rules of +the character set specification. Some character sets require a prolog +which must appear exactly once for an entire text. If a number of +@code{mbsrtowcs} calls are used to convert the text only the first call +must add the prolog. But since there is no communication between the +different calls of @code{mbsrtowcs} the conversion functions have no +possibility to find this out. The situation is different for sequences +of @code{iconv} calls since the handle allows to access the needed +information. + +This element is mostly used together with @code{invocation_counter} in a +way like this: + +@smallexample +if (!data->internal_use && data->invocation_counter == 0) + /* @r{Emit prolog.} */ + ... +@end smallexample + +This element must never be modified. + +@item mbstate_t *statep +The @code{statep} element points to an object of type @code{mbstate_t} +(@pxref{Keeping the state}). The conversion of an stateful charater +set must use the object pointed to by this element to store information +about the conversion state. The @code{statep} element itself must never +be modified. + +@item mbstate_t __state +This element @emph{never} must be used directly. It is only part of +this structure to have the needed space allocated. +@end table +@end deftp + +@subsubsection @code{iconv} module interfaces + +With the knowledge about the data structures we now can describe the +conversion functions itself. To understand the interface a bit of +knowledge about the functionality in the C library which loads the +objects with the conversions is necessary. + +It is often the case that one conversion is used more than once. I.e., +there are several @code{iconv_open} calls for the same set of character +sets during one program run. The @code{mbsrtowcs} et.al.@: functions in +the GNU C library also use the @code{iconv} functionality which +increases the number of uses of the same functions even more. + +For this reason the modules do not get loaded exclusively for one +conversion. Instead a module once loaded can be used by arbitrary many +@code{iconv} or @code{mbsrtowcs} calls at the same time. The splitting +of the information between conversion function specific information and +conversion data makes this possible. The last section showed the two +data structure used to do this. + +This is of course also reflected in the interface and semantic of the +functions the modules must provide. There are three functions which +must have the following names: + +@table @code +@item gconv_init +The @code{gconv_init} function initializes the conversion function +specific data structure. This very same object is shared by all +conversion which use this conversion and therefore no state information +about the conversion itself must be stored in here. If a module +implements more than one conversion the @code{gconv_init} function will be +called multiple times. + +@item gconv_end +The @code{gconv_end} function is responsible to free all resources +allocated by the @code{gconv_init} function. If there is nothing to do +this function can be missing. Special care must be taken if the module +implements more than one conversion and the @code{gconv_init} function +does not allocate the same resources for all conversions. + +@item gconv +This is the actual conversion function. It is called to convert one +block of text. It gets passed the conversion step information +initialized by @code{gconv_init} and the conversion data, specific to +this use of the conversion functions. +@end table + +There are three data types defined for the three module interface +function and these define the interface. + +@comment gconv.h +@comment GNU +@deftypevr {Data type} int (*gconv_init_fct) (struct gconv_step *) +This specifies the interface of the initialization function of the +module. It is called exactly once for each conversion the module +implements. + +As explained int the description of the @code{struct gconv_step} data +structure above the initialization function has to initialize parts of +it. + +@table @code +@item min_needed_from +@itemx max_needed_from +@itemx min_needed_to +@itemx max_needed_to +These elements must be initialized to the exact numbers of the minimum +and maximum number of bytes used by one character in the source and +destination character set respectively. If the characters all have the +same size the minimum and maximum values are the same. + +@item stateful +This element must be initialized to an nonzero value if the source +character set is stateful. Otherwise it must be zero. +@end table + +If the initialization function needs to communication some information +to the conversion function this can happen using the @code{data} element +of the @code{gconv_step} structure. But since this data is shared by +all the conversion is must not be modified by the conversion function. +How this can be used is shown in the example below. + +@smallexample +#define MIN_NEEDED_FROM 1 +#define MAX_NEEDED_FROM 4 +#define MIN_NEEDED_TO 4 +#define MAX_NEEDED_TO 4 + +int +gconv_init (struct gconv_step *step) +@{ + /* @r{Determine which direction.} */ + struct iso2022jp_data *new_data; + enum direction dir = illegal_dir; + enum variant var = illegal_var; + int result; + + if (__strcasecmp (step->from_name, "ISO-2022-JP//") == 0) + @{ + dir = from_iso2022jp; + var = iso2022jp; + @} + else if (__strcasecmp (step->to_name, "ISO-2022-JP//") == 0) + @{ + dir = to_iso2022jp; + var = iso2022jp; + @} + else if (__strcasecmp (step->from_name, "ISO-2022-JP-2//") == 0) + @{ + dir = from_iso2022jp; + var = iso2022jp2; + @} + else if (__strcasecmp (step->to_name, "ISO-2022-JP-2//") == 0) + @{ + dir = to_iso2022jp; + var = iso2022jp2; + @} + + result = GCONV_NOCONV; + if (dir != illegal_dir) + @{ + new_data = (struct iso2022jp_data *) + malloc (sizeof (struct iso2022jp_data)); + + result = GCONV_NOMEM; + if (new_data != NULL) + @{ + new_data->dir = dir; + new_data->var = var; + step->data = new_data; + + if (dir == from_iso2022jp) + @{ + step->min_needed_from = MIN_NEEDED_FROM; + step->max_needed_from = MAX_NEEDED_FROM; + step->min_needed_to = MIN_NEEDED_TO; + step->max_needed_to = MAX_NEEDED_TO; + @} + else + @{ + step->min_needed_from = MIN_NEEDED_TO; + step->max_needed_from = MAX_NEEDED_TO; + step->min_needed_to = MIN_NEEDED_FROM; + step->max_needed_to = MAX_NEEDED_FROM + 2; + @} + + /* @r{Yes, this is a stateful encoding.} */ + step->stateful = 1; + + result = GCONV_OK; + @} + @} + + return result; +@} +@end smallexample + +The function first checks which conversion is wanted. The module from +which this function is taken implements four different conversion and +which one is selected can be determined by comparing the names. The +comparison should always be done without paying attention to the case. + +Then a data structure is allocated which contains the necessary +information about which conversion is selected. The data structure +@code{struct iso2022jp_data} is locally defined since outside the module +this data is not used at all. Please note that if all four conversions +this modules supports are requested there are four data blocks. + +One interesting thing is the initialization of the @code{min_} and +@code{max_} elements of the step data object. A single ISO-2022-JP +character can consist of one to four bytes. Therefore the +@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined +this way. The output is always the @code{INTERNAL} character set (aka +UCS4) and therefore each character consists of exactly four bytes. For +the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into +account that escape sequences might be necessary to switch the character +sets. Therefore the @code{max_needed_to} element for this direction +gets assigned @code{MAX_NEEDED_FROM + 2}. This takes into account the +two bytes needed for the escape sequences to single the switching. The +asymmetry in the maximum values for the two directions can be explained +easily: when reading ISO-2022-JP text escape sequences can be handled +alone. I.e., it is not necessary to process a real character since the +effect of the escape sequence can be recorded in the state information. +The situation is different for the other direction. Since it is in +general not known which character comes next one cannot emit escape +sequences to change the state in advance. This means the escape +sequences which have to be emitted together with the next character. +Therefore one needs more room then only for the character itself. + +The possible return values of the initialization function are: + +@table @code +@item GCONV_OK +The initialization succeeded +@item GCONV_NOCONV +The requested conversion is not supported in the module. This can +happen if the @file{gconv-modules} file has errors. +@item GCONV_NOMEM +Memory required to store additional information could not be allocated. +@end table +@end deftypevr + +The functions called before the module is unloaded is significantly +easier. It often has nothing at all to do in which case it can be left +out completely. + +@comment gconv.h +@comment GNU +@deftypevr {Data type} void (*gconv_end_fct) (struct gconv_step *) +The task of this function is it to free all resources allocated in the +initialization function. Therefore only the @code{data} element of the +object pointed to by the argument is of interest. Continuing the +example from the initialization function, the finalization function +looks like this: + +@smallexample +void +gconv_end (struct gconv_step *data) +@{ + free (data->data); +@} +@end smallexample +@end deftypevr + +The most important function of course is the conversion function itself. +It can get quite complicated for complex character sets. But since this +is not of interest here we will only describe a possible skeleton for +the conversion function. + +@comment gconv.h +@comment GNU +@deftypevr {Data type} int (*gconv_fct) (struct gconv_step *, struct gconv_step_data *, const char **, const char *, size_t *, int) +The conversion function can be called for two basic reason: to convert +text or to reset the state. From the description of the @code{iconv} +function it can be seen why the flushing mode is necessary. What mode +is selected is determined by the sixth argument, an integer. If it is +nonzero it means that flushing is selected. + +Common to both mode is where the output buffer can be found. The +information about this buffer is stored in the conversion step data. A +pointer to this is passed as the second argument to this function. The +description of the @code{struct gconv_step_data} structure has more +information on this. + +@cindex stateful +What has to be done for flushing depends on the source character set. +If it is not stateful nothing has to be done. Otherwise the function +has to emit a byte sequence to bring the state object in the initial +state. Once this all happened the other conversion modules in the chain +of conversions have to get the same chance. Whether another step +follows can be determined from the @code{is_last} element of the step +data structure to which the first parameter points. + +The more interesting mode is when actually text has to be converted. +The first step in this case is to convert as much text as possible from +the input buffer and store the result in the output buffer. The start +of the input buffer is determined by the third argument which is a +pointer to a pointer variable referencing the beginning of the buffer. +The fourth argument is a pointer to the byte right after the last byte +in the buffer. + +The conversion has to be performed according to the current state if the +character set is stateful. The state is stored in an object pointed to +by the @code{statep} element of the step data (second argument). Once +either the input buffer is empty or the output buffer is full the +conversion stops. At this point the pointer variable referenced by the +third parameter must point to the byte following the last processed +byte. I.e., if all of the input is consumed this pointer and the fourth +parameter have the same value. + +What now happens depends on whether this step is the last one or not. +If it is the last step the only thing which has to be done is to update +the @code{outbuf} element of the step data structure to point after the +last written byte. This gives the caller the information on how much +text is available in the output buffer. Beside this the variable +pointed to by the fifth parameter, which is of type @code{size_t}, must +be incremented by the number of characters (@emph{not bytes}) which were +written in the output buffer. Then the function can return. + +In case the step is not the last one the later conversion functions have +to get a chance to do their work. Therefore the appropriate conversion +function has to be called. The information about the functions is +stored in the conversion data structures, passed as the first parameter. +This information and the step data are stored in arrays so the next +element in both cases can be found by simple pointer arithmetic: + +@smallexample +int +gconv (struct gconv_step *step, struct gconv_step_data *data, + const char **inbuf, const char *inbufend, size_t *written, + int do_flush) +@{ + struct gconv_step *next_step = step + 1; + struct gconv_step_data *next_data = data + 1; + ... +@end smallexample + +The @code{next_step} pointer references the next step information and +@code{next_data} the next data record. The call of the next function +therefore will look similar to this: + +@smallexample + next_step->fct (next_step, next_data, &outerr, outbuf, written, 0) +@end smallexample + +But this is not yet all. Once the function call returns the conversion +function might have some more to do. If the return value of the +function is @code{GCONV_EMPTY_INPUT} this means there is more room in +the output buffer. Unless the input buffer is empty the conversion +functions start all over again and processes the rest of the input +buffer. If the return value is not @code{GCONV_EMPTY_INPUT} something +went wrong and we have to recover from this. + +A requirement for the conversion function is that the input buffer +pointer (the third argument) always points to the last character which +was put in the converted form in the output buffer. This is trivial +true after the conversion performed in the current step. But if the +conversion functions deeper down the stream stop prematurely not all +characters from the output buffer are consumed and therefore the input +buffer pointers must be backed of to the right position. + +This is easy to do if the input and output character sets have a fixed +width for all characters. In this situation we can compute how many +characters are left in the output buffer and therefore can correct the +input buffer pointer appropriate with a similar computation. Things are +getting tricky if either character set has character represented with +variable length byte sequences and it gets even more complicated if the +conversion has to take care of the state. In these cases the conversion +has to be performed once again, from the known state before the initial +conversion. I.e., if necessary the state of the conversion has to be +reset and the conversion loop has to be executed again. The difference +now is that it is known how much input must be created and the +conversion can stop before converting the first unused character. Once +this is done the input buffer pointers must be updated again and the +function can return. + +One final thing should be mentioned. If it is necessary for the +conversion to know whether it is the first invocation (in case a prolog +has to be emitted) the conversion function should just before returning +to the caller increment the @code{invocation_counter} element of the +step data structure. See the description of the @code{struct +gconv_step_data} structure above for more information on how this can be +used. + +The return value must be one of the following values: + +@table @code +@item GCONV_EMPTY_INPUT +All input was consumed and there is room left in the output buffer. +@item GCONV_OUTPUT_FULL +No more room in the output buffer. In case this is not the last step +this value is propagated down from the call of the next conversion +function in the chain. +@item GCONV_INCOMPLETE_INPUT +The input buffer is not entirely empty since it contains an incomplete +character sequence. +@end table + +The following example provides a framework for a conversion function. +In case a new conversion has to be written the holes in this +implementation have to be filled and that is it. + +@smallexample +int +gconv (struct gconv_step *step, struct gconv_step_data *data, + const char **inbuf, const char *inbufend, size_t *written, + int do_flush) +@{ + struct gconv_step *next_step = step + 1; + struct gconv_step_data *next_data = data + 1; + gconv_fct fct = next_step->fct; + int status; + + /* @r{If the function is called with no input this means we have} + @r{to reset to the initial state. The possibly partly} + @r{converted input is dropped.} */ + if (do_flush) + @{ + status = GCONV_OK; + + /* @r{Possible emit a byte sequence which put the state object} + @r{into the initial state.} */ + + /* @r{Call the steps down the chain if there are any but only} + @r{if we successfully emitted the escape sequence.} */ + if (status == GCONV_OK && ! data->is_last) + status = fct (next_step, next_data, NULL, NULL, + written, 1); + @} + else + @{ + /* @r{We preserve the initial values of the pointer variables.} */ + const char *inptr = *inbuf; + char *outbuf = data->outbuf; + char *outend = data->outbufend; + char *outptr; + + /* @r{This variable is used to count the number of characters} + @r{we actually converted.} */ + size_t converted = 0; + + do + @{ + /* @r{Remember the start value for this round.} */ + inptr = *inbuf; + /* @r{The outbuf buffer is empty.} */ + outptr = outbuf; + + /* @r{For stateful encodings the state must be safe here.} */ + + /* @r{Run the conversion loop. @code{status} is set} + @r{appropriately afterwards.} */ + + /* @r{If this is the last step leave the loop, there is} + @r{nothing we can do.} */ + if (data->is_last) + @{ + /* @r{Store information about how many bytes are} + @r{available.} */ + data->outbuf = outbuf; + + /* @r{Remember how many characters we converted.} */ + *written += converted; + + break; + @} + + /* @r{Write out all output which was produced.} */ + if (outbuf > outptr) + @{ + const char *outerr = data->outbuf; + int result; + + result = fct (next_step, next_data, &outerr, + outbuf, written, 0); + + if (result != GCONV_EMPTY_INPUT) + @{ + if (outerr != outbuf) + @{ + /* @r{Reset the input buffer pointer. We} + @r{document here the complex case.} */ + size_t nstatus; + + /* @r{Reload the pointers.} */ + *inbuf = inptr; + outbuf = outptr; + + /* @r{Possibly reset the state.} */ + + /* @r{Redo the conversion, but this time} + @r{the end of the output buffer is at} + @r{@code{outerr}.} */ + @} + + /* @r{Change the status.} */ + status = result; + @} + else + /* @r{All the output is consumed, we can make} + @r{ another run if everything was ok.} */ + if (status == GCONV_FULL_OUTPUT) + status = GCONV_OK; + @} + @} + while (status == GCONV_OK); + + /* @r{We finished one use of this step.} */ + ++data->invocation_counter; + @} + + return status; +@} +@end smallexample +@end deftypevr + +This information should be sufficient to write new modules. Anybody +doing so should also take a look at the available source code in the GNU +C library sources. It contains many examples of working and optimized +modules. |