diff options
Diffstat (limited to 'newlib/libc')
-rw-r--r-- | newlib/libc/iconv/iconv.tex | 424 |
1 files changed, 421 insertions, 3 deletions
diff --git a/newlib/libc/iconv/iconv.tex b/newlib/libc/iconv/iconv.tex index 3d3621e..9d524ce 100644 --- a/newlib/libc/iconv/iconv.tex +++ b/newlib/libc/iconv/iconv.tex @@ -1,14 +1,432 @@ @node Iconv @chapter Character-set conversions (@file{iconv.h}) -This chapter describes the iconv character-set conversion functions. -The corresponding declarations are in +This chapter describes the Newlib iconv library. +The iconv functions declarations are in @file{iconv.h}. @menu -* iconv:: Character set conversion routines +* iconv:: Character set conversion routines +* iconv architecture:: Architecture of Newlib iconv library +* iconv configuration:: Newlib iconv-specific configure options +* Generating CCS tables:: How to generate CCS tables +* Adding new converter:: Steps on adding a new converter @end menu @page @include iconv/iconv.def +@page +@node iconv architecture +@section iconv architecture +@findex iconv architecture +@findex encoding +@findex CCS +@findex CES +@findex iconv converter +@* +@itemize @bullet +@item +Encoding - a rule to represent computer text by means of bits and bytes. +@item +CCS (Coded Character Set) - a mapping from an abstract character set +to a set of non-negative integers (character codes). +@item +CES (Character Encoding Scheme) - a mapping from a set of character codes +units to a sequence of bytes. +@end itemize + +@* +Examples of CCS: ASCII, ISO-8859-x, KOI8-R, KSX-1001, GB-2312.@* +Examples of CES: UTF-8, UTF-16, EUC-JP, ISO-2022-JP. + +@* +The iconv library is used to convert an array of characters in one encoding +to array in another encoding. + +@* +From a user's point of view, the iconv library is a set of converters. Each converter +corresponds to one encoding (e.g., KOI8-R converter, UTF-8 converter). +Internally the meaning of converter is different. + +@* +The iconv library always performs conversions through UCS-32: i.e., to convert +from A to B, iconv library first converts A to UCS-32, and then USC-32 to B. + +@* +Each encoding consists of CES and CCS. CCS may be represented as data tables +but CES always implies some code (algorithm). Iconv uses CCS tables +to map from some encoding to UCS-32. CCS tables are placed into +the iconv/ccs subdirectory of newlib. The iconv code also uses CES +modules which can convert some CCS to and from UCS-32. CES modules are placed +in the iconv/ces subdirectory. + +@* +Some encodings have CES = CCS (e.g., KOI8-R). For such encodings iconv uses +special subroutines which perform simple table conversions (ccs_table.c). + +@* +Among specialized CES modules, the iconv library has +generic support for EUC and ISO-2022-family encodings (ces_euc.c and +ces_iso2022.c). + +@* +To enable iconv to work with CCS or CES-based encodings, the correspondent +CES table or CCS module should be linked with Newlib. The iconv support +can also load CCS tables dynamically from external files (.cct files from +iconv/ccs/binary subdirectory). CES modules, on the other-hand, can't +be dynamically loaded. + +@* +Each iconv converter has one name and a set of aliases. The list of +aliases for each converter's name is in the iconv/charset.aliases file. +Note: iconv always normalizes converter names and aliases before using. + +@page +@node iconv configuration +@section iconv configuration +@findex iconv configuration +@findex iconv converter +@* +To enable iconv, the --enable-newlib-iconv configuration option should be +used when configuring newlib. + +@* +To link a specific converter (CCS table or CES module) into Newlib, the +---enable-newlib-builtin-converters option should be used. A +comma-separated list of converters can be passed with this option +(e.g., ---enable-newlib-builtin-converters=koi8-r,euc-jp to link KOI8-R +and EUC-JP converters). Either converter names or aliases may be used. + +@* +If the target system has a file system accessible by Newlib, table-based +converters may be loaded dynamically from external files. The iconv +code tries to load files from the iconv_data subdirectory of the directory +specified by the NLSPATH environment variable. + +@* +Since Newlib has no generic dynamic module load support, CES-based converters +can't be dynamically loaded and should be linked-in. + +@page +@node Generating CCS tables +@section Generating CCS tables +@* +CCS tables are placed in the ccs subdirectory of the iconv directory. +This subdirectory contains .cct and .c files. The .cct files are for +dynamic loading whereas the .c files are for static linking with Newlib. +Both .c and .cct files are generated by the 'iconv_mktbl' perl script +from special source files (call them +.txt files). The 'iconv_mktbl' script can be found in the iconv/ccs +subdirectory. Input .txt files can be found at the Unicode.org site or +other locations found on the web. + +@* +The .c files are linked with Newlib if the correspondent 'configure' script +option was given. This is needed to use iconv on targets without file system +support. If a CCS table isn't configured to be linked, the iconv library +tries to load it dynamically from a corresponding .cct file. + +@* +The following are commands to build .c and .cct CCS table files from .txt +files for several supported encodings. + +@* +@itemize +@item +cp775:@* +iconv_mktbl -Co cp775.c cp775.txt@* +iconv_mktbl -o cp775.cct cp775.txt +@end itemize + +@itemize +@item +cp850:@* +iconv_mktbl -Co cp850.c cp850.txt@* +iconv_mktbl -o cp850.cct cp850.txt +@end itemize + +@itemize +@item +cp852:@* +iconv_mktbl -Co cp852.c cp852.txt@* +iconv_mktbl -o cp852.cct cp852.txt +@end itemize + +@itemize +@item +cp855:@* +iconv_mktbl -Co cp855.c cp855.txt@* +iconv_mktbl -o cp855.cct cp855.txt +@end itemize + +@itemize +@item +cp866@* +iconv_mktbl -Co cp866.c cp866.txt@* +iconv_mktbl -o cp866.cct cp866.txt +@end itemize + +@itemize +@item +iso-8859-1@* +iconv_mktbl -Co iso-8859-1.c iso-8859-1.txt@* +iconv_mktbl -o iso-8859-1.cct iso-8859-1.txt +@end itemize + +@itemize +@item +iso-8859-4@* +iconv_mktbl -Co iso-8859-4.c iso-8859-4.txt@* +iconv_mktbl -o iso-8859-4.cct iso-8859-4.txt +@end itemize + +@itemize +@item +iso-8859-5@* +iconv_mktbl -Co iso-8859-5.c iso-8859-5.txt@* +iconv_mktbl -o iso-8859-5.cct iso-8859-5.txt +@end itemize + +@itemize +@item +iso-8859-2@* +iconv_mktbl -Co iso-8859-2.c iso-8859-2.txt@* +iconv_mktbl -o iso-8859-2.cct iso-8859-2.txt +@end itemize + +@itemize +@item +iso-8859-15@* +iconv_mktbl -Co iso-8859-15.c iso-8859-15.txt@* +iconv_mktbl -o iso-8859-15.cct iso-8859-15.txt +@end itemize + +@itemize +@item +big5@* +iconv_mktbl -Co big5.c big5.txt@* +iconv_mktbl -o big5.cct big5.txt +@end itemize + +@itemize +@item +ksx1001@* +iconv_mktbl -Co ksx1001.c ksx1001.txt@* +iconv_mktbl -o ksx1001.cct ksx1001.txt +@end itemize + +@itemize +@item +gb_2312@* +iconv_mktbl -Co gb_2312-80.c gb_2312-80.txt@* +iconv_mktbl -o gb_2312-80.cct gb_2312-80.txt +@end itemize + +@itemize +@item +jis_x0201@* +iconv_mktbl -Co jis_x0201.c jis_x0201.txt@* +iconv_mktbl -o jis_x0201.cct jis_x0201.txt +@end itemize + +@itemize +@item +iconv_mktbl -Co shift_jis.c shift_jis.txt@* +iconv_mktbl -o shift_jis.cct shift_jis.txt +@end itemize + +@itemize +@item +jis_x0208@* +iconv_mktbl -C -c 1 -u 2 -o jis_x0208-1983.c jis_x0208-1983.txt@* +iconv_mktbl -c 1 -u 2 -o jis_x0208-1983.cct jis_x0208-1983.txt +@end itemize + +@itemize +@item +jis_x0212@* +iconv_mktbl -Co jis_x0212-1990.c jis_x0212-1990.txt@* +iconv_mktbl -o jis_x0212-1990.cct jis_x0212-1990.txt +@end itemize + +@itemize +@item +cns11643-plane1@* +iconv_mktbl -C -p 0x1 -o cns11643-plane1.c cns11643.txt@* +iconv_mktbl -p 0x1 -o cns11643-plane1.cct cns11643.txt +@end itemize + +@itemize +@item +cns11643-plane2@* +iconv_mktbl -C -p 0x2 -o cns11643-plane2.c cns11643.txt@* +iconv_mktbl -p 0x2 -o cns11643-plane2.cct cns11643.txt +@end itemize + +@itemize +@item +cns11643-plane14@* +iconv_mktbl -C -p 0xE -o cns11643-plane14.c cns11643.txt@* +iconv_mktbl -p 0xE -o cns11643-plane14.cct cns11643.txt +@end itemize + +@itemize +@item +koi8-r@* +iconv_mktbl -Co koi8-r.c koi8-r.txt@* +iconv_mktbl -o koi8-r.cct koi8-r.txt +@end itemize + +@itemize +@item +koi8-u@* +iconv_mktbl -Co koi8-u.c koi8-u.txt@* +iconv_mktbl -o koi8-u.cct koi8-u.txt +@end itemize + +@itemize +@item +us-ascii@* +iconv_mktbl -Cao us-ascii.c iso-8859-1.txt@* +iconv_mktbl -ao us-ascii.cct iso-8859-1.txt +@end itemize + +@* +Source files for CCS tables can be taken from at least two places: + +@* +@enumerate +@item +http://www.unicode.org/Public/MAPPINGS/ contains a lot of encoding +map files. +@item +http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains original +iconv sources and encoding map files. +@end enumerate + +@* +The following are URLs where source files for some of the CCS tables +are found: + +@itemize +@item +big5:@* +http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/BIG5.TXT +@end itemize + +@itemize +@item +cns11643_plane14, cns11643_plane1 and cns11643_plane2:@* +http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/OTHER/CNS11643.TXT +@end itemize + +@itemize +@item +cp775, cp850, cp852, cp855, cp866:@* +http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/ +@end itemize + +@itemize +@item +gb_2312_80:@* +http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/GB/GB2312.TXT +@end itemize + +@itemize +@item +iso_8859_15, iso_8859_1, iso_8859_2, iso_8859_4, iso_8859_5:@* +http://www.unicode.org/Public/MAPPINGS/ISO8859/ +@end itemize + +@itemize +@item +jis_x0201, jis_x0208_1983, jis_x0212_1990, shift_jis@* +http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/JIS0201.TXT +@end itemize + +@itemize +@item +koi8_r@* +http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/KOI8-R.TXT +@end itemize + +@itemize +@item +ksx1001@* +http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT +@end itemize + +@itemize +@item +koi8-u can be given from original FreeBSD iconv library distribution +http://www.dante.net/staff/konstantin/FreeBSD/iconv/ +@end itemize + +@* +Moreover, http://www.dante.net/staff/konstantin/FreeBSD/iconv/ contains a +lot of additional CCS tables that you can use with Newlib (iso-2022 and +RFC1345 encodings). + +@page +@node Adding new converter +@section Adding a new iconv converter +@* +The following steps should be taken to add a new iconv converter: + +@* +@enumerate +@item +Converter's name and aliases list should be added to +the iconv/charset.aliases file +@item +All iconv converters are protected by a _ICONV_CONVERTER_XXX +macro, where XXX is converter name. This protection macro should be added to +newlib/newlib.hin file. +@item +Converter's name and aliases should be also registered in _iconv_builtin_aliases +table in iconv/lib/bialiasesi.c. The list should be protected by +the corresponding macro mentioned above. +@item +If a new converter is just a CCS table, the corresponding .cct and .c files +should be added to the iconv/ccs/ subdirectory. The name of the files +should be equivalent to the normalized encoding name. The 'iconv_mktbl' +Perl script (found in iconv/ccs) may +be used to generate such files. The file's name should be added to +iconv/ccs/Makefile.am and iconv/ccs/binary/Makefile.am files and then +automake should be used to regenerate the Makefile.in files. +@item +If a new converter has a CES algorithm, the appropriate file should be +added to the +iconv/ces/ subdirectory. The name of the file again should be equivalent +to the normalized +encoding name. +@item +If a converter is EUC or ISO-2022-family CES, then the converter +is just an array with a list of used CCS (See ccs/euc-jp.c for example). This +is because iconv already has EUC and ISO-2022 support. Used CCS tables should +be provided in iconv/ccs/. +@item +If a converter isn't EUC or ISO-2022-based CCS, the following two functions +should be provided (see utf-8.c for example): +@enumerate @minus +@item A function to convert from new CES to UCS-32; +@item A function to convert from UCS-32 to new CES; +@item An 'init' function; +@item A 'close' function; +@item A 'reset' function to reset shift state for stateful CES. +@end enumerate + +@* +All these functions are registered into a 'struct iconv_ces_desc' object. +The name of the object should be _iconv_ces_module_XXX, where XXX is the +name of the converter. +@item +For CES converters the correspondent 'struct iconv_ces_desc' reference should +be added into iconv/lib/bices.c file. + +@* +For CCS converters, the corresponding table reference should be added into +the iconv/lib/biccs.c file. +@end enumerate + |