diff options
Diffstat (limited to 'manual/message.texi')
-rw-r--r-- | manual/message.texi | 1185 |
1 files changed, 1185 insertions, 0 deletions
diff --git a/manual/message.texi b/manual/message.texi new file mode 100644 index 0000000..7640e21 --- /dev/null +++ b/manual/message.texi @@ -0,0 +1,1185 @@ +@node Message Translation +@chapter Message Translation + +The program's interface with the human should be designed in a way to +ease the human the task. One of the possibilities is to use messages in +whatever language the user prefers. + +Printing messages in different languages can be implemented in different +ways. One could add all the different languages in the source code and +add among the variants every time a message has to be printed. This is +certainly no good solution since extending the set of languages is +difficult (the code must be changed) and the code itself can become +really big with dozens of message sets. + +A better solution is to keep the message sets for each language are kept +in separate files which are loaded at runtime depending on the language +selection of the user. + +The GNU C Library provides two different sets of functions to support +message translation. The problem is that neither of the interfaces is +officially defined by the POSIX standard. The @code{catgets} family of +functions is defined in the X/Open standard but this is drived from +industry decisions and therefore not necessarily is based on reasinable +decisions. + +As mentioned above the message catalog handling provides easy +extendibility by using external data files which contain the message +translations. I.e., these files contain for each of the messages used +in the program a translation for the appropriate language. So the tasks +of the message handling functions functions are + +@itemize @bullet +@item +locate the external data file with the appropriate translations. +@item +load the data and make it possible to address the messages +@item +map a given key to the translated message +@end itemize + +The two approaches mainly differ in the implementation of this last +step. The design decisions made for this influences the whole rest. + +@menu +* Message catalogs a la X/Open:: The @code{catgets} family of functions. +* The Uniforum approach:: The @code{gettext} family of functions. +@end menu + + +@node Message catalogs a la X/Open +@section X/Open Message Catalog Handling + +The @code{catgets} functions are based on the simple scheme: + +@quotation +Associate every message to translate in the source code with a unique +identifier. To retrieve a message from a catalog file solely the +identifier is used. +@end quotation + +This means for the author of the program that s/he will have to make +sure the meaning of the identifier in the program code and in the +message catalogs are always the same. + +Before a message can be translated the catalog file must be located. +The user of the program must be able to guide the responsible function +to find whatever catalog the user wants. This is separated from what +the programmer had in mind. + +All the types, constants and funtions for the @code{catgets} functions +are defined/declared in the @file{nl_types.h} header file. + +@menu +* The catgets Functions:: The @code{catgets} function family. +* The message catalog files:: Format of the message catalog files. +* The gencat program:: How to generate message catalogs files which + can be used by the functions. +* Common Usage:: How to use the @code{catgets} interface. +@end menu + + +@node The catgets Functions +@subsection The @code{catgets} function family + +@comment nl_types.h +@comment X/Open +@deftypefun nl_catd catopen (const char *@var{cat_name}, int @var{flag}) +The @code{catgets} function tries to locate the message data file names +@var{cat_name} and loads it when found. The return value is of an +opaque type and can be used in calls to the other functions to refer to +this loaded catalog. + +The return value is @code{(nl_catd) -1} in case the function failed and +no catalog was loaded. The global variable @var{errno} contains a code +for the error causing the failure. But even if the function call +succeeded this does not mean that all messages can be translated. + +Locating the catalog file must happen in a way which lets the user of +the program influence the decision. It is up to the user to decide +about the language to use and sometimes it is useful to use alternate +catalog files. All this can be specified by the user by setting some +enviroment variables. + +The first problem is to find out where all the message catalogs are +stored. Every program could have its own place to keep all the +different files but usually the catalog files are grouped by languages +and the catalogs for all programs are kept in the same place. + +@cindex NLSPATH environment variable +To tell the @code{catopen} function where the catalog for the program +can be found the user can set the environment variable @code{NLSPATH} to +a value which describes her/his choice. Since this value must be usable +for different languages and locales it cannot be a simple string. +Instead it is a format string (similar to @code{printf}'s). An example +is + +@smallexample +/usr/share/locale/%L/%N:/usr/share/locale/%L/LC_MESSAGES/%N +@end smallexample + +First one can see that more than one directory can be specified (with +the usual syntax of separating them by colons). The next things to +observe are the format string, @code{%L} and @code{%N} in this case. +The @code{catopen} function knows about several of them and the +replacement for all of them is of course different. + +@table @code +@item %N +This format element is substituted with the name of the catalog file. +This is the value of the @var{cat_name} argument given to +@code{catgets}. + +@item %L +This format element is substituted with the name of the currently +selected locale for translating messages. How this is determined is +explained below. + +@item %l +(This is the lowercase ell.) This format element is substituted with the +language element of the locale name. The string decsribing the selected +locale is expected to have the form +@code{@var{lang}[_@var{terr}[.@var{codeset}]]} and this format uses the +first part @var{lang}. + +@item %t +This format element is substituted by the territory part @var{terr} of +the name of the currently selected locale. See the explanation of the +format above. + +@item %c +This format element is substituted by the codeset part @var{codeset} of +the name of the currently selected locale. See the explanation of the +format above. + +@item %% +Since @code{%} is used in a meta character there must be a way to +express the @code{%} character in the result itself. Using @code{%%} +does this just like it works for @code{printf}. +@end table + + +Using @code{NLSPATH} allows to specify arbitrary directories to be +searched for message catalogs while still allowing different languages +to be used. If the @code{NLSPATH} environment variable is not set the +default value is + +@smallexample +@var{prefix}/share/locale/%L/%N:@var{prefix}/share/locale/%L/LC_MESSAGES/%N +@end smallexample + +@noindent +where @var{prefix} is given to @code{configure} while installing the GNU +C Library (this value is in many cases @code{/usr} or the empty string). + +The remaining problem is to decide which must be used. The value +decides about the substitution of the format elements mentioned above. +First of all the user can specify a path in the message catalog name +(i.e., the name contains a slash character). In this situation the +@code{NLSPATH} environment variable is not used. The catalog must exist +as specified in the program, perhaps relative to the current working +directory. This situation in not desirable and catalogs names never +should be written this way. Beside this, this behaviour is not portable +to all other platforms providing the @code{catgets} interface. + +@cindex LC_ALL environment variable +@cindex LC_MESSAGES environment variable +@cindex LANG environment variable +Otherwise the values of environment variables from the standard +environemtn are examined (@pxref{Standard Environment}). Which +variables are examined is decided by the @var{flag} parameter of +@code{catopen}. If the value is @code{NL_CAT_LOCALE} (which is defined +in @file{nl_types.h}) then the @code{catopen} function examines the +environment variable @code{LC_ALL}, @code{LC_MESSAGES}, and @code{LANG} +in this order. The first variable which is set in the current +environment will be used. + +If @var{flag} is zero only the @code{LANG} environment variable is +examined. This is a left-over from the early days of this function +where the other environment variable were not known. + +In any case the environment variable should have a value of the form +@code{@var{lang}[_@var{terr}[.@var{codeset}]]} as explained above. If +no environment variable is set the @code{"C"} locale is used which +prevents any translation. + +The return value of the function is in any case a valid string. Either +it is a translation from a message catalog or it is the same as the +@var{string} parameter. So a piece of code to decide whether a +translation actually happened must look like this: + +@smallexample +@{ + char *trans = catgets (desc, set, msg, input_string); + if (trans == input_string) + @{ + /* Something went wrong. */ + @} +@} +@end smallexample + +@noindent +When an error occured the global variable @var{errno} is set to + +@table @var +@item EBADF +The catalog does not exist. +@item ENOMSG +The set/message touple does not name an existing element in the +message catalog. +@end table + +While it sometimes can be useful to test for errors programs normally +will avoid any test. If the translation is not available it is no big +problem if the original, untranslated message is printed. Either the +user understands this as well or s/he will look for the reason why the +messages are not translated. +@end deftypefun + +Please note that the currently selected locale does not depend on a call +to the @code{setlocale} function. It is not necessary that the locale +data files for this locale exist and calling @code{setlocale} succeeds. +The @code{catopen} function directly reads the values of the environment +variables. + + +@deftypefun {char *} catgets (nl_catd @var{catalog_desc}, int @var{set}, int @var{message}, const char *@var{string}) +The function @code{catgets} has to be used to access the massage catalog +previously opened using the @code{catopen} function. The +@var{catalog_desc} parameter must be a value previously returned by +@code{catopen}. + +The next two parameters, @var{set} and @var{message}, reflect the +internal organization of the message catalog files. This will be +explained in detail below. For now it is interesting to know that a +catalog can consists of several set and the messages in each thread are +individually numbered using numbers. Neither the set number nor the +message number must be consecutive. They can be arbitrarily chosen. +But each message (unless equal to another one) must have its own unique +pair of set and message number. + +Since it is not guaranteed that the message catalog for the language +selected by the user exists the last parameter @var{string} helps to +handle this case gracefully. If no matching string can be found +@var{string} is returned. This means for the programmer that + +@itemize @bullet +@item +the @var{string} parameters should contain reasonable text (this also +helps to understand the program seems otherwise there would be no hint +on the string which is expected to be returned. +@item +all @var{string} arguments should be written in the same language. +@end itemize +@end deftypefun + +It is somewhat uncomfortable to write a program using the @code{catgets} +functions if no supporting functionality is available. Since each +set/message number touple must be unique the programmer must keep lists +of the messages at the same time the code is written. And the work +between several people working on the same project must be coordinated. +In @ref{Common Usage} we will see some how these problems can be relaxed +a bit. + +@deftypefun int catclose (nl_catd @var{catalog_desc}) +The @code{catclose} function can be used to free the resources +associated with a message catalog which previously was opened by a call +to @code{catopen}. If the resources can be successfully freed the +function returns @code{0}. Otherwise it return @code{@minus{}1} and the +global variable @var{errno} is set. Errors can occur if the catalog +descriptor @var{catalog_desc} is not valid in which case @var{errno} is +set to @code{EBADF}. +@end deftypefun + + +@node The message catalog files +@subsection Format of the message catalog files + +The only reasonable way the translate all the messages of a function and +store the result in a message catalog file which can be read by the +@code{catopen} function is to write all the message text to the +translator and let her/him translate them all. I.e., we must have a +file with entries which associate the set/message touple with a specific +translation. This file format is specified in the X/Open standard and +is as follows: + +@itemize @bullet +@item +Lines containing only whitespace characters or empty lines are ignored. + +@item +Lines which contain as the first non-whitespace character a @code{$} +followed by a whitespace character are comment and are also ignored. + +@item +If a line contains as the first non-whitespace characters the sequence +@code{$set} followed by a whitespace character an additional argument +is required to follow. This argument can either be: + +@itemize @minus +@item +a number. In this case the value of this number determines the set +to which the following messages are added. + +@item +an identifier consisting of alphanumeric characters plus the underscore +character. In this case the set get automatically a number assigned. +This value is one added to the largest set number which so far appeared. + +How to use the symbolic names is explained in section @ref{Common Usage}. + +It is an error if a symbol name appears more than once. All following +messages are placed in a set with this number. +@end itemize + +@item +If a line contains as the first non-whitespace characters the sequence +@code{$delset} followed by a whitespace character an additional argument +is required to follow. This argument can either be: + +@itemize @minus +@item +a number. In this case the value of this number determines the set +which will be deleted. + +@item +an identifier consisting of alphanumeric characters plus the underscore +character. This symbolic identifier must match a name for a set which +previously was defined. It is an error if the name is unknown. +@end itemize + +In both cases all messages in the specified set will be removed. They +will not appear in the output. But if this set is later again selected +with a @code{$set} command again messages could be added and these +messages will appear in the output. + +@item +If a line contains after leading whitespaces the sequence +@code{$quote}, the quoting character used for this input file is +changed to the first non-whitespace character following the +@code{$quote}. If no non-whitespace character is present before the +line ends quoting is disable. + +By default no quoting character is used. In this mode strings are +terminated with the first unescaped line break. If there is a +@code{$quote} sequence present newline need not be escaped. Instead a +string is terminated with the first unescaped appearence of the quote +character. + +A common usage of this feature would be to set the quote character to +@code{"}. Then any appearence of the @code{"} in the strings must +be escaped using the backslash (i.e., @code{\"} must be written). + +@item +Any other line must start with a number or an alphanumeric identifier +(with the underscore character included). The following characters +(starting at the first non-whitespace character) will form the string +which gets associated with the currently selected set and the message +number represented by the number and identifier respectively. + +If the start of the line is a number the message number is obvious. It +is an error if the same message number already appeared for this set. + +If the leading token was an identifier the message number gets +automatically assigned. The value is the current maximum messages +number for this set plus one. It is an error if the identifier was +already used for a message in this set. It is ok to reuse the +identifier for a message in another thread. How to use the symbolic +identifiers will be explained below (@pxref{Common Usage}). There is +one limitation with the identifier: it must not be @code{Set}. The +reason will be explained below. + +Please note that you must use a quoting character if a message contains +leading whitespace. Since one cannot guarantee this never happens it is +probably a good idea to always use quoting. + +The text of the messages can contain escape characters. The usual bunch +of characters known from the @w{ISO C} language are recognized +(@code{\n}, @code{\t}, @code{\v}, @code{\b}, @code{\r}, @code{\f}, +@code{\\}, and @code{\@var{nnn}}, where @var{nnn} is the octal coding of +a character code). +@end itemize + +@strong{Important:} The handling of identifiers instead of numbers for +the set and messages is a GNU extension. Systems strictly following the +X/Open specification do not have this feature. An example for a message +catalog file is this: + +@smallexample +$ This is a leading comment. +$quote " + +$set SetOne +1 Message with ID 1. +two " Message with ID \"two\", which gets the value 2 assigned" + +$set SetTwo +$ Since the last set got the nubmer 1 assigned this set has number 2. +4000 "The numbers can be arbitrary, they need not start at one." +@end smallexample + +This small example shows various aspects: +@itemize @bullet +@item +Lines 1 and 9 are comments since they start with @code{$} followed by +a whitespace. +@item +The quoting character is set to @code{"}. Otherwise the quotes in the +message definition would have to be left away and in this case the +message with the identifier @code{two} would loose its leading whitespace. +@item +Mixing numbered messages with message having symbolic names is no +problem and the numering happens automatically. +@end itemize + + +While this file format is pretty easy it is not the best possible for +use in a running program. The @code{catopen} function would have to +parser the file and handle syntactic errors gracefully. This is not so +easy and the whole process is pretty slow. Therefore the @code{catgets} +functions expect the data in another more compact and ready-to-use file +format. There is a special programm @code{gencat} which is explained in +detail in the next section. + +Files in this other format are not human readable. To be easy to use by +programs it is a binary file. But the format is byte order independent +so translation files can be shared by systems of arbitrary architecture +(as long as they use the GNU C Library). + +Details about the binary file format are not important to know since +these files are always created by the @code{gencat} program. The +sources of the GNU C Library also provide the sources for the +@code{gencat} program and so the interested reader can look throught +these source files to learn about the file format. + + +@node The gencat program +@subsection Generate Message Catalogs files + +@cindex gencat +The @code{gencat} program is specified in the X/Open standard and the +GNU implementation follows this specification and so allows to process +all correctly formed input files. Additionally some extension are +implemented which help to work in a more reasonable way with the the +@code{catgets} functions. + +The @code{gencat} program can be invoked in two ways: + +@example +`gencat [@var{Option}]@dots{} [@var{Output-File} [@var{Input-File}]@dots{}]` +@end example + +This is the interface defined in the X/Open standard. If no +@var{Input-File} parameter is given input will be read from standard +input. Multiple input files will be read as if they are concatenated. +If @var{Output-File} is also missing, the output will be written to +standard output. To provide the interface one is used from other +programs a second interface is provided. + +@smallexample +`gencat [@var{Option}]@dots{} -o @var{Output-File} [@var{Input-File}]@dots{}` +@end smallexample + +The option @samp{-o} is used to specify the output file and all file +arguments are used as input files. + +Beside this one can use @file{-} or @file{/dev/stdin} for +@var{Input-File} to denote the standard input. Corresponding one can +use @file{-} and @file{/dev/stdout} for @var{Output-File} to denote +standard output. Using @file{-} as a file name is allowed in X/Open +while using the device names is a GNU extension. + +The @code{gencat} program works by concatenating all input files and +then @strong{merge} the resulting collection of message sets with a +possiblity existing output file. This is done by removing all messages +with set/message number touples matching any of the generated messages +from the output file and then adding all the new messages. To +regenerate a catalog file while ignoring the old contents therefore +requires to remove the output file if it exists. If the output is +written to standard output no merging takes place. + +@noindent +The following table shows the options understood by the @code{gencat} +program. The X/Open standard does not specify any option for the +program so all of these are GNU extensions. + +@table @samp +@item -V +@itemx --version +Print the version information and exit. +@item -h +@itemx --help +Print a usage message listing all available options, then exit successfully. +@item --new +Do never merge the new messages from the input files with the old content +of the output files. The old content of the output file is discarded. +@item -H +@itemx --header=name +This option is used to emit the symbolic names given to sets and +messages in the input files for use in the program. Details about how +to use this are given in the next section. The @var{name} parameter to +this option specifies the name of the output file. It will contain a +number of C preprocessor @code{#define}s to associate a name with a +number. + +Please note that the generated file only contains the symbols from the +input files. If the output is merged with the previous content of the +output file the possibly existing symbols from the file(s) which +generated the old output files are not in the generated header file. +@end table + + +@node Common Usage +@subsection How to use the @code{catgets} interface + +The @code{catgets} functions can be used in two different ways. By +following slavishly the X/Open specs and not relying on the extension +and by using the GNU extensions. We will take a look at the former +method first to understand the benefits of extensions. + +@subsubsection Not using symbolic symbolic names + +Since the X/Open format of the message catalog files does not allow +symbol names we have to work with numbers all the time. When we start +writing a program we have to replace all appearences of translatable +strings with someting like + +@smallexample +catgets (catdesc, set, msg, "string") +@end smallexample + +@noindent +@var{catgets} is retrieved from a call to @code{catopen} which is +normally done once at the program start. The @code{"string"} is the +string we want to translate. The problems start with the set and +message numbers. + +In a bigger program several programmers usually work at the same time on +the program and so coordinating the number allocation is crucial. +Though no two different strings must be indexed by the same touple of +numbers it is highly desireable to reuse the numbers for equal strings +with equal translations (please note that there might be strings which +are equal in one language but have different translations due to +difference contexts). + +The allocation process can be relaxed a bit by different set numbers for +different parts of the program. So the number of developers who have to +coordinate the allocation can be reduced. But still lists must be keep +track of the allocation and errors can easily happen. These errors +cannot be discovered by the compiler or the @code{catgets} functions. +Only the user of the program might see wrong messages printed. In the +worst cases the messages are so irritating that they cannot be +recognized as wrong. Think about the translations for @code{"true"} and +@code{"false"} being exchanged. This could result in a desaster. + + +@subsubsection Using symbolic names + +The problems mentioned in the last section derive from the fact that: + +@enumerate +@item +the numbers are allocated once and due to the possibly frequent use of +them it is difficult to change a number later. +@item +the numbers do not allow to guess anything about the string and +therefore collisions can easily happen. +@end enumerate + +By constantly using symbolic names and by providing a method which maps +the string content to a symbolic name (however this will happen) one can +prevent both problems above. The cost of this is that the programmer +has to write a complete message catalog file while s/he is writing the +program itself. + +This is necessary since the symbolic names must be mapped to numbers +before the program sources can be compiled. In the last section it was +described how to generate a header containing the mapping of the names. +E.g., for the example message file given in the last section we could +call the @code{gencat} program as follow (assume @file{ex.msg} contains +the sources). + +@smallexample +gencat -H ex.h -o ex.cat ex.msg +@end smallexample + +@noindent +This generates a header file with the following content: + +@smallexample +#define SetTwoSet 0x2 /* u.msg:8 */ + +#define SetOneSet 0x1 /* u.msg:4 */ +#define SetOnetwo 0x2 /* u.msg:6 */ +@end smallexample + +As can be seen the various symbols given in the source file are mangled +to generate unique identifiers and these identifiers get numbers +assigned. Reading the source file and knowing about the rules will +allow to predict the content of the header file (it is deterministic) +but this is not necessary. The @code{gencat} program can take care for +everything. All the programmer has to do is to put the generated header +file in the dependency list of the source files of her/his project and +to add a rules to regenerate the header of any of the input files +change. + +One word about the symbol mangling. Every symbol consists of two parts: +the name of the message set plus the name of the message or the special +string @code{Set}. So @code{SetOnetwo} means this macro can be used to +access the translation with identifier @code{two} in the message set +@code{SetOne}. + +The other names denote the names of the message sets. The special +string @code{Set} is used in the place of the message identifier. + +If in the code the second string of the set @code{SetOne} is used the C +code should look like this: + +@smallexample +catgets (catdesc, SetOneSet, SetOnetwo, + " Message with ID \"two\", which gets the value 2 assigned") +@end smallexample + +Writing the function this way will allow to change the message number +and even the set number without requiring any change in the C source +code. (The text of the string is normally not the same; this is only +for this example.) + + +@subsubsection How does to this allow to develop + +To illustrate the usual way to work with the symbolic version numbers +here is a little example. Assume we want to write the very complex and +famous greeting program. We start by writing the code as usual: + +@smallexample +#include <stdio.h> +int +main (void) +@{ + printf ("Hello, world!\n"); + return 0; +@} +@end smallexample + +Now we want to internationalize the message and therefore replace the +message with whatever the user wants. + +@smallexample +#include <nl_types.h> +#include <stdio.h> +#include "msgnrs.h" +int +main (void) +@{ + nl_catd catdesc = catopen ("hello.cat", NL_CAT_LOCALE); + printf (catgets (catdesc, SetMainSet, SetMainHello, "Hello, world!\n")); + catclose (catdesc); + return 0; +@} +@end smallexample + +We see how the catalog object is opened and the returned descriptor used +in the other function calls. It is not really necessary to check for +failure of any of the functions since even in these situations the +functions will behave reasonable. They simply will be return a +translation. + +What remains unspecified here are the constants @code{SetMainSet} and +@code{SetMainHello}. These are the symbolic names describing the +message. To get the actual definitions which match the information in +the catalog file we have to create the message catalog source file and +process it using the @code{gencat} program. + +@smallexample +$ Messages for the famous greeting program. +$quote " + +$set Main +Hello "Hallo, Welt!\n" +@end smallexample + +Now we can start building the program (assume the message catalog source +file is named @file{hello.msg} and the program source file @file{hello.c}): + +@smallexample +@cartouche +% gencat -H msgnrs.h -o hello.cat hello.msg +% cat msgnrs.h +#define MainSet 0x1 /* hello.msg:4 */ +#define MainHello 0x1 /* hello.msg:5 */ +% gcc -o hello hello.c -I. +% cp hello.cat /usr/share/locale/de/LC_MESSAGES +% echo $LC_ALL +de +% ./hello +Hallo, Welt! +% +@end cartouche +@end smallexample + +The call of the @code{gencat} program creates the missing header file +@file{msgnrs.h} as well as the message catalog binary. The former is +used in the compilation of @file{hello.c} while the later is placed in a +directory in which the @code{catopen} function will try to locate it. +Please check the @code{LC_ALL} environment variable and the default path +for @code{catopen} presented in the description above. + + +@node The Uniforum approach +@section The Uniforum approach to Message Translation + +Sun Microsystems tried to standardize a different approach to message +translation in the Uniforum group. There never was a real standard +defined but still the interface was used in Sun's operation systems. +Since this approach fits better in the development process of free +software it is also used throughout the GNU package and the GNU +@file{gettext} package provides support for this outside the GNU C +Library. + +The code of the @file{libintl} from GNU @file{gettext} is the same as +the code in the GNU C Library. So the documentation in the GNU +@file{gettext} manual is also valid for the functionality here. The +following text will describe the library functions in detail. But the +numerous helper programs are not described in this manual. Instead +people should read the GNU @file{gettext} manual +(@pxref{Top,,GNU gettext utilities,gettext,Native Language Support Library and Tools}). +We will only give a short overview. + +Though the @code{catgets} functions are available by default on more +systems the @code{gettext} interface is at least as portable as the +former. The GNU @file{gettext} package can be used wherever the +functions are not available. + + +@menu +* Message catalogs with gettext:: The @code{gettext} family of functions. +* Helper programs for gettext:: Programs to handle message catalogs + for @code{gettext}. +@end menu + + +@node Message catalogs with gettext +@subsection The @code{gettext} family of functions + +The paradigms underlying the @code{gettext} approach to message +translations is different from that of the @code{catgets} functions the +basic functionally is equivalent. There are functions of the following +categories: + +@menu +* Translation with gettext:: What has to be done to translate a message. +* Locating gettext catalog:: How to determine which catalog to be used. +* Using gettextized software:: The possibilities of the user to influence + the way @code{gettext} works. +@end menu + +@node Translation with gettext +@subsubsection What has to be done to translate a message? + +The @code{gettext} functions have a very simple interface. The most +basic function just takes the string which shall be translated as the +argument and it returns the translation. This is fundamentally +different from the @code{catgets} approach where an extra key is +necessary and the original string is only used for the error case. + +If the string which has to be translated is the only argument this of +course means the string itself is the key. I.e., the translation will +be selected based on the original string. The message catalogs must +therefore contain the original strings plus one translation for any such +string. The task of the @code{gettext} function is it to compare the +argument string with the available strings in the catalog and return the +appropriate translation. Of course this process is optimized so that +this process is not more expensive than an access using an atomic key +like in @code{catgets}. + +The @code{gettext} approach has some advantages but also some +disadvantages. Please see the GNU @file{gettext} manual for a detailed +discussion of the pros and cons. + +All the definitions and declarations for @code{gettext} can be found in +the @file{libintl.h} header file. On systems where these functions are +not part of the C library they can be found in a separate library named +@file{libintl.a} (or accordingly different for shared libraries). + +@deftypefun {char *} gettext (const char *@var{msgid}) +The @code{gettext} function searches the currently selected message +catalogs for a string which is equal to @var{msgid}. If there is such a +string available it is returned. Otherwise the argument string +@var{msgid} is returned. + +Please note that all though the return value is @code{char *} the +returned string must not be changed. This broken type results from the +history of the function and does not reflect the way the function should +be used. + +Please note that above we wrote ``message catalogs'' (plural). This is +a speciality of the GNU implementation of these functions and we will +say more about this in section @xref{Locating gettext catalog} when we +talk about the ways message catalogs are selected. + +The @code{gettext} function does not modify the value of the global +@var{errno} variable. This is necessary to make it possible to write +something like + +@smallexample + printf (gettext ("Operation failed: %m\n")); +@end smallexample + +Here the @var{errno} value is used in the @code{printf} function while +processing the @code{%m} format element and if the @code{gettext} +function would change this value (it is called before @code{printf} is +called) we wouls get a wrong message. + +So there is no easy way to detect a missing message catalog beside +comparing the argument string with the result. But it is normally the +task of the user to react on missing catalogs. The program cannot guess +when a message catalog is really necessary since for a user who s peaks +the language the program was developed in does not need any translation. +@end deftypefun + +The remaining two functions to access the message catalog add some +functionality to select a message catalog which is not the default one. +This is important if parts of the program are developed independently. +Every part can have its own message catalog and all of them can be used +at the same time. The C library itself is an example: internally it +uses the @code{gettext} functions but since it must not depend on a +currently selected default message catalog it must specify all ambiguous +information. + +@deftypefun {char *} dgettext (const char *@var{domainname}, const char *@var{msgid}) +The @code{dgettext} functions acts just like the @code{gettext} +function. It only takes an additional first argument @var{domainname} +which guides the selection of the message catalogs which are searched +for the translation. If the @var{domainname} parameter is the null +pointer the @code{dgettext} function is exactly equivalent to +@code{gettext} since the default value for the domain name is used. + +As for @code{gettext} the return value type is @code{char *} which is an +anachronism. The returned string must never be modfied. +@end deftypefun + +@deftypefun {char *} dcgettext (const char *@var{domainname}, const char *@var{msgid}, int @var{category}) +The @code{dcgettext} adds another argument to those which +@code{dgettext} takes. This argument @var{category} specifies the last +piece of information needed to localize the message catalog. I.e., the +domain name and the locale category exactly specify which message +catalog has to be used (relative to a given directory, see below). + +The @code{dgettext} function can be expressed in terms of +@code{dcgettext} by using + +@smallexample +dcgettext (domain, string, LC_MESSAGES) +@end smallexample + +@noindent +instead of + +@smallexample +dgettext (domain, string) +@end smallexample + +This also shows which values are expected for the third parameter. One +has to use the available selectors for the categories available in +@file{locale.h}. Normally the available values are @code{LC_CTYPE}, +@code{LC_COLLATE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, +@code{LC_NUMERIC}, and @code{LC_TIME}. Please note that @code{LC_ALL} +must not be used and even though the names might suggest this, there is +no relation to the environments variables of this name. + +The @code{dcgettext} function is only implemented for compatibility with +other systems which have @code{gettext} functions. There is not really +any situation where it is necessary (or useful) to use a different value +but @code{LC_MESSAGES} in for the @var{category} parameter. We are +dealing with messages here and any other choice can only be irritating. + +As for @code{gettext} the return value type is @code{char *} which is an +anachronism. The returned string must never be modfied. +@end deftypefun + +When using the three functions above in a program it is a frequent case +that the @var{msgid} argument is a constant string. So it is worth to +optimize this case. Thinking shortly about this one will realize that +as long as no new message catalog is loaded the translation of a message +will not change. I.e., the algorithm to determine the translation is +deterministic. + +Exactly this is what the optimizations implemented in the +@file{libintl.h} header will use. Whenver a program is compiler with +the GNU C compiler, optimization is selected and the @var{msgid} +argument to @code{gettext}, @code{dgettext} or @code{dcgettext} is a +constant string the actual function call will only be done the first +time the message is used and then always only if any new message catalog +was loaded and so the result of the translation lookup might be +different. See the @file{libintl.h} header file for details. For the +user it is only important to know that the result is always the same, +independent of the compiler or compiler options in use. + + +@node Locating gettext catalog +@subsubsection How to determine which catalog to be used + +The functions to retrieve the translations for a given mesage have a +remarkable simple interface. But to provide the user of the program +still the opportunity to select exactly the translation s/he wants and +also to provide the programmer the possibility to influence the way to +locate the search for catalogs files there is a quite complicated +underlying mechanism which controls all this. The code is complicated +the use is easy. + +Basically we have two different tasks to perform which can also be +performed by the @code{catgets} functions: + +@enumerate +@item +Locate the set of message catalogs. There are a number of files for +different languages and which all belong to the package. Usually they +are all stored in the filesystem below a certain directory. + +There can be arbitrary many packages installed and they can follow +different guidelines for the placement of their files. + +@item +Relative to the location specified by the package the actual translation +files must be searched, based on the wishes of the user. I.e., for each +language the user selects the program should be able to locate the +appropriate file. +@end enumerate + +This is the functionality required by the specifications for +@code{gettext} and this is also what the @code{catgets} functions are +able to do. But there are some problems unresolved: + +@itemize @bullet +@item +The language to be used can be specified in several different ways. +There is no generally accepted standard for this and the user always +expects the program understand what s/he means. E.g., to select the +German translation one could write @code{de}, @code{german}, or +@code{deutsch} and the program should always react the same. + +@item +Sometimes the specification of the user is too detailed. If s/he, e.g., +specifies @code{de_DE.ISO-8859-1} which means German, spoken in Germany, +coded using the @w{ISO 8859-1} character set there is the possibility +that a message catalog matching this exactly is not available. But +there could be a catalog matching @code{de} and if the character set +used on the machine is always @w{ISO 8859-1} there is no reason why this +later message catalog should not be used. (We call this @dfn{message +inheritance}.) + +@item +If a catalog for a wanted language is not available it is not always the +second best choice to fall back on the language of the developer and +simply not translate any message. Instead a user might be better able +to read the messages in another language and so the user of the program +should be able to define an precedence order of languages. +@end itemize + +We can devide the configuration actions in two parts: the one is +performed by the programmer, the other by the user. We will start with +the functions the programmer can use since the user configuration will +be based on this. + +As the functions described in the last sections already mention separate +sets of messages can be selected by a @dfn{domain name}. This is a +simple string which should be unique for each program part with uses a +separate domain. It is possible to use in one program arbitrary many +domains at the same time. E.g., the GNU C Library itself uses a domain +named @code{libc} while the program using the C Library could use a +domain named @code{foo}. The important point is that at any time +exactly one domain is active. This is controlled with the following +function. + +@deftypefun {char *} textdomain (const char *@var{domainname}) +The @code{textdomain} function sets the default domain, which is used in +all future @code{gettext} calls, to @var{domainname}. Please note that +@code{dgettext} and @code{dcgettext} calls are not influenced if the +@var{domainname} parameter of these functions is not the null pointer. + +Before the first call to @code{textdomain} the default domain is +@code{messages}. This is the name specified in the fpsecification of +the @code{gettext} API. This name is as good as any other name. No +program should ever really use a domain with this name since this can +only lead to problems. + +The function returns the value which is from now on taken as the default +domain. If the system went out of memory the returned value is +@code{NULL} and the global variable @var{errno} is set to @code{ENOMEM}. +Despite the return value type being @code{char *} the return string must +not be changed. It is allocated internally by the @code{textdomain} +function. + +If the @var{domainname} parameter is the null pointer no new default +domain is set. Instead the currently selected default domain is +returned. + +If the @var{domainname} parameter is the empty string the default domain +is reset to its initial value, the domain with the name @code{messages}. +This possibility is questionable to use since the domain @code{messages} +really never should be used. +@end deftypefun + +@deftypefun {char *} bindtextdomain (const char *@var{domainname}, const char *@var{dirname}) +The @code{bindtextdomain} function can be used to specify the directly +which contains the message catalogs for domain @var{domainname} for the +different languages. To be correct, this is the directory where the +hierachy of directories is expected. Details are explained below. + +For the programmer it is important to note that the translations which +come with the program have be placed in a directory hierachy starting +at, say, @file{/foo/bar}. Then the program should make a +@code{bindtextdomain} call to bind the domain for the current program to +this directory. So it is made sure the catalogs are found. A correctly +running program does not depend on the user setting an environment +variable. + +The @code{bindtextdomain} function can be used several times and if the +@var{domainname} argument is different the previously boundd domains +will not be overwritten. + +If the @var{dirname} parameter is the null pointer @code{bindtextdomain} +returns the currently selected directory for the domain with the name +@var{domainname}. + +the @code{bindtextdomain} function returns a pointer to a string +containing the name of the selected directory name. The string is +allocated internally in the function and must not be changed by the +user. If the system went out of core during the execution of +@code{bindtextdomain} the return value is @code{NULL} and the global +variable @var{errno} is set accordingly. +@end deftypefun + + +@node Using gettextized software +@subsubsection User influence on @code{gettext} + +The last sections described what the programmer can do to +internationalize the messages of the program. But it is finally up to +the user to select the message s/he wants to see. S/He must understand +them. + +The POSIX locale model uses the environment variables @code{LC_COLLATE}, +@code{LC_CTYPE}, @code{LC_MESSAGES}, @code{LC_MONETARY}, @code{NUMERIC}, +and @code{LC_TIME} to select the locale which is to be used. This way +the user can influence lots of functions. As we mentioned above the +@code{gettext} functions also take advantage of this. + +To understand how this happens it is necessary to take a look at the +various components of the filename which gets computed to locate a +message catalog. It is composed as follows: + +@smallexample +@var{dir_name}/@var{locale}/LC_@var{category}/@var{domain_name}.mo +@end smallexample + +The default value for @var{dir_name} is system specific. It is computed +from the value given as the prefix while configuring the C library. +This value normally is @file{/usr} or @file{/}. For the former the +complete @var{dir_name} is: + +@smallexample +/usr/share/locale +@end smallexample + +We can use @file{/usr/share} since the @file{.mo} files containing the +message catalogs are system independent, all systems can use the same +files. If the program executed the @code{bindtextdomain} function for +the message domain that is currently handled the @code{dir_name} +component is the exactly the value which was given to the function as +the second parameter. I.e., @code{bindtextdomain} allows to overwrite +the only system depdendent and fixed value to make it possible to +address file everywhere in the filesystem. + +The @var{category} is the name of the locale category which was selected +in the program code. For @code{gettext} and @code{dgettext} this is +always @code{LC_MESSAGES}, for @code{dcgettext} this is selected by the +value of the third parameter. As said above it should be avoided to +ever use a category other than @code{LC_MESSAGES}. + +The @var{locale} component is computed based on the category used. Just +like for the @code{setlocale} function here comes the user selection +into the play. Some environment variables are examined in a fixed order +and the first environment variable set determines the return value of +the lookup process. In detail, for the category @code{LC_xxx} the +following variables in this order are examined: + +@table @code +@item LANGUAGE +@item LC_ALL +@item LC_xxx +@item LANG +@end table + +This looks very familiar. With the exception of the @code{LANGUAGE} +environment variable this is exactly the lookup order the +@code{setlocale} function uses. But why introducing the @code{LANGUAGE} +variable? + +The reason is that the syntax of the values these variables can have is +different to what is expected by the @code{setlocale} function. If we +would set @code{LC_ALL} to a value following the extended syntax that +would mean the @code{setlocale} function will never be able to use the +value of this variable as well. An additional variable removes this +problem plus we can select the language independently of the locale +setting which sometimes is useful. + +While for the @code{LC_xxx} variables the value should consist of +exactly one specification of a locale the @code{LANGUAGE} variable's +value can consist of a colon separated list of locale names. The +attentive reader will realize that this is the way we manage to +implement one of our additional demands above: we want to be able to +specify an ordered list of language. + +Back to the constructed filename we have only one component missing. +The @var{domain_name} part is the name which was either registered using +the @code{textdomain} function or which was given to @code{dgettext} or +@code{dcgettext} as the first parameter. Now it becomes obvious that a +good choice for the domain name in the program code is a string which is +closely related to the program/package name. E.g., for the GNU C +Library the domain name is @code{libc}. + +@noindent +A limit piece of example code should show how the programmer is supposed +to work: + +@smallexample +@{ + textdomain ("test-package"); + bindtextdomain ("test-package", "/usr/local/share/locale"); + puts (gettext ("Hello, world!"); +@} +@end smallexample + +At the program start the default domain is @code{messages}. The +@code{textdomain} call changes this to @code{test-package}. The +@code{bindtextdomain} call specifies that the message catalogs for the +domain @code{test-package} can be found below the directory +@file{/usr/local/share/locale}. + +If now the user set in her/his environment the variable @code{LANGUAGE} +to @code{de} the @code{gettext} function will try to use the +translations from the file + +@smallexample +/usr/local/share/locale/de/LC_MESSAGES/test-package.mo +@end smallexample + +From the above descriptions it should be clear which component of this +filename is determined fromby which source. + +@c Describe: +@c * message inheritence +@c * locale aliasing +@c * character set dependence + + +@node Helper programs for gettext +@subsection Programs to handle message catalogs for @code{gettext} + +@c Describe: +@c * msgfmt +@c * xgettext +@c Mention: +@c * other programs from GNU gettext |