diff options
author | Zack Weinberg <zack@gcc.gnu.org> | 2003-07-05 00:24:00 +0000 |
---|---|---|
committer | Zack Weinberg <zack@gcc.gnu.org> | 2003-07-05 00:24:00 +0000 |
commit | e6cc3a24c237713413070f4a5dc35b55dc2715b8 (patch) | |
tree | 34c7734f7acee49beff2b3d99cbdf53576456697 /gcc/cppcharset.c | |
parent | 61aeb06fe596bd822b665d65a271804efdaf0053 (diff) | |
download | gcc-e6cc3a24c237713413070f4a5dc35b55dc2715b8.zip gcc-e6cc3a24c237713413070f4a5dc35b55dc2715b8.tar.gz gcc-e6cc3a24c237713413070f4a5dc35b55dc2715b8.tar.bz2 |
cpplib.h (CPP_AT_NAME, [...]): New token types.
* cpplib.h (CPP_AT_NAME, CPP_OBJC_STRING): New token types.
(struct cpp_options): Add narrow_charset, wide_charset,
bytes_big_endian fields. Remove EBCDIC field.
(cpp_init_iconv, cpp_interpret_string): New external interfaces.
* cpphash.h: Include <iconv.h> if we have it, otherwise
provide a dummy definition of iconv_t.
(struct cpp_reader): Add narrow_cset_desc and wide_cset_desc fields.
(_cpp_valid_ucn): Update prototype.
(_cpp_destroy_iconv): New prototype.
* doc/cpp.texi: Document character set handling.
* doc/cppopts.texi: Document -fexec-charset= and -fexec-wide-charset=.
* doc/extend.texi: Delete entire section on multiline strings.
Rewrite section on __FUNCTION__ etc now that these are
variables in C.
* cppucnid.tab, cppucnid.pl: New files.
* cppucnid.h: New generated file.
* cppcharset.c: Include cppucnid.h. Lots of commentary added.
(iconv_open, iconv, iconv_close): Provide dummy definitions
if !HAVE_ICONV.
(SOURCE_CHARSET, struct strbuf, init_iconv_desc, cpp_init_iconv,
_cpp_destroy_iconv, convert_cset, width_to_mask, convert_ucn,
emit_numeric_escape, convert_hex, convert_oct, convert_escape,
cpp_interpret_string, narrow_str_to_charconst,
wide_str_to_charconst): New.
(ucn_valid_in_identifier): Use a binary search through the
ucnranges table defined in cppucnid.h, not a long chain of if
statements.
(_cpp_valid_ucn): Add a limit pointer. Downgrade "universal
character names are only valid in C++ and C99" to a warning.
Issue the "meaning of \[uU] is different in traditional C"
warning here. Take care not to let iconv see an invalid UCS
value if we get a malformed UCN. Issue an error if we don't
have iconv.
(cpp_interpret_charconst): Moved here from cpplex.c. Use
cpp_interpret_string to do the heavy lifting.
* cppinit.c (cpp_create_reader): Initialize bytes_big_endian,
narrow_charset, wide_charset fields of options structure.
(cpp_destroy): Call _cpp_destroy_iconv.
* cpplex.c (forms_identifier_p): Adjust call to _cpp_valid_ucn.
(maybe_read_ucn, hex_digit_value, cpp_parse_escape): Delete.
(cpp_interpret_charconst): Moved to cppcharset.c.
* cpplib.c (dequote_string): Delete.
(interpret_string_notranslate): New.
(do_line, do_linemarker): Use interpret_string_notranslate.
* Makefile.in (cppcharset.o): Depend on cppucnid.h.
* c-common.c (fname_string, combine_strings): Delete.
* c-common.h (fname_string, combine_strings): Delete prototypes.
* c-lex.c (ignore_escape_flag): Delete.
(cb_ident): Use cpp_interpret_string, not lex_string.
(get_nonpadding_token): New function.
(c_lex): Handle Objective-C @-prefixed identifiers and strings here.
Adjust calls to lex_string. Don't write *value twice.
(lex_string): Now handles string constant concatenation.
Most of the work handed off to cpp_interpret_string.
Call fix_string_type here.
* c-parse.in (STRING_FUNC_NAME, VAR_FUNC_NAME): Replace with
FUNC_NAME, throughout.
(OBJC_STRING): New token type.
(primary:STRING): No need to call fix_string_type here.
(primary:objc_string): Make that OBJC_STRING.
(objc_string nonterminal): Delete.
(yylexname): Delete code to handle fake string constants.
(yylexstring): Delete entirely.
(_yylex): Handle CPP_AT_NAME and CPP_OBJC_STRING. No need
to handle CPP_ATSIGN.
* c.opt (-fexec-charset=, -fwide-exec-charset=): New options.
* c-opts.c (missing_arg, c_common_handle_option): Handle
OPT_fexec_charset_ and OPT_fwide_exec_charset_.
(c_common_init): Set cpp_opts->bytes_big_endian, not
cpp_opts->EBCDIC. Call cpp_init_iconv.
(print_help): Document -fexec-charset= and -fexec-wide-charset=.
(TARGET_EBCDIC): Delete default definition.
* objc/objc-act.c (build_objc_string_object): No need to
handle string constant concatenation.
cp:
* parser.c (cp_lexer_read_token): No need to handle string
constant concatenation.
testsuite:
* gcc.c-torture/execute/wchar_t-1.x: New file; XFAIL wchar_t-1.c
everywhere.
* gcc.dg/concat.c: Concatenation of string constants with
__FUNCTION__ / __PRETTY_FUNCTION__ is now a hard error.
* gcc.dg/wtr-strcat-1.c: Loosen dg-warning regexp.
* gcc.dg/cpp/escape-2.c: Use wide character constants where
necessary to avoid multi-character character constant warning.
* gcc.dg/cpp/escape.c: Likewise.
* gcc.dg/cpp/ucs.c: Likewise.
Remove backslashes from dg-bogus comments, as they confuse Tcl.
Fix a typo.
libstdc++-v3:
* testsuite/22_locale/collate/compare/wchar_t/2.cc
* testsuite/22_locale/collate/compare/wchar_t/wrapped_env.cc
* testsuite/22_locale/collate/compare/wchar_t/wrapped_locale.cc
* testsuite/22_locale/collate/hash/wchar_t/2.cc
* testsuite/22_locale/collate/hash/wchar_t/wrapped_env.cc
* testsuite/22_locale/collate/hash/wchar_t/wrapped_locale.cc
* testsuite/22_locale/collate/transform/wchar_t/2.cc
* testsuite/22_locale/collate/transform/wchar_t/wrapped_env.cc
* testsuite/22_locale/collate/transform/wchar_t/wrapped_locale.cc:
XFAIL on all targets.
From-SVN: r68952
Diffstat (limited to 'gcc/cppcharset.c')
-rw-r--r-- | gcc/cppcharset.c | 1238 |
1 files changed, 771 insertions, 467 deletions
diff --git a/gcc/cppcharset.c b/gcc/cppcharset.c index f506ba2..0ba7e93 100644 --- a/gcc/cppcharset.c +++ b/gcc/cppcharset.c @@ -24,8 +24,278 @@ Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. */ #include "tm.h" #include "cpplib.h" #include "cpphash.h" +#include "cppucnid.h" + +/* Character set handling for C-family languages. + + Terminological note: In what follows, "charset" or "character set" + will be taken to mean both an abstract set of characters and an + encoding for that set. + + The C99 standard discusses two character sets: source and execution. + The source character set is used for internal processing in translation + phases 1 through 4; the execution character set is used thereafter. + Both are required by 5.2.1.2p1 to be multibyte encodings, not wide + character encodings (see 3.7.2, 3.7.3 for the standardese meanings + of these terms). Furthermore, the "basic character set" (listed in + 5.2.1p3) is to be encoded in each with values one byte wide, and is + to appear in the initial shift state. + + It is not explicitly mentioned, but there is also a "wide execution + character set" used to encode wide character constants and wide + string literals; this is supposed to be the result of applying the + standard library function mbstowcs() to an equivalent narrow string + (6.4.5p5). However, the behavior of hexadecimal and octal + \-escapes is at odds with this; they are supposed to be translated + directly to wchar_t values (6.4.4.4p5,6). + + The source character set is not necessarily the character set used + to encode physical source files on disk; translation phase 1 converts + from whatever that encoding is to the source character set. + + The presence of universal character names in C99 (6.4.3 et seq.) + forces the source character set to be isomorphic to ISO 10646, + that is, Unicode. There is no such constraint on the execution + character set; note also that the conversion from source to + execution character set does not occur for identifiers (5.1.1.2p1#5). + + For convenience of implementation, the source character set's + encoding of the basic character set should be identical to the + execution character set OF THE HOST SYSTEM's encoding of the basic + character set, and it should not be a state-dependent encoding. + + cpplib uses UTF-8 or UTF-EBCDIC for the source character set, + depending on whether the host is based on ASCII or EBCDIC (see + respectively Unicode section 2.3/ISO10646 Amendment 2, and Unicode + Technical Report #16). It relies on the system library's iconv() + primitive to do charset conversion (specified in SUSv2). If this + primitive is not present, the source and execution character sets + must be identical and are limited to the basic ASCII or EBCDIC + range, and wide characters are implemented by padding narrow + characters to the size of wchar_t. */ + +#if !HAVE_ICONV +/* Make certain that the uses of iconv(), iconv_open(), iconv_close() + below, which are guarded only by if statements with compile-time + constant conditions, do not cause link errors. */ +#define iconv_open(x, y) (errno = EINVAL, (iconv_t)-1) +#define iconv(a,b,c,d,e) (errno = EINVAL, (iconv_t)-1) +#define iconv_close(x) 0 +#endif + +#if HOST_CHARSET == HOST_CHARSET_ASCII +#define SOURCE_CHARSET "UTF-8" +#elif HOST_CHARSET == HOST_CHARSET_EBCDIC +#define SOURCE_CHARSET "UTF-EBCDIC" +#else +#error "Unrecognized basic host character set" +#endif + +/* This structure is used for a resizable string buffer, mostly by + convert_cset and cpp_interpret_string. */ +struct strbuf +{ + uchar *text; + size_t asize; + size_t len; +}; + +/* This is enough to hold any string that fits on a single 80-column + line, even if iconv quadruples its size (e.g. conversion from + ASCII to UCS-4) rounded up to a power of two. */ +#define OUTBUF_BLOCK_SIZE 256 + +/* Subroutine of cpp_init_iconv: initialize and return an iconv + descriptor for conversion from FROM to TO. If iconv_open() fails, + issue an error and return (iconv_t) -1. Silently return + (iconv_t) -1 if FROM and TO are identical. */ +static iconv_t +init_iconv_desc (cpp_reader *pfile, const char *to, const char *from) +{ + iconv_t dsc; + + if (!strcmp (to, from)) + return (iconv_t) -1; + + dsc = iconv_open (to, from); + if (dsc == (iconv_t) -1) + { + if (errno == EINVAL) + cpp_error (pfile, DL_ERROR, /* XXX should be DL_SORRY */ + "conversion from %s to %s not supported by iconv", + from, to); + else + cpp_errno (pfile, DL_ERROR, "iconv_open"); + } + return dsc; +} + +/* If charset conversion is requested, initialize iconv(3) descriptors + for conversion from the source character set to the execution + character sets. If iconv is not present in the C library, and + conversion is requested, issue an error. */ + +void +cpp_init_iconv (cpp_reader *pfile) +{ + const char *ncset = CPP_OPTION (pfile, narrow_charset); + const char *wcset = CPP_OPTION (pfile, wide_charset); + const char *default_wcset; + + bool be = CPP_OPTION (pfile, bytes_big_endian); + + if (CPP_OPTION (pfile, wchar_precision) >= 32) + default_wcset = be ? "UCS-4BE" : "UCS-4LE"; + else if (CPP_OPTION (pfile, wchar_precision) >= 16) + default_wcset = be ? "UCS-2BE" : "UCS-2LE"; + else + /* This effectively means that wide strings are not supported, + so don't do any conversion at all. */ + default_wcset = SOURCE_CHARSET; + + if (!HAVE_ICONV) + { + if (ncset && strcmp (ncset, SOURCE_CHARSET)) + cpp_error (pfile, DL_ERROR, /* XXX should be DL_SORRY */ + "no iconv implementation, cannot convert to %s", ncset); + + if (wcset && strcmp (wcset, default_wcset)) + cpp_error (pfile, DL_ERROR, /* XXX should be DL_SORRY */ + "no iconv implementation, cannot convert to %s", wcset); + } + else + { + if (!ncset) + ncset = SOURCE_CHARSET; + if (!wcset) + wcset = default_wcset; + + pfile->narrow_cset_desc = init_iconv_desc (pfile, ncset, SOURCE_CHARSET); + pfile->wide_cset_desc = init_iconv_desc (pfile, wcset, SOURCE_CHARSET); + } +} + +void +_cpp_destroy_iconv (cpp_reader *pfile) +{ + if (HAVE_ICONV) + { + if (pfile->narrow_cset_desc != (iconv_t) -1) + iconv_close (pfile->narrow_cset_desc); + if (pfile->wide_cset_desc != (iconv_t) -1) + iconv_close (pfile->wide_cset_desc); + } +} + +/* iconv(3) utility wrapper. Convert the string FROM, of length FLEN, + according to the iconv descriptor CD. The result is appended to + the string buffer TO. If DESC is (iconv_t)-1 or iconv is not + available, the string is simply copied into TO. + + Returns true on success, false on error. */ + +static bool +convert_cset (iconv_t cd, const uchar *from, size_t flen, struct strbuf *to) +{ + if (!HAVE_ICONV || cd == (iconv_t)-1) + { + if (to->len + flen > to->asize) + { + to->asize = to->len + flen; + to->text = xrealloc (to->text, to->asize); + } + memcpy (to->text + to->len, from, flen); + to->len += flen; + return true; + } + else + { + char *inbuf, *outbuf; + size_t inbytesleft, outbytesleft; + + /* Reset conversion descriptor and check that it is valid. */ + if (iconv (cd, 0, 0, 0, 0) == (size_t)-1) + return false; + + inbuf = (char *)from; + inbytesleft = flen; + outbuf = (char *)to->text + to->len; + outbytesleft = to->asize - to->len; + + for (;;) + { + iconv (cd, &inbuf, &inbytesleft, &outbuf, &outbytesleft); + if (__builtin_expect (inbytesleft == 0, 1)) + { + to->len = to->asize - outbytesleft; + return true; + } + if (errno != E2BIG) + return false; + + outbytesleft += OUTBUF_BLOCK_SIZE; + to->asize += OUTBUF_BLOCK_SIZE; + to->text = xrealloc (to->text, to->asize); + outbuf = (char *)to->text + to->asize - outbytesleft; + } + } +} + +/* Utility routine that computes a mask of the form 0000...111... with + WIDTH 1-bits. */ +static inline size_t +width_to_mask (size_t width) +{ + width = MIN (width, BITS_PER_CPPCHAR_T); + if (width >= CHAR_BIT * sizeof (size_t)) + return ~(size_t) 0; + else + return ((size_t) 1 << width) - 1; +} + + + +/* Returns 1 if C is valid in an identifier, 2 if C is valid except at + the start of an identifier, and 0 if C is not valid in an + identifier. We assume C has already gone through the checks of + _cpp_valid_ucn. The algorithm is a simple binary search on the + table defined in cppucnid.h. */ + +static int +ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c) +{ + int mn, mx, md; + + mn = -1; + mx = ARRAY_SIZE (ucnranges); + while (mx - mn > 1) + { + md = (mn + mx) / 2; + if (c < ucnranges[md].lo) + mx = md; + else if (c > ucnranges[md].hi) + mn = md; + else + goto found; + } + return 0; -static int ucn_valid_in_identifier (cpp_reader *, cppchar_t); + found: + /* When -pedantic, we require the character to have been listed by + the standard for the current language. Otherwise, we accept the + union of the acceptable sets for C++98 and C99. */ + if (CPP_PEDANTIC (pfile) + && ((CPP_OPTION (pfile, c99) && !(ucnranges[md].flags & C99)) + || (CPP_OPTION (pfile, cplusplus) + && !(ucnranges[md].flags & CXX)))) + return 0; + + /* In C99, UCN digits may not begin identifiers. */ + if (CPP_OPTION (pfile, c99) && (ucnranges[md].flags & DIG)) + return 2; + + return 1; +} /* [lex.charset]: The character designated by the universal character name \UNNNNNNNN is that character whose character short name in @@ -52,20 +322,21 @@ static int ucn_valid_in_identifier (cpp_reader *, cppchar_t); */ cppchar_t -_cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr, int identifier_pos) +_cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr, + const uchar *limit, int identifier_pos) { cppchar_t result, c; unsigned int length; const uchar *str = *pstr; const uchar *base = str - 2; - /* Only attempt to interpret a UCS for C++ and C99. */ if (!CPP_OPTION (pfile, cplusplus) && !CPP_OPTION (pfile, c99)) - return 0; - - /* We don't accept UCNs for an EBCDIC target. */ - if (CPP_OPTION (pfile, EBCDIC)) - return 0; + cpp_error (pfile, DL_WARNING, + "universal character names are only valid in C++ and C99"); + else if (CPP_WTRADITIONAL (pfile) && identifier_pos == 0) + cpp_error (pfile, DL_WARNING, + "the meaning of '\\%c' is different in traditional C", + (int) str[-1]); if (str[-1] == 'u') length = 4; @@ -83,13 +354,16 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr, int identifier_pos) str++; result = (result << 4) + hex_value (c); } - while (--length); + while (--length && str < limit); *pstr = str; if (length) - /* We'll error when we try it out as the start of an identifier. */ - cpp_error (pfile, DL_ERROR, "incomplete universal character name %.*s", - (int) (str - base), base); + { + /* We'll error when we try it out as the start of an identifier. */ + cpp_error (pfile, DL_ERROR, "incomplete universal character name %.*s", + (int) (str - base), base); + result = 1; + } /* The standard permits $, @ and ` to be specified as UCNs. We use hex escapes so that this also works with EBCDIC hosts. */ else if ((result < 0xa0 @@ -99,6 +373,7 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr, int identifier_pos) { cpp_error (pfile, DL_ERROR, "%.*s is not a valid universal character", (int) (str - base), base); + result = 1; } else if (identifier_pos) { @@ -113,6 +388,15 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr, int identifier_pos) "universal character %.*s is not valid at the start of an identifier", (int) (str - base), base); } + /* We don't accept UCNs if iconv is not available or will not + convert to the target wide character set. */ + else if (!HAVE_ICONV || pfile->wide_cset_desc == (iconv_t) -1) + { + /* XXX should be DL_SORRY */ + cpp_error (pfile, DL_ERROR, + "universal character names are not supported in this configuration"); + } + if (result == 0) result = 1; @@ -120,467 +404,487 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr, int identifier_pos) return result; } -/* Returns 1 if C is valid in an identifier, 2 if C is valid except at - the start of an identifier, and 0 if C is not valid in an - identifier. We assume C has already gone through the checks of - _cpp_valid_ucn. */ -static int -ucn_valid_in_identifier (cpp_reader *pfile, cppchar_t c) +/* Convert an UCN, pointed to by FROM, to UTF-8 encoding, then translate + it to the execution character set and write the result into TBUF. + An advanced pointer is returned. Issues all relevant diagnostics. + + UTF-8 encoding looks like this: + + value range encoded as + 00000000-0000007F 0xxxxxxx + 00000080-000007FF 110xxxxx 10xxxxxx + 00000800-0000FFFF 1110xxxx 10xxxxxx 10xxxxxx + 00010000-001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx + 00200000-03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + 04000000-7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx + + Values in the 0000D800 ... 0000DFFF range (surrogates) are invalid, + which means that three-byte sequences ED xx yy, with A0 <= xx <= BF, + never occur. Note also that any value that can be encoded by a + given row of the table can also be encoded by all successive rows, + but this is not done; only the shortest possible encoding for any + given value is valid. For instance, the character 07C0 could be + encoded as any of DF 80, E0 9F 80, F0 80 9F 80, F8 80 80 9F 80, or + FC 80 80 80 9F 80. Only the first is valid. */ + +static const uchar * +convert_ucn (cpp_reader *pfile, const uchar *from, const uchar *limit, + struct strbuf *tbuf, bool wide) { - /* None of the valid chars are outside the Basic Multilingual Plane (the - low 16 bits). */ - if (c > 0xffff) - return 0; + int nbytes; + uchar buf[6], *p = &buf[6]; + static const uchar masks[6] = { 0x00, 0xC0, 0xE0, 0xF0, 0xF8, 0xFC }; + cppchar_t ucn; + + from++; /* skip u/U */ + ucn = _cpp_valid_ucn (pfile, &from, limit, 0); + if (!ucn) + return from; + + nbytes = 1; + if (ucn < 0x80) + *--p = ucn; + else + { + do + { + *--p = ((ucn & 0x3F) | 0x80); + ucn >>= 6; + nbytes++; + } + while (ucn >= 0x3F || (ucn & masks[nbytes-1])); + *--p = (ucn | masks[nbytes-1]); + } + + if (!convert_cset (wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc, + p, nbytes, tbuf)) + cpp_errno (pfile, DL_ERROR, "converting UCN to execution character set"); + + return from; +} - if (CPP_OPTION (pfile, c99) || !CPP_PEDANTIC (pfile)) +static void +emit_numeric_escape (cpp_reader *pfile, cppchar_t n, + struct strbuf *tbuf, bool wide) +{ + if (wide) { - /* Latin. */ - if (c == 0x0aa || c == 0x00ba || c == 0x207f || c == 0x1e9b) - return 1; - - /* Greek. */ - if (c == 0x0386) - return 1; - - /* Cyrillic. */ - if (c == 0x040c) - return 1; - - /* Hebrew. */ - if ((c >= 0x05b0 && c <= 0x05b9) - || (c >= 0x05bb && c <= 0x005bd) - || c == 0x05bf - || (c >= 0x05c1 && c <= 0x05c2)) - return 1; - - /* Arabic. */ - if ((c >= 0x06d0 && c <= 0x06dc) - || c == 0x06e8 - || (c >= 0x06ea && c <= 0x06ed)) - return 1; - - /* Devanagari */ - if ((c >= 0x0901 && c <= 0x0903) - || (c >= 0x093e && c <= 0x094d) - || (c >= 0x0950 && c <= 0x0952) - || c == 0x0963) - return 1; - - /* Bengali */ - if ((c >= 0x0981 && c <= 0x0983) - || (c >= 0x09be && c <= 0x09c4) - || (c >= 0x09c7 && c <= 0x09c8) - || (c >= 0x09cb && c <= 0x09cd) - || (c >= 0x09e2 && c <= 0x09e3)) - return 1; - - /* Gurmukhi */ - if (c == 0x0a02 - || (c >= 0x0a3e && c <= 0x0a42) - || (c >= 0x0a47 && c <= 0x0a48) - || (c >= 0x0a4b && c <= 0x0a4d) - || (c == 0x0a74)) - return 1; - - /* Gujarati */ - if ((c >= 0x0a81 && c <= 0x0a83) - || (c >= 0x0abd && c <= 0x0ac5) - || (c >= 0x0ac7 && c <= 0x0ac9) - || (c >= 0x0acb && c <= 0x0acd) - || (c == 0x0ad0)) - return 1; - - /* Oriya */ - if ((c >= 0x0b01 && c <= 0x0b03) - || (c >= 0x0b3e && c <= 0x0b43) - || (c >= 0x0b47 && c <= 0x0b48) - || (c >= 0x0b4b && c <= 0x0b4d)) - return 1; - - /* Tamil */ - if ((c >= 0x0b82 && c <= 0x0b83) - || (c >= 0x0bbe && c <= 0x0bc2) - || (c >= 0x0bc6 && c <= 0x0bc8) - || (c >= 0x0bc8 && c <= 0x0bcd)) - return 1; - - /* Telugu */ - if ((c >= 0x0c01 && c <= 0x0c03) - || (c >= 0x0c3e && c <= 0x0c44) - || (c >= 0x0c46 && c <= 0x0c48) - || (c >= 0x0c4a && c <= 0x0c4d)) - return 1; - - /* Kannada */ - if ((c >= 0x0c82 && c <= 0x0c83) - || (c >= 0x0cbe && c <= 0x0cc4) - || (c >= 0x0cc6 && c <= 0x0cc8) - || (c >= 0x0cca && c <= 0x0ccd) - || c == 0x0cde) - return 1; - - /* Malayalam */ - if ((c >= 0x0d02 && c <= 0x0d03) - || (c >= 0x0d3e && c <= 0x0d43) - || (c >= 0x0d46 && c <= 0x0d48) - || (c >= 0x0d4a && c <= 0x0d4d)) - return 1; - - /* Thai */ - if ((c >= 0x0e01 && c <= 0x0e3a) - || (c >= 0x0e40 && c <= 0x0e5b)) - return 1; - - /* Lao */ - if ((c >= 0x0ead && c <= 0x0eae) - || (c >= 0x0eb0 && c <= 0x0eb9) - || (c >= 0x0ebb && c <= 0x0ebd) - || (c >= 0x0ec0 && c <= 0x0ec4) - || c == 0x0ec6 - || (c >= 0x0ec8 && c <= 0x0ecd) - || (c >= 0x0edc && c <= 0x0ed)) - return 1; - - /* Tibetan. */ - if (c == 0x0f00 - || (c >= 0x0f18 && c <= 0x0f19) - || c == 0x0f35 - || c == 0x0f37 - || c == 0x0f39 - || (c >= 0x0f3e && c <= 0x0f47) - || (c >= 0x0f49 && c <= 0x0f69) - || (c >= 0x0f71 && c <= 0x0f84) - || (c >= 0x0f86 && c <= 0x0f8b) - || (c >= 0x0f90 && c <= 0x0f95) - || c == 0x0f97 - || (c >= 0x0f99 && c <= 0x0fad) - || (c >= 0x0fb1 && c <= 0x0fb7) - || c == 0x0fb9) - return 1; - - /* Katakana */ - if ((c >= 0x30a1 && c <= 0x30f6) - || (c >= 0x30fb && c <= 0x30fc)) - return 1; - - /* CJK Unified Ideographs. */ - if (c >= 0x4e00 && c <= 0x9fa5) - return 1; - - /* Hangul. */ - if (c >= 0xac00 && c <= 0xd7a3) - return 1; - - /* Digits. */ - if ((c >= 0x0660 && c <= 0x0669) - || (c >= 0x06f0 && c <= 0x06f9) - || (c >= 0x0966 && c <= 0x096f) - || (c >= 0x09e6 && c <= 0x09ef) - || (c >= 0x0a66 && c <= 0x0a6f) - || (c >= 0x0ae6 && c <= 0x0aef) - || (c >= 0x0b66 && c <= 0x0b6f) - || (c >= 0x0be7 && c <= 0x0bef) - || (c >= 0x0c66 && c <= 0x0c6f) - || (c >= 0x0ce6 && c <= 0x0cef) - || (c >= 0x0d66 && c <= 0x0d6f) - || (c >= 0x0e50 && c <= 0x0e59) - || (c >= 0x0ed0 && c <= 0x0ed9) - || (c >= 0x0f20 && c <= 0x0f33)) - return 2; - - /* Special characters. */ - if (c == 0x00b5 - || c == 0x00b7 - || (c >= 0x02b0 && c <= 0x02b8) - || c == 0x02bb - || (c >= 0x02bd && c <= 0x02c1) - || (c >= 0x02d0 && c <= 0x02d1) - || (c >= 0x02e0 && c <= 0x02e4) - || c == 0x037a - || c == 0x0559 - || c == 0x093d - || c == 0x0b3d - || c == 0x1fbe - || (c >= 0x203f && c <= 0x2040) - || c == 0x2102 - || c == 0x2107 - || (c >= 0x210a && c <= 0x2113) - || c == 0x2115 - || (c >= 0x2118 && c <= 0x211d) - || c == 0x2124 - || c == 0x2126 - || c == 0x2128 - || (c >= 0x212a && c <= 0x2131) - || (c >= 0x2133 && c <= 0x2138) - || (c >= 0x2160 && c <= 0x2182) - || (c >= 0x3005 && c <= 0x3007) - || (c >= 0x3021 && c <= 0x3029)) - return 1; + /* We have to render this into the target byte order, which may not + be our byte order. */ + bool bigend = CPP_OPTION (pfile, bytes_big_endian); + size_t width = CPP_OPTION (pfile, wchar_precision); + size_t cwidth = CPP_OPTION (pfile, char_precision); + size_t cmask = width_to_mask (cwidth); + size_t nbwc = width / cwidth; + size_t i; + size_t off = tbuf->len; + cppchar_t c; + + if (tbuf->len + nbwc > tbuf->asize) + { + tbuf->asize += OUTBUF_BLOCK_SIZE; + tbuf->text = xrealloc (tbuf->text, tbuf->asize); + } + + for (i = 0; i < nbwc; i++) + { + c = n & cmask; + n >>= cwidth; + tbuf->text[off + (bigend ? nbwc - i - 1 : i)] = c; + } + tbuf->len += nbwc; } - - if (CPP_OPTION (pfile, cplusplus) || !CPP_PEDANTIC (pfile)) + else { - /* Greek. */ - if (c == 0x0384) - return 1; - - /* Cyrillic. */ - if (c == 0x040d) - return 1; - - /* Hebrew. */ - if (c >= 0x05f3 && c <= 0x05f4) - return 1; - - /* Lao. */ - if ((c >= 0x0ead && c <= 0x0eb0) - || (c == 0x0eb2) - || (c == 0x0eb3) - || (c == 0x0ebd) - || (c >= 0x0ec0 && c <= 0x0ec4) - || (c == 0x0ec6)) - return 1; - - /* Hiragana */ - if (c == 0x3094 - || (c >= 0x309d && c <= 0x309e)) - return 1; - - /* Katakana */ - if ((c >= 0x30a1 && c <= 0x30fe)) - return 1; - - /* Hangul */ - if ((c >= 0x1100 && c <= 0x1159) - || (c >= 0x1161 && c <= 0x11a2) - || (c >= 0x11a8 && c <= 0x11f9)) - return 1; - - /* CJK Unified Ideographs */ - if ((c >= 0xf900 && c <= 0xfa2d) - || (c >= 0xfb1f && c <= 0xfb36) - || (c >= 0xfb38 && c <= 0xfb3c) - || (c == 0xfb3e) - || (c >= 0xfb40 && c <= 0xfb41) - || (c >= 0xfb42 && c <= 0xfb44) - || (c >= 0xfb46 && c <= 0xfbb1) - || (c >= 0xfbd3 && c <= 0xfd3f) - || (c >= 0xfd50 && c <= 0xfd8f) - || (c >= 0xfd92 && c <= 0xfdc7) - || (c >= 0xfdf0 && c <= 0xfdfb) - || (c >= 0xfe70 && c <= 0xfe72) - || (c == 0xfe74) - || (c >= 0xfe76 && c <= 0xfefc) - || (c >= 0xff21 && c <= 0xff3a) - || (c >= 0xff41 && c <= 0xff5a) - || (c >= 0xff66 && c <= 0xffbe) - || (c >= 0xffc2 && c <= 0xffc7) - || (c >= 0xffca && c <= 0xffcf) - || (c >= 0xffd2 && c <= 0xffd7) - || (c >= 0xffda && c <= 0xffdc) - || (c >= 0x4e00 && c <= 0x9fa5)) - return 1; + if (tbuf->len + 1 > tbuf->asize) + { + tbuf->asize += OUTBUF_BLOCK_SIZE; + tbuf->text = xrealloc (tbuf->text, tbuf->asize); + } + tbuf->text[tbuf->len++] = n; } +} - /* Latin */ - if ((c >= 0x00c0 && c <= 0x00d6) - || (c >= 0x00d8 && c <= 0x00f6) - || (c >= 0x00f8 && c <= 0x01f5) - || (c >= 0x01fa && c <= 0x0217) - || (c >= 0x0250 && c <= 0x02a8) - || (c >= 0x1e00 && c <= 0x1e9a) - || (c >= 0x1ea0 && c <= 0x1ef9)) - return 1; - - /* Greek */ - if ((c >= 0x0388 && c <= 0x038a) - || (c == 0x038c) - || (c >= 0x038e && c <= 0x03a1) - || (c >= 0x03a3 && c <= 0x03ce) - || (c >= 0x03d0 && c <= 0x03d6) - || (c == 0x03da) - || (c == 0x03dc) - || (c == 0x03de) - || (c == 0x03e0) - || (c >= 0x03e2 && c <= 0x03f3) - || (c >= 0x1f00 && c <= 0x1f15) - || (c >= 0x1f18 && c <= 0x1f1d) - || (c >= 0x1f20 && c <= 0x1f45) - || (c >= 0x1f48 && c <= 0x1f4d) - || (c >= 0x1f50 && c <= 0x1f57) - || (c == 0x1f59) - || (c == 0x1f5b) - || (c == 0x1f5d) - || (c >= 0x1f5f && c <= 0x1f7d) - || (c >= 0x1f80 && c <= 0x1fb4) - || (c >= 0x1fb6 && c <= 0x1fbc) - || (c >= 0x1fc2 && c <= 0x1fc4) - || (c >= 0x1fc6 && c <= 0x1fcc) - || (c >= 0x1fd0 && c <= 0x1fd3) - || (c >= 0x1fd6 && c <= 0x1fdb) - || (c >= 0x1fe0 && c <= 0x1fec) - || (c >= 0x1ff2 && c <= 0x1ff4) - || (c >= 0x1ff6 && c <= 0x1ffc)) - return 1; - - /* Cyrillic */ - if ((c >= 0x0401 && c <= 0x040c) - || (c >= 0x040f && c <= 0x044f) - || (c >= 0x0451 && c <= 0x045c) - || (c >= 0x045e && c <= 0x0481) - || (c >= 0x0490 && c <= 0x04c4) - || (c >= 0x04c7 && c <= 0x04c8) - || (c >= 0x04cb && c <= 0x04cc) - || (c >= 0x04d0 && c <= 0x04eb) - || (c >= 0x04ee && c <= 0x04f5) - || (c >= 0x04f8 && c <= 0x04f9)) - return 1; - - /* Armenian */ - if ((c >= 0x0531 && c <= 0x0556) - || (c >= 0x0561 && c <= 0x0587)) - return 1; - - /* Hebrew */ - if ((c >= 0x05d0 && c <= 0x05ea) - || (c >= 0x05f0 && c <= 0x05f2)) - return 1; - - /* Arabic */ - if ((c >= 0x0621 && c <= 0x063a) - || (c >= 0x0640 && c <= 0x0652) - || (c >= 0x0670 && c <= 0x06b7) - || (c >= 0x06ba && c <= 0x06be) - || (c >= 0x06c0 && c <= 0x06ce) - || (c >= 0x06e5 && c <= 0x06e7)) - return 1; - - /* Devanagari */ - if ((c >= 0x0905 && c <= 0x0939) - || (c >= 0x0958 && c <= 0x0962)) - return 1; - - /* Bengali */ - if ((c >= 0x0985 && c <= 0x098c) - || (c >= 0x098f && c <= 0x0990) - || (c >= 0x0993 && c <= 0x09a8) - || (c >= 0x09aa && c <= 0x09b0) - || (c == 0x09b2) - || (c >= 0x09b6 && c <= 0x09b9) - || (c >= 0x09dc && c <= 0x09dd) - || (c >= 0x09df && c <= 0x09e1) - || (c >= 0x09f0 && c <= 0x09f1)) - return 1; - - /* Gurmukhi */ - if ((c >= 0x0a05 && c <= 0x0a0a) - || (c >= 0x0a0f && c <= 0x0a10) - || (c >= 0x0a13 && c <= 0x0a28) - || (c >= 0x0a2a && c <= 0x0a30) - || (c >= 0x0a32 && c <= 0x0a33) - || (c >= 0x0a35 && c <= 0x0a36) - || (c >= 0x0a38 && c <= 0x0a39) - || (c >= 0x0a59 && c <= 0x0a5c) - || (c == 0x0a5e)) - return 1; - - /* Gujarati */ - if ((c >= 0x0a85 && c <= 0x0a8b) - || (c == 0x0a8d) - || (c >= 0x0a8f && c <= 0x0a91) - || (c >= 0x0a93 && c <= 0x0aa8) - || (c >= 0x0aaa && c <= 0x0ab0) - || (c >= 0x0ab2 && c <= 0x0ab3) - || (c >= 0x0ab5 && c <= 0x0ab9) - || (c == 0x0ae0)) - return 1; - - /* Oriya */ - if ((c >= 0x0b05 && c <= 0x0b0c) - || (c >= 0x0b0f && c <= 0x0b10) - || (c >= 0x0b13 && c <= 0x0b28) - || (c >= 0x0b2a && c <= 0x0b30) - || (c >= 0x0b32 && c <= 0x0b33) - || (c >= 0x0b36 && c <= 0x0b39) - || (c >= 0x0b5c && c <= 0x0b5d) - || (c >= 0x0b5f && c <= 0x0b61)) - return 1; - - /* Tamil */ - if ((c >= 0x0b85 && c <= 0x0b8a) - || (c >= 0x0b8e && c <= 0x0b90) - || (c >= 0x0b92 && c <= 0x0b95) - || (c >= 0x0b99 && c <= 0x0b9a) - || (c == 0x0b9c) - || (c >= 0x0b9e && c <= 0x0b9f) - || (c >= 0x0ba3 && c <= 0x0ba4) - || (c >= 0x0ba8 && c <= 0x0baa) - || (c >= 0x0bae && c <= 0x0bb5) - || (c >= 0x0bb7 && c <= 0x0bb9)) - return 1; - - /* Telugu */ - if ((c >= 0x0c05 && c <= 0x0c0c) - || (c >= 0x0c0e && c <= 0x0c10) - || (c >= 0x0c12 && c <= 0x0c28) - || (c >= 0x0c2a && c <= 0x0c33) - || (c >= 0x0c35 && c <= 0x0c39) - || (c >= 0x0c60 && c <= 0x0c61)) - return 1; - - /* Kannada */ - if ((c >= 0x0c85 && c <= 0x0c8c) - || (c >= 0x0c8e && c <= 0x0c90) - || (c >= 0x0c92 && c <= 0x0ca8) - || (c >= 0x0caa && c <= 0x0cb3) - || (c >= 0x0cb5 && c <= 0x0cb9) - || (c >= 0x0ce0 && c <= 0x0ce1)) - return 1; - - /* Malayalam */ - if ((c >= 0x0d05 && c <= 0x0d0c) - || (c >= 0x0d0e && c <= 0x0d10) - || (c >= 0x0d12 && c <= 0x0d28) - || (c >= 0x0d2a && c <= 0x0d39) - || (c >= 0x0d60 && c <= 0x0d61)) - return 1; - - /* Thai */ - if ((c >= 0x0e01 && c <= 0x0e30) - || (c >= 0x0e32 && c <= 0x0e33) - || (c >= 0x0e40 && c <= 0x0e46) - || (c >= 0x0e4f && c <= 0x0e5b)) - return 1; - - /* Lao */ - if ((c >= 0x0e81 && c <= 0x0e82) - || (c == 0x0e84) - || (c == 0x0e87) - || (c == 0x0e88) - || (c == 0x0e8a) - || (c == 0x0e8d) - || (c >= 0x0e94 && c <= 0x0e97) - || (c >= 0x0e99 && c <= 0x0e9f) - || (c >= 0x0ea1 && c <= 0x0ea3) - || (c == 0x0ea5) - || (c == 0x0ea7) - || (c == 0x0eaa) - || (c == 0x0eab)) - return 1; - - /* Georgian */ - if ((c >= 0x10a0 && c <= 0x10c5) - || (c >= 0x10d0 && c <= 0x10f6)) - return 1; - - /* Hiragana */ - if ((c >= 0x3041 && c <= 0x3093) - || (c >= 0x309b && c <= 0x309c)) - return 1; - - /* Bopmofo */ - if ((c >= 0x3105 && c <= 0x312c)) - return 1; +/* Convert a hexadecimal escape, pointed to by FROM, to the execution + character set and write it into the string buffer TBUF. Returns an + advanced pointer, and issues diagnostics as necessary. + No character set translation occurs; this routine always produces the + execution-set character with numeric value equal to the given hex + number. You can, e.g. generate surrogate pairs this way. */ +static const uchar * +convert_hex (cpp_reader *pfile, const uchar *from, const uchar *limit, + struct strbuf *tbuf, bool wide) +{ + cppchar_t c, n = 0, overflow = 0; + int digits_found = 0; + size_t width = (wide ? CPP_OPTION (pfile, wchar_precision) + : CPP_OPTION (pfile, char_precision)); + size_t mask = width_to_mask (width); + + if (CPP_WTRADITIONAL (pfile)) + cpp_error (pfile, DL_WARNING, + "the meaning of '\\x' is different in traditional C"); + + from++; /* skip 'x' */ + while (from < limit) + { + c = *from; + if (! hex_p (c)) + break; + from++; + overflow |= n ^ (n << 4 >> 4); + n = (n << 4) + hex_value (c); + digits_found = 1; + } - return 0; + if (!digits_found) + { + cpp_error (pfile, DL_ERROR, + "\\x used with no following hex digits"); + return from; + } + + if (overflow | (n != (n & mask))) + { + cpp_error (pfile, DL_PEDWARN, + "hex escape sequence out of range"); + n &= mask; + } + + emit_numeric_escape (pfile, n, tbuf, wide); + + return from; +} + +/* Convert an octal escape, pointed to by FROM, to the execution + character set and write it into the string buffer TBUF. Returns an + advanced pointer, and issues diagnostics as necessary. + No character set translation occurs; this routine always produces the + execution-set character with numeric value equal to the given octal + number. */ +static const uchar * +convert_oct (cpp_reader *pfile, const uchar *from, const uchar *limit, + struct strbuf *tbuf, bool wide) +{ + size_t count = 0; + cppchar_t c, n = 0; + size_t width = (wide ? CPP_OPTION (pfile, wchar_precision) + : CPP_OPTION (pfile, char_precision)); + size_t mask = width_to_mask (width); + bool overflow = false; + + while (from < limit && count++ < 3) + { + c = *from; + if (c < '0' || c > '7') + break; + from++; + overflow |= n ^ (n << 3 >> 3); + n = (n << 3) + c - '0'; + } + + if (n != (n & mask)) + { + cpp_error (pfile, DL_PEDWARN, + "octal escape sequence out of range"); + n &= mask; + } + + emit_numeric_escape (pfile, n, tbuf, wide); + + return from; +} + +/* Convert an escape sequence (pointed to by FROM) to its value on + the target, and to the execution character set. Do not scan past + LIMIT. Write the converted value into TBUF. Returns an advanced + pointer. Handles all relevant diagnostics. */ +static const uchar * +convert_escape (cpp_reader *pfile, const uchar *from, const uchar *limit, + struct strbuf *tbuf, bool wide) +{ + /* Values of \a \b \e \f \n \r \t \v respectively. */ +#if HOST_CHARSET == HOST_CHARSET_ASCII + static const uchar charconsts[] = { 7, 8, 27, 12, 10, 13, 9, 11 }; +#elif HOST_CHARSET == HOST_CHARSET_EBCDIC + static const uchar charconsts[] = { 47, 22, 39, 12, 21, 13, 5, 11 }; +#else +#error "unknown host character set" +#endif + + uchar c; + + c = *from; + switch (c) + { + /* UCNs, hex escapes, and octal escapes are processed separately. */ + case 'u': case 'U': + return convert_ucn (pfile, from, limit, tbuf, wide); + + case 'x': + return convert_hex (pfile, from, limit, tbuf, wide); + break; + + case '0': case '1': case '2': case '3': + case '4': case '5': case '6': case '7': + return convert_oct (pfile, from, limit, tbuf, wide); + + /* Various letter escapes. Get the appropriate host-charset + value into C. */ + case '\\': case '\'': case '"': case '?': break; + + case '(': case '{': case '[': case '%': + /* '\(', etc, can be used at the beginning of a line in a long + string split onto multiple lines with \-newline, to prevent + Emacs or other text editors from getting confused. '\%' can + be used to prevent SCCS from mangling printf format strings. */ + if (CPP_PEDANTIC (pfile)) + goto unknown; + break; + + case 'b': c = charconsts[1]; break; + case 'f': c = charconsts[3]; break; + case 'n': c = charconsts[4]; break; + case 'r': c = charconsts[5]; break; + case 't': c = charconsts[6]; break; + case 'v': c = charconsts[7]; break; + + case 'a': + if (CPP_WTRADITIONAL (pfile)) + cpp_error (pfile, DL_WARNING, + "the meaning of '\\a' is different in traditional C"); + c = charconsts[0]; + break; + + case 'e': case 'E': + if (CPP_PEDANTIC (pfile)) + cpp_error (pfile, DL_PEDWARN, + "non-ISO-standard escape sequence, '\\%c'", (int) c); + c = charconsts[2]; + break; + + default: + unknown: + if (ISGRAPH (c)) + cpp_error (pfile, DL_PEDWARN, + "unknown escape sequence '\\%c'", (int) c); + else + cpp_error (pfile, DL_PEDWARN, + "unknown escape sequence: '\\%03o'", (int) c); + } + + /* Now convert what we have to the execution character set. */ + if (!convert_cset (wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc, + &c, 1, tbuf)) + cpp_errno (pfile, DL_ERROR, + "converting escape sequence to execution character set"); + + return from + 1; +} + +/* FROM is an array of cpp_string structures of length COUNT. These + are to be converted from the source to the execution character set, + escape sequences translated, and finally all are to be + concatenated. WIDE indicates whether or not to produce a wide + string. The result is written into TO. Returns true for success, + false for failure. */ +bool +cpp_interpret_string (cpp_reader *pfile, const cpp_string *from, size_t count, + cpp_string *to, bool wide) +{ + struct strbuf tbuf; + const uchar *p, *base, *limit; + size_t i; + iconv_t cd = wide ? pfile->wide_cset_desc : pfile->narrow_cset_desc; + + tbuf.asize = MAX (OUTBUF_BLOCK_SIZE, from->len); + tbuf.text = xmalloc (tbuf.asize); + tbuf.len = 0; + + for (i = 0; i < count; i++) + { + p = from[i].text; + if (*p == 'L') p++; + p++; /* skip leading quote */ + limit = from[i].text + from[i].len - 1; /* skip trailing quote */ + + for (;;) + { + base = p; + while (p < limit && *p != '\\') + p++; + if (p > base) + { + /* We have a run of normal characters; these can be fed + directly to convert_cset. */ + if (!convert_cset (cd, base, p - base, &tbuf)) + goto fail; + } + if (p == limit) + break; + + p = convert_escape (pfile, p + 1, limit, &tbuf, wide); + } + } + /* NUL-terminate the 'to' buffer and translate it to a cpp_string + structure. */ + emit_numeric_escape (pfile, 0, &tbuf, wide); + tbuf.text = xrealloc (tbuf.text, tbuf.len); + to->text = tbuf.text; + to->len = tbuf.len; + return true; + + fail: + cpp_errno (pfile, DL_ERROR, "converting to execution character set"); + free (tbuf.text); + return false; +} + +/* Subroutine of cpp_interpret_charconst which performs the conversion + to a number, for narrow strings. STR is the string structure returned + by cpp_interpret_string. PCHARS_SEEN and UNSIGNEDP are as for + cpp_interpret_charconst. */ +static cppchar_t +narrow_str_to_charconst (cpp_reader *pfile, cpp_string str, + unsigned int *pchars_seen, int *unsignedp) +{ + size_t width = CPP_OPTION (pfile, char_precision); + size_t max_chars = CPP_OPTION (pfile, int_precision) / width; + size_t mask = width_to_mask (width); + size_t i; + cppchar_t result, c; + bool unsigned_p; + + /* The value of a multi-character character constant, or a + single-character character constant whose representation in the + execution character set is more than one byte long, is + implementation defined. This implementation defines it to be the + number formed by interpreting the byte sequence in memory as a + big-endian binary number. If overflow occurs, the high bytes are + lost, and a warning is issued. + + We don't want to process the NUL terminator handed back by + cpp_interpret_string. */ + result = 0; + for (i = 0; i < str.len - 1; i++) + { + c = str.text[i] & mask; + if (width < BITS_PER_CPPCHAR_T) + result = (result << width) | c; + else + result = c; + } + + if (i > max_chars) + { + i = max_chars; + cpp_error (pfile, DL_WARNING, "character constant too long for its type"); + } + else if (i > 1 && CPP_OPTION (pfile, warn_multichar)) + cpp_error (pfile, DL_WARNING, "multi-character character constant"); + + /* Multichar constants are of type int and therefore signed. */ + if (i > 1) + unsigned_p = 0; + else + unsigned_p = CPP_OPTION (pfile, unsigned_char); + + /* Truncate the constant to its natural width, and simultaneously + sign- or zero-extend to the full width of cppchar_t. + For single-character constants, the value is WIDTH bits wide. + For multi-character constants, the value is INT_PRECISION bits wide. */ + if (i > 1) + width = CPP_OPTION (pfile, int_precision); + if (width < BITS_PER_CPPCHAR_T) + { + mask = ((cppchar_t) 1 << width) - 1; + if (unsigned_p || !(result & (1 << (width - 1)))) + result &= mask; + else + result |= ~mask; + } + *pchars_seen = i; + *unsignedp = unsigned_p; + return result; +} + +/* Subroutine of cpp_interpret_charconst which performs the conversion + to a number, for wide strings. STR is the string structure returned + by cpp_interpret_string. PCHARS_SEEN and UNSIGNEDP are as for + cpp_interpret_charconst. */ +static cppchar_t +wide_str_to_charconst (cpp_reader *pfile, cpp_string str, + unsigned int *pchars_seen, int *unsignedp) +{ + bool bigend = CPP_OPTION (pfile, bytes_big_endian); + size_t width = CPP_OPTION (pfile, wchar_precision); + size_t cwidth = CPP_OPTION (pfile, char_precision); + size_t mask = width_to_mask (width); + size_t cmask = width_to_mask (cwidth); + size_t nbwc = width / cwidth; + size_t off, i; + cppchar_t result = 0, c; + + /* This is finicky because the string is in the target's byte order, + which may not be our byte order. Only the last character, ignoring + the NUL terminator, is relevant. */ + off = str.len - (nbwc * 2); + result = 0; + for (i = 0; i < nbwc; i++) + { + c = bigend ? str.text[off + i] : str.text[off + nbwc - i - 1]; + result = (result << cwidth) | (c & cmask); + } + + /* Wide character constants have type wchar_t, and a single + character exactly fills a wchar_t, so a multi-character wide + character constant is guaranteed to overflow. */ + if (off > 0) + cpp_error (pfile, DL_WARNING, "character constant too long for its type"); + + /* Truncate the constant to its natural width, and simultaneously + sign- or zero-extend to the full width of cppchar_t. */ + if (width < BITS_PER_CPPCHAR_T) + { + if (CPP_OPTION (pfile, unsigned_wchar) || !(result & (1 << (width - 1)))) + result &= mask; + else + result |= ~mask; + } + + *unsignedp = CPP_OPTION (pfile, unsigned_wchar); + *pchars_seen = 1; + return result; +} + +/* Interpret a (possibly wide) character constant in TOKEN. + PCHARS_SEEN points to a variable that is filled in with the number + of characters seen, and UNSIGNEDP to a variable that indicates + whether the result has signed type. */ +cppchar_t +cpp_interpret_charconst (cpp_reader *pfile, const cpp_token *token, + unsigned int *pchars_seen, int *unsignedp) +{ + cpp_string str = { 0, 0 }; + bool wide = (token->type == CPP_WCHAR); + cppchar_t result; + + /* an empty constant will appear as L'' or '' */ + if (token->val.str.len == (size_t) (2 + wide)) + { + cpp_error (pfile, DL_ERROR, "empty character constant"); + return 0; + } + else if (!cpp_interpret_string (pfile, &token->val.str, 1, &str, wide)) + return 0; + + if (wide) + result = wide_str_to_charconst (pfile, str, pchars_seen, unsignedp); + else + result = narrow_str_to_charconst (pfile, str, pchars_seen, unsignedp); + + if (str.text != token->val.str.text) + free ((void *)str.text); + + return result; } |