aboutsummaryrefslogtreecommitdiff
path: root/clang/lib/Lex/Lexer.cpp
AgeCommit message (Collapse)AuthorFilesLines
2022-12-16[clang] silence unused variable warningKrasimir Georgiev1-0/+2
No functional changes intended.
2022-12-16[Clang] Allow additional mathematical symbols in identifiers.Corentin Jabot1-27/+84
Implement the proposed UAX Profile "Mathematical notation profile for default identifiers". This implements a not-yet approved Unicode for a vetted UAX31 identifier profile https://www.unicode.org/L2/L2022/22230-math-profile.pdf This change mitigates the reported disruption caused by the implementation of UAX31 in C++ and C2x, as these mathematical symbols are commonly used in the scientific community. Fixes #54732 Reviewed By: tahonermann, #clang-language-wg Differential Revision: https://reviews.llvm.org/D137051
2022-12-16[Support] llvm::Optional => std::optionalFangrui Song1-2/+2
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2022-12-15[clang/Lexer] Enhance `Lexer::getImmediateMacroNameForDiagnostics` to return ↵Argyrios Kyrtzidis1-3/+5
a result from non-file buffers Use `SourceManager::isWrittenInScratchSpace()` to specifically check for token paste or stringization, instead of excluding all non-file buffers. This allows diagnostics to mention macro names that were defined from the command-line. Differential Revision: https://reviews.llvm.org/D140164
2022-12-13[Clang] Implement CWG2640 Allow more characters in an n-char sequenceCorentin Jabot1-33/+42
Reviewed By: #clang-language-wg, aaron.ballman, tahonermann Differential Revision: https://reviews.llvm.org/D138861
2022-12-10Don't include None.h (NFC)Kazu Hirata1-1/+0
I've converted all known uses of None to std::nullopt, so we no longer need to include None.h. This is part of an effort to migrate from llvm::Optional to std::optional: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2022-12-03[clang] Use std::nullopt instead of None (NFC)Kazu Hirata1-11/+11
This patch mechanically replaces None with std::nullopt where the compiler would warn if None were deprecated. The intent is to reduce the amount of manual work required in migrating from Optional to std::optional. This is part of an effort to migrate from llvm::Optional to std::optional: https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
2022-11-18[clang] Fix assert messageserge-sans-paille1-2/+2
2022-11-16[Lexer] Speedup LexTokenInternalserge-sans-paille1-4/+7
Only reset "NeedsCleaning" flag in case of re-entrant call. Do not needlessly blank IdentifierInfo. This information will be set once the token type is picked. This yields a nice 1% speedup when pre-processing sqlite amalgamation through: valgrind --tool=callgrind ./bin/clang -E sqlite3.c -o/dev/null Differential Revision: https://reviews.llvm.org/D137960
2022-09-07[Lex/DependencyDirectivesScanner] Keep track of the presence of tokens ↵Argyrios Kyrtzidis1-0/+3
between the last scanned directive and EOF Directive `dependency_directives_scan::tokens_present_before_eof` is introduced to indicate there were tokens present before the last scanned dependency directive and EOF. This is useful to ensure we correctly identify the macro guards when lexing using the dependency directives. Differential Revision: https://reviews.llvm.org/D133357
2022-08-08[clang] LLVM_FALLTHROUGH => [[fallthrough]]. NFCFangrui Song1-3/+3
With C++17 there is no Clang pedantic warning or MSVC C5051. Reviewed By: aaron.ballman Differential Revision: https://reviews.llvm.org/D131346
2022-08-01Fixed a number of typosGabriel Ravier1-1/+1
I went over the output of the following mess of a command: (ulimit -m 2000000; ulimit -v 2000000; git ls-files -z | parallel --xargs -0 cat | aspell list --mode=none --ignore-case | grep -E '^[A-Za-z][a-z]*$' | sort | uniq -c | sort -n | grep -vE '.{25}' | aspell pipe -W3 | grep : | cut -d' ' -f2 | less) and proceeded to spend a few days looking at it to find probable typos and fixed a few hundred of them in all of the llvm project (note, the ones I found are not anywhere near all of them, but it seems like a good start). Differential Revision: https://reviews.llvm.org/D130827
2022-07-29[Clang] Do not check for underscores in isAllowedInitiallyIDCharCorentin Jabot1-2/+2
isAllowedInitiallyIDChar is only used with non-ASCII codepoints, which are handled by isAsciiIdentifierStart. To make that clearer, remove the check for _ from isAllowedInitiallyIDChar, and assert on ASCII - to ensure neither _ or $ are passed to this function. Reviewed By: tahonermann, aaron.ballman Differential Revision: https://reviews.llvm.org/D130750
2022-07-23[Clang] Add support for Unicode identifiers (UAX31) in C2x mode.Corentin Jabot1-3/+3
This implements N2836 Identifier Syntax using Unicode Standard Annex 31. The feature was already implemented for C++, and the semantics are the same. Unlike C++ there was, afaict, no decision to backport the feature in older languages mode, so C17 and earlier are not modified and the code point tables for these language modes are conserved. Reviewed By: aaron.ballman Differential Revision: https://reviews.llvm.org/D130416
2022-07-14[Clang] Adjust extension warnings for delimited sequencesCorentin Jabot1-2/+8
WG21 approved delimited escape sequences and named escape sequences. Adjust the extension warnings accordingly, and update the release notes. Reviewed By: aaron.ballman Differential Revision: https://reviews.llvm.org/D129664
2022-07-13[Clang] Add a warning on invalid UTF-8 in comments.Corentin Jabot1-16/+91
Introduce an off-by default `-Winvalid-utf8` warning that detects invalid UTF-8 code units sequences in comments. Invalid UTF-8 in other places is already diagnosed, as that cannot appear in identifiers and other grammar constructs. The warning is off by default as its likely to be somewhat disruptive otherwise. This warning allows clang to conform to the yet-to be approved WG21 "P2295R5 Support for UTF-8 as a portable source file encoding" paper. Reviewed By: aaron.ballman, #clang-language-wg Differential Revision: https://reviews.llvm.org/D128059
2022-07-12Revert "[Clang] Add a warning on invalid UTF-8 in comments."Jonas Devlieghere1-92/+16
This reverts commit cc309721d20c8e544ae7a10a66735ccf4981a11c because it breaks the following tests on GreenDragon: TestDataFormatterObjCCF.py TestDataFormatterObjCExpr.py TestDataFormatterObjCKVO.py TestDataFormatterObjCNSBundle.py TestDataFormatterObjCNSData.py TestDataFormatterObjCNSError.py TestDataFormatterObjCNSNumber.py TestDataFormatterObjCNSURL.py TestDataFormatterObjCPlain.py TestDataFormatterObjNSException.py https://green.lab.llvm.org/green/view/LLDB/job/lldb-cmake/45288/
2022-07-12[Clang] Add a warning on invalid UTF-8 in comments.Corentin Jabot1-16/+92
Introduce an off-by default `-Winvalid-utf8` warning that detects invalid UTF-8 code units sequences in comments. Invalid UTF-8 in other places is already diagnosed, as that cannot appear in identifiers and other grammar constructs. The warning is off by default as its likely to be somewhat disruptive otherwise. This warning allows clang to conform to the yet-to be approved WG21 "P2295R5 Support for UTF-8 as a portable source file encoding" paper. Reviewed By: aaron.ballman, #clang-language-wg Differential Revision: https://reviews.llvm.org/D128059
2022-07-09Revert "[Clang] Add a warning on invalid UTF-8 in comments."Corentin Jabot1-94/+16
It is probable thart this change crashes on the powerpc bots. This reverts commit 355532a1499aa9b13a89fb5b5caaba2344d57cd7.
2022-07-09[Clang] Add a warning on invalid UTF-8 in comments.Corentin Jabot1-16/+94
Introduce an off-by default `-Winvalid-utf8` warning that detects invalid UTF-8 code units sequences in comments. Invalid UTF-8 in other places is already diagnosed, as that cannot appear in identifiers and other grammar constructs. The warning is off by default as its likely to be somewhat disruptive otherwise. This warning allows clang to conform to the yet-to be approved WG21 "P2295R5 Support for UTF-8 as a portable source file encoding" paper. Reviewed By: aaron.ballman, #clang-language-wg Differential Revision: https://reviews.llvm.org/D128059
2022-07-06Revert "[Clang] Add a warning on invalid UTF-8 in comments."Nico Weber1-96/+17
This reverts commit 4174f0ca618b467571b43cff12cbe4c4239670f8. Also revert follow-up "[Clang] Fix invalid utf-8 detection" This reverts commit bf45e27a676d87944f1f13d5f0d0f39935fc4010. The second commit broke tests, see comments on https://reviews.llvm.org/D129223, and it sounds like the first commit isn't valid without the second one. So reverting both for now.
2022-07-06[Clang] Add a warning on invalid UTF-8 in comments.Corentin Jabot1-17/+96
Introduce an off-by default `-Winvalid-utf8` warning that detects invalid UTF-8 code units sequences in comments. Invalid UTF-8 in other places is already diagnosed, as that cannot appear in identifiers and other grammar constructs. The warning is off by default as its likely to be somewhat disruptive otherwise. This warning allows clang to conform to the yet-to be approved WG21 "P2295R5 Support for UTF-8 as a portable source file encoding" paper. Reviewed By: aaron.ballman, #clang-language-wg Differential Revision: https://reviews.llvm.org/D128059
2022-07-06Revert "[Clang] Add a warning on invalid UTF-8 in comments."Corentin Jabot1-96/+17
Reverting while I investigate build failures This reverts commit e3dc56805f1029dd5959e4c69196a287961afb8d.
2022-07-06[Clang] Add a warning on invalid UTF-8 in comments.Corentin Jabot1-17/+96
Introduce an off-by default `-Winvalid-utf8` warning that detects invalid UTF-8 code units sequences in comments. Invalid UTF-8 in other places is already diagnosed, as that cannot appear in identifiers and other grammar constructs. The warning is off by default as its likely to be somewhat disruptive otherwise. This warning allows clang to conform to the yet-to be approved WG21 "P2295R5 Support for UTF-8 as a portable source file encoding" paper. Reviewed By: aaron.ballman, #clang-language-wg Differential Revision: https://reviews.llvm.org/D128059
2022-06-29[Lex] Make sure to notify `MultipleIncludeOpt` for "read tokens" during fast ↵Argyrios Kyrtzidis1-0/+4
dependency directive lexing Otherwise a header may be erroneously marked as having a header macro guard and won't get re-included. Differential Revision: https://reviews.llvm.org/D128772
2022-06-25[Clang][C++23] P2071 Named universal character escapesCorentin Jabot1-15/+127
Implements [[ https://wg21.link/p2071r1 | P2071 Named Universal Character Escapes ]] - as an extension in all language mode, the patch not warn in c++23 mode will be done later once this paper is plenary approved (in July). We add * A code generator that transforms `UnicodeData.txt` and `NameAliases.txt` to a space efficient data structure that can be queried in `O(NameLength)` * A set of functions in `Unicode.h` to query that data, including * A function to find an exact match of a given Unicode character name * A function to perform a loose (ignoring case, space, underscore, medial hyphen) matching * A function returning the best matching codepoint for a given string per edit distance * Support of `\N{}` escape sequences in String and character Literals, with loose and typos diagnostics/fixits * Support of `\N{}` as UCN with loose matching diagnostics/fixits. Loose matching is considered an error to match closely the semantics of P2071. The generated data contributes to 280kB of data to the binaries. `UnicodeData.txt` and `NameAliases.txt` are not committed to the repository in this patch, and regenerating the data is a manual process. Reviewed By: tahonermann Differential Revision: https://reviews.llvm.org/D123064
2022-05-27[Lex] Fix crash during dependency scanning while skipping an unmatched `#if`Argyrios Kyrtzidis1-0/+1
2022-05-26[Tooling/DependencyScanning & Preprocessor] Refactor dependency scanning to ↵Argyrios Kyrtzidis1-7/+140
produce pre-lexed preprocessor directive tokens, instead of minimized sources This is a commit with the following changes: * Remove `ExcludedPreprocessorDirectiveSkipMapping` and related functionality Removes `ExcludedPreprocessorDirectiveSkipMapping`; its intended benefit for fast skipping of excluded directived blocks will be superseded by a follow-up patch in the series that will use dependency scanning lexing for the same purpose. * Refactor dependency scanning to produce pre-lexed preprocessor directive tokens, instead of minimized sources Replaces the "source minimization" mechanism with a mechanism that produces lexed dependency directives tokens. * Make the special lexing for dependency scanning a first-class feature of the `Preprocessor` and `Lexer` This is bringing the following benefits: * Full access to the preprocessor state during dependency scanning. E.g. a component can see what includes were taken and where they were located in the actual sources. * Improved performance for dependency scanning. Measurements with a release+thin-LTO build shows ~ -11% reduction in wall time. * Opportunity to use dependency scanning lexing to speed-up skipping of excluded conditional blocks during normal preprocessing (as follow-up, not part of this patch). For normal preprocessing measurements show differences are below the noise level. Since, after this change, we don't minimize sources and pass them in place of the real sources, `DependencyScanningFilesystem` is not technically necessary, but it has valuable performance benefits for caching file `stat`s along with the results of scanning the sources. So the setup of using the `DependencyScanningFilesystem` during a dependency scan remains. Differential Revision: https://reviews.llvm.org/D125486 Differential Revision: https://reviews.llvm.org/D125487 Differential Revision: https://reviews.llvm.org/D125488
2022-04-22Revert "Revert "Revert "[clang][pp] adds '#pragma include_instead'"""Christopher Di Bella1-2/+2
> Includes regression test for problem noted by @hans. > is reverts commit 973de71. > > Differential Revision: https://reviews.llvm.org/D106898 Feature implemented as-is is fairly expensive and hasn't been used by libc++. A potential reimplementation is possible if libc++ become interested in this feature again. Differential Revision: https://reviews.llvm.org/D123885
2022-04-19[clang][lexer] Allow u8 character literal prefixes in C2xTimm Bäder1-3/+6
Implement N2418 for C2x. Differential Revision: https://reviews.llvm.org/D119221
2022-03-02[NFC][Lexer] Remove getLangOpts function from LexerDawid Jurczak1-7/+9
Given that there is only one external user of Lexer::getLangOpts we can remove getter entirely without much pain. Differential Revision: https://reviews.llvm.org/D120404
2022-02-28[NFC][Lexer] Make Lexer::LangOpts const referenceDawid Jurczak1-8/+8
This change can be seen as code cleanup but motivation is more performance related. While browsing perf reports captured during Linux build we can notice unusual portion of instructions executed in std::vector<std::string> copy constructor like: 0.59% 0.58% clang-14 clang-14 [.] std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::vector or even: 1.42% 0.26% clang clang-14 [.] clang::LangOptions::LangOptions | --1.16%--clang::LangOptions::LangOptions | --0.74%--std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::vector After more digging we can see that relevant LangOptions std::vector members (*Files, ModuleFeatures and NoBuiltinFuncs) are constructed when Lexer::LangOpts field is initialized on list: Lexer::Lexer(..., const LangOptions &langOpts, ...) : ..., LangOpts(langOpts), Since LangOptions copy constructor is called by Lexer(..., const LangOptions &LangOpts,...) and local Lexer objects are created thousands times (in Lexer::getRawToken, Preprocessor::EnterSourceFile and more) during single module processing in frontend it makes std::vector copy constructors surprisingly hot. Unfortunately even though in current Lexer implementation mentioned std::vector members are unused and most of time empty, no compiler is smart enough to optimize their std::vector copy constructors out (take a look at test assembly): https://godbolt.org/z/hdoxPfMYY even with LTO enabled. However there is simple way to fix this. Since Lexer doesn't access *Files, ModuleFeatures, NoBuiltinFuncs and any other LangOptions fields (but only LangOptionsBase) we can simply get rid of redundant copy constructor assembly by changing LangOpts type to more appropriate const LangOptions reference: https://godbolt.org/z/fP7de9176 Additionally we need to store LineComment outside LangOpts because it's written in SkipLineComment function. Also FormatTokenLexer need to be adjusted a bit to avoid lifetime issues related to passing local LangOpts reference to Lexer. After this change I can see more than 1% speedup in some of my microbenchmarks when using Clang release binary built with LTO. For Linux build gains are not so significant but still nice at the level of -0.4%/-0.5% instructions drop. Differential Revision: https://reviews.llvm.org/D120334
2022-02-23[NFC][Lexer] Make access to LangOpts more consistentDawid Jurczak1-22/+20
Before this change without any good reason Lexer::LangOpts is sometimes accessed by getter and another time read directly in Lexer functions. Since getLangOpts is a bit more verbose prefer direct access to LangOpts member when possible. Differential Revision: https://reviews.llvm.org/D120333
2022-01-31[clang][Lexer] Make raw and normal lexer behave the same for line commentsKadir Cetinkaya1-2/+3
Normally there are heruistics in lexer to treat `//*` specially in language modes that don't have line comments (to emit `/`). Unfortunately this only applied to the first occurence of a line comment inside the file, as the subsequent line comments were treated as if language had support for them. This unfortunately only holds in normal lexing mode, as in raw mode all occurences of line comments received this treatment, which created discrepancies when comparing expanded and spelled tokens. The proper fix would be to just make sure we treat all the line comments with a subsequent `*` the same way, but it would imply breaking some code that's accepted by clang today. So instead we introduce the same bug into raw lexing mode. Fixes https://github.com/clangd/clangd/issues/1003. Differential Revision: https://reviews.llvm.org/D118471
2021-12-29[clang] Use nullptr instead of 0 or NULL (NFC)Kazu Hirata1-2/+2
Identified with modernize-use-nullptr.
2021-11-18[clang][lex] Refactor check for the first file includeJan Svoboda1-6/+10
This patch refactors the code that checks whether a file has just been included for the first time. The `HeaderSearch::FirstTimeLexingFile` function is removed and the information is threaded to the original call site from `HeaderSearch::ShouldEnterIncludeFile`. This will make it possible to avoid tracking the number of includes in a follow up patch. Depends on D114092. Reviewed By: dexonsmith Differential Revision: https://reviews.llvm.org/D114093
2021-09-15Implement delimited escape sequences.Corentin Jabot1-20/+55
\x{XXXX} \u{XXXX} and \o{OOOO} are accepted in all languages mode in characters and string literals. This is a feature proposed for both C++ (P2290R1) and C (N2785). The papers have been seen by both committees but are not yet adopted into either standard. However, they do have support from both committees.
2021-09-14Cleanup identifier parsing; NFCCorentin Jabot1-134/+120
Rename methods to clearly signal when they only deal with ASCII, simplify the parsing of identifier, and use start/continue instead of head/body for consistency with Unicode terminology.
2021-08-18Do not emit diagnostics for invalid unicode characters in preprocessing modeCorentin Jabot1-2/+4
This amends 4e80636db71a1b6123d15ed1f9eda3979b4292de with a fix for https://lab.llvm.org/buildbot/#/builders/139/builds/8943
2021-08-18Implement P1949Corentin Jabot1-37/+91
This adds the Unicode 13 data for XID_Start and XID_Continue. The definition of valid identifier is changed in all C++ modes as P1949 (https://wg21.link/p1949) was accepted by WG21 as a defect report.
2021-07-29Revert "Revert "[clang][pp] adds '#pragma include_instead'""Christopher Di Bella1-2/+2
Includes regression test for problem noted by @hans. This reverts commit 973de7185606a21fd5e9d5e8c014fbf898c0e72f. Differential Revision: https://reviews.llvm.org/D106898
2021-07-27Revert "[clang][pp] adds '#pragma include_instead'"Hans Wennborg1-2/+2
> `#pragma clang include_instead(<header>)` is a pragma that can be used > by system headers (and only system headers) to indicate to a tool that > the file containing said pragma is an implementation-detail header and > should not be directly included by user code. > > The library alternative is very messy code that can be seen in the first > diff of D106124, and we'd rather avoid that with something more > universal. > > This patch takes the first step by warning a user when they include a > detail header in their code, and suggests alternative headers that the > user should include instead. Future work will involve adding a fixit to > automate the process, as well as cleaning up modules diagnostics to not > suggest said detail headers. Other tools, such as clangd can also take > advantage of this pragma to add the correct user headers. > > Differential Revision: https://reviews.llvm.org/D106394 This caused compiler crashes in Chromium builds involving PCH and an include directive with macro expansion, when Token::getLiteralData() returned null. See the code review for details. This reverts commit e8a64e5491260714c79dab65d1aa73245931d314.
2021-07-26[clang][pp] adds '#pragma include_instead'Christopher Di Bella1-2/+2
`#pragma clang include_instead(<header>)` is a pragma that can be used by system headers (and only system headers) to indicate to a tool that the file containing said pragma is an implementation-detail header and should not be directly included by user code. The library alternative is very messy code that can be seen in the first diff of D106124, and we'd rather avoid that with something more universal. This patch takes the first step by warning a user when they include a detail header in their code, and suggests alternative headers that the user should include instead. Future work will involve adding a fixit to automate the process, as well as cleaning up modules diagnostics to not suggest said detail headers. Other tools, such as clangd can also take advantage of this pragma to add the correct user headers. Differential Revision: https://reviews.llvm.org/D106394
2021-07-21[clang] Introduce SourceLocation::[U]IntTy typedefs.Simon Tatham1-1/+1
This is part of a patch series working towards the ability to make SourceLocation into a 64-bit type to handle larger translation units. NFC: this patch introduces typedefs for the integer type used by SourceLocation and makes all the boring changes to use the typedefs everywhere, but for the moment, they are unconditionally defined to uint32_t. Patch originally by Mikhail Maltsev. Reviewed By: tmatheson Differential Revision: https://reviews.llvm.org/D105492
2021-07-20[Lex] Consider a PCH header-guarded even with #endif truncatedSam McCall1-0/+5
This seems to be a more useful behavior for tools that use preambles. I believe it doesn't affect real compiles: the PCH is only included once when used, and recursive inclusion of the main-file *within* the PCH isn't supported in any case. Differential Revision: https://reviews.llvm.org/D106204
2021-07-14[Lexer] Fix bug in `makeFileCharRange` called on split tokens.Yitzhak Mandelbaum1-4/+17
When the end loc of the specified range is a split token, `makeFileCharRange` does not process it correctly. This patch adds proper support for split tokens. Differential Revision: https://reviews.llvm.org/D105365
2021-05-27Add support for #elifdef and #elifndefAaron Ballman1-0/+2
WG14 adopted N2645 and WG21 EWG has accepted P2334 in principle (still subject to full EWG vote + CWG review + plenary vote), which add support for #elifdef as shorthand for #elif defined and #elifndef as shorthand for #elif !defined. This patch adds support for the new preprocessor directives.
2021-05-24PR50456: Properly handle multiple escaped newlines in a '*/'.Richard Smith1-30/+44
2021-03-15[NFC] Use higher level constructs to check for whitespace/newlines in the lexerserge-sans-paille1-4/+4
It turns out that according to valgrind and perf, it's also slightly faster. Differential Revision: https://reviews.llvm.org/D98637
2021-03-12Add support for digit separators in C2x.Aaron Ballman1-2/+4
WG14 adopted N2626 at the meetings this week. This commit adds support for using ' as a digit separator in a numeric literal which is compatible with the C++ feature.