You'd replace the JUMP_TARGET macro:
#define JUMP_TARGET goto *jump_table[(int32_t)l->input.p[l->pos]]
With: #ifdef __clang__
#define musttail [[clang::musttail]]
#elif __GNUC__
#define musttail [[gnu::musttail]]
#else
#define musttail
#endif
#define JUMP_TARGET return musttail jump_table[(int32_t)l->input.p[l->pos]](l, a, out)
Then move the jump table out to the top level and replace each `&&` with `&`.See diff (untested): https://www.diffchecker.com/V4yH3EyF/
This approach has the advantage that it will work everywhere and not only on compilers that support the computed gotos - it just won't optimize it on compilers that don't support `musttail`. (Though it has been proposed to standardize it in a future version of C).
It might also work better with code navigation tools that show functions, but not labels, and enables modularity as we can split rules over multiple translation units.
Performance wise should basically be the same - though it's been argued that it may do better in some cases because the compiler's register allocator doesn't do a great job in large functions with computed gotos - whereas in musttail approach each function is a smaller unit and optimized separately.
Why don't they just use `tailcall`? That would make it's obvious what it's doing because we've been using the term for nearly half a century, and the entire literature on the subject uses the term "tail call".
Even better would be to just automatically insert a tail call - like every other language that has supported tail calls for decades - provided the callee has the same signature as the caller. If it's undesirable because we want a stack trace, then instead have some keyword or attribute to suppress the tail call - such as `no_tail`, `nontail` or `donttail`.
Requiring tail calls to be marked will basically mean the optimization will be underutilized. Other than having a stack trace for debugging, there's basically no reason not to have the optimization on by default.
As for why it's not trivial for Rust to do this by default, consider the question of what should happen in the case of local destructors, which in an ordinary function would be called after `return myfunc()` returns, but in a tail-recursive function would need to be called beforehand. The proposals for `become` tend to handle this by making it a compiler error to have any locals with destructors in scope at the point of the tail-call, further motivating the explicit syntax.
I've used Scala for many years and concluded this was a mistake. It makes it too easy to accidentally rely on an optimisation and then break your code much later with a seemingly innocuous change. Better to make it explicit.
I don't see any way around having to state your intent (as a programmer) for this. It's just a semantic difference in behavior unless your Abstract Machine allows for just ignoring the possibility of stack overflow entirely.
I want a way to state that my intent is to not rely on the TCO. Even a second explicit annotation for "do not TCO this function" would be better than what Scala currently does (but frankly TCO is surprising enough to debug that I think it should be opt-in rather than opt-out).
How would that work? You want a nontail keyword to indicate that intent explicitly?
I guess I could imagine a scenario where you want to de-optimize a TCO into a non-TC... but I mean... that's got to be rare enough to just not bother with?
EDIT: Also, this is the exact opposite of your comment which I replied to. Make up your mind on what you want and we can maybe find a way to achieve it
That in a quite C-like language.
https://blog.nelhage.com/post/cpython-tail-call/ has made the rounds a lot recently, and explores this for Python's bytecode interpreter.
A switch only selects on one character. To continue lexing you need the switch inside a loop. The compiler might optimize the switch itself to a jump table - but what does each case do - jumps back to the start of the loop, after which it enters the jump table again. There are two branches involved.
The point of direct threading is that there is no loop - you simply jump directly to the handler for the next character at the end of each handler.
If you read the URL I linked to, you will see that it is.
> The point of direct threading is that there is no loop - you simply jump directly to the handler for the next character at the end of each handler.
No, the point of direct threading is that you give the branch predictor more context to work with (effectively, the previous opcode), which was relevant with the branch predictors in typical CPUs 10+ years ago. (Modern ones, ITTAGE-style, have per-branch history also for indirect jumps.)
I was excited when it was introduced but quickly ran into issues.
Here is a silent miscompilation involving `[[musttail]]` that I reported in 2022 and is still open: https://github.com/llvm/llvm-project/issues/56435
Everything else, I stole from Bob Nystrom: I keep a local copy of the token's string in the token, aka, `char word[64]`. I try to minimize "decision making" during lexing. Really, at the consumption point we're only interested in an extremely small number of things: (1) does the lexeme start with a letter or a number?; (2) is it whitespace, and is that whitespace a new line?; or, (3) does it look like an operator?
The only place where I've ever considered goto-threading was in keyword identification. However, if your language keeps keywords to ≤ 8 bytes, you can just bake the keywords into `uint64_t`'s and compare against those values. You can do a crapload of 64b compares/ns.
The next level up (parsing) is slow enough to eat & memoize the decision making of the lexer; and, materially, it doesn't complicate the parser. (In fact: there's a lot of decision making that happens in the parser that'd have to be replicated in the lexer, otherwise.)
The result, overall, is you can have a pretty general-purpose lexer that you can reuse for a any old C-ish language, and tune to your heart's content, without needing a custom rewrite, each time.
This does mean you have to worry about partial tokens ... but if you limit yourself to feeding full lines that mostly goes away.
Besides, for reasonable-size workloads, "read the whole file ahead of time" is usually a win. The only time it's tempting not to do so is for REPLs.
While I understand the desire to support one input interface for composability, reuse, etc. I can't help wondering why 'FILE*'. Isn't reading from a string more "universal"?
> If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer.
And if the user has a file, it's easy to read it into memory in advance.
What's the benefit of FILE* over a string?
In C, I like just passing a const char * around as input; this also gives me ability to return progress and unget chars as an added bonus.
https://github.com/codr7/shi-c/blob/b1d5cb718b7eb166a0a93c77...
A truly performant lexer needs to jump ahead as far as possible. This likely involves SIMD (or SWAR) since unfortunately the C library fails to provide most of the important interfaces.
As an example that the C library can handle tolerably, while lexing a string, you should repeatedly call `strcspn(input, "\"\\\n")` to skip over chunks of ordinary characters, then only special-case the quote, backslash, newline and (implicit!) NUL after each jump. Be sure to correctly distinguish between an embedded NUL and the one you probably append to represent EOF (or, if streaming [which requires quite a bit more logic], end of current chunk).
Unfortunately, there's a decent chance your implementation of `strcspn` doesn't optimize for the possibility of small sets, and instead constructs a full 256-bit bitset. And even if it does, this strategy won't work for larger sets such as "all characters in an identifier" (you'd actually use `strspn` since this is positive), for which you'll want to take advantage of the characters being adjacent.
Edit: yikes, is this using a hash without checking for collisions?!?
Eventually you land on recreating the modern cpu.
https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-...
[1]: https://github.com/xNaCly/purple-garden/blob/master/cc.c#L76...
https://github.com/ClickHouse/ClickHouse/blob/master/src/Par...
https://github.com/ClickHouse/ClickHouse/blob/master/src/Par...
It supports SIMD for accelerated character matching, it does not do any allocations, and it is very small (compiles to a few KB of WASM code).
One trick that postgres uses [1][2] is perfect hashing [3]. Since you know in advance what your keywords are, you can design such hashing functions that for each w(i) in list of i keywords W, h(w(i)) = i. It essentially means no collisions and it's O(i) for the memory requirement.
[1] https://github.com/postgres/postgres/blob/master/src/tools/P...
[2] https://github.com/postgres/postgres/blob/master/src/tools/g...
That said, there's no reason not to squeeze every bit of performance out of it!
[1]: In this talk about the Carbon language, Chandler Carruth shows and explains some goals/challenges regarding performance: https://youtu.be/ZI198eFghJk?t=1462
Years back I worked at a C++ shop with a big codebase (hundreds of millions of LOC when you included vendored dependencies). Compile times there were sometimes dominated by parsing speed! Now, I don't remember the exact breakdown of lexing vs parsing, but I did look at it under a profiler.
It's very easy in C++ projects to structure your code such that you inadvertently cause hundreds of megabytes of sources to be parsed by each single #include. In such a case, lexing and parsing costs can dominate build times. Precompiled headers help, but not enough...
Lexing, parsing and even type checking are interleaved in most C++ compilers due to the ambiguous nature of many construct in the language.
It is very hard to profile only one of these in isolation. And even with compiler built-in instrumentation, the results are not very representative of the work done behind.
C++ compilers are amazing machines. They are blazing fast at parsing a language which is a nightmare of ambiguities. And they are like that mainly because how stupidly verbose and inefficient the C++ include system is.
Indeed, precise cost attribution is difficult or impossible due to how the nature of the language imposes structure on industrial computers. But that aside, you still end up easily with hundreds of megabytes of source to deal with in each translation unit. I have so many scars from dealing with that...
Good luck to naively differentiate if '>' is effectively a chevron operator or a shift operator '>>' in a random nested template expression like : 'set<vector<int>>'.
For a statically typed language, it's very unlikely that the lexer shows up as a bottleneck. Compilation time will likely be dominated by semantic analysis, type checking, and code generation.
For a dynamically typed language where there isn't as much for the compiler to do, then the lexer might be a more noticeable chunk of compile times. As one of the V8 folks pointed out to me years ago, the lexer is the only part of the compiler that has to operate on every single individual byte of input. Everything else gets the luxury of greater granularity, so the lexer can be worth optimizing.
P.S. I absolutely loved "Crafting Interpreters" — thank you so much for writing it!
Example: In the blog post a single token uses 32 bytes + 8 bytes for the pointer indirection in AST node. That's a lot of memory, cache line misses and indirections.
PS: We really need GPSIMD compiler phases for GCC and Clang.
Compilers will compile switches to a branch tree, two-level jump table, or single-level jump table depending on density and optimisation options. If manually using a jump table is faster, you either have a terrible compiler or just haven't explored its optimisation settings enough.
I did a similar thing (for fun) for the tokenizer associated to a Swift derivates language written in C++.
My approach was however very different of yours:
- No macro, no ASM, just explicit vectorization using std.simd
- No hand rolled allocator. Just std::vector and SOA.
- No hashing for keyword. They are short. A single SIMD load / compare is often enough for a comparison
- All the lookup tables are compile time generated from the token list using constexpr to keep the code small and maintainable.
I was able to reach around 8 Mloc/s on server grade hardware, single core.
For integer parsing, this is a pretty cheap operation, so I wonder if it is worthwhile to try to avoid doing it many times. For double parsing, it is expensive if you require the bottom couple bits to be correct, so the approach in the blog post should create savings.
I found it quite tricky to apply its ideas to the more general syntax for a programming language, but with a bunch of hacking and few subtle changes to the language itself, the performance difference over one-character-at-a-time was quite substantial (about 6-10x).
The only reason this raises an eyebrow is that I've seen conflicting anec/data on this, depending pretty hard on target microarchitecture and the program itself.
A few months ago I built a toy boolean expression parser as a weekend project. The main goal was simple: evaluate an expression and return true or false. It supported basic types like int, float, string, arrays, variables, and even custom operators.
The syntax and grammar were intentionally kept simple. I wanted the whole implementation to be self-contained and compact, something that could live in just a .h and .cc file. Single pass for lexing, parsing, and evaluation.
After having the first version working, I kind of challenged myself to make it faster and tried many things.
Once the first version was functional, I challenged myself to optimize it for speed. Here are some of the performance-related tricks I remember using:
- No string allocations: used the input *str directly, relying on pointer manipulation instead of allocating memory for substrings.
- Stateful parsing: maintained a parsing state structure passed by reference to avoid unnecessary copies or allocations.
- Minimized allocations: tried to avoid heap allocations wherever possible. Some were unavoidable during evaluation, but I kept them to a minimum.
- Branch prediction-friendly design: used lookup tables to assist with token identification (mapping the first character to token type and validating identifier characters).
- Inline literal parsing: converted integer and float literals to their native values directly during lexing instead of deferring conversion to a later phase.
I think all the tricks are mentioned in the article already.For what is worth, here is the project:
https://github.com/pausan/tinyrulechecker
I used this expression to assess the performance on an Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz (launched Q3 2018): myfloat.eq(1.9999999) || myint.eq(32)
I know it is a simple expression and likely a larger expression would perform worse due to variables lookups, ... I could get a speed of 287MB/s or 142ns per evaluation (7M evaluations per second). I was gladly surprised to reach those speeds given that 1 evaluation is a full cycle of lexing, parsing and evaluating the expression itself.The next step I thought was also to use SIMD for tokenizing, but not sure it would have helped a lot on the overall expression evaluation times, I seem to recall most of the time was spent on the parser or evaluation phases anyway, not the lexer.
It was a fun project.
Not sure how many of these translate to other languages.