[0]: https://sillycross.github.io/2023/05/12/2023-05-12/ [1]: https://sillycross.github.io/2022/11/22/2022-11-22/
The major difference is that LuaJIT Remake's Copy-and-Patch requires "manually" copying blocks of assembly code and patching values, while this post relies on the Go compiler's closures to create copies of the functions with runtime-known values.
I think there's fascinating processing being made in this area—I think in the future this technique (in some form) will be the go-to way to create new interpreted languages, and AST interpreters, switch-based bytecode VMs, and JIT compilation will be things of the past.
> JIT will be things of the past
Sorry no. JIT is not going anywhere. They mentioned in the article JIT would be better performance just more effort than they wanted to put in (a good tradeoff!) JIT powers Java, Wasm and Javascript VMs and are certainly the way to get the fastest code because you can give the CPU code that it can do a much better job predicting the next instruction. With interpreters you’re often limited by the indirection of loads when looking up the next function to call, and generating code for the control flow outside calling your “builtins” is precisely what Sparkplug is doing.
At the end of the day, like most of engineering, choose the right tool for the job, which in this case is simplicity (which is often the right choice!), but that doesn’t mean it’s always the right choice. For example if browsers did this then Javascript performance would tank compared to what we get today.
Template interpreters are not a substitute for real JIT — JIT compilation isn’t going anywhere.
It was also my understanding that that's also the point of "super optimizer"s [1] which look for these common patterns in something like LLVM IR to generate optimization targets for the mainline optimizer.
> this post relies on the Go compiler's closures to create copies of the functions with runtime-known values
To be clear the technique of using closures like this is ancient in the world of LISP. You can see in Paul Graham's books on LISP from the 90s, and in LiSP in Small Pieces, and many interpreters of 80s/90s vintage. I would say that it is quite standard.
I am finding a switched byte interpreter to be very expedient on my computer. It seems that if the # of cases is kept small enough, your chances of getting a good branch prediction can go up substantially. Something like a brainfuck interpreter runs extremely fast. In the worst case of randomly guessing, you are still going to time travel with a 12.5% success rate.
Random guessing clearly isn't the worst case then, a bad prediction can miss every single time.
There are more tricks available there, as for example unrolling the loop that calls the list of closures, and having a `nop` closure that is executed when there's nothing to run but execution is not yet at the end of the the unrolled loop.
I'd love to see, if it's possible to create a libc-free, dependency-free executable without Nim (https://nim-lang.org/).
https://github.com/skx/simple-vm
Simpler than the full gomacro codebase, but perhaps helpful.
In particular, you should do the `switch op` at https://github.com/skx/simple-vm/blob/b3917aef0bd6c4178eed0c... outside the closure, and create a different, specialised closure for each case. Otherwise the "fast interpreter" may be almost as slow as a vanilla AST walker.
Here is a pseudocode example, adapted from gomacro sources:
https://gist.github.com/cosmos72/f971c172e71d08030f92a1fc5fa...
This works best for "compiling" statically typed languages, and while much faster than an AST interpreter, the "tree of closures" above is still ~10 times slower that natively compiled code. And it's usually also slower than JIT-compiled code
Because bind variables and Prepared Statements go hand in hand, you want to do everything you can to coax your users into doing the right thing. If each unique query has to be compiled before first use, that’s an extra inducement to using prepared statements properly.
It's a good pattern!
The Algorand VM was focused on low overhead running thousands of tiny programs per second. Version 1 had no loops and a 1000 instruction limit.
The JVM was focused on low memory, towards possible embedded microcontroller use.
So, 'array of function pointers' is nothing new, but it is a good pattern.
https://blog.cloudflare.com/building-fast-interpreters-in-ru...
The most efficient control flow is one that describes only what your algorithm needs, coupled with an optimiser that can exploit the particular flow you're describing.
Among the many things discovered by the author of https://blog.nelhage.com/post/cpython-tail-call/, Clang/LLVM was able to optimise the standard switch based interpreter loop as if it had been written with computed gotos.
Yes. I’m not saying goto is just faster in general. But some algorithms are difficult to describe with while and if (bunch of examples in Knuth).
> Clang/LLVM was able to optimise
Because it’s implicit, this is the kind of optimization that’s easy to silently regress a few weeks before ship by accidentally violating a rule.
I think unrestricted goto is not good. But an alternative is to make the semantics and scope stricter, something like a Common Lisp prog/tag body block.
'computed goto' is used in gcc to mean the same as assigned goto in Fortran. The Fortran variation appears to have more restrictions than the gnuc one.
https://eli.thegreenplace.net/2012/07/12/computed-goto-for-e...
The reduced branching used to bring speed improvements due to better branch prediction, though they have shrunken and become smaller and smaller on newer CPUs as branch prediction has improved in general.
The fact that they had become increasingly miniscule was acknowledged even ten years ago:
Nehalem shows a few outstanding speedups (in the 30%–40% range), as well as Sandy Bridge to a lesser extent, but the average speedups (geomean of individual speedups) for Nehalem, Sandy Bridge, and Haswell are respectively 10.1%, 4.2%, and 2.8% with a few outstanding values for each microarchitecture. The benefits of threaded code decreases with each new generation of microarchitecture.
goto *errOrSuccess;
is a pretty normal one. This basically allows you to emulate a "finally".State machines are another big one, which look much nicer with computed goto than a switch statement.
bool function_with_cleanup(void) {
int *buffer1 = NULL;
int *buffer2 = NULL;
FILE *file = NULL;
bool success = false;
// Allocate first resource
buffer1 = malloc(sizeof(int) * 100);
if (!buffer1) {
goto cleanup; // Error, jump to cleanup
}
// Allocate second resource
buffer2 = malloc(sizeof(int) * 200);
if (!buffer2) {
goto cleanup; // Error, jump to cleanup
}
// Open a file
file = fopen("data.txt", "r");
if (!file) {
goto cleanup; // Error, jump to cleanup
}
// Do work with all resources...
success = true; // Only set to true if everything succeeded
cleanup:
// Free resources in reverse order of acquisition
if (file) fclose(file);
free(buffer2); // free() is safe on NULL pointers
free(buffer1);
return success;
}
bool function_with_cleanup(void) {
// Allocate first resource
int* buffer1 __attribute__((__cleanup__(free))) = malloc(sizeof(int) * 100);
if(!buffer1) { return false; }
// Allocate second resource
int* buffer2 __attribute__((__cleanup__(free))) = malloc(sizeof(int) * 200);
if(!buffer2) { return false; }
// Open a file
FILE* file __attribute__((__cleanup__(fclose))) = fopen("data.txt", "r");
if (!file) { return false; }
return true;
}
[0] https://thephd.dev/c2y-the-defer-technical-specification-its... bool function_with_cleanup(void) {
int *buffer1 = NULL;
int *buffer2 = NULL;
FILE *file = NULL;
bool success = false;
// Allocate first resource
buffer1 = malloc(sizeof(int) * 100);
if (buffer1) {
// Allocate second resource
buffer2 = malloc(sizeof(int) * 200);
if (buffer2) {
// Open a file
file = fopen("data.txt", "r");
if (file) {
// Do work with all resources...
fclose(file);
success = true; // Only set to true if everything succeeded
}
free(buffer2);
}
free(buffer1);
}
return success;
}
Much shorter and more straightforward.One-time loops with break also work if you're not doing the resource allocation in another loop:
bool function_with_cleanup(void) {
int *buffer1 = NULL;
int *buffer2 = NULL;
FILE *file = NULL;
bool success = false;
do { // One-time loop to break out of on error
// Allocate first resource
buffer1 = malloc(sizeof(int) * 100);
if (!buffer1) {
break; // Error, jump to cleanup
}
// Allocate second resource
buffer2 = malloc(sizeof(int) * 200);
if (!buffer2) {
break; // Error, jump to cleanup
}
// Open a file
file = fopen("data.txt", "r");
if (!file) {
break; // Error, jump to cleanup
}
// Do work with all resources...
success = true; // Only set to true if everything succeeded
} while(false);
// Free resources in reverse order of acquisition
if (file) fclose(file);
free(buffer2); // free() is safe on NULL pointers
free(buffer1);
return success;
}
Still simpler to follow than goto IMHO. Both these patterns work in other languages without goto too, e.g. Python.And break is extremely thin sugar on top of go-to.
It can get quite hilly, which doesn't look great. It does have the advantage that each resource is explicitly only valid in a visible scope, and there's a marker at the end to denote the valid region of the resource is ending.
EDIT: you mentioned early return, this style forbids early return (at least, any early return after the first resource acquisition)
And in such case, the C compiler doesn't have a limit to write functions and better modularize their implementations.
bool function_with_cleanup(void) {
int *buffer1 = NULL;
int *buffer2 = NULL;
FILE *file = NULL;
// Allocate first resource
buffer1 = malloc(sizeof(int) * 100);
if (!buffer1) {
return false;
}
// Allocate second resource
buffer2 = malloc(sizeof(int) * 200);
if (!buffer2) {
free(buffer1);
return false;
}
// Open a file
file = fopen("data.txt", "r");
if (!file) {
free(buffer1);
free(buffer1);
return false;
}
// Do work with all resources...
fclose(file);
free(buffer1);
free(buffer1);
return true;
}
Ah, but all those free() calls get tedious can be forgotten and mistyped bool function_with_cleanup(void) {
int *buffer1 = NULL;
int *buffer2 = NULL;
FILE *file = NULL;
// Allocate first resource
buffer1 = arena_alloc(¤t_arena, sizeof(int) * 100);
if (!buffer1) {
return false;
}
// Allocate second resource
buffer2 = arena_alloc(¤t_arena, sizeof(int) * 200);
if (!buffer2) {
arena_reset(¤t_arena);
return false;
}
// Open a file
file = fopen("data.txt", "r");
if (!file) {
arena_reset(¤t_arena);
return false;
}
// Do work with all resources...
fclose(file);
arena_reset(¤t_arena);
return true;
}
Can still be improved with a mix of macros and varargs functions.Or if using language extensions is a thing, the various ways to do defer in C.
You clearly can in Go. How would you deal with exceptions otherwise? You can even get silly with it:
func main() {
try(func() {
throw("Oops")
}).catch(func(e any) {
fmt.Println("Caught:", e)
})
}
https://go.dev/play/p/Y1-w9xUXIcOIt is a tool, not a religion.
- https://expr-lang.org/blog/exploring-virtual-machines-for-em...
It’s quite nice and only a small bit more code than an ast interpreter - but not as fast as a luajit- style tail calling bytecode vm, nor a baseline jit like copy and patch. It works wonders in languages that support closures though.
https://www-labs.iro.umontreal.ca/~feeley/papers/FeeleyLapal...