252 pointsby ksec10 days ago14 comments
  • MatthiasPortzel9 days ago
    It's crazy that this post seems to have stumbled across an equivalent to the Copy-and-Patch technique[0] used to create a Lua interpreter faster than LuaJit[1]

    [0]: https://sillycross.github.io/2023/05/12/2023-05-12/ [1]: https://sillycross.github.io/2022/11/22/2022-11-22/

    The major difference is that LuaJIT Remake's Copy-and-Patch requires "manually" copying blocks of assembly code and patching values, while this post relies on the Go compiler's closures to create copies of the functions with runtime-known values.

    I think there's fascinating processing being made in this area—I think in the future this technique (in some form) will be the go-to way to create new interpreted languages, and AST interpreters, switch-based bytecode VMs, and JIT compilation will be things of the past.

    • rockwotj9 days ago
      It’s not really copy and patch, the whole point of the copy patch is so you can inline that in your compilation output and it’s a fast baseline interpreter because individual builtin functions are optimized (via the compiler output you’re copying from) and inlined (which is why you need to patch to update what registers are being used. In this model you jit only control flow really, then inline the implementation of each bytecode operation (in contrast to sparkplug [https://v8.dev/blog/sparkplug] which just calls a builtin instead of copy/patch). It’s still JIT which is vastly different than an interpreter.

      > JIT will be things of the past

      Sorry no. JIT is not going anywhere. They mentioned in the article JIT would be better performance just more effort than they wanted to put in (a good tradeoff!) JIT powers Java, Wasm and Javascript VMs and are certainly the way to get the fastest code because you can give the CPU code that it can do a much better job predicting the next instruction. With interpreters you’re often limited by the indirection of loads when looking up the next function to call, and generating code for the control flow outside calling your “builtins” is precisely what Sparkplug is doing.

      At the end of the day, like most of engineering, choose the right tool for the job, which in this case is simplicity (which is often the right choice!), but that doesn’t mean it’s always the right choice. For example if browsers did this then Javascript performance would tank compared to what we get today.

    • frumplestlatz9 days ago
      The JVM has had a template interpreter since the mid-90s, it’s not anything new, and template interpreters are only sufficiently performant as to provide acceptable execution speed until you JIT.

      Template interpreters are not a substitute for real JIT — JIT compilation isn’t going anywhere.

      • cogman109 days ago
        My understanding of most optimizing compilers is that this is an extremely common "last step" sort of optimization. A lot of the optimizing work is beating the code into a canonical form where these sorts of templates can be readily applied.

        It was also my understanding that that's also the point of "super optimizer"s [1] which look for these common patterns in something like LLVM IR to generate optimization targets for the mainline optimizer.

        [1] https://en.wikipedia.org/wiki/Superoptimization

    • deagle509 days ago
      Faster than LuaJIT's interpreter, not the compiled code.
    • cryptonector9 days ago
      > It's crazy that this post seems to have stumbled across an equivalent to the Copy-and-Patch technique[0] used to create a Lua interpreter faster than LuaJit[1]

      > this post relies on the Go compiler's closures to create copies of the functions with runtime-known values

      To be clear the technique of using closures like this is ancient in the world of LISP. You can see in Paul Graham's books on LISP from the 90s, and in LiSP in Small Pieces, and many interpreters of 80s/90s vintage. I would say that it is quite standard.

    • bob10299 days ago
      > switch-based bytecode VMs

      I am finding a switched byte interpreter to be very expedient on my computer. It seems that if the # of cases is kept small enough, your chances of getting a good branch prediction can go up substantially. Something like a brainfuck interpreter runs extremely fast. In the worst case of randomly guessing, you are still going to time travel with a 12.5% success rate.

      • josefx9 days ago
        > In the worst case of randomly guessing, you are still going to time travel with a 12.5% success rate.

        Random guessing clearly isn't the worst case then, a bad prediction can miss every single time.

    • 9 days ago
      undefined
  • cosmos00729 days ago
    Compiling an expression to a tree of closures, and a list of statements to a slice of closures, is exactly how I optimized [gomacro](https://github.com/cosmos72/gomacro) my Go interpreter written in go.

    There are more tricks available there, as for example unrolling the loop that calls the list of closures, and having a `nop` closure that is executed when there's nothing to run but execution is not yet at the end of the the unrolled loop.

    • ALLTaken9 days ago
      Impressive! I would love to learn howto implement that in Julia. Could you help me understand how you did that?

      I'd love to see, if it's possible to create a libc-free, dependency-free executable without Nim (https://nim-lang.org/).

      • stevekemp8 days ago
        I put together this example after reading the article:

        https://github.com/skx/simple-vm

        Simpler than the full gomacro codebase, but perhaps helpful.

        • cosmos00728 days ago
          For optimal speed, you should move as much code as possible outside the closures.

          In particular, you should do the `switch op` at https://github.com/skx/simple-vm/blob/b3917aef0bd6c4178eed0c... outside the closure, and create a different, specialised closure for each case. Otherwise the "fast interpreter" may be almost as slow as a vanilla AST walker.

          • stevekemp8 days ago
            That's fair, thanks for taking the time to look and giving the feedback.
      • cosmos00729 days ago
        The core idea is simple: do a type analysis on each expression you want to "compile" to a closure, and instantiate the correct closure for each type combination.

        Here is a pseudocode example, adapted from gomacro sources:

        https://gist.github.com/cosmos72/f971c172e71d08030f92a1fc5fa...

        This works best for "compiling" statically typed languages, and while much faster than an AST interpreter, the "tree of closures" above is still ~10 times slower that natively compiled code. And it's usually also slower than JIT-compiled code

  • hinkley9 days ago
    Compiling queries is one of those things that is both Make it Right and Make it Fast.

    Because bind variables and Prepared Statements go hand in hand, you want to do everything you can to coax your users into doing the right thing. If each unique query has to be compiled before first use, that’s an extra inducement to using prepared statements properly.

  • brianolson9 days ago
    I built a 'slice of function pointers' bytecode interpreter in Go in 2019 for the Algorand VM (Blockchain smart contract stuff) and before that the same pattern in C for a toy JVM around 2005.

    It's a good pattern!

    The Algorand VM was focused on low overhead running thousands of tiny programs per second. Version 1 had no loops and a 1000 instruction limit.

    The JVM was focused on low memory, towards possible embedded microcontroller use.

    So, 'array of function pointers' is nothing new, but it is a good pattern.

  • rockwotj9 days ago
    The resulting VM implementation reminds me of this post from cloudflare where they similarly use closures to build their interpreter.

    https://blog.cloudflare.com/building-fast-interpreters-in-ru...

  • giovannibonetti9 days ago
    The article mentions the struggle to speed up expression computation on MySQL due to its dynamic typing. I wonder if Postgres' more static type system works better in that regard.
  • returningfory28 days ago
    Really interesting technique. I guess one of the (theoretical?) drawbacks is the memory usage is not ideal? In a traditional bytecode compiler the result of compilation is a very-cache-friendly binary array, whereas here it looks like a vector of pointers to heap allocated closures.
  • politician9 days ago
    It would be ideal if Go would add support for computed goto, so that we could build direct threaded interpreters.
    • nemo16189 days ago
      Is computed goto used for anything other than interpreter loops? Because if not, I would rather have a special "it looks like you're trying to implement an interpreter loop" case in the compiler than add new syntax.
      • grandempire9 days ago
        Yes. Goto is useful for efficient control flow in many stateful algorithms.
        • dwattttt9 days ago
          I don't know that _efficient_ is the best word. If you use goto to force a particular control flow rather than a more constrained form of control flow (e.g. if), you make it harder for the optimiser to work; it can only make "as-if" changes, ones that mean the code that executes looks as if it's what you wrote.

          The most efficient control flow is one that describes only what your algorithm needs, coupled with an optimiser that can exploit the particular flow you're describing.

          Among the many things discovered by the author of https://blog.nelhage.com/post/cpython-tail-call/, Clang/LLVM was able to optimise the standard switch based interpreter loop as if it had been written with computed gotos.

          • grandempire9 days ago
            > The most efficient control flow is one that describes only what your algorithm needs,

            Yes. I’m not saying goto is just faster in general. But some algorithms are difficult to describe with while and if (bunch of examples in Knuth).

            > Clang/LLVM was able to optimise

            Because it’s implicit, this is the kind of optimization that’s easy to silently regress a few weeks before ship by accidentally violating a rule.

            I think unrestricted goto is not good. But an alternative is to make the semantics and scope stricter, something like a Common Lisp prog/tag body block.

    • pklausler9 days ago
      Did you mean "assigned GOTO", not computed GOTO? Because that's just a switch list.
      • knome9 days ago
        https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html

        'computed goto' is used in gcc to mean the same as assigned goto in Fortran. The Fortran variation appears to have more restrictions than the gnuc one.

        • johnisgood9 days ago
          I may be dumb, but what is "goto *ptr;" useful for? Or &&foo.
          • trealira8 days ago
            It's used in VMs, emulators, and other interpreters for making a dispatch loop with less branching than a standard loop with a switch inside it.

            https://eli.thegreenplace.net/2012/07/12/computed-goto-for-e...

            • rurban6 days ago
              No, it has the same number of branches as a switch inside. The only difference is that computed goto switches miss the initial range check. And proper switch implementations by a better compiler would be more optimizable. But we are waiting more than 20 years for that already. Sparse switch dispatch tables are also compiled horribly.
              • trealira6 days ago
                No, it doesn't, because in a switch with a loop, at the end of each case (after the break), the instructions have an unconditional jump to the top of the switch again, where it performs the indirect jump. With the code using computed goto, the indirect jump is at the end of every single case of the switch. You can see this in the assembly the author provides. I'm ignoring cases where there's fallthrough to another case of the switch, which is sometimes done in interpreters like these. And though I wasn't thinking of it, the lack of a bounds check also reduces the number of branches.

                The reduced branching used to bring speed improvements due to better branch prediction, though they have shrunken and become smaller and smaller on newer CPUs as branch prediction has improved in general.

                The fact that they had become increasingly miniscule was acknowledged even ten years ago:

                Nehalem shows a few outstanding speedups (in the 30%–40% range), as well as Sandy Bridge to a lesser extent, but the average speedups (geomean of individual speedups) for Nehalem, Sandy Bridge, and Haswell are respectively 10.1%, 4.2%, and 2.8% with a few outstanding values for each microarchitecture. The benefits of threaded code decreases with each new generation of microarchitecture.

                https://inria.hal.science/hal-01100647/document

          • withinboredom9 days ago
            error conditions when you don't have exceptions:

                goto *errOrSuccess;
            
            is a pretty normal one. This basically allows you to emulate a "finally".

            State machines are another big one, which look much nicer with computed goto than a switch statement.

    • zabzonk9 days ago
      I don't think "ideal" would be the exact word - I used to shudder at it in FORTRAN.
    • 9dev9 days ago
      Go to goto, go!
    • kiitos9 days ago
      hard no - goto is a feature for code generators, not anything meant for human use
      • greenavocado9 days ago
        I never understood this argument. Without RAII you can easily get screwed by resource leaks without goto when returning early. In this regard, using goto is expedient. How do C programmers avoid this problem without goto?

            bool function_with_cleanup(void) {
                int *buffer1 = NULL;
                int *buffer2 = NULL;
                FILE *file = NULL;
                bool success = false;
                
                // Allocate first resource
                buffer1 = malloc(sizeof(int) * 100);
                if (!buffer1) {
                    goto cleanup;  // Error, jump to cleanup
                }
                
                // Allocate second resource
                buffer2 = malloc(sizeof(int) * 200);
                if (!buffer2) {
                    goto cleanup;  // Error, jump to cleanup
                }
                
                // Open a file
                file = fopen("data.txt", "r");
                if (!file) {
                    goto cleanup;  // Error, jump to cleanup
                }
                
                // Do work with all resources...
                
                success = true;  // Only set to true if everything succeeded
                
            cleanup:
                // Free resources in reverse order of acquisition
                if (file) fclose(file);
                free(buffer2);  // free() is safe on NULL pointers
                free(buffer1);
                
                return success;
            }
        • shakna9 days ago
          Attributes, mostly. Which have become so common that defer is very likely to be in the next C standard. [0]

              bool function_with_cleanup(void) {
                  // Allocate first resource
                  int* buffer1 __attribute__((__cleanup__(free))) =  malloc(sizeof(int) * 100);
                  if(!buffer1) { return false; }
          
                  // Allocate second resource
                  int* buffer2 __attribute__((__cleanup__(free))) = malloc(sizeof(int) * 200);
                  if(!buffer2) { return false; }
          
                  // Open a file
                  FILE* file __attribute__((__cleanup__(fclose))) = fopen("data.txt", "r");
                  if (!file) { return false; }
          
                  return true;
              }
          
          
          [0] https://thephd.dev/c2y-the-defer-technical-specification-its...
        • mkl9 days ago
          Nested ifs are my preference:

            bool function_with_cleanup(void) {
                int *buffer1 = NULL;
                int *buffer2 = NULL;
                FILE *file = NULL;
                bool success = false;
            
                // Allocate first resource
                buffer1 = malloc(sizeof(int) * 100);
                if (buffer1) {
                    // Allocate second resource
                    buffer2 = malloc(sizeof(int) * 200);
                    if (buffer2) {
                        // Open a file
                        file = fopen("data.txt", "r");
                        if (file) {
                            // Do work with all resources...
                            fclose(file);
                            success = true;  // Only set to true if everything succeeded
                        }
                        free(buffer2);
                    }
                    free(buffer1);
                }
            
                return success;
            }
          
          Much shorter and more straightforward.

          One-time loops with break also work if you're not doing the resource allocation in another loop:

            bool function_with_cleanup(void) {
                int *buffer1 = NULL;
                int *buffer2 = NULL;
                FILE *file = NULL;
                bool success = false;
            
                do { // One-time loop to break out of on error
                    // Allocate first resource
                    buffer1 = malloc(sizeof(int) * 100);
                    if (!buffer1) {
                        break;  // Error, jump to cleanup
                    }
            
                    // Allocate second resource
                    buffer2 = malloc(sizeof(int) * 200);
                    if (!buffer2) {
                        break;  // Error, jump to cleanup
                    }
            
                    // Open a file
                    file = fopen("data.txt", "r");
                    if (!file) {
                        break;  // Error, jump to cleanup
                    }
            
                    // Do work with all resources...
            
                    success = true;  // Only set to true if everything succeeded
                } while(false);
            
                // Free resources in reverse order of acquisition
                if (file) fclose(file);
                free(buffer2);  // free() is safe on NULL pointers
                free(buffer1);
            
                return success;
            }
          
          Still simpler to follow than goto IMHO. Both these patterns work in other languages without goto too, e.g. Python.
          • gpderetta9 days ago
            Nested if, aside from being awful, doesn't scale.

            And break is extremely thin sugar on top of go-to.

        • dwattttt9 days ago
          Open a scope when you check resource acquisition passed, rather than the opposite (jump to the end of the function if it failed).

          It can get quite hilly, which doesn't look great. It does have the advantage that each resource is explicitly only valid in a visible scope, and there's a marker at the end to denote the valid region of the resource is ending.

          EDIT: you mentioned early return, this style forbids early return (at least, any early return after the first resource acquisition)

        • gigatexal9 days ago
          In this example couldn’t the go to cleanup instead be return cleanup_func where the same cleanup code was executed?
        • pjmlp9 days ago
          Maybe that is exactly the problem, stop using a language designed in 1970's that ignored on purpose the ecosystem outside Bell Labs, unless where it is unavoidable.

          And in such case, the C compiler doesn't have a limit to write functions and better modularize their implementations.

              bool function_with_cleanup(void) {
                  int *buffer1 = NULL;
                  int *buffer2 = NULL;
                  FILE *file = NULL;        
                  // Allocate first resource
                  buffer1 = malloc(sizeof(int) * 100);
                  if (!buffer1) {
                      return false;
                  }
                  
                  // Allocate second resource
                  buffer2 = malloc(sizeof(int) * 200);
                  if (!buffer2) {
                      free(buffer1);
                      return false;
                  }
                  
                  // Open a file
                  file = fopen("data.txt", "r");
                  if (!file) {
                      free(buffer1);
                      free(buffer1);
                      return false;
                  }
                  
                  // Do work with all resources...
                  fclose(file);
                  free(buffer1);
                  free(buffer1);
          
                  return true;
              }
          
          Ah, but all those free() calls get tedious can be forgotten and mistyped

              bool function_with_cleanup(void) {
                  int *buffer1 = NULL;
                  int *buffer2 = NULL;
                  FILE *file = NULL;        
                  // Allocate first resource
                  buffer1 = arena_alloc(&current_arena, sizeof(int) * 100);
                  if (!buffer1) {
                      return false;
                  }
                  
                  // Allocate second resource
                  buffer2 = arena_alloc(&current_arena, sizeof(int) * 200);
                  if (!buffer2) {
                      arena_reset(&current_arena); 
                      return false;
                  }
                  
                  // Open a file
                  file = fopen("data.txt", "r");
                  if (!file) {
                      arena_reset(&current_arena);
                      return false;
                  }
                  
                  // Do work with all resources...
                  fclose(file);
                  arena_reset(&current_arena);
                  return true;
               }
          
          Can still be improved with a mix of macros and varargs functions.

          Or if using language extensions is a thing, the various ways to do defer in C.

          • LtWorf9 days ago
            You understand go doesn't have exceptions right?
            • pjmlp9 days ago
              You understand that I wrote C code, and in what concerns Go, panic/recover are exceptions that don't want to assume themselves as such?
              • LtWorf8 days ago
                You can't handle a panic, it's the whole point.
                • 9rx8 days ago
                  In C, you mean?

                  You clearly can in Go. How would you deal with exceptions otherwise? You can even get silly with it:

                      func main() {
                          try(func() {
                              throw("Oops")
                          }).catch(func(e any) {
                              fmt.Println("Caught:", e)
                          })
                      }
                  
                  https://go.dev/play/p/Y1-w9xUXIcO
      • jandrewrogers9 days ago
        There are situations in practical code where goto is the only reasonable choice if you want to avoid spaghetti code. Absolutes have no place in software, you can find scenarios where almost every typically bad idea is actually a good idea.

        It is a tool, not a religion.

        • 9rx9 days ago
          Even Dijkstra was clear that he was only referring to unbridled gotos.
  • medv8 days ago
    This is very interesting article! In https://expr-lang.org we did our own research and a paper about comparison on different vm implementation in go:

    - https://expr-lang.org/blog/exploring-virtual-machines-for-em...

  • djwatson248 days ago
    Marc feeley wrote “using closures for code generation” way back in 1987. Everything old is new again!

    It’s quite nice and only a small bit more code than an ast interpreter - but not as fast as a luajit- style tail calling bytecode vm, nor a baseline jit like copy and patch. It works wonders in languages that support closures though.

    https://www-labs.iro.umontreal.ca/~feeley/papers/FeeleyLapal...

  • Thaxll9 days ago
    This looks very similar to how "basic" emulators work.
  • rajandatta9 days ago
    Nice article. Instructive to see the latest trends and direction for high performance interpreters. Thanks for sharing.
  • Jennber2 days ago
    [flagged]
  • kiitos9 days ago
    vitess is not the answer
    • ksec9 days ago
      May be you want to explain yourself rather than a statement? Vitess is battle tested and used by a lot of large enterprise.
      • gigatexal9 days ago
        Yeah it’s literally the monetization of db setup that powered YouTube — pretty much the best battle testing imo.
        • ksec8 days ago
          It is also used by Github.
          • gigatexal5 days ago
            Interesting. I had no idea GitHub was a planetscale shop