So there is not much to gain from JITing the query plan execution only.
JITing begins to make more sense, when the individual query plan steps (join, filter, ...) themselves be specialized/recompiled/improved/merged by knowing the context of the query plan.
- not all databases need migrations (or migrations without downtime)
- alternatively, ship the migrations as part of the binary
Adhoc modifications would still be more difficult but tbh that’s not necessarily a bug
The problems related to PostgreSQL are pretty much all described here. It's very difficult to do low-latency queries if you cannot cache the compiled code and do it over and over again. And once your JIT is slow you need a logic to decide whether to interpret or compile.
I think it would be the best to start interpreting the query and start compilation in another thread, and once the compilation is finished and interpreter still running, stop the interpreter and run the JIT compiled code. This would give you the best latency, because there would be no waiting for JIT compiler.
This is not too difficult, it just requires a different execution style. Salesforce's Hyper for example very heavily relies on JIT compilation, as does Umbra [1], which some people regard as one of the fastest databases right now. Umbra doesn't cache any IR or compiled code and still has an extremely low start-up latency; an interpreter exists but is practically never used.
Postgres is very robust and very powerful, but simply not designed for fast execution of queries.
Disclosure: I work in the group that develops Umbra.
This technique is known as a "tiered JIT". It's how production virtual machines operate for high-level languages like JavaScript.
There can be many tiers, like an interpreter, baseline compiler, optimizing compiler, etc. The runtime switches into the faster tier once it becomes ready.
More info for the interested:
Most other DB's cache query plans including jitted code so they are basically precompiled from one request to the next with the same statement.
"This obvious drawback of the current software architecture motivates our work: sharing JIT code caches across applications. During the exploration of this idea, we have encountered several challenges. First of all, most JIT compilers leverage both runtime context and profile information to generate optimized code. The compiled code may be embedded with runtime-specific pointers, simplified through unique class-hierarchy analysis, or inlined recursively. Each of these "improve- ments" can decrease the shareability of JIT compiled code."
Anythings doable here with enough dev time. Would be nice if PG could just serialize the query plan itself maybe just as an SO along with non-process specific executable code that then has to be dynamically linked again in other processes.
I was actually confused by this submission as it puts so much of an emphasis on initial compilation time, when every DB (apparently except for pgsql) caches that result and shares it/reuses it until invalidation. Invalidation can occur for a wide variety of reasons (data composition changing, age, etc), but still the idea of redoing it on every query, where most DBs see the same queries endlessly, is insane.
To make code shareable between processes takes effort and will have tradeoff in performance since it is not specialized to the process.
If the query plan where at least serializable which is more like a AST then at least that part could be reused and then maybe have jitted code in each processes cached in memory that the plan can reference by some key.
DB's like MSSQL avoid the problem because they run a single OS process with multiple threads instead. This is also why it can handle more connections easily since each connection is not a whole process.
https://www.postgresql.org/docs/current/parallel-query.html
"PostgreSQL can devise query plans that can leverage multiple CPUs in order to answer queries faster."
If process based then they can send small parts of plan across processes.
Plans for prepared statements are cached though.
Other databases like MSSQL have prepared statements but they are rarely used now days since plan caching based on query text was introduced decades ago.
I don't know anything here, but this seems like a good case for ahead of time compilation? Or at least caching your JIT results? I can image much of the time, you are getting more or less the same query again and again?
Some years ago we ported some code from querying out the data and tallying in Python (how many are in each bucket) to using SQL to do that. It didn't speed up the execution. I was surprised by that, but I guess the Postgres interpreter is roughly the same speed as Python, which when you think about it perhaps isn't that surprising.
But Python is truly general purpose while the core query stuff in SQL is really specialized (we were not using stored procedures). So if Pypy can get 5x speedup, it seems to me that it should be possible to get the same kind of speed up in Postgres. I guess it needs funding and someone as smart as the Pypy people.
And then the issue is not dissimilar to Postgres’s planner issues.
MS SQL still has prepared statements and they really haven't been used in 20 years since it gained the ability to cache plans based on statement text.
Was common guidance back in the day to use stored procedures for all application access code because they where cached in MSSQL (which PG doesn't even do). Then around 2000 it started caching based on statement text and that became much less important.
You would only used prepared statements if doing a bunch of inserts in a loop or something and it has a very small benefit now days only because its not sending the same text over the network over and over and hashing to lookup plan.
Ad-hoc one off queries usually can accept higher initial up-front compile cost because the main results usually take much longer anyway, vs worrying about an extra 100ms of compile.
Maybe it was too strong to say its not a concern at all, but nothing like PG where every single request needs to replan and potentially jit unless the client manually prepares and keeps the connection open.
Really amazed to see not one but several generic JIT frameworks though, no idea that was a thing.
* Latency. LLM responses are measured in order of 1000s of milliseconds, where this project targets 10s of milliseconds, that's off by almost two orders of magnitute.
* Determinism. LLMs are inherently non-deterministic. Even with temperature=0, slight variations of the input lead to major changes in output. You really don't want your DB to be non-deterministic, ever.
This isn't true, and certainly not inherently so.
Changes to input leading to changes in output does not violate determinism.
From what I understand, in practice it often is true[1]:
Matrix multiplication should be “independent” along every element in the batch — neither the other elements in the batch nor how large the batch is should affect the computation results of a specific element in the batch. However, as we can observe empirically, this isn’t true.
In other words, the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.
[1]: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
"But why aren’t LLM inference engines deterministic? One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first."
From https://thinkingmachines.ai/blog/defeating-nondeterminism-in...