Edit: I looked over some of the code.
It's not good. It's certainly not anywhere near SQLite's quality, performance, or codebase size. Many elements are the most basic thing that could possibly work, or else missing entirely. To name some examples:
- Absolutely no concurrency.
- The B-tree implementation has a line "// TODO: Free old overflow pages if any."
- When the pager adds a page to the free list, it does a linear search through the entire free list (which can get arbitrarily large) just to make sure the page isn't in the list already.
- "//! The current planner scope is intentionally small: - recognize single-table `WHERE` predicates that can use an index - choose between full table scan and index-driven lookup."
- The pager calls clone() on large buffers, which is needlessly inefficient, kind of a newbie Rust mistake.
However…
It does seem like a codebase that would basically work. At a large scale, it has the necessary components and the architecture isn't insane. I'm sure there are bugs, but I think the AI could iron out the bugs, given some more time spent working on testing. And at that point, I think it could be perfectly suitable as an embedded database for some application as long as you don't have complex needs.
In practice, there is little reason not to just reach for actual SQLite, which is much more sophisticated. But I can think of one possible reason: SQLite has been known to have memory safety vulnerabilities, whereas this codebase is written in Rust with no unsafe code. It might eat your data, but it won't corrupt memory.
That is impressive enough for now, I think.
Lucky that SQLite is also robust against random process death.
- if you're not passing SQLite's open test suite, you didn't build SQLite
- this is a "draw the rest of the owl" scenario; in order to transform this into something passing the suite, you'd need an expert in writing databases
These projects are misnamed. People didn't build counterstrike, a browser, a C compiler, or SQLite solely with coding agents. You can't use them for that purpose--like, you can't drop this in for maybe any use case of SQLite. They're simulacra (slopulacra?)--their true use is as a prop in a huge grift: tricking people (including, and most especially, the creators) into thinking this will be an economical way to build complex software products in the future.
Just start the MCP server in the SQLite repo. We have clear SOTA on re-creating existing projects starting from their test suite.
> "2. Safe languages insert additional machine branches to do things like verify that array accesses are in-bounds. In correct code, those branches are never taken. That means that the machine code cannot be 100% branch tested, which is an important component of SQLite's quality strategy."
'Safe' languages don't need to do that, if they can verify the array access is always in bounds at compile time then they don't need to emit any code to check it. That aside, it seems like they are saying:
for (int i=0; i<10; i++) {
foo(array[i]);
}
in C might become the equivalent of: for (int i=0; i<10; i++) {
if (i >= array_lower && i < array_higher) {
foo(array[i]);
} else {
??? // out of bounds, should never happen
}
}
in a 'safe' language, and i will always be in inside the array bounds so there is no way to test the 'else' branch?But that can't be in SQLite's checks as you claim, because the C code does not have a branch there to test?
Either way it seems hard to argue that a bounds test which can never fail makes the code less reliable and less trustworthy than the same code without a bounds test, using the argument that "you can't test the code path where the bounds check which can never fail, fails" - because you can use that same argument "what if the C code for array access which is correct, sometimes doesn't run correctly, you can't test for that"?
do_thing().expect(...);
This branch is required by the code, even if it can't be reached, because the type system requires it. It's not possible to test this branch, therefore 100% coverage is impossible in those cases.SQLite apparently has 2 million tests! If you started only with that and set your agentic swarm against it, and the stars aligned and you ended up with a pristine, clean-room replica that passes everything, other than proof that it could be done, what did you achieve? You stood on the shoulders of giants to build a Bizarro World giant that gets you exactly back to where you began?
I'd be more interested in forking SQLite as-is, setting a swarm of agents against it with the looping task to create novel things on top of what already exists, and see what comes out.
[0] https://en.wikipedia.org/wiki/SQLite#Development_and_distrib...
I agree that this current implementation is not very useful. I would not trust it where I trust SQLite.
Regardless, the potential for having agents build clean room implementations of existing systems from existing tests has value.
Why? The combinatorics of “just try things until you get it right” makes this impractical.
I believe it's an ad. Everything about it is trying so hard to seem legit and it's the most pointless thing I have ever seen.
I've lost every single shred of confidence I had in the comment's more optimistic claims the moment I read this.
If you read through SQLite's CVE history, you'll notice most of those are spurious at best.
Some more context here: https://sqlite.org/cves.html
> All historical vulnerabilities reported against SQLite require at least one of these preconditions:
> 1. ...
> 2. The attacker can submit a maliciously crafted database file to the application that the application will then open and query.
> Few real-world applications meet either of these preconditions, and hence few real-world applications are vulnerable, even if they use older and unpatched versions of SQLite.
This 2. precondition is literally one of the idiomatic usage of sqlite that they've suggested on their site: https://sqlite.org/appfileformat.html
It is true that the half-million lines of test code found in the public source tree are not the entirety of the SQLite test suite. There are other parts that are not open-source. But the part that is public is a big chunk of the total.
Edit: also this:
> TH3 Testing Support. The TH3 test harness is an aviation-grade test suite for SQLite. SQLite developers can run TH3 on specialized hardware and/or using specialized compile-time options, according to customer specification, either remotely or on customer premises. Pricing for this services is on a case-by-case basis depending on requirements.
The roots of SQLite are in defence industry projects of US Navy and General Dynamics. Seems like TH3 might be of interest for these sort of users.
That isn't true, not by a long shot. Improvements happen because someone is inspired to do something differently.
How will that ever happen if we're obsessed with proving we can reimplement shit that's already great?
In reality, LLMs can (currently) build worse versions of things that already exist: a worse database than SQL, a worse C compiler than GCC, a worse website than one done by a human. I'd really like to see some agent create a better version of something that already exists, or, at least, something relatively novel.
But it enables people who can't do these things at all to appear to be able to do these things and claim reputation and acclaim that they don't deserve for skills they don't have.
While I'm generally sympathetic to the idea that humans and LLM creativity is broadly similar (combining ideas absorbed elsewhere in new ways), when we ask for something that already exists it's basically just laundering open source code
That's the real unlock in my opinion. It's effectively an automated reverse engineering of how SQLite behaves, which is something agents are really good at.
I did a similar but smaller project a couple of weeks ago to build a Python library that could parse a SQLite SELECT query into an AST - same trick, I ran the SQLite C code as an oracle for how those ASTs should work: https://github.com/simonw/sqlite-ast
Question: you mention the OpenAI and Anthropic Pro plans, was the total cost of this project in the order of $40 ($20 for OpenAI and $20 for Anthropic)? What did you pay for Gemini?
Parallelism over one code base is clearly not very useful.
I don't understand why going as fast as possible is the goal. We should be trying to be as correct as possible. The whole point is that these agents can run while we sleep. Convergence is non linear. You want every step to be in the right direction. Think of it more as a series of crystalline database transactions that must unroll in perfect order than a big pile of rocks that needs to be moved from a to b.
I wrote a rant about this a while back to try and encourage people to be more responsible: https://sibylline.dev/articles/2026-01-27-stop-orchestrating...
Agreed, a flat set of workers configured like this is probably not the best configuration.
Can you imagine what an all human team configured like this would produce?
- the memory, thread safety, and build system of Rust
- the elegant syntax of OCaml and Haskell
- the expressive type system of Haskell and TypeScript
- the directness and simplicity of JavaScript
Think coding agents can help here?
But seriously though: have you tried to see how far you can get with the design right now? You can start iterating on it already, even if the implementation will lag.
Expressive power is the ratio how strongly/clearly you can encode invariants to how complex and ceremonious the syntax of it needs to be.
See how JS, a language usually seen as a middling/mediocre language, can distill the basic good parts of OOP into very direct and clear idioms? I can just create an object literal and embed simple methods on them that receive the "this" pointer and use it. The constructor would be just a regular function. None of the cruft of standard OOP.
See how you define an enumerable union in TypeScript? Very simple. And yet I can think of many major languages that do not have this, certainly not with a lot of ceremony and complexity.
And I can go on.
except '- the directness and simplicity of JavaScript'
https://github.com/artpar/guage
But somehow the language feels so foreign. it can obviously do hello world, but I don't have a real use case
PS: the "Pure symbols only" is no longer true, most symbols have been converted to English names
and, the "days" you see there in the markdowns are "claude code sessions", not actual days
And yes, LLMs/agents can help you do it for sure, I'm currently building the lisp of my dreams in my freetime, and already have compiler, interpreter, UI framework and some other things already done in a way I'm happy with.
And trust me, such a language that captures enough mindshare is absolutely needed. People thought Rust was going to be it, but it got taken over by the idea of it being the next C++.
IF LLMs are what you make them out to be, it shouldn't have been long before we saw serious attempts at such languages, but I suspect LLMs are of barely any help here beyond some basic implementation tasks.
Looking a bit further out, F# and Swift also come close.
Which aims to match SQLite quality and provide new features (free encryption, multiple simultaneous writers, and bitflip resistance.)
How well does the resulting code perform? What are the trade-offs/limitations/benefits compared to SQLite? What problems does it solve?
Why did you use this process? this mixture of models? Why is this a good setup?
If its sqlites suite then its great the models managed to get there, but one issue (without trying to be too pessimistic) is that the models had the test suite there to validate against. Sqlites devs famously spend more of their time making the tests than building the functionalities. If we can get AI that reliably defines the functionality of such programs by building the test suite over years of trial and error, then we'll have what people are saying
590x the application code
A small, highly experienced team steering Claude might be able to replicate the architecture and test suite reasonably quickly.
1-shotting something that looks this good means that with a few helping hands, small teams can likely accomplish decades of work in mere months.
Small teams of senior engineers can probably begin to replicate entire companies worth of product surface area.
It can even do that in a loss-less way, instead of burning a bunch of tokens to get a bad, barely working half-copy.
Don't get me wrong, I'm no AI hater, they are an impressive technology. But both AI-deniers and hypers need a reality check.
I provided a repo (mine) that already implemented double-double arithmetic, trigonometry, and logarithms/exponentials, with plenty of tests.
It produced something that looked this good. It had tests, it followed the style of the existing code base, etc. But it was full of shit and outright lies.
After I reviewed it to fix deficiencies, I don't think there was anything left of the original.
I had much more success the previous week using an AI to rubber duck the algorithms to implement trig.
I am incredibly sceptical that just adding more loops — and less critical thinking/review — to brute force through a solution, is a good idea.
* After a long vibe-coding session, I have to spend an inordinate amount of time cleaning up what Cursor generated. Any given page of code will be just fine on its own, but the overall design (unless I'm extremely specific in what I tell Cursor to do) will invariably be a mess of scattered control, grafted-on logic, and just overall poor design. This is despite me using Plan mode extensively, and instructing it to not create duplicate code, etc.
* I keep seeing metrics of 10s and 100s of thousands of LOC (sometimes even millions), without the authors ever recognizing that a gigantic LOC is probably indicative of terrible heisenbuggy code. I'd find it much more convincing if this post said it generated a 3K SQLite implementation, and not 19K.
Wondering if I'm just lagging in my prompting skills or what. To be clear, I'm very bullish on AI coding, but I do feel people are getting just a bit ahead of themselves in how they report success.
And for the most part I use either opus or sonnet, but for planning sometimes I switch to chatgpt since I think claude is too blunt and does not ask enough questions. I also have local setups with OLlama and have tried for personal projects some kimi models. The results are the same for all, but again claude models are slighly better.
What model? Cursor doesn't generate anything itself, and there's a huge difference between gpt5.3-codex and composer 1 for example.
lol