- Opening/closing connections is fast, but not free—overhead adds up under load
- SQLite writes are globally locked
aiosqlitepool is a tiny library that adds connection pooling for any asyncio SQLite driver (like aiosqlite):
- It avoids repeated database connection setup (syscalls, memory allocation) and teardown (syscalls, deallocation) by reusing long-lived connections
- Long-lived connections keep SQLite's in-memory page cache "hot." This serves frequently requested data directly from memory, speeding up repetitive queries and reducing I/O operations
- Allows your application to process significantly more database queries per second under heavy load
Enjoy!
> Python
Your repo and the readme.md don't say "python." The title of this post doesn't say "python."
It took me a while to realize that this is for python, as opposed to a general-purpose cache for, say, libsqlite.
There's tags showing what Python versions are supported.
The root dir of the repo contains a 'pyproject.toml' file.
The readme contains installation instructions for pip, poetry, and uv, all of which are Python package managers.
The readme contains example code, all of which is in Python.
The readme references asyncio, a Python module that is included in the standard library for Python 3.
The 'Languages' widget on the page shows 99.2% of the repo's code is in Python.
Every file not in the root dir has a .py extension.
Yeah, I can see why it was so hard to figure out...
Then writers queue up, while readers are unimpeded.
(in general you _really_ should use WAL mode if using sqlite concurrently, you also should read the documentation about WAL mode tho)
This only gets “worse” as computers get faster: imagine how many write transactions a serial writer could complete (WAL mode and normal synchronous mode) while all your writers are sleeping after the previous one left, because they didn't line up?
And, if you have a single limited pool, your readers will now be stuck waiting for an available connection too (because they're all taken by sleeping writers).
It's much fairer and more efficient for writers to line up with blocking application locks.
It's fixable by periodically forcing the WAL to be truncated, but it took me a lot of time and pain to figure it out.
They do point out the risks here: https://sqlite.org/wal.html#avoiding_excessively_large_wal_f...
sqlites design makes a lot of SQL concurrency synchronization edge cases much simpler as you can rely on the single writer at a time limitation. And it has some grate hidden features for using it as client application state storage. But there are use-cases it's just not very good at and moving from sqlite to other DBs can be tricky (if you ever relied on the exclusive write transaction or the way cells are blobs which can mix data types, even it it was by accident)
In the end I wrote an external process that forced a checkpoint a few times a day, which worked. I came across other exasperated people in various dark corners of the Internet with the same symptoms.
If I had a blog, I'd be writing about it.
The two main issues w opening a connection are:
1. There is fixed cost O(database schema) time spent building the connection stuff. Ideally SQLite could use a “zygote” connection that can refresh itself and then get cloned to create a new one, instead of doing this work from scratch every time.
2. There is O(number of connections) time spent looking at a list of file descriptors in global state under a global lock. This one is REALLY BAD if you have >10,000 connections so it was a major motivator for us to do connection pooling at Notion. Ideally SQLite could use a hash table instead of a O(n) linear search for this, or disable it entirely.
Both of these issues are reasons I’m excited about Turso’s SQLite rewrite in Rust - it’s so easy to fix both of these issues in Rust (like a good hash table is 2 LoC to adopt in Rust) whereas in the original C it’s much more involved to safely and correctly fix the issue in a fork.
Furthermore, it would be great to share more of the cache between connections as a kind of “L2 cache”; again tractable and safe to build in Rust but complicated to build in a fork of the original C.
Notion uses a SQLite-backed server for our “Database” product concept that I helped write, we ran in to a lot of these kinds of issues scaling reads. We implemented connection pooling over better-sqlite3 Node module to mitigate these issues. We also use Turso’s existing SQLite C fork “libsql” for some connections since it offers a true async option backed by thread pool under the hood in the node driver, which helps in cases where you can have a bottleneck serializing or deserializing data from “node” layout to “SQLite c” layout or many concurrent writes to different DBs from a single NodeJS process.
SQLite does not even do network I/O.
How does sharing a connection (and transaction scope) in an asyncio environment even work? Won’t you still need a connection per asyncio context?
Does sqlite_open really take long compared to the inevitable contention for the write lock you’ll see when you have many concurrent contexts?
Does sqlite_open even register in comparison with the overhead of the python interpreter?
What is an asyncio SQLite connection anyways? Isn’t it just a regular one that gets hucked into a separate thread?
If you're querying a multi-GB SQLite database there are things like per-connection caches that may benefit from a connection pool.
> What is an asyncio SQLite connection anyways? Isn’t it just a regular one that gets hucked into a separate thread?
Basically yes - aiosqlite works by opening each connection in a dedicated thread and then sending async queries to it and waiting for a response that gets sent to a Future.
https://github.com/omnilib/aiosqlite/blob/895fd9183b43cecce8...
For really large data sets, you can query and wait a few minutes before getting a result. Do you really want to await that?
(I don't know anything about that project and this isn't meant as a criticism of its design or a challenge - cos I'd probably lose :-) )
One way to look at is that with a client-server database and an async client library, you have a thread pool in the database server process to do the heavy lifting, and async clients talk to it via TCP. With SQLite, you have that "server" thread pool in the same process instead, and async "clients" talk to it via in-process communication.
In which case SQLite is probably the wrong tool for the job, and you should be using Postgres or MySQL that is actually designed from the ground up for lots of concurrent connections.
SQLite is amazing. I love SQLite. But I love it for single-user single-machine scenarios. Not multi-user. Not over a network.
I didn't say anything about concurrent access. SQLite with WAL mode is fine for that these days for dozens of concurrent readers/writers (OK only one writer gets to write at a time, but if your writes queue for 1-2ms who cares?) - if you're dealing with hundreds or thousands over a network then yeah, use a server-based database engine.
That is DuckDB its selling point. You want data analyzing, you go DuckDB. You want oltp you go SQLite. Or combine both if you need both.
Even postgres struggles with OLAP dataloads, and that is why we have solutions like TimescaleDB / postgres plugin. That ironically uses postgres rows but then packs information as column into columns row fields.
That does not mean that postgres is flawed working with big data. Same with Sqlite... Different data has different needs, and has nothing to do with database sizes.
- sqlite is a bit like a RWLocked database either any number or readers xor exactly one writer and no readers
- but with WAL mode enabled readers and writers (mostly) don't block each other, i.e. you can have any number of readers and up to one writer (so normally you want WAL mode if there is any concurrent access)
- if a transaction (including implicit by a single command without "begin", or e.g. upgrading from a read to a write transaction) is taking too long due to a different processes write transaction blocking it SQLITE_BUSY might be returned.
- in addition file locks might be used by SQL bindings or similar to prevent multi application access, normally you wouldn't expect that but given that sqlite had a OPEN_EXCLUSIVE option in the past (which should be ignored by half way modern impl. of it) I wouldn't be surprised to find that.
- your file system might also prevent concurrent access to sqlite db files, this is a super obscure niche case but I have seen it once (in a shared server, network filesystem(??) context, probably because sqlite really doesn't like network filesystems often having unreliable implementations for some of the primitives sqlite needs for proper synchronization)
as other comments pointed out enabling WAL mode will (probably) fix your issues
Your throughput will be much worse than a single process, but it's possible, and sometimes convenient. Maybe something in your stack is trying to hold open a writable connection in both processes?
Those are a nasty trap. The solution is non-obvious: you have to use BEGIN IMMEDIATE on any transaction that performs at least one write: https://simonwillison.net/tags/sqlite-busy/
Saying multi gigabyte databases for single user usage is the norm feels insane to me.
Why? If all it's missing is an async connection pool to make it a good tool for more jobs, what's the problem with just creating one?
asyncpg is a nice python library for postgres.
I think postgres releasing a nice linkable, "serverless" library would be pretty amazing, to make the need for abusing sqlite like this (I do it too) go away.
Check the throughput graphs from this blog post from 2020 (for improvements I made to connection scalability):
https://techcommunity.microsoft.com/blog/adforpostgresql/imp...
That's for read-mostly work. If you do write very intensely, you're going to see more contention earlier. But that's way way worse with sqlite, due to its single writer model.
EDIT: Corrected year.
The way our SQLite workload works is that we have a pool of hundreds of read connections per DB file, and a single writer thread per DB file that keeps the DB up to date via CDC from Postgres; basically using SQLite as a secondary index "scale out" over data primarily written to Postgres. Because we're piping Postgres replication slot -> SQLite, we don't suffer any writer concurrency and throughput is fine to keep up with the change rate so far. Our biggest bottleneck is reading the replication slot on the Postgres side into Kafka with Debezium.
It really depends on what your workload looks like, but I think synchronous will win most of the time.
Even better is to have separate pools for the writer connection and readers in WAL mode. Then you can cache write relevant statements only once. I am skeptical about a dedicated thread per call because that seems like it would add a bunch of latency.
The trouble is that nobody took a step back and asked: "Can we simply use the backing DBMS?" Instead, they trudged forward with "Let's get rid of the backing DBMS and embed the database engine into our own DBMS!" And since SQLite is a convenient database engine...
Wrong on so many levels it's frightening.
You must set it to a value higher than the size of your DB. I use:
PRAGMA mmap_size = 1099511627776;
(1TB)https://sqlite.org/compile.html#max_mmap_size
Ubuntu system pragma compile_options:
MAX_MMAP_SIZE=0x7fff0000
Changing this default requires knowing it is a 64 bit platform when the C preprocessor runs, and would surprise anyone who was ok with the 2GB value.
There are two downsides of mmap - I/O errors can't be caught and handled by SQLite code, and buggy stray writes by other code in the process could corrupt the database.
It is best practise to directly include the SQLite amalgamation into your own projects which allows you to control version updating, and configuration.
True. https://www.sqlite.org/mmap.html lists 3 other issues as well.
> and buggy stray writes by other code in the process could corrupt the database.
Not true: "SQLite uses a read-only memory map to prevent stray pointers in the application from overwriting and corrupting the database file."
I really don't get it. How would this help?
The benchmarks dont mention which journal mode sqlite is configured as, which is very suspicious as that makes a huge difference under concurrent load.
Basically, it adds its own write lock outside of SQLite, because the pool can implement the lock in a less annoying way.
> Basically, it adds its own write lock outside of SQLite, because the pool can implement the lock in a less annoying way.
Less annoying how? What is the difference?
After all, as far as i understand, the busy timeout is only going to occur at the beginning of a write transaction, so its not like you have to redo a bunch of queries.
This situation can be managed with a pool of small (5 connections or less) to prevent spawning too many connections. This will reduce racing between them and allow write operations to succeed.