> Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).
This is surely true for certain use cases, say financial applications which must guarantee 100% uptime, but I'd argue the vast, vast majority of applications are perfectly ok with local commit and rapid recovery from remote logs and replicas. The point is, the cloud won't give you that distributed consistency for free, you will pay for it both in money and complexity that in practice will lock you in to a specific cloud vendor.
I.e, make cloud and hosting services impossible to commoditize by the database vendors, which is exactly the point.
- A modern high end SSD commits faster than the one way time to anywhere much farther than a few miles away. (Do the math. A few tens of microseconds specified write latency is pretty common. NVDIMMs (a sadly dying technology) can do even better. The speed of light is only so fast.
- Unfortunate local correlated failures happen. IMO it’s quite nice to be able to boot up your machine / rack / datacenters and have your data there.
- Not everyone runs something on the scale of S3 or EBS. Those systems are awesome, but they are (a) exceedingly complex and (b) really very slow compared to SSDs. If I’m going to run an active/standby or active/active system with, say, two locations, I will flush to disk in both locations.
It is. Coordinated failures shouldn't be a surprise these days. It's kind of sad to here that from an AWS engineer. Same data pattern fills the buffers and crashes multiple servers, while they were all "hoping" that others fsynced the data, but it turns out they all filled up and crashed. That's just one case there are others.
It is great as long as your actual workload fits, but misleading if a microbenchmark doesn't inform you of the knee in the curve where you exhaust the buffer and start observing the storage controller as it retires things from this buffer zone to the other long-term storage areas. There can also be far more variance in this state as it includes not just slower storage layers, but more bookkeeping or even garbage-collection functions.
You must have a single flushed write to disk to be durable, but it doesn't need the second write.
Overall speed is irrelevant, what mattered was the relative speed difference between sequential and random access.
And since there's still a massive difference between sequential and random access with SSDs, I doubt the overall approach of using buffers needs to be reconsidered.
Edit: thank you for all the answers -- very educational, TIL!
https://i.imgur.com/t5scCa3.png
https://ssd.userbenchmark.com/ (click on the orange double arrow to view additional columns)
That is a latency of about 50 µs for a random read, compared to 4-5 ms latency for HDDs.
With SSDs, the write pattern is very important to read performance.
Datacenter and enterprise class drives tend to have a maximum transfer size of 128k, which is seemingly the NAND block size. A block is the thing that needs to be erased before rewriting.
Most drives seem to have an indirection unit size of 4k. If a write is not a multiple of the IU size or not aligned, the drive will have to do a read-modify-write. It is the IU size that is most relevant to filesystem block size.
If a small write happens atop a block that was fully written with one write, a read of that LBA range will lead to at least two NAND reads until garbage collection fixes it.
If all writes are done such that they are 128k aligned, sequential reads will be optimal and with sufficient queue depth random 128k reads may match sequential read speed. Depending on the drive, sequential reads may retain an edge due to the drive’s read ahead. My own benchmarks of gen4 U.2 drives generally backs up these statements.
At these speeds, the OS or app performing buffered reads may lead to reduced speed because cache management becomes relatively expensive. Testing should be done with direct IO using libaio or similar.
To me a 4k read seems anachronistic from a modern application perspective. But I gather 4kb pages are still common in many file systems. But that doesn’t mean the majority of reads are 4kb random in a real world scenario.
> Our extensive experiments discover that, unlike HDDs, the performance degradation of modern storage devices incurred by fragmentation mainly stems from request splitting, where a single I/O request is split into multiple ones.
- The access block size (LBA size). Either 512 bytes or 4096 bytes modulo DIF. Purely a logical abstraction.
- The programming page size. Something in the 4K-64K range. This is the granularity at which an erased block may be programmed with new data.
- The erase block size. Something in the 1-128 MiB range. This is the granularity at which data is erased from the flash chips.
SSDs always use some kind of journaled mapping to cope with the actual block size being roughly five orders of magnitude larger than the write API suggests. The FTL probably looks something like an LSM with some constant background compaction going on. If your writes are larger chunks, and your reads match those chunks, you would expect the FTL to perform better, because it can allocate writes contiguously and reads within the data structure have good locality as well. You can also expect for drives to further optimize sequential operations, just like the OS does.
(N.b. things are likely more complex, because controllers will likely stripe data with the FEC across NAND planes and chips for reliability, so the actual logical write size from the controller is probably not a single NAND page)
Most filesystems read in 4K chunks (or sometimes even worse, 512 byes), and internally the actual block is often multiple MB in size, so this internal read multiplication is a big factor in performance in those cases.
Note the only real difference between a random read and a sequential one is the size of the read in one sequence before it switches location - is it 4K? 16mb? 2G?
anything like this, but for postgres?
actually, is it even possible to write a new db engine for postgres? like mysql has innodb, myisam, etc
That said, there are a few alternative storage engines for Postgres, such as OrioleDB. However due to limitations in Postgres's storage engine API, you need to patch Postgres to be able to use OrioleDB.
MySQL instead focused on pluggable storage engines from the get-go. That has had major pros and cons over the years. On the one hand, MyISAM is awful, so pluggable engines (specifically InnoDB) are the only thing that "saved" MySQL as the web ecosystem matured. It also nicely forced logical replication to be an early design requirement, since with a multi-engine design you need a logical abstraction instead of a physical one.
But on the other hand, pluggable storage introduces a lot of extra internal complexity, which has arguably been quite detrimental to the software's evolution. For example: which layer implements transactions, foreign keys, partitioning, internal state (data dictionary, users/grants, replication state tracking, etc). Often the answer is that both the server layer and the storage engine layer would ideally need to care about these concerns, meaning a fully separated abstraction between layers isn't possible. Or think of things like transactional DDL, which is prohibitively complex in MySQL's design so it probably won't ever happen.
> Companies are global, businesses are 24/7
Only a few companies are global, so only a few of them should optimize for those kind of workload. However maybe every startup in SV must aim to becoming global, so probably that's what most of them must optimize for, even the ones that eventually fail to get traction.
24/7 is different because even the customers of local companies, even B2B ones, mighty feel like doing some work at midnight once in a while. They'll be disappointed to find the server down.
A massive number of companies have global customers, regardless of where the company itself has employees.
For example my b2b business is relatively tiny, yet my customer base spans four continents. Or six continents if you count free users!
Due to the interface between SSD and host OS being block based, you are forced to write a full 4k page. Which means you really still benefit from a write ahead log to batch together all those changes, at least up to page size, if not larger.
If you want to get some sort of sub-block batching, you need a structure that isn't random in the first place, for instance an LSM (where you write all of your changes sequentially to a log and then do compaction later)—and then solve your durability in some other way.
¿Por qué no los dos?
Postgres allows a group commit to try to combine multiple transactions to avoid the multiple fsyncs, but it adds delay and is off by default. And even so, it reduces fsyncs, not writes.
If you believe this, then what you want already exists. For example: MySQL has in memory tables, but also this design pretty much sounds like NDB.
I don’t think I’d build a database the way they are describing for anything serious. Maybe a social network or other unimportant app where the consequences of losing data aren’t really a big deal.
This made sense for product catalogs, employee dept and e-commerce type of use cases.
But it's an extremely poor fit for storing a world model that LLMs are building in an opaque and probabilistic way.
Prediction: a new data model will take over in the next 5 years. It might use some principles from many decades of relational DBs, but will also be different in fundamental ways.
Design Database for SSD would still go a very very long way before what I think the author is suggesting which is designing for cloud or datacenter.
Ultimately, it's a trade-off: larger pages mean faster I/O, while smaller pages mean better CPU utilisation.
And then a bug crashes your database cluster all at once and now instead of missing seconds, you miss minutes, because some smartass thought "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?".
I love how this industry invents best practices that are actually good then people just invent badly researched reasons to just... not do them.
That would be asynchronous replication. But IIUC the author is instead advocating for a distributed log with synchronous quorum writes.
> In Aurora, we have chosen a design point of tolerating (a) losing an entire AZ and one additional node (AZ+1) without losing data, and (b) losing an entire AZ without impacting the ability to write data. [..] With such a model, we can (a) lose a single AZ and one additional node (a failure of 3 nodes) without losing read availability, and (b) lose any two nodes, including a single AZ failure and maintain write availability.
As for why this can be considered durable enough, section 2.2 gives an argument based on their MTTR (mean time to repair) of storage segments
> We would need to see two such failures in the same 10 second window plus a failure of an AZ not containing either of these two independent failures to lose quorum. At our observed failure rates, that’s sufficiently unlikely, even for the number of databases we manage for our customers.
[0] https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...
Arbitrating differences in relative ordering across different observer clocks is what N-temporal databases are about. In databases we usually call the basic 2-temporal case “bitemporal”. The trivial 1-temporal case (which is a quasi-global clock) is what we call “time-series”.
The complexity is that N-temporality turns time into a true N-dimensional data type. These have different behavior than the N-dimensional spatial data types that everyone is familiar with, so you can’t use e.g. quadtrees as you would in the 2-spatial case and expect it to perform well.
There are no algorithms in literature for indexing N-temporal types at scale. It is a known open problem. That’s why we don’t do it in databases except at trivial scales where you can just brute-force the problem. (The theory problem is really interesting but once you start poking at it you quickly see why no one has made any progress on it. It hurts the brain just to think about it.)
Also even if not required makes reasoning about how systems work a hell lot easier. So for vast majority that doesn't need massive throughtputs sacrificing some speed for easier to understand consistency model is worthy tradeoff
Prety much all financial transactions are settled with a given date, not instantly. Go sell some stocks, it takes 2 days to actually settle. (May be hidden by your provider, but that how it works).
For that matter, the ultimate in BASE for financial transactions is the humble check.
That is a great example of "money out" that will only be settled at some time in the future.
There is a reason there is this notion of a "business day" and re-processing transactions that arrived out of order.
Frankly, it’s shocking anything works at all.
So, essentially just CQRS, which is usually handled in the application level with event sourcing and similar techniques.
[1] https://www.dr-josiah.com/2010/08/databases-on-ssds-initial-...
And CedarDB https://cedardb.com/ the more commercialized product that is following up on some of this research, including employing many of the key researchers.
I was familiar with Solarflare and Mellanox zero copy setups in a previous fintech role, but at that time it all relied on black boxes (specifically out of tree kernel modules, delivered as blobs without DKMS or equivalent support, a real headache to live with) that didn't always work perfectly, it was pretty frustrating overall because the customer paying the bill (rightfully) had less than zero tolerance for performance fluctuations. And fluctuations were annoyingly common, despite my best efforts (dedicating a core to IRQ handling, bringing up the kernel masked to another core, then pinning the user space workloads to specific cores and stuff like that) It was quite an extreme setup, GPS disciplined oscillator with millimetre perfect antenna wiring for the NTP setup etc we built two identical setups one in Hong Kong and one in new york. Ah very good fun overall but frustrating because of stack immaturity at that time.
The amount of performance you can extract from a modern CPU if you really start optimising cache access patterns is astounding
High performance networking is another area like this. High performance NICs still go to great lengths to provide a BSD socket experience to devs. You can still get 80-90% of the performance advantages of kernel bypass without abandoning that model.
I think this was one, and I want to emphasise this, of the main points behind Odin programming language.
It turns out that btrees are still efficient for this work. At least until the hardware vendors deign to give us an interface to SSD that looks more like RAM.
Reading over https://www.cs.cit.tum.de/dis/research/leanstore/ and associated papers and follow up work is recommended.
In the meantime with RAM prices sky rocketing, work and research in buffer & page management for greater-than-main-memory-sized DBs is set to be Hot Stuff again.
I like working in this area.