> a Linux kernel facility that lets userspace code detect whether it was preempted or migrated during a critical section and restart it if so. PostgreSQL's spinlock paths would use rseq to detect preemption and retry, avoiding the scenario where a preempted lock holder stalls all waiting backends.
The real proposal is about time-slice extension, which is a feature that uses the abi for rseq but otherwise has nothing to do with retrying critical sections. While a process holds a s_lock, it would set a request bit. If the kernel tries to preempt that thread while the request bit is set, it instead extends the time slice once and returns control back to the thread. It's further explained here: https://docs.kernel.org/userspace-api/rseq.html
This seems confused. These are options for preemptibility of the kernel, which is a relatively modern fearure. Userspace could always be preempted and these options do not change anything there. The kernel must in any case frequently interrupt threads and processes to implement preemptive multitasking which Linux of course had since the beginning.
Read more eg at https://lwn.net/Articles/944686/ or help texts at https://github.com/torvalds/linux/blob/master/kernel/Kconfig...
Also a crime that people are still running databases with 4kb pages.
To put it in perspective, this means you will have more than 30 million pages on a server with 128GB RAM. As an example, if there is 16bytes of metadata for memory page. The metadata itself would take more than half a gigabyte.
It is able to? Configure huge_page_size=1GB?
The headline implies it broke PG everywhere. It didn’t.
Since we will never know it might be a good idea to feature gate the change, change the default and let users decide to change it back. This may give some feedback on the lkml or else to decide if the change is worthwhile?
It's very close to a real world simulation of a production workload
For example, this issue aside, I'd rather split such a workload into multiple smaller instances, naturally. Because the impact of a crash on this single node, heavy load, many cores, many clients scenario would be huge.
AWS engineer reports PostgreSQL perf halved by Linux 7.0, fix may not be easy (phoronix.com) ~24 days ago, 165+ comments
It only took a few decades for Linux to get a good CPU scheduler and good I/O schedulers, too. I don't get how such an important area can be so bad for so long. But then, bad scheduling is everywhere. I find it to be a pretty fun area to work in, but, judging by how much it is less than half-assed in much existing software, most developers seem to hate dealing with it?
At first I thought that maybe Linux doesn't have ways to give priority to the desktop environment (a.k.a. "graphical shell") which is why running out of RAM means your cursor starts lagging, clicking on things stops working, etc.
But maybe Linux is just bad at that in general and a single process eating too much RAM can simply bring the whole system to a halt as it tries to move and compress RAM to a pagefile on an HDD (not SSD).
Every time it happens to me I just find it so incredible. Here I am with a PC with a multiple cores, multiple processors, and a single process eating all the RAM can bottleneck ALL of them at once? Am I misunderstanding something? Shouldn't it, ideally, work in such way that so long as one processor is free, the system can process mouse input and render the cursor and do all the desktop stuff no matter what I/O is happening in the background?
Since it's Linux maybe it's just my DE/distro (Cinnamon/Mint). Maybe it does allocations under the assumption there will always be a few free bytes in RAM available, so it halts if RAM runs out while some other DE wouldn't. But even then you'd think there would be a way to just reserve "premium" memory for critical processes so they never become unresponsive.
I wonder if other people have the same experience as me. This part of Linux just always felt fundamentally poor for me.
Dealing with low-memory situations elegantly is pretty hard: firstly Linux uses memory overcommit by default, in part because the semantics of fork imply very large memory commitments which are almost never realised, and in part because a lot of software does the same because it's the default. Secondly, managing allocation failures is often tricky and ill-tests, and often requires co-ordination between different systems. The DE could, though, in principle, put running applications in a container which would prevent them from using above a certain amount of memory, but the results are similar to early-oom in that the result of reaching the limit is almost certainly the termination of the process using the most memory.
You could split the processes into 2 categories:
1: applications that are doing tasks the user wants.
2: OS processes that the user needs to interact with in order to terminate applications.
There is an argument for applications taking priority: the user wants to do a task, if you move application out of RAM, the task is going to take longer.
But to me OS processes, including the graphical shell (taskbar, windowing system, etc.), should have priority: if an application hangs on I/O, the user NEEDS to be able to use the taskbar in order to terminate the application, otherwise they're going to have to wait who knows how long for the application to finish its task (or just hard reset the computer).
I don't know anything about how Linux handles memory, but the impression I have is that it has its priorities wrong, or it may not even have a way to configure priorities (unlikely), or maybe there is a to prioritize what is kept in memory but it only splits kernel/userspace memory so DE's that sit in userspace don't get priority (i.e. it's inadequate for a graphical operating system).
To be frank, as a desktop Linux user my biggest fear is that the Linux kernel is perfectly capable of prioritizing kernel/userspace memory, but it has no way to prioritize DE's. In other words, that the "graphical OS" use case of Linux is a second-class citizen, a feature bolted on top of GNU/Linux/Systemd. Because that would mean a lot of things are considered only from the perspective of a Linux server. This is only my imagination talking, since I'm not really involved with how Linux works. But to be fair I was never involved with how Windows worked either, and I never doubted it considered desktop a primary use case.
Your consternation is seconded.
Edit: systemd-oomd is what I was thinking of
What if it was on a VM and the core holding the lock got descheduled from the hypervisor?
https://lore.kernel.org/all/20260126204745.GP171111@noisy.pr...
I thought the warnings were only generated when you turned on a kernel config "that no one uses in practice"
Though I actually don't know how large shared buffers has to be for huge pages to make a noticeable difference.
Edit: It may not be optimal or recommended config but I was under the impression it's very close to default config. As far as I know, most popular distros are shipping with no hugepage pool reservation and shared memory transparent hugepages disabled.
Doing research though a spinlock actually doesn't seem as unusual a hack as it would first seem, do drivers and the like not have similar issues because they don't trigger a page fault I guess?
That was idk, 2008-9 -ish? I don't know what spotty history you are talking about, if you have multigigabyte address spaces floating on a machine it's stupid not to use hugepages.
do you consider huge pages disabled as some discouraged config? If data doesn't fit into memory, it means single lookup will read multiple NVME pages instead of single, which could lead to significant regression.
You might have transparent huge pages on by default depending on the distro
Especially with containers around you might very well hit the case of running new kernel but older version of PostgreSQL with no code mitigation for the problem
I can defend someone who is unwilling to yield on quality. Afterall, this truly is his baby. Issuing scathing rebukes to well-intentioned contributors is like slapping my kid when he brings me the wrong type of screwdriver.
Would you be able to point one out?
> to well-intentioned contributors
This is a system used and relied upon by billions of people around the world. Your intentions, while good, are not material to the problem. Put another way we have an endless supply of people with "good intentions" but we don't enjoy the same largess of people with "good skills."
https://lore.kernel.org/all/CAHk-=wiLdmz92CCfu2+-9_UrGSn6Pu6...
https://lkml.org/lkml/2009/7/28/373
It also didn't just happen out of the blue. It's also true that Alan had already been working on the kernel for 15 years, was an employee of RedHat at the time, and his Wife's health was starting to fail.
If you follow the thread it goes back and forth across quite a few messages with frustration building on both sides with Alan ultimately deciding to step away from a single (and very hairy) subsystem.
You don't talk like this to junior or even senior engineers, but you do reach a level at which gently telling isn't necessary.
If you don't like it go fork Linux and try being the nice benevolent dictator and we'll applaud your success.