Well, that does at least answer my immediate question about why I would ever swap from expensive RAM to really expensive RAM:) Feels niche, but when you want it it's a good idea.
Edit: Although, this is predicated on the system being able to release VRAM that is acting as swap when it's time to start a game. Can it do that?
>Sequential throughput: ~1.3 GB/s
[on a RTX 3070 Laptop]
This RTX 3070 chip is on PCIe 4.0 x16 which should give 64GB/s. The 8GB of GDDR6 is 448GB/s.
Swapping to an NVMe drive would be twice as fast, but with higher latency.
Edit: Their benchmarks are also run using ZRAM, which compresses pages before writing to swap. Not sure what the performance overhead of that is, but it's probably quite a bit.
First of all, it's a userspace program hooking the nbd driver, which is known for being slow. It also uses a bounce buffer in userspace before transferring to the GPU. So when the kernel needs to swap a page, it has to first copy it into a userspace facing buffer. The userspace program that has to wake back up and issue the cuda operation to copy the page into device memory.
nbd also doesn't really do a good job of supporting high queue depth or merging adjacent accesses. So if the kernel is issuing a bunch of 4K page swaps without any coalescing, you're going to end up with at least million kernel/userspace context switches per second just to handle 4 GB/s (4 GB / 4K page), let alone 64 GB/s. And that's just the NBD portion, forget the mess that is the NVIDIA driver. PCIe can move a lot of data, but in order to get anything even resembling the full bandwidth, you have to have use DMA engines with long page lists. Having to set up a transfer for every 4K page over PCIe will not reach full saturation of the bus.
Swapping to NVMe is a very optimized path -> the swapper can submit lists of pages directly to the NVMe driver and the controller can DMA them directly out of RAM, no copies or context switches CPU side at all.
This could probably be improved by migrating to the ublk driver as it might let you avoid the userspace bounce buffer. It'd also be able to have multiple write queues to at least set up CUDA copies in parallel.
With X11 it's not that bad (buffers are pre-allocated), but with Wayland allocations are a lot more dynamic, so running low on VRAM can easily crash the whole desktop. I just had a few of such crashes with Hyprland+llama-server+KVM switching between computers without freeing VRAM.
Well, GPUs also have stupid amounts of compute on them. I have to imagine that there is some kind of database format that's useful with GPU compute attached.
Since the data is already in VRAM, the GPU can sort, join, or otherwise manipulate data as needed.
I believe within 2-3 years databases and data warehouses on GPU will be common. The widespread use of agents to query data will be a part of this, as there will be a need to run far more queries at lower latency than needed for the ETL and BI workloads of the past.
It must have failed because I never heard of an update to this GPU. But AMD definitely made a GPU with 4x NVMe SSDs attached to the GPU.
In the end I just had to bite the bullet and take a gamble on finding ECC DDR4 RAM that would work with the ancient AMD chipset...
This particular implementation seems to be running over too many layers to be particularly performant. Why not a custom block driver instead?
The problem with putting (system) RAM on a PCIe card is that PCIe is not a cache-coherent interconnect. If you have a cache line that resides on your GPU sitting inside your processor's cache a remote modification to that memory by either the GPU, another CPU core or some other PCIe device with NOT invalidate the CPU cache line. You also have the fun situation that if it's modified on both ends simultaneously the resulting state will be non-deterministic.
Device drivers have to be very careful about synchronization when accessing memory-like areas on PCIe. CXL adds a cache coherency protocol among other things, so that invalidations and snoops can be exchanged over the interconnect.
sounds VERY low, also, wouldn't random read/write speed be MUCH more relevant here?
This HN comment and the linked post brought up a lot of good points. The main takeaway is that swap should primarily be considered a mechanism for equality of reclamation, not for emergency extra memory, where equality of reclamation means file-backed pages and anonymous pages are subject to similar criteria for being evicted from physical memory.
I used to have zero swap on my Linux desktop and this convinced me to add at least a small swap partition.
[0]: https://knowyourmeme.com/memes/its-one-banana-michael-what-c...
Now if it could be dynamically used and vacated on other GPU workloads?
Hard drives that huge scare me as it would take days to backup all the data off them.