On the positive side, you can scale out memory quite a lot, fill up PCI slots, even have memory external to your chassis. Memory tiering has a lot of potential.
On the negative side, you've got latency costs to swallow up. You don't get distance from CPU for free (there's a reason the memory on your motherboard is as close as practical to the CPU) https://www.nextplatform.com/2022/12/05/just-how-bad-is-cxl-.... CXL spec for 2.0 is at about 200ns of latency added to all calls to what is stored in memory, so when using it you've got to think carefully about how you approach using it, or you'll cripple yourself.
There's been work on the OS side around data locality, but CXL stuff hasn't been widely available, so there's an element of "Well, we'll have to see".
Azure has some interesting whitepapers out as they've been investigating ways to use CXL with VMs, https://www.microsoft.com/en-us/research/wp-content/uploads/....
But CXL-backed memory can use your CPU caches as usual and the PCIe 5.0 lane throughput is still good, assuming that the CXL controller/DRAM side doesn't become a bottleneck. So you could design your engines and data structures to account for these tradeoffs. Like fetching/scanning columnar data structures, prefetching to hide latency etc. You probably don't want to have global shared locks and frequent atomic operations on CXL-backed shared memory (once that becomes possible in theory with CXL3.0).
Edit: I'll plug my own article here - if you've wondered whether there were actual large-scale commercial products that used Intel's Optane as intended then Oracle database took good advantage of it (both the Exadata and plain database engines). One use was to have low latency durable (local) commits on Optane:
https://tanelpoder.com/posts/testing-oracles-use-of-optane-p...
VMware supports it as well, but using it as a simpler layer for tiered memory.
I'd bet contested locks spend more time in cache than most other lines of memory so in practice a global lock might not be too bad.
I would be more worried about memory bandwidth. You can now add so much memory to your servers that it might take minutes to do a full in-memory table scan.
:-/
But, because I'm a good sport, I actually chased a couple of those links figuring that I could convert Egyptian Pound into USD but <https://www.sigma-computer.com/en/search?q=CXL%20R5X4> is "No results", and similar for the other ones that I could get to even load
I think the main bridge chipsets come from Microchip (this one) and Montage.
This Gigabyte product is interesting since it’s a little lower end than most AXL solutions - so far AXL memory expansion has mostly appeared in esoteric racked designs like the particularly wild https://www.servethehome.com/cxl-paradigm-shift-asus-rs520qa... .
"CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies" https://arxiv.org/abs/2506.15601
Do like the card though, was waiting for someone to make an affordable version (or rather: this looks affordable, I hope it will be both that and actually obtainable. CXL was kinda locked away so far…)
A lot of the initial use cases of CXL seem to be to use up lots of older DDR4 RDIMMs in newer systems to expand memory, eg cloud providers have a lot.
I guess there are some use cases for this for local users, but I think the biggest wins could come from the CXL shared memory arrays in smaller clusters. So you could, for example, cache the entire build-side of a big hash join in the shared CXL memory and let all other nodes performing the join see the single shared dataset. Or build a "coherent global buffer cache" using CPU+PCI+CXL hardware, like Oracle Real Application Clusters has been doing with software+NICs for the last 30 years.
Edit: One example of the CXL shared memory pool devices is Samsung CMM-B. Still just an announcement, haven't seen it in the wild. So, CXL arrays might become something like the SAN arrays in the future - with direct loading to CPU cache (with cache coherence) and being byte-addressable.
https://semiconductor.samsung.com/news-events/tech-blog/cxl-...
For example:
https://en.wikipedia.org/wiki/I-RAM
(Not a unique thing, merely the first one I found).
And then there are the more exotic options, like the stuff that these folk used to make: https://en.wikipedia.org/wiki/Texas_Memory_Systems - iirc - Eve Online used the RamSan product line (apparently starting in 2005: https://www.eveonline.com/news/view/a-history-of-eve-databas... )
The technical explanations for the fact that you (boolean)can't have extra DRAM controllers on PCIe is increasingly sounding like market segmentation reasons than purely technical ones. x86 is a memory mapped I/O platform. Why we can't just have RAM sticks on RAM addresses.
The reverse of this works btw. NVMe drives can use Host Memory Buffer to cache reads and writes on system RAM - the feature that jammed and caught fire on recently rumored bad ntfs.sys incident in Windows 11.
If you need a lot of memory bandwidth, workstation boards have DDR5 at 256-512 bits wide. Apple Silicon supports that range on Pro and Max, and Ultra is 1024.
(I'm using bits instead of channels because channels/subchannels can be 16 or 32 or 64 bits wide.)
Apple FBOW, based on a quick and sloppy count of a reballing jig [1], has something on the order of 2500-2700 balls on an M2 CPU.
I think AMD's FP11 'socket' (it's really just a standard ball grid array) pinout is something on the order of 2000-2100 balls and that gets you four 64 Bit DDR channels (I think Apple works a bit different and uses 16 bit channels, thus the 'channel count' for an M2 is higher.)
Which is a roundabout way of saying, AMD and Intel probably can match the bandwidth but to do so likely would require moving to soldered CPUS which would be a huge paradigm shift for all the existing boardmakers/etc.
[0] - They do have other tradeoffs; namely that 1151 has built in PCIE, on the other hand the link to the PCH is AFAIR a good bit thinner than the QPI link on the 1366.
[1] - https://www.masterliuonline.com/products/a2179-a1932-cpu-reb... . I counted ~55 rows along the top and ~48 rows on the side...
I think part of might be that Apple recognized that integrated GPUs require a lot of bulk memory bandwidth. I noticed this with their tablet derivative cores having memory bandwidth that tended to scale with screen size but Samsung and Qualcomm didn't bother for ages. And it sucked doing high speed vision systems on their chips because of it.
For years Intel had been slowly beefing up the L2/L3/L4.
M1Max is somewhere between Nvidia 1080 and 1080TI in bulk bandwidth. The lowest end M chips aren't competitive, but near everything above that overlaps even the current gen NVIDA 4050+ offerings
Yeah, Apple definitely realized that they should do something and for as much as I don't care for their ecosystem I think they were very smart in how they handled the need for memory bandwidth, e.x. having more 16 bit channels vs fewer 64 bit channels probably allows for better power management characteristics as far as being able to relocating data on 'sleep'/'wake' and thus being able to leave more of the ram powered off.
That plus the good UMA impl has left the rest of the industry 'not playing catchup' i.e.
- Intel failing to capitalize on the opportunity of a 'VRAM heavy' low end card to gain market share,
- AMD failing to bite the bullet and meaningfully try to fight Nvidia on memory/bandwidth margin...
- Nvidia just raking that margin in...
- By this point you'd think Qualcomm would just do an 'AI Accelerator' reference platform just to try....
- I'm guessing whatever efforts are happening in China, they are too busy trying to fill internal needs to bother boasting and tipping their hat; better to let outside companies continue to overspend on the current paradigm.
IIRC Intel even made a DRAM card that was drum-memory compatible.
For a memory controller, that thing looks hot!
[1] PDF Data sheet: https://ww1.microchip.com/downloads/aemDocuments/documents/D...