One of the original proposals for in-DRAM compute: https://users.ece.cmu.edu/~omutlu/pub/in-DRAM-bulk-AND-OR-ie...
First demonstration with off-the-shelf parts: https://parallel.princeton.edu/papers/micro19-gao.pdf
DRAM Bender, the tool they are using to implement this: https://github.com/CMU-SAFARI/DRAM-Bender
Memory-Centric Computing: Recent Advances in Processing-in-DRAMhttps://arxiv.org/abs/2412.19275
Edit: Oh and cpldcpu linked the ComputeDRAM paper that explains how to do it with off the shelf parts.
I was expecting to find this 2016 article in there: https://news.ycombinator.com/item?id=12469270
This 2019 one does show up: https://news.ycombinator.com/item?id=22712811
Of course, this "out of spec" behaviour of DRAM, more specifically the ability to do copying, is also implicated in this infamous bug: https://news.ycombinator.com/item?id=5314959
It seems more than one person independently observed such a thing, and thought "this might be a useful behaviour".
Take that, binary blobs for DRAM training!
Processing-Using-DRAM (PUD) leverages the inherent analog operational characteristics of DRAM to enable highly parallel bit-serial computations directly within memory arrays. Prior research has demonstrated that commercial off-the-shelf DRAM can achieve PUD functionality without hardware modifications by intentionally violating the timing parameters.
These studies have established two fundamental PUD operations: RowCopy and majority-of-X (MAJX) (Fig. 1). The RowCopy operation facilitates data movement between different rows within a subarray by issuing a PRE command followed immediately by an ACT command before bitline precharging completes, enabling data transfer through the bitlines. This operation affects all cells along a row simultaneously, making it approximately 100 times faster than processor-mediated data movement. The MAJX operation performs a majority vote among X cells sharing the same bitline that are activated simultaneously, implemented in commercial DRAM by issuing ACT, PRE, and ACT commands in rapid succession without delays. This allows concurrent activation of 2∼32 rows. MAJX enables bit-serial computations that leverage the parallelism of subarrays with 65,536 columns, serving as the fundamental computational unit for PUD.
* https://www.servethehome.com/sk-hynix-ai-memory-at-hot-chips... * https://www.servethehome.com/samsung-processing-in-memory-te...
Not sure anyone has started using it in production.
We'll see that before anything built around HBM or GDDR.
edit: seems like there was a recent discussion about something similar... undefined behavior in some C function iirc
Certain network cards have either a bug or combination of features that work in an interesting way to the benefit of the trading firm.
These bugs (and features too) sometimes get removed in favor of either getting rid of the bug or those features are seen as not needed for the larger market etc. Therefore, firms will sometimes attempt to buy up all available supply of certain models.
For communication equipment, this is super important, with all sorts of "quirks" put in for vendors that didn't follow the spec. And, that includes keeping quirks in your firmware, so you don't break anyone else's. Imagine entire walls of legacy and long-gone and current competitor equipment, with robot arms to plug things in, and you have an idea of what some hardware validation labs look like.
Motherboard manufacturer firmware is also filled with quirks for specific CPUs, chipsets, etc.
As for this paper, it's not about relying on a bug but rather presenting what might be possible with DRAM in the hopes of standardizing capabilities.
Ok, so my math isnt great.
When I was studying Quaternions during my 3d math class (That I failed the first time, like I said, not a math guy) they briefly covered the history of matrix calculation in graphics development.
My understanding is that Quaternions became popular because they are almost as accurate as matrices but much less complex computationally.
Has anyone tried building an LLM using Quats instead of matrices?
Or are the optimisations with Quaternions more useful in realtime?
Does this mean all functions that can be described by quaternions are non-linear, or does it mean that quaternions can describe some linear functions such as the ones associated with rotations in 3D space but there are linear function they cannot describe?
Matrices and quaternions take different approaches to describing rotations: a matrix sees a rotation as a linear function, and quaternions see rotations as a group (confusingly represented with matrices, this field is called representation theory if you want to know more).
So the answer to your question: there are linear functions that quaternions cannot describe. And quaternions can only describe a very specific class of linear functions (with some rather complicated maths behind them).
And beyond that, for those rotations, a quaternion doesn't scale nearly as well as you add dimensions. Complex numbers are a complex representation of two space, quaternions are a complex representation of three space, and to go to four space you have octonions, which have eight elements.
Of course, that the 3d thing you end up with represents rotations in 3d space is extremely neat; and not something all 3d things do.
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=iram...
However, IRAM looks like compute near memory where they will add an ALU to the memory chip. compute in memory is about using the memory array itself.
To be fair, CIM looked much less appealing before the advent of deep-learning with crazy vector lengths. So people rather tried to build something that allows more fine grained control of the operations.
You are right, I remember 1972-ish papers where they did compute in memory. I just couldn't locate links to these papers in a few minutes.
And does such a processing shift give advantage to Samsung etc? Where does this leave NVIDIA etc?
https://www.patentlyapple.com/2024/12/apple-plans-to-transit...