> But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board.
is the most interesting one.
A man hammering a nail into a board can be both beautiful and elegant! If you've ever seen someone effortlessly hammer nail after nail into wood without having to think hardly at all about what they're doing, you've seen a master craftsman at work. Speaking as a numerical analyst, I'd say a well multiplied matrix is much the same. There is much that goes into how deftly a matrix might be multiplied. And just as someone can hammer a nail poorly, so too can a matrix be multiplied poorly. I would say the matrices being multiplied in service of training LLMs are not a particularly beautiful example of what matrix multiplication has to offer. The fast Fourier transform viewed as a sparse matrix factorization of the DFT and its concomitant properties of numerical stability might be a better candidate.
Generally, low-rank and block-diagonal matrices are both great strategies for producing expressive matmuls with fewer parameters. We can view the FFT as a particularly deft example of factorizing one big matmul into a number of block-diagonal matmuls, greatly reducing the overall number of multiplications by minimizing the block size. However, on a G/TPU, we have a lot more parallelism available, so the sweet spot for size of the blocks may be larger than 2x2...
We can also mix low-rank, block diagonal, and residual connections to get the best of both worlds:
x' = (L@x + B@x + x)
The block-diagonal matrix does 'local' work, and the low-rank matrix does 'broadcast' work. I find it pretty typical to be able to replace a single dense matmul with this kind of structure and save ~90% of the params with no quality cost... (and sometimes the regularization actually helps!)
Riffing further on the Fourier connection: are you planning to explore the link between matmul and differentiation?
Using the "Pauli-Z" matrix that you introduced without a straightforward motivation, eg.
(I took it that you intended it to be a "backyard instance" of "dual numbers")
Elegance is a silly critique. Imagine instead we were spending trillions on floral bouquets, calligraphy, and porcelain tea sets. I would argue that would be a bad allocation of resources.
What matters to me is whether it solves the problems we have. Not how elegant we are in doing so. And to the extent AI fails to do that, I think those are valid critiques. Not how elegant it is.
But elegant can mean minimal, restrained, parsimonious, sparing. That's different from a bunch a paraphernalia and flowery nonsense.
Maybe we need a word that, when applied to mathematical concepts, describes how simple, easy to understand and generally useful a solution or idea is.
I wonder what that word could be.
Lol did you think this was clever? You just literally reiterated exactly what I said. See, if you had said "there are many pianists that find beauty in math" - you know like how many mathematicians find beauty in piano concertos - then you'd have me.
But I'll admit its barely beautiful. Within which context, I guess the article's lawyering for the relative beauty of a matrix was a success, but I always liked them better than calculus or group theory anyway.
That's because we don't fully understand what a number is and what a multiplication is. We defined -x and 1/x as inverses (additive and multiplicative), but what is -1/x ? Let's consider them as operations. Apply any one of them on any other of them, you get the third one. Thus they occupy peer status. But we hardly ever talked about -1/x.
The mathematical inquisition is in its infancy.
It feels like Linear Algebra tries to get at the heart of this generality but the structure and operator is more constrained than it ultimately could be. It's a small oddball computational device that can be tersely written into papers and widely understood. I always find pseudocode easier to follow and reason about but that's my particular bias.
I've mentioned it before, but I'd love for sparse operations to be more widespread in HPC hardware and software.
Linear transformations are a beautiful thing, but matrices are an ugly representation that nevertheless is a convenient one when we actually want to compute.
Elegant territory. Inelegant, brute-force, number crunching map.
All such things are like this.
For me, this is fascinating, mind-boggling, non-sensical, and unsurprising, all at once.
But I wouldn’t call it inelegant.
I seriously doubt that was ever true, except perhaps for a very brief time in the 1950s or 60s.
Linear programming is an incredibly niche application of computing used so infrequently that I've never seen it utilised anywhere despite being a consultant that has visited hundreds of varied customers including big business.
It's like Wolfram Mathematica. I learned to use it in University, I became proficient at it, and I've used it about once a decade "in industry" because most jobs are targeted at the median worker. The median worker is practically speaking innumerate, unable to read a graph, understand a curve fit, or if they do, their knowledge won't extend to confidence intervals or non-linear fits such as log-log graphs.
Teachers that are exposed to the same curriculum year after year, seeing the same topic over and over assume that industry must be the same as their lived experience. I've lost count of the number of papers I've seen about Voronoi diagrams or Delaunay triangulations, neither of which I've ever seen applied anywhere outside of a tertiary education setting. I mean, seriously, who uses this stuff!?
In the networking course in my computer science degree I had to use matrix exponentiation to calculate the maximum throughput of an arbitrary network topology. If I were to even suggest something like this at any customer, even those spending millions on their core network infrastructure, I would be either laughed at openly, or their staff would gape at me in wide-eyed horror and back away slowly.
I’m not saying these things have zero utility, it’s just that they’re used far less frequently in industry than academics imagine.
Of course that doesn't really make any sense at the matrix level. And (from what I understand) techniques like MoE move in that direction. So the criticism doesn't really make sense anymore, except in that brains are still much much more efficient than LLMs so we know that we could do better.
You can have very beautiful algorithms when you assume the matrices involved have a certain structure. You can even have that A*B == B*A, if A and B have a certain structure.
I know linear algebra, but this part seems profoundly unclear. What does "send" mean? Following with different examples in 2 by 2 notation only makes it worse. It seems like you're changing referents constantly.
In US schools during K-12, we generally learn functions in two ways:
1. 2-d line chart with an x-axis and y-axis, like temperature over time, history of stock price, etc. Classic independent variable is on the horizontal axis, dependent variable is on the vertical axis. And even people who forgotten almost all math can instantly understand the graphics displayed when they're watching CNBC or a TV weather report.
2. We also think of functions like little machines that do things for us. E.g., y = f(x) means that f() is like a black box. We give the black box input 'x'; then the black box f() returns output 'y'. (Obviously very relevant to the life of programmers.)
But one of 3blue-1brown's excellent videos finally showed me at least a few more ways of thinking of functions. This is where a function acts as a map from what "thing" to another thing (technically from Domain X to Co-Domain Y).
So if we think of NVIDIA stock price over time (Interpretation 1) as a graph, it's not just a picture that goes up and to the right. It's mapping each point in time on the x-axis to a price on the y-axis, sure! Let's use the example, x=November 21, 2025 maps to y=$178/share. Of course, interpretation 2 might say that the black box of the function takes in "November 21, 2025" as input and returns "$178" as output.
But what what I call Interpretation 3 does is that it maps from the domain of Time to the output Co-domain of NVDA Stock Price.
3. This is a 1D to 1D mapping. aka, both x and y are scalar values. In the language that jamespropp used, we send the value "November 21, 2025" to the value "$178".
But we need not restrict ourselves to a 1-dimensional input domain (time) and a 1-dimensional output domain (price).
We could map from a 2-d Domain X to another 2-d Co-Domain Y. For example X could be 2-d geographical coordinates. And Y could be 2-d wind vector.
So we would feed input of say location (5,4) as input. and our 2Dto2D function would output wind vector (North by 2mph, East by 7mph).
So we are "sending" input (5,4) in the first 2d plane to output (+2,+7) in the second 2d plane.
There are many complications arising from such a thing existing, and from what was needed to bring it into existence (and at the cost of whom), I'll never deny that. I just can't comprehend how someone can find the technical aspects repulsive in isolation.
It feels a lot like trying to convince someone that nuclear weapons are bad by defending that splitting an atom is akin to banging a rock against a coconut to split it in two.
The way I've always explained this to non-algebra people is to imagine driving in a city downtown. If you're at an intersection and you turn right, then left at the next intersection, you'll end up at a completely different spot than if you were to instead turn left and then right.
Using matrix multiplication is also ugly when it's literally millions of times less efficient then a proper solution.