> But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board.
is the most interesting one.
A man hammering a nail into a board can be both beautiful and elegant! If you've ever seen someone effortlessly hammer nail after nail into wood without having to think hardly at all about what they're doing, you've seen a master craftsman at work. Speaking as a numerical analyst, I'd say a well multiplied matrix is much the same. There is much that goes into how deftly a matrix might be multiplied. And just as someone can hammer a nail poorly, so too can a matrix be multiplied poorly. I would say the matrices being multiplied in service of training LLMs are not a particularly beautiful example of what matrix multiplication has to offer. The fast Fourier transform viewed as a sparse matrix factorization of the DFT and its concomitant properties of numerical stability might be a better candidate.
Generally, low-rank and block-diagonal matrices are both great strategies for producing expressive matmuls with fewer parameters. We can view the FFT as a particularly deft example of factorizing one big matmul into a number of block-diagonal matmuls, greatly reducing the overall number of multiplications by minimizing the block size. However, on a G/TPU, we have a lot more parallelism available, so the sweet spot for size of the blocks may be larger than 2x2...
We can also mix low-rank, block diagonal, and residual connections to get the best of both worlds:
x' = (L@x + B@x + x)
The block-diagonal matrix does 'local' work, and the low-rank matrix does 'broadcast' work. I find it pretty typical to be able to replace a single dense matmul with this kind of structure and save ~90% of the params with no quality cost... (and sometimes the regularization actually helps!)
There's a lot of opportunity here. Just because matrix multiplication makes for a beautiful mathematical building block, and a very reasonable one to build high-level ML logic on, doesn't mean it needs to be computed the same way, and in the same order, that we learned in linear algebra courses.
I'm quite curious if this is being used in practice at scale, or whether it's still in the lab at the moment!
I think this touches on something fundamental. As a stand-alone operation matmul is ugly because it's arbitrary. In other words.. if the goal was just to entangle values, there's a bunch of ways to do it, so why this particular way landing on ae+bg etc? You kind of need algebra/geometry to justify matmul this way, which makes it obviously useful, but now it's still ugly, exactly because you had to invoke this other stuff.
Compare that situation to algebra and geometry themselves, which in a real sense don't need each other. Or to things like logic, sets, categories, processes, numbers, knots, games, etc where you can build up piles of stuff based on it in a whole rich universe before you need to appeal to much that is "outside". And in those universes operations would be defined mostly in ways that were more like "natural" or "necessary" without anything feeling arbitrary.
Traditional matmul is beautiful in the sense of "connections across and between", where all the particulars do become necessary. For those that prefer a certain amount of abstract perfection / platonism / etc or those with a taste for foundations though, it's understandable if it's not that appealing. This is related to, but not the same as the pure vs applied split.
I haven't seen banded matrices as much, though (with weight sharing) they're just convolutions. One nice feature of block diagonality is that you can express it as batched matrix multiplication, reusing all the existing matmul kernels.
Thanks for the link; that is absolutely masterful work.
Riffing further on the Fourier connection: are you planning to explore the link between matmul and differentiation?
Using the "Pauli-Z" matrix that you introduced without a straightforward motivation, eg.
(I took it that you intended it to be a "backyard instance" of "dual numbers")
Elegance is a silly critique. Imagine instead we were spending trillions on floral bouquets, calligraphy, and porcelain tea sets. I would argue that would be a bad allocation of resources.
What matters to me is whether it solves the problems we have. Not how elegant we are in doing so. And to the extent AI fails to do that, I think those are valid critiques. Not how elegant it is.
But elegant can mean minimal, restrained, parsimonious, sparing. That's different from a bunch a paraphernalia and flowery nonsense.
Maybe we need a word that, when applied to mathematical concepts, describes how simple, easy to understand and generally useful a solution or idea is.
I wonder what that word could be.
And I would argue it wouldn't. So? It's a value call.
> What matters to me is whether it solves the problems we have.
Again, what is and is not a problem is a value call. "Lacking tools to surveil and control the population" and "having population that demands its share of economic output" arguably are problems for someone which AI probably could solve. "The planet is literally on fire" is another problem (for, arguably, much bigger number of someones) and pouring terawatts of energy into chips that, coincidentally, do AI-related matrix multiplications, won't solve that problem.
Much like sex. Sex has reproductive utility but that's not why most people engage in it. Those who do are are missing much.
Notion of beauty for a mathematician is quite specialized. It's the difference between spaghetti code that works and an elegant and efficient code that is correct. They are easy to build upon efficiently.
My guy you know lots of people in here have read Feynman right? You should cite him instead of pretending you were clever enough to come up with the analogy yourself.
Channeling Good Will Hunting much huh? Most HN'ers would have watched that too.
For all of your "forceful" comments on math, I think probably you don't actually know much about it.
That's not what I contested. What fraction of people who use differentials in their published work still cite Newton or Leibnitz was the point. You can count number of such citations in last 10 years of say neural nets literature, or applied maths literature and report. Thats plenty of use.
Citations to their differential calculus that are still made are mostly in the context of history of math.
Seems numeracy or comprehension is not your strong point. LOL.
> I have no idea what you're trying to say
Now I don't doubt it. LOL
Those papers were written in the 1600s. "The character of physical law", the essay you're ripping off, was written in 1964. 100% papers from the 1960s are cited every single time the techniques are used.
You are as tedious as the original refrain I was complaining about (which is not at all ironic). What's most tedious is you're not actually a mathematician but presume to speak for them.
Honestly quite often an idea did originate in my own thoughts, but the work to put it into well-formed words, which I will use to tell others about it, was done by someone else, whose formulations of the same idea I had, I have read later.
> it's pathetic to pass off someone else's insights as your own.
To which my point was citations are made when there is an expectation of originality. By now Feynman's anecdotes are folklore and folks wisdom.
OK let's go by your standards. Cooley Tukey's FFT algorithm was "discovered" by them in around 1965. How often do they get a citation when FFT is used, especially in comments on a social site, such as HN is.
LOL even 10 years old results do not get cited because they are considered common knowledge.
That said, Witt's notion of beauty that Propp is critiquing in the posted article is just plane idiotic. Lack of commutativity is not lack of beauty. What a stupid idea.
Mathematical beauty and imagination is different. One of Hilbert's grad students dropped out to become a poet.Hilbert is reported to have said: 'I never thought he had enough imagination to be a mathematician.'
A little unsolicited advice: if you are an aspiring mathematician(I am very happy for you if you are), but if you do not have a sense of a good taste or mathematical beauty, you probably will probably not have a good time.
Lol I have a PhD from a T10 and 15 published papers. I'm pretty sure I don't need your advice on "taste" or "beauty".
My condolences though, for being in a line of work where you don't perceive beauty.
The 15 is on the lower side. When I used to be there 15 would be on the uncomfortable side :) Good luck to up the numbers. Oh! do get back on the Cooley Tukey citations and FFT mention ratio.
Lol did you think this was clever? You just literally reiterated exactly what I said. See, if you had said "there are many pianists that find beauty in math" - you know like how many mathematicians find beauty in piano concertos - then you'd have me.
Accidentally - in the parts of maths where the concepts can be visualized, such as fractal theory, non-mathematicians seem to love what they see.
Absolutely no one when they're navel gazing on this topic is discussing the aesthetics of notation.
> seem to love what they see.
Nor visualizations
People in general perceive music as "what is being played" vs. mathematics "what is being written on a page". This is the common concept, but it is incomplete. Music has its boring parts (notation), so does maths, but the general public is prone to confuse maths as a whole with its "sheet music".
"when they're navel gazing"
Maybe they're just thinking?
I asked him, “is this not the commuter rail?”
There's a wooden weaving machine at a heritage museum near me that gives me the same 'taste' in my brain as thinking about 'matrix' processing in a TPU or whatever.
But I'll admit its barely beautiful. Within which context, I guess the article's lawyering for the relative beauty of a matrix was a success, but I always liked them better than calculus or group theory anyway.
The way I've always explained this to non-algebra people is to imagine driving in a city downtown. If you're at an intersection and you turn right, then left at the next intersection, you'll end up at a completely different spot than if you were to instead turn left and then right.
Hadamars product/elementwise multiplication is also commutative.
I've mentioned it before, but I'd love for sparse operations to be more widespread in HPC hardware and software.
I'm intrigued. How would a white Malcolm Gladwell's quotes differ from the IRL Malcolm Gladwell?
Of course that doesn't really make any sense at the matrix level. And (from what I understand) techniques like MoE move in that direction. So the criticism doesn't really make sense anymore, except in that brains are still much much more efficient than LLMs so we know that we could do better.
All such things are like this.
For me, this is fascinating, mind-boggling, non-sensical, and unsurprising, all at once.
But I wouldn’t call it inelegant.
I seriously doubt that was ever true, except perhaps for a very brief time in the 1950s or 60s.
Linear programming is an incredibly niche application of computing used so infrequently that I've never seen it utilised anywhere despite being a consultant that has visited hundreds of varied customers including big business.
It's like Wolfram Mathematica. I learned to use it in University, I became proficient at it, and I've used it about once a decade "in industry" because most jobs are targeted at the median worker. The median worker is practically speaking innumerate, unable to read a graph, understand a curve fit, or if they do, their knowledge won't extend to confidence intervals or non-linear fits such as log-log graphs.
Teachers that are exposed to the same curriculum year after year, seeing the same topic over and over assume that industry must be the same as their lived experience. I've lost count of the number of papers I've seen about Voronoi diagrams or Delaunay triangulations, neither of which I've ever seen applied anywhere outside of a tertiary education setting. I mean, seriously, who uses this stuff!?
In the networking course in my computer science degree I had to use matrix exponentiation to calculate the maximum throughput of an arbitrary network topology. If I were to even suggest something like this at any customer, even those spending millions on their core network infrastructure, I would be either laughed at openly, or their staff would gape at me in wide-eyed horror and back away slowly.
I’m not saying these things have zero utility, it’s just that they’re used far less frequently in industry than academics imagine.
The triangulations you mention are important in the current group I'm working in.
PS: My point is not that these things are never used, they clearly are, I'm saying that the majority of CPU cycles globally goes towards "idle", then pushing pixels around with simple bitblt-like algorithms for 2D graphics, then whatever it is that browsers do on the inside, then operating system internals, and then specialised and more interesting algorithms like Linear Programming are a vanishingly small slice of whatever is left of that pie chart.
Part of the reason why linear programming does need t get used as often is that there are no industry standard software implementation that is not atrociously priced. Same deal with Mathematica.
Linear transformations are a beautiful thing, but matrices are an ugly representation that nevertheless is a convenient one when we actually want to compute.
Elegant territory. Inelegant, brute-force, number crunching map.
You can have very beautiful algorithms when you assume the matrices involved have a certain structure. You can even have that A*B == B*A, if A and B have a certain structure.
I know linear algebra, but this part seems profoundly unclear. What does "send" mean? Following with different examples in 2 by 2 notation only makes it worse. It seems like you're changing referents constantly.
In US schools during K-12, we generally learn functions in two ways:
1. 2-d line chart with an x-axis and y-axis, like temperature over time, history of stock price, etc. Classic independent variable is on the horizontal axis, dependent variable is on the vertical axis. And even people who forgotten almost all math can instantly understand the graphics displayed when they're watching CNBC or a TV weather report.
2. We also think of functions like little machines that do things for us. E.g., y = f(x) means that f() is like a black box. We give the black box input 'x'; then the black box f() returns output 'y'. (Obviously very relevant to the life of programmers.)
But one of 3blue-1brown's excellent videos finally showed me at least a few more ways of thinking of functions. This is where a function acts as a map from what "thing" to another thing (technically from Domain X to Co-Domain Y).
So if we think of NVIDIA stock price over time (Interpretation 1) as a graph, it's not just a picture that goes up and to the right. It's mapping each point in time on the x-axis to a price on the y-axis, sure! Let's use the example, x=November 21, 2025 maps to y=$178/share. Of course, interpretation 2 might say that the black box of the function takes in "November 21, 2025" as input and returns "$178" as output.
But what what I call Interpretation 3 does is that it maps from the domain of Time to the output Co-domain of NVDA Stock Price.
3. This is a 1D to 1D mapping. aka, both x and y are scalar values. In the language that jamespropp used, we send the value "November 21, 2025" to the value "$178".
But we need not restrict ourselves to a 1-dimensional input domain (time) and a 1-dimensional output domain (price).
We could map from a 2-d Domain X to another 2-d Co-Domain Y. For example X could be 2-d geographical coordinates. And Y could be 2-d wind vector.
So we would feed input of say location (5,4) as input. and our 2Dto2D function would output wind vector (North by 2mph, East by 7mph).
So we are "sending" input (5,4) in the first 2d plane to output (+2,+7) in the second 2d plane.
Perfect! Overall, it's much better (though I really just meant that "send" was vague, which "change" improves)!
There are many complications arising from such a thing existing, and from what was needed to bring it into existence (and at the cost of whom), I'll never deny that. I just can't comprehend how someone can find the technical aspects repulsive in isolation.
It feels a lot like trying to convince someone that nuclear weapons are bad by defending that splitting an atom is akin to banging a rock against a coconut to split it in two.
Using matrix multiplication is also ugly when it's literally millions of times less efficient then a proper solution.
i think a lot of issues arise from using analogies. Another one us complex numbers as 2D vectors. Its an ok analogy.. Except complex numbers can be multiplied where are 2D coordinates can not. Your weird new nonvectors are now spinning and people are left confused
That's because we don't fully understand what a number is and what a multiplication is. We defined -x and 1/x as inverses (additive and multiplicative), but what is -1/x ? Let's consider them as operations. Apply any one of them on any other of them, you get the third one. Thus they occupy peer status. But we hardly ever talked about -1/x.
The mathematical inquisition is in its infancy.
It feels like Linear Algebra tries to get at the heart of this generality but the structure and operator is more constrained than it ultimately could be. It's a small oddball computational device that can be tersely written into papers and widely understood. I always find pseudocode easier to follow and reason about but that's my particular bias.
Matrix is just one way to organize data. When linear operators are organized this way composition of linear operators map to matrix multiplication.
But that is just one of the ways that multiplication may be defined on matrices, Hadamard products, Tensor product, Khatri-Rao product are some of the other examples. They all correspond to different mathematical structures one wants to explore or use. If linear algebraic structures is what ones to explore or use then one gets matrix multiplication.