Is Matrix Multiplication Ugly?(mathenchant.wordpress.com)

99 pointsby jamespropp14 hours ago18 comments

sfpotter13 hours ago
I think this sentence:
> But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board.
is the most interesting one.
A man hammering a nail into a board can be both beautiful and elegant! If you've ever seen someone effortlessly hammer nail after nail into wood without having to think hardly at all about what they're doing, you've seen a master craftsman at work. Speaking as a numerical analyst, I'd say a well multiplied matrix is much the same. There is much that goes into how deftly a matrix might be multiplied. And just as someone can hammer a nail poorly, so too can a matrix be multiplied poorly. I would say the matrices being multiplied in service of training LLMs are not a particularly beautiful example of what matrix multiplication has to offer. The fast Fourier transform viewed as a sparse matrix factorization of the DFT and its concomitant properties of numerical stability might be a better candidate.
- sdenton49 hours ago
  A somewhat more beautiful matmul for neural networks is given by the Monarch paper: https://arxiv.org/abs/2204.00595
  Generally, low-rank and block-diagonal matrices are both great strategies for producing expressive matmuls with fewer parameters. We can view the FFT as a particularly deft example of factorizing one big matmul into a number of block-diagonal matmuls, greatly reducing the overall number of multiplications by minimizing the block size. However, on a G/TPU, we have a lot more parallelism available, so the sweet spot for size of the blocks may be larger than 2x2...
  We can also mix low-rank, block diagonal, and residual connections to get the best of both worlds:
  x' = (L@x + B@x + x)
  The block-diagonal matrix does 'local' work, and the low-rank matrix does 'broadcast' work. I find it pretty typical to be able to replace a single dense matmul with this kind of structure and save ~90% of the params with no quality cost... (and sometimes the regularization actually helps!)
- hobs11 hours ago
  https://www.youtube.com/watch?v=Ruf-cLr2PZ8 I always think of this when thinking about the gracefulness of a hammer.
  - nialse4 hours ago
    Turns out it’s the skill of the person handling the hammer that matters most. Enlightening! Appreciate the link!
  - JanNash8 hours ago
    Wow, thank you for this gem!!!
- jamespropp12 hours ago
  Yes!
  - gsf_emergency_610 hours ago
    >The fast Fourier transform viewed as a sparse matrix factorization of the DFT
    Riffing further on the Fourier connection: are you planning to explore the link between matmul and differentiation?
    Using the "Pauli-Z" matrix that you introduced without a straightforward motivation, eg.
    (I took it that you intended it to be a "backyard instance" of "dual numbers")
musebox356 hours ago
The computations in transformers are actually generalized tensor tensor contractions implemented as matrix multiplications. Their efficient implementation in gpu hardware involves many algebraic gems and is a work of art. You can have a taste of the complexity involved in their design in this Youtube video: https://www.youtube.com/live/ufa4pmBOBT8
janalsncm11 hours ago
> But matrix multiplication, to which our civilization is now devoting so many of its marginal resources, has all the elegance of a man hammering a nail into a board.
Elegance is a silly critique. Imagine instead we were spending trillions on floral bouquets, calligraphy, and porcelain tea sets. I would argue that would be a bad allocation of resources.
What matters to me is whether it solves the problems we have. Not how elegant we are in doing so. And to the extent AI fails to do that, I think those are valid critiques. Not how elegant it is.
- card_zero10 hours ago
  "Creeping elegance", I guess: https://en.wiktionary.org/wiki/creeping_elegance
  But elegant can mean minimal, restrained, parsimonious, sparing. That's different from a bunch a paraphernalia and flowery nonsense.
- alganet15 minutes ago
  You're right. It's all about solving problems.
  Maybe we need a word that, when applied to mathematical concepts, describes how simple, easy to understand and generally useful a solution or idea is.
  I wonder what that word could be.
- almostgotcaught7 hours ago
  the aesthetics of math and physics is by far the most boring discussion that can be had. i used to be utterly repulsed by such talk in undergrad - beauty this and that. it absolutely always felt affected and put on - as if you talk about it enough, you'll actually convince people outside of the major to give you the same plaudits as real artists.... yea right lol.
  - chris_wot6 hours ago
    That’s the opinion of one person. There are many mathematicians who find beauty in the maths they are studying.
    card_zero6 hours ago
    You're equivocating over the verdict, here. Are they right?
    3 hours ago
    undefined
    almostgotcaught5 hours ago
    > There are many mathematicians who find beauty
    Lol did you think this was clever? You just literally reiterated exactly what I said. See, if you had said "there are many pianists that find beauty in math" - you know like how many mathematicians find beauty in piano concertos - then you'd have me.
card_zero6 hours ago
So, Hardy focused on good explanations, and that was what he meant by beauty. Fair enough. The best objective definition of beauty I know of is "communication across a gap". This covers flowers, mathematics, and all kinds of art, including art I think is ugly such as Lucian Freud and Hans Giger I guess. So now I'm describing some things as beautiful and ugly at the same time, which betrays that there's a relative component to it (relative, objectively). That means I wish some things - including mathematics, which is usually tedious - communicated better, or explained things that seem to me to matter more: I feel in my gut that there's potential for this. So I don't rate mathematics as beautiful, any of it, personally.
But I'll admit its barely beautiful. Within which context, I guess the article's lawyering for the relative beauty of a matrix was a success, but I always liked them better than calculus or group theory anyway.
laichzeit09 hours ago
Well function composition f(g(x)) is not the same as g(f(x)) and when you represent f and g as matrices relative to some suitable set of basis functions then obviously AB and BA should be different. If the multiplication was defined any different, that wouldn’t work.
gweinberg11 hours ago
The commutation problem has nothing to do with matrices. Rotations in space do not commute, and that will be the case whether you represent them as matrices or in some other way.
peterfirefly2 hours ago
I just finished reading lots of Stephen Witt quotes on goodreads. He comes across as a white Malcolm Gladwell, except that he actually does know what "Igon values" are so I don't know what his excuse is.
countWSS4 hours ago
Beauty,symmetry,etc are largely irrelevant, the key point it does not scale and burning gigawatts to compute these matrices(even with all those tricks) will not scale or compete with more efficient/direct methods in the long term. Perhaps transformers are very elaborate sunk-cost fallacy where pivoting to scalable, simpler architecture is treated as "too risky" even when cost of new GPU cluster dwarfs whatever it takes to bring an architecture from 0 to chatGPT level.
- sigmoid103 hours ago
  The whole issue with this industry is that it moves so fast, there is no "long term." You're either in all the way in a likely futile attempt to capture the market or you're not in at all. So you also don't have time to really innovate on the hardware or software level and you need to put everything into training data and training hardware.
zkmon5 hours ago
I doubt anyone of the past or present could fully describe what a matrix is, and what its multiplication is. There are many ways people looked at it so far - as a spatial transformation, dot products and so on. I don't think the description is complete in any significant way.
That's because we don't fully understand what a number is and what a multiplication is. We defined -x and 1/x as inverses (additive and multiplicative), but what is -1/x ? Let's consider them as operations. Apply any one of them on any other of them, you get the third one. Thus they occupy peer status. But we hardly ever talked about -1/x.
The mathematical inquisition is in its infancy.
- themafia5 hours ago
  As someone who never got deeply into math but deeply into programming they just seemed like an incompletely generalized data structure with an interesting "canonical" algorithm that can be used on it. In some cases, if you arrange your data into the structure correctly, you can use it to model interesting real world phenomenon.
  It feels like Linear Algebra tries to get at the heart of this generality but the structure and operator is more constrained than it ultimately could be. It's a small oddball computational device that can be tersely written into papers and widely understood. I always find pseudocode easier to follow and reason about but that's my particular bias.
- youoy5 hours ago
  I get your point, but i think the real issue is -(1/(-1/x)). It is the one that is being overlooked the most in our society, as if it were something normal, but it contains some of the deepest truths imho.
  - iamgopal5 hours ago
    how about -1/(-(1/(-1/x))) ? How many roads must a man walk down before we can call him a man ?
    zkmon4 hours ago
    No need of walking, they just need to be able to read post properly before calling him a man.
  - zkmon4 hours ago
    No you didn't get it. You missed "Let's consider them as operations. Apply any one of them on any other of them, you get the third one."
    youoy3 hours ago
    So is what i wrote a third one? Fourth? Fifth? :)
    zkmon2 hours ago
    Not sure what you are talking about. What you wrote reduces to just x. What I meant was, if you substitute say, -x for x in -1/x, you get 1/x, which is the third inverse. Same is true for the other two pairs. So, if we call them functions f, g and h, then, f=g(h)=h(g); g=f(h)=h(f); h=f(g)=g(f)
Scene_Cast29 hours ago
Matmuls (and GEMM) are a hardware-friendly way to stuff a lot of FLOPS into an operation. They also happen to be really useful as a constant-step discrete version of applying a mapping to a 1D scalar field.
I've mentioned it before, but I'd love for sparse operations to be more widespread in HPC hardware and software.
cjfd4 hours ago
Anyone who thinks matrix multiplication is ugly has understood nothing about it.
algernonramone8 hours ago
I am willing to admit that I find matrix multiplication ugly, as well as non-intuitive. But, I am also willing to admit that my finding it ugly is likely a result of my relative mathematical immaturity (despite my BS in math).
- znkr4 hours ago
  Maybe it helps to think of matrix multiplication as a special case of the composition of linear transformation. In the finite dimensional case they can be expressed as matrix multiplications.
jamespropp14 hours ago
Do you disagree with my take or think I’m missing Witt’s point? I’d be happy to hear from people who disagree with me.
- starmole9 hours ago
  I think 4x4 matrices for 3D transforms (esp homogenous coordinates) are very elegant. I think the intended critique is that the huge n*m matrices used in ML are not elegant - but the point is made poorly by pointing out properties of general matrices. In ML matrices are just "data", or "weights". There are no interesting properties to these matrices. In a way a Neumann (https://en.wikipedia.org/wiki/Von_Neumann%27s_elephant) Elephant. Now, this might just be what it is needed for ML to work and deal with messy real world data! But mathematically it is not elegant.
- messe5 hours ago
  I think it conflates the map and the territory.
  Linear transformations are a beautiful thing, but matrices are an ugly representation that nevertheless is a convenient one when we actually want to compute.
  Elegant territory. Inelegant, brute-force, number crunching map.
- johngossman12 hours ago
  I think you're right that the inelegant part is how AI seems to just consist of endless loops of multiplication. I say this as a graphics programmer who realized years ago that all those beautiful images were just lots of MxNs, and AI takes this to a whole new level. When I was in college they told us most of computing resources were used doing Linear Programming. I wonder when that crossed over to graphics or AI (or some networking operation like SSL)?
  - dwaltrip10 hours ago
    What could any complex phenomenon possibly be other than small “mundane” components combined together in a variety of ways and in immense quantities?
    All such things are like this.
    For me, this is fascinating, mind-boggling, non-sensical, and unsurprising, all at once.
    But I wouldn’t call it inelegant.
  - jiggawatts10 hours ago
    > When I was in college they told us most of computing resources were used doing Linear Programming.
    I seriously doubt that was ever true, except perhaps for a very brief time in the 1950s or 60s.
    Linear programming is an incredibly niche application of computing used so infrequently that I've never seen it utilised anywhere despite being a consultant that has visited hundreds of varied customers including big business.
    It's like Wolfram Mathematica. I learned to use it in University, I became proficient at it, and I've used it about once a decade "in industry" because most jobs are targeted at the median worker. The median worker is practically speaking innumerate, unable to read a graph, understand a curve fit, or if they do, their knowledge won't extend to confidence intervals or non-linear fits such as log-log graphs.
    Teachers that are exposed to the same curriculum year after year, seeing the same topic over and over assume that industry must be the same as their lived experience. I've lost count of the number of papers I've seen about Voronoi diagrams or Delaunay triangulations, neither of which I've ever seen applied anywhere outside of a tertiary education setting. I mean, seriously, who uses this stuff!?
    In the networking course in my computer science degree I had to use matrix exponentiation to calculate the maximum throughput of an arbitrary network topology. If I were to even suggest something like this at any customer, even those spending millions on their core network infrastructure, I would be either laughed at openly, or their staff would gape at me in wide-eyed horror and back away slowly.
    aragilar10 hours ago
    The first two results from Google with "Voronoi astro" gave two different uses than the one I knew about (sampling fibre bundles): https://galaxyproject.org/news/2025-06-11-voronoi-astronomy/ https://arxiv.org/abs/2511.14697
    jiggawatts6 hours ago
    Astronomy is pure research and is performed almost exclusively by academics.
    I’m not saying these things have zero utility, it’s just that they’re used far less frequently in industry than academics imagine.
    aragilar37 minutes ago
    And astronomy tends to throw up technology that becomes widely used (WiFi being the obvious example) or becomes of "interest" to governments. I expect that AMR code will be used/ported to nuclear simulations if it proves to be useful. Do I expect it to be used in a CRUD app? Obviously not, but use by most software shops isn't a measure of importance.
- dnr10 hours ago
  The inelegance to me isn't in the definition of the operation, but that it's doing a huge amount of brute-force work to mix every part of the input with every other part, when the answer really only depends on a tiny fraction of the input. If we somehow "just knew" what parts to look at, we could get the answer much more efficiently.
  Of course that doesn't really make any sense at the matrix level. And (from what I understand) techniques like MoE move in that direction. So the criticism doesn't really make sense anymore, except in that brains are still much much more efficient than LLMs so we know that we could do better.
- LegionMammal97813 hours ago
  If the O(n^3) schoolbook multiplication were the best that could be done, then I'd totally agree that "it's simply the nature of matrices to have a bulky multiplication process". Yet there's a whole series of algorithms (from the Strassen algorithm onward) that use ever-more-clever ways to recursively batch things up and decrease the asymptotic complexity, most of which aren't remotely practical. And for all I know, it could go on forever down to O(n^(2+ε)). Overall, I hate not being able to get a straight answer for "how hard is it, really".
  - RossBencina9 hours ago
    For anyone interested, there is a introductory survey of the current lower bound at: https://en.wikipedia.org/wiki/Computational_complexity_of_ma...
- amelius13 hours ago
  Maybe the problem is that matrices are too general.
  You can have very beautiful algorithms when you assume the matrices involved have a certain structure. You can even have that A*B == B*A, if A and B have a certain structure.
- djmips13 hours ago
  Ignore me then because I agree with you. :) He sounds like someone who upon first hearing jazz to complain it was ugly.
- veqq12 hours ago
  > sends the pair (x, y) to the pair (−x, y)
  I know linear algebra, but this part seems profoundly unclear. What does "send" mean? Following with different examples in 2 by 2 notation only makes it worse. It seems like you're changing referents constantly.
  - jeffhwang11 hours ago
    Let me try.
    In US schools during K-12, we generally learn functions in two ways:
    1. 2-d line chart with an x-axis and y-axis, like temperature over time, history of stock price, etc. Classic independent variable is on the horizontal axis, dependent variable is on the vertical axis. And even people who forgotten almost all math can instantly understand the graphics displayed when they're watching CNBC or a TV weather report.
    2. We also think of functions like little machines that do things for us. E.g., y = f(x) means that f() is like a black box. We give the black box input 'x'; then the black box f() returns output 'y'. (Obviously very relevant to the life of programmers.)
    But one of 3blue-1brown's excellent videos finally showed me at least a few more ways of thinking of functions. This is where a function acts as a map from what "thing" to another thing (technically from Domain X to Co-Domain Y).
    So if we think of NVIDIA stock price over time (Interpretation 1) as a graph, it's not just a picture that goes up and to the right. It's mapping each point in time on the x-axis to a price on the y-axis, sure! Let's use the example, x=November 21, 2025 maps to y=$178/share. Of course, interpretation 2 might say that the black box of the function takes in "November 21, 2025" as input and returns "$178" as output.
    But what what I call Interpretation 3 does is that it maps from the domain of Time to the output Co-domain of NVDA Stock Price.
    3. This is a 1D to 1D mapping. aka, both x and y are scalar values. In the language that jamespropp used, we send the value "November 21, 2025" to the value "$178".
    But we need not restrict ourselves to a 1-dimensional input domain (time) and a 1-dimensional output domain (price).
    We could map from a 2-d Domain X to another 2-d Co-Domain Y. For example X could be 2-d geographical coordinates. And Y could be 2-d wind vector.
    So we would feed input of say location (5,4) as input. and our 2Dto2D function would output wind vector (North by 2mph, East by 7mph).
    So we are "sending" input (5,4) in the first 2d plane to output (+2,+7) in the second 2d plane.
  - jamespropp12 hours ago
    Thanks for pointing this out. I’ll work on this passage tomorrow.
fracus12 hours ago
I think it is just a matter of perspective. You can both be right. I don't think there is an objective answer to this question.
- sswatson11 hours ago
  The author has exclusive claim to their own aesthetic sensibilities, of course, but the language in the piece suggests some degree of universality. Whereas in fact, effectively no one who is knowledgeable about math would share the view that noncommutative operations are ugly by virtue of being noncommutative. It’s a completely foreign idea, like a poet saying that the only beautiful poems are the palindromic ones.
- krackers11 hours ago
  One could say that it depends on your basis...
tiagod8 hours ago
Honestly, in a purely technical sense, I do find it beautiful how you can take matrix multiplication and a shit-ton of data, and get a program that can talk to you, solve problems, and generate believable speech and imagery.
There are many complications arising from such a thing existing, and from what was needed to bring it into existence (and at the cost of whom), I'll never deny that. I just can't comprehend how someone can find the technical aspects repulsive in isolation.
It feels a lot like trying to convince someone that nuclear weapons are bad by defending that splitting an atom is akin to banging a rock against a coconut to split it in two.
14 hours ago
undefined
stackghost7 hours ago
>Matrix algebra is the language of symmetry and transformation, and the fact that a followed by b differs from b followed by a is no surprise; to expect the two transformations to coincide is to seek symmetry in the wrong place — like judging a dog’s beauty by whether its tail resembles its head.
The way I've always explained this to non-algebra people is to imagine driving in a city downtown. If you're at an intersection and you turn right, then left at the next intersection, you'll end up at a completely different spot than if you were to instead turn left and then right.
o11c10 hours ago
Matrix multiplication libraries are ugly. They either give up on performance or have atrocious interfaces ... sometimes both.
Using matrix multiplication is also ugly when it's literally millions of times less efficient then a proper solution.
- Filligree7 hours ago
  What’s the proper solution for computing the voltage and current flows in a component network, order than modified nodal analysis?
  - o11c6 hours ago
    I'm not familiar with that particular problem, but I did use a load-bearing "when".
    chris_wot6 hours ago
    Can you give an example of when it is not appropriate?
    o11c4 hours ago
    Literally 99% of the crap they're shoving AI into these days.