In my runs, the shallow 2-layer setting looked mostly similar to AdamW. The more interesting case was moderate depth: around 8 layers, Muon was noticeably more stable and gave better final results. I also saw a fairly large robustness gap under feature noise and edge dropout.
The writeup focuses on the spectral side of the story: singular values, conditioning, and why the effect seems to show up more in deeper message-passing stacks than in the standard shallow benchmark regime.
I included the negative results too: Muon is slower per epoch, it doesn’t win everywhere, and by very large depth the optimizer alone is not enough.