1 pointby brunoalano5 hours ago1 comment

brunoalano5 hours ago
I've been experimenting with Muon on GNNs to see whether orthogonalizing updates helps with the usual depth problems.
In my runs, the shallow 2-layer setting looked mostly similar to AdamW. The more interesting case was moderate depth: around 8 layers, Muon was noticeably more stable and gave better final results. I also saw a fairly large robustness gap under feature noise and edge dropout.
The writeup focuses on the spectral side of the story: singular values, conditioning, and why the effect seems to show up more in deeper message-passing stacks than in the standard shallow benchmark regime.
I included the negative results too: Muon is slower per epoch, it doesn’t win everywhere, and by very large depth the optimizer alone is not enough.