1 pointby brunoalano5 hours ago1 comment
  • brunoalano5 hours ago
    I've been experimenting with Muon on GNNs to see whether orthogonalizing updates helps with the usual depth problems.

    In my runs, the shallow 2-layer setting looked mostly similar to AdamW. The more interesting case was moderate depth: around 8 layers, Muon was noticeably more stable and gave better final results. I also saw a fairly large robustness gap under feature noise and edge dropout.

    The writeup focuses on the spectral side of the story: singular values, conditioning, and why the effect seems to show up more in deeper message-passing stacks than in the standard shallow benchmark regime.

    I included the negative results too: Muon is slower per epoch, it doesn’t win everywhere, and by very large depth the optimizer alone is not enough.