[0] https://www.toolmesh.ai/news/deepseek-mhc-architecture-ai-pe...
So far they tested this on training 27B models with a tiny overhead and has less "exploding" signals when compared to the other approaches and the baseline. Would be interesting to see results from >100B+ parameter models.
This should be recommended reading for those interested in micro-design changes from the days of residual networks (ResNet) to Manifold-Constrained Hyper Connections (mHC).
Instead of just adding more GPUs + Money + Parameters + Data at the problem.