MHC: Manifold-Constrained Hyper-Connections(arxiv.org)

32 pointsby ipnona month ago1 comment

Alifatiska month ago
So if I get this right, all transformers until today has the same residual design, one stream carrying information between layers. DeepSeek figured out how to widen it without training collapsing. Wow, incredible work Deepseek!
- rvza month ago
  Yes. This is a general improvement in a long time of the residual design in deep neural networks and it also improves on training LLMs with hyper-connections (HC) at a large scale when compared with the standard HC architecture.
  So far they tested this on training 27B models with a tiny overhead and has less "exploding" signals when compared to the other approaches and the baseline. Would be interesting to see results from >100B+ parameter models.
  This should be recommended reading for those interested in micro-design changes from the days of residual networks (ResNet) to Manifold-Constrained Hyper Connections (mHC).
  Instead of just adding more GPUs + Money + Parameters + Data at the problem.
- karmakazea month ago
  I saw this topic in my Youtube feed (YTers are fast). Looking for a bit more info for laypeople found this[0].
  [0] https://www.toolmesh.ai/news/deepseek-mhc-architecture-ai-pe...