24 pointsby ipnon14 hours ago1 comment
  • Alifatisk11 hours ago
    So if I get this right, all transformers until today has the same residual design, one stream carrying information between layers. DeepSeek figured out how to widen it without training collapsing. Wow, incredible work Deepseek!
    • karmakaze4 hours ago
      I saw this topic in my Youtube feed (YTers are fast). Looking for a bit more info for laypeople found this[0].

      [0] https://www.toolmesh.ai/news/deepseek-mhc-architecture-ai-pe...

    • rvz9 hours ago
      Yes. This is a general improvement in a long time of the residual design in deep neural networks and it also improves on training LLMs with hyper-connections (HC) at a large scale when compared with the standard HC architecture.

      So far they tested this on training 27B models with a tiny overhead and has less "exploding" signals when compared to the other approaches and the baseline. Would be interesting to see results from >100B+ parameter models.

      This should be recommended reading for those interested in micro-design changes from the days of residual networks (ResNet) to Manifold-Constrained Hyper Connections (mHC).

      Instead of just adding more GPUs + Money + Parameters + Data at the problem.