So the first major paper of 2026, DeepSeek mHC: Manifold-Constrained Hyper-Connections.
This is actually an engineering paper, taking as a starting points ideas already exposed in the original Hyper-Connections (HC) paper from ByteDance, which is consequently a prerequisite for reading. So initial notes on this first.
As a preamble HC surprisingly intersects with two major open questions that have been on my mind ever since SYNTH:
1) Reasoning capacities seem to emerge from depth, so indirectly better layer combinations. This is especially striking for math and Circuit Tranformers already suggest that models perform formal operations at this sub-token level. Drafts just wrap this process through another time. But then, how can we build more optimal layer combinations/assignments? This become even more critical as we scale depth (or nest it through MoE): it’s known through interpretability studies that layers are largely redundant.
2) Synthetic data has become the most efficient way to train models, mostly as we delegate “training” to the data shape. Paraphrasing is literally a way to extrapolate the memorization process in transformers world, as we create endless variations of the same knowledge components. If training was really optimized, this should be mostly internalized. So how can we build efficient training?
It’s not surprising that hyper-connections is immediately associated with Muon. The general idea is similar: make better training updates. Yet, there is a major difference: hyper-connections are a low level change, transforming a decade old piece of deep learning infra, the residual function F, and making it trainable.
Current normalization approach scale well and yet result in "representation collapse", "where hidden features in deeper layers become highly similar, diminishing the contribution of additional layers as their number increases." To address this, hyper-connections introduce entirely new learnable objectives for "depth-connections and width-connections". In theory "learning the hyper-connection matrix in various forms can create layer arrangements that surpass traditional sequential and parallel configurations, resulting in a soft-mixture or even dynamic arrangement".
The original HC paper does manage to retrain a small Olmo-MoE and demonstrate it "converges 1.8 times faster and shows an improvement of 6 points on ARC-Challenge compared to the baseline trained with 500 B tokens". Layer interpretability suggests that "the baseline tends toward representation collapse", while the HC variant "exhibits significantly lower similarity between features".
DeepSeek paper starts almost in media res and first underlines a major success of HC original approach: increase in math/topological complexity did not result in computational overhead. Yet, does it scale?
Moving beyond small models, there are two major issues: "as the training scale increases, HC introduces potential risks of instability" and "the hardware efficiency concerning memory access costs for the widened residual stream remains unaddressed in the original design". Concretely, naive experiment scaling of HC results in "unexpected loss surge around the 12k step, which is highly correlated with the instability in the gradient norm"
Consequently DeepSeek proposes their own variant, Manifold-Constrained Hyper-Connections (mHC). As the name implied, it restrict the learnable objective preventing deviations from identity mapping and "effectively constrains the residual connection matrices within the manifold that is constituted by doubly stochastic matrices".
The math part (4.1 & 4.2) is very elegant, but clearly not the hardest part. The actual core of the paper is “4.3 efficient training design", where they simply:
1) write three new mHC kernels that "employ mixed-precision strategies to maximize numerical accuracy without compromising speed, and fuse multiple operations with shared memory access into unified compute kernels to reduce memory bandwidth bottlenecks"
2) manage the substantial memory overhead by discarding "the intermediate activations of the mHC kernels after the forward pass and recompute them on-the-fly in the backward pass"
3) adapt pipeline parallelism as "mHC incurs substantial communication latency across pipeline stages". So "to prevent blocking the communication stream, we execute the Fpost,res kernels of MLP (i.e. FFN) layers on a dedicated high-priority compute stream"
Overall the actual flex of the paper is not so much proving Hyper-Connections can work at scale. It’s: we have the internal capacity to re-engineer the complete training environment at all dimensions (kernels, memory management, inter-node communication) around highly experimental research ideas.
That’s what makes you a frontier lab.