锐评:挂流形的羊头,卖运筹学的狗肉,论文命名的反向工程,给工程解法找理论爹
千万别被“流形”这个词骗了。说是从流形理论推导的,我敢打赌这绝对是从运筹学“倒着来”的,想从微分几何去理解是南辕北辙。
我工业工程的DNA动了,怪不得这么多人“看不懂”。说是指派问题我的IE同学们是不是能看懂 ?
DeepSeek just dropped a banger paper to wrap up 2025
"mHC: Manifold-Constrained Hyper-Connections"
Hyper-Connections turn the single residual “highway” in transformers into n parallel lanes, and each layer learns how to shuffle and share signal between lanes.
But if each layer can arbitrarily amplify or shrink lanes, the product of those shuffles across depth makes signals/gradients blow up or fade out.
So they force each shuffle to be mass-conserving: a doubly stochastic matrix (nonnegative, every row/column sums to 1). Each layer can only redistribute signal across lanes, not create or destroy it, so the deep skip-path stays stable while features still mix!
with n=4 it adds ~6.7% training time, but cuts final loss by ~0.02, and keeps worst-case backward gain ~1.6 (vs ~3000 without the constraint), with consistent benchmark wins across the board