Software Engineer

Joined December 2008
346 Photos and videos
Pinned Tweet
16 Apr 2014
How did increased regulation of childhood affect social and geographical mobility?
1
5
Apr 24
Muon(Clip) used by GLM & Kimi has a RMS matching factor of 0.2 in order to reuse AdamW's optimal LR. My theory is that it actually makes Muon's momentum-dependent effective LR roughly match AdamW's (momentum-independent) LR: (1.95/0.05) ** .5 * 0.2 = 1.25 1/3
1
1
4
353
Apr 24
DeepSeek opted to train DeepSeek V4 with Nesterov Muon. Nesterov Muon also has a momentum-dependent eff. LR albeit more complicated. Their "RMS matching factor" is 0.18: ((1 mu)/(1 - mu)) ** .5 * (1 2 * mu - 2 * mu ** 3) ** -.5 * 0.18 = 1.03 if mu = 0.95 2/3
1
2
4
281
Apr 24
So their eff. LR matching is in fact more accurate. I don't know if they got it thru trial & error or they had their own theory internally. DS V4 tech report: huggingface.co/deepseek-ai/D… My preprint (that started about WD): arxiv.org/abs/2512.08217 3/3
2
4
285
Apr 6
So I finally get around to test this thoroughly with Gaussian random vector / matrix: github.com/EIFY/normalized_u… The prediction works great for normalized vector update and Muon / Scion spectral ortho. update with n=4m matrix, but not so much for steady-state norm when n=m. Why? 🤔
2
2
331
Apr 13
Numerical simulations for n=2m matrix (left) and n=1.5m matrix (right)
96
Apr 6
(This also stumbled DeepSeek-V3.2)
111
Mar 28
I happened to be running this ablation. Preliminary result based on training a ViT-S on ImageNet-1k for 90 epochs says it's better to leave the biases out 🤔 Background: A modified DeiT base has been the CV workhorse for Scion papers... 1/5 x.com/Ji_Ha_Kim/status/20372…
Replying to @Ji_Ha_Kim
Nowadays biases are omitted in transformers for simplicity/same quality but it might matter more to keep affine layers for more expressivity, since bias doesn't affect Lipschitz
2
3
15
2,623
Mar 28
2. For my ViT-S I made sure that QKV grad. are separately orthogonalized and the input dim. of patchifier are flattened. In the process I already left out the bias of QKV. 3. I incorporated @tmpethick's 1.0 init. mo for the unbiased exp. 4/5
1
1
1
173
Mar 28
4. These are ScionC experiments, designed to keep the weight norm stable. (I don't expect 2-4 to change the direction of the result) 5. Without biases somehow the avg. spectral norm is smaller and the L2 grad norm is higher. It's possible that the optimal WD may change... 5/5
1
3
163
Mar 13
If the hidden dimension D is a major bottleneck then we can expand it cheaply & exponentially in the last few layers by replacing residual connections with concatenation, tweaking the MLP shape, or activating more experts. Has anyone tried that? x.com/nthngdy/status/2032172…

🧵New paper: "Lost in Backpropagation: The LM Head is a Gradient Bottleneck" The output layer of LLMs destroys 95-99% of your training signal during backpropagation, and this significantly slows down pretraining 👇
2
251
Jan 6
TIL scipy has every possible inverse of the regularized incomplete beta function: scipy.special.{btdtria, btdtrib, betaincinv} Convenient for working with Beta distribution!
263
27 Aug 2025
Cloning the PyTorch repo takes 1.21 GB of disk space. You can feel the weight of history
1
514
9 Aug 2025
Quite a late bloomer, (constrained) Scion. Diff. from the DeiT exp. in the paper: It's based on "better baseline" ViT-S/16 (global avg. pooling, sincos2d etc.) and trained for 90 ep. Parameters are all reinitialized and it uses Polar Express for msign. github.com/EIFY/mup-vit/tree…
1
1
612
29 Jul 2025
(4/4) So, spectral chisel can be as accurate as other methods except msign-based hardcapping. It may depend on how many singular values we need to cap but # of steps needed should be ~O(n), so the total cost is at most about that of a few matmuls. Repo: github.com/EIFY/spectral-chi…

1
1
306
29 Jul 2025
(1/4) Comparing such "spectral chisel" (1 iter/step) against the ground truth, direct hardcap (from @LakerNewhouse's repo), and msign-based hardcap (Newhouse et al., 2025) for capping n=4096 square unit gaussian random matrix at svdval <= 120: x.com/EIFY/status/1947062171…
20 Jul 2025
Replying to @LakerNewhouse
Approximating σ1, u1, and v1 with power iteration is O(n^2) per iter. while matmul alone is O(n^3). How about just repeating power iteration spectral weight decay / hammer a fixed number of times or till σ1 < σ_{max}?
1
1
247
29 Jul 2025
(3/4) 🤨 Strangely the lipschitz-transformers repo doesn't implement the msign-based spectral hardcapping described in Appendix B. Instead it does this "direct hardcapping" attributed to @leloykun, presumably optimized with procedures similar to x.com/YouJiacheng/status/189…
Improved stability against noise, idea of the noise-stability loss is borrowed from @leloykun 's code.
1
1
170
29 Jul 2025
(2/4) Here is the implementation. For the test above it's run for n steps so with 1 iter/step it costs 2n^3 O(n^2) scalar multiplications:
1
124