apaz

apaz

135 Photos and videos

Tweets

Pinned Tweet

apaz

@apaz_cli

Mar 13

Releasing mlsweep, a sweep scheduler and visualizer for distributed ML training. It aims to make launching runs across groups GPUs frictionless and achieve near feature-parity with wandb. But you can use it with whatever frameworks or loggers you like, wandb included.

4,073

apaz

apaz

@apaz_cli

I'm slamming my head into kernel autoresearch and... it feels so close and so far simultaneously. I think we're cooked until someone cracks it, at which point things will rapidly improve. And then new hardware will be released and we'll be cooked again. DSV4 doesn't know tcgen05.

apaz

apaz

@apaz_cli

Jun 12

I thought my experiment failed completely, but sometimes there is signal in noise. Even in truly insane amounts of noise.

apaz

apaz

@apaz_cli

Jun 9

Suppose I have some fancy new way of doing RL. What's the most convincing smallish experiment that I can use to derive scaling laws? Some sort of math benchmark on qwen3 1.5b or something? But I hear that's qwenslop. It might be qwenslop. What's in the same ballpark but better?

1,950

stochasm

apaz retweeted

stochasm

@stochasticchasm

19 Jul 2025

thought he only had 2 moments

160

8,802

apaz

apaz

@apaz_cli

Jun 4

I love how any the normalized length of any sufficiently large vector of normally distributed random variables converges to 1 by the law of large numbers. This lets you clip ES gradient projections for cheaper because you don't have to do a reduction to compute the length.

175

apaz

apaz

@apaz_cli

Jun 2

PSA that despite the Deepseek V4 paper clearly stating that you must set a specific system prompt, @opencode does not set this prompt and so you cannot use "max" mode in opencode, even if you think you have selected max mode.

659

apaz

apaz

@apaz_cli

Jun 2

I understand not wanting to do SFT. There are real gains from not doing it. But the choice to pretrain without pretraining on things that look like rollouts is confusing to me. There is a big distribution mismatch there, as big as base model vs instruction tune. IMO it should be a part of midtraining. There are ways to produce data that looks like a rollout without actually sampling a rollout.

Rosinality @rosinality

Jun 2

Many choices here are only possible when your objective is not an immediate or short-term performance. Pretraining without synthetic data, posttraining without SFT with data from other LLMs. (And other good choices like scaling ladder with NLL instead of benchmark scores).

217

apaz

apaz

@apaz_cli

Jun 2

I'm actually so thankful that RWKV exists. I can't name anyone else that has carried RNNs forward (mamba doesn't count)

apaz

apaz

@apaz_cli

May 31

Suppose in deep learning that we didn't do gradient accumulation/batching and we had full knowledge of what components of the grad came from what item. In that case would it make sense to apply grad clipping per-item? One preserves information, the other bounds step magnitude.

154

apaz

apaz

@apaz_cli

May 31

You can tell Opus 4.8 is honest because it keeps telling you it's so honest.

195

apaz

apaz

@apaz_cli

May 30

I will hand it to Anthropic, the model released, I switched to it, and it instantly found a bug. It also instantly started annoyed me. An adjustment for sure.

Taelin

@VictorTaelin

May 29

300k tokens trying to teach 4.8 how Bend's termination checker works 🫠 maybe not so bright, but somehow a pleasure to talk to and definitely my favorite model of all time

473

apaz

apaz

@apaz_cli

May 27

I'm writing a blog post that requires doing a lot of gradient estimation noise math. I'll doublecheck this paper quick too. The way this tends to work is that if it does better in theory than existing techniques it's promising. Otherwise it's garbage and not worth pursuing.

hardmaru

@hardmaru

May 27

For over a decade, we’ve accepted that end-to-end backprop is the only way to train deep networks. But holding the entire network in memory all at once is why AI training is hitting a resource wall. We found a new way to break the network into blocks and train them independently. The trick? Treating the network’s forward pass like a diffusion model denoising a signal. This reinterpretation slashes the memory needed to train deep models. In our #ICLR2026 paper (arxiv.org/abs/2506.14202), we matched end-to-end performance across ViTs, DiTs, and LLMs. We did this while training just one isolated block at a time.

1,831

apaz

apaz

@apaz_cli

May 27

Update here x.com/apaz_cli/status/205968…

apaz

@apaz_cli

May 27

Okay never mind. The gradient noise math is correct, and it scales fine. But it optimizes a proxy objective, not the real objective. Stepping in a direction that is not the real objective is noise. So the question is, what is the difference between that objective and this one for your target distribution? This is not really measurable in theory, only in practice. In any case I think this noise is even less fixable than MeZO. Yes this proxy objective appears to be fine for general language modeling, but lots of things are fine for general language modeling. It's an easy problem. If you want to do something like reasoning RL with this proxy objective you will probably run into problems.

210

apaz

apaz

@apaz_cli

May 27

apaz

@apaz_cli

May 27

640

apaz

apaz

@apaz_cli

May 27

Just realized I'm writing CUDA at 4 AM with all the lights off. This was the CUDA_MODE they warned me about, wasn't it...

145

apaz

apaz

@apaz_cli

May 22

I got so nerdsniped by hessian approximation noise distribution math, dear claude please deliver me from this hell

158

apaz

apaz

@apaz_cli

May 25

Okay so the hessian approximation math was all for nothing and my time was wasted but at least I learned what not to try to do.

apaz

apaz

@apaz_cli

May 25

They're reduce-scattering me tomorrow

snow

@snowclipsed

May 25

they're orthogonalizing me tomorrow

310

apaz

apaz

@apaz_cli

May 23

I just learned that Yurii Nesterov invented Zeroth-Order optimization as well. I only knew him as the Nesterov Momentum guy. But no I guess he just invented absolutely everything.

2,025