Okay never mind. The gradient noise math is correct, and it scales fine. But it optimizes a proxy objective, not the real objective.
Stepping in a direction that is not the real objective is noise. So the question is, what is the difference between that objective and this one for your target distribution? This is not really measurable in theory, only in practice.
In any case I think this noise is even less fixable than MeZO.
Yes this proxy objective appears to be fine for general language modeling, but lots of things are fine for general language modeling. It's an easy problem. If you want to do something like reasoning RL with this proxy objective you will probably run into problems.
I'm writing a blog post that requires doing a lot of gradient estimation noise math. I'll doublecheck this paper quick too.
The way this tends to work is that if it does better in theory than existing techniques it's promising. Otherwise it's garbage and not worth pursuing.