The GPT-4o moment for humanoids might finally be here.
And yeah, sorry in advance for the rickroll.
OMG runs a Unitree G1 off one brain that natively takes language, audio, and human motion.
That "Never Gonna Give You Up" dance isn't the flex; one model fluent in every modality is.
Here's the shift.
Most humanoid policies are one-trick. Train per skill, hand-tune the rewards, repeat.
The rest just replay a fixed motion you feed them.
OMG instead works like a biological motor system. A "brain" that turns intent into future motion.
A "cerebellum" that reactively runs it on the robot.
The brain is one diffusion model. Language, audio, a reference pose, or any blend goes in.
A robot-ready G1 trajectory comes out, live.
New inputs attach through zero-init adapters. They start at zero, so the pretrained motion prior carries over intact instead of getting scrambled.
That's how they bolted on VR teleoperation as a brand-new modality, reusing the same brain.
And it behaves like a foundation model. Bigger backbone, cleaner motion.
Finetune on 1% of new data, nearly match a model trained from scratch on 100%. Compose language audio at inference for combos never seen in training.
1000 hours of motion, all retargeted into one G1 body.
One brain that scales.
We keep racing to build stronger low-level controllers.
OMG's bet is that the real bottleneck is the brain mapping human intent to motion.
Congrats to
@KnightNemo_,
@li_yitang,
@ShaotingZ38103 and whole team!