Congrats on the new MolmoAct 2 release by
@allen_ai!
A few features that stood out for those considering this for real-world deployments:
1. YAM embodiment unlock 720h teleoperated dataset
720 hours of bimanual YAM data is a meaningful contribution. The YAM embodiment is a simple bimanual arm setup for dexterous tasks, very similar to the dual PiperX and Trossen WidowX arms. Anyone building on
@physical_int 's Pi05 or similar models with a YAM-type robot now has significantly more data to fine-tune from, which should reduce the fine-tuning samples needed for a custom task, assuming the target task and environment fall within the dataset's distribution.
The dataset spans household, factory, and coffee-shop settings with high object and scene variation.
@cortexairobot was the data vendor. Hoping the appendix detailing the quality control protocol gets released.
2. Depth reasoning as a reproducible recipe, but only with layer-level access
MolmoAct2-Think shows one way to inject depth information into the action model. Before producing an action, the model predicts a compact discrete depth representation that conditions the action expert through per-layer KV conditioning.
The mechanism requires surgical access to the VLM's intermediate attention states at every layer, something only possible with fully open architectures.
3. Swappable VLM backbones for converting VLM -> VLM-ER
The released training recipe effectively decouples the perception backbone from the action head. You can pick a VLM optimized for your task domain rather than accepting a generic vision encoder.
Hypothetical example: for warehouse sorting where success hinges on reading tiny, cluttered, blurry SKU labels, start from a VLM fine-tuned for OCR (e.g., a custom Qwen-VL or InternVL variant) instead of a generalist web-scale VLM.
Apply the MolmoAct2-ER training recipe to that backbone to produce an "OCR-VL-ER" variant, then attach a flow-matching Action Expert. The result is a bespoke VLA that inherits your perception fine-tuning, optimized for label-reading manipulation rather than generic open-world scenes.
This assumes catastrophic forgetting is minimized and the backbone retains most of the baseline capabilities it had before fine-tuning.
With this recipe, you can swap in domain-specific backbones (medical imaging, industrial inspection, high-res OCR) and convert them into action models entirely from open components.
Robotics models often struggle outside controlled environments. Ours is built to work in real ones.
Today we're launching MolmoAct 2, which can assist with a host of chores & lab tasks, plus the MolmoAct 2-Bimanual YAM dataset—the largest open robotics dataset of its kind. 🧵