Technical highlights:
CogViT Vision Encoder
- Built with dual-teacher distillation: SigLIP2 for semantics, DINOv3 for texture. A two-stage recipe, masked modeling, then contrastive pretraining, with QK-Norm for attention stability at scale.
Multimodal Multi-Token Prediction (MMTP)
- Three ways to pass image tokens into the MTP head were compared. The chosen approach uses a shared <image> token, removing the need to propagate visual embeddings across pipeline stages and improving training stability.
Broad Training Across Perception, Reasoning, and Agent Capability
- Vision and language are fused from pre-training onward, with emphasis on multimodal code. Joint RL across 30 task categories yields consistent gains with weaker cross-domain interference than SFT.
Multimodal RL at Scale
- Infrastructure rebuilt along four axes: unified task and reward abstraction, full-pipeline asynchrony, fine-grained memory management for vision modules, and topology-aware partitioning for variable-length visual inputs.