Presents MIO, a foundation model built on multimodal tokens using causal multimodal modeling
Demonstrates huge potential due to its any-to-any understanding and generation. Capabilities include interleaved video-text generation, chain-of-visual-thought reasoning, and visual guidelines generation.