LLMs are the de facto building block in recent multimodal models. Yet, it is still unclear why text-only models can generalize to multimodal inputs!
We investigate why, and provide insights with practical implications on performance, efficiency and safety problems. (1/10)