Multimodal representation learning works for 2 modalities, but what if you're working with 3 modalities, like in healthcare, robotics, or video?
📢 Meet Symile: a model-agnostic contrastive loss for any number of modalities with CLIP's simplicity and superior performance✨
1/n