All vision-language models should have hyperbolic embeddings. Vision and language are incredibly hierarchical in nature!
See below our latest work on hyperbolic vision-language models that exploit visual compositions through entailment:
(1/6)🥳 Excited to share my latest research done as part of my MSc AI thesis! We introduced Hyperbolic Compositional CLIP (HyCoCLIP)—a novel framework that leverages the hierarchical nature of hyperbolic space for learning vision-language representations using scene compositions.