The multi vector embeddings are created using our SOTA retrieval model mxbai-wholembed. It supports text, audio, video on over 300 languages. A key innovation is the dynamic vector allocation, which lets the model dynamically decide the amounts of vectors it needs to represent information. For example, a simple cat image may output a few vectors, whereas a complex slide deck may generate thousands of vectors. We wrote a custom inference engine to serve mxbai-wholembed with low latency.