What makes CLIP work?
The contrast with negatives via softmax?
The more negatives, the better -> large batch-size?
We'll answer "no" to both in our ICCV oral🤓
By introducing SigLIP, a simpler CLIP that also works better and is more scalable, we can study the extremes.
Hop in🧶