Vision Transformers (ViTs) are a powerful deep learning architecture, but what’s the difference between ViT and a text-based transformer like BERT? Despite being applied in completely different domains, these models have only one major difference… 🧵[1/7]