T2I evals miss fine-grained alignment. TokenFocus-VQA fixes this by focusing on key tokens via VQA-style probing and position-aware loss.
- Targets semantic tokens, not global similarity
- Uses LVLMs for precise image-text checks
- Ranked 2nd (0.8445) at NTIRE 2025, 0.0001 from 1st
Token-level grounding > fuzzy matching.