๐ ๐๐ฎ๐ป ๐๐ ๐ก๐ฎ๐๐ถ๐ด๐ฎ๐๐ฒ ๐ ๐ฎ๐ฝ๐ ๐๐ถ๐ธ๐ฒ ๐๐๐บ๐ฎ๐ป๐ ๐๐ผ? ๐๐ป๐๐ฟ๐ผ๐ฑ๐๐ฐ๐ถ๐ป๐ด ๐ ๐ฎ๐ฝ๐๐ฒ๐ป๐ฐ๐ต! ๐บ๏ธ๐ค
๐๐ฆ๐ข๐ฅ๐ช๐ฏ๐จ ๐ฎ๐ข๐ฑ๐ด, like Google Maps and Theme Park Maps, is second nature for humans. It is a highly challenging task that requires visual understanding, spatial reasoning, and long-horizon planning. We're curious -ย ๐๐ฎ๐ป ๐๐ฎ๐ฟ๐ด๐ฒ ๐ฉ๐ถ๐๐ถ๐ผ๐ป-๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น๐ (๐๐ฉ๐๐ ๐) ๐ฑ๐ผ ๐ถ๐ ๐๐ผ๐ผ? ๐ค
Weโre excited to share ๐ ๐ฎ๐ฝ๐๐ฒ๐ป๐ฐ๐ต, the first-ever dataset and benchmark specifically designed for evaluating how well LVLMs perform on pixel-based map navigation tasks! ๐
๐ ๐ช๐ต๐ ๐ ๐ฎ๐ฝ๐๐ฒ๐ป๐ฐ๐ต ๐ถ๐ ๐ฎ ๐๐ฎ๐บ๐ฒ-๐๐ต๐ฎ๐ป๐ด๐ฒ๐ฟ:
โข ๐ 1600 Complex Pathfinding Queries from 100 uniquely challenging map scenarios (urban areas, theme parks, universities, malls, and more).
โข ๐ Introduces Map Space Scene Graph (MSSG): a novel data structure for mapping visual landmarks and spatial relationships to structured navigation tasks.
โข ๐ Evaluates state-of-the-art LVLMs like GPT-4o, Llama-3.2, and Qwen-2-VL under zero-shot and Chain-of-Thought (CoT) reasoning methods, revealing key insights into their spatial reasoning and navigation abilities.
๐ฉ ๐๐ฒ๐ ๐๐ป๐๐ถ๐ด๐ต๐๐:
โข Despite their impressive capabilities, current LVLMs struggle significantly with spatial reasoning and structured decision-making.
โข CoT prompting boosts spatial reasoning performance but sometimes introduces redundant details.
๐ ๐๐ต๐ฒ๐ฐ๐ธ ๐ผ๐๐ ๐ผ๐๐ฟ ๐ณ๐ถ๐ป๐ฑ๐ถ๐ป๐ด๐, ๐ฑ๐ฎ๐๐ฎ๐๐ฒ๐, ๐ฎ๐ป๐ฑ ๐ฐ๐ผ๐ฑ๐ฒ ๐ต๐ฒ๐ฟ๐ฒ:
๐ Arxiv:
lnkd.in/gBv-sFJ3
Huge thanks to our incredible collaborators for making this happen, from
@TAMU,
@UCBerkeley,
@mbzuai,
@UMich, and
@UCRiverside! ๐
Letโs continue to bridge the gap between human intuition and AI navigation! ๐บ๏ธ๐ก