Nvidia's LocateAnything-3B is the #1 trending vision model on HF: a 3B that does detection, grounding and GUI pointing with a parallel box decoder.
~~~ the model ships allowing 25,600 vision patches per screenshot. without flash-attn (unavailable on consumer Blackwell), its attention fallback materializes one 40.6GB tensor on a 4K screenshot.
an H100 80GB shrugs; a 32GB card has to cap patches at 12,288 and downscale big images.
~~~ ScreenSpot-Pro: 55.3% measured vs 60.3 claimed, my forced downscaling only pushes down, and accuracy falls with screenshot size exactly as you'd predict.
the honest consumer-card number is ~55%.
~~~ the real fault line is text vs icon: 63.2% on text targets, 42.7% on icons.
per app: word 82%. it reads UIs, it doesn't really see abstract iconography.
~~~ the novel bit verifies: parallel box decoding gives 100 to 207 tok/s over autoregressive, even on the SDPA fallback without Nvidia's custom attention.
very great model overall.