We often hear that "computer vision has been solved.” But is it really so?
🚀 Excited to share our new work: 𝗖𝗩-𝗔𝗿𝗲𝗻𝗮: 𝗔𝗻 𝗢𝗽𝗲𝗻 𝗕𝗲𝗻𝗰𝗵𝗺𝗮𝗿𝗸 𝗳𝗼𝗿 𝗜𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻𝗮𝗹 𝗖𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝗩𝗶𝘀𝗶𝗼𝗻 𝗣𝗿𝗼𝗯𝗹𝗲𝗺 𝗦𝗼𝗹𝘃𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗛𝘂𝗺𝗮𝗻-𝗔𝗜 𝗖𝗼𝗹𝗹𝗮𝗯𝗼𝗿𝗮𝘁𝗶𝘃𝗲 𝗣𝗿𝗲𝗳𝗲𝗿𝗲𝗻𝗰𝗲𝘀.
In this paper, we define 𝗶𝗻𝘀𝘁𝗿𝘂𝗰𝘁𝗶𝗼𝗻𝗮𝗹 𝗰𝗼𝗺𝗽𝘂𝘁𝗲𝗿 𝘃𝗶𝘀𝗶𝗼𝗻 𝗽𝗿𝗼𝗯𝗹𝗲𝗺 𝘀𝗼𝗹𝘃𝗶𝗻𝗴 𝗶𝗖𝗩𝗣𝗦 as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints.
🧩 To support this direction, we introduce 𝗖𝗩-𝗔𝗿𝗲𝗻𝗮, an open benchmark designed for professional-grade visual editing and problem solving.
𝗖𝗩-𝗔𝗿𝗲𝗻𝗮 contains:
✅ 12K high-resolution real-image instruction pairs
✅ 16 instruction-based visual task types
✅ Tasks spanning restoration, enhancement, computational photography, physically grounded object insertion, semantic manipulation, geometry-driven structural editing, and typography recovery
✅ Real-world images with native aspect ratios and high-resolution details
🔍 We also introduce 𝗖𝗼𝗴𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗲𝗿, a dual-track retrieval and curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability to construct diverse and legally traceable benchmark data.
⚖️ For evaluation, we propose 𝗔𝗰𝘁𝗶𝘃𝗲 𝗘𝗹𝗼, a human-AI collaborative preference protocol. Instead of relying purely on automatic metrics or fully human annotation, Active Elo combines:
1. 𝗖𝗩-𝗝𝘂𝗱𝗴𝗲, a logic-gated, multi-dimensional VLM evaluator
2. selective routing of ambiguous high-quality comparisons to expert human raters
3. reliability-weighted Elo updates to aggregate mixed human and AI supervision
This allows us to evaluate models at scale while preserving alignment with expert human preferences.
📊 We benchmark 21 systems, including proprietary, open-source, and agentic models. Our results reveal persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation.
🤖 Finally, we develop 𝗖𝗩-𝗔𝗴𝗲𝗻𝘁, a lightweight agentic baseline that combines planning, editing, and verification. The results suggest that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.
💡 The main takeaway: as visual AI moves toward real workflows, the challenge is no longer only to generate visually plausible images. Models must also understand intent, preserve constraints, reason about structure and physics, and verify whether the edit actually solves the requested visual problem.
𝗣𝗿𝗼𝗷𝗲𝗰𝘁:
ark1234.github.io/cv-arena
𝗖𝗼𝗱𝗲:
github.com/taco-group/CV-Are…
#ComputerVision #GenerativeAI #MultimodalAI #ImageEditing #AIAgents #Benchmarking #CVArena #TAMU