Filter
Exclude
Time range
-
Near
Wonderful to be back from #CVPR2026, and excited to share the release of our follow-up work: VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation VoLo introduces the idea of a physical orchestrator for open-vocabulary, long-horizon manipulation. Our goal is to move toward robots that can reason, plan, act, monitor, and recover by adaptively using VLA/WAMs, vision models, and action primitives as tools. We introduce three main contributions: ๐Ÿค– VoLoAgent โ€” a physical orchestrator that plans, monitors, and recovers by adaptively using, halting, and redirecting robot actions with tools. ๐Ÿ“Š RoboVoLo โ€” a high-fidelity benchmark with 126 open-vocabulary long-horizon manipulation tasks spanning common sense, memory/state tracking, complex references, and world knowledge. ๐Ÿ“ˆ A large-scale empirical study comparing action models, code-as-policy systems, TAMP-style systems, and ablations of the VoLoAgent orchestrator, complemented by real-robot experiments. This work was done during my internship at @NVIDIA and would not have been possible without my brilliant collaborators: Hugo Hadfield, Alexander Zook, @mikacuy, @luke_ch_song, @erwincoumans, @xuningy, Faisal Ladhak, @qu_1006, @BirchfieldStan, Jonathan Tremblay, and @robovalts. Huge thanks to everyone! ๐Ÿ”— Project: chicychen.github.io/VoLo/ ๐Ÿ”— Previous work, SpaceTools: spacetools.github.io/ #Robotics #EmbodiedAI #VisionLanguageModels #VLAModels #RobotLearning #NVIDIA #CVPR2026 #LongHorizonManipulation #AI #ComputerVision
2
16
71
8,576
23/25 ๐—š๐—Ÿ๐—œ๐—ก๐—ง: ๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐˜€๐—ฒ๐—น๐˜† ๐—š๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ฉ๐—ถ๐˜€๐—ถ๐—ผ๐—ป-๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐—”๐—น๐—ถ๐—ด๐—ป๐—บ๐—ฒ๐—ป๐˜ ๐—ณ๐—ผ๐—ฟ ๐—™๐—ถ๐—ป๐—ฒ-๐—š๐—ฟ๐—ฎ๐—ถ๐—ป๐—ฒ๐—ฑ ๐—ฅ๐—ฎ๐—ฑ๐—ถ๐—ผ๐—น๐—ผ๐—ด๐˜† ๐—ฅ๐—ฒ๐—ฝ๐—ฟ๐—ฒ๐˜€๐—ฒ๐—ป๐˜๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€ GLINT (Gated Language-Image alignmeNT) is a novel framework for radiology Vision-Language Models that tackles sparse correspondence between localized findings and global supervision. It utilizes Sparsely Gated Alignment with a sigmoid gate for relevant patch activation and Dense Feature Regularization, enabling zero-shot classification, grounding, and the first zero-shot segmentation on 3D CT volumes without mask supervision. GLINT outperforms existing SSL encoders and medical VLMs in downstream classification, report generation, and segmentation tasks. #GLINT #RadiologyVLM #ZeroShotLearning #MedicalAI #3DCTSegmentation #VisionLanguageModels Paper Link: arxiv.org/abs/2606.03180
1
27
What information is actually hidden inside a multimodal embedding? In this new work, we find that frozen vision-language models already encode rich attribute-specific signals for objects, backgrounds, and styles, even though their standard embeddings appear highly entangled. We introduce QARE (Queryable Attribute Representation Extraction), a simple text-guided framework that extracts attribute-specific representations from frozen VLMs without fine-tuning. Along the way, we build QARE-Bench, a challenging benchmark with both controlled synthetic data and a new real-world dataset featuring diverse scenes, non-rigid objects, and hard negatives designed to stress-test attribute disentanglement. Key finding: ๐Ÿ‘‰ The problem may not be that VLMs lack disentangled representations. ๐Ÿ‘‰ The problem may be that we haven't learned how to query them. ๐Ÿ“„ Paper: openaccess.thecvf.com/contenโ€ฆ ๐Ÿ’ป Code: github.com/yibingwei-1/QARE #ComputerVision #MultimodalAI #VisionLanguageModels #RepresentationLearning #ImageRetrieval
6
19
2,691
Excited to be at #CVPR 2026 in Denver this week for events around trustworthy AI, embodied reasoning, watermarking, and world models. I will only be around on Thursday, June 4, so please come say hi tomorrow! On June 4, I will be speaking at several CVPR workshops and tutorials: 1โƒฃ CVPR 2026 Workshop on Trustworthy AI / TRUE-V ๐Ÿ”— trustworthy-ai-workshop.githโ€ฆ ๐Ÿ“ 9:10โ€“9:40 AM | Room 705/707 Talk Title: A Few Early Steps Away: Building Self-Correcting Vision-Language Systems I will discuss how we can move beyond static vision-language models toward systems that can recognize, reason about, and correct their own failures. 2โƒฃ The first CVPR Workshop on Embodied Reasoning in Action (ERA) ๐Ÿ”— embodied-reasoning.github.ioโ€ฆ ๐Ÿ“11:45 AMโ€“12:20 PM | Room 605 Talk Title: From Perception to Action: From Latent World Models to State-Aware Scene Graphs for Physical Intelligence This talk will focus on representations and learning systems that connect perception, reasoning, and action for physical intelligence. Later in the day, I will also be part of the CVPR tutorial: 3โƒฃFoundations and Frontiers of Watermarking ๐Ÿ”— vishal3477.github.io/watermaโ€ฆ ๐Ÿ“3:30โ€“4:10 PM | Room: Mile High 2B Session Title: Benchmarking & Robustness Evaluation I will cover how to evaluate watermarking systems under distortions, regeneration, and adaptive attacksโ€”an increasingly important direction for trustworthy generative AI. Our team will also present TraceGen at the CVPR main conference. I will not be there on June 6, but my students will be presenting the work โ€” please stop by and talk with them! 4โƒฃTraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos Project ๐Ÿ”—: tracegen.github.io/ YouTube ๐Ÿ“ฝ๏ธ: youtu.be/JCXnK2tHE_I Poster ๐Ÿ“: Saturday, June 6, 2026 | 11:45 AMโ€“1:45 PM MDT | ExHall F 605 TraceGen introduces a world-modeling framework that predicts future motion in a compact 3D trace space, rather than directly in pixel space. This abstraction preserves the geometry needed for manipulation while reducing dependence on embodiment-specific appearance, enabling learning from heterogeneous human and robot videos and improving transfer to real-world robotic tasks. Fresh out of oven new research: We have also pushed this direction to the next level. Stay tuned for our upcoming release of ฮผโ‚€, a symbolic world model pretrained only from video data that reaches ฯ€โ‚€.โ‚…-level performance. ๐Ÿ”ฅ Looking forward to seeing friends, collaborators, and new colleagues tomorrow at CVPR! #CVPR2026 #TrustworthyAI #EmbodiedAI #VisionLanguageModels #Robotics #WorldModels #Watermarking #GenerativeAI #PhysicalIntelligence
1
9
40
2,614
Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition 1 MolSeek-OCR shows that a document OCR foundation model can be transferred to molecular structure recognition, if the fine-tuning is done progressively: direct full-parameter supervised fine-tuning was unstable and failed, but a staged recipe produced a competitive image-to-SMILES system. 2 The paper reframes OCSR as image-conditioned SMILES generation with a fixed instruction prompt, training the model to autoregressively output only the SMILES tokens (no loss on prompt/image placeholder tokens), aligning the objective with strict โ€œexact matchโ€ evaluation. 3 Core technical contribution: a two-stage progressive supervised fine-tuning strategy that starts with parameter-efficient LoRA to adapt both (a) the text generation pathway and (b) the visual-language projection/alignment layers, then transitions to selective full-parameter tuning. 4 In stage 2, the model is not tuned uniformly: it freezes the lowest-level visual tokenizer (and token embedding layer), while continuing to optimize higher-level modules (LM-as-vision-encoder, compression/projection interface, and the autoregressive decoder). It also uses split learning rates (smaller for the visual branch, larger for the language branch) to stabilize cross-modal transfer. 5 Data strategy: training mixes large-scale synthetic renderings from PubChem with realistic patent images from USPTO-MOL to cover both style diversity (rendering engines, bond/annotation variations) and real-world artifacts (scan noise, line thickness, patent conventions). LoRA stage uses a smaller mixed budget; full stage scales to ~800k total samples. 6 Evaluation spans synthetic (Indigo, ChemDraw), realistic (USPTO, CLEF, Staker, UOB, ACS), and perturbed versions of several realistic sets, reflecting the practical requirement that OCSR models handle both clean depictions and degraded/heterogeneous document images. 7 Results: zero-shot DeepSeek-OCR-2 essentially fails on exact SMILES matching, while MolSeek-OCR improves substantially and is broadly comparable to DECIMER among image-to-sequence baselines across multiple datasets; however, it still trails state-of-the-art image-to-graph methods such as MolScribe, highlighting the ongoing advantage of explicit atom/bond layout modeling. 8 Negative (but informative) finding: reinforcement-style post-training (GSPO) and data-curation-based refinement (ReFT) did not improve exact-match SMILES accuracy. The optimization sometimes improved graph-level equivalence while degrading strict sequence fidelity, suggesting that common reward designs struggle to preserve the exact serialized SMILES form required by this benchmark. 9 Practical takeaway: for VLM-based molecular OCR, stability hinges on (a) progressive adaptation (LoRA then selective full tuning), (b) freezing low-level vision components, and (c) carefully balancing learning rates across visual vs language branches; and even then, exact SMILES matching remains a harder target than graph-equivalent correctness. ๐Ÿ’ปCode: github.com/HaCTang/MolSeek-Oโ€ฆ ๐Ÿ“œPaper: arxiv.org/abs/2604.03476 #OCSR #Chemoinformatics #MolecularOCR #VisionLanguageModels #DeepLearning #SMILES #Patents #PubChem #FineTuning #LoRA
4
21
1,915
Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition 1 MolSeek-OCR shows that a document OCR foundation model can be transferred to molecular structure recognition, if the fine-tuning is done progressively: direct full-parameter supervised fine-tuning was unstable and failed, but a staged recipe produced a competitive image-to-SMILES system. 2 The paper reframes OCSR as image-conditioned SMILES generation with a fixed instruction prompt, training the model to autoregressively output only the SMILES tokens (no loss on prompt/image placeholder tokens), aligning the objective with strict โ€œexact matchโ€ evaluation. 3 Core technical contribution: a two-stage progressive supervised fine-tuning strategy that starts with parameter-efficient LoRA to adapt both (a) the text generation pathway and (b) the visual-language projection/alignment layers, then transitions to selective full-parameter tuning. 4 In stage 2, the model is not tuned uniformly: it freezes the lowest-level visual tokenizer (and token embedding layer), while continuing to optimize higher-level modules (LM-as-vision-encoder, compression/projection interface, and the autoregressive decoder). It also uses split learning rates (smaller for the visual branch, larger for the language branch) to stabilize cross-modal transfer. 5 Data strategy: training mixes large-scale synthetic renderings from PubChem with realistic patent images from USPTO-MOL to cover both style diversity (rendering engines, bond/annotation variations) and real-world artifacts (scan noise, line thickness, patent conventions). LoRA stage uses a smaller mixed budget; full stage scales to ~800k total samples. 6 Evaluation spans synthetic (Indigo, ChemDraw), realistic (USPTO, CLEF, Staker, UOB, ACS), and perturbed versions of several realistic sets, reflecting the practical requirement that OCSR models handle both clean depictions and degraded/heterogeneous document images. 7 Results: zero-shot DeepSeek-OCR-2 essentially fails on exact SMILES matching, while MolSeek-OCR improves substantially and is broadly comparable to DECIMER among image-to-sequence baselines across multiple datasets; however, it still trails state-of-the-art image-to-graph methods such as MolScribe, highlighting the ongoing advantage of explicit atom/bond layout modeling. 8 Negative (but informative) finding: reinforcement-style post-training (GSPO) and data-curation-based refinement (ReFT) did not improve exact-match SMILES accuracy. The optimization sometimes improved graph-level equivalence while degrading strict sequence fidelity, suggesting that common reward designs struggle to preserve the exact serialized SMILES form required by this benchmark. 9 Practical takeaway: for VLM-based molecular OCR, stability hinges on (a) progressive adaptation (LoRA then selective full tuning), (b) freezing low-level vision components, and (c) carefully balancing learning rates across visual vs language branches; and even then, exact SMILES matching remains a harder target than graph-equivalent correctness. ๐Ÿ’ปCode: github.com/HaCTang/MolSeek-Oโ€ฆ ๐Ÿ“œPaper: arxiv.org/abs/2604.03476 #OCSR #Chemoinformatics #MolecularOCR #VisionLanguageModels #DeepLearning #SMILES #Patents #PubChem #FineTuning #LoRA
1
5
636
๐Ÿšจ Medical AI Research Alert! ๐Ÿšจ How can AI synthesize raw data from ECGs, echocardiograms, and MRIs simultaneously to mimic a cardiologist's diagnostic reasoning? @Stanford presents ๐— ๐—”๐—ฅ๐—–๐—จ๐—ฆ: ๐—”๐—ป ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐—ถ๐—ฐ ๐˜ƒ๐—ถ๐˜€๐—ถ๐—ผ๐—ป-๐—น๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐˜€๐˜†๐˜€๐˜๐—ฒ๐—บ ๐—ณ๐—ผ๐—ฟ ๐—ฒ๐—ป๐—ฑ-๐˜๐—ผ-๐—ฒ๐—ป๐—ฑ ๐—บ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น ๐—ฐ๐—ฎ๐—ฟ๐—ฑ๐—ถ๐—ฎ๐—ฐ ๐—ฑ๐—ถ๐—ฎ๐—ด๐—ป๐—ผ๐˜€๐—ถ๐˜€. By @DrJackOSullivan, Mohammad Asadi, Lennart Elbe, @Dr_ASChaudhari, Tahoura Nedaee, Francois Haddad, @salernomdphd, @drfeifei, @eadeli, Rima Arnaout, @euanashley Now you can watch and listen to the latest Medical AI papers daily on our YouTube and Spotify channels! YouTube: youtube.com/@OpenlifesciAI YouTube Deep Dive: youtu.be/e6CFpVZK6Q8 YouTube Shorts: youtube.com/shorts/C5CZWnYKaโ€ฆ Spotify: open.spotify.com/show/4edRuSโ€ฆ Here's why it's exciting: ๐Ÿ‘‡๐Ÿงต 1/9 #MedicalAI #Healthcare #Cardiology #VisionLanguageModels [1/9]
1
2
6
237
Wondering how to combine the perceptual abilities of VLMs with structured program synthesis? --> We will be presenting our Vision-Language Programs at #CVPR2026 :) #NeuroSymbolic #VisionLanguageModels
Excited to share that our paper "Synthesizing Visual Concepts as Vision-Language Programs" has been accepted to #CVPR2026! ๐ŸŽ‰ We propose a novel method that combines VLMs with symbolic program synthesis to learn reliable programs of visual concepts. ๐ŸŒ ml-research.github.io/visionโ€ฆ
1
2
15
1,817
If document automation, multimodal AI, or clinical decision support are on your roadmap, this session will provide measurable performance insights. Register now: hubs.li/Q0452MZM0 #HealthcareAI #MedicalImaging #VisionLanguageModels #ClinicalAI #GenerativeAI #HealthIT
2
5
424
โณ 1 week left to submit to the Med-Reasoner Workshop @CVPR! ๐Ÿ“‹ Submit your work on medical reasoning, VLMs & clinical AI ๐Ÿ”— Submission: lnkd.in/d4We2qTt ๐ŸŒ Website: lnkd.in/dQM-ayY5 Deadline: March 1, 2026 #MedicalAI #VisionLanguageModels #HealthcareAI #MedReasoner
๐Ÿ“ข Call for Papers - @CVPR 2026 Workshop (Med-Reasoner) Submission deadline: March 1, 2026 (AoE) Workshop on Medical Reasoning with Vision Language Foundation Models ๐Ÿ”— Submission: lnkd.in/d4We2qTt ๐Ÿ”— Website: lnkd.in/dQM-ayY5
1
3
2,834
We tested 9 commercial AI models on brain MRIs. Not just for accuracy, but to see if they could be tricked by fake reports hidden inside the images. Spoiler: they could. All of them. โš ๏ธ Visible fake reports dropped specificity to zero across every model. The stealth version, text invisible to the human eye, still fooled more than half. OCR capability = attack surface. If a model can read text in an image, it can be manipulated by it. 27K inference calls. 600 MRIs. 9 models. 5 conditions. @Hacettepe1967 @MIT @harvardmed @cwru ๐Ÿค Paper in the replies ๐Ÿ‘‡ #AIinHealthcare #RadiologyAI #AdversarialAI #VisionLanguageModels #PatientSafety #MedicalAI #PromptInjection #PedsICU
1
3
270
LLMs can reason. Vision models can see. But most real problems donโ€™t come in one modality. That gap is exactly why Vision-Language Models (VLMs matter). This carousel breaks down how VLMs actually work under the hood and why theyโ€™ve become foundational for modern AI systems. Whatโ€™s really changing with VLMs: - Beyond text-only reasoning LLMs operate over symbols. VLMs ground those symbols in pixels, spatial structure, and visual evidence. - Not just โ€œLLMs with imagesโ€ The core shift is alignment fusion: vision and language arenโ€™t parallel streams, they interact. - Architecture matters Vision encoders extract structured visual tokens Language encoders express intent and queries Multimodal fusion layers are where reasoning actually happens Why this matters if you work with LLMs today: - Multimodal inputs are becoming the default, not the edge case - Agents increasingly need to see, not just read - Grounding reduces hallucinations and unlocks real-world decision making If youโ€™re thinking about agents, tool use, or real-world AI systems, understanding VLMs isnโ€™t optional anymore. In our Agentic AI Bootcamp, we spend time breaking down how modern AI systems are designed, evaluated, and connected in practice. If youโ€™d like to explore this further, you can find more details in the replies. Link will be in the replies. #VisionLanguageModels #AgenticAI #LLMs #MultimodalAI
1
1
4
605
Busy season, huh? #ICLR decisions are out and #CVPR rebuttals are flying... but donโ€™t miss this! ๐Ÿ˜… ๐Ÿ“ฃ ๐—–๐—ฎ๐—น๐—น ๐—ณ๐—ผ๐—ฟ ๐—–๐—ผ๐—ป๐˜๐—ฟ๐—ถ๐—ฏ๐˜‚๐˜๐—ถ๐—ผ๐—ป๐˜€: We're organizing a new edition of the ๐— ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น ๐—”๐—น๐—ด๐—ผ๐—ฟ๐—ถ๐˜๐—ต๐—บ๐—ถ๐—ฐ ๐—ฅ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ถ๐—ป๐—ด Workshop @ #CVPR2026 (Denver)! โœ… ๐—ง๐—ต๐—ฒ ๐˜€๐˜‚๐—ฏ๐—บ๐—ถ๐˜€๐˜€๐—ถ๐—ผ๐—ป ๐—ฝ๐—ผ๐—ฟ๐˜๐—ฎ๐—น ๐—ถ๐˜€ ๐—ป๐—ผ๐˜„ ๐—ผ๐—ฝ๐—ฒ๐—ป, and we welcome both new and previously published work. ๐Ÿ“Œ ๐—ฆ๐˜‚๐—ฏ๐—บ๐—ถ๐˜€๐˜€๐—ถ๐—ผ๐—ป ๐—ด๐˜‚๐—ถ๐—ฑ๐—ฒ๐—น๐—ถ๐—ป๐—ฒ๐˜€ (details on the workshop website): We accept three types of submissions: ย ย โ€ข Original papers (โ‰ค 8 pages, in proceedings) ย ย โ€ข Short papers (โ‰ค 4 pages, workshop website only) ย ย โ€ข Previously published papers (โ‰ค 8 pages, workshop website only) ๐Ÿ—“๏ธ ๐—ž๐—ฒ๐˜† ๐—ฑ๐—ฎ๐˜๐—ฒ๐˜€: ๐—ฆ๐˜‚๐—ฏ๐—บ๐—ถ๐˜€๐˜€๐—ถ๐—ผ๐—ป ๐—ฑ๐—ฒ๐—ฎ๐—ฑ๐—น๐—ถ๐—ป๐—ฒ: ๐—™๐—ฒ๐—ฏ ๐Ÿฎ๐Ÿณ, ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฒ Notification: Mar 20, 2026 Camera-ready: Apr 10, 2026 ๐ŸŒ ๐—ช๐—ฒ๐—ฏ๐˜€๐—ถ๐˜๐—ฒ: marworkshop.github.io/cvpr26โ€ฆ ๐Ÿ” ๐—ช๐—ผ๐—ฟ๐—ธ๐˜€๐—ต๐—ผ๐—ฝ ๐—ณ๐—ผ๐—ฐ๐˜‚๐˜€: This workshop focuses on multimodal algorithmic reasoning, where ๐—ฎ๐—ป ๐—ฎ๐—ด๐—ฒ๐—ป๐˜ ๐—บ๐˜‚๐˜€๐˜ ๐—ฎ๐˜€๐˜€๐—ถ๐—บ๐—ถ๐—น๐—ฎ๐˜๐—ฒ ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ณ๐—ฟ๐—ผ๐—บ ๐—บ๐˜‚๐—น๐˜๐—ถ๐—ฝ๐—น๐—ฒ ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น๐—ถ๐˜๐—ถ๐—ฒ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—น๐—ฒ๐˜… ๐—ฝ๐—ฟ๐—ผ๐—ฏ๐—น๐—ฒ๐—บ ๐˜€๐—ผ๐—น๐˜ƒ๐—ถ๐—ป๐—ด. Real-world examples of such problems include: (i) chain-of-thought reasoning across modalities, (ii) vision-and-language problem solving, (iii) agentic reasoning and tool use, and (iv) reasoning under physical constraints, among others. ๐—ง๐—ต๐—ฒ ๐˜๐—ผ๐—ฝ๐—ถ๐—ฐ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐— ๐—”๐—ฅ-๐—–๐—ฉ๐—ฃ๐—ฅ ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฒ ๐—ถ๐—ป๐—ฐ๐—น๐˜‚๐—ฑ๐—ฒ, ๐—ฏ๐˜‚๐˜ ๐—ฎ๐—ฟ๐—ฒ ๐—ป๐—ผ๐˜ ๐—น๐—ถ๐—บ๐—ถ๐˜๐—ฒ๐—ฑ ๐˜๐—ผ: ๐Ÿ”น Multimodal structured and multi-step reasoning across vision, language, audio, and other modalities, including compositional and programmatic inference. ๐Ÿ”น Multimodal foundation models and world models for reasoning, planning, and decision-making, and their connections to general intelligence. ๐Ÿ”น Reasoning under physical, geometric, and causal constraints, including embodied agents, simulators, and digital twins. ๐Ÿ”น Multi-agent reasoning and collaboration, including debate, coordination, mixture-of-experts, and reward- or critique-based aggregation. ๐Ÿ”น Extreme generalization and concept learning, including few-shot, zero-shot, and out-of-distribution multimodal reasoning. ๐Ÿ”น Scaling laws, efficiency, and test-time reasoning, including inference-time optimization, self-refinement, and tool-augmented reasoning. ๐Ÿ”น Benchmarks, datasets, diagnostics, and evaluation, including synthetic data, interpretability, and systematic analysis of shortcomings and failure modes in multimodal AI models. ๐Ÿ”น Theoretical and cognitive perspectives on multimodal reasoning, including limits of current models and insights from human cognition. ๐Ÿ”น Humanโ€“AI reasoning comparisons and foundations, including perspectives from psychology, neuroscience, and child development; theoretical limits of reasoning in large models; and position papers on how current multimodal AI reasoning differs from human cognition. #MultimodalReasoning #Reasoning #AlgorithmicReasoning #Multimodal #AI #VisionLanguage #VisionLanguageModels #VLM #Agents #ToolUse #LLM #FoundationModels #Research #MachineLearning #DeepLearning #CallForPapers
1
6
714
System-level Security for Computer Use Agents - arxiv.org/pdf/2601.09923 ๐Ÿงฉ Problem Computer Use Agents automate desktop and browser tasks by reading screenshots or DOM state and then clicking, typing, and navigating. Malicious UI content can inject instructions that redirect actions to steal credentials or trigger financial loss. Most CUA benchmarks score task completion and miss whether the agent only executes user intended actions under hostile UI content. The paper tests system level control flow integrity for CUAs, and what failures remain. ๐Ÿ” How The authors apply architectural isolation to CUAs by splitting planning from perception, then use Single Shot Planning where a trusted planner generates a complete branching execution graph before any potentially malicious UI observation. They evaluate on OSWorld with pass@1 and pass@k task completion, and they analyze branch steering attacks plus redundancy based verification with DOM consistency and multi modal consensus. ๐Ÿ“ˆ Findings Single Shot Planning retains up to 57% of frontier model utility on OSWorld while improving smaller open source models by up to 19%. On all OSWorld tasks, UITars rises from 24.4% to 29.0% success. Branch steering remains, cookie popup and pixel based attacks can steer valid plan paths, and the strongest redundancy setup still fails on the pixel attack. ๐ŸŽฏ Lessons learned Define failure as executing any action not reachable in a pre approved execution graph, and gate each click or keystroke on a verify step. Log screenshots, DOM, extracted coordinates, and the chosen branch so reviewers can reconstruct intent and data flow. Stress test predictable routines like cookie consent and element finding, since attackers can steer branches without changing the plan. Track utility loss and operational cost from extra checking, including false positives and token volume. Authors: @hfoerster01, Robert Mullins, Tom Blanchard, @NicolasPapernot, @NKristina01_, @florian_tramer, @iliaishacked, Cheng Zhang, Yiren Zhao - @Cambridge_Uni, @UofT, @VectorInst, @ETH_en, @aisequrity #AISecurity #LLMAgents #ComputerUseAgents #PromptInjection #AgentSecurity #InfoFlowControl #ModelIsolation #OSWorld #VisionLanguageModels #SecureByDesign #RedTeaming #AdversarialML
1
19
1,362
#reComputer Super J4012 runs #LiveVLM WebUI on-device, turning camera input into real-time #VisionLanguageModels processing. Perfect for #EdgeAI robots that see, analyze, and actโ€”locally, instantly. ๐Ÿ“•Step-by-step tutorial at: wiki.seeedstudio.com/deploy_โ€ฆ ๐Ÿ”— Discover more about reComputer Super J4012 : seeedstudio.com/reComputer-Sโ€ฆ
8
73
2,829
Vision-Language and Multimodal Models for Chemical Analysis: A Comprehensive Survey 1. This survey explores the cutting-edge advancements of Vision-Language Models (VLMs) and multimodal AI in chemical analysis, highlighting their potential to revolutionize the field by integrating diverse data types like molecular structures, spectroscopic signals, and experimental descriptions. 2. The article provides an in-depth review of how VLMs and multimodal models can enhance accuracy, efficiency, and interpretability in tasks such as materials discovery, reaction prediction, and drug discovery. 3. It discusses the adaptation of VLMs for chemical contexts, including specialized terminology, encoding of chemical visuals, handling complex molecular structures, and leveraging unlabeled data through few-shot learning strategies. 4. The survey also examines broader multimodal models that integrate more than just vision and language, such as incorporating structured chemical data, spectral data, and temporal information for a holistic understanding of chemical systems. 5. Challenges like data scarcity, interpretability, robustness, and ethical considerations are critically assessed, with promising future directions outlined, including the development of chemistry-specific foundation models and enhanced multimodal data fusion techniques. ๐Ÿ“œPaper: doi.org/10.26434/chemrxiv-20โ€ฆ #MultimodalAI #ChemicalAnalysis #VisionLanguageModels #AIinChemistry #DrugDiscovery #MaterialsScience

1
4
1,003
From Macro to Micro: Benchmarking Microscopic Spatial Intelligence on Molecules via Vision-Language Models 1. A new benchmark framework called MiSI-Bench is introduced to evaluate the ability of Vision-Language Models (VLMs) to understand and reason about the spatial relationships of microscopic entities like molecules. This is crucial for scientific discovery in fields such as structural biology and drug design. 2. MiSI-Bench consists of over 163,000 question-answer pairs and 587,000 images derived from around 4,000 molecular structures. It includes nine tasks ranging from basic spatial transformations to complex relational identifications, providing a comprehensive assessment of microscopic spatial intelligence. 3. The study reveals that current state-of-the-art VLMs perform significantly below human level on this benchmark. However, a fine-tuned 7B model shows substantial potential, even surpassing humans in some spatial transformation tasks, indicating the untapped potential of VLMs for microscopic spatial reasoning. 4. The research highlights the necessity of integrating explicit domain knowledge into VLMs to improve their performance in scientifically-grounded tasks such as hydrogen bond recognition. This suggests that combining domain expertise with VLMs is essential for progress toward scientific AGI. 5. The datasets are available at huggingface.co/datasets/zongโ€ฆ, providing a valuable resource for researchers to further explore and enhance the microscopic spatial intelligence of VLMs. ๐Ÿ“œPaper: arxiv.org/abs/2512.10867v1 #MicroscopicSpatialIntelligence #VisionLanguageModels #Benchmarking #MolecularStructures #ScientificDiscovery
1
2
17
1,383
One of the most interesting directions in multimodal AI right now is rethinking how LLMs gain new modalities and this paper introduces a surprisingly effective alternative to the โ€œtrain a giant VLMโ€ approach. Instead of merging vision and language into one huge model, the authors propose something much more modular: **Use a small VLM as the perceiver. Use a powerful text-only LLM as the reasoner. Let them collaborate through conversation.** This framework, BeMyEyes, treats multimodality as a team sport rather than a single-model capability. And the results are genuinely impressive: Key insights from the paper: ๐Ÿ”น LLMs donโ€™t need to โ€œseeโ€ directly to reason about images. A text-only model like DeepSeek-R1 can outperform large multimodal models if itโ€™s paired with a well-instructed small perceiver agent. ๐Ÿ”น Perception and reasoning are very different skills. Smaller VLMs are great at describing whatโ€™s in an image, but not at deep reasoning. LLMs are great at reasoning, but have no perception. Letting each agent stay in its lane produces better outcomes than forcing a single model to do both. ๐Ÿ”น Multi-turn conversation matters. The reasoner asks for clarifications, challenges missing details, and guides the perceiver to provide richer descriptions. This iterative loop consistently improves accuracy compared to one-shot captions. ๐Ÿ”น The system is modular and extensible. Swap the perceiver for a new VLM. Swap the reasoner for a better LLM. No retraining required on the large model. ๐Ÿ”น A clever synthetic data pipeline bridges the gap. Since perceivers arenโ€™t naturally trained to collaborate with reasoners, the authors generate structured conversations (using a stronger model as a teacher) to fine-tune the perceiver on โ€œhow to talk to an LLM.โ€ This challenges a big assumption in multimodal AI: maybe the path to stronger multimodal systems isnโ€™t bigger encoders, but better communication between specialized agents. If youโ€™re thinking about agentic systems, modular architectures, or multimodal extension without huge training costs, this paper is a strong data point for that direction. #AIResearch #MultimodalAI #LLMAgents #VisionLanguageModels #DeepSeek #MachineLearning #ArtificialIntelligence #ModularAI #AIAgents #ResearchInsights
2
11
1,527
What if your vision language model isnโ€™t actually seeingโ€ฆ but mostly guessing from text? ๐Ÿ‘€ @AICoffeeBreak explains it perfectly: when VLMs rely too heavily on text, they start hallucinating answers based on the most common phrasing in their training data instead of whatโ€™s in the image. Ask โ€œHow many cats are there?โ€ and even if the image shows five, the model might say two simply because โ€œtwoโ€ appears more often in similar prompts. This is the hidden trap behind text-driven hallucinations. And unless we explicitly measure grounding, these models will keep sounding confident while being wrong. I love this kind of research because it shows exactly where our tools break and where we need to push next, especially in high-stakes domains like medicine or robotics. #visionlanguagemodels #VLM #AIresearch #multimodalAI
2
2
498