Hugging Face Daily Papers, 2026-06-12: 44 papers worth scanning today. Full list with arXiv links:
1. EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments
Highlights lLM (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments.
arxiv.org/abs/2606.13681
2. SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
Highlights spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision.
arxiv.org/abs/2606.13673
3. MiniMax Sparse Attention
Highlights ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memor.
arxiv.org/abs/2606.13392
4. InterleaveThinker: Reinforcing Agentic Interleaved Generation
Highlights recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. Ho.
arxiv.org/abs/2606.13679
5. Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?
Highlights mLLMs (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly.
arxiv.org/abs/2606.08063
6. FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents
Highlights training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through sear.
arxiv.org/abs/2606.12087
7. MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
Highlights we present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first tra.
arxiv.org/abs/2606.13473
8. WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces
Highlights computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, an.
arxiv.org/abs/2606.09426
9. LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
Highlights scientific laboratories increasingly rely on AI systems to reason about experiments, but the physical act of doing science remains largely outside.
arxiv.org/abs/2606.13578
10. HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
Highlights holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation spac.
arxiv.org/abs/2606.13289
11. N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization
Highlights the success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the roll.
arxiv.org/abs/2606.10768
12. EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery
Highlights lLM-based agents have shown increasing potential in automating scientific discovery. Given an optimizable metric and an execution environment, they.
arxiv.org/abs/2606.13662
13. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning
Highlights latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulatio.
arxiv.org/abs/2606.13106
14. VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
Highlights we introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular vide.
arxiv.org/abs/2606.13364
15. Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback
Highlights despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failure.
arxiv.org/abs/2606.06113
16. VIA-SD: Verification via Intra-Model Routing for Speculative Decoding
Highlights speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to vali.
arxiv.org/abs/2606.12243
17. MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold
Highlights we present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This sett.
arxiv.org/abs/2606.13376
18. From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion
Highlights multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details.
arxiv.org/abs/2606.12303
19. TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search
Highlights deep search requires agents to answer complex questions through multi-step web search, browsing, evidence comparison, and synthesis. A central chal.
arxiv.org/abs/2606.11662
20. High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation
Highlights few-step diffusion distillation has become increasingly mature for 4-8-step generation, yet pushing further to 2 steps remains challenging. In this.
arxiv.org/abs/2606.12575
21. Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models
Highlights adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly.
arxiv.org/abs/2606.11409
22. HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness
Highlights lLMs are increasingly deployed as agents for long-horizon tasks, yet their performance is shaped not only by model capability and.
arxiv.org/abs/2606.12882
23. SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling
Highlights on-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperfor.
arxiv.org/abs/2606.09304
24. Visual Para-Thinker : A Single-Policy Multi-Agent Framework for Visual Reasoning
Highlights visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early.
arxiv.org/abs/2606.09290
25. EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge
Highlights search Agents -- large language models augmented with search tools -- have intensified the need for future-proof evaluation benchmarks. Existing be.
arxiv.org/abs/2606.13120
26. MaskAlign: Token-Subset Representation Alignment for Efficient Diffusion Training
Highlights representation alignment with pretrained vision models has recently shown strong potential for accelerating diffusion transformer training. By alig.
arxiv.org/abs/2606.08788
27. ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages
Highlights mLLMs (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in s.
arxiv.org/abs/2606.13572
28. Surflo: Consistent 3D Surface Flow Model with Global State
Highlights geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstru.
arxiv.org/abs/2606.13644
29. Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
Highlights anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably pr.
arxiv.org/abs/2606.12730
30. Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents
Highlights compact language models (LMs) reduce cost, latency, and deployment risk for tool agents. Yet MCP-style tool use requires more than isolated functio.
arxiv.org/abs/2606.12674
31. MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning
Highlights robotic simulators are a cornerstone of modern research in aerial robotics, serving both as a vehicle for the development of new control algorithms.
arxiv.org/abs/2606.08039
32. See What I See, Know What I Think: Dense Latent Communication Across Heterogeneous Agents
Highlights multi-agent systems communicate mostly through text, paying a lossy and expensive decode and re-encode cost. KV-cache communication is a promising.
arxiv.org/abs/2606.13594
33. Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents
Highlights interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in o.
arxiv.org/abs/2606.13174
34. $\texttt{WEAVER}$, Better, Faster, Longer: An Effective World Model for Robotic Manipulation
Highlights the potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and te.
arxiv.org/abs/2606.13672
35. PianoKontext: Expressive Performance Rendering from Deadpan Context
Highlights expressive performance rendering (EPR) aims to generate realistic performances constrained on sequences of notes. However, flow matching audio edit.
arxiv.org/abs/2606.12282
36. IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder
Highlights built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for construct.
arxiv.org/abs/2606.11096
37. The Cold-Start Safety Gap in LLM Agents
Highlights are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a ses.
arxiv.org/abs/2606.07867
38. ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs
Highlights lLMs deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approache.
arxiv.org/abs/2606.12451
39. A Stationary (and Therefore Compatible) Representation is All You Need
Highlights learning compatible representations aims to learn feature representations that can be used interchangeably over time whenever a model undergoes upd.
arxiv.org/abs/2606.12488
40. WebChallenger: A Reliable and Efficient Generalist Web Agent
Highlights autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose infer.
arxiv.org/abs/2606.10423
41. Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering
Highlights we present \textbf{Flash-GMM}, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single.
arxiv.org/abs/2606.10896
42. Leveraging Morphology for Historical Script Metrological Analysis
Highlights advances in handwritten text recognition have enabled large-scale transcription of historical documents, but still provide limited access to interp.
arxiv.org/abs/2606.09446
43. Revisiting Articulated Parts Perception in Robot Manipulation
Highlights we are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and generalizable perception of articula.
arxiv.org/abs/2606.08103
44. On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance
Highlights large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-int.
arxiv.org/abs/2606.00467
Trend summary:
- Agents / search / tool use: 19 papers
- Reasoning / proof / RL: 19 papers
- Vision / multimodal generation: 15 papers
- Robotics / world models: 9 papers
- Memory / retrieval / efficient kernels: 7 papers
- Safety / evaluation / robustness: 6 papers
Overall: agentic systems dominate this batch, with strong secondary clusters in multimodal/3D generation, reasoning/RL, robotics/world models, and evaluation/robustness.