DailyPapers

DailyPapers

2,641 Photos and videos

Tweets

DailyPapers

@HuggingPapers

TRL-Bench: finally comparing tabular encoders fairly A unified benchmark that turns 20 heterogeneous encoders into directly comparable embedding models through one shared interface. 16 tasks. 87 datasets. No universal winner.

1,017

DailyPapers

DailyPapers

@HuggingPapers

Paper: huggingface.co/papers/2606.0… Datasets: huggingface.co/collections/l… Code: github.com/LOGO-CUHKSZ/TRL-B…

Paper page - TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular...

Join the discussion on this paper page

huggingface.co

397

DailyPapers

DailyPapers

@HuggingPapers

LabVLA: Grounding VLA models in scientific laboratories RoboGenesis builds 10K lab scenes across 16 robot types. LabVLA pairs a Qwen3-VL backbone with a DiT flow-matching expert, reaching 71.1% success on LabUtopia and transferring to real Franka robots.

1,031

DailyPapers

DailyPapers

@HuggingPapers

Paper: huggingface.co/papers/2606.1… Model: huggingface.co/zjunlp/LabVLA Project page: zjunlp.github.io/LabVLA/

Paper page - LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Join the discussion on this paper page

huggingface.co

473

DailyPapers

DailyPapers

@HuggingPapers

11h

FlashMemory-DeepSeek-V4 Lookahead Sparse Attention cuts the KV cache by over 90% at 500K context, compressing it to just 13.5% of full size while maintaining or improving accuracy on RULER, LongBench-v2, and LongMemEval.

1,678

DailyPapers

DailyPapers

@HuggingPapers

11h

This lightweight retriever is on Hugging Face, sparsifying DeepSeek-V4's CSA KV-cache on the fly to keep only ~10–15% resident on GPU. Paper: huggingface.co/papers/2606.0… Model: huggingface.co/libertywing/F…

Paper page - FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse...

Join the discussion on this paper page

huggingface.co

631

DailyPapers

DailyPapers

@HuggingPapers

15h

FORT-Searcher A new framework for training deep search agents that resist shortcuts. By controlling four key risks during data synthesis, it forces models to search longer before answering. SFT-only training yields top performance among comparable open-source agents.

1,857

DailyPapers

DailyPapers

@HuggingPapers

15h

Paper: paperswithcode.co/paper/2606… Code: github.com/RUCAIBox/FORT-Sea… Datasets and checkpoints coming soon to Hugging Face.

FORT-Searcher: Synthesizing Shortcut-Resistant Search Tasks for Training Deep Search Agents |...

Training deep search agents requires verifiable questions whose answers remain unavailable until sufficient evidence has been acquired through search. Existing synthesis methods often increase...

paperswithcode.co

905

DailyPapers

DailyPapers

@HuggingPapers

19h

WeaveBench Microsoft Research Asia introduces 114 long-horizon tasks that force agents to interleave GUI and CLI in one trajectory. The same frontier models that score over 78% on OSWorld-Verified collapse to 41.2% on WeaveBench.

1,700

DailyPapers

DailyPapers

@HuggingPapers

19h

Outcome-only grading overestimates agent performance by 10-20 percentage points. WeaveBench uses a trajectory-aware judge that audits every step. Project: weavebench.github.io Paper: paperswithcode.co/paper/2606… Dataset: huggingface.co/datasets/wanl…

WeaveBench

A long-horizon hybrid GUI CLI benchmark for computer-use agents. Best frontier pairing: 41.2% PassRate.

weavebench.github.io

717

DailyPapers

DailyPapers

@HuggingPapers

23h

MiniMax MaxProof exceeds human gold-medal threshold on math olympiads A population-level test-time scaling framework that searches over candidate proofs through tournament selection. Scores 35/42 on IMO 2025 and 36/42 on USAMO 2026.

1,744

DailyPapers

DailyPapers

@HuggingPapers

23h

Paper: huggingface.co/papers/2606.1…

Paper page - MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level...

Join the discussion on this paper page

huggingface.co

747

DailyPapers

DailyPapers

@HuggingPapers

Jun 13

ResearchClawBench A benchmark for end-to-end autonomous research. 40 real tasks across 10 domains test if AI agents can rediscover published science from raw data alone. Top agents average just 21.5 out of 100. The frontier for automated discovery is wide open.

2,356

DailyPapers

DailyPapers

@HuggingPapers

Jun 13

Paper: paperswithcode.co/paper/2606… Dataset: huggingface.co/datasets/Inte… Community: huggingface.co/spaces/Intern… Can your agent beat the 21.5 frontier?

899

DailyPapers

DailyPapers

@HuggingPapers

Jun 13

Imaginative Perception Tokens UW, OpenAI, Microsoft, and AI2 teach VLMs to imagine unseen visual perspectives. These tokens boost spatial reasoning over text chain-of-thought across perspective taking, path tracing, and multiview counting. No images are generated at inference time.

2,595

DailyPapers

DailyPapers

@HuggingPapers

Jun 13

Explore the paper, data, and benchmarks paperswithcode.co/paper/2606… huggingface.co/collections/w… huggingface.co/collections/w…

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models | Papers with...

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception:...

paperswithcode.co

965

DailyPapers

DailyPapers

@HuggingPapers

Jun 12

Robust-U1 equips multimodal LLMs with visual self-recovery Corrupted images break understanding. This ICML work trains models to self-restore pixels. Recovery uses supervised training, RL with pixel and semantic rewards, and joint reasoning over both views.

2,239

DailyPapers

DailyPapers

@HuggingPapers

Jun 12

Discuss: huggingface.co/papers/2606.0… Demo: huggingface.co/spaces/Jiaqi-… Models: huggingface.co/Jiaqi-hkust/R…

Paper page - Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Join the discussion on this paper page

huggingface.co

1,341

DailyPapers

DailyPapers

@HuggingPapers

Jun 12

MiniMax released MSA for million-token contexts Blockwise sparse attention with an Index Branch that scores and selects Top-k KV blocks per GQA group, and a Main Branch that attends only to those blocks. At 1M tokens on a 109B model, it cuts per-token attention compute by 28x and delivers 14x prefill speedups on H800 GPUs.

2,278

DailyPapers

DailyPapers

@HuggingPapers

Jun 12

Get the 109B MiniMax-M3 model powered by MSA on Hugging Face huggingface.co/MiniMaxAI/Min… The open-source CUDA kernels for dense and sparse attention on NVIDIA SM100 are also available paperswithcode.co/paper/2606…

MiniMaxAI/MiniMax-M3 · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

686