elvis

elvis

Users
Tweets

elvis

@omarsar0

Feb 25

New research from Intuit AI Research. Agent performance depends on more than just the agent. It also depends on the quality of the tool descriptions it reads. However, tool interfaces are still written for humans, not LLMs. As the number of candidate tools grows, poor descriptions become a real bottleneck for tool selection and parameter generation. As Karpathy has suggested, let's build for AI Agents. This new research introduces Trace-Free , a curriculum learning framework that teaches models to rewrite tool descriptions into versions that are more effective for LLM agents. The key idea: during training, the model learns from execution traces showing which tool descriptions lead to successful usage. Then, through curriculum learning, it progressively reduces reliance on traces, so at inference time, it can improve tool descriptions for completely unseen tools without any execution history. On StableToolBench and RestBench, the approach shows consistent gains on unseen tools, strong cross-domain generalization, and robustness as candidate tool sets scale beyond 100. Instead of only fine-tuning the agent, optimizing the tool interface itself is a practical and underexplored lever for improving agent reliability. Paper: arxiv.org/abs/2602.20426 Learn to build effective AI agents in our academy: academy.dair.ai/

124

11,444

Edoardo Ponti

Edoardo Ponti @PontiEdoardo

Feb 9

Results across visual and textual environments: unsupervised SWIRL ꩜ outperforms SFT-only baselines. 16% on AURORA-BENCH (visual dynamics) 28% on ByteMorph (visual dynamics) 16% on WorldPredictionBench (longer-horizon prediction) 14% on StableToolBench (API calling)

281

AI EdTalks

AI EdTalks

@AIEdTalks

3 Oct 2025

References: [1] $τ$-bench: A Benchmark for Tool-Agent-User Interaction — arXiv — 2024-06-17 — arxiv.org/abs/2406.12045 — Shows sharp performance drop as tool choice scales; defines pass^k. [2] StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning — arXiv (v5) — 2025-03-05 — arxiv.org/abs/2403.07714 — Introduces virtual API server; notes only 44.4% success in real ToolBench APIs. [3] Dynamic tool calling in LangGraph agents — LangChain Changelog — 2025-08-06 — changelog.langchain.com/anno… — Adds state-scoped tool exposure. [4] Introducing the Model Context Protocol (MCP) — Anthropic Blog — 2024-11-25 — anthropic.com/news/model-con… — Standardizes tool discovery interoperability. [5] Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks — arXiv — 2024-11-07 — arxiv.org/abs/2411.04468 — Orchestrator model for dynamic coordination. [6] CrewAI Docs — Tools — CrewAI — accessed 2025-09-30 — docs.crewai.com/concepts/too… — Defines crew-level tool usage patterns telemetry. [7] Gorilla: LLM Connected with Massive APIs — arXiv — 2023-05-24 — arxiv.org/abs/2305.15334 — Retrieval-aware API calling; foundational for large toolsets. [8]MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use- arxiv.org/html/2508.16260v1?…

$τ$-bench: A Benchmark for Tool-Agent-User Interaction in...

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world...

arxiv.org

AISecHub

AISecHub

@AISecHub

17 Jul 2025

Evaluating LLM-based Agents - arxiv.org/pdf/2503.16416v1 A comprehensive list of methods for evaluating AI Agents. #AQUARAT #HotpotQA #StrategyQA #GSM8K #MATH #Gameof24 #MiniWAT #PlanBench #FlowBench #FOLIO #PFOLIO #MULtIrc #MUSR #BeeT #BoolQ #AutoPlanBench #APCBench #NarrativeQA #QMSum #QUALITY #MemGPT #LoCoMo #AMEM #StreamBench #LLMEvolve #ReflectionBench #ToolBench #ToolAlpaca #APIBench #NexusRaven #SealTools #ComplexFuncBench #RestBench #APIgen #StableToolBench #WebShop #MindWeb #WebShopV2 #WebArena #MMH #AssistInBench #Camas #WorkArena #HumanEval #SWEbench #SWEbenchLite #SWEbenchMultimodal #ProofBench #SWFBench

184

Vincent

Vincent

@vansinhu

29 Mar 2025

paperscope.ai/2503.20527 Summary: The rapid advancement of large language models (LLMs) has spurred significant interest in tool learning, where LLMs are augmented with external tools to tackle complex tasks. However, existing tool environments face challenges in balancing stability, scalability, and realness, particularly for benchmarking purposes. To address this problem, we propose MirrorAPI, a novel framework that trains specialized LLMs to accurately simulate real API responses, effectively acting as ""mirrors"" to tool environments. Using a comprehensive dataset of request-response pairs from 7,000 APIs, we employ supervised fine-tuning and chain-of-thought reasoning to enhance simulation fidelity. MirrorAPI achieves superior accuracy and stability compared to state-of-the-art methods, as demonstrated by its performance on the newly constructed MirrorAPI-Bench and its integration into StableToolBench.