π₯ Autonomous AI Assistants (e.g.,
#googleio2024,
#WWDC24) and coding agents (e.g.,
#Devin,
#SWEAgent) have garnered a lot of attention recently. We can envision coding agents autonomously completing complex day-to-day tasks across apps using APIs on our behalf. But how can we develop & benchmark them in a rigorous & reproducible manner?
π Introducing AppWorld: πa simulated world environment where agents can write code to interact with many apps via APIs on behalf of people πa benchmark of complex tasks defined on it, and π§ͺa robust evaluation framework for assessing agentβs goal completion.
π’ To appear as an
#ACL2024 paper ππ»π§βπ€βπ§ βAppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agentsβ
#NLProc #ai #AIagents
π
arxiv.org/abs/2407.18901 (paper)
π
appworld.dev for code, blog, data (tasks, APIs, trajectories) explorer, interactive playground, leaderboard & more!