aguea asia
Filter
Retweets
Media
Videos
News
Verified
Native videos
Replies
Links
Images
Safe
Quotes
Pro videos
Exclude
Retweets
Media
Videos
News
Verified
Native videos
Replies
Links
Images
Safe
Quotes
Pro videos
Time range
-
Near
Users
Tweets
AISecHub
@AISecHub
17 Jul 2025
Evaluating LLM-based Agents -
arxiv.org/pdf/2503.16416v1
A comprehensive list of methods for evaluating AI Agents.
#AQUARAT
#HotpotQA
#StrategyQA
#GSM8K
#MATH
#Gameof24
#MiniWAT
#PlanBench
#FlowBench
#FOLIO
#PFOLIO
#MULtIrc
#MUSR
#BeeT
#BoolQ
#AutoPlanBench
#APCBench
#NarrativeQA
#QMSum
#QUALITY
#MemGPT
#LoCoMo
#AMEM
#StreamBench
#LLMEvolve
#ReflectionBench
#ToolBench
#ToolAlpaca
#APIBench
#NexusRaven
#SealTools
#ComplexFuncBench
#RestBench
#APIgen
#StableToolBench
#WebShop
#MindWeb
#WebShopV2
#WebArena
#MMH
#AssistInBench
#Camas
#WorkArena
#HumanEval
#SWEbench
#SWEbenchLite
#SWEbenchMultimodal
#ProofBench
#SWFBench
2
5
184
Load more