Artificial Analysis 发布行业首个智能体(Agent)硬件基准 AA-AgentPerf。传统评测如同单次问答「短跑」,只看响应速度;智能体任务则像「接力跑」,AI 需自主拆解目标,在读写文件、改写代码、运行测试中反复流转。频繁交互对服务器内存容量和调度效率挑战极高。基准通过重放真实编程轨迹,以「每兆瓦功耗支持并发智能体规模」为核心能效指标,直击数据中心电力与资金瓶颈。
首期测试运行 1.6 万亿参数开源模型 DeepSeek V4 Pro。结果显示,英伟达 Blackwell 液冷整柜系统 GB300 NVL72 每兆瓦功耗可承载 6.14 万个并发智能体,而上一代 Hopper HGX H200 仅能支持 2600 个,能效提升超 20 倍。单显卡并发容量也提升了 41 倍。这使得在同等电力预算下,数据中心可多承载 20 倍智能体并发规模,大幅拉低自动编程和客服等应用落地成本。
首批成绩中,AMD Instinct MI355X 暂时落后。评测机构指出,AMD 与 H200 配置均使用通用开源 vLLM 框架搭建,未作深度优化;随着服务框架与内核算子适配跟进,AMD 性能仍有提升空间。目前,Together AI 等推理商已率先在 Blackwell 部署 DeepSeek V4 Pro,为智能体编程工具 Cursor 提供实时推理支持。
Today we're releasing the first results for AA-AgentPerf, our new agentic inference benchmark: initially covering DeepSeek V4 Pro across NVIDIA Blackwell, Hopper, and AMD.
AA-AgentPerf is the first benchmark built for agentic inference. We use real, long-context agentic coding trajectory data as the workload, and inference with real production optimizations such as KV cache reuse and speculative decoding, leading to the most realistic evaluation of inference performance available today.
AA-AgentPerf’s lead metric is Agents per Megawatt. In a power-constrained world, this answers the most relevant question for AI infrastructure providers - “how many real agents can I deploy per unit of power available?”.
First results for DeepSeek V4 Pro (at the easiest defined service level of 20 tokens/s and 10s TTFT):
➤ GB300 (rack-scale, disaggregated): 61,354 Agents/MW
➤ B300 (single node, disaggregated): 21,053 Agents/MW
➤ MI355X: 3,551 Agents/MW
➤ H200: 2,594 Agents/MW
Further AA-AgentPerf details:
➤ Real agent workloads, beyond synthetic queries: AA-AgentPerf replays real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens - the workloads that matter in 2026
➤ Production optimizations allowed: KV cache reuse, speculative decoding, and prefill/decode disaggregation are all permitted, with accuracy verification to control for quality loss - we want results to reflect what real deployments actually look like
➤ Lead metric is Agents per Megawatt: simultaneous agents supported at production performance targets (e.g. 20 tokens/s per user, ≤10s TTFT) per megawatt consumed. Agents per TCO and $/hr will be supported soon
Key findings:
➤ Rack-scale disaggregated inference (GB300) is ~3× more power-efficient than single-node Blackwell (B300), and similarly ahead in raw agents per GPU
➤ Blackwell represents a large generational step over Hopper in both power efficiency and raw compute per GPU
➤ In this test, NVIDIA's Blackwell systems currently lead AMD MI355X by a clear margin. Important context: our MI355X configs are approximately two weeks older than our Blackwell configs and couldn’t stably use speculative decoding. MI355X power draw under heavy load is also well below TDP, indicating there is much room to improve on DeepSeek V4 Pro, which we will measure and publish in the coming weeks
➤ Config and inference framework version matter enormously - we've seen meaningful improvements daily since the DeepSeek V4 Pro release and look forward to tracking performance over time
AA-AgentPerf is a live benchmark and we publish results on a rolling basis as submissions come in. Some of the new features coming in v1.1: more models (gpt-oss-120b), more hardware (GB200, B200, H100, MI300X), better AMD configurations, $/hr and cost-per-task normalization, Agents per TCO, and performance tracking over time.