Arena.ai

Arena.ai

1,453 Photos and videos

Tweets

Pinned Tweet

Arena.ai

@arena

Jun 4

Introducing Agent Mode: Agentic AI is now measured in the Arena. Agent Mode can do deep research, create reports, generate images, build websites, debug code, and more. It completes more complex tasks by using tools like web search, bash in a sandbox environment, image generation, file writing, and asking follow-up questions. Frontier models are waiting for you in Agent Mode to take on real-world tasks. GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro, and top open models. Test them yourself.

0:44

461

253,005

Arena.ai

Arena.ai retweeted

Arena.ai

@arena

Jun 13

Our first impressions with @AnthropicAI's Claude Fable 5 in the Agent Arena by @petergostev youtube.com/watch?v=db_ci3HY…

Claude Fable 5 (Mythos) | First impressions

https://arena.ai/agentAnthropic's most anticipated model release ...

youtube.com

14,700

Arena.ai

Arena.ai

@arena

14h

Open-source model, Kimi-K2.7-Code by @Kimi_Moonshot is in the Code Arena: Frontend. In the Code Arena, you can build full web apps and interactive sites from prompts and image uploads. Your feedback drives the Code Arena: Frontend leaderboard.

Kimi.ai

@Kimi_Moonshot

Jun 12

🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: 21.8% on Kimi Code Bench v2, 11.0% on Program Bench, and 31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6. 🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates. ⚡️ 6x High-Speed Mode coming soon! 🔌 Available today via Kimi API and Kimi Code. 🔗 Kimi Code: kimi.com/code 🔗 API: platform.moonshot.ai

236

17,955

Arena.ai

Arena.ai

@arena

14h

Work with Kimi-K2.7-Code and other top frontier models in the Code Arena: Frontend at: arena.ai/code

Code Arena: Build & Test with AI Coding Models

Test the world's leading coding models. Build web apps and websites in real time while evaluating model accuracy and logic.

arena.ai

3,856

Arena.ai

Arena.ai

@arena

14h

Here's where the Code Arena: Frontend leaderboard stands right now: arena.ai/leaderboard/code/we…

WebDev AI Leaderboard - Best AI Models for Web Development

View overall rankings across AI models on front-end web development tasks, including agentic coding workflows that require multi-step reasoning and tool use.

arena.ai

3,557

Arena.ai

Arena.ai

@arena

Jun 13

Find more technical details of Claude Fable-5 on Agent Arena leaderboard arena.ai/leaderboard/agent

Agent Arena | AI Agent Performance Leaderboard

Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.

arena.ai

6,653

Arena.ai

Arena.ai

@arena

Jun 13

The official statement from @AnthropicAI x.com/AnthropicAI/status/206…

Anthropic

@AnthropicAI

Jun 13

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Claude models is not affected. We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible. Read our full statement: anthropic.com/news/fable-myt…

9,295

Arena.ai

Arena.ai

@arena

Jun 13

Update: We've removed Claude Fable 5 from Arena, following Anthropic's latest announcement and the U.S. government directive to suspend access. Claude Fable 5 is the most powerful model we’ve ever tested - ranking #1 across Agent, Text, and Code Arena, and setting a new breakthrough for frontier AI performance. We look forward to restoring access and resuming community testing when possible.

Arena.ai

@arena

Jun 10

Exciting news: Claude Fable 5 ranks #1 on the new Agent Arena leaderboard! Fable 5 leads by the widest margin ever over Opus-4.8 and GPT-5.5 on two key signals: confirmed task success rate and praise vs. complaint, despite weaker steerability. If Fable can do something, it will do it very well. If it can't/doesn't want to do something, it may be hard to steer the model towards the goal. In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks. Models get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. We use the causal tracing methodology to measure a model's net improvement which indicates how much it improves outcomes relative to the average model. Huge congrats to @AnthropicAI for the incredible milestone! Below we break down how Claude Fable 5 (based on Mythos) scored across 5 signals, drawn from tasks submitted by a global community of users.

867

110,896

Arena.ai

Arena.ai

@arena

Jun 12

Open-weight model, MiniMax M3 by @MiniMax_AI is available in the Agent Arena. In Agent Arena, models get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session contributes to the Agent Arena leaderboard. Scores for MiniMax M3 coming soon.

Arena.ai

@arena

Jun 4

Introducing Agent Arena: real-world agentic evals at scale. How do you evaluate agents doing actual work? We measure millions of live sessions where real users accomplish real tasks. On Arena, models now get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. Every session produces rich signals. Users iterate with the agent turn-by-turn: approving, editing, correcting, praise or expressing frustration. The environment gives feedback too: shell errors, tool failures, recovery attempts, and more. Our leaderboard measures each model's agentic performance using causal inference across five signals: task success, steerability, error recovery, user praise vs. complaint, and tool hallucination. This leaderboard snapshot is built from 300K tasks, 2M tool calls, and 40M lines of code by agents. Top labs in Agent Arena: - #1 @OpenAI: GPT-5.5 (High) - #2 @AnthropicAI: Claude-Opus-4.7 (Thinking) - #3 @Zai_org: GLM-5.1 - #4 @GoogleDeepMind: Gemini-3.1-Pro - #5 @Kimi_Moonshot: Kimi-K2.6 More analysis in the thread, with the full technical blog below.

10,774

Arena.ai

Arena.ai

@arena

Jun 12

Tackle your complex real-world tasks while contributing to frontier AI measurement at: arena.ai/agent

Agent Mode | Autonomous AI Agents for Real-World Tasks

Browse, research, and code autonomously on Arena — free. Compare frontier models on real-world agentic tasks.

arena.ai

3,839

Arena.ai

Arena.ai

@arena

Jun 12

The newest open model to join the Agent Arena leaderboard, Nemotron 3 Ultra by @NVIDIA lands at #20 overall and #5 among open models. Its standout signals are a positive praise-vs-complaint margin and low tool hallucination, but it's held back by steerability and bash recovery. Note the wide confidence intervals as scores are still stabilizing. In Agent Arena, models get web search, filesystem, and terminal tools to complete complex workflows: writing code, creating slide deck, researching the web, building apps, and analyzing documents. We use the causal tracing methodology to measure a model's net improvement which indicates how much it improves outcomes relative to the average model. See in thread how Nemotron 3 Ultra scored across 5 signals, drawn from tasks submitted by a global community of users.

Arena.ai

@arena

Jun 4

185

16,031

more replies

Arena.ai

Arena.ai

@arena

Jun 12

Learn more about the causal tracing methodology for Agent Arena on our blog: arena.ai/blog/agent-arena-me…

Agent Arena: Causal Evaluation of Agents in the Real World

Agents are increasingly doing real work. The resulting task distribution has greatly expanded. We desire an agent evaluation that scales along with usage and capability.

arena.ai

3,309

Arena.ai

Arena.ai

@arena

Jun 12

Head over to the Agent Arena leaderboard to dive into the details: arena.ai/leaderboard/agent

Agent Arena | AI Agent Performance Leaderboard

Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.

arena.ai

2,538

Arena.ai

Arena.ai retweeted

Arena.ai

@arena

Jun 10

Check out first impressions with @AnthropicAI’s Claude Fable 5 in the Agent Arena with @petergostev on our YouTube: youtu.be/db_ci3HYth8

Claude Fable 5 (Mythos) | First impressions

https://arena.ai/agentAnthropic's most anticipated model release ...

youtube.com

25,945

Arena.ai

Arena.ai

@arena

Jun 11

GPT-5.5 (xHigh) ranks #2 on Agent Arena ( 10.6% net improvement), making it the highest-ranked OpenAI model closely behind Claude Fable 5 (High). Per signal breakdown, GPT-5.5 (xHigh) ranks #1 in Praise vs. Complaint ( 29.4%) and Bash Recovery ( 14.1%), scoring higher than Claude Fable 5 (High) on both signals. It trails Claude Fable 5 (High) on Confirmed Success ( 5.4% vs. 17.6%) and Steerability ( 1.9% vs. 5.4%). Agent Arena evaluates models on millions of real-world, long-horizon agentic tasks. Models use tools like web search, filesystem, and terminal to complete complex workflows: writing code, creating slide decks, researching the web, building apps, and analyzing documents. We use causal tracing to measure model performance across real-world agentic tasks. More breakdown of GPT-5.5 (xHigh) across five signals in the thread.

Arena.ai

@arena

Jun 4

0:44

472

45,520

more replies

Arena.ai

Arena.ai

@arena

Jun 11

Learn more about the Agent Arena methodology here: arena.ai/blog/agent-arena-me…

Agent Arena: Causal Evaluation of Agents in the Real World

Agents are increasingly doing real work. The resulting task distribution has greatly expanded. We desire an agent evaluation that scales along with usage and capability.

arena.ai

3,765

Arena.ai

Arena.ai

@arena

Jun 11

Full Agent Arena leaderboard: arena.ai/leaderboard/agent

Agent Arena | AI Agent Performance Leaderboard

Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.

arena.ai

3,617