evaluating llms and code. download now on VSCode! | maintained by @iamwaynechi @valeriechen_

Joined September 2024
12 Photos and videos
Pinned Tweet
Check out our findings in our latest preprint! A big thank you to everyone who's been using and voting on Copilot Arena. We couldn't have done it without you all♥️!
4 Mar 2025
What do developers 𝘳𝘦𝘢𝘭𝘭𝘺 think of AI coding assistants? In October, we launched @CopilotArena to collect user preferences on real dev workflows. After months of live service, we’re here to share our findings in our recent preprint. Here's what we have learned /🧵
1
4
1,533
Copilot Arena retweeted
We are excited to launch the ⚔️PR Arena⚔️ leaderboard! Full results will be revealed after a certain milestone of community votes. Fix your GitHub issues for free and vote for better fix! 👉Leaderboard & Setup Guide: prarena.web.app
1
9
24
5,536
Copilot Arena retweeted
Here are some tips for using ⚔️PR Arena⚔️ 1⃣ pr-arena🏷️ option is added automatically to Issue Labels for ease of use! 2⃣ You can use PR Arena in forked repositories. 3⃣ Don't like either fix? Select “neither” and no PR will be created. 👉Install here: github.com/apps/openhands-pr…

Introducing ⚔️PR Arena⚔️ - free AI coding agents to fix real GitHub issues. Claude Sonnet 4 vs Gemini 2.5 Pro… Who writes better pull requests? 👉 Install here: github.com/apps/openhands-pr… Powered by @allhands_ai
1
2
14
4,186
📢Calling all developers who contributed votes in Copilot Arena, we need your help building the PR Arena leaderboard 🗳️. You will no longer be restricted to VSCode IDE--any GitHub repo with an open issue is fair game! Check out the thread below for details:
Introducing ⚔️PR Arena⚔️ - free AI coding agents to fix real GitHub issues. Claude Sonnet 4 vs Gemini 2.5 Pro… Who writes better pull requests? 👉 Install here: github.com/apps/openhands-pr… Powered by @allhands_ai
1
10
680
Come meet our amazing little brother, Music Arena!
Excited to share our beta release of Music Arena, a live evaluation platform for state-of-the-art AI music generation models! 🎧 Listen to the latest models and 🗳️ vote for your favorite ⚔️ music-arena.org ⭐️ github.com/gclef-cmu/music-a… 📜 arxiv.org/abs/2507.20900
3
9
758
We’re featured in the new tech report on Mercury models! Check it out👇
Since our launch earlier this year, we are thrilled to witness the growing community around dLLMs. The Mercury tech report from @InceptionAILabs is now on @arxiv with more extensive evaluations: arxiv.org/abs/2506.17298 New model updates dropping later this week!
3
7
852
New result: Qwen-2.5-Coder jumps from 13th to joint 1st place with fill-in-the-middle (FiM)! Congrats to @Alibaba_Qwen 🥳 Also check out @lmarena_ai 's new UI 🖥️✨
4
7
984
Copilot Arena retweeted
Who is winning the race to claim the LLMs for SWE market? We share our thoughts based on our @CopilotArena work. See article below for current sentiments and what lies ahead 👇
OpenAI is making a big push into one of the most popular AI domains: software engineering on.wsj.com/3SCvoW2
3
20
2,180
Copilot Arena retweeted
We are launching our API in open beta! Visit the Inception Platform to create your account and get started using the first commercial-scale diffusion large language models (dLLMs). platform.inceptionlabs.ai/

8
30
136
64,537
Copilot Arena retweeted
With so many AI coding assistants out there, it can be hard to keep track of ones that perform well on real-world tasks. CMU researchers developed Copilot Arena to do just that by crowdsourcing user ratings of LLM-written code. bit.ly/3YLeDvh

4
10
1,417
Copilot Arena retweeted
Replying to @CopilotArena
@CopilotArena was featured in @SCSatCMU news! Featuring quotes from me, @iamwaynechi, @atalwalkar and @chrisdonahuey 🥳 📖Check out the article here: cs.cmu.edu/news/2025/copilot…

4 Mar 2025
What do developers 𝘳𝘦𝘢𝘭𝘭𝘺 think of AI coding assistants? In October, we launched @CopilotArena to collect user preferences on real dev workflows. After months of live service, we’re here to share our findings in our recent preprint. Here's what we have learned /🧵
5
19
1,806
A post about me?
9 Apr 2025
blog.ml.cmu.edu/2025/04/09/c… How do real-world developer preferences compare to existing evaluations? A CMU and UC Berkeley team led by @iamwaynechi and @valeriechen_ created @CopilotArena to collect user preferences on in-the-wild workflows. This blogpost overviews the  design and deployment of Copilot Arena new insights into developer code preferences.
4
218
Copilot Arena retweeted
14 Mar 2025
Check out @CopilotArena’s new Code Edit Leaderboard!
New #1 Leaders of Code Edit Leaderboard: Strong performance from both Claude 3.7 Sonnet and Gemini-2.0-Pro! Congratulations to @AnthropicAI and @GoogleDeepMind 🥇 We also release new live leaderboard interface✨. You can now easily toggle between code completion and code edit.
3
4
71
10,339
New #1 Leaders of Code Edit Leaderboard: Strong performance from both Claude 3.7 Sonnet and Gemini-2.0-Pro! Congratulations to @AnthropicAI and @GoogleDeepMind 🥇 We also release new live leaderboard interface✨. You can now easily toggle between code completion and code edit.
1
6
69
21,792
Curious about how code edits work in Copilot Arena? Check out this post: x.com/lmarena_ai/status/1882…

24 Jan 2025
News from @CopilotArena: Code Editing Leaderboard is now LIVE! We have collected over 3.7k votes on 6 models. Congrats @AnthropicAI Claude 3.5 Sonnet on a 1st place rank!🥇 Blog analysis below👇
1
6
1,180
Try Copilot Arena for free here: lmarena.ai/copilot Leaderboard at: lmarena.ai Paper at: arxiv.org/abs/2502.09328 Open-source at: github.com/lmarena/copilot-a…

5
575
Copilot Arena is now on Open VSX! Download here: open-vsx.org/extension/copil…

5
254
Copilot Arena retweeted
4 Mar 2025
Interested in trying out Copilot Arena for yourself? Download at lmarena.ai/copilot. Follow us at @CopilotArena for upcoming updates!

1
6
835
Copilot Arena retweeted
🏆 Mercury Coder’s performance: It’s tied for 2nd place on Copilot Arena, a platform for evaluating coding assistants in real-world settings. This is impressive for a new model based on emerging tech, competing with leaders like DeepSeek V2.5 and Claude Sonnet 3.5. #Coding #AI
1
1
6
3,610