SWE-bench

SWE-bench

5 Photos and videos

Tweets

Pinned Tweet

SWE-bench @SWEbench

20 Dec 2025

CLS

@ChengleiSi

19 Dec 2025

Replying to @jyangballin @KLieret @_carlosejimenez @OfirPress

how do I join SWE-bench slack John

7,219

SWE-bench

SWE-bench @SWEbench

Jan 26

More SWE-bench environments, tasks, trajectories, and training recipes for all!

Kevin Li

@kevin_x_li

Jan 26

SWE-smith is going multilingual! We have expanded our task synthesis pipeline to JavaScript! This release includes: • 6,099 new JS tasks • Coverage across 34 popular repos • End-to-end Modal pipeline for fast task synthesis Scaling agentic training data just got easier.

803

SWE-bench

SWE-bench @SWEbench

Jan 26

Join us in SWE-bench slack if you're interested in contributing and using these new datasets! (bottom left of swebench.com) Expect a lot more to come in the following weeks :)

416

SWE-bench

SWE-bench @SWEbench

Jan 22

🚀🚀🚀

John Yang

@jyangballin

Jan 22

PyPI downloads last month - swebench: 3.1 Million (10M Total) - swesmith: 1.9M (2.8M Total) - mini-swe-agent: 164k (636k Total) We're incredibly grateful ❤️ to the worldwide SWE-* community who continue to build on our work! New releases on all fronts coming soon

538

SWE-bench

SWE-bench @SWEbench

19 Dec 2025

SWE-bench blog site launched! Check out our content expect more SWE-bench/agent/smith content soon!

8,291

John Yang

SWE-bench retweeted

John Yang

@jyangballin

5 Nov 2025

New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals

0:18

417

102,938

Chunyang Chen

SWE-bench retweeted

Chunyang Chen @chun_yang_chen

6 Nov 2025

🏆Glad to know that our #ASE25 paper about automated bug repair using MMLM just got the ACM SIGSOFT Distinguished Paper Award🎉 And it is still ranked top #1 in @SWEbench Mulmimodal Track! Thank Kai, Xiaofei @xfxie312, and Jian for the great work!

Chunyang Chen @chun_yang_chen

8 Sep 2025

Excited to announce Kai's latest ASE'25 work, let LLMs not only see bugs, but also fix them: 📄 “Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Repair” 🔗arxiv.org/abs/2506.16136 Ranked #1 on @SWEbench Multimodal!

1,157

Ofir Press

SWE-bench retweeted

Ofir Press

@OfirPress

10 Sep 2025

Congrats to @Zai_org GLM-4.5 on getting the 7th spot on our SWE-bench Verified [Bash Only] leaderboard! w/ @KLieret @_carlosejimenez @jyangballin

2,360

Ofir Press

SWE-bench retweeted

Ofir Press

@OfirPress

10 Sep 2025

Super excited to have @anyscalecompute use mini-swe-agent for their large scale runs! w/ @KLieret @_carlosejimenez @jyangballin

2,750

Ofir Press

SWE-bench retweeted

Ofir Press

@OfirPress

6 Sep 2025

3 out of the top 6 most downloaded datasets on @huggingface are SWE-bench related. Thanks!!! ♥️

18,458

carlos

SWE-bench retweeted

carlos @_carlosejimenez

12 Aug 2025

Recent open model scores on SWE-bench Bash Only: 🥇Qwen3-Coder 480B/A35B Instruct - 55.40% 🥈Kimi-K2-Instruct - 43.80% 🥉gpt-oss-120b - 26.00% See the full leaderboard below! 👇

211

66,426

Kilian Lieret

SWE-bench retweeted

Kilian Lieret @KLieret

20 Aug 2025

What if your agent uses a different LM at every turn? We let mini-SWE-agent randomly switch between GPT-5 and Sonnet 4 and it scored higher on SWE-bench than with either model separately. Read more in the SWE-bench blog 🧵

ALT GPT-5 or Sonnet 4. Flipping a coin at every step of your agent beats both

269

31,881

Kilian Lieret

SWE-bench retweeted

Kilian Lieret @KLieret

21 Aug 2025

Deepseek v3.1 chat scores 53.8% on SWE-bench verified with mini-SWE-agent. Tends to take more steps to solve problems than others (flattens out after some 125 steps). As a result effective cost is somewhere near GPT-5 mini. Details in 🧵

156

24,030

Ofir Press

SWE-bench retweeted

Ofir Press

@OfirPress

7 Aug 2025

GPT-5 gets 74.9 on SWE-bench. Wonder what the budget per task is.

4,061

carlos

SWE-bench retweeted

carlos @_carlosejimenez

31 Jul 2025

What happens if you compare LMs on SWE-bench without the fancy scaffolds? Our new leaderboard “SWE-bench (bash only)” shows you which LMs are the best at getting the job done with just bash. More on why this is important 👇

203

33,013

Ofir Press

SWE-bench retweeted

Ofir Press

@OfirPress

28 Jul 2025

Super exciting to have 3 new open-weight models that all obtain more than 60 on SWE-bench Verified! Looking forward to the results on SWE-bench Multimodal when these models obtain vision capabilities :)

3,019

Kilian Lieret

SWE-bench retweeted

Kilian Lieret @KLieret

24 Jul 2025

Releasing mini, a radically simple SWE-agent: 100 lines of code, 0 special tools, and gets 65% on SWE-bench verified! Made for benchmarking, fine-tuning, RL, or just for use from your terminal. It’s open source, simple to hack, and compatible with any LM! Link in 🧵

ALT mini-SWE-agent: 100 lines is all you need. 65% on SWE-bench verified with Claude Sonnet 4. No special tools, no tool calls, no shell session.

782

112,121

SWE-bench

SWE-bench @SWEbench

22 Jul 2025

🎉 Congrats @Alibaba_Qwen @huybery @JustinLin610 and the Qwen team! Incredible progress in the last year, love to see Qwen continue championing open models for SWE-bench!

Qwen

@Alibaba_Qwen

22 Jul 2025

>>> Qwen3-Coder is here! ✅ We’re releasing Qwen3-Coder-480B-A35B-Instruct, our most powerful open agentic code model to date. This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation. It achieves top-tier performance across multiple agentic coding benchmarks among open models, including SWE-bench-Verified!!! 🚀 Alongside the model, we're also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini Code, it includes custom prompts and function call protocols to fully unlock Qwen3-Coder’s capabilities. Qwen3-Coder works seamlessly with the community’s best developer tools. As a foundation model, we hope it can be used anywhere across the digital world — Agentic Coding in the World! 💬 Chat: chat.qwen.ai/ 📚 Blog: qwenlm.github.io/blog/qwen3-… 🤗 Model: hf.co/Qwen/Qwen3-Coder-480B-… 🤖 Qwen Code: github.com/QwenLM/qwen-code

1,282

SWE-bench

SWE-bench @SWEbench

11 Jul 2025

SWE-agent is now Multimodal! 😎 We're releasing SWE-agent Multimodal, with image-viewing abilities and a full web browser for debugging front-ends. Evaluate your LMs on SWE-bench Multimodal or use it yourself for front-end dev. 🔗➡️

2,015

SWE-bench

SWE-bench @SWEbench

11 Jul 2025

Doc: swe-agent.com/latest/usage/m… Code: github.com/SWE-agent/SWE-age…

1,156