Updated DevQualityEval v1.0 results are in 👀 Check out how our new king of cost-effectiveness (Google’s Gemini 2.0 Flash Lite) performed, and find out if Claude 3.7 Sonnet (Thinking) is worth the additional costs 👇
Insights of analyzing >100 LLMs for the DevQualityEval v1.0 (generating quality code) in latest deep dive
- 👑 Google’s Gemini 2.0 Flash Lite is the king of cost-effectiveness (our previous king OpenAI’s o1-preview is 1124x more expensive, and worse in score)
- 🥇 Anthropic’s Claude 3.7 Sonnet is the functional best model (with help) … by far
- 🏡 Qwen’s Qwen 2.5 Coder is the best model for local use
- Models are on average getting better at code generation, especially in Go
- Only one model is on-par with static tooling for migrating JUnit 4 to 5 code
- Surprise! providers are unreliable for days for new popular models
- Let’s STOP the model naming MADNESS together: we proposed a convention for naming models
- We counted all the votes, v1.1 will bring: JS, Python, Rust, …
- Our hunch with using static analytics to improve scoring continues to be true
All the other models, details and how we continue to solve the "ceiling problem" in the deep dive: 👇🧵
(now with interactive graphs 🌈)
Looking forward to your feedback :-)