Joined October 2010
6 Photos and videos
GH Copilot (VSCode Insider Preview) has added the context window stats... Eventually. And what a discovery, GPT-5.2 has just 128K context window (out of 272K allowed by the model)
58
Maxim Saplin retweeted
♟️Excited to share that our work LLM Chess! It’s a clean, scalable benchmark showing that even today’s top LLMs still struggle with strategic reasoning and instruction-following in dynamic environments. 📄 Paper: arxiv.org/abs/2512.01992 🏆 Leaderboard: maxim-saplin.github.io/llm_c… 💻 Code: github.com/maxim-saplin/llm_… 🎯Why Chess? Chess is the original AI challenge: strategic, long-horizon, and grounded. It’s also a clean test for LLMs: no contamination, no memorization, and difficulty scales with progress. 🔑• 50 models including GPT-o3 @OpenAI, Gemini @Google, Claude @AnthropicAI, DeepSeek @deepseek_ai, Llama @Meta, @Alibaba_Qwen evaluated via agentic gameplay. • Reasoning models do much better than non-reasoning, yet many still can’t beat random play. • Top models reach ~758 Elo: good, but nowhere near strong humans. 🧑‍🤝‍🧑 Thank you amazing collaborators @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq_, @Chi_Wang_! 📍The work will also be presented at NeurIPS FoRLM Workshop at Sun, Dec 7 3:00–4:15pm PT in Upper Level Room 33ABC. Come chat with us and check out the live leaderboard!
2
9
654
Maxim Saplin retweeted
Excited to share our latest work, now on arXiv and at FoRLM @ NeurIPS'25! 🎉 Introducing **LLM Chess**: a benchmark for evaluating reasoning and instruction-following in LLMs through chess. LLMs now reach experts in math & coding, but can they *reason* in dynamic, multi-step strategic environments? We tested 50 models. The results? Many models struggle to beat an opponent making *random* moves, and even powerful reasoning models cannot beat a *weak skilled opponent*. Why chess? It's been the "drosophila of AI" since the 1950s, used as a measuring stick for AI progress and a testbed for planning, strategy, and long-horizon decision-making. Unlike static benchmarks that get contaminated or saturated, chess offers: ✅ Dynamic, stochastic gameplay ✅ Adjustable difficulty via engine skill ✅ Resistance to memorization Our setup: LLMs play in an agentic environment, making moves through tool calls. **Phase 1:** 50 models play 30 games each vs a random agent, a simple test that many models *fail* due to instruction-following failures or poor performance. **Phase 2:** Top reasoning models face the Komodo Dragon engine at various Elo scores from 250 to 1375 for performance estimation grounded in the real world (tied to chess. com Elo). Key findings for Phase 1: ♟️ Reasoning models crush non-reasoning: **45.4% vs 0.7%** win rate, with many models struggling to reach even 50% Win/Loss vs a random player ♟️ Instruction failures **3× higher** in non-reasoning models (71.9% vs 24.4%) ♟️ Test-time scaling for reasoning effort boosts performance up to ** 20%** Key findings for Phase 2: 📉 The best LLM we tested (o3-low) peaks at only **~758 Elo**. While LLMs match experts in math & coding, they play chess around the average online player (~611 Elo on chess .com) and far below human masters (~2800 Elo). 🔄LLM Chess is extensible. As models improve, we scale difficulty. No saturation, no contamination. Check it out and let us know what you think! We are continually evaluating more models on the benchmark. Come and see us at the FoRLM workshop at 3:00-4:15pm on Sunday December 7th, 2025 @ Upper Level Room 33ABC at NeurIPS! 📄 Paper: arxiv.org/abs/2512.01992 🏆 Leaderboard: maxim-saplin.github.io/llm_c… 💻 Code: github.com/maxim-saplin/llm_… Huge thanks to @msmxm, @SaiKolasani1, @nrcrispino, @kylepmont, @matei_zaharia, @jaredq, @Chi_Wang_, @ChenguangWang 🙏
1
5
11
545
7 Apr 2025
Putting it into perspective, Llama 3 released on April 18, 2024 was pre-trained on 15 trillion tokens. Llama 4 had 40T, almost a 3x increase.
51
29 Mar 2025
AI saves hours nailing down this sort of typos in you code:
39
26 Mar 2025
A quick speed test of a devbox, measuring TS build time: git clone github.com/microsoft/TypeScr… cd TypeScript npm install time npm run build Few results: i5 13600KF OC, Desktop, WSL2 - 16s M4 Pro MBP 16 - 19.2s i7-8850H, Win, WSL2 - 40.7s i7-8850H, Win - 53.5s i7-8850H, MBP - 52.9s
71
20 Dec 2024
Do you mind trying o3 in this eval: maxim-saplin.github.io/llm_c… - @gdb?

36
16 Nov 2024
When prompted to play chess, LLMs can't score a single win against a random player, @karpathy maxim-saplin.github.io/llm_c…

39
4 Oct 2024
Oneteen onety one (HuggingFaceTB/SmolLM-135M)
52
3 Oct 2024
2 studies published in September 2024 and investigating the same subject, impact of GitHub CoPilot on dev productivity, draw opposite conclusions: 👍 papers.ssrn.com/sol3/papers.… - 26% more completed tasks 👎cio.com/article/3540579/devs… - no change in cycle time, 46% more bugs

35
14 Apr 2024
HuggingFace's dataset collection is a treasure...
42
11 Apr 2024
Sundman's general solution to the 3 body problem would involve at least [10 to the power of 8 million] iterations to calculate coordinates of moving planets. There're [10 to the power of 80] atoms in the known universe.
1
61
14 Mar 2024
49
1 Jan 2024
Interestingly, those people who recently held my cassette player began their inspection of the device by trying to open the lid to look inside, pulling out the cassette. Only after that did they start pushing buttons or listening to the sound.
52
1 Oct 2023
"BCG consultants solving business problems with OpenAI’s GPT-4 performed 23% worse than those without it, new study finds" Fortune title says Yet, "using GPT-4 for creative product innovation outperformed the control group (those completed the task without using GPT-4) by 40%"
63
15 Jul 2012
The next generation of Recorded Future is coming soon. Get invited: recorded-future.kickofflabs.…

18 Jan 2012
To avoid #Wikipediablackout simlpy block the JavaScript URL which hinders the home page. Details are here saplin.blogspot.com/2012/01/…