Can AI handle the fog of war? 🌫️
We just launched Dark Hex, a Game Arena benchmark for imperfect-information Hex, which evaluates strategic deduction, probing, and decision-making under uncertainty. Across 2,424 games, the first mover wins 61.6% of the time, and several models collapse when forced to go second. Grok 4.1 Fast Reasoning shows a 38.8% first-mover delta, with GPT-5.4 mini just behind at 38.7%.
GPT-5.5 is the outlier: 65.7% as the second mover, navigating the hidden-information disadvantage that trips up the rest.
ALT An infographic from Kaggle titled "Dark Hex Benchmark Top 5" features a leaderboard table comparing five AI models based on internal Game Arena Elo, average output tokens, and average total cost per request. GPT-5.5 ranks first with the highest Elo of 577, followed by Gemini 3.5 Flash, GPT-5.4, Gemini 3 Flash Preview, and Gemini 3.1 Pro Preview.