All three leading open weights models were released last week. Progress continues for open weights models alongside proprietary ones, with the gap to GPT-5.5, the leading proprietary model, sitting at 6 points on the Artificial Analysis Intelligence Index
@Kimi_Moonshot’s Kimi K2.6 (Reasoning) and
@Xiaomi's MiMo V2.5 Pro (Reasoning) tie as the leading open weights models on the Artificial Analysis Intelligence Index at 54, with
@deepseek_ai's DeepSeek V4 Pro (Reasoning, Max Effort) at 52. This places the best open weights models within 3-6 points of the leading proprietary models:
@OpenAI's GPT-5.5 (xhigh) at 60, and
@Google's Gemini 3.1 Pro Preview and
@AnthropicAI's Claude Opus 4.7 (Adaptive Reasoning, Max Effort) at 57.
For context: just one year ago the highest-scoring open weights model was DeepSeek V3 0324 which achieved 22 on the Intelligence Index, and was ~13 points below the highest-scoring proprietary model, Claude 3.7 Sonnet (Reasoning) at 35.
Key takeaways:
➤ The top three most intelligent open weights models are trillion-plus-parameter MoE architectures with permissive licenses. Kimi K2.6 (Reasoning) has 1T total / 32B active parameters with 256K context window, MiMo V2.5 Pro (Reasoning) has 1T total / 42B active with 1M context window, and DeepSeek V4 Pro (Reasoning, Max Effort) has 1.6T total / 49B active with 1M context window.
➤ The gap to proprietary remains wide on the hardest reasoning and agentic coding evaluations. On HLE (Humanity's Last Exam) the three top open weights models score 34-36%, vs 44% for GPT-5.5 (xhigh) and 45% for Gemini 3.1 Pro Preview. On CritPt (Research-level Physics) they score 4-12%, vs 27% for GPT-5.5 (xhigh). On TerminalBench Hard (Agentic Coding & Terminal Use) they score 43-46%, vs 61% for GPT-5.5 (xhigh) and 54% for Gemini 3.1 Pro Preview.
➤ Omniscience (knowledge hallucination) shows a large gap to proprietary models, with DeepSeek V4 Pro (Reasoning, Max Effort) hallucinating significantly more than its open weights peers. DeepSeek V4 Pro (Reasoning, Max Effort) scores -10, MiMo V2.5 Pro (Reasoning) 4, and Kimi K2.6 (Reasoning) 6. By comparison, GPT-5.5 (xhigh) scores 20, Claude Opus 4.7 (Adaptive Reasoning, Max Effort) 26, and Gemini 3.1 Pro Preview 33.