Alibaba has released 4 new Qwen3.5 models from 0.8B to 9B. The 9B (Reasoning, 32 on the Intelligence Index) is the most intelligent model under 10B parameters, and the 4B (Reasoning, 27) the most intelligent under 5B, but both use 200M output tokens to run the Intelligence Index
@Alibaba_Qwen has expanded the Qwen3.5 family with four smaller dense models: the 9B (Reasoning, 32 on the Intelligence Index), 4B (Reasoning, 27), 2B (Reasoning, 16), and 0.8B (Reasoning, 9). These complement the larger 397B, 27B, 122B A10B, and 35B A3B models released earlier this month. All models are Apache 2.0 licensed, support 262K context, include native vision support, and use the same unified thinking/non-thinking hybrid approach as the rest of the Qwen3.5 family
Key benchmarking results for the reasoning variants:
➤ The 9B and 4B are the most intelligent models at their respective size classes, ahead of all other models under 10B parameters. Qwen3.5 9B (32) scores roughly double the next closest models under 10B: Falcon-H1R-7B (16) and NVIDIA Nemotron Nano 9B V2 (Reasoning, 15). Qwen3.5 4B (27) outscores all of these despite having roughly half the parameters. All four of the small Qwen3.5 models are on the Pareto frontier of the Intelligence vs. Total Parameters chart
➤ The Qwen3.5 generation represents a material intelligence uplift over Qwen3 across all sub-10B model sizes, with larger gains at higher total parameter counts. Comparing reasoning variants: Qwen3.5 9B (32) is 15 points ahead of Qwen3 VL 8B (17), the 4B (27) gains 9 points over Qwen3 4B 2507 (18), the 2B (16) is 3 points ahead of Qwen3 1.7B (estimated 13), and the 0.8B (9) gains 2.5 points over Qwen3 0.6B (6.5).
➤ All four models use 230-390M output tokens to run the Intelligence Index, significantly more than both larger Qwen3.5 siblings and Qwen3 predecessors. Qwen3.5 2B used ~390M output tokens, 4B used ~240M, 0.8B used ~230M, and 9B used ~260M. For context, the much larger Qwen3.5 27B used 98M and the 397B flagship used 86M. These token counts also exceed most frontier models: Gemini 3.1 Pro Preview (57M), GPT-5.2 (xhigh, 130M), and GLM-5 Reasoning (109M)
➤ AA-Omniscience is a relative weakness, with hallucination rates of 80-82% for the 4B and 9B. Qwen3.5 4B scores -57 on AA-Omniscience with a hallucination rate of 80% and accuracy of 12.8%. Qwen3.5 9B scores -56 with 82% hallucination and 14.7% accuracy. These are marginally better than their Qwen3 predecessors (Qwen3 4B 2507: -61, 84% hallucination, 12.7% accuracy), with the improvement driven primarily by lower hallucination rates rather than higher accuracy.
➤ The Qwen3.5 sub-10B models combine high intelligence with native vision at a scale previously unavailable. On MMMU-Pro (multimodal reasoning), Qwen3.5 9B scores 69.2% and 4B scores 65.4%, ahead of Qwen3 VL 8B (56.6%), Qwen3 VL 4B (52.0%), and Ministral 3 8B (46.0%). The Qwen3.5 0.8B scores 25.8%, which is notable for a sub-1B model
Other information:
➤ Context window: 262K tokens
➤ License: Apache 2.0
➤ Quantization: Native weights are BF16. Alibaba has not released first-party GPTQ-Int4 quantizations for these small models, though they have for the larger models in the Qwen3.5 family released earlier (27B, 35B-A3B, 122B-A10B, 397B-A17B). In 4-bit quantization all four models are accessible on consumer hardware
➤ Availability: At time of publishing, there are no first-party or third-party serverless APIs hosting these models