🏆 Huggingface releases SmolLM3, SoTA 3B model, 128k context, dual mode reasoning (think/no_think)
🤖
@huggingface released SmolLM3, a 3B parameter multilingual reasoner that matches bigger 4B models, handles 128k tokens, and ships with an open-sourced training blueprint in this blog post.
🌍 They pre-trained on 11.2T tokens then stretched context with YARN up-sampling, finishing the run on 384 H100 GPUs in 24 days.
🧠 A built-in dual think / no_think switch lets users decide between fast answers or slower chain-of-thought traces.
🛠️ How they pulled it off
Grouped Query Attention trades multi-head attention for 4 compact query groups, shrinking memory without hurting accuracy.
NoPE removes rotary position math from every 4th layer, so the model remembers long passages yet stays snappy with short ones. NoPE is a twist on the usual rotary position embeddings. The SmolLM3 crew borrowed it from the 2025 study “RoPE to NoRoPE and Back Again”. They turn off rotary position math in every 4th transformer layer, so 1 out of 4 blocks handles tokens without any positional stamp.
That small skip keeps numerical noise from piling up as the text gets longer, boosts efficiency, and still keeps short-prompt quality steady. In SmolLM3, the trick helps a 3B model train cleanly on 64k-token sequences and stretch to 128k at inference time without extra hacks.
Intra-document masking keeps sentences from different web pages isolated during training, stopping weird cross-talk.
They mix web, code, and math across 3 stages, bumping code to 24% and math to 13% near the end because those domains sharpen reasoning.
After the main run they add 100B extra tokens only to extend context, raising RoPE theta to 5M so the model natively learns sequences up to 64k before YARN doubles it at inference.
A short “mid-training” on 35B reasoning traces teaches the model to explain its steps, while supervised fine-tuning balances 0.8B reasoning tokens against 1B direct-answer tokens.
They align responses with Anchored Preference Optimization, a stabler cousin of DPO, then merge checkpoints so long-context skill rebounds without losing fresh logic boosts.
Benchmarks show the base model tops every other 3B system on HellaSwag, ARC, and GSM8K, and the instruct variant edges close to Qwen3-4B while staying lighter.
Everything, from datasets to evaluation code, sits on GitHub, and Huggingface.