We taught a 1.3M parameter model to play DOOM. It outperforms LLMs up to 92,000x its size.
Happy Easter Monday! Here's our Easter egg release: SauerkrautLM-Doom-MultiVec-1.3M.
17.8 average points per episode.
We benchmarked our tiny model against GPT-4o-mini (via OpenAI API), Nemotron-120B, Qwen3.5-27B, and Gemini Flash Lite (via OpenRouter API) on VizDoom's defend_the_center:
- Our model: 17.8 avg points/episode, 31ms per decision, runs on CPU
- Gemini Flash Lite: 0.8 avg points/episode (920ms latency)
- Qwen3.5-27B: 0.67 avg points/episode (13.3s latency)
- Nemotron-120B: 0.6 avg points/episode (8.9s latency)
- GPT-4o-mini: 0.0 avg points/episode (just dodges, never engages)
The architecture: ModernBERT-Hash
We took hash embeddings (Svenstrup et al. 2017), previously only applied to the original BERT architecture (see
@neumll 's BERT-Hash models), and brought them to ModernBERT, adding rotary position embeddings, alternating local/global attention, Flash Attention 2 support, and learned depth embeddings from VizDoom's depth buffer.
The result is a 5-layer encoder with a 75-token character-level tokenizer (no BPE, every ASCII character is one token, preserving spatial structure), attention pooling, and a 4-action classification head. Total: 1,319,300 parameters, ~5MB on disk, 31ms inference on CPU.
Trained on 31K frames of a human playing DOOM for about 2 hours. That's it.
Fully open source. Everything you need to reproduce this:
Model weights:
huggingface.co/VAGOsolutions…
Training data (31K frames):
huggingface.co/datasets/VAGO…
Code, training scripts, benchmark framework:
github.com/VAGOsolutions/Sau…
Full paper with methodology included in the repo.
Why does this matter beyond the fun factor?
Small specialized models can decisively beat general-purpose LLMs at real-time control tasks. Not by a small margin, by 22x on average points per episode. At 1/400th the latency. On a CPU. For free.
This has real implications for robotics, autonomous systems, game AI, and any domain where you need sub-100ms decisions on edge hardware. The future of AI isn't exclusively large. It's appropriately sized.
Thank you to my co-authors Daryoush Vaziri (University of Applied Sciences Bonn-Rhein-Sieg) and Alexander Marquardt (Nara Institute of Science and Technology, CARE Laboratory) for their contributions to this work.
Built with VizDoom, PyTorch, HuggingFace Transformers, and the ModernBERT architecture by
@benjamin_warner ,
@antoine_chaffin,
@ClavierBenjamin et al. Hash embedding approach inspired by NeuML's BERT-Hash models.
#AI #DOOM #GameAI #SmallModels #OpenSource #ModernBERT #SauerkrautLM #VAGOSolutions #Easter #TinyML