Overfitting. He was likely overfitting the model for one specific benchmark.
PewDiePie didn’t “train his own LLM.” He fine-tuned an existing open-source model on coding benchmarks. His model started at 8%, crawled to 16% after format fixes, and one run hit 19.6% that briefly passed GPT-4o on a single benchmark before he couldn’t consistently reproduce it.
The tweet makes it sound like a YouTuber casually built a frontier lab in his bedroom. What actually happened is more interesting: a guy with a $41,000 home rig of 10 GPUs and 424GB of VRAM spent months failing, retraining, and iterating on dataset quality until he squeezed marginal gains out of a fine-tune.
This is the part worth paying attention to. The entire arc from October 2025 to now tells you where AI tooling has actually landed. PewDiePie went from building his first PC to running Qwen 235B locally, vibe-coding a custom chat UI, orchestrating multi-agent voting systems, and now fine-tuning models on custom datasets. He did most of this through AI-assisted coding itself.
The video is literally called “I wish I never did this project.” He’s documenting how painful and tedious the process was. That honesty is the signal. The hype accounts strip that away and replace it with “what the f*ck, YouTuber beats DeepSeek.”
The real takeaway: fine-tuning on specific benchmarks with curated data can let anyone temporarily spike a score past models that cost hundreds of millions to train. That tells you everything about how narrow benchmark gaming has become, and nothing about general capability. PewDiePie knows this. The people quote-tweeting him with shock emojis do not.