We are living in a timeline where a non-US company is keeping the original mission of OpenAI alive - truly open, frontier research that empowers all. It makes no sense. The most entertaining outcome is the most likely.
DeepSeek-R1 not only open-sources a barrage of models but also spills all the training secrets. They are perhaps the first OSS project that shows major, sustained growth of an RL flywheel.
Impact can be done by "ASI achieved internally" or mythical names like "Project Strawberry".
Impact can also be done by simply dumping the raw algorithms and matplotlib learning curves.
I'm reading the paper:
> Purely driven by RL, no SFT at all ("cold start"). Reminiscent of AlphaZero - master Go, Shogi, and Chess from scratch, without imitating human grandmaster moves first. This is the most significant takeaway from the paper.
> Use groundtruth rewards computed by hardcoded rules. Avoid any learned reward models that RL can easily hack against.
> Thinking time of the model steadily increases as training proceeds - this is not pre-programmed, but an emergent property!
> Emergence of self-reflection and exploration behaviors.
> GRPO instead of PPO: it removes the critic net from PPO and uses the average reward of multiple samples instead. Simple method to reduce memory use. Note that GRPO was also invented by DeepSeek in Feb 2024 ... what a cracked team.