The bitter lesson in 26 words:
Don’t be distracted by human knowledge, as AI has been historically.
Instead focus on methods for creating knowledge that scale with computation, like search and learning.
IMHO, this verbatim C-to-Rust translation is a mistake; it should be approached from the perspective of the program's topology, which, unfortunately, is often incomplete or even incorrect.
Follow-up on non-English token-inefficiency with more model-language pairs:
- Chinese is cheaper than English on major Chinese models
- Gemini and Qwen provide least non-English tax
- Anthropic has the highest tax by far; Kimi is next
- Hindi is the worst-covered language here, despite its massive speaker base
Dario is wrong.
He knows absolutely nothing about the effects of technological revolutions on the labor market.
Don't listen to him, Sam, Yoshua, Geoff, or me on this topic.
Listen to economists who have spent their career studying this, like @Ph_Aghion , @erikbryn , @DAcemogluMIT , @amcafee , @davidautor
Anthropic CEO Dario Amodei: “50% of all tech jobs, entry-level lawyers, consultants, and finance professionals will be completely wiped out within 1–5 years.”
For currying, there is no need using a number; for functional, there is no need using a function. Currying(with only one variable) is equal to Functional.
🤯BREAKING: Alibaba just proved that AI Coding isn't taking your job, it's just writing the legacy code that will keep you employed fixing it for the next decade. 🤣
Passing a coding test once is easy. Maintaining that code for 8 months without it exploding? Apparently, it’s nearly impossible for AI.
Alibaba tested 18 AI agents on 100 real codebases over 233-day cycles. They didn't just look for "quick fixes"—they looked for long-term survival.
The results were a bloodbath:
75% of models broke previously working code during maintenance.
Only Claude Opus 4.5/4.6 maintained a >50% zero-regression rate.
Every other model accumulated technical debt that compounded until the codebase collapsed.
We’ve been using "snapshot" benchmarks like HumanEval that only ask "Does it work right now?"
The new SWE-CI benchmark asks: "Does it still work after 8 months of evolution?"
Most AI agents are "Quick-Fix Artists." They write brittle code that passes tests today but becomes a maintenance nightmare tomorrow. They aren't building software; they're building a house of cards.
The narrative just got honest: Most models can write code. Almost none can maintain it.
A small experiment with #autoresearch, in 30 hours it tried nearly 80 times, find 17 improvements and the loss dropped nearly 77%, which equals to 5 days work of a middle-level researcher.
CONS: 1. only works on single GPU 2. only supports one metric