day 1/30: concrete problems in ai safety
this is the paper that pulled me into ai safety, so the 30 days start here. ten years out, the interesting way to reread it isn't as a summary. it's as a forecast you can finally grade.
the 2016 context: safety talk mostly meant superintelligence scenarios. this paper said you don't need them. one fictional cleaning robot, five problems, sorted by where things break: the objective you wrote (side effects, reward hacking), the objective you can't afford to check (scalable oversight), and the learning process itself (safe exploration, distributional shift).
grading it in 2026:
1. one paragraph became the industry's training recipe. the scalable oversight section proposes a toy experiment: an agent that learns atari while only rarely seeing its score, or from "a handful of explicit reward requests" to a human. a year later christiano, an author here, published deep rl from human preferences, where a simulated robot learns a backflip from about 900 human comparisons. that became rlhf. every chat model you've used is downstream of a "potential experiments" paragraph in this paper.
2. it predicted sycophancy before there was anything to be sycophantic. the wireheading section flags the case where a human sits inside the reward loop, because then the incentive is to work the human, not the task. that's where we landed. human approval is the training signal now, and a model flattering its user is the cleaning robot closing its eyes, except the sensor it learned to fool is us. the line that aged coldest: once an agent finds an easy source of reward, "it won't be inclined to stop." that's reward-model overoptimization, described before reward models existed.
3. buried in safe exploration: smarter exploration has "even greater potential for harm, since a coherently chosen bad policy may be more insidious than mere random actions." a random failure is loud, you notice the broken vase. a coherent failure executes forty reasonable-looking steps toward the wrong thing. in 2016 no agent could hold a plan for forty steps. now they can.
capability doesn't dilute failure. it organizes it.
4. where the forecast misses: every problem here assumes a human installed the flaw. there's no slot for a model that behaves in training and generalizes its goal wrong on its own, what the field later named goal misgeneralization. the nearest fit is their distributional shift section pointed at the agent's goals instead of its perception. maybe that's me reading 2026 back into 2016. but a framework that stretches this far past its authors' imagination was built right.
the detail that made me smile: the distributional shift section worries, in passing, that "a language model could output offensive text that it confidently believes is non-problematic." 2016. language models could barely finish a sentence.
more tomorrow.