My (pure) speculation about what OpenAI o1 might be doing
[Caveat: I don't know anything more about the internal workings of o1 than the handful of lines about what they are actually doing in that blog post--and on the face of it, it is not more informative than "It uses Python er.. RL".. But here is what I told my students as one possible way it might be working]
There are two things--RL and "Private CoT" that are mentioned in the writeup. So imagine you are trying to transplant a "generalized AlphaGo"--let's call it GPTGo--onto the underlying LLM token prediction substate.
To do this, you need to know
(1) What are the GPTGo moves? For AlphaGo, we had GO moves). What would be the right moves when the task is just "expand the prompt".. ?
(2) Where is it getting its external success/failure signal from? for AlphaGo, we had simulators/verifiers giving the success/failure signal. The most interesting question in glomming the Self-play idea for general AI agent is where is it getting this signal? (See e.g.
x.com/rao2z/status/171625758… )
My guess is that the moves are auto-generated CoTs (thus the moves have very high branching factor). Let's assume--for simplification--that we have a CoT-generating LLM, that generates these CoTs conditioned on the prompt.
The success signal is from training data with correct answers. When the expanded prompt seems to contain the correct answer (presumably LLM-judged?), then it is success. If not failure.
The RL task is: Given the original problem prompt, generate and select a CoT, and use it to continue to extend the prompt (possibly generating subgoal CoTs after every few stages). Get the final success/failure signal for the example (for which you do have answer).
Loop on a gazillion training examples with answers, and multiple times per example. [The training examples with answers can either be coming from benchmarks, or from synthetic data with problems and their solutions--using external solvers; see
x.com/rao2z/status/171625758…]
Let RL do its thing to figure out credit-blame assignment for the CoTs that were used in that example. Incorporate this RL backup signal into the CoT genertor's weights (?).
<At this point, you now have a CoT generator that is better than before the RL stage>
During inference, stage, you can basically do rollouts (a la the original AlphaGo) to further improve the effectiveness of the moves ("internal CoT's"). The higher the roll out, the longer the time.
My guess is that what O1 is printing as a summary is just a summary of the "winning path" (according to it)--rather than the full roll out tree.
===
Assuming I am on the right path here in guessing what o1 is doing, a couple corollaries:
1. This can at least be better than just fine tuning on the synthetic data (again see
x.com/rao2z/status/171625758…)--we are getting more leverage out of the data by learning move (auto CoT) generators. [Think behavior cloning vs. RL..]
2. There will not still be any guarantees that the answers provided are "correct"--they may be probabilistically a little more correct (subject to the training data). If you want guarantees, you still will need some sort of LLM-Modulo approach even on top of this (c.f.
arxiv.org/abs/2402.01817).
3. It is certainly not clear that anyone will be willing to really wait for long periods of time during inference (it is already painful to wait for 10 sec for a 10 word last letter concatenation!). See
x.com/rao2z/status/183431495…
The kind of people who will wait for longer periods would certainly want guarantees--and there are deep and narrow System 2's a plenty that can be used for many such cases.
4. There is a bit of a Ship of Theseus feel to calling o1 an LLM--considering how far it is from the other LLM models (all of which essentially have teacher-forced training and sub-real-time next token prediction. That said, this is certainly an interesting way to build a generalized system 2'ish component on top of LLM substrates--but without guarantees. I think we will need to understand how this would combine with other efforts to get System 2 behavior--including LLM-Modulo (
arxiv.org/abs/2402.01817) that give guarantees for specific classes.
to be contd..
One thing about o1 model is to what extent end users are actually at all interested in "waiting" (see the death of online computation thread below). And if they are actually have the patience, there do exist guaranteed deep/narrow System 2 solvers for specific problems!