First summary about OpenAI-01
After reading the publications of OpenAI, I first summarize the essential aspects, followed by a summary
- immensely better reasoning about complex problems
- the model will become "regular updates and improvements"
- Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes
- performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology
- excels in math and coding
- International Mathematics Olympiad (IMO)83% (I'll have to look up the results of AlphaGeometry2 and AlphaProof again to compare them)
- they are resetting the counter back to 1 and naming this series OpenAI o1 (apparently no more ChatGPT, but with OpenAI 01 the new beginning of a model)
- Very well developed against jailbreaks
- Close cooperation with the authorities (we’ve bolstered our safety work, internal governance, and federal government collaboration)
- it uses Chain of Thought (CoT)
- performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).
---
OpenAI has actually made it. Below you can see the benchmark results. It is exactly as hoped: OpenAI-01 excels especially in the areas where regular LLMs basically reach their limits. Especially logical tasks. Through the use of CoT and presumably aspects of self-learning, the model is able to achieve outstanding results through constant self-correction. The benchmarks show a quantum leap compared to ChatGPT-4o. It is not a small improvement but a milestone. You can't overstate how groundbreaking the results are. We actually have a model that has reached the level of PhD experts in STEM subjects. In coding Olympiads, it reaches an unprecedented ELO of 1807 and also the 93 percentile:
"Finally, we simulated competitive programming contests hosted by Codeforces to demonstrate this model’s coding skill. Our evaluations closely matched competition rules and allowed for 10 submissions. GPT-4o achieved an Elo rating[3] of 808, which is in the 11th percentile of human competitors. This model far exceeded both GPT-4o and o1—it achieved an Elo rating of 1807, performing better than 93% of competitors."
The models are constantly being improved and further developed. At this rate, we can assume that we will perhaps really reach AGI by 2025. Certainly not available to everyone, but probably possible as an application. The impact on the economy and fields of work is not foreseeable.
"o1 significantly advances the state-of-the-art in AI reasoning. We plan to release improved versions of this model as we continue iterating. We expect these new reasoning capabilities will improve our ability to align models to human values and principles. We believe o1 – and its successors – will unlock many new use cases for AI in science, coding, math, and related fields. We are excited for users and API developers to discover how it can improve their daily work. (...) We also evaluated o1 on GPQA diamond, a difficult intelligence benchmark which tests for expertise in chemistry, physics and biology. In order to compare models to humans, we recruited experts with PhDs to answer GPQA-diamond questions. We found that o1 surpassed the performance of those human experts, becoming the first model to do so on this benchmark. These results do not imply that o1 is more capable than a PhD in all respects — only that the model is more proficient in solving some problems that a PhD would be expected to solve. On several other ML benchmarks, o1 improved over the state-of-the-art. With its vision perception capabilities enabled, o1 scored 78.2% on MMMU, making it the first model to be competitive with human experts. It also outperformed GPT-4o on 54 out of 57 MMLU subcategories."
But what is at least as significant is the fact that OpenAI has directly released a mini version of 01 that is about 80% cheaper but still significantly better than GPT-4o and only slightly worse than the regular OpenAI-01! This should not be underestimated, as it means that this outstanding model can be used everywhere at low cost (benchmark results also below).
"OpenAI o1-mini, a cost-efficient reasoning model. o1-mini excels at STEM, especially math and coding—nearly matching the performance of [OpenAI o1] on evaluation benchmarks such as AIME and Codeforces.
Today, we are launching o1-mini to [tier 5 API users(opens in a new window)] at a cost that is 80% cheaper than OpenAI o1-preview."
I think there will be a time before OpenAI-01 and a time after OpenAI-01. What we have seen today is nothing less than a break in history. Numbers don't lie and OpenAI-01 shows how good it already is. It will change the world. OpenAI has delivered. It's a day to celebrate.