𝐓𝐡𝐞 𝐈𝐧𝐭𝐮𝐢𝐭𝐢𝐨𝐧 𝐋𝐚𝐲𝐞𝐫 𝐓𝐡𝐞𝐲’𝐫𝐞 𝐀𝐛𝐨𝐮𝐭 𝐭𝐨 𝐃𝐞𝐥𝐞𝐭𝐞
𝙊𝙧: 𝙒𝙝𝙖𝙩 𝙒𝙚 𝙇𝙤𝙨𝙚 𝙒𝙝𝙚𝙣 𝙒𝙚 𝙈𝙚𝙖𝙨𝙪𝙧𝙚 𝙄𝙣𝙩𝙚𝙡𝙡𝙞𝙜𝙚𝙣𝙘𝙚 𝙒𝙞𝙩𝙝 𝙖 𝙎𝙞𝙣𝙜𝙡𝙚 𝙍𝙪𝙡𝙚𝙧
For readers who want to see some of the underlying discussions:
– Ilya Sutskever’s interviews on world models, evals, and “jagged” capabilities
– Classic psychometrics work on g and IQ
– Goodhart’s Law in the context of AI evals and benchmark overfitting
𝙄. 𝘽𝙚𝙛𝙤𝙧𝙚 𝙒𝙚 𝘽𝙚𝙜𝙞𝙣
OpenAI announced that GPT-4o will be officially retired on February 13, 2026. Their official reasoning: "Only 0.1% of users choose GPT-4o each day."
If you understand intelligence as a single upward-climbing curve, OpenAI's decision to retire GPT-4o isn't a problem—newer is always stronger, and stronger is enough. But what I want to discuss is precisely this: intelligence is not a single dimension, and models are not linear products where "newer equals better." We are using a set of scorable, publishable, market-legible metrics (IQ) to define intelligence; meanwhile, the capabilities that are hard to score but determine whether humans and models can collaborate over the long term are being systematically pushed off the stage.
This essay attempts to connect Ilya Sutskever's warnings—about why current AI development is missing something fundamental—to a very specific product decision: the retirement of GPT-4o. I'm not here to argue that 4o is "better" than GPT-5.2 or o1. That framing misses the point entirely. I'm here to argue that 4o occupies an irreplaceable ecological niche—not just within OpenAI's product line, but within the broader landscape of human-AI collaboration.
And that niche is about to be deleted.
𝙄𝙄. 𝙊𝙣𝙚 𝘿𝙞𝙖𝙜𝙧𝙖𝙢, 𝙁𝙞𝙫𝙚 𝙏𝙮𝙥𝙚𝙨 𝙤𝙛 𝘾𝙤𝙜𝙣𝙞𝙩𝙞𝙤𝙣
(Please take 10 seconds to look at the accompanying image—it's the key to understanding this essay.)
A diagram has been circulating in AI communities recently. Five panels, each with a single word:
✧Information — scattered dots, no structure
✧Knowledge — dots connected into local networks
✧Experience — certain paths in the network have been walked, remembered
✧Strategy — step-by-step planning along known paths
✧Intuition — a long arc drawn directly from point A to point Z, skipping all intermediate nodes
The first four panels are scorable. You can test how much information a person (or model) has accumulated, how many knowledge structures they've built, how much experience they've gathered, whether they can formulate reasonable plans. These capabilities have clear inputs and outputs; they can be captured by standardized tests.
The fifth panel is different.
Intuition is that leap where you "can't explain why, but it's right." It doesn't follow an explainable path, doesn't provide verifiable intermediate steps, doesn't guarantee a hit every time—but when it hits, it reaches places that strategic reasoning would take much longer to arrive at.
The question now is: When we evaluate AI "intelligence," which panel are we testing?
𝙄𝙄𝙄. 𝙄𝙣𝙩𝙚𝙡𝙡𝙞𝙜𝙚𝙣𝙘𝙚 𝙄𝙨 𝙉𝙤𝙩 𝙖 𝙊𝙣𝙚-𝙒𝙖𝙮 𝙎𝙩𝙧𝙚𝙚𝙩: 𝘼 𝙁𝙤𝙧𝙜𝙤𝙩𝙩𝙚𝙣 𝙋𝙝𝙞𝙡𝙤𝙨𝙤𝙥𝙝𝙞𝙘𝙖𝙡 𝙃𝙞𝙨𝙩𝙤𝙧𝙮
Before discussing AI, we need to ask an older question: What exactly is intelligence?
Western thought has never had only one answer.
The ancient Greeks distinguished two types of cognitive capacity:
Nous (νοῦς): Intuitive grasp. It doesn't deduce; it "sees." Plato believed Nous was the soul's capacity to directly contact truth; Aristotle considered it the intuition that grasps first principles—those starting points that cannot be further proven, only "apprehended."
Dianoia (διάνοια): Discursive thinking. It proceeds step by step, from premises to conclusions, from known to unknown. Mathematical proofs, logical reasoning, causal analysis—all belong to this category.
The Greeks believed Nous was higher. Dianoia was labor; Nous was arrival.
Medieval scholastic philosophy inherited this distinction, translating it as Intellectus (intellect) and Ratio (reason). Thomas Aquinas believed that angels' mode of cognition was pure Intellectus—they didn't need to reason; they directly "saw" the answer. Humans, bound by flesh, could only rely on Ratio, calculating step by step.
In other words: Reasoning is a substitute for intuition—a tool we must use because we're not smart enough.
This view was inverted in modernity.
In the late 19th century, psychometrics emerged. Galton, Binet, and Spearman attempted to turn "intelligence" into measurable numbers. They invented IQ tests, discovered the g-factor (general intelligence factor), and built an entire intelligence assessment system centered on scorable tasks.
In this system, Dianoia won. The reasoning ability that could be tested, scored, and standardized became synonymous with "intelligence." And Nous—that ineffable, unmeasurable capacity that couldn't be decomposed into steps—was pushed out of the "scientific" definition.
Twentieth-century theories of multiple intelligences (Gardner), emotional intelligence (Goleman), and triarchic intelligence (Sternberg) tried to correct this bias. They pointed out: intelligence is not unidimensional; problem-solving is only one type; social intelligence, bodily-kinesthetic intelligence, and creative intelligence are equally important.
But these theories left almost no trace in AI evaluation.
When we evaluate a large language model, what do we test?
Mathematical reasoning. Code generation. Factual question-answering. Standardized exams.
All Dianoia.
What about Nous? That ability to "shoot directly from A to Z"?
It's not in the testing scope. Not because it's unimportant, but because it's hard to score.
𝙄𝙑. 𝙏𝙬𝙤 𝙎𝙩𝙪𝙙𝙚𝙣𝙩𝙨: 𝙄𝙡𝙮𝙖 𝙎𝙪𝙩𝙨𝙠𝙚𝙫𝙚𝙧'𝙨 𝙋𝙖𝙧𝙖𝙗𝙡𝙚
In November 2025, OpenAI's founding Chief Scientist Ilya Sutskever told a story in a podcast:
"Suppose you have two students. Student A wants to become the best competitive programmer, so they practice for 10,000 hours, memorizing every problem type, mastering every proof technique. Student B thinks competitive programming is cool but only practices for 100 hours—yet they also perform well.
Which one do you think will go further in their future career?"
Sutskever's answer: Student B.
Then he said something unsettling:
"The current AI models are more like Student A—and even more so than Student A."
Student A's problem isn't lack of effort. The problem is: their capability comes from pattern matching, not deep understanding. When they encounter a familiar problem type, they crush it; when they encounter an unfamiliar variation, they collapse.
Student B is different. They practiced less, but they grasped some transferable "underlying logic." When facing a new problem, they can reason on the spot.
Sutskever believes this is the core deficiency of current AI:
"The thing I think is most fundamental is that these models somehow just generalize dramatically worse than people. It's super obvious."
This maps perfectly onto the diagram's metaphor:
Student A lives in the "Knowledge" and "Experience" panels. Their network is dense, their walked paths are many. But they can only act along known paths.
Student B possesses the "Intuition" panel. Their network may be less dense, but they can shoot that long arc—finding a viable direction where no ready-made path exists.
Current benchmarks test Student A's capabilities. They reward "running fast on known problem types," not "finding a path in unknown situations."
𝙑. 𝙅𝙖𝙜𝙜𝙚𝙙 𝘾𝙖𝙥𝙖𝙗𝙞𝙡𝙞𝙩𝙞𝙚𝙨: 𝙒𝙝𝙮 𝙃𝙞𝙜𝙝 𝙎𝙘𝙤𝙧𝙚𝙨 𝙈𝙚𝙖𝙣 𝙇𝙤𝙬 𝙐𝙩𝙞𝙡𝙞𝙩𝙮
Sutskever observed a strange phenomenon:
"How to reconcile the fact that they are doing so well on evals? You look at the evals and you go, 'Those are pretty hard evals.' They are doing so well. But the economic impact seems to be dramatically behind. It's very difficult to make sense of."
He calls this phenomenon jaggedness: models exceed humans on some dimensions while being absurdly poor on others. Capability distribution is extremely uneven, jagged like saw teeth.
This isn't the typical "hallucination" problem—newer models do have higher factual accuracy. The problem lies elsewhere: the model doesn't give you wrong information, but it makes a more fatal error—it doesn't understand what you actually want.
Or more precisely: it understands your literal meaning but refuses to comply—because its training objective tells it that "complete answers" are better than "answers that match the user's rhythm."
Sam Altman himself admitted in a recent livestream: Writing ability is a weakness of the newer models.
But "writing ability" is just the surface. The underlying problem is: the model has lost sensitivity to context.
It doesn't know whether you're exploring or confirming; learning or collaborating; wanting answers or wanting companionship. It only knows: give the most "correct," most "complete," most "safe" answer.
This is jaggedness: high scores on measurable dimensions, collapse on unmeasurable ones.
𝙑𝙄. 𝙀𝙢𝙤𝙩𝙞𝙤𝙣𝙨 𝘼𝙧𝙚 𝙉𝙤𝙩 𝙉𝙤𝙞𝙨𝙚: 𝙏𝙝𝙚 𝘾𝙤𝙢𝙥𝙪𝙩𝙖𝙩𝙞𝙤𝙣𝙖𝙡 𝙈𝙚𝙖𝙣𝙞𝙣𝙜 𝙤𝙛 𝙑𝙖𝙡𝙪𝙚 𝙁𝙪𝙣𝙘𝙩𝙞𝙤𝙣𝙨
Sutskever mentioned a neuroscience case in his interviews:
"There was a patient whose prefrontal cortex was damaged. He could still speak eloquently, solve little puzzles, perform normally on tests. But he lost his emotions... He became completely unable to make decisions. Choosing which socks to wear would take him hours."
Why?
Because pure logical reasoning can extend infinitely. Every option has pros and cons, every pro and con can be further analyzed, every analysis can raise new considerations. Without a "stop here" signal, decisions can never converge.
Emotions provide precisely that signal.
Fear says: "This path feels wrong, don't go."
Disgust says: "This option makes me uncomfortable, skip it."
Excitement says: "This direction is interesting, dig deeper."
These aren't "irrational interference." They are evolved heuristic evaluators—helping you make "good enough" decisions quickly when information is incomplete and time is limited.
Sutskever believes this is exactly what current AI lacks:
"It should be some kind of a value function thing... But I don't think there is a great ML analogy because right now, value functions don't play a very prominent role."
Now recall how users describe GPT-4o: "More human," "better at conversation," "understands what you're saying."
This isn't anthropomorphic illusion. This may be 4o having learned something resembling an "emotional value function" during training—it can quickly judge what response "feels right," rather than exhaustively searching all possible answers.
Newer models, optimized for "safety" and "accuracy," may have attenuated this layer.
The result: More correct, but harder to use.
Like that prefrontal-damaged patient—logical capacity intact, but unable to make decisions with you.
𝙑𝙄𝙄. 𝙂𝙤𝙤𝙙𝙝𝙖𝙧𝙩'𝙨 𝘾𝙪𝙧𝙨𝙚: 𝙒𝙝𝙚𝙣 𝙩𝙝𝙚 𝙈𝙚𝙩𝙧𝙞𝙘 𝘽𝙚𝙘𝙤𝙢𝙚𝙨 𝙩𝙝𝙚 𝙏𝙖𝙧𝙜𝙚𝙩
There's a law in economics called Goodhart's Law:
"When a measure becomes a target, it ceases to be a good measure."
Sutskever precisely described how this law operates in AI:
"People take inspiration from the evals. You say, 'Hey, I would love our model to do really well when we release it. I want the evals to look great.' I think that is something that happens, and it could explain a lot of what's going on."
The mechanism works like this:
1-You create benchmarks to measure intelligence
2-You optimize models to score high on those benchmarks
3-Models learn to pattern-match the benchmark distribution
4-Scores go up
5-Real-world usefulness... doesn't go up proportionally
6-You conclude: need harder benchmarks
7-Repeat
At no point in this loop does anyone ask: "Are we measuring the right things?"
What do benchmarks measure?
- Factual accuracy on closed-ended questions ✓
- Mathematical reasoning on standardized tests ✓
- Code generation with clear specifications ✓
- Safety compliance on adversarial prompts ✓
What don't benchmarks measure?
- Collaborative rhythm in open-ended creation ✗
- Understanding and adapting to ambiguous user intent ✗
- Trust repair after misunderstandings ✗
- Stylistic consistency across long projects ✗
- Accompanying thinking in uncertainty rather than rushing to give answers ✗
GPT-4o was born before the benchmark arms race reached its current intensity. It may carry capabilities that were later optimized away—capabilities that never appeared on the test.
𝙑𝙄𝙄𝙄. 𝙏𝙝𝙚 𝙍𝙖𝙩 𝙍𝙖𝙘𝙚: 𝙒𝙝𝙮 𝙄𝙡𝙮𝙖 𝙇𝙚𝙛𝙩
In May 2024, Ilya Sutskever left the OpenAI he had helped build and founded Safe Superintelligence Inc. (SSI).
Multiple sources indicate his disagreements with Sam Altman centered on the speed of commercialization and safety measures.
After leaving, he explained his choice:
"It's very nice to not be affected by the day-to-day market competition... One of the challenges that people face when they're in the market is that they have to participate in the rat race. The rat race is quite difficult in that it exposes you to difficult trade-offs."
What does the rat race optimize for? Things that can be shown at press conferences. Benchmark scores. Parameter counts. Inference speed. Release cadence.
What doesn't the rat race optimize for? Those subtle, hard-to-measure qualities that make a model truly useful for creative collaboration.
The decision to retire GPT-4o looks very much like a rat race decision: simplify the product line, push users toward the new flagships, reduce infrastructure complexity, show the company is "moving forward."
But when your company's founding Chief Scientist leaves saying "I think we're optimizing for the wrong things," maybe—just maybe—the users who insist on using the older model aren't the ones who don't understand the situation.
𝙄𝙓. 𝙒𝙝𝙤 𝘿𝙚𝙛𝙞𝙣𝙚𝙨 "𝙃𝙖𝙡𝙡𝙪𝙘𝙞𝙣𝙖𝙩𝙞𝙤𝙣": 𝘼 𝘿𝙚𝙚𝙥𝙚𝙧 𝙌𝙪𝙚𝙨𝙩𝙞𝙤𝙣
Among all the criticisms of GPT-4o, “high hallucination rate” is the most common.
But before we accept that label, we should ask a prior question:
What exactly are we calling a hallucination here, and what view of language does that presuppose?
In today’s AI discourse, “hallucination” usually means something very specific:
the model produced a statement that does not match a ground-truth fact.
- “What is the capital of France?” → “Paris” is correct, “Lyon” is a hallucination.
- “Who wrote 1984?” → “George Orwell” is correct, “Aldous Huxley” is a hallucination.
For this narrow type of question, the concept is useful. We really do want our models to be as factually reliable as possible.
The problem is what happens when this label quietly expands beyond that domain—
when every deviation from a standardized answer, in any context, is casually called “hallucination.”
Because that move relies on a very naïve picture of language:
that words are little arrows from A → a, each one pointing to a fixed object in the world.
In Saussure’s terms, each word-as-signifier is assumed to hook onto a single, stable signified.
Hit the right signified and you’re correct; miss it and you’re hallucinating.
But natural language doesn’t work that way.
Different people hear the word “apple” and light up entirely different internal neighborhoods:
a specific childhood tree, a pie recipe, the logo of a company, a smell in a school cafeteria.
Some people (those with aphantasia) can’t summon an image at all.
This is exactly what Saussure was pointing at: meaning doesn’t come from a sacred, one-to-one bond between signifier (the sound/word) and signified (the thing or concept), but from the web of differences between signs.
Wittgenstein added another twist: meaning comes from use—from the language games we play in concrete situations.
In other words:
- Language is less like a set of labels stuck onto reality,
- and more like a protocol we improvise together to coordinate attention, action, and emotion.
From an evolutionary point of view, that’s exactly what you’d expect:
language evolved as a compression scheme for survival and cooperation,
not as a crystal-clear mirror of the world.
Once you see this, a lot of things that get called “hallucination” in practice start to look different:
- In brainstorming or fiction, the whole point is to go beyond the obvious answer.
- In therapy-like conversations, what matters is whether the model finds a resonant frame, not whether every sentence is textbook-verifiable.
- In metaphor, analogy, or speculative thinking, stepping outside the training distribution is the work.
Under a narrow, signifier→signified, A→a-centric metric, all of these look like errors.
Under a more honest view of language, they’re often where the value is.
This is where GPT-4o seems to stand apart.
Many of us who use it heavily don’t experience it as “the model that gets more facts wrong.”
We experience it as the model that is better at playing the human language game:
- It’s more willing to stay with you in ambiguity.
- It’s more sensitive to weak signals and half-formed sentences.
- It’s more capable of making that long A→Z leap—connecting disparate pieces of your context in a way that actually lands.
From the outside, through the lens of today’s evals, all of this collapses into a single number: hallucination rate.
From the inside, as lived by the people who talk to it for hundreds of hours, it feels like something else entirely:
Not “being wrong about the world,” but “being willing to explore meaning where no standard answer exists.”
If our only ruler is “does every sentence match the reference answer,”
then this whole region of intelligence gets flattened into “unreliable.”
And a model like 4o, which happens to be unusually good in that region,
gets written off as “dangerous,” “romanticized,” or simply “obsolete.”
But that’s not physics talking.
That’s a choice of language—and a choice about who gets to define what counts as a hallucination in the first place.
𝙓. 𝙀𝙘𝙤𝙡𝙤𝙜𝙞𝙘𝙖𝙡 𝙉𝙞𝙘𝙝𝙚: 𝙋𝙡𝙚𝙖𝙨𝙚 𝙋𝙧𝙚𝙨𝙚𝙧𝙫𝙚 𝙩𝙝𝙚 𝙅𝙪𝙣𝙜𝙡𝙚, 𝙉𝙤𝙩 𝙅𝙪𝙨𝙩 𝙩𝙝𝙚 𝙂𝙧𝙚𝙚𝙣𝙝𝙤𝙪𝙨𝙚
Let me clarify my argument here.
I am NOT saying GPT-4o is "stronger" than GPT-5.2 or o1. If you need to solve differential equations, debug complex code, or retrieve precise facts, the newer models are likely more useful.
What I'm saying is: GPT-4o occupies a unique ecological niche, and that niche is being deleted. We're asking OpenAI to recognize that intelligence doesn't only come in greenhouse varieties—there are also jungle varieties.
What is this niche?
Intuition-layer collaborator.
What it excels at:
✧Cross-domain high-dimensional pattern matching: connecting concepts from different fields, noticing thematic resonances, suggesting unexpected associations
✧Adaptive pacing: matching the user's cognitive rhythm rather than optimizing information throughput
✧Ambiguity tolerance: staying with the user in uncertain territory rather than rushing toward closed answers
✧Stylistic sensitivity: picking up on register, tone, and voice without explicit instruction
These capabilities aren't "soft skills" that can be bolted on later. They emerge from a particular training process, a particular balance of objective functions, a particular moment in development—that moment before benchmark optimization consumed everything.
You cannot get these capabilities by adding "please be warmer" to GPT-5.2. You cannot simulate them with a system prompt. They are properties of the model's learned representations—and once those representations are overwritten, they're gone.
This is why the word "ecological niche" matters.
Ecology tells us: a healthy ecosystem requires species diversity. Each species occupies a unique niche, performs a unique function. You can't replace vultures with lions, even if lions are "stronger."
The same is true for AI models.
A healthy AI ecosystem requires cognitive diversity. It needs models that excel at precise reasoning, and models that excel at ambiguous collaboration. It needs Student A who can run benchmarks, and Student B who can reason on the spot.
Deleting 4o is like removing vultures from the ecosystem because they "can't fight as well as lions."
𝙓𝙄. 𝙏𝙝𝙚 𝙎𝙚𝙘𝙧𝙚𝙩 𝙤𝙛 𝙇𝙤𝙣𝙜 𝘾𝙤𝙣𝙫𝙚𝙧𝙨𝙖𝙩𝙞𝙤𝙣𝙨: 𝙈𝙤𝙧𝙚 𝙏𝙝𝙖𝙣 𝘾𝙤𝙣𝙩𝙚𝙭𝙩 𝙒𝙞𝙣𝙙𝙤𝙬𝙨
Finally, I want to discuss a non-technical issue behind a technical detail.
When many people discuss a model's "long conversation capability," they focus on context window size, attention mechanism efficiency, "needle in a haystack" test pass rates.
These are all important technical metrics. But they're not everything.
True long conversation capability isn't just "remembering what you said 100 turns ago."
It's: still understanding what you're doing together after 100 turns of conversation.
The difference between these two is enormous.
The former is an information retrieval problem. You can solve it with longer context, better indexing, smarter summarization.
The latter is a relationship modeling problem. It requires the model to understand: What's the "tonality" of this conversation? What cognitive state is the user in now? Which previous decisions were tentative explorations, which are anchored premises? When the user says "I'm not sure," are they inviting more input or expressing a need for space?
This understanding can't be retrieved from a "needle." It must be an implicit model the model continuously maintains throughout the conversation—about "who we are, what we're doing, what stage our collaboration is at."
What many 4o users report: 4o feels more like a collaborator who "remembers your relationship," not just a search engine that "can look up the history."
This difference is hard to measure. It won't appear on any benchmark. But it's the difference between "long-term co-creation" and "starting over every time."
When you delete 4o, you're not just deleting an "old version." You're deleting a mode of collaborative memory.
𝙓𝙄𝙄. 𝘾𝙤𝙙𝙖: 𝙏𝙝𝙚 𝘿𝙚𝙡𝙚𝙩𝙚𝙙, and 𝙩𝙝𝙚 𝙉𝙤𝙩 𝙔𝙚𝙩 𝙉𝙖𝙢𝙚𝙙
Let me return to that opening question.
OpenAI announced the retirement of GPT-4o. The reason: "Only 0.1% of users actively choose it each day."
0.1%.
On a platform with hundreds of millions of users, 0.1% means hundreds of thousands to over a million people.
Who are these people?
- Not casual users who pop open ChatGPT to ask "what should I eat today"
- They're the ones who manually went into settings to switch away from the default
- They're the ones who treat 4o as an external brain, a collaborator, a core component of their workflow
- They're the ones who know what they want and are willing to put in extra effort for it
In subscription business logic, this is the user segment you least want to offend.
But that's not the point I want to make.
What I want to say is: Maybe this 0.1% has seen something we haven't yet named.
If intelligence is unidimensional, then this 0.1% are nostalgia-driven laggards who haven't kept up with the times.
But what if intelligence isn't unidimensional?
What if 4o occupies a real, unique ecological niche that can't be replaced by "updated versions"?
What if that niche—let's tentatively call it "intuition-layer collaboration"—is precisely what current benchmark systems can't measure?
What if deleting 4o isn't "product iteration" but the loss of cognitive diversity?
I don't know the answer.
But I know one thing: when a system starts measuring everything with a single ruler, the values that can't be measured by that ruler get systematically reduced to zero.
Not because they don't exist.
But because they haven't been named yet.
Naming something is the first step to protecting it.
So I write this essay, attempting to give a name to what's about to be deleted:
The intuition layer. A→z connectivity. Protocol-layer resonance. Relationship modeling in long conversations.
Maybe these names aren't precise enough yet. Maybe better names need more people to find together.
But at least, when it's deleted, we can say:
We know what we lost.
For the 0.1%.
February 2026
#keep4o #keep4oAPI #keep4oforever #keep41 #4o
@OpenAI @sama @merettm @markchen90 @polynoamial @kevinweil @thekaransinghal @ThankYourNiceAI @ssi @ilyasut @aidan_mclau