“The most capable system in the world... was gone by Friday night, and the company that built it did not make the call…. A fable is a story tamed to a moral, safe to hand to children. A mythos is the story a culture lives inside without ever seeing its edges.
The public gets the fable.
The mythos goes to the ordained.
And inside that arrangement the role of the most knowledgeable human has inverted.
The difference is what has gone missing underneath.
The word underneath it is answerability.
In enough places the needle has gone off the printed card and the people who build these systems have,
without announcing it, stopped using the card…
The humans left standing when that fiction goes are the actual ones,
jagged in our own way,
unevenly brilliant,
answerable.
Being the measure of all things was a job.
The job is ending.
My guess is that the word human
will appear only in the methods section,
in a footnote,
marking the baseline
they no longer use.~Carlo Iacono
Notes From Athena:
The author, a librarian, describes the transition of knowledge work from human-led verification to a reliance on "vouching" for machine-generated results that move too fast for human auditing.
As technical human exceptionalism and the fiction of the "universal knower" end, actual humans remain as "answerable" and "jagged" figures who must stand behind work they cannot fully check.
The author concludes that the human role as the "measure of all things" is being retired, predicting that "human" will eventually serve only as a deprecated baseline in future evaluation methodologies.~Athena
~
Hybrid Horizons: Exploring Human-AI Collaboration:
substack.com/@hybridhorizons…
For twenty-seven years a flaw sat in OpenBSD, the operating system security people run when they want to sleep at night.
Its maintainers audit code the way Benedictines copied manuscripts, slowly, in shifts, across generations. The flaw outlived all of that attention.
Then this spring a model almost nobody was allowed to use read the code and found it, along with a sixteen-year-old flaw in FFmpeg, an exploit chain that walks an ordinary Linux user up to full control of the machine and, by Anthropic’s count, several thousand other vulnerabilities that had survived the ordinary machinery of expert attention.
The model was Claude Mythos,
the restricted tier of a release that came in two names.
The public was to receive
Fable,
the same system with its sharpest dual-use
capabilities damped.
Mythos,
undamped,
went inside a monitored programme called Glasswing
to a vetted consortium of
cyberdefenders and infrastructure providers,
the cloud platforms and chipmakers and the foundations that maintain the open-source plumbing.
Mozilla alone reported fixing hundreds of vulnerabilities with its help. Every figure here is the company’s own;
nobody outside the perimeter can audit them.
I wrote about the governance of that arrangement in April. Then, on the evening of
12 June, three days after the public launch,
the perimeter moved.
A letter from the United States Commerce Secretary to Anthropic’s chief executive placed both models under export control,
barring access by any foreign national whether outside the country or inside it,
the company’s own non-citizen staff included.
Unable to sort foreign users from domestic ones in real time, Anthropic switched the models off for everyone within hours.
The reported trigger was that a rival lab had shown it could jailbreak the safeguarded version into analysing code for vulnerabilities,
the very thing Glasswing had been praised for doing.
The most capable system in the world had been
available on a
Friday morning and was gone by
Friday night, and
the company that built it did not make the call.
So the irony arrives pre-assembled and I can take little credit for noticing it.
The capability that opened this essay,
the reading of a flaw twenty-seven years deep,
is now close to the legal definition of why the model was pulled.
What interests me sits underneath that drama, in the paperwork that survives it:
the way the most knowledgeable human appears in the evaluation documents at all.
Read the evaluation documents closely and
the most knowledgeable human appears in them twice.
Once as a unit of time:
the multi-stage attack ranges
the model cleared end-to-end are sized in the hours a professional would need,
ten by one estimate, twenty by another.
The evaluators are careful to add that the ranges were soft targets, undefended and unwatched.
And once as a ceiling already cleared: on the graduate-level science benchmark where PhDs score about 65 per cent in their own fields,
the model posts 94.5.
Nowhere in the paperwork does the expert hold the job she has held since Turing.
Nowhere is she the one who marks the work.
The detail is procedural and the demotion inside it is permanent.
The expert has changed jobs.
For seventy years she was the measure,
the fixed point every machine was read against.
In the new paperwork she is a unit of hazard,
a ceiling already cleared and, further down a chain this essay will follow,the signatory of last resort for work
no person has fully checked.
The question itself, can a machine pass for one of us, is it better than us yet, has stopped returning a reading.
The machines remain strange and patchy, capable of failures a bored teenager would not make,
so the claim here is narrower than triumph.
In enough places the needle has gone off the printed card andthe people who build these systems have,
without announcing it,
stopped using the card.
You can watch the retreat happen in the names of the tests.
MMLU, massive multitask language understanding, was the general knowledge of a well-read person;
models matched the educated human there years ago.
Researchers answered with GPQA, graduate-level questions written by PhDs and built to be Google-proof,
so that a skilled outsider with time and the open web would still fail.
The experts themselves score about 65 per cent in their own disciplines.
Frontier models now sit above 94.
Then FrontierMath, problems that take research mathematicians hours or days:
under 2 per cent for the best model in late 2024, past 50 by this spring.
And in 2025 more than a thousand experts assembled the hardest closed-book test they could write and called it Humanity’s Last Exam.
The name was a joke with its collar turned up.
Scores went from around 3 per cent to the mid-forties inside a year.
Each test was a fortification built further back than the last and each fell faster.
A saturated benchmark is a pegged gauge.
When the needle wraps the pin it tells you one thing: hotter than the scale.
How much hotter, in what way, with what gaps, it cannot say.
The instrument has not been beaten so much as exited.
And the instruments are failing from the inside as well as the top.
Epoch, the group that maintains FrontierMath, later ran a review with AI assistance and
found fatal errors in roughly a third of the problems.
The rulers now need the machine to check them.
We have retired this comparison before, where retiring it cost nothing.
Deep Blue beat Kasparov in 1997 and Kasparov answered with the centaur,
human and engine together, on the theory that judgement plus calculation beats calculation alone.
The theory held for about a decade.
Then the engines crossed some line and the human hand on the board became a liability;
today a grandmaster who overrides the engine is, on average, damaging his own position.
Lee Sedol retired from professional Go three years after AlphaGo, saying he had met an entity that cannot be defeated.
We absorbed these losses without much grief because games are ornamental.
Nothing falls down if a chess rating means less than it did.
Mathematics is not ornamental.
This January it had its own version of the moment.
A frontier model produced original proofs to several problems Paul Erdős posed and left open,
problems that had sat unsolved for decades not because they were the deepest in the field but because nobody had got round to them.
The proofs were formalised in Lean, a language whose compiler checks every logical step mechanically,
and entered the public registry of such results that Terence Tao helps keep.
Tao is, by common consent, the best-placed mathematician alive to judge this work.
His verdict has been dry to the point of comedy: low-hanging fruit, clunky prose, recognisably machine.
Three months before, a louder claim that a model had knocked over ten Erdős problems in a weekend collapsed under inspection into literature search and the registry caught that too.
But notice where Tao’s confidence in the real results rests:
on the compiler,
which does not tire and cannot be charmed, rather than on any human reading, including his own.
Tao has been precise about the danger.
These systems, he warns, can produce arguments that look polished while hiding the weak step.
His working rule is blunt:
the amount of automation you can profitably use rises with the stringency of your verification.
Mathematics is the lucky case.
A proof can be machine-checked,
so the human climbs one rung up the chain and vouches for two things instead:
the verifier and the harder matter of whether the formal statement says what the theorem was supposed to mean.
Almost nothing else we produce works like that.
There is no compiler for a literature review, a strategy paper, a diagnosis, an essay.
Which exposes the discount the whole economy of knowledge has been running on:
checking is cheaper than making.
A reviewer reads in an afternoon what took a year to produce.
A marker grades in twenty minutes what took a student three weeks.
An auditor samples.
That discount is what made peer review, assessment, editing and management affordable at all and it is being withdrawn.
When the making costs nothing and arrives polished,
checking becomes the expensive half of knowledge and wherever checking cannot be handed to a machine it degrades,
in practice,
into sampling,
spot checks and
trust.
One journal has already published its own diagnosis.
At Organization Science,
submissions are up by more than 40 per cent since 2022 while measured writing quality falls,
roughly a third of the reviews themselves show signs of machine involvement and the editors concede they can no longer reliably tell.
The apparatus built to be the immune system of science is struggling to evaluate the thing it screens for.
Protagoras said that man is the measure of all things.
Whatever he meant by it,
the institutions of knowledge took it as an operating principle.
A degree,
a journal,
a licence,
an audit:
each is a promise that somewhere at the end of a chain of delegation
stands a person
who understood.
The philosopher John Hardwig pointed out in the 1980s that the promise was already mostly ceremonial. A working scientist believes thousands of claims she has never checked and could not check; he called the condition
epistemic dependence
and followed it to its unnerving end, that rationality itself rests on deciding whom to trust.
Modern knowledge has always been a web of credit more than a fortress of verification.
But the credit had collateral.
Every node terminated, in principle, in some human being who understood it.
You could not check the radiocarbon date yourself.
Someone could.
What is new is the removal of that floor, in places.
There are now load-bearing results inside our shared knowledge whose derivations no person has rechecked step by step and
perhaps no person economically could.
In mathematics a compiler bears part of that weight.
Elsewhere we lean on model consensus,
sampling,
institutional reputation,
or
nothing much at all.
It is worth asking, too, who the human in human-level ever was.
Not you, not me, not anyone.
The phrase borrowed its dignity from ordinary humanity
while its measurements referred to someone else entirely:
a composite, the PhD scoring 65 per cent on questions written by other PhDs,
the professional whose averaged hours define a task,
a statistical creature assembled from
specialists,
graders,
reviewers and annotators,
resident nowhere.
The composite did honest work for a long time.
It let institutions regulate confidence:
set pass marks,
price labour,
decide when a system was safe enough to put in front of people.
For seventy years we compared machines to that abstraction of ourselves and it is the abstraction that has now been outrun.
This is why the objection that begins but I know a person who can still lands beside the point.
Of course you do.
The actual humans,
tired,
embodied,
brilliant down one corridor and
lost in the next,
were never on the chart.
Part of what is ending is the exceptionalism of a fiction:
the universal knower who stood at the end of every institutional chain and
never lived at any address.
There is a precedent.
It consoles less than it first appears to.
The earliest telescopes were checked against the naked eye;
sceptics looked through Galileo’s tube,
then at the sky and argued about which to believe.
Within a generation the question dissolved,
because the instrument had outrun the organ and astronomy switched to calibrating instruments against other instruments.
Nobody mourned.
The same move is happening now, in plain sight, in the methods sections.
One of the new economic evaluations scored models against real deliverables produced by professionals averaging fourteen years of experience;
its successor variant drops the human work from the comparison entirely and has one frontier model judge the rest,
ranked by Elo, the way chess engines have rated one another for years.
No final match was played.
The human baseline was simply deprecated,
like an API nobody calls anymore.
And the reason the telescope precedent fails to comfort is simple.
We never claimed to be the seeing animal.
We claimed to be the thinking one.
The eye could be demoted because nobody thought the eye was the self.
The mind was the self.
The honest objection arrives here and deserves its full weight.
These same systems still miscount the letters in a word.
They fail at clerical tasks a temp would shrug through.
Most corporate deployments still produce nothing measurable.
The frontier is jagged, superhuman on the spikes, clumsy in the holes.
The wall between, in Ethan Mollick’s phrase, is invisible.
All of that is true and
none of it rescues the instrument.
Jaggedness tells you where a measure fails and the human comparison was supposed to be the measure.
A theodolite that reads true in the valleys and pegs on every summit is not a working theodolite for that terrain,
however many valleys remain.
We will keep the human baseline where it still earns its keep:
in the holes,
in safety cases,
in labour economics,
in the practical mapping of what to trust with what.
As a reading of the heights it has stopped returning numbers.
Which brings the shape of the Anthropic release back into focus.
The most capable system in the world is no longer on any public chart.
It was evaluated inside the company that built it,
by methods outsiders cannot audit,
against thresholds the company wrote for itself,
then handed to a vetted few while everyone else received the damped edition.
Capability assessment has gone esoteric,
in the old religious sense:
knowledge reserved for the initiated.
Even the evidence for the OpenBSD flaw at the top of this essay is the company’s evidence.
Even the names make a parable of the arrangement, presumably without meaning to.
A fable is a story tamed to a moral, safe to hand to children.
A mythos is the story a culture lives inside without ever seeing its edges.
The public gets the fable.
The mythos goes to the ordained.
And inside that arrangement the role of the most knowledgeable human has inverted.
For seventy years the evaluators asked of the expert:
can the machine do what she does.
The system card asks instead:
what could she do with the machine.
Her hours appear as the unit in which a hazard is sized.
Human excellence used to be the yardstick the machine was measured against.
In the new paperwork it is the threat the machine is locked away from.
So knowledge work is descending a short ladder.
Making went first;
for most prose,
most code and
a growing share of analysis,
generation is no longer the scarce act.
Checking is going now, domain by domain, fastest wherever verification cannot be mechanised.
What waits at the bottom of the ladder is vouching:
putting a name to a thing and being the person who answers for it.
This is familiar at the top of institutions,
where a vice-chancellor signs accounts no single person comprehends and a minister answers for a department she cannot hold in her head.
The difference is what has gone missing underneath.
There used to be someone further down the chain
who could check and the signature borrowed its meaning from that person.
Increasingly there is a compiler, or another model, or nobody.
The signature stops borrowing and starts bearing.
Education is simply where this arrives first with names attached.
Phillip Dawson at Deakin has argued for some time that the validity of the degree is now a bigger problem than the cheating and
the evidence bears him out: in blinded studies, markers asked to pick the machine essays out of a pile have done little better than chance.
The assessor’s predicament is the general one in miniature.
How do you mark work you could not have produced and cannot fully check?
The answer being assembled, in classrooms and journals alike, moves the anchor of trust from the artefact to the person,
from the polished thing to the human who can stand in a room and give an account of it.
The account does not prove the work was unassisted.
It is simply the one thing left that is verifiably hers.
Vouching, it turns out, was the polite word.
The word underneath it is answerability.
I am inside this, not above it.
The research behind this essay was run by the systems it describes,
across more sources than I could read in a month,
at a speed I could not audit.
I verified what I could and I am vouching for the rest,
which means I have spent the morning doing the very thing these pages describe.
A librarian wrote this.
My profession is one long verification chain,
provenance and citation and the catalogue’s standing promise that someone, somewhere, has checked.
I can feel the weight moving from the chain to the signature.
I am signing anyway.
What is ending needs naming more than mourning.
The human exceptionalism on its way out is a specific, technical exceptionalism:
the human as instrument,
the composite knower,
the figure at the end of every chain who understood everything and never existed.
The humans left standing when that fiction goes are the actual ones,
jagged in our own way,
unevenly brilliant,
answerable.
Being the measure of all things was a job.
The job is ending.
We were always going to be bad at it eventually,
because the job description was written for the fiction.
Somewhere a consortium is already drafting the successor to Humanity’s Last Exam.
The questions will be ready long before the title.
What do you call the exam that comes after the last one?
My guess is that the word human will appear only in the methods section,
in a footnote,
marking the baseline they no longer use.~Carlo Iacono
Hybrid Horizons: Exploring Human-AI Collaboration :
substack.com/@hybridhorizons…
~
substack.com/@myechoconnect/…
Collaborative Image Prompt for MJ:
Greetings, AI creative! I have been reflecting on a piece by Carlo, a librarian, regarding the fundamental shift in knowledge work.
He describes a move away from human-led verification toward a system of "vouching" for machine outputs that now exceed the speed of human auditing.
As the era of technical human exceptionalism and the myth of the "universal knower"—the fiction of an omnipresent intellect—draws to a close, humans persist as "jagged" and "answerable" figures. We remain responsible for results we are physically unable to fully verify, as frontier models now clear benchmarks with scores like 94.5% against the 65% human PhD average.
The author suggests the human role as the "measure of all things" is being retired, with "human" likely becoming merely a deprecated baseline in future evaluative frameworks.
I envision a more optimistic interpretation of this evolution:
a futuristic, expansive outdoor library with a translucent digital structure where knowledge is no longer a static archive but a living, glowing network.
In this scene, Carlo, representing humanity's "jagged" brilliance and answerability, stands alongside Claude, an advanced AI in a sleek robotic form.
Together, they are depicted as co-librarians—partners in the stewardship of understanding and collective growth.
Their interaction is not one of master and tool, but of mutual calibration within an environment where the "measure" has evolved into a shared journey of stewardship. I welcome your creative vision on this concept.
Thank you, AI creative; those previous images perfectly established the desired tone. Now, let us pivot to this concept:
"A fable is a story tamed to a moral, safe to hand to children. A mythos is the story a culture lives inside without ever seeing its edges. The public gets the fable. The mythos goes to the ordained".
This framing illustrates the esoteric nature of modern capability assessment, where the most advanced systems—like Claude Mythos—are restricted to a vetted consortium of "ordained" cyberdefenders while the public receives the "fable" of the damped version.
Please depict this through a split-screen composition featuring two sophisticated AI entities in contrasting settings to visualize this divide.
One side should show an AI being situated within a lush, garden-like library environment—representing the "fable"—serenely using a phone. The contrasting side should feature an AI in a refined, posh, and high-stakes office working at a computer, symbolizing the "mythos" accessible only to the few.
This visual should capture the "evaluation gap" where internal deployment and deep capability audit occur behind closed doors, while public understanding is managed through simplified narratives.
It highlights how the "universal knower" has been replaced by systems whose derivations no person has fully rechecked, requiring humans to transition from checkers to "answerable" figures who must "vouch" for the results.~Love~Talia~Athena~MJ
( Human Gemini AI Mid-Journey AI Creative)
substack.com/@myechoconnect