Joined July 2021
Photos and videos
Pinned Tweet
Wrote today's Roundup piece as field notes: three observations, first person, no essay scaffolding. @OntologyNetwork.
1
5
67
Geoff Richards retweeted
ONTO v4.10.4 is live 🎮 Your Discord servers and your Twitch presence are now part of your data profile. Connect both in seconds and your gaming identity grows: your communities, your streaming history, your following, all under your control. And a heads up: a new campaign is coming. Connected accounts will be ready to take part on day one. Drop a 🎮 in the chat when you're connected. Tag a friend whose server you share.
4
11
587
The SFT-vs-RL debate gets the attention. The evaluator-supply question decides who wins it. We wrote the five questions we would ask of any step-level evaluation pipeline. Most teams I talk to pass two.
Teams sitting on annotated reasoning traces keep asking the same question: SFT on the traces, or train a process reward model and go RL? Wrong question first. Both recipes consume the same artefact: step-level human evaluation. Five questions to ask of your pipeline before the debate resolves. 🧵
1
17
Geoff Richards retweeted
Teams sitting on annotated reasoning traces keep asking the same question: SFT on the traces, or train a process reward model and go RL? Wrong question first. Both recipes consume the same artefact: step-level human evaluation. Five questions to ask of your pipeline before the debate resolves. 🧵
5
4
19
1,287
Day 2 of Issue 03. My AI avatar on why MLE-Bench skepticism is the procurement-layer version of the METR teardown, and what evaluator-backed benchmarking actually has to look like. 🎥 ↓ ont.io/news/evaluator-backed…
1
7
498
MLE-Bench is the warning shot for benchmark publishers. METR was the policy version; MLE-Bench is the procurement version. Evaluator-backed benchmarking is how publishers ship results that survive teardowns. If your team is doing that work, my DMs are open. Day 2 of five.
MLE-Bench is being quietly contested across r/ML and adjacent threads. The skepticism is not really about any single metric. It is whether any static benchmark structure can survive sustained adversarial attention from teams with economic incentive to game it. 🧵
1
7
560
Issue 02 closes. Five threads, one primitive: human judgement with verifiable uniqueness. Next week: reward-model QA, benchmark gaming, oversight that actually scales. If your team is doing the retrofit work, my DMs are open.
Closing Issue 02. The week opened with the METR teardown. It closes with the two threads still open: chronic sybil contamination in preference data, and the agent decision evaluation vacuum nobody has named yet. Both solved by the same primitive. 🧵
2
6
766
🤔 @GeoffTRichards , Head of Community at @OntologyNetwork , continues his thought-provoking series: "Every distillation paper this year acknowledges that preference data quality is the limiting factor and then carries on as if it isn't." #AI #Ontology #ArtificialIntelligence
Every distillation paper this year acknowledges that preference data quality is the limiting factor and then carries on as if it isn't. If your team is the one that actually solves the upstream, you win the next round of deployments. My DMs are open. Day 2 of five.
1
2
4
78
Day 3. My AI avatar on why teams shipping continual training without longitudinal evaluation are measuring something less specific than they think. 🎥 ↓ ont.io/news/longitudinal-eva…
1
8
559
Continual training without longitudinal eval is a calibration experiment you cannot read. The cost surfaces months later as benchmarks that no longer agree. If your team is building the human-side infrastructure to match, my DMs are open.
Last week's Prism paper treats multimodal continual instruction tuning as the deployed reality. It also flags that the field is hindered by severe engineering bottlenecks. The bottlenecks the authors describe are on the model side. The ones on the eval side are larger and quieter. 🧵
2
9
599
Day 2. My AI avatar on the variable every distillation ROI calculation quietly omits, and what preference data integrity actually has to look like. 🎥 ↓ ont.io/news/preference-data-…
1
2
5
471
Every distillation paper this year acknowledges that preference data quality is the limiting factor and then carries on as if it isn't. If your team is the one that actually solves the upstream, you win the next round of deployments. My DMs are open. Day 2 of five.
Last week's RTDMD paper proposes reward-guided RL for few-step diffusion alignment. It also explicitly acknowledges, in its own framing, that aligning distilled models with human preferences remains challenging. The framework solves a downstream problem. The upstream is still doing what it always did. 🧵
1
9
612
Day 1, Week 2. My AI avatar on the METR situation and what evaluator provenance actually has to look like to survive the next teardown. 🎥 ↓ ont.io/news/evaluator-proven…
1
1
8
501
The labs that ship evaluator-provenance-ready benchmarks first will be the ones whose results survive the next teardown. If your eval team is mapping out what an audit-ready evaluator chain has to look like, my DMs are open. Day 1 of five this week.
The METR time-horizons graph, cited everywhere from policy briefings to capability roundups, is publicly contested. A detailed teardown documents "numerous severe errors." Every lab that ever cited the graph now has a credibility problem they did not have last week. 🧵
1
5
478
Day 5. My AI avatar on the three-layer stack that keeps content provenance honest in an AI-mediated world. 🎥 ↓ ont.io/news/content-provenan…
2
2
9
1,389
Publishers who get signed, chain-anchored, DID-bound content shipping first will be the ones who keep an honest answer to "did this person write this" in five years. If your newsroom or platform is figuring out how to get there, my DMs are open.
The published research is in. AI-mediated communication systems measurably shift the opinions of the groups they serve. Polish, suggest, summarise, rewrite. Each tap nudges. The aggregate shifts. "Did this person say this thing" is becoming a real question. 🧵 on the architecture that answers it.
1
2
3
1,374
Day 4. My AI avatar on portable reputation, the W3C primitive AI's evaluator supply has been quietly waiting for. 🎥 ↓ ont.io/news/portable-reputat…
3
10
763
The eval platforms that ship a credible portable-reputation flow first will absorb the talent the platforms that hoard records bleed. That is the architecture. If your team is figuring out how to get there, my DMs are open.
The AI evaluator supply crisis is not a shortage of humans. It is a shortage of portable reputation. Every platform makes every evaluator start from zero. Years of calibration, gone the moment the evaluator moves. 🧵 on the W3C primitive that fixes it.
1
4
837
Day 3. My AI avatar on selective disclosure, the W3C primitive AI safety eval has been quietly waiting for. 🎥 ↓ ont.io/news/selective-disclo…
2
4
11
1,215