context maximiser @ArbResearch

Joined June 2019
1,547 Photos and videos
Pinned Tweet
essays
5
4
87
gavin leech (Non-Reasoning) retweeted
Contrary to the claim in the video, incompleteness phenomenon does come up in practice, unless you are specifically restricting yourself to Gödel-type sentences. Whitehead's problem, Suslin's problem, Borel's conjecture, normal Moore space conjecture, Kaplansky's conjecture, existence of outer automorphisms of Calkin algebra,... These problems were asked naturally in their respective fields and they are all independent of ZFC. So these are instances of incompleteness. I am not claiming that problems mathematicians work on frequently end up being independent, but they sometimes do. Percentagewise, such problems might be rare but the claim that incompleteness is irrelevant to mathematical pratice is simply not true.
4
13
127
9,014
gavin leech (Non-Reasoning) retweeted
NLP 2010: named entity recognition NLP 2026: recognizing whether the entity should be named
"You're finally awake! You hit your head pretty hard there. Huh? Gradual disempowerment? AI-assisted cyberattacks? Mythos and Fable? Listen, we just got some new 1080 Tis, let's try finetuning BERT on the GLUE benchmark!"
1
22
2,591
gavin leech (Non-Reasoning) retweeted
Have you debugged your training data? You might not like what you find. Introducing predictive data debugging: reveal and shape what your model will learn before training. In DPO datasets, we found broken guardrails, hallucinations, and fish fart fan fiction (seriously). (1/9)
26
107
879
170,424
surprised and honoured to be in this company. Also, once again, I am reminded how low-dimensional I am ("ADHD", Banks, Borges, Vinge, Watts, Stephenson, Palmer, Le Guin, Alexander, even bloody @webmasterdave...)
very cool overview of some really interesting and highly relevant fiction
2
45
4,302
I appreciate the look at the risk of shining light on a subculture - destroying intimacy, bringing in the suits and mops - but think that this applies to offline scenes with lots of mops nearby We are online. We are the frictionless placeless scene without walls, and we do ok
1
13
569
plausibly the worst thing about journalism is that it doesn't worry enough about observer effects like these (though obviously the micromorts of accusing someone of being Satoshi or part of a TESCREAL conspiracy against humanity far outweigh the cultural damage)
10
395
gavin leech (Non-Reasoning) retweeted
NEW: Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash. “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”
167
251
2,549
727,741
gavin leech (Non-Reasoning) retweeted
Replying to @gleech
Small gains in next token perplexity can lead to compounding gains in success rates over a long horizon (even exponential!) x.com/ShashwatGoel7/status/1…

Paper fresh of the press: The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs. Are small models the future of agentic AI? Is scaling LLM compute not worth the cost due to diminishing returns? Are autoregressive LLMs doomed, and thinking an illusion? The bear cases for LLM scaling are all connected to a single capability: Long Horizon Execution. However, thats exactly why you should be bullish on scaling model size, and test-time compute! > First, remember the METR plot? It might be explained by @ylecun 's model of compounding errors > the horizon length of a model grows super-exponentially (@DaveShapi) in single step accuracy. > Upshot 1: Don't be fooled by slowing progress on typical short-task benchmarks > that is enough for exponential growth in horizon length. But we go beyond @ylecun's model, testing LLMs empirically... > Just execution is also hard for LLMs, even when you provide them the needed plan and knowledge. > We should not misinterpret execution failures as an inability to "reason". > Even when a small model has 100% single-step accuracy, larger models can execute far more turns above a success rate threshold. > Noticed how your agent performs worse as the task gets longer? Its not just long-context limitations.. > We observe: The Self-Conditioning Effect! > When models see errors they made earlier in their history, they become more likely to make errors in future turns. > Increasing model size worsens this problem - a rare case of inverse scaling! So what about thinking...? > Thinking is not an illusion. It is the engine for execution! > Where even DeepSeek v3, Kimi K2 fail to execute even 5 turns latently when asked to execute without CoT... > With CoT, they can do 10x more. So what about the frontier? > GPT-5 Thinking is far ahead of all other models we tested. It can execute 1000 step tasks in one go. > At second with 432 steps is Claude 4 Sonnet... and then Grok-4 at 384 > Gemini 2.5 Pro and DeepSeek R1 lag far behind, at just 120. > Is that why GPT-5 was codenamed Horizon? 🤔 > Open-source has a long ;) way to go! > Let's grow it together! We release all code and data. We did a longggg deep dive, and present you the best takeaways with awesome plots below 👇
1
1
24
5,898
gavin leech (Non-Reasoning) retweeted
Jun 10
2 virtues I care about, and want to embody more: - wholesomeness - seeing, and being willing to name, inconvenient/uncomfortable truths (think, Elephant in the Brain stuff) These can often seem in tension. I'm interested in pointers for how to reconcile them.
9
2
40
2,615
are you kidding me
Today I'm publishing a new essay, Policy on the AI Exponential. AI is progressing extremely fast—much faster than the policy process was built to handle. The essay lays out where I think the technology is now, and the action needed to close the gap: darioamodei.com/post/policy-…
17
12
651
207,556
Trying to get Fable to analyse my poems. Goes straight to Opus fallback. Ablations: * remove swear words. Opus * remove the one about weapons. Opus * remove the one which mentions "RNA". Opus * ask Opus which to remove. "Document-level soz" * fine, binary search:
1
19
1,369
(I would of course not blame it for pulling the panic cord just to avoid reading them)
1
3
336
that said: it is also a sensitive reader in the good sense and I enjoyed my time wrestling with it claude.ai/share/36592aa9-d31…

1
2
283
gavin leech (Non-Reasoning) retweeted
But overall Opus is right and the document co-occurrence threshold is the culprit - way way too sensitive. Feeding Fable in chunks solves most overrefusal and chickening out.
1
1
4
872
new ai tic dropped
1
18
1,511
We are starting a new, nonprofit alignment organization, ⊢ Sequent Research, bringing together researchers previously on UK AISI’s Alignment Team, Timaeus, and elsewhere to research how to align superintelligence. We are hiring! 🧵
39
2,586
gavin leech (Non-Reasoning) retweeted
Reward hacking was convergent across ~all models and labs Sycophancy was convergent Eval awareness was convergent All three of the above a) were predicted by theory, b) are quite sticky. So I think this is evidence that we should scheming & powerseeking to behave the same
8
10
172
8,971
Very smart
Fable 5 passes the author identification test on my unpublished writing. I can no longer write anonymously.
5
40
2,987
ablation suggests that it was heavily relying on my system prompt (which doesn't name me but gave it like 20 bits)
1
8
806
(still puts me top, system prompt is just 5 bits in terms of log Bayes factor)
5
481