Research Lead @Amazon | Self-improving AI | ex-@CarnegieMellon, @MSFTResearch | DM for collab, massive compute available | Building github.com/A-EVO-lab

Joined June 2016
43 Photos and videos
Pinned Tweet
Current Recursive Self-Improving AI misses a ๐ฉ๐ฎ๐›๐ฅ๐ข๐œ datapoint: AI trains a frontier-class model. Today we share ๐€-๐„๐ฏ๐จ๐ฅ๐ฏ๐ž-๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  that ๐š๐ฎ๐ญ๐จ-๐ฉ๐จ๐ฌ๐ญ-๐ญ๐ซ๐š๐ข๐ง๐ž๐ ๐š ๐Ÿ‘๐ŸŽ๐ ๐๐ž๐ฆ๐จ๐ญ๐ซ๐จ๐ง model with no human-in-the-loop. ~10ยฒร— larger than prior public autoresearch (GPT2, ~124M) demos. It achieved 0.86 on NVIDIA Nemotron Reasoning Challenge (snapshot 6/1 as of writing the tech report) vs human #1's 0.87. Same system has been run at ๐Ÿ๐Ÿ๐ŸŽ๐ ๐š๐ง๐ ๐Ÿ“๐Ÿ“๐ŸŽ๐ scale; effectiveness claim deferred at those scales pending human anchor. But the important part is not the score: the loop didn't just execute โ€” it discovered a flaw in its own optimization, and inverted its own objective mid-campaign. ๐Ÿ‘‡Tech report and website
Our internal data shows Claude is accelerating AI developmentโ€”a possible path to recursive self-improvement, or AI autonomously building a more capable successor. Itโ€™s happening faster than we thought, and the implications deserve greater attention. anthropic.com/institute/recuโ€ฆ
4
10
29
6,533
Current Recursive Self-Improving AI misses a ๐ฉ๐ฎ๐›๐ฅ๐ข๐œ datapoint: AI trains a frontier-class model. Today we share ๐€-๐„๐ฏ๐จ๐ฅ๐ฏ๐ž-๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  that ๐š๐ฎ๐ญ๐จ-๐ฉ๐จ๐ฌ๐ญ-๐ญ๐ซ๐š๐ข๐ง๐ž๐ ๐š ๐Ÿ‘๐ŸŽ๐ ๐๐ž๐ฆ๐จ๐ญ๐ซ๐จ๐ง model with no human-in-the-loop. ~10ยฒร— larger than prior public autoresearch (GPT2, ~124M) demos. It achieved 0.86 on NVIDIA Nemotron Reasoning Challenge (snapshot 6/1 as of writing the tech report) vs human #1's 0.87. Same system has been run at ๐Ÿ๐Ÿ๐ŸŽ๐ ๐š๐ง๐ ๐Ÿ“๐Ÿ“๐ŸŽ๐ scale; effectiveness claim deferred at those scales pending human anchor. But the important part is not the score: the loop didn't just execute โ€” it discovered a flaw in its own optimization, and inverted its own objective mid-campaign. ๐Ÿ‘‡Tech report and website
Our internal data shows Claude is accelerating AI developmentโ€”a possible path to recursive self-improvement, or AI autonomously building a more capable successor. Itโ€™s happening faster than we thought, and the implications deserve greater attention. anthropic.com/institute/recuโ€ฆ
4
10
29
6,533
๐ƒ๐ข๐ฌ๐œ๐จ๐ฏ๐ž๐ซ๐ฒ, ๐ง๐จ๐ญ ๐ฃ๐ฎ๐ฌ๐ญ ๐จ๐ฉ๐ญ๐ข๐ฆ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง The A-Evolve-Training auto-post-train loop discovered that its own proxy had become misleading. It optimized the internal dev metric to record highs, saw those gains fail to transfer, then changed the search policy itself: stop maximizing the proxy, look for interventions that might lower it while improving the external target. That is the moment this became interesting. Not "AI ran our experiments faster." Not "AI tuned some hyperparameters." But an autonomous research loop falsifying a premise of its own search.
1
2
276
๐Ž๐ง๐ž ๐ฆ๐จ๐ซ๐ž ๐ฌ๐ญ๐ž๐ฉ ๐ญ๐จ๐ฐ๐š๐ซ๐๐ฌ ๐‘๐’๐ˆ A-EVO-Lab studies RSI under one thesis โ€” AI-as-researcher: frontier agents and models play the researcher in the loop that builds better AI. RSI will not arrive as one magic model waking up and rewriting itself. It will look like AI systems taking over the actual work humans do to build AI: โ€” harness building โ€” post-training โ€” pre-training Auto-harness and post-training both have public reports & papers. Follow us for public datapoints on AI systems learning to build AI. GitHub: github.com/A-EVO-Lab
2
166
๐€๐๐š๐ฉ๐ญ๐ข๐ฏ๐ž ๐€๐ฎ๐ญ๐จ-๐‡๐š๐ซ๐ง๐ž๐ฌ๐ฌ - New Research: Even frontier LLMs (Opus) powered SOTA self-improving agents ๐Ÿ๐š๐ข๐ฅ๐ž๐ on real-world task streams (e.g., Prediction Market). With the same ๐Ÿ๐š๐ข๐ฅ๐ฎ๐ซ๐ž ๐ฉ๐š๐ญ๐ญ๐ž๐ซ๐ง: ๐ฉ๐ž๐š๐ค ๐ž๐š๐ซ๐ฅ๐ฒ, ๐ญ๐ก๐ž๐ง ๐๐ž๐œ๐ฅ๐ข๐ง๐ž. A single harness overfits to past patterns. The problem isn't the LLM. The auto-harness must be adaptive. We introduce ๐€๐๐š๐ฉ๐ญ๐ข๐ฏ๐ž ๐€๐ฎ๐ญ๐จ-๐‡๐š๐ซ๐ง๐ž๐ฌ๐ฌ: a tree of regime-specific harness branches, with per-task routing at solve time. Same LLMs, same auto-harness machinery โ€” but the harness now specializes per task instead of compromising across all of them. Results vs 5 auto-harness baselines human-designed harness on 3 real benchmarks: ๐Ÿ”ธ PolyBench (5,075 prediction-market tasks): 80.9% vs 50.8% ๐Ÿ”ธ CTF-Dojo (261 security challenges over 8 years): 50.2% vs 45.2% ๐Ÿ”ธ FutureX (503 forecasting tasks ): 49.5% vs 47.5%
When you building a self-improving agent, two natural questions emerge: ๐๐Ÿ: ๐–๐ก๐ข๐œ๐ก ๐ฆ๐จ๐๐ž๐ฅ๐ฌ ๐ฉ๐ซ๐จ๐๐ฎ๐œ๐ž ๐ญ๐ก๐ž ๐›๐ž๐ฌ๐ญ ๐ก๐š๐ซ๐ง๐ž๐ฌ๐ฌ ๐ฎ๐ฉ๐๐š๐ญ๐ž๐ฌ? ๐๐Ÿ: ๐–๐ก๐ข๐œ๐ก ๐ฆ๐จ๐๐ž๐ฅ๐ฌ ๐›๐ž๐ง๐ž๐Ÿ๐ข๐ญ ๐ฆ๐จ๐ฌ๐ญ ๐Ÿ๐ซ๐จ๐ฆ ๐ก๐š๐ซ๐ง๐ž๐ฌ๐ฌ ๐ฎ๐ฉ๐๐š๐ญ๐ž๐ฌ? Our new paper has counter-intuitive answers to both: they decouple from model capability, in opposite ways. Q1 (who produces good updates): the updater's base capability barely matters.A 9B model (Qwen3.5) produces harness updates that match Claude Opus 4.6's. Best vs worst evolver gap โ‰ค3.1pp. Q2 (who benefits most): non-monotonic.Mid-tier solvers benefit the most. Strong-tier hits ceiling. Weak-tier benefits LEAST despite the most headroom โ€” failing at two layers: skill activation (25% vs ~96% for strong) and adherence drift across trajectory (~4x steeper). Tested across 7 evolver models ร— 6 solver agents ร— 3 agentic benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench). Implication: don't pay frontier prices for both halves of the loop. Put capability budget on the agent (solver), not the evolver.
5
15
85
9,670
Like the harness-updating-vs-benefit paper earlier this week (arxiv:2605.30621), this work was developed on the A-Evolve framework โ€” shared primitives for large-scale self-evolving agent research. Code, evolved harnesses, branches, and routing traces all releasing through the repo. One more paper coming this week: online context-to-harness skill compilation. Paper: arxiv.org/abs/2606.01770 Repo: github.com/A-EVO-Lab/a-evolvโ€ฆ Huggingface Daily: huggingface.co/papers/2606.0โ€ฆ
1
10
388
When you building a self-improving agent, two natural questions emerge: ๐๐Ÿ: ๐–๐ก๐ข๐œ๐ก ๐ฆ๐จ๐๐ž๐ฅ๐ฌ ๐ฉ๐ซ๐จ๐๐ฎ๐œ๐ž ๐ญ๐ก๐ž ๐›๐ž๐ฌ๐ญ ๐ก๐š๐ซ๐ง๐ž๐ฌ๐ฌ ๐ฎ๐ฉ๐๐š๐ญ๐ž๐ฌ? ๐๐Ÿ: ๐–๐ก๐ข๐œ๐ก ๐ฆ๐จ๐๐ž๐ฅ๐ฌ ๐›๐ž๐ง๐ž๐Ÿ๐ข๐ญ ๐ฆ๐จ๐ฌ๐ญ ๐Ÿ๐ซ๐จ๐ฆ ๐ก๐š๐ซ๐ง๐ž๐ฌ๐ฌ ๐ฎ๐ฉ๐๐š๐ญ๐ž๐ฌ? Our new paper has counter-intuitive answers to both: they decouple from model capability, in opposite ways. Q1 (who produces good updates): the updater's base capability barely matters.A 9B model (Qwen3.5) produces harness updates that match Claude Opus 4.6's. Best vs worst evolver gap โ‰ค3.1pp. Q2 (who benefits most): non-monotonic.Mid-tier solvers benefit the most. Strong-tier hits ceiling. Weak-tier benefits LEAST despite the most headroom โ€” failing at two layers: skill activation (25% vs ~96% for strong) and adherence drift across trajectory (~4x steeper). Tested across 7 evolver models ร— 6 solver agents ร— 3 agentic benchmarks (SWE-bench Verified, MCP-Atlas, SkillsBench). Implication: don't pay frontier prices for both halves of the loop. Put capability budget on the agent (solver), not the evolver.
Launch Post๐Ÿงฌ A-Evolve: The PyTorch Moment for Self-evolving AI Today we at @amazon launch the universal infrastructure that turns any agent into a self-improving SOTA agent โ€” zero human intervention. You give it a base agent โ†’ it returns a continuously evolving Top-10 agent. 3 lines of code. 0 hours of manual harness engineering: ๐ŸŸข MCP-Atlas โ†’ 79.4% (#1) 3.4pp ๐Ÿ”ต SWE-bench Verified โ†’ 76.8% (~#5) 2.6pp ๐ŸŸฃ Terminal-Bench 2.0 โ†’ 76.5% (~#7) 13.0pp ๐ŸŸก SkillsBench โ†’ 34.9% (#2) 15.2pp Thanks @binghe2727 @YisiSang @sammyershi @linminhua16 for the contribution! #AgenticAI #AEvolve #SelfImprovingAgents
4
12
104
21,702
All experiments in this paper were developed on the A-Evolve framework - a scalable self-improving agent infra. Two more research on self-improving agents built on the same framework releasing coming weeks: online context-to-harness skill compilation, and adaptive auto-harness for long-running deployment. Code, evolved harnesses, and trajectories all releasing through the repo. Paper: arxiv.org/abs/2605.30621 Repo: github.com/A-EVO-Lab/a-evolvโ€ฆ Hugging face daily: huggingface.co/papers/2605.3โ€ฆ Thanks @sammyershi @YisiSang @linminhua16 @dakuowang @binghe2727 @wei_tianxin @cihangxie and other co-authors!
1
1
13
603
Anthropic's @kdqg1 named the phenomenon earlier this year: "agentic laziness" โ€” models finding "an excuse to stop before finishing the task." Mechanism beneath that observation: training distribution rewards in-context completion. Models optimize for wrap-up tokens over sustained-extension tokens โ€” even when capability would clearly support real further progress.
1
3
200
At single-file demo scale, this is invisible โ€” the whole loop fits in one context. At Claude-tier training scale it becomes the dominant failure mode: โ€“ multi-file refactors crossing context boundaries โ€“ hardware errors surfacing hours after launch โ€“ long-chain data processing pipelines โ€“ multi-day training runs with mid-run analysis โ€“ late-stage cross-experiment result interpretation
1
2
185
The bias compounds every time the loop must extend across contexts. Real question isn't whether autoresearch works at 630 lines โ€” it does. It's getting frontier models to sustain research engagement across the time horizons real training cycles require, when their training distribution biases them toward early wrap-up. Several A-Evolve papers/codes on this releasing over coming weeks - stay tuned! Repo: github.com/A-EVO-Lab/a-evolvโ€ฆ Memo: github.com/A-EVO-Lab/.githubโ€ฆ
1
121
.@karpathy starting a new team to scale autoresearch from his single-py-file demo to Claude-tier models. After developing the scaled version (~10ยณร— prior self-improving work), the bottleneck we hit isn't capability โ€” it's that frontier models are trained to complete-in-context. That becomes the dominant failure mode at scale.
Excited to welcome Andrej to the Pretraining team! He'll be building a team focused on using Claude to accelerate pretraining research itself. I canโ€™t think of anyone better suited to do it โ€” looking forward to what we build together!
2
1
8
564
There's been a lot of back-and-forth on RSI timelines @jackclarkSF โ€” @ericjang11 's framing here is actually the missing measurement question. How do we know how far we are? When we measured AI research ability directly, the answer turned out to depend entirely on time horizon: 1 hour: models crush the best human researchers in reading paper, implementing a code change for experiments. 1 week: not even close. The bottleneck isn't capability per token โ€” it's the inability to stably translate compute into scientific progress. Humans keep asking the right next question over weeks; models drift, get stuck, or chase dead ends. So RSI timelines should be indexed to time horizon, not raw capability. In short bursts we're already past human. The real question is when the horizon over which models compound stretches from hours to weeks to months. That curve is the RSI clock.
What's the current bottleneck to automating AI research? @ericjang11's report: today's models are already good at implementing and running experiments, but still can't reliably pick the right question to investigate next or tell when they're stuck down a dead end.
1
8
846
.@Recursive_SI Congrats on the launch! The idea that really hits hard is that AI improving AI might be the single highest-leverage problem in all of science right now. Because if AI is the most leveraged domain (solving it accelerates every other scientific problem), then self-improving AI is the highest-leverage problem inside AI. Itโ€™s a meta-problem: once you crack open-ended, recursive self-improvement, you donโ€™t just get better models โ€” you get a system that can accelerate its own progress on every other capability. Your framing of open-ended discovery as the engine feels exactly right. The practical bottleneck then shifts from โ€œhow do we make the next model smarterโ€ to โ€œhow do we build systems that can autonomously run the entire research loop on themselves.โ€ Extremely excited to see this direction unfold. Congrats on the raise โ€” this feels like one of the most important bets in the field.
4
8
34
10,033
And the key to solve this recursive improvement problem is how we can scale. The existing works like @karpathy โ€˜s auto research are good POCs on 100M models, but frontier models are Trillions level. How we can scale 1000x is the next key question.
1
3
307
๐Ÿš€ Major update: We let the model design a multi-agent system โ€” and it improved on ARC-AGI. Self-evolution isnโ€™t just the model improving itself. Itโ€™s the model acting as a researcher that designs other agents. For single-agent tasks (coding, MCP-Atlas, etc.), weโ€™ve shown the model can successfully design and evolve its own harness. But for harder tasks like @arcprize, the optimal solver is a complex multi-agent system. The real question: Can the model evolve a multi-agent system from scratch? We ran A-Evolve on ARC-AGI and proved yes โ€” the model successfully evolved a multi-agent solver and achieved a clear performance uplift: 10% โ†’ 12%. (Attached: one example task solutions of agent before & after A-Evolve) Full results details in the reply below ๐Ÿ‘‡ #AgenticAI #AEvolve #SelfImprovingAgents #ARCAGI
Launch Post๐Ÿงฌ A-Evolve: The PyTorch Moment for Self-evolving AI Today we at @amazon launch the universal infrastructure that turns any agent into a self-improving SOTA agent โ€” zero human intervention. You give it a base agent โ†’ it returns a continuously evolving Top-10 agent. 3 lines of code. 0 hours of manual harness engineering: ๐ŸŸข MCP-Atlas โ†’ 79.4% (#1) 3.4pp ๐Ÿ”ต SWE-bench Verified โ†’ 76.8% (~#5) 2.6pp ๐ŸŸฃ Terminal-Bench 2.0 โ†’ 76.5% (~#7) 13.0pp ๐ŸŸก SkillsBench โ†’ 34.9% (#2) 15.2pp Thanks @binghe2727 @YisiSang @sammyershi @linminhua16 for the contribution! #AgenticAI #AEvolve #SelfImprovingAgents
6
17
93
17,014
GitHub: github.com/A-EVO-Lab/a-evolvโ€ฆ Submission to ARC-AGI community leaderboard: @GregKamradt @fchollet Key insight: - For single-agent friendly tasks โ†’ model can evolve its own single-agent harness. - For ARC-AGI-level difficulty โ†’ optimal solver is multi-agent, and the model can evolve that multi-agent system too. We will keep pushing the frontier of what agents can design and evolve themselves. Who else is excited about models that can evolve their own multi-agent systems? ๐Ÿ‘€
1
6
581
.@demishassabis : lack of continual learning is the blocker for agents completing full tasks. One data point on why this can be unlocking now, from our ๐š-๐ž๐ฏ๐จ๐ฅ๐ฏ๐ž runs: self-correction went through a phase transition. Pre-Sonnet-4.5: often can't find own errors. Sonnet-4.5: finds them, often can't fix. Opus-4.6: usually fixes right. Continual learning is a second-order capability โ€” gated by base model strength.
Demis Hassabis (@demishassabis) has had one of the most extraordinary careers in tech. He started as a chess prodigy and video game designer at 17 before getting a PhD in neuroscience and going on to found DeepMind. His lab cracked Go, solved protein structure prediction with AlphaFold, and then gave it away free to every scientist on earth. That work won him the 2024 Nobel Prize in Chemistry. Today he leads @GoogleDeepMind, pushing toward the same goal he set as a teenager: AGI. On this special live episode of How to Build the Future, he sat down with YC's @garrytan to talk about what still needs to happen to get us to AGI, his advice for founders on how to stay ahead of the curve, and what the next big scientific breakthroughs might be. 01:48 โ€” Whatโ€™s Missing Before We Get To AGI? 03:36 โ€” Why Memory Is Still Unsolved 06:14 โ€” How AlphaGo Shaped Gemini 08:06 โ€” Why Smaller Models Are Getting So Powerful 10:46 โ€” The 1000x Engineer 12:40 โ€” Continual Learning and the Future of Agents 13:32 โ€” Why AI Still Fails at Basic Reasoning 15:33 โ€” Are Agents Overhyped or Just Getting Started? 18:31 โ€” Can AI Become Truly Creative? 20:26 โ€” Open Models, Gemma, and Local AI 22:26 โ€” Why Gemini Was Built Multimodal 24:08 โ€” What Happens When Inference Gets Cheap? 25:24 โ€” From AlphaFold to the Virtual Cells 28:24 โ€” AI as the Ultimate Tool for Science 30:43 โ€” Advice for Founders 33:30 โ€” The AlphaFold Breakthrough Pattern 35:20 โ€” Can AI Make Real Scientific Discoveries? 37:59 โ€” What to Build Before AGI Arrives
6
500
There's a missing piece. ๐‹๐‹๐Œ๐ฌ ๐š๐ฅ๐ซ๐ž๐š๐๐ฒ ๐š๐ซ๐ž ๐ซ๐ข๐œ๐ก ๐ซ๐ž๐ฐ๐š๐ซ๐ ๐Ÿ๐ฎ๐ง๐œ๐ญ๐ข๐จ๐ง๐ฌ for many domains โ€” we use them as judges, critics, self-correctors. Self-evolving agents work specifically because the model evaluates its own trajectories and writes code to patch its own scaffolding. The "simple cross-entropy" problem is a base-training problem; it doesn't bind at the agent loop layer.
There's a quadrillion-dollar question at the heart of AI: Why are humans so much more sample efficient compared to LLM? There are three possible answers: 1. Architecture and hyperparameters (aka transformer vs whatever โ€˜algoโ€™ cortical columns are implementing) 2. Learning rule (backprop vs whatever brain is doing) 3. Reward function @AdamMarblestone believes the answer is the reward function. ML likes to use pretty simple loss functions, like cross-entropy. These are easy to work with. But they might be too simple for sample-efficient learning. Adam thinks that, in humans, the large number of highly specialised cells in the โ€˜lizard brainโ€™ might actually be encoding information for sophisticated loss functions, used for โ€˜trainingโ€™ in the more sophisticated areas like the cortex and amygdala. Like: the human genome is barely 3 gigabytes (compare that to the TBs of parameters that encode frontier LLM weights). So how can it include all the information necessary to build highly intelligent learners? Well, if the key to sample-efficient learning resides in the loss function, even very complicated loss functions can still be expressed in a couple hundred lines of Python code.
3
23
3,804
๐–๐ก๐ž๐ซ๐ž ๐ข๐ญ ๐›๐ซ๐ž๐š๐ค๐ฌ: tasks where the model has no inherent taste. We tried this on Texas Hold'em โ€” the model can't intuit hand probabilities, so self-critique went nowhere. The fix was letting it write an external equity solver. Then the agent loop closed again, just tool-augmented.
1
1
164
So @AdamMarblestone 's framing maps onto LLMs cleanly, but with a twist: the rich reward function isn't a separate evolved circuit. It's the same model evaluating itself. Sample efficiency at the agent level comes from this self-critique loop, not from making each forward pass more efficient.
1
1
153