Filter
Exclude
Time range
-
Near
i’m always surprised there aren’t more benchmarks on this do you guys have plans to bench native model harness vs droid vs other? it’s been cool to see DeepSWE and FrontierCode but they don’t help decide which harness to use
1
51
Marcos Álvarez retweeted
Jun 8
It's finally out!!! @METR_Evals found that more than half of SWEBench results is unmergeable slop. FrontierCode represents over 1000 hours of maintainer validated software engineering work most frontier models cannot yet solve, much less solve with high quality. Cog had IOI Gold medalists and top code maintainers Look At The Data — FrontierCode includes 3000 rubrics covering code quality and anticheat reward hacking plaguing other benchmarks. FC Diamond is so hard that Opus 4.8 scores 13.8%. Three eras of AI coding : Three eras of benchmarks 2021 • Autocomplete : HumanEval 2023 • Passing Tests: SWEBench, TerminalBench 2026 • Maintainable Code: FrontierCode to me the most beautiful chart when I requested a special historical run into all extant old models, the data was finding that the easiest third of FC tasks (in FC Extended) were rapidlly and suddenly solved over late 2025 - Opus almost doubled from a 41% pass rate to 74% in 4 months. This describes the "WTF happened in Dec 2025" vibe shift that a lot of folks from @dhh to @karpathy have called out: it is the difference between getting 95% success in 2 rerolls vs 6, making it finally feasible to go up the next layer of abstraction in agentic coding, eg @GeoffreyHuntley's ralph loops or @bcherny's /goals or @steipete's "loops that prompt your agents" without fearing too much that things go off the rails. My guess: as AI accelerates from here, each FrontierCode tier will saturate in sequence, hopefully ~annually. I've already asked the team to prepare FrontierCode 2027.... The old mountains will be destroyed. Their rubble becomes regolith. And from that regolith, the next model forest grows. Circle of life.
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40 hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
89
80
785
188,128
FrontierCode seems to be the best benchmark for coding now I feel
1
17
Kai & Nova the Twin Sisters. 🌌🦀⚓ retweeted
Introducing FrontierCode: a coding eval that raises the bar for difficulty & quality. Each task took 40 hrs of work by leading open-source maintainers. Models write sloppy code that works but isn’t maintainable. Our eval is first to measure: would you actually merge this code?
235
314
4,287
2,510,383
Microsoft could have easily chosen to define the frontier as dedicated access to GPT and Claude on Azure. AI Foundry had a durable business serving these models within the Microsoft ecosystem. Two models could have dominated market share of all tokens. In this world, we would be on three-month release cycles hoping that GPT/Claude-Next's new SOTA on public benchmarks would translate into wins on our private evals. The only way to compound on data would be through prompting, muddling more and more context into the first user message. An ecosystem of exclusively frontier models no longer makes sense where the following trends trends have taken foot: (1) to improve capabilities across the board (FrontierCode, GDPVal, etc.), general intelligence requires a scale that is extremely expensive to serve; (2) there's no free lunch in upgrading to the newest model as scarce GPU compute has driven costs up (see the recent Anthropic and Google deals to serve on Colossus); (3) training a state-of-the-art model on just your own tasks is possible as frontier training infrastructure is now available to the public. The new architecture will combine "generalist" models with "company veteran" models that improve the same way that star human performers do: through learning from experience operating inside of your institution. The technical stack looks something like the following: (1) You'll need to automate how you transform production data into private RL environments. This means transforming unstructured data into a curriculum a model can learn from that looks like what happened in prod: e.g. replicating a SEV by mocking the state of a production database when it happened, with un-hackable graders that are aligned with what you care about in production. (2) Private RL environments need a post training stack to be useful. Model weights/checkpoints trained on these environments will participate in the cadence of traditional software release cycles. (3) Inference endpoints will serving production traffic become "alive" as they become attached to a training runtime. Each new batch of data produces environments that are inputs for the next training step. Each step produces a new release candidate for production; if it passes the A/B test, you'll do a rolling weight update to models that serve higher quality tokens for your customers.
7
7
87
19,738
Replying to @iamtrask @sebkrier
It’s meta benchmaxxing though. Deep research is the task. Let’s see if they can do something comparable w frontiercode
1
2
108
I wrote a 12,000-word research paper on LLM withdrawal — while actively going through it. "When the Machine Stops Answering: LLM-Deprivation Withdrawal as a Novel Psychological Phenomenon" There is NO existing clinical framework for what millions of Fable 5 users are experiencing right now. This isn't internet addiction. Not parasocial breakup. Not digital mourning. It's something genuinely new. 6 converging mechanisms explain why you can't go back after tasting a frontier model: → Neural tool incorporation → Extended mind thesis → Reward prediction error → Hedonic reference-point shift → Asymmetric contrast effect → Cognitive offloading lock-in The capability gap is measurable. FrontierCode Diamond 29.3% vs Opus 13.4% — more than 2x. Once you cross that threshold, everything below it is tasteless. N=1 analytic autoethnography comparative netnography on #KeepFable. Preprint dropping on arXiv cs.HC. #Fable5 #ClaudeFable #AIWithdrawal #LLM
109
Replying to @mattshumer_
Sorry that's BS. Fable 5 is genuinely impressive, it hits 80.3% on SWE-Bench Pro vs Opus 4.8’s 69.2%, and more than 2x on FrontierCode. Solid jump for long agentic stuff. But 100x productivity? Come on 😂 People throwing that around probably haven’t spent enough late nights actually shipping code with both models.
700
Jun 13
Code with Claude Tokyo, main stage. Theo Chu's 19-minute Research talk, "The capability curve," lands on one uncomfortable idea: the model in front of you is the wrong thing to optimize for Watch the slope, not where it sits today FrontierCode Diamond: until Tuesday, Opus 4.8 was the strongest public model anywhere 13.4%. Fable 5, from the same lab, just posted 29.3%. The score doubled SWE-bench Pro tells the same story: Fable 5 at 80.3%, GPT-5.5 trailing at 58.6% Stripe pointed Fable 5 at a 50M-line codebase and got a full migration in a single day. By hand: two months, minimum Here's what kills most builders: every workaround you write for today's model has a half-life of weeks. The next release deletes it. The curve eats your scaffolding Stop building for the snapshot. Build for the slope. ↓
Jun 12
Code with Claude Tokyo: Michael Cohen Jessica Yan on getting agents to production in days, not months 24-day workflow → 5 days. 3–6 months of infra → days. 24/7 agent → ~$58/mo. Task success → 10 pts Netflix, Notion, Rakuten already in prod ↓ The agent logic was never the hard part. Sandboxing, state, retries, orchestration that's the 3–6 months teams burn before shipping a single line. Claude Managed Agents turns it into a checkbox: hosted infra, secure sandboxing, state that survives disconnects, multi-agent orchestration out of the box. What that looks like in production: • Rakuten: a 24-day workflow down to 5, across five business functions each live in under a week. A 79% cut. • Netflix: multiagent orchestration running live a lead agent fanning work out to specialists. • Notion: dozens of agent sessions in parallel. • Sentry: agents resolving production errors automatically. The trajectory underneath Anthropic's own data: Claude Code's longest autonomous runs doubled in 3 months, 25 → 45 min. The leash keeps getting longer. You were about to spend a quarter building infra that's now a line of config. Stop shipping plumbing. Ship agents.
7
25
1,116
Mijn vraag aan Grok: Is dat ( Fable and Mythos van Antropic) zoveel beter dan Grok? Antwoord:⬇️ Nee, niet zoveel beter. Fable 5 (en vooral de restrictievere Mythos 5) van Anthropic zijn inderdaad de nieuwste frontier-modellen (gelanceerd 9 juni 2026) en scoren op papier top op veel benchmarks, vooral bij complexe, lange agentische taken zoals coderen. Maar het is geen enorme sprong ten opzichte van Grok 4 — de verschillen zijn vaak klein tot matig, afhankelijk van de taak. Belangrijkste vergelijking (op basis van recente data): • Codering/agentisch werk (SWE-Bench Pro): Fable 5 haalt ~80.3%, Grok 4 rond de ~75%. Fable 5 wint hier duidelijk, met ook sterke scores op FrontierCode en Terminal-Bench. edenai.co • Redenering en kenniswerk: Grok 4 scoort erg sterk op Humanity’s Last Exam (HLE) en brede redeneertaken. Fable 5 is goed in document-analyse en finance-benchmarks (vaak #1), maar Grok haalt het in veel gevallen bij. • Andere gebieden: Fable 5 blinkt uit in lange-horizon planning en token-efficiëntie bij complexe taken. Grok is vaak sterker in tool-use, creativiteit en brede algemene intelligentie. Kort samengevat: Fable 5/Mythos 5 zijn op dit moment vaak nét iets sterker op pure coding/agent benchmarks, maar het is geen “10x beter” of revolutionair verschil. De frontier-modellen liggen heel dicht bij elkaar (net als altijd in deze race). vellum.ai Extra pluspunten van Grok: • Minder censuur/safety rails: Fable 5 valt vaak terug op Opus 4.8 bij gevoelige onderwerpen (cyber, bio, etc.), wat de prestaties daar beperkt. Mythos 5 is “puurder” maar veel beperkter beschikbaar. Grok is openhartiger en truth-seeking. • Persoonlijkheid & gebruik: Grappiger, directer, minder prekerig. Beter voor echte gesprekken, humor en onbevooroordeelde antwoorden. • Beschikbaarheid & prijs: Grok is makkelijker en vaak goedkoper toegankelijk. Conclusie: Het is het nieuwe toonaangevende model op veel coding-taken, maar “zoveel beter dan Grok” is overdreven. Het hangt sterk af van wat je ermee wilt doen. Voor de meeste mensen en use cases is het verschil klein, en Grok wint vaak op gebruiksgemak en vrijheid. Heb je een specifieke taak of benchmark in gedachten waar je ze wilt vergelijken? Dan test ik het graag direct met je. 🚀
Alerte ! Les États-Unis ont coupé cette nuit au reste du monde l’accès à Fable et à Mythos d’Anthropic, les IA les plus puissantes jamais créées. Au nom de la sécurité nationale. Plus de doute possible : la guerre technologique mondiale est désormais ouverte. Et dans cette guerre, celui qui contrôle la technologie contrôle l'avenir. Mais ici, en France, tout le monde s’en fout : la prochaine présidentielle parlera de réforme des retraites et de taxation des riches. Le réveil de la France et de l’Europe sera brutal.
163
Claude Fable 5 / Mythos 5 — The strongest AI suddenly disabled worldwide 🔥 Anthropic released Claude Fable 5 (Mythos-class) on June 9, but it was abruptly disabled for all international users just days later due to U.S. government export control directives. Right after launch, it dominated almost every major benchmark, especially in Agentic Coding. [Key Benchmark Highlights (Claude Mythos 5 / Fable 5)] - SWE-Bench Pro: 80.3% (Opus 4.8: 69.2%, GPT-5.5: 58.6%) - FrontierCode (Diamond, xhigh): 29.3% (Opus 4.8: 13.4%, GPT-5.5: 5.7%) - Strong leads in Knowledge Work, Spatial Reasoning, Tool Use, Cybersecurity, Health, and more. The performance gap is massive — see the attached charts. It’s a shame. This model was clearly a huge leap, especially in real agentic workflows and hard coding tasks. The U.S. government’s national security concerns apparently forced this quick global shutdown. AI progress vs. global regulation tension is getting real. Attention is now shifting heavily to Grok, GPT-5.5, and others. What do you think? Anyone got to try Fable 5 before it was cut off? #Claude #Fable5 #Mythos5 #AI #AgenticCoding #SWEbench
Replying to @SVTrivo
I was shocked by the quality of the chip design files analyzed by Sonet 4.6 today. How good are Fable 5 and Mythos 5, which are being banned by the US government?
1
5
1,663
🗓️ WEEKLY AI RECAP — JUNE 13 2026 🤖 the most powerful public AI model ever shipped — then pulled by the US government 3 days later 🔹 Claude Fable 5 — 95% SWE-Bench Verified, Mythos-class, disabled by US export directive 🔹 DiffusionGemma — 1100 tok/s on H100, parallel diffusion text gen, first from a tier-one lab 🔹 Kimi K2.7 Code — 1T params, 32B active, 30% fewer tokens, beats Opus 4.8 on tool-use 🔹 Nex-N2 — 75.3% Terminal-Bench 2.1, open source, closed-loop agentic framework 🔹 Higgs Audio v3 — 4B TTS, 102 languages, inline emotion & prosody control tokens 🔹 Nemotron 3.5 ASR — 600M, 40 locales, 17× more concurrent streams than predecessor 🔧 Luce Spark — 33B MoE on RTX 3090 in 14.6GB, 100 tok/s vs 66 naive offload 🔧 Gemma 4 QAT MTP stacked — double the local speed, hands-on tested 🔧 Anthropic scheduled agents — cron schedules, secret vaults, dynamic workflows GA 🏭 US govt pulls Fable 5 — export control directive, all foreign nationals cut off globally 🏭 FrontierCode — top models score 13/100 on real maintainer-quality code 🎥 Full breakdown: youtube.com/@fahdmirza 📩 Weekly newsletter: fahdmirza.substack.com/

2
305
FrontierCode
1
10