Filter
Exclude
Time range
-
Near
AI Native Dev retweeted
> The most expensive model in the benchmark wasn't the best value I found similar results in my Snyk VulnBench benchmark for finding vulnerabilities (comparing Opus 4.7 and others) Released soon, stay tuned folks 👀😉
1
1
119
Simon Maple retweeted
The most expensive model in the benchmark wasn't the best value. Rob Willoughby and Simon Maple ( @sjmaple ) evaluated 19 model configurations on real agentic tasks and found that DeepSeek V4 Flash scored 82.3 while costing just $0.0236 per task. Claude Haiku 4.5 scored 82.9 at roughly four times the cost, while DeepSeek V4 Pro scored 85.3 at nearly eight times the cost. The interesting part isn't that Flash beat stronger models. It didn't. The interesting part is how little quality was gained for how much additional spend. That becomes a very different conversation once you're running agents at scale. A model that looks marginally better on a benchmark can end up costing dramatically more over the course of a year, especially when agent workloads start growing. The benchmark also surfaced something that many teams probably aren't measuring closely enough. The biggest performance jump didn't come from switching models. It came from adding the right skill. DeepSeek V4 Flash moved from 64.1 to 82.3 with skill context applied, which raises an uncomfortable question about how much of agent performance is actually model selection versus everything built around the model. The full breakdown is worth reading, particularly the sections on points-per-dollar, turn counts, and why the cheapest model in the benchmark ended up being one of the most interesting. Read the full blog here: tessl.io/blog/same-quality-a…
1
1
6
305
Replying to @ainativedev
Oh that's lovely indeed. Looks like mixed results, especially between the tool calls divide. Good you had been able to run it before the fable takedown 😉
1
1
6
Replying to @ainativedev
I like these benchmarks. Nice work Tessl team
1
1
574
Hamza Oza retweeted
The biggest AI challenge inside organisations might have nothing to do with AI. Hamza Oza (@hamzaoza) connects two very different events, AI Native DevCon and Muslim Tech Fest, and arrives at the same conclusion from both. Most conversations about AI start with tools. The more interesting question is where value is created before the tool arrives. At AI DevCon, the discussion was about why individual AI productivity gains often fail to scale across teams. A developer can build faster with Claude Code, Cursor, or Copilot, but organisations still need shared context, standards, governance, and workflows before those gains become repeatable. At Muslim Tech Fest, the conversation surfaced the same idea at a personal level. Before AI can amplify someone's work, they need clarity on their strengths, weaknesses, judgement, and the areas where they create the most value. That parallel feels increasingly important. If an individual doesn't understand where they contribute value, AI won't solve that problem. If a team lacks shared standards, better tools won't create them. If an organisation doesn't understand what makes it effective, adding agents is unlikely to provide the answer. Technology can accelerate direction. It cannot provide direction. A thoughtful perspective on why reflection may be a more important starting point than augmentation. Read the full blog here: tessl.io/blog/reflection-bef…
1
4
91
Rohan Sharma retweeted
Ryan Lopopolo put out a claim that it's "borderline negligent" not to use a billion tokens a day — and in this clip he explains exactly why. Intelligence extraction scales linearly with token consumption. That's why test-time compute exists. And getting to a billion tokens means thinking well beyond pair programming. Watch the full episode at youtu.be/MFQIKbr1IEo or listen wherever you get your podcasts. #AI #agenticcoding #claudecode #codex #AIskills
1
3
90
Oleg Šelajev 🇪🇪🧊🐳 retweeted
A skill from @shelajev, showing up on his talk at @ainativedev for doing a security audit that finds scattered credentials on the file system Claude Code refuses to run the skill, Oleg asks to rewrite on Python and so on and so on It's a fun experiment of behavior analysis
1
5
431
Der.dev 🔥🛠️ retweeted
Had a great conversation with the Tessl folks at @ainativedev London on all things Codex, agents, and harness engineering. Hope y’all give it a listen!
Ryan Lopopolo tracked PR throughput on his OpenAI team from 3.5 per engineer per week up to 70 — not through adding headcount, but through iterating on the model and the harness together. Every revision of GPT-5 from 5.2 onward compounded on the last, and this clip shows exactly what that felt like from inside the team. Watch the full episode at youtu.be/MFQIKbr1IEo or listen wherever you get your podcasts. #AI #agenticcoding #claudecode #codex #AIskills
2
1
25
2,279
Dorothy Bartomeo retweeted
Ryan Lopopolo tracked PR throughput on his OpenAI team from 3.5 per engineer per week up to 70 — not through adding headcount, but through iterating on the model and the harness together. Every revision of GPT-5 from 5.2 onward compounded on the last, and this clip shows exactly what that felt like from inside the team. Watch the full episode at youtu.be/MFQIKbr1IEo or listen wherever you get your podcasts. #AI #agenticcoding #claudecode #codex #AIskills
2
6
2,854
It was fun! and you were great Ryan :-)
3
124
Michael Wall retweeted
Ryan Lopopolo built a product at OpenAI with zero human-written code, and by the time his team reached its seventh engineer, new hires were making the team faster within two weeks. The secret isn't just better agents — it's Harness Engineering: the systems, constraints, and feedback loops that make agents trustworthy enough to let go. This conversation was recorded live at AI Native DevCon London 2026, and it's one of the most concrete breakdowns of production-grade agent development we've had on the show. Watch the full episode at youtu.be/MFQIKbr1IEo or listen wherever you get your podcasts. #AI #agenticcoding #claudecode #codex #AIskills
1
3
8
1,008
Simon Maple retweeted
Developers using AI tools are creating and merging twice as many pull requests — but AI-generated PRs have a 60/40 merge rate compared to 80/20 for humans. That gap reveals something important about how agents are actually being used in the wild: probing, experimenting, spawning throwaway work. Jellyfish's Nick Arcolano breaks down what the data actually says. Watch the full episode at youtu.be/GbHfzFcIa0o or listen wherever you get your podcasts #AI #agenticcoding #claudecode #codex #AIskills
2
2
479
There's always next time
2
24
@SammyHep where's mine? 😆
2
23
Thank you for inviting me to speak at @ainativedev and meet some of the coolest AI builders talent and minds in London 🚀 Appreciate @tessl_io for building this and @SammyHep, @sjmaple and team for all the effort to organize and make it a stellar AI event
3
9
750