Recently we built OTelBench – a benchmark to test how well LLMs handle OpenTelemetry instrumentation.
We tested 14 models. The best (Claude Opus 4.5) hit only 29%.
These weren't trick questions, just small subset of typical SRE tasks.
Link here:
quesma.com/blog/introducing-…
AI Ghidra by NSA = reverse-engineering fun
I am speaking at @AITinkerers Warsaw, 4th Mar 2026.
One of my favorite event series - by and for the creators community.
Vibe-resurrecting an old game from binaries 👾 and vibe-hardware-ing a LED backpack 🎒🌈.
Claude can code, but can it read machine code?
We gave AI agents access to Ghidra (a decompiler by the NSA) and tasked them with finding hidden backdoors in servers - working solely from binaries, without any access to source code.
See our BinaryAudit: quesma.com/blog/introducing-…
Great to see the community releasing benchmarks in @harborframework now. These are invaluable resources for collectively building the most useful agents.
Finally, an AI that can draw a map without getting lost. Nano Banana Pro uses tools to create factually correct infographics - and it's a game-changer.
quesma.com/blog/nano-banana-…
Interesting use case for AWS Lambda that we explored: sandboxing AI-generated code.
We tried WebAssembly first but hit the wall. So, we scrapped our experiment for AWS Lambda with Docker containers in an isolated VPC.
Full writeup from @pmigdal:
awsfundamentals.com/blog/san…
Lambda has tons of use cases, but one I've missed: using it as some kind of sandbox for running AI-generated code.
Lambda's isolation and scaling are a solid fit for this problem.
Can AI compile 22-year-old code? We built CompileBench to find out.
We know that LLMs can vibe-code or even win IOI, but what about dependency hell or legacy build systems?
(image based on XKCD 2347)
ALT Cartoon about dependency hell; tangled ‘dependencies’ making simple tasks complex.
Cost-efficiency crown: @OpenAI.
Across difficulties, OpenAI models dominate the Pareto frontier of cost.
GPT-5-mini (high reasoning) is a great price/perf pick; GPT-4.1 is the fastest with solid wins.
ALT Scatter plot of success vs cost, highlighting OpenAI models.
At #IcebergSummit 2025, Ryan Blue unveiled Iceberg beyond Java, plus the path to Table Spec V3 & forward to V4. Przemysław Delewski’s new blog covers Fokko Driesprong on Pylceberg, Matt Topol on Go, Julien Le Dem on modular DBs. Essential read for next-gen data platforms. Link👇