nilenso

nilenso

248 Photos and videos

Tweets

Pinned Tweet

nilenso @nilenso

3 Aug 2023

In 2013, a group of makers got together to find new ways to work together. A lot has happened since. We recently celebrated our 10th birthday :) Over the years, we've had the privilege of working with some exceptional organizations and doing work we're proud of. 1/2

4,742

Srihari Sriraman

nilenso retweeted

Srihari Sriraman

@SrihariSriraman

Apr 23

So, if you recognize these patterns in your agents, and they don't fit your task at hand, you could steer them accordingly. Full write-up: blog.nilenso.com/blog/2026/0… Analysis code and data: github.com/nilenso/swe-bench….

1,219

Srihari Sriraman

nilenso retweeted

Srihari Sriraman

@SrihariSriraman

Apr 23

I wanted to check this on newer models, but no SWE-bench Pro trajectories exist for Opus 4.6 / GPT-5.4. So I pulled @badlogicgames' issue-fixing trajectories and ran the same analysis. Thanks for putting those out in public, they make this kind of analysis possible. Opus's first edit sits at 47% in your pi sessions, vs 35% for Sonnet 4.5 on SWE-bench. Harness and model differ too, so I can't isolate the prompt's effect, but the shape shifts in the direction you'd expect from the explicit analyze-dont-edit prompt. I think we can see the effect of the human-steering through explicit analysis / go-ahead / wrap-up cues in this comparison.

Comparison of trajectory-phase shapes between SWE-Bench Pro runs and Pi/SWE-agent-style traces. I think it’s likely that the explicit analysis prompt is pushing the understand phase to be longer. The first edit is pushed from 35% to 47%.

ALT Comparison of trajectory-phase shapes between SWE-Bench Pro runs and Pi/SWE-agent-style traces. I think it’s likely that the explicit analysis prompt is pushing the understand phase to be longer. The first edit is pushed from 35% to 47%.

Comparison of trajectory-phase shapes between SWE-Bench Pro runs and Pi/SWE-agent-style traces. The edit duration has reduced significantly from 40% to 23%. I suspect this is a combination of the model upgrade and Mario’s steering towards “minimal” and “concise” solutions.

ALT Comparison of trajectory-phase shapes between SWE-Bench Pro runs and Pi/SWE-agent-style traces. The edit duration has reduced significantly from 40% to 23%. I suspect this is a combination of the model upgrade and Mario’s steering towards “minimal” and “concise” solutions.

26,569

Mario Zechner

nilenso retweeted

Mario Zechner

@badlogicgames

Apr 23

super interesting work! glad my open traces on @huggingface allow this! my workflow is mostly: - prompt template with injected gh issue url.and instructions on how to annalyze and present results concise impl plan to me - i confirm analysis/plan either by knowing or double checking manually. may steer to adjust plan a few times until model knows what to do - tell model to implement - check results, steer if necessary - if all good (type checking, linting, tests, manual code review, manual tests), another prompt template is used to wrap up, i.e. changelog, docs, commit, push, comment on issue, close issue

Srihari Sriraman

@SrihariSriraman

Apr 23

Replying to @SrihariSriraman

21,768

Srihari Sriraman

nilenso retweeted

Srihari Sriraman

@SrihariSriraman

Apr 23

I analyzed 730 SWE-Bench Pro trajectories each for Sonnet 4.5 and GPT-5 and turned them into “trajectory shapes”: when they start editing, when they stop, how much they verify, how many steps they spend understanding vs doing. They have very different work habits.

Two stacked area charts comparing coding-agent trajectory phases over normalized progress from 0% to 100%. Sonnet starts editing earlier, at 35%, and finishes implementation much sooner, by 62%, then spends a long tail verifying after its last source edit. GPT-5 front-loads a lot more reading before it starts editing at 50%, and does very little verification afterward. Sonnet also has to clean up temporary files, while GPT-5 doesn’t.

ALT Two stacked area charts comparing coding-agent trajectory phases over normalized progress from 0% to 100%. Sonnet starts editing earlier, at 35%, and finishes implementation much sooner, by 62%, then spends a long tail verifying after its last source edit. GPT-5 front-loads a lot more reading before it starts editing at 50%, and does very little verification afterward. Sonnet also has to clean up temporary files, while GPT-5 doesn’t.

4,888

Srihari Sriraman

nilenso retweeted

Srihari Sriraman

@SrihariSriraman

Apr 22

This was a great opportunity to bridge our industry work at @nilenso with academic research at CMU, and I am very grateful to the co-authors @heathermiller (CMU), Michael Isaac (CMU), and @AtharvaRaykar (@nilenso). We will be in the Bay Area for the conference soon. If you are building AI tooling or want to talk shop about context engineering, we would love to connect.

265

Srihari Sriraman

nilenso retweeted

Srihari Sriraman

@SrihariSriraman

Apr 22

We’ll be presenting context-viewer at ACM @CAISconf! context-viewer is an observability tool for context engineering. It gives structure to LLM contexts using classification by topics, and allows you to compare runs side-by-side. Useful for things like analyzing agent failures, token spends, and evaluating context compaction. - github.com/nilenso/context-v… - caisconf.org/program/2026/de…

3,629

Drew Breunig

nilenso retweeted

Drew Breunig

@dbreunig

Apr 1

My takeaways from scanning the Claude Code code for ~45 min this evening: 1️⃣Harness engineering is hard. There's a lot of hard won knowledge in here and plenty of diagnostics to keep the feedback flowing. 2️⃣Harnesses and prompts smooth out model quirks. @SrihariSriraman and I covered this last month, but good to see it verified here. So many conditionals based on model types and specific contexts to deploy to mitigate model weirdness. 3️⃣So much of this is CLI app boilerplate. Fully expect a tool like @badlogicgames's pi to be the foundation for any CLI agent being built today. I talk about the last point, the opportunity for shared foundations, in a post today: dbreunig.com/2026/03/26/winc…

The Cathedral, the Bazaar, and the Winchester Mystery House

Welcome to the era of sprawling, idiosyncratic tooling.

dbreunig.com

3,620

Srihari Sriraman

nilenso retweeted

Srihari Sriraman

@SrihariSriraman

Apr 1

Did you know claude code has "model counterweights"? These are patches in the system prompt that exist to balance model biases. These weren't visible earlier, but the leaked code has @[MODEL LAUNCH] annotations that call them out explicitly.

287

Srihari Sriraman

nilenso retweeted

Srihari Sriraman

@SrihariSriraman

Mar 11

Replying to @swyx @dbreunig

I did a compaction analysis on a couple of recent claude code sessions, and thought I'd share here too. You can use the link to explore further if you're interested. These are good compaction examples. I wish I could do the same analysis with some bad examples. nilenso.github.io/context-vi…

1:54

128

Drew Breunig

nilenso retweeted

Drew Breunig

@dbreunig

Feb 18

Somehow I didn't fully appreciate how strongly Claude Code's prompt has to fight against the weights to make parallel tool calls. blog.nilenso.com/blog/2026/0…

255

19,121

Govind Krishna Joshi

nilenso retweeted

Govind Krishna Joshi

@govindkrjoshi

Feb 17

Something I've been thinking for a while, but finally got to writing it down. The core thesis is that building reliable AI applications requires a harness to be able to tinker, experiment and iterate, without which the project gets stuck in the prototyping phase. blog.nilenso.com/blog/2026/0…

Engineering Maturity is all you need

8:PM in the evening: it’s demo day tomorrow.

blog.nilenso.com

460

Drew Breunig

nilenso retweeted

Drew Breunig

@dbreunig

Feb 13

Really excited for this one: @SrihariSriraman and I took a deep dive into coding agent system prompts to understand their structure, similarities, and differences. dbreunig.com/2026/02/10/syst…

How System Prompts Define Agent Behavior

System prompts matter far more than most assume. A given model sets the theoretical ceiling of an agent’s performance, but the system prompt determines whether this peak is reached.

dbreunig.com

315

23,181

Srihari Sriraman

nilenso retweeted

Srihari Sriraman

@SrihariSriraman

Jan 15

Replying to @badlogicgames

I've been studying the effect of system prompts in the model tools system-prompts harness stack. So, I ran the same SWE-Bench-Pro task with Opus Claude Code, but with different system prompts. One run used Codex's system prompt, and another run used Claude's system prompt. The workflows on the runs are different, and mirror these kinds of sentiments. You can see the corresponding differences in the system prompts too. We maybe mis-attributing some of these behaviours to the model, when they're attributable to the system-prompt.

386

Yash Gandhi

nilenso retweeted

Yash Gandhi @yashgandhi_

Jan 8

Atharva Raykar from @nilenso will tell us how you're not a programmer anymore: you're coordinating a complex system. Systems thinking, feedback loops, scientific reasoning. The skills that actually matter when building AI. Unlearning and relearning the new rules of the game.

255

atharva

nilenso retweeted

atharva

@AtharvaRaykar

16 Dec 2025

Link to full article: blog.nilenso.com/blog/2025/1…

Minimum Viable Benchmark

A few months ago, I was co-facilitating a “Birds of a Feather” session on keeping up with AI progress. This was a group of engineering leaders and ...

blog.nilenso.com

166

atharva

nilenso retweeted

atharva

@AtharvaRaykar

16 Dec 2025

I have collected some thoughts on how to look at benchmarks that are rarely expressed elsewhere. I believe it's useful and tenable for people and organisations to build their own "minimum viable benchmark" to really make sense of LLM capabilities.

1,046

Srihari Sriraman

nilenso retweeted

Srihari Sriraman

@SrihariSriraman

27 Nov 2025

I just published the next article in the "How to work with Product" series. This one is called: "Taste and Adjust", and it's about finding ways to "taste" your product at every stage, by consciously building a product development flywheel. Link: blog.nilenso.com/blog/2025/1…

177

atharva

nilenso retweeted

atharva

@AtharvaRaykar

7 Nov 2025

I let Codex CLI rip over the @nilenso website code to optimise performance. It scripted a benchmark, applied some changes and reran the bench to confirm that its changes sped things up by ~5x. Our website sends ~10x less data as well. We had been putting off the website optimisation work due to other priorities, but these days the friction to take up this kind of work is really low.

256

atharva

nilenso retweeted

atharva

@AtharvaRaykar

5 Nov 2025

another win for bitter lesson driven development: specialised tool interfaces -> code execution blog.nilenso.com/blog/2025/1…

Anthropic

@AnthropicAI

4 Nov 2025

New on the Anthropic Engineering blog: tips on how to build more efficient agents that handle more tools while using fewer tokens. Code execution with the Model Context Protocol (MCP): anthropic.com/engineering/co…

513

Srihari Sriraman

nilenso retweeted

Srihari Sriraman

@SrihariSriraman

4 Nov 2025

Sometimes I just want to give a github url, and a prompt to semantically search. Similar to web search tools, but for Github / Gitlab. I made a tool that does this, following @thorstenball 's "How to Build an Agent", and @nickbaumann_ 's "What Makes a Coding Agent?" blog posts. I just use Github/Gitlab's APIs instead of using the filesystem. I use this now in storymachine because product managers or business folks don't have a repo cloned or an agentic-cli running on their machines.

9,417