Defused bombs. Tech banker. 3x founder. Now deploying enterprise AI. GM @tribe_ai.

Joined May 2023
29 Photos and videos
Pinned Tweet
I went from defusing bombs in the Navy to helping enterprises not blow up their AI deployments. Now: GM at Tribe AI (@tribe_ai). Writing about AI security, enterprise risk, and why most "AI strategies" are just slide decks. Builders and operators, my DMs are open.
119
Claude Code's TodoWrite tool pinged early models every 5 turns to stay on track. As models got better, the reminders made Claude rigidly stick to lists instead of adapting. They removed it. A harness constraint that was essential 6 months ago may be limiting your agent now.
6
Ask Claude "should I launch this?" and you get 5 yes reasons. Ask "is this a bad idea?" and you get 5 no reasons. Same product. Karpathy's LLM Council fix: 5 advisors independently, responses anonymized, peer-review each other. The gaps reveal what none of them caught.
5
Anthropic found agents marking features complete after unit tests passed, then those features failed in the browser. Fix: add Puppeteer so agents can navigate the app themselves. Quality improved. The bottleneck wasn't model capability. It was the quality of the feedback loop.
2
Anthropic's eval guidance: start with 20-50 cases drawn from real production failures, not hand-crafted scenarios. Hand-crafted tests reflect what you imagine goes wrong. Real failures show what actually does. 20 verified examples beat 200 synthetic ones you haven't checked.
The SWE-agent paper found capping search output at 50 matches was one of the highest-leverage changes in the build. More results flooded working memory and caused agents to thrash. The fix: every tool should manage its own output volume. Not the agent.
1
Two agents can both solve a task correctly and still differ dramatically in production. Deep Agents measures efficiency against an ideal trajectory: latency ratio, cost ratio, tool call count. Correctness gets you on the shortlist. Efficiency determines what ships.
Most enterprise AI teams take the bolt-on path: add a model to an existing workflow. Gokul Rajaram argues that approach has a real ceiling. The teams seeing durable product differentiation are rebuilding end-to-end with new UX primitives, not adding a layer on top.
Claude Code shipped with RAG for context. They replaced it with a Grep tool. Why: agents that search for their own context outperform agents given pre-built context. Claude went from requiring RAG to doing nested multi-layer file searches autonomously in about a year.
1
5
MCP tool schemas eat context before agents start thinking. Anthropic cut one deployment from 150,000 tokens to 2,000 by switching to CLI tools. 98.7% overhead reduction. CLI commands cost ~15 tokens at invocation. The 50x gap compounds with tool count.
6
Goldman and Barclays each run 30,000 offshore BPO staff. That's where AI labor displacement starts. Per Gokul Rajaram: cut BPO spend first, then stop backfilling attrition. Layoffs come last. Enterprise workforce planning keeps getting the sequence wrong.
6
The Claude Code team called their auto-memory 'barely net positive.' Root cause: it doesn't fail to recall. It occasionally recalls wrong, forcing you to verify everything. Below the trust threshold, an agent with patchy memory is less useful than one with none.
1
8
Most agents in production run under shared service accounts. When one is compromised, you can't audit which agent did what, and you can't shut it down without killing all of them. Unique machine identities and kill switches per agent are table stakes for production deployment.
3
LangChain's DeepAgents went from 52.8% to 66.5% on Terminal Bench 2.0 without changing the model. All harness: self-verification loops and loop detection middleware. The architecture around your model matters as much as the model itself.
10
Human code review in AI-assisted teams should reprioritize. Style and naming: linters handle those better. The signal is in types (documenting contracts, catching bugs automatically) and comments explaining intent. 'Handle timezone for HIPAA compliance' beats 'convert timezone.'
3
Claude Code's TodoWrite tool kept early models on track with reminders every 5 turns. As models improved, those reminders became a constraint. Better models treated the list as rigid rather than adaptive. They replaced it with a Task Tool supporting subagent coordination.
8
OpenAI Codex team found one big AGENTS.md breaks at scale in four ways: crowds out context, makes everything seem equally important, rots as code evolves, and can't be verified for coverage. Their fix: ~100 lines that serve as a map pointing to structured docs.
15
Tokenizing PII protects identity fields but not surrounding text. Consider: 'TOKEN_A was diagnosed with terminal cancer and has 3 months to live.' The person is shielded. The medical detail is fully exposed. Session metadata or process of elimination can close the gap.
1
Each eval you add to an agent system applies directional pressure on behavior. A huge eval suite can create an illusion of improvement. LangChain's Deep Agents (Varun Trivedy): catalog only the behaviors that matter in production, then write evals that measure exactly those.
10
ChatGPT referral converts at up to 16x Google organic. Microsoft analyzed 1,200 websites: AI platforms convert at 3-17x traditional channels. Users complete discovery inside the AI and arrive already decided. Lowest-funnel traffic that exists.
11
Claude Code team called their auto-memory 'barely net positive.' The problem isn't failure to remember. It's occasional wrong recalls. A few incorrect memories force you to verify everything before trusting the agent's output. Reliability comes before capability.
17