SDE | Building AI Agents

Joined November 2024
80 Photos and videos
Pinned Tweet
May 7
Hey y'all I'm starting a new series on how production ai agents actually work under the hood In this series I'll cover some topics in 4 phases - Runtime internals - memory and states - multi agent orchestration - production systems Breaking down how systems like Claude, Cursor, etc. are actually architected. Starting from tomorrow
1
312
Day 17/20 of AI Agent Systems Series ARC 4: Production Systems - Human in the Loop as Architecture an agent that deletes the wrong S3 bucket, commits a breaking change to main, or triggers a large unexpected API bill does not pause and ask. it just acts. fully autonomous agents are useful in controlled environments. in production, most teams find they need deliberate points where a human reviews, approves, or edits what the agent is about to do. this is not a lack of trust in the model. it is a recognition that some actions carry consequences that are hard or impossible to reverse, and the cost of being wrong once outweighs the cost of adding a gate. the way most teams first implement this: a simple prompt. "should i proceed? y/n" the way it needs to work in production is architecturally different from that. HITL is not a pause button. it is an async state management problem. here is why.
2
1
49
the core architecture: serialize, wait, resume. when an agent hits an interrupt, it should not hold an open connection waiting for a human. humans take minutes, hours, sometimes days. the correct pattern: > agent reaches interrupt condition > serializes full graph state to a checkpoint store > returns control to the caller > caller routes the interrupt to whatever surface the human will see (UI, Slack, email) > human responds > system loads the checkpoint by thread_id > execution resumes from the exact node that interrupted > no re-running from the top LangGraph gives you two interrupt mechanisms to reach that checkpoint: static breakpoints: declared at compile time. always interrupts at a specific node, regardless of what the agent has done or decided. useful for mandatory review gates. dynamic interrupts: raised from inside a node at runtime, based on the current state. the node inspects what it is about to do and decides whether this specific action warrants a pause. more flexible, more powerful. once interrupted, a human has four options: > approve the action as-is > edit the action before it runs > reject it with feedback the agent can use > respond directly to an "ask user" style query the thread_id is what makes resumption work. it is a persistent cursor, the primary key for the conversation state. every checkpoint is stored against it. the system loads that checkpoint, the human's decision gets written into state, and Command resumes execution. LangGraph also allows "time travel": invoking the graph with a specific checkpoint_id to fork execution from a prior state, useful for exploring what would have happened if a human had approved differently.
1
1
11
two production gotchas most writeups skip. TOKEN EXPIRY DURING APPROVAL WAITS OAuth access tokens expire on fixed timers. HubSpot around 30 minutes, Google around an hour, Salesforce around two hours. if a human takes longer than that to approve an action involving one of these services, the token has expired by the time execution resumes. the agent tries to act on the approved action and fails, not because the approval was wrong, but because the credential the agent held is no longer valid. the fix: re-validate credentials and re-fetch any time-sensitive data immediately before executing an approved action, not immediately before requesting approval. the world the agent reasoned about may have changed during the wait. ABANDONED THREAD ACCUMULATION a thread is interrupted and nobody resumes it. this happens more than teams expect. the checkpointer holds the frozen state indefinitely. at scale this becomes a storage and visibility problem. production teams implement a TTL-based expiry job that scans for threads not resumed within a threshold, 7 days is a common default, and marks them as abandoned. abandoned threads get surfaced for human review rather than sitting silently in the checkpoint store. the rule of thumb from practitioners: interrupt on irreversible, high-blast-radius actions only. not on every step. a fully autonomous graph completes in seconds. a human-gated one can sit frozen for hours. every interrupt is a latency decision as much as a safety decision. the two have to be balanced for the system to be usable. Sources: langchain.com/blog/making-it… docs.langchain.com/oss/pytho… digitalapplied.com/blog/huma… abstractalgorithms.dev/langg…
1
21
Jun 13
Day 16/20 of AI Agent Systems Series ARC 4: Production Systems - Permission and Security Architecture an AI agent with full filesystem access, shell access, and network access is not a feature. it is a liability waiting for the wrong prompt. Claude Code's permission system is one of the more thought-through examples of how to handle this at the architecture level. a recent source-level analysis (arXiv, 2026) found the permission system follows four design principles: > deny-first with human escalation > a graduated trust spectrum > defense in depth through layered mechanisms > reversibility-weighted risk assessment every tool call passes through this system before execution. the default behavior is to deny or ask, never to allow silently. here is how that actually plays out in practice.
1
1
32
Jun 13
by default, Claude Code runs with strict read-only permissions. reading files, grep, git status, basic navigation, these run without a prompt. the moment an action could modify something, write a file, run a command that changes state, make a network call, the permission system steps in and asks before proceeding. as of mid-2026, the official docs describe the evaluation order as deny, then ask, then allow. deny rules are checked first and cannot be overridden by anything downstream, including hooks. this is the graduated trust spectrum in practice. not every action gets the same scrutiny read operations are cheap to allow. write and execute operations get a prompt. operations that are hard to reverse, deleting files, pushing to git, running destructive database commands, get the strongest scrutiny regardless of what mode the session is running in. there are several permission modes available,each trading oversight for speed differently. default mode keeps every action visible. acceptEdits assumes file edits are low risk once you know the codebase, but still prompts for bash commands and anything touching the system outside the editor. then there is bypassPermissions, sometimes called "YOLO mode." it disables permission checks entirely. the documentation is honest about this: it exists for automated pipelines with no human in the loop, and using it on a main developmentmachine is generally discouraged. permissions are one layer. sandboxing is the other. permissions control which tools and files an agent can touch. sandboxing enforces that at the OS level, restricting what a bash command and its child processes can actually reach on disk and over the network. the two layers are complementary, not redundant. permissions can be reasoned about and configured. sandboxing holds even if the permission layer is somehow bypassed.
2
1
44
Jun 13
the part of this architecture worth internalizing is the reversibility-weighted risk assessment. not all actions carry the same cost if something goes wrong. reading a file is fully reversible, nothing changed. editing a file is mostly reversible, version control exists. running rm or git push or a database migration may not be reversible at all. the permission system treats these differently not because one is "more dangerous" in the abstract, but because the cost of being wrong scales with how hard the action is to undo. this is a useful lens for designing permission systems generally, not just for Claude Code. ask: if the agent gets this wrong, how hard is it to recover? the answer should determine how much friction that action gets, not a flat risk category assigned once and never revisited. one more layer worth knowing: PreToolUse hooks can inspect tool calls and enforce additional organization-specific policies before execution, complementing the built-in permission system. some teams use a small model here to auto-approve genuinely safe operations while flagging anything unusual, getting uninterrupted flow without reaching for bypassPermissions as a shortcut. the broader takeaway for production agent design: permission architecture is not a single setting. it is layered: a default posture, graduated trust based on action type, reversibility-aware escalation, and OS level sandboxing underneath all of it. remove any one layer and the others have to work harder to compensate. removing all of them is what bypassPermissions does, and the documentation is honest about when that tradeoff is acceptable. Sources: code.claude.com/docs/en/secu… arxiv.org/html/2604.14228v1 claudecode-lab.com/en/blog/c… generalanalysis.com/guides/h…
1
13
Jun 10
Tried implementing elastic search from scratch in go will post about it Good night
6
Null retweeted
Just launched VengenceUI v2 50 new components Much faster performance Cleaner animations Better DX Built for devs who want landing pages and interfaces that actually look premium Try it: vengenceui.com/ Github: github.com/Ashutoshx7/Vengen… Show some love if you like it ⚡
25
16
79
6,110
Jun 7
Day 15/20 of AI Agent Systems Series ARC 3: Multi-Agent Orchestration - Handoff Object Design Recent industry analyses suggest orchestration design is a major source of production failures. not the models. not the tools. not the prompts. how agents coordinated with each other. the handoff boundary is one of the highest-leverage failure points in multi-agent systems. agent A completes its work and passes to agent B. what exactly gets passed, in what format, with what context, is usually left vague. that vagueness is the failure mode. the handoff object is a design artifact. it needs to be treated like one. what goes in, what stays out, and what gets transformed at the boundary determines whether the next agent starts informed or starts confused.
1
2
47
Jun 7
here is how major frameworks approach the handoff object differently. OPENAI AGENTS SDK made a breaking change to handoff behavior. prior versions passed the raw message history between agents. v0.6.0 collapsed that history into a single context message with the header: "For context, here is the conversation so far between the user and the previous agent." why: raw history passed between agents created context pollution at scale. agent B received everything agent A saw, regardless of relevance. the compression trades fidelity for focus. GOOGLE ADK takes a more explicit approach with two modes: agents as tools: the callee sees only a focused prompt and the necessary artifacts. no history. clean context. call it, get a result. agent transfer: the sub-agent inherits a view over the full session. the include_contents knob on the callee controls exactly how much flows through. the architectural choice is explicit: teams decide per handoff how much context the receiving agent gets. no implicit defaults. this design forces a useful question before every handoff is wired up: does this sub-agent genuinely need the full history, or just the relevant slice? Manus reported roughly a 10x cost difference between cached and uncached tokens in parts of its production system. passing too much context at handoffs is not just an accuracy risk. it is a unit economics problem.
1
2
32
Jun 7
here is a practical framework for designing handoff objects, regardless of which framework you use. THREE CATEGORIES: WHAT TO PASS > the current active goal > relevant results produced so far > constraints the receiving agent must respect > expected output format WHAT TO DROP > full conversation transcript > tool call internals from prior agents > intermediate reasoning chains > retry logs and error details already resolved WHAT TO TRANSFORM > output format from agent A likely does not match input schema expected by agent B > a transformation step at the boundary converts rather than passing raw output and hoping the receiving agent figures it out the transformation step is the one most teams skip. they pass agent A's output directly and let agent B parse it however it can. schema mismatches become reasoning errors that look like model failures. one honest limitation worth knowing: there is no widely adopted cross-framework standard for agent handoffs today. porting a multi-agent workflow from CrewAI to LangGraph means rewriting the handoff logic entirely. Google's A2A protocol is the strongest candidate for future interoperability but adoption is still early. for now, the practical advice: standardize on one framework early. design agent interfaces so the underlying handoff protocol can be swapped later without rewriting agent logic. that closes Arc 3: Multi-Agent Orchestration. Arc 4 next: Production Systems. permission architecture, observability, failure recovery, and cost design. Sources: peppereffect.com/blog/agent-… developers.googleblog.com/ar… tao-hpu.medium.com/ai-agent-… arxiv.org/pdf/2603.09619
1
57
Jun 6
Day 14/20 of AI Agent Systems Series ARC 3: Multi-Agent Orchestration - Agent as Tool most multi-agent systems are wired through message passing. agent A finishes, sends a message to agent B, agent B reads it, runs, sends to agent C. it works. it also means every agent in the system is tightly coupled to the message format of every other agent. change how agent A formats its output and agent B breaks. test agent B in isolation and you need to simulate agent A's messages exactly. LangGraph's subgraph pattern takes a different approach. wrap each agent as a node. give it a typed input, an isolated state, and a declared typed output. the parent graph sees it as a function call,not a conversation participant. that one architectural decision changes what you can do with the agent later. here is why it matters in practice.
1
2
76
Jun 6
a subgraph in LangGraph is a complete StateGraph used as a node inside another graph. it has its own TypedDict state, its own nodes, its own edges, its own internal reasoning loop. none of that is visible to the parent graph. the parent graph only sees: > what goes in (the typed input) > what comes out (the declared output keys) two modes depending on your architecture: SHARED STATE SCHEMA parent and subgraph share the same state keys. the subgraph reads directly from parent state and writes back to it. simpler to wire, tighter coupling between parent and subgraph. DIFFERENT STATE SCHEMAS parent and subgraph have different state structures. a transformation node sits between them, converts parent state into subgraph input before entry, converts subgraph output back into parent state on exit. more code upfront, but the subgraph becomes fully independent of the parent's state shape. the second mode is what enables real composability. when a subgraph does not depend on the parent's state structure, it can be: > tested completely in isolation, no parent graph needed > versioned and updated without touching the parent > deployed as a separate service on LangGraph Platform > swapped for a different implementation without changing a single line of parent graph code LangGraph Platform (GA May 2025) supports deploying subgraphs as independent services. the parent graph calls them over the network the same way it would call a local node.the interface contract is the same either way.
1
1
35
Jun 6
the scatter-gather pattern is where this gets particularly useful for parallel workflows. the parent graph distributes sub-tasks to multiple subgraph agents simultaneously via the Send API. each subgraph runs with its own isolated state slice. when all return, the parent graph consolidates the outputs. a research agent example: > send 5 search queries to 5 subgraph agents in parallel > each subgraph retrieves, processes, returns a result > parent graph synthesizes across all 5 results when subgraphs run in parallel, latency approaches the slowest subgraph rather than the sum of all subgraphs. and each search subgraph can be tested with a single query without running the whole pipeline. compare this to a message-passing architecture doing the same thing: in message passing, to test one search agent you need the full message history from the orchestrator, formatted exactly right. a schema change anywhere breaks the chain. parallel execution requires careful coordination of shared mutable state. the agent-as-tool pattern removes those problems by making the interface explicit and typed rather than implicit and conversational. the tradeoff is real though. setting up transformation nodes between different state schemas adds boilerplate. for simple linear workflows where agents are naturally sequential and tightly related, message passing may be simpler to reason about. the pattern earns its complexity on workflows that: > need parallel execution across independent sub-tasks > have sub-agents that might be reused in other systems > require sub-agents to be testable and deployable independently of the parent workflow most production multi-agent systems eventually land here once the message-passing approach starts creating maintenance problems at scale. Sources: docs.langchain.com/oss/pytho… latenode.com/blog/langgraph-… medium.com/@vin4tech/convers…
1
29
Jun 5
Day 13/20 of AI Agent Systems Series ARC 3: Multi-Agent Orchestration - The Swarm Pattern the Supervisor Pattern introduces a central coordination point: the supervisor. if the supervisor makes a bad routing decision,every downstream agent acts on it . if it becomes a bottleneck at scale, every worker waits behind it. the Swarm Pattern removes the supervisor entirely.agents hand off to each other based on context,with no central coordinator deciding who goes next. OpenAI popularized this pattern through the Swarm framework in October 2024. Swarm was later superseded by the OpenAI Agents SDK, which kept the same core handoff model while adding production features. same conceptual model, adds guardrails, tracing, and TypeScript support on top. the pattern survived the deprecation. the primitives are in production at scale. but most people have a fundamental misconception about how it works.
2
1
104
Jun 5
the misconception first: Swarm is not parallel. this catches most people. in the Swarm pattern, only one agent is active at any given time. it is sequential control transfer, not concurrent execution. agent A runs, decides it needs to hand off, passes control to agent B, agent B runs. one active agent throughout. fan-out parallelism multiple agents running simultaneously on independent sub-tasks requires a coordinator. Swarm explicitly removes the coordinator, so it does not give you parallelism. it gives you decentralized sequential routing. here is how the pattern actually works. two primitives only: AGENTS an agent is a system prompt plus a list of functions. that is it. the functions define what the agent can do and who it can hand off to. HANDOFFS a handoff is a function that returns a different agent object. when the current agent calls that function,the framework switches the active agent and continues. the entire API surface of the original Swarm: > define agents with instructions and tools > define handoff functions that return other agents > call run() with a message > the framework routes through agents until done the framework is stateless no persistent state between calls. every handoff must carry all the context the next agent needs in the conversation history. no hidden state. no memory between runs. OpenAI Agents SDK (March 2025) added what original Swarm deliberately left out: > guardrails (input/output validation) > built-in tracing per handoff > persistent state management > TypeScript support the pattern is identical. the production runtime is the Agents SDK, not the original Swarm repo.
1
1
24
Jun 5
here is where Swarm breaks and it breaks in specific, predictable ways. TERMINATION PROBLEM without a supervisor to decide "we are done," the system needs explicit exit conditions. max iterations. quality thresholds. timeout-based convergence. if none of these are defined carefully, agents hand off in a loop indefinitely. too aggressive a termination condition produces incomplete results. too conservative burns tokens until the budget runs out. this is the most common production failure in Swarm based systems. teams define the agents. they forget to define when to stop. DEBUGGING PROBLEM with a Supervisor Pattern, the execution history is centralized. one node made every routing decision. one place to look. with Swarm, tracing a failed task means reconstructing the handoff chain from distributed logs.which agent had control at step 7. what context it received. what it decided to hand off. one article compared it to debugging an eventually consistent distributed database you need distributed tracing tooling from day one,not as an afterthought. WHEN SWARM ACTUALLY WINS > exploration tasks where the optimal path is unknown and no supervisor could predetermine the routing > customer service triage where query type determines routing and no global state is needed > tasks where agents are genuine peers with equal authority no natural hierarchy exists when tasks have strict ordering requirements, need transactional guarantees, or require a global view of progress use Supervisor. the pattern is not better or worse than Supervisor. it is the right tool for a different class of problem. Sources: augmentcode.com/guides/swarm… morphllm.com/openai-swarm gurusup.com/blog/agent-orche… galileo.ai/blog/openai-swarm…
1
10
Jun 3
Windsurf is now Devin Desktop Same Windsurf ide just unified under Devin > Manage all your local and cloud agents from one Kanban view > Introduces Spaces group sessions, PRs and files so agents share context > Supports ACP run Codex, Claude Code, OpenCode or your own agents inside it > Plan locally, hand off to cloud Devin keeps working after you close your laptop
Introducing Devin Desktop. Manage fleets of local and cloud agents from one surface. Plan, delegate, review, and ship without leaving your editor.
1
191
Jun 1
Yooo finally got the @runyourloop merch 😁 The tee and tote bag are sick af 🔥🤌 Thank you so much @nush_1320 and @topmateHQ for this 🙌
1
58
May 30
The hardest part of deploying AI agents to production isn’t building the agent. It’s handling everything that happens when the agent breaks. Most people still think about the happy path: The agent receives a task, calls a tool, gets the result, and completes the workflow. But production systems don't live on the happy path. APIs time out. Tool calls fail. Providers return errors. Workers crash halfway through execution. Suddenly, your agent is stuck in the middle of a workflow and your customer is still waiting for a response. That's when the real engineering questions show up: • Should the workflow start over from scratch? • Should it resume from the last successful step? • Can you inspect exactly what failed? • Can the system recover without losing progress? These problems have nothing to do with making the agent smarter. They're about making the system reliable. Production-grade agents need durability, observability, and recovery mechanisms built into the runtime from day one. Because the difference between a demo and a production system isn't what happens when everything works. It's what happens when things don't.
1
1
52