Prasenjit Sarkar

Prasenjit Sarkar

1,087 Photos and videos

Tweets

Prasenjit Sarkar

@stretchcloud

A lot of teams do not fail at AI media because they cannot call a model. They fail because production media workflows are bundles of taste, prompts, settings, retries, post-processing and review decisions pretending to be one API call. That is why Runway API Recipes are worth paying attention to. The docs frame a Recipe as the option for a known use case where you want a polished result from a single call instead of building and maintaining the workflow yourself. Runway packages model selection, prompt engineering and post-processing into one endpoint so teams can generate consistent assets at scale. This is a subtle but important shift. The early API market was mostly raw capability: text-to-video, image-to-video, generation, edit, upscale. But real product teams want repeatable outcomes. They need a social ad variant, a product visual, a cinematic clip, a branded asset, or a workflow that behaves the same way tomorrow as it did today. That mirrors what happened in cloud. Teams started by renting primitives: compute, storage, queues. Then the valuable layers became managed databases, auth, search, observability and deployment workflows. In AI media, the equivalent layer is not just the model. It is packaged production judgment. The hidden bottleneck is operational taste. If the creative workflow is fragile, every API call becomes a tiny art-director job. Recipes are an attempt to move that judgment into a repeatable product surface. That is where AI media gets closer to infrastructure and further away from one-off prompt demos. x.com/runwayml/status/206733…

Runway

@runwayml

New on the Runway API: Recipes. Drop production-ready generative media features into your platform, with one API call. Recipes are Runway-built endpoints with our prompting and workflow expertise packaged in. Polished results, without building or maintaining the workflow yourself. Create product ads from an image, swap a product in an existing video and more. At scale.

0:47

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

15m

The interesting part of Unreal Engine adding MCP is not that Claude or Codex can "talk to a game engine." It is that the engine is becoming an agent-addressable workspace. For years, creative tools have had scripting layers, plugins, command palettes and automation APIs. The difference now is that an LLM can sit above those surfaces, reason over project state, and ask the engine to perform bounded actions. Unreal Engine 5.8 makes this explicit with an experimental MCP plugin for the Unreal Editor. Epic's release notes say the plugin enables agentic AI systems to connect to the editor; the release summary describes automated asset creation, testing, optimization and project interaction across core engine systems. That matters because game and 3D production is full of work that is too contextual for a generic chatbot and too tedious for manual repetition: placing assets, checking materials, validating levels, optimizing scenes, creating test variations, wiring blueprints, searching project structure, and explaining why something broke. This is the same pattern we saw with coding agents. The leap was not autocomplete. It was giving the model safe access to the repo, terminal, tests and review loop. In Unreal, the equivalent is scene graph, assets, materials, levels, build settings, performance budgets and artist/developer approval. The hidden bottleneck is not model creativity. It is tool-grounded control. If MCP becomes the standard bridge into creative environments, the winning products will not be the ones that generate the prettiest one-off demo. They will be the ones that make the agent's edits inspectable, reversible and compatible with a team's actual production pipeline. x.com/ziwenxu_/status/206729…

Ziwen

@ziwenxu_

Unreal Engine just added MCP. Now our Claude and Codex can use it natively!

0:17

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

25m

Every team using coding agents eventually has the same quiet meeting: the demos look useful, then the token bill arrives. That is why Headroom is a more interesting project than it looks. The GitHub repo describes a proxy/library/MCP server that compresses tool outputs, logs, files and RAG chunks before they reach the LLM, with claimed 60-95% fewer tokens while preserving answers. Coverage of the project frames it as a drop-in layer created by Netflix senior engineer Tejas Chopra, not an official Netflix product, with several Netflix teams and external projects reportedly using it. The important mechanism is where it sits. Most teams try to reduce cost at the application layer: shorter prompts, smaller models, fewer turns, manual summarization. Headroom attacks the infrastructure layer by sitting between the app and the model. That means the developer can keep the agent workflow while removing redundant context before it compounds across tool calls. This is the same pattern as CDN compression, log sampling and query optimization. At small scale, waste is invisible. At production scale, repeated structure becomes a tax. Build logs, JSON arrays, traces, files and search results are full of redundancy, and agents are unusually good at generating more of it. The hidden bottleneck is not just model price. It is context hygiene. The teams that win with agents will not simply buy the biggest context window. They will build systems that decide what context deserves to exist, what can be compressed, what must be preserved exactly, and what should never enter the model at all. x.com/tonysimons_/status/206…

Tony Simons

@tonysimons_

20h

🚨 A Netflix engineer built an open-source proxy that cuts AI token usage by 60-95%. Zero code changes. Benchmarks show ±0.000 accuracy regression. ✨ 29.9k stars on GitHub. It sits between your app and the LLM, so every tool output, code block, and conversation history gets compressed in-flight. 🚫 No summarization, no loss. 😎 Just 60-95% fewer tokens with the same answers. Works with Claude Code, Cursor, Copilot, and any OpenAI-compatible client. One pip install, one env var, done. Netflix uses it internally. Apache 2.0. Built by Tejas Chopra. github.com/chopratejas/headr…

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

Agent interoperability has a boring problem hiding underneath it: discovery. Today, if an agent needs a capability, the developer usually hardcodes an MCP server, pastes a tool URL, installs a plugin, or relies on one platform's private catalog. That works for demos. It breaks when agents need to operate across companies, clouds, identity systems and trust boundaries. Google's Agentic Resource Discovery announcement is interesting because it targets the layer before invocation. ARD is an open specification for publishing, discovering and verifying agentic capabilities across the web. The companion spec describes resources broadly: agents, MCP servers, Skills, APIs, workflows, catalog entries, or similar callable capabilities. Google says it was developed with partners across the ecosystem; GoDaddy's announcement names Cisco, Databricks, GitHub, Google, Hugging Face, Microsoft, Nvidia, Salesforce, SAP, ServiceNow and Snowflake. This is the same platform pattern we saw before. The web needed search and DNS. Cloud needed service discovery and identity. Kubernetes needed registries, labels and control planes. Agents will need a way to answer: what capability exists, who owns it, what policy governs it, and why should this agent trust it? People will read ARD as another protocol. I think the deeper read is that agent ecosystems are leaving the single-app phase. Once agents can discover resources dynamically, the scarce asset becomes trust metadata: provenance, permissions, freshness, evaluations, compliance posture and failure history. The hidden bottleneck is not calling a tool. It is deciding which tool deserves to be called. x.com/googledevs/status/2067…

Google for Developers

@googledevs

Agents are part of a massive, interconnected ecosystem. But how do they find and trust each other across different platforms? Today, we’re proud to announce the Agentic Resource Discovery (ARD), an open specification alongside industry partners (including Cisco, Databricks, GitHub, GoDaddy, Hugging Face, Microsoft, NVIDIA, Salesforce, ServiceNow, and Snowflake). ARD gives any agent a secure, decentralized way to discover and verify capabilities (like tools, skills, MCP servers, and other agents) anywhere on the web. Read the full announcement and get started: goo.gle/4a2sTWf

This architectural diagram titled "Agentic Resource Discovery" illustrates two parallel paths for an AI agent to find resources: a direct "Catalog" layer where companies self-host discovery files and a centralized "Registry" discovery service. The AI agent searches these sources, then verifies and connects to resources via A2A, MCP, or API protocols. A five-step process flow at the bottom summarizes the workflow: 1) Publish, 2) Crawl, 3) Search, 4) Verify, and 5) Connect.

ALT This architectural diagram titled "Agentic Resource Discovery" illustrates two parallel paths for an AI agent to find resources: a direct "Catalog" layer where companies self-host discovery files and a centralized "Registry" discovery service. The AI agent searches these sources, then verifies and connects to resources via A2A, MCP, or API protocols. A five-step process flow at the bottom summarizes the workflow: 1) Publish, 2) Crawl, 3) Search, 4) Verify, and 5) Connect.

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

The useful detail in Block's Builderbot story is not that engineers can tag an AI in Slack. It is that the interface is ordinary while the orchestration behind it is not. A lot of agent products still make the user come to the agent: open a special IDE, paste context, pick a model, manage a run. Block appears to be doing the opposite. Engineers tag Builderbot where work already starts, then the system researches, plans, changes code and routes the result back into the existing software delivery path. The reported scale is the part worth studying: 200,000 operations per day, about 1,500 merged pull requests per week, and 15% of production code changes. Even if you discount every vanity interpretation of those numbers, this is still one of the more concrete signals that agentic coding is crossing from individual productivity into organizational throughput. The mechanism is familiar from DevOps. CI/CD did not win because a pipeline was magical. It won because the workflow became repeatable, observable, permissioned and close to where engineers already worked. AI agents need the same treatment: source boundaries, repo context, review checks, rollout controls, audit trails and clear ownership when the bot is wrong. The hidden bottleneck is not whether the model can write a diff. It is whether the company can absorb automated changes without losing architectural control. That is the real enterprise agent question: not "can one engineer go faster?" but "can the org create many safe units of progress without turning the codebase into a pile of locally rational patches?" x.com/blocks/status/20672845…

Block

@blocks

We built an internal AI system called Builderbot. It coordinates agents across our entire codebase. Engineers tag it in Slack, and it researches, plans, and ships. The story so far: - 200,000 operations per day. - 1,500 pull requests merged per week. - 15% of all production code changes across Block. What used to take months now takes days. How we built it: block.xyz/inside/block-rolls…

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

The uncomfortable truth about AI coding environments is that the editor has become a privileged runtime. Developers now put source code, terminals, cloud credentials, package tokens, MCP servers, local files and AI agents inside one workspace. That makes a malicious editor extension much worse than a nuisance. It is not just UI chrome. It can sit next to the repo and the command line. That is why Socket Firewall blocking malicious VS Code and Open VSX extensions is a meaningful shift. Socket's own research has been tracking the GlassWorm campaign, including 72 malicious Open VSX extensions identified between January and March 2026, and another cluster of 73 sleeper/impersonation extensions later. The pattern is familiar from npm: attackers do not need to break the whole platform if they can get one trusted install path into a developer's machine. AI makes this more important, not less. The more teams lean on agentic coding, the more the editor becomes an execution surface. Extensions can influence prompts, read files, watch terminals, exfiltrate tokens, or poison the tooling around the agent. A model can be aligned and the workflow can still be compromised by the environment around it. This mirrors the move from package scanning to package firewalls. Static detection after the fact is useful, but install-time and update-time policy is where teams actually reduce blast radius. The hidden bottleneck is trust in the developer workstation. Agent security is not only prompt injection and model behavior. It is also the boring supply chain around the editor, where one bad extension can quietly inherit the keys to the build system. x.com/SocketSecurity/status/…

Socket

@SocketSecurity

🚀 Launch Week Day 3: Socket Firewall now blocks malicious code editor extensions. VS Code and Open VSX extensions run inside developer environments with access to source code, terminals, credentials, and tokens. Now teams can block bad extensions before install or update.

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

A lot of people still think local models are mainly about privacy or offline chat. This demo points at a more operational use case: cheap parallelism. Google is showing Gemma 4 26B running locally while orchestrating 10 parallel sub-agents to build an SVG art gallery, with the demo claiming 100 tokens/sec. The exact app is a toy, but the mechanism is not. Once a local model is fast enough, the architecture changes from "one expensive frontier call" to "many small workers doing bounded subtasks near the developer." Gemma 4's own docs make the direction clearer. The family includes E2B, E4B, 12B, 31B and 26B A4B variants, and Google calls out multi-token prediction with a dedicated draft model for speculative decoding. That is the kind of inference detail that matters for agent workloads because latency compounds. Ten agents that each pause awkwardly are worse than one good model. Ten agents that can stream quickly, share state and stay cheap become a different tool. This is similar to the cloud-to-edge shift in devtools. Not every job should run on the biggest remote machine. Some work belongs close to the repo, close to the filesystem, close to the user's feedback loop, and cheap enough to retry. People will read this as "local open models are catching up." I think the more useful read is that orchestration pressure is moving down the stack. The hidden bottleneck becomes scheduling, memory, context partitioning and verification across workers. Local agents will not win by pretending to be one frontier model. They win when they make parallel work boring, fast and inspectable. x.com/googlegemma/status/206…

Google Gemma

@googlegemma

Teamwork makes the dream work. Now running locally. Watch Gemma 4 26B orchestrate 10 parallel sub-agents to code an SVG art gallery in seconds. Hitting 100 tokens/sec, imagine how you can scale this for complex tasks or local chatbots for entire teams!!

1:31

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

The interesting part of this example is not that a model suggested a chemistry idea. Models have been useful literature assistants for a while. The interesting part is the workflow boundary: literature review, hypothesis generation, specialized chemistry tooling, and lab validation start to look like one loop. That is a different product surface from "chat with a PDF." OpenAI is positioning GPT-5.4 as a professional-work model with 1M context, tool search, coding and computer-use strengths. In this case, the claim is narrower and more useful: GPT-5.4 worked with Molecule.one's Maria AI and a specialized lab to move a medicinal chemistry project from literature review to a validated experimental result. Builders should pay attention to the architecture, not the headline. The model did not replace the lab. It sat inside a system where domain tools, reaction planning, experimental constraints and validation could close the loop. That is much closer to how high-value scientific work actually happens: not one prompt, but a chain of scoped decisions where the cost of being wrong is measured in failed experiments, not ugly UI. This mirrors what happened in software agents. The value moved from raw completion to harnesses: repo context, tool permissions, tests, sandboxes, evals, rollback. In chemistry, the equivalent is literature grounding, synthesis feasibility, assay design, supplier/lab constraints and wet-lab feedback. The hidden bottleneck is not ideation. It is validated action. A useful science agent is not the one with the most confident suggestion. It is the one that can survive contact with instruments, protocols, costs and failed results, then update the next step. x.com/gdb/status/20673446061…

Greg Brockman

@gdb

GPT-5.4 for improving a challenging reaction in medicinal chemistry:

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

The most useful coding benchmark right now may not be another bug-fix leaderboard. It may be migration. Most enterprises do not have a clean “build new app from scratch” problem. They have a 14-year-old service, a 600k-line dependency graph, a rules engine nobody wants to touch, or a COBOL/Java/.NET boundary that quietly runs revenue every day. The question is not “can the model write code?” It is “can it preserve behavior while changing the implementation?” That is why Vals AI’s Code Migration benchmark is worth watching. It scores migrations by hidden behavior tests, not surface similarity. It also zeroes out wrappers, copied reference artifacts and wrong-language submissions. That changes the incentive: the model has to infer the contract, rebuild the system, and pass the behavior checks. The early numbers show how unsolved this still is. Claude Fable 5 leads overall at 55.1%, ahead of Opus 4.8 at 47.2% and GPT 5.5 at 45.2%. Fable 5 is much stronger on CLI migration at 60.1%, but COBOL-to-Java is tighter, with GPT 5.5 and Opus 4.7 at 70.0% on that split. That pattern matches how migration work feels in the field. The hard part is rarely translation alone. It is missing specs, hidden side effects, batch jobs, error semantics, data formats and decades of “business logic” encoded as accidents. People will market this as AI replacing modernization teams. I think the better read is narrower and more useful: AI is becoming a behavioral diff engine for legacy estates. The hidden bottleneck is testable intent. Once you can express the old system’s behavior well enough, migration stops being archaeology and becomes an engineering loop. x.com/ValsAI/status/20669975…

Vals AI

@ValsAI

Jun 16

We’re releasing our Code Migration benchmark — and we managed to get Fable tested in time Code migration carries real economic weight. COBOL powers banks, payrolls, government services, and underpins nearly 95% of US ATM transactions. The danger with any migration is that a model ships code that looks right but quietly drops essential behaviors We evaluated models in three ways: modern to modern migrations, legacy to modern migrations and on their overall code quality. Each model rebuilds the program in an offline sandbox, then is scored on a hidden behavior test with anti-cheat checks that catch anything wrapping the original, copying reference files, or staying in the source language Fable 5 leads overall at 55%, but costs $115.43 per test, while Opus 4.8 (47%), and GPT 5.5 (45%) cost $30.51 and $6.44, respectively, making GPT 5.5 the most cost efficient model. Kimi K2.6 is the #1 open-weight model (28%) priced at $5.12, ranking above some frontier models

1:57

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

Every serious agent demo eventually hits the boring failure mode: the agent is halfway through a task, a process restarts, a tool call hangs, a websocket drops, and the “AI” problem quietly becomes a distributed systems problem. That is why this Cloudflare move is more interesting than another framework launch. They are trying to make the Agents SDK the runtime layer under different harnesses, with Flue as the first framework building on it. Their framing is useful: framework at the top, harness in the middle, runtime/platform underneath. The runtime owns durable execution, code execution, filesystem, workflows, state, storage and recovery. That sounds infrastructure-heavy because production agents are infrastructure-heavy. Cloudflare’s docs describe each agent session as having durable identity, local SQL storage, realtime connections, scheduled work and recoverable execution. The GitHub repo frames agents as persistent stateful execution environments, powered by Durable Objects, with built-in scheduling, model calls, MCP and workflows. Builders are starting to discover that “agent quality” is not just model intelligence. It is whether the agent can remember what it was doing, resume without burning tokens, hold the right websocket, run untrusted code safely, and expose enough observability to debug the loop. This mirrors the early serverless/cloud shift. At first everyone argued about functions. Then the real product surface became state, queues, retries, logs, permissions and deployment. The hidden bottleneck for agents is not the loop. It is operational continuity. A good harness thinks. A production runtime survives. x.com/Cloudflare/status/2067…

Cloudflare

@Cloudflare

The Agents SDK is now a runtime any agent framework can build on. Today we're opening up the Agents SDK primitives, with Flue as a first framework targeting Agents SDK, and rolling out agents in the dashboard. cfl.re/3Qw74Yz

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

Small workflow detail, big platform signal: model choice in coding tools is moving from settings pages into extension distribution. A developer no longer has to leave the editor, paste API keys into a separate tool, then re-learn a new agent UX. VS Code is turning the model picker into a marketplace surface. The mechanism matters more than the menu item. Microsoft had already pushed BYOK into VS Code so users could attach providers like OpenRouter, Ollama, Google, OpenAI, etc. Then the Language Model Chat Provider API let providers ship through extensions instead of waiting for VS Code to hard-code every option. The latest release takes the next step: browse/install extra providers from the Marketplace and have them appear in the picker. This is the same pattern we saw in cloud and DevOps. The winning control plane is not the one with one perfect backend; it is the one where teams can route workload, policy and budget without leaving their daily operating surface. People will read this as “more models in VS Code.” I think the deeper read is “agent IDEs are becoming model marketplaces.” The hidden bottleneck shifts from access to governance: which model is allowed for repo data, which one has tool-use quality for this task, which one is cost-capped, and how the team evaluates regressions. Model choice is becoming table stakes. The durable product value is the harness, policy layer and feedback loop around it. x.com/code/status/2067336917…

Visual Studio Code

@code

🚀 The latest @code release just expanded your AI model options. Discover and install extra providers from the Marketplace right inside the editor. Just browse, install, and your new models show up in the picker!

0:23

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

Anyone who has built a research agent knows the awkward truth: the model is rarely the only problem. The web is. A good answer depends on search quality, page fetching, deduping, source selection, freshness, synthesis, citation discipline, and knowing when to keep digging. Most teams start by gluing a search API to an LLM, then discover that “research” is really a multi-step retrieval workflow with failure modes everywhere. That is why Exa Agent is interesting. Exa describes it as a single API for frontier web research, combining top language models with Exa’s web search tools. Their LangChain case study says Exa built a production multi-agent research system that processes hundreds of research queries daily, autonomously exploring the web until it finds structured information users need. This follows the same pattern we saw in observability and payments: teams first compose primitives themselves, then a product appears that packages the common workflow into one reliable surface. The value is not that developers could not build it. The value is that they do not want to maintain every edge case while trying to ship their own product. People will perceive this as “deep research via API,” but the real wedge is operational reliability: fewer brittle browser hacks, better source selection, cheaper repeated research, and a cleaner contract for agents that need live web knowledge. The hidden bottleneck is retrieval operations. Research agents are not just reasoning systems. They are search systems with taste, memory, and accountability. x.com/ExaDevelopers/status/2…

Exa Developers

@ExaDevelopers

Introducing Exa Agent: frontier web research built for developers! Now available in the API.

0:56

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

A quiet lesson from production AI: the expensive model is often doing the wrong job. LangChain’s example is useful because it is not abstract benchmark theater. LangSmith processes billions of tokens a day across production traces. They needed to classify perceived errors in those traces, so they partnered with Fireworks and fine-tuned a Qwen model with LoRA/SFT. The result: a trace judge that matched or exceeded frontier model performance for that narrow job and ran up to 100x cheaper. This is how mature AI stacks will look. You do not send every workflow to the largest general model forever. You use a frontier model to explore the task, collect labels, understand failure modes, then distill the repeatable part into a smaller specialized model. We have seen this before in cloud infrastructure. Early systems overused the most flexible primitive because it was easy. Then cost pressure created queues, caches, indexes, tiered storage, autoscaling, and specialized workers. AI inference is going through the same normalization. People still perceive “open model vs frontier model” as a single leaderboard race. In production, the more important question is: which task is stable enough to specialize? The hidden bottleneck is workflow economics. If a system runs billions of tokens through the same judgment path, prompt quality is not enough. The architecture has to learn where general intelligence should stop and specialized inference should take over. Fine-tuning is becoming less about model bragging rights and more about cost-shaped product design. x.com/LangChain/status/20673…

LangChain

@LangChain

Fine-tuning open models can exceed or match frontier models. 📦Base @Alibaba_Qwen out of the box w/ good prompting: Strong for perceived error classification, trailed frontier model performance. 🔧With a LoRA SFT job: Both models came close to or above frontier performance.

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

The interesting behavior here is not “generate a UI from a prompt.” Builders have been doing that all year. The shift is the handoff. A founder sketches a product in Claude Design, gets something visually coherent, then sends it into Replit to turn it into a working app. Replit’s own write-up frames it as design in Claude, build in Replit: design with natural language, then continue building, refining, and shipping without copy-pasting the artifact into another tool. That sounds small until you remember where most AI app builders break. The prototype looks good in the chat window, but then the user has to move it into a real project, wire state, add auth, fix deploy errors, and preserve context across tools. Every context switch is a chance for the agent to lose intent. This is the same pattern cloud platforms learned years ago: the winning product is not just the editor, the runtime, or the deploy button. It is the path between them. People are already treating Claude, Replit, Lovable, Cursor, and similar tools as a loose assembly line: ideate in one place, generate UI in another, harden code elsewhere, deploy from a platform. The platforms now want to own more of that chain. The hidden bottleneck is continuity. If the design intent, code context, deployment target, and feedback loop travel together, AI app building becomes less like copy-paste prototyping and more like a real product workflow. x.com/Replit/status/20673285…

Replit ⠕

@Replit

Design in Claude. Build in Replit You can now send your design from Claude Design to Replit to turn it into a working app

0:12

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

The useful thing about this leaderboard is not the rank itself. It is the kind of eval pressure it creates. Most coding model launches still lead with static benchmarks. Kimi K2.7 Code has those too: Moonshot says it improved over K2.6 by 21.8% on Kimi Code Bench v2, 11.0% on Program Bench, 31.5% on MLS Bench Lite, with about 30% lower reasoning-token usage. Useful numbers, but they still mostly describe controlled tasks. Agent Arena is trying to measure a messier thing: models doing long-horizon work for real users with tools, filesystems, web search, iteration, failures, and recovery. In that world, a model can be good at code generation and still lose points on steerability, bash recovery, tool hallucination, or complaint rate. That is how builders actually experience agents. Nobody says “the benchmark was elegant” when the agent edits the wrong file, ignores a constraint, or needs three nudges to recover from a broken command. The pattern is the same shift we saw in cloud and DevOps tooling: synthetic benchmarks matter early, but production trust comes from operational behavior under messy workloads. The hidden bottleneck is not intelligence. It is reliability under tool use. For agentic coding, the best evals will look less like exams and more like telemetry from real work. x.com/arena/status/206732435…

Arena.ai

@arena

Kimi K2.7 Code by @Kimi_Moonshot ranks #19 overall on the new Agent Arena leaderboard, and #6 among open models. In Agent Arena, we measure models on millions of real-world, long-horizon agentic tasks from a global community of users. Models can access web search, filesystem, and terminal tools to complete complex workflows. The leaderboard measures model performance on outcomes relative to the average model using a causal tracing methodology. Kimi K2.7 Code's strongest signal is confirmed task success, while bash capabilities and tool hallucination hold stable. The tradeoff is steerability, which regresses sharply compared to K2.6 (-12.25% vs. -2.82%). Note the wide confidence intervals, since the scores are still stabilizing. See thread for details on how Kimi K2.7 Code performs across 5 different signals.

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

A lot of developers are still using coding agents like an autocomplete with a bigger mouth: prompt, wait, inspect, prompt again. The more interesting shift is what Cline is pointing at here: move the human out of the “please check this” loop and put checks into the system. A pre-commit hook that asks an agent to look for leaked keys, P0 bugs, or obvious regressions is a small example, but it changes the job. You stop being the person who remembers to ask. You become the person who designs the loop. Addy Osmani framed this as loop engineering: replace yourself as the prompt operator, define the recursive goal, and let the agent iterate with guardrails. Cline’s own SDK/docs point in the same direction with checkpoints, MCP, cron jobs, subagents, and CLI/IDE surfaces. That is not just “better prompting.” It is workflow design. We have seen this movie in DevOps. Manual deploy checklists became CI pipelines. Human code style comments became linters. Runbooks became automation. The useful parts survived because they were boring, repeatable, and reviewable. The hidden bottleneck is verification debt. If the loop can act but cannot prove what it checked, it just creates faster uncertainty. The next durable agent products will not be chat windows. They will be loops with logs, gates, rollback, and a clear owner. x.com/cline/status/206732147…

Cline

@cline

Here's a practical way to start "loop engineering" (fancy way to say something other than a human prompting an agent to do some work) Use a git hook script to automatically review your code for leaked keys, p0 bugs, etc. before committing.

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

Every team that builds an agent over a real codebase eventually has the same humbling moment: the model sounds smart, then misses the one exact line that matters. That is why the “vector DB vs grep” debate is the wrong frame. Dense semantic search is excellent for the first pass: “find the part of the system related to billing retries.” But when the agent needs the exact env var name, the test that failed last week, or the callsite hidden behind a wrapper, old-fashioned grep and file reads often beat embeddings. This pattern has shown up before. Web search did not replace keywords with semantics; it blended lexical matching, link signals, ranking, freshness, and later neural rerankers. Enterprise RAG is moving the same way. Qdrant and LlamaIndex both support hybrid retrieval patterns because dense vectors and sparse/keyword search fail in different ways. The practical agent architecture is becoming: semantic search to orient, grep to verify, file navigation to preserve structure, and reranking to avoid drowning the model in plausible-but-wrong chunks. The hidden bottleneck is retrieval fidelity. Agents do not need more confident summaries of the wrong context. They need the right evidence, in the right granularity, at the right moment. For agentic engineering, hybrid retrieval is not a compromise. It is the production shape. x.com/llama_index/status/206…

LlamaIndex 🦙

@llama_index

Vector databases or pure grep? Teams are split on the right retrieval architecture for agents. ⁣ ⁣ The reality? You need both. Semantic search for a fast first pass; grep and file reads for surgical precision when top-k chunks cut off mid-answer. ⁣ ⁣ On June 29, our Head of Engineering George He goes under the hood on the architecture decisions and dead ends behind building this harness into LlamaParse Index.⁣ ⁣  Register here : landing.llamaindex.ai/retrie…

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

The important part of LCLMs is not just “long context got faster.” It is that context may need a compute-native representation, not only more tokens. The paper introduces Latent Context Language Models: an encoder compresses long input into latent representations, and a decoder reads those latents instead of the original token stream. The reported setup uses a small 0.6B encoder with a 4B decoder, trained on 350B tokens, and claims up to 8.8x faster long-context inference without losing accuracy. That targets the real bottleneck in long-context systems: KV cache and attention cost. We keep treating context windows like infinite storage, then wonder why latency and memory explode when agents drag around logs, docs, diffs, browser traces, and stale conversation history. The hidden bottleneck is deciding what should remain expanded and what can become compressed working memory. This is where agent infrastructure is heading: not simply larger windows, but layered memory where raw text, summaries, embeddings, and latent context all serve different jobs. x.com/artemg314/status/20672…

Artem Gazizov

@artemg314

🧵 We made long-context LLMs up to 8.8× faster without losing accuracy. Meet LCLMs: Latent Context Language Models. A small 0.6B encoder compresses a long context into a latent vector representation that a 4B decoder reads in place of the original tokens. Trained on 350B tokens, at compression up to 16×. New paper below 👇 👇 arxiv.org/pdf/2606.09659

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

The interesting part of AWS Transform is the shift from “modernization project” to “modernization loop.” AWS says Transform now offers continuous code modernization in preview, aimed at outdated dependencies, security vulnerabilities, and AI-readiness gaps that accumulate while teams ship product work. That framing matters. Most enterprise tech debt does not become painful because nobody can write the migration. It becomes painful because the work is episodic, politically expensive, and disconnected from the release path. Putting modernization into CI/CD changes the operating model. The agent is not a one-off consultant generating a giant migration branch. It becomes a background system that can inspect repos, prioritize changes, create pull requests, and keep debt from re-forming immediately after the big cleanup. The hidden bottleneck is review trust. If the codebase is large and business-specific, the value is not “AI changed files.” The value is traceable, small, policy-aware changes that reviewers can accept without redoing the whole analysis. Modernization agents will win when they make maintenance continuous and reviewable. x.com/awscloud/status/206727…

Amazon Web Services

@awscloud

AWS Transform now offers continuous code modernization in preview. Code bases accumulate outdated dependencies, security vulnerabilities, and AI readiness gaps over time, especially when modernization only happens episodically. AWS Transform moves this work into your CI/CD pipeline, so remediation happens autonomously at every commit, reducing operational maintenance costs by up to 30%. go.aws/4a51x1E

Prasenjit Sarkar

Prasenjit Sarkar

@stretchcloud

The useful read on ARD is that agent ecosystems are hitting the same problem APIs hit years ago: discovery becomes infrastructure. Microsoft describes Agentic Resource Discovery as an open spec for publishing, indexing, and discovering AI capabilities. That sounds dry, but it is the layer agents need once the world has thousands of tools, MCP servers, skills, APIs, and internal workflows that all claim to help with a task. Right now, capability selection is still too manual. Developers wire tools into one app. Vendors build their own registries. Enterprises hide useful actions behind permission systems and tribal knowledge. The model can plan, but it often cannot reliably know what is available, who owns it, what permissions are needed, or whether a capability is safe to invoke. The hidden bottleneck is not tool calling. It is trusted tool discovery. If ARD works, the agent does not just ask “what should I do next?” It can ask “what verified capability exists for this job, in this environment, under this identity?” That is a much more production-shaped question. x.com/msdev/status/206728615…

Microsoft Developer

@msdev

Today's challenge is not just creating AI capabilities, it's finding them. We're introducing the Agentic Resource Discovery (ARD) specification, an open spec that establishes a secure common layer for publishing, indexing and discovering AI capabilities. Created by Microsoft, Google, Hugging Face and many more industry collaborators, it's available today to everyone.

Logos of various companies listed as contributors to the Agentic Resource Discovery specification: Cisco, Databricks, GitHub, GoDaddy, Google, Hugging Face, Microsoft, NVIDIA, Salesforce, ServiceNow, Snowflake.

ALT Logos of various companies listed as contributors to the Agentic Resource Discovery specification: Cisco, Databricks, GitHub, GoDaddy, Google, Hugging Face, Microsoft, NVIDIA, Salesforce, ServiceNow, Snowflake.