How about orchestrating a codebase on 5GB VRAM using a local Qwen3.5-35B-A3B (~25-30 tokens/sec through llama.cpp, 65k context, remaining layers offloaded to system RAM)?
Even better, when two simultaneous agent instances can run comfortably at ~15-20 t/s, which natively supports both thinking and non-thinking models (including Gemma 4), or can be pointed at heavy-compute cloud endpoints for complex architectural tasks.
If this sounds too good to be true, please keep reading.
Running local LLMs often feels like a downgrade from premium cloud subscriptions, but the real constraint is not just model quality, it is systems design.
Context windows are finite, and simply increasing token capacity does not eliminate the need to control what the model sees.
In practice, larger contexts frequently introduce more noise, more drift, and weaker reasoning when that context is not actively curated.
What local coding agents need is not a bigger monolithic chat loop, but a better execution architecture: a lighterweight terminal environment that separates planning from implementation.
The primary orchestrator should operate like a lead architect. It should inspect the codebase, build a concrete implementation plan, decompose work into atomic tasks, and dispatch those tasks to short-lived subagents with tightly scoped, isolated contexts.
Each coding subagent should execute one bounded change, return a compact summary of the result, and terminate.
That keeps the planner’s context clean, prevents edit history from ballooning, and avoids the gradual degradation you get when every action is forced through one ever-expanding conversation.
The result is a system that behaves less like a confused chatbot and more like a disciplined engineering team with clear task boundaries and fast feedback loops.
That is the idea behind Late.
Late is a deterministic coding-agent orchestrator built to make local LLMs viable for serious agentic software development.
Instead of dumping an entire repository into a single context window and hoping the model stays coherent, it maps the codebase, maintains a high-level control plane, and spawns ephemeral execution agents to perform precise, exact-match code edits.
By mirroring the structure of a real engineering organization, Late reduces token bloat, limits context pollution, and improves reliability under long-running coding workflows.
agentnativedev.medium.com/ou…