I benchmarked a new extraction harness on a private eval dataset for lerim-cli (new version is out now - v0.1.83) and the main lesson was very clear: if you want smaller models to work well, you should stop asking the model to do everything and start doing more engineering work.
Before, the agent was closer to a single-pass PydanticAI setup: read a large trace, understand what matters, decide what is durable memory, call tools correctly, stay inside the context window, and output clean structured records.
That puts too much burden on the model, especially when you want to use smaller or cheaper models.
The new harness is BAML (
@boundaryML) LangGraph (
@LangChain).
The graph now does more of the deterministic work:
- read the trace in windows
- ask the model to scan one window at a time
- keep compact findings instead of the whole trace
- synthesize memory records only at the end
- validate/retry typed BAML outputs
- persist with normal code, not model improvisation
So the model is not the whole agent anymore -> It is one reasoning component inside a more engineered system.
On the private benchmark, using the same MiniMax M2.7 model, the new harness completed all cases while the old harness had multiple failures from tool retries and context window issues.
- Task completion: BAML LangGraph completed 100.0% vs PydanticAI at 72.73%, a 27.27 point lead.
- Case failures: BAML LangGraph had 0 failures vs PydanticAI with 6, meaning 6 fewer failures.
- Episode count rate: BAML LangGraph reached 100.0% vs PydanticAI at 81.25%, a 18.75 point lead.
- Record budget rate: BAML LangGraph reached 46.88% vs PydanticAI at 28.12%, a 18.76 point lead.
- Concept recall average: BAML LangGraph scored 0.428 vs PydanticAI at 0.2598, a 0.1682 improvement.
- Quality average: BAML LangGraph scored 0.3352 vs PydanticAI at 0.318, a 0.0172 improvement.
- Tool call errors average: BAML LangGraph had 0.0625 vs PydanticAI at 1.9688, much better.
Quality is not solved yet. It is only slightly better overall and still needs better pruning before persistence. But robustness improved a lot.
This is the direction I think specialized agents should go: smaller models, more deterministic scaffolding, less magical thinking about one giant prompt doing the whole job.
Next step is to make this work well with models people can run locally.
A new version of Lerim-cli is now released with the extract agent refactored to use Langgraph BAML. Next agents will be refactored as well soon in the next releases.
github.com/lerim-dev/lerim-c…