slime v0.3.0: Built for the Agent Era
🌟 Insights from Zhihu contributor @朱小霖
@zhuzilinallen
There's little doubt that OpenClaw and Opus have kicked open the door to the Agent era.
slime's server-based engine custom rollout architecture was built with this direction in mind. But as Agents become real-world workloads, it's clear that an RL framework needs more than basic Agent support—it needs better inference orchestration, long-horizon training, environment integration, and maintainable engineering practices.
That's exactly what slime v0.3.0 is about:
🔗
github.com/THUDM/slime/relea…
🚀 Agent-Native Infrastructure
The biggest risk for an RL framework isn't lacking features—it's chasing a new trend by piling temporary fixes onto an existing design. Agent training makes this especially tempting.
Instead of treating Agent support as one giant feature, we break it down into a series of infrastructure problems and solve them one by one.
🎱 Like in snooker, you don't clear the table with a single shot—you gradually create a better position. Many of the updates in slime v0.3.0 follow exactly this philosophy.
At its core, an RL framework is still about two things: inference and training. Let's look at them separately.
Faster & More Flexible Inference
Agent workloads dramatically increase token consumption and put much higher demands on serving systems. Two requirements stand out:
• Fast rollouts for long-horizon, multi-turn, tool-heavy tasks
• Production-like inference configurations so models can transition naturally from pretraining/SFT into deployment
To support this, slime expanded SGLang deployment with YAML-based multi-server configurations, allowing users to build composable server/router topologies instead of relying on a single inference setup.
📖 Docs:
thudm.github.io/slime/advanc…
Many users now use slime as a launcher for complex SGLang clusters, which suggests people need more than an RL framework—they need a reliable infrastructure entry point.
We also improved --debug-rollout-only, making rollout-only and serving-only deployments much closer to production environments by cleanly separating inference and training resources.
Another trend we've observed: multi-turn interactions and tool usage significantly increase prefill pressure. Cache hit rates and memory capacity now directly impact rollout throughput.
Inspired by optimizations from the Miles team:
🔗
github.com/radixark/miles/pu…
slime no longer offloads fp32 gradients and bf16 parameters in integrated training-serving workloads, saving roughly 6× parameter memory and improving rollout speed for Agent tasks.
🧠 Training for Long-Horizon Agents
On the training side, the focus shifts from infrastructure to algorithm design.
slime v0.3.0 adds support for compact and subagent workflows, where one prompt can generate multiple training samples.
Previously, frameworks often had to either:
• discard samples, wasting rollout data; or
• pad batches, increasing compute and memory costs.
Now, batch sizes can adapt dynamically to rollout results, eliminating both compromises while preserving proper normalization across related samples.
Long-horizon tasks are also driving renewed interest in reward shaping, value functions, and PPO-style algorithms.
To support this, slime rebuilt its PPO implementation so that actor and critic always share GPU resources, allowing users to move from GRPO to PPO without allocating an entirely separate GPU cluster.
It also supports independent Megatron configurations for actor and critic.
📖 Docs:
thudm.github.io/slime/advanc…
Meanwhile, as Agent rollouts grow longer, more teams are adopting asynchronous training. In v0.3.0, fully async training has been promoted from an experimental example to a first-class feature, sharing the same interface as partial-rollout async workflows.
🤖 slime/agent:Solidify the common Agent components
While slime still encourages users to build their own custom harnesses, we've found that some Agent components are common enough to standardize.
That's why v0.3.0 introduces slime/agent/, including utilities like:
• trajectory merging
• OpenAI/Anthropic request interception
• reusable Agent tooling
We also released a complete Coding Agent RL example:
🔗
github.com/THUDM/slime/tree/…
The example demonstrates an end-to-end pipeline where Claude Code operates inside a real environment, interacts through SGLang endpoints, logs requests via an Anthropic adapter, generates rewards automatically, and converts trajectories into trainable RL data.
🛠️ Maintaining Open Source in the Agent Era
As coding agents improve, software projects may split into two categories:
Projects that can be rewritten every time a stronger model arrives.
Projects whose value comes from years of accumulated design decisions, testing, edge cases, and user trust.
Training frameworks belong to the second category.
That creates two major risks when relying heavily on coding agents:
• Attention DDoS — code volume grows faster than maintainers can review and understand it.
• Loss of ownership — developers stop understanding why systems are designed the way they are, and architecture quality gradually degrades.
Because of this, slime remains conservative in core development. AI is used as a collaborator, reviewer, and coding assistant—not as the primary architect.
On the other hand, we've aggressively used AI for testing and visualization. Over the past few months, this approach has helped us build extensive CPU-only test coverage and improve framework stability.
The goal is simple: make slime not only battle-tested at scale, but also one of the most rigorously tested open-source RL frameworks available.
slime is approaching its first open-source anniversary. What started as a project maintained by one or two people has grown into a team effort.
We hope v0.3.0 makes Agent RL easier to build—and helps slime remain clear, lightweight, and reliable as the Agent era unfolds.
⭐ If slime has been useful to you, consider giving it a star:
github.com/THUDM/slime
🔗Original article:
zhuanlan.zhihu.com/p/2044533…
#AI #Agents #RLHF #ReinforcementLearning #OpenSource #LLM #AgenticAI #SGLang #DeepLearning