We are pleased to see that the latest MAI-Thinking-1 model is strongly sustained by a synthetic pipeline for RL environment, primarily for agentic MCP tool use scenario. Curiously, they especially highlight the FunReason-MT pipeline by Ant Group, which contains a few interesting ideas:
(1) To scale up multi-turn trajectory generation, they form a directed graph of all tools, and sample shortest paths to a target tool call, to get a correct and efficient trajectory.
(2) They generate a realistic query that matches the trajectory so that the agent gets a task where the trajectory is the ground truth answer.
(3) They collect the actual trajectory by an agent solving that task as the final data, but also use a correction agent to correct the agent at each step against the ground truth, so that the agent produces correct, optimal trajectory that is highly usable.
In practice, MAI is able to leverage multiple open-source synthetic data research to generate 150 environments with 130K tasks, which eventually leads to their strong performances.
Glad to see a new player in the agentic RL landscape, but more importantly one that is willing to implement, evaluate, and share research on synthetic data generation, which will be increasingly important with the growing number of agentic scenarios.