We just built and released the largest dataset for supervised fine-tuning of agentic LMs, 1.27M trajectories (~36B tokens)!
Up until now, large-scale SFT for agents is rare - not for lack of data, but because of fragmentation across heterogeneous formats, tools, and interfaces.
To solve this, we introduce the Agent Data Protocol, a new “interlingua” between a broad variety of heterogeneous agent datasets - coding, browsing, API/tool use - and unified agent training pipelines downstream.
We unified 13 datasets into ADP, converted them to be compatible with multiple agent frameworks, and observed ~20% average gains, reaching SOTA/near-SOTA without domain-specific tuning.
📄 Read our paper:
arxiv.org/abs/2510.24702
🌐 Check our project website:
agentdataprotocol.com/
And this is just getting started, we can add more datasets, further expand the resources, and make training agent LMs easy for all. We’d love to have you join the shared effort and help to make ADP the open standard for the community 🚀