We're launching Macrodata Labs.
Me and
@gui_penedo have spent the past three years in the trenches working on data for training LLMs. This gave us a unique perspective on how the field has progressed - from GPT-3-era models capable of little more than simple completions to today, where agents are writing a substantial share of the code being shipped.
This progress was enabled by just two components: scaling data and compute while being extremely deliberate about what data to use and what not to use. Look at failed training runs and ask researchers what caused them - poor data quality is almost always at the top of the list.
While LLMs have undergone this Cambrian explosion, robotics today feels exactly like LLMs did back then. There is still no clear recipe for what will work. Every team has its own opinions on embodiment and architecture, yet they all agree on one thing: the most important problem to solve is data and how to scale it.
Nobody knows yet whether the answer lies in simulation data, egocentric data, IMU data, or something completely different.
Whatever the answer turns out to be, every team still has to go through the same process: acquiring the data, filtering problematic episodes, synchronizing sensor values, annotating episodes using VLMs, splitting episodes into subtasks, or, in the case of egocentric data, extracting 21-DOF hand annotations. Finally, all of this has to be converted into training-ready datasets before training starts as choosing a bad format for training will waste GPU cycles.
These pipelines need to run continuously. Every day, new episodes arrive from in-house data collection efforts and external vendors. Teams not only have to deal with the peculiarities of working with video data, ensuring sensor streams are error-free and avoiding unnecessary video decoding, but also need to support ingestion from whatever formats their data vendors provide.
Many teams are solving these problems today, yet you'll quickly discover that 99% of the solutions are collections of one-off scripts, which everyone hates the moment something goes wrong. Researchers end up digging through repositories trying to find the script that performed a particular operation three months ago, not even knowing whether they're looking at the version that was actually run.
What people want is something as scalable as Spark and as trackable as Weights & Biases.
That is what we created Macrodata Labs to build.
Our first step is Refiner, an open-source framework for processing robotics datasets.
We designed Refiner to help robotics teams turn raw demonstrations into training-ready datasets. Instead of maintaining collections of one-off scripts, teams can use Refiner to ingest heterogeneous robotics data, synchronize sensors, run annotation workflows, extract signals like hand tracking, split trajectories into subtasks, and continuously process new data as it arrives.
Alongside Refiner, we're also launching Refiner Cloud. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, failure recovery, lineage tracking, and observability built in—so teams can focus on what matters most: data, not infrastructure plumbing.
We're incredibly fortunate to have the support of
@airstreet ,
@DrysdaleVC , OPRTRS Club,
@kimaventures , YG (Alex Yazdi), >commit,
@Thom_Wolf , and an amazing group of angels from leading AI labs and technology companies who share our belief that data will be one of the defining challenges in robotics.
If this resonates with you, give Refiner a try, and don't hesitate to shoot me a message. We'd love to chat.