Macrodata Labs

Macrodata Labs

1 Photos and videos

Tweets

Pinned Tweet

Macrodata Labs

@macrodata_labs

Jun 11

Macrodata Labs is launching today to build infrastructure for the robotics data loop. Robotics is starting to scale. Progress in LLMs and VLMs is making robots more capable, but the data layer behind robotics is still underbuilt. Physical-world data is messy and fragmented. Every robot, sensor setup, and lab has its own assumptions, and teams still spend too much time writing brittle scripts just to make their data usable. The hard part is not only collecting more demonstrations. It is turning those demonstrations into datasets teams can train on, inspect, improve, and reuse as their policies and data collection setups change. We built Refiner as our first step toward better infrastructure for robotics data. It is an open-source framework for turning messy robotics data into scalable, inspectable, training-ready datasets. Refiner helps teams process demonstrations, add annotations, run reward model scoring, and scale robotics data pipelines from local execution to managed cloud compute on the Macrodata Labs platform. Starting today, you can use Refiner and the Macrodata Labs platform to make the most out of your robotics data. We are fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, @Thom_Wolf , and business angels from leading AI labs and technology companies to make this mission possible. @gui_penedo @HKydlicek

ALT Macrodata Labs

9,035

elie

Macrodata Labs retweeted

elie

@eliebakouch

Jun 11

not many new labs working at the data layer! guilherme and hynek new company @macrodata_labs is focusing on data collection for multimodal/robotics where there is a ton of messy data to process/filter i've worked with them at huggingface (they were behind fineweb, finepdf, datatrove etc..) and they were cracked, so very excited to see what they will ship! they also open sourced a new lib for this if you're interested github.com/macrodata-labs/re…

Guilherme Penedo @gui_penedo

Jun 11

Today we’re announcing Macrodata Labs. Over the last few years, @HKydlicek and I have been turning a large part of the internet into some of the largest open LLM pre-training datasets. Through FineWeb, FineWeb2, FinePDFs, FineTranslations, and related work, we got a front-row seat to how scaling compute and data drove progress in LLMs. We are starting to see a similar takeoff in robotics. Building on advances in LLMs and VLMs, robotics is finally starting to scale. But physical data is messy in ways text isn’t: large video files, multi-rate sensors, many different formats, and open questions around what signals to record, which annotations matter, and how to turn all that context into better policies. That makes data work in robotics especially important. Teams need to extract as much signal as possible from every demonstration, trajectory, video frame, and sensor stream, without rebuilding their whole data stack every time they change robot, sensors, format, or labeling method. We think the right tooling for this is still missing. That is what we created Macrodata Labs to build. Our first step is Refiner, an open-source framework for processing robotics datasets. We designed Refiner to handle a variety of robotics formats and help teams extract more signal from each demonstration. It is shipping today with support for hand-tracking, subtask annotation, and reward model scoring. We are also launching a cloud version of Refiner, so teams can focus on their data instead of infrastructure. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, model deployments, failure recovery, and detailed observability built in. We’re fortunate to be backed by Air Street Capital, Drysdale Ventures, OPRTRS club, Kima Ventures, YG (Alex Yazdi), >commit, Thomas Wolf, and many incredible angels from top AI labs and technology companies. I’m excited to keep exploring how better data work can push the frontier of AI, now in the physical world. If @macrodata_labs sounds interesting to you, or if you are building in the space, I would love to hear from you.

ALT Macrodata_Labs

7,006

Nathan Benaich

Macrodata Labs retweeted

Nathan Benaich

@nathanbenaich

Jun 11

The FineWeb team turned the open web into the datasets that trained many of the field's LLMs. Now @gui_penedo and @HKydlicek are doing the same for robotics, where data is the bottleneck everyone agrees on and nobody has solved. @airstreet led the round and we're now live!

Macrodata Labs

@macrodata_labs

Jun 11

ALT Macrodata Labs

3,027

Remi Cadene

Macrodata Labs retweeted

Remi Cadene

@RemiCadene

Jun 11

This open-source tool is super efficient to schedule thousands of jobs with SlurmArray It allows for insightful visualization and logging Very unique!

Macrodata Labs

@macrodata_labs

Jun 11

ALT Macrodata Labs

4,129

Iacopo Poli

Macrodata Labs retweeted

Iacopo Poli @iacopo_poli

Jun 11

Robotics is having a moment but data processing is still brittle. @guipenedo just launched Macrodata to fix that: an open-source framework for robotics data, and a cloud offering managing the infrastructure. I worked with him, everything he builds is a joy to use 🚀

Macrodata Labs

@macrodata_labs

Jun 11

ALT Macrodata Labs

147

Lewis Tunstall

Macrodata Labs retweeted

Lewis Tunstall

@_lewtun

Jun 11

Guilherme and Hynek have a long track record of turning messy, unstructured data into gold for model training (FineWeb, FineTranslations, FinePDFs etc) It’s very exciting to see them come out of stealth to target robotics, which is the next frontier in AI and arguably the hardest one to acquire good data for!

Guilherme Penedo @gui_penedo

Jun 11

ALT Macrodata_Labs

1,887

m_ric

Macrodata Labs retweeted

m_ric

@AymericRoucher

Jun 11

Data is the most underappreciated topic in training models These guys are cracks => watch out for what they'll do !

Guilherme Penedo @gui_penedo

Jun 11

ALT Macrodata_Labs

1,393

Antoine Chaffin

Macrodata Labs retweeted

Antoine Chaffin

@antoine_chaffin

Jun 11

Data is important, robotics is important (and will continue growing in the future) Very glad to see @HKydlicek and @gui_penedo pushing this space, especially in the open!

Hynek Kydlíček @HKydlicek

Jun 11

We're launching Macrodata Labs. Me and @gui_penedo have spent the past three years in the trenches working on data for training LLMs. This gave us a unique perspective on how the field has progressed - from GPT-3-era models capable of little more than simple completions to today, where agents are writing a substantial share of the code being shipped. This progress was enabled by just two components: scaling data and compute while being extremely deliberate about what data to use and what not to use. Look at failed training runs and ask researchers what caused them - poor data quality is almost always at the top of the list. While LLMs have undergone this Cambrian explosion, robotics today feels exactly like LLMs did back then. There is still no clear recipe for what will work. Every team has its own opinions on embodiment and architecture, yet they all agree on one thing: the most important problem to solve is data and how to scale it. Nobody knows yet whether the answer lies in simulation data, egocentric data, IMU data, or something completely different. Whatever the answer turns out to be, every team still has to go through the same process: acquiring the data, filtering problematic episodes, synchronizing sensor values, annotating episodes using VLMs, splitting episodes into subtasks, or, in the case of egocentric data, extracting 21-DOF hand annotations. Finally, all of this has to be converted into training-ready datasets before training starts as choosing a bad format for training will waste GPU cycles. These pipelines need to run continuously. Every day, new episodes arrive from in-house data collection efforts and external vendors. Teams not only have to deal with the peculiarities of working with video data, ensuring sensor streams are error-free and avoiding unnecessary video decoding, but also need to support ingestion from whatever formats their data vendors provide. Many teams are solving these problems today, yet you'll quickly discover that 99% of the solutions are collections of one-off scripts, which everyone hates the moment something goes wrong. Researchers end up digging through repositories trying to find the script that performed a particular operation three months ago, not even knowing whether they're looking at the version that was actually run. What people want is something as scalable as Spark and as trackable as Weights & Biases. That is what we created Macrodata Labs to build. Our first step is Refiner, an open-source framework for processing robotics datasets. We designed Refiner to help robotics teams turn raw demonstrations into training-ready datasets. Instead of maintaining collections of one-off scripts, teams can use Refiner to ingest heterogeneous robotics data, synchronize sensors, run annotation workflows, extract signals like hand tracking, split trajectories into subtasks, and continuously process new data as it arrives. Alongside Refiner, we're also launching Refiner Cloud. With a one-line code change, the same pipeline can scale on our platform, with sharding, checkpointing, failure recovery, lineage tracking, and observability built in—so teams can focus on what matters most: data, not infrastructure plumbing. We're incredibly fortunate to have the support of @airstreet , @DrysdaleVC , OPRTRS Club, @kimaventures , YG (Alex Yazdi), >commit, @Thom_Wolf , and an amazing group of angels from leading AI labs and technology companies who share our belief that data will be one of the defining challenges in robotics. If this resonates with you, give Refiner a try, and don't hesitate to shoot me a message. We'd love to chat.

1,540

Leandro von Werra

Macrodata Labs retweeted

Leandro von Werra

@lvwerra

Jun 11

Only few people are as data pilled as Guilherme and Hynek! Among the dozens of neo-labs they are the ones building oil pipelines.

Guilherme Penedo @gui_penedo

Jun 11

ALT Macrodata_Labs

3,832

Cody Blakeney

Macrodata Labs retweeted

Cody Blakeney

@code_star

Jun 11

Super excited about what @gui_penedo and @HKydlicek and @macrodata_labs are building. The quality of their track record in LLM data speaks for itself (refinedweb, fineweb, fineweb-edu, finepdfs, finephrase). Every model is only as good as its data. Your data is only as good as your tooling. While existing solutions to processing large training sets work, they feel incredibly clunky and unintuitive to the level of abstraction you naturally want to work at as a practitioner. (Anyone who has tried to inspect text from a spark dataframe knows what I mean) I’m really excited to see these masters of their craft bringing their expertise to the world.

Guilherme Penedo @gui_penedo

Jun 11

ALT Macrodata_Labs

3,126