Big Distilabel release! Distilabel is an open source framework for creating synthetic datasets and generating AI feedback, designed to provide fast, reliable, and scalable pipelines based on verified research papers for engineers! 👀
And just got its 1.4 release with:
🧩 New Steps for better dataset sampling, deduplication (embeddings and minhash), truncation of inputs and better combining outputs
💰 50% Cost Savings by pausing pipelines and using OpenAI Batch API
⚡️ Caching for step outputs for maximum reusability—even if the pipeline changes.
📝 Steps can now generate and save artifacts, automatically uploaded to the Hugging Face Hub.
🆕 New Tasks with CLAIR, APIGen, URIAL, TextClassification, TextClustering, and an updated TextGeneration task.