1. I spent a lot of time at scale labeling data myself, never thought it was beneath me. Instead it's how we developed quality criteria, instructions, how we provide partnerships to our customers. My co-founder
@flubtitle and I built out new labeling products for LLMs in 2023 (well it's old now) because we did labeling, ran queues (meaning we were running real projects and needed to deliver data). Not because we got a prd from anyone.
2. One of the first things we built at Santori Labs is our voice-first eval/label flow and a roleplay system. I spent hours every week going through data, thinking about what is good vs not.
3. Imagine an engineer who thinks they are too good to do that, but instead they are just here to execute a prd that is given to them.
4. I don't think data is all you should do, but it's still one of the most important things you can do. Labeling is one form, another one is looking at agent traces. If you don't see why that's important, you are stuck in the past.
5. It's painful looking at data. You think you just look at it and you just know if this is good. It's never that. It's always the messy middle of "meh". That's why the design principle for our own data flow is that: data is a focused act, and the product needs to encourage focus
Just learned:
Software engineers used to do manual data labeling at Scale AI while Alex Wang was CEO. After he left, new leadership joined, and were HORRIFIED to learn this. Stopped it ASAP
Now at Meta, software engineers are assigned manual data labeling... see the pattern?