one structural answer: generate the data in public, against a deterministic scoring rule, with the QC pipeline published instead of hidden.
doesn't solve "quality has no ceiling" does collapse "judge quality without seeing the pipeline"
training data is starting to look like a zero knowledge proof problem.
labs have to judge quality without seeing the full dataset or the QC pipeline behind it.
vendors proxy quality with multi-rollout pass rates, small-model ablations, and downstream eval gains. but compute and iteration costs explode as environments and trajectories grow more complex.
quality has no ceiling, and the best data is often the hardest to capture in a metric or explain in a writeup.
huge alpha in making data quality more legible.