SITUATION EXPLAINED: How much are frontier labs actually spending on training data?
.
@SeanZCai: "Frontier labs are spending about $10 to $15 billion per lab on data."
"Really good long horizon tasks go up to $20,000 each. A complete browser-use version of SAP was rumored at $500,000."
"Despite everybody thinking the market is super crowded, we still don't have enough good quality data vendors that actually understand how to deliver product plus services in a way researchers are looking for."
"I have not seen a contract for genuinely good data gets turned down because of budgetary concerns yet."
On data markets:
A while ago, Anthropic said that they would be spending a billion dollars this year on RL data. This year, that amount will be far exceeded, with good data rarely being turned down for budget concerns. We can expect OpenAI to be of similar mindset, although the window for banal data projects serviced by the likes of Mercor is rumored to be closing entirely this year. Deepmind, Meta, Microsoft, Amazon, and xAI are known to be N-1 labs who may buy datasets already saturated by the likes of Anthropic, or buy RL environments in light of not having a system like Tundra in Anthropic.
The TAM is still 10s of billions if not more and the raw aggregate spent on data will only continue to increase.
But one must remember what is bought when data is sold, because few today can really differentiate Mercor/Handhshake from a Mechanize/Surge. Data is valuable, to frontier labs, based on how much it can be easily used to improve frontier models. To show this capability, it matters whether teams selling data can show how most directly it can be used to hillclimb models, how much frontier SOTA models struggle on its benchmarks, and how much trouble they can save the frontier lab in its continual acquisition. Data sold is, therefore, very much resembling selling outcomes rather than an actual reusable product, which is why one must obsess about indexing on the scalable means of producing internal systems that can help end model trainers produce outcomes rather than fixating on data itself when evaluating RL environment companies.
In this way, the TAM of data markets is actually extremely greenfield and growing, because few teams have the sophistication for research services and scale for on demand consistently QA’ed data. It is the semblance of this product with which Mercor was able to overtake Scale, the semblance of this product which many newer upstarts are painting as an argument to chip away at Mercor/Handshake/Surge’s lunches.
From my April's edition of State of Data on substack: