Yesterday I interviewed
@SeanZCai about AI data.
This is essentially a guide for founders on how to sell data and RL envs to AI labs.
"I've never seen a data contract get turned down by a top lab, if it's good quality data, for budget reasons."
00:00 What areas of data are underserved?
02:10 For bio data, is it real-world or purely digital?
04:21 For cyber data, which subsets are most underserved?
05:50 What is the sales process like?
07:04 Why would a lab not renew or increase their purchase volume?
10:13 When a researcher is exploring a new direction, what's the first step?
11:35 In robotics data, what do you view as underserved?
13:12 What does the initial data delivery look like, what format?
13:53 Do labs have more sophisticated internal setups for running environments?
14:32 Are the non-frontier labs buying off-the-shelf data from Anthropic / OpenAI vendors?
16:11 Do Anthropic data vendors put expiry timeframes on the exclusivity?
16:42 Are purchase decisions researcher-led?
17:41 Decagon, Sierra, Ramp: what kinds of data are they buying?
19:06 Long-term, when do labs still need to buy external data vs train on user traces?
21:15 Will end-vendor benchmarks shift to performance per dollar?
22:04 How many labs are spending at the 1B /yr data level?
23:53 Delta between Anthropic's stated $1B and your 10-20B/lab number?
26:05 What makes inference providers / neoclouds a good fit to acquire RL env cos?