More broadly: are there better ways to run these expensive, low-sample evaluations to get more insight efficiently?
One idea is to run an episode end-to-end once, then return to an intermediate progress state, branch, and sample more heavily from that point.
Could designs like this help us estimate time-horizons, inference-scaling efficiency, robustness, and harness effects?