Do single-cell foundation models obey scaling laws?
A somewhat thought-provoking new Nature Methods study by the Crawford lab suggests that, for current single-cell foundation models, the answer may be “not really.” Across a broad range of architectures and downstream tasks, increasing pretraining data from hundreds of thousands to tens of millions of cells yielded surprisingly limited gains, with performance often saturating much earlier than expected.
This is interesting and provides exactly the kind of rigorous benchmarking our field needs. As Felix Fischer and I commented in the accompanying Research Briefing, such studies help move the discussion beyond model size and computational budgets toward actual scientific utility.
At the same time, I am not convinced the key conclusion is that scaling does not work in biology. Rather, it may be that current objectives are not extracting enough information from additional data.
Interestingly, in our recent scConcept work, we observe a markedly different scaling behavior, with continued gains as training data grows toward hundreds of millions of cells. The key difference may be the training objective itself: instead of reconstruction-based masked modeling, scConcept uses a contrastive objective that directly optimizes biologically meaningful cell representations.
biorxiv.org/content/10.1101/…
This raises an interesting question for the field: Have we reached the limits of data scaling, or only the limits of current objectives?
-> My guess is that the next generation of biological foundation models will depend less on simply collecting more cells and more on finding the right representation learning principles for biology.
Nature Methods paper:
nature.com/articles/s41592-0…
Research Briefing:
nature.com/articles/s41592-0…
#SingleCell #FoundationModels #AIforBiology