ScProtoTransformer: Scalable Reference Mapping Across Molecules, Cells and Donors
1. The rapid accumulation of single-cell data has enabled comprehensive biological system characterization, but scalable reference mapping across different resolutions remains a major challenge. ScProtoTransformer addresses this by introducing a prototype-based Transformer architecture that achieves scalable mapping across molecular, cellular, and donor levels.
2. A key innovation is the knowledge-guided prototype tokenizer, which projects gene expression into biologically interpretable pathway prototypes. This reduces numerical batch effects while preserving biological semantic patterns, making it a powerful tool for cross-scale reference mapping.
3. ScProtoTransformer leverages knowledge distilled from foundation models and a dynamic supervised fine-tuning strategy. This allows it to inherit the knowledge of large-scale pretraining models without requiring extensive pretraining itself, significantly reducing computational costs.
4. Benchmark experiments demonstrate that ScProtoTransformer delivers competitive or superior performance compared to state-of-the-art methods across molecular, cell, and donor-level reference mapping tasks. It also provides interpretability through biologically meaningful prototypes.
5. The method supports multi-level reference mapping: gene embeddings enable molecular-level mapping, cell embeddings support cell-level mapping, and donor-level mapping is achieved by aggregating embeddings from the same donor sample. This lays the foundation for integrative analysis across different biological scales.
6. ScProtoTransformer shows strong performance in cross-modal and cross-batch integration tasks, outperforming specialized integration methods. It also demonstrates adaptability to spatial data without relying on non-molecular features like spatial coordinates.
7. The study includes comprehensive ablation experiments, validating the necessity of the prototype tokenizer, knowledge distillation loss, and dynamic SFT loss in achieving robust performance across different levels of biological analysis.
📜Paper:
biorxiv.org/content/10.64898…
#ComputationalBiology #SingleCellData #TransformerArchitecture #ReferenceMapping #Bioinformatics