A large-scale foundation model for bulk transcriptomes
1.BulkFormer is the first large-scale foundation model designed specifically for bulk RNA-seq data, filling a critical gap left by models trained solely on sparse single-cell data.
2.It is pretrained on over 520,000 human bulk transcriptomes, modeling approximately 20,000 protein-coding genes using a hybrid encoder that integrates both graph neural networks and performer-based self-attention.
3.Despite using significantly fewer computational resources than single-cell models, BulkFormer achieves superior performance in six key tasks: transcriptome imputation, disease annotation, prognosis modeling, drug response prediction, compound perturbation simulation, and gene essentiality prediction.
4.In imputation tasks, BulkFormer reaches a Pearson correlation of 0.954 on masked gene expression recovery, outperforming variational autoencoders and single-cell foundation models, which suffer from modality mismatch.
5.Applied to pancreatic cancer data, BulkFormer recovers missing values that lead to the identification of 408 new DEGs, enriching pathways like oxidative phosphorylation—highlighting its power in uncovering latent biology.
6.The model facilitates the discovery of novel prognostic biomarkers across eight cancer types. For example, H4C1 is identified as a risk gene in kidney cancer (5.2x higher mortality), and PDE6H as protective in pancreatic cancer (HR = 0.26).
7.For disease classification, BulkFormer achieves a weighted F1 score of 0.939 across 23 diseases—substantially outperforming single-cell pretrained models like scGPT and Geneformer.
8.In cancer subtype classification across 33 cancer types, it again leads with a weighted F1 of 0.833 and yields high-quality, disease-separated UMAP projections, indicating strong representational power.
9.By generating context-aware embeddings, BulkFormer improves prognosis modeling, rescuing weak prognostic signals. For example, RPS27 becomes a strong risk gene in lower-grade glioma after embedding (HR = 4.77).
10.It also enables fine-grained prediction of transcriptomic changes upon drug treatment. In compound perturbation tasks, it outperforms PRnet and scLLMs, achieving PCC = 0.493 on unseen drugs like Dovitinib.
11.For drug response prediction (IC50) across 255 compounds and 700 cell lines, BulkFormer attains top-tier performance (PCC = 0.910), showing promise for precision oncology and drug screening.
12.Finally, BulkFormer predicts gene essentiality scores with high accuracy (PCC = 0.931) from expression alone, highlighting cancer-specific vulnerabilities and informing therapeutic strategies.
13.Its rotary expression embedding method captures expression magnitude and continuity more effectively than traditional rank-based methods, improving stability and interpretability.
14.BulkFormer offers fast training—requiring just 1–10% of the GPU time compared to scLLMs—making it both cost-effective and scalable for large-scale biomedical applications.
15.Limitations include reduced applicability to single-cell tasks and lack of modeling for non-coding RNAs, but its focus on bulk-level data makes it ideal for clinical and tissue-scale analyses.
16.Future directions include multimodal foundation models for joint bulk and single-cell data, and integration of patient metadata (age, sex, tissue type) to enhance context-awareness.
💻Code:
github.com/KangBoming/BulkFo… 📜Paper:
biorxiv.org/content/10.1101/…
#BulkRNAseq #FoundationModel #Transcriptomics #Bioinformatics #CancerBiomarkers #AI4Biomedicine