A Gene Set Foundation Model Pre-Trained on a Massive Collection of Diverse Gene Sets
1.A new gene set foundation model (GSFM) is introduced, trained on over one million gene sets mined from two large resources—Rummagene and RummaGEO—offering a wide, unlabeled view of biological knowledge without relying on single-cell data.
2.GSFM outperforms state-of-the-art methods like PrismEXP and GenePT in gene function prediction across both curated gene libraries (like KEGG and GO BP) and data-driven ones (like ChEA and GWAS).
3.Rummagene, a key data source, extracts gene sets from supplementary tables of PMC papers, capturing diverse biological contexts. RummaGEO contributes sets derived from differential expression analysis of thousands of RNA-seq studies in GEO.
4.The top-performing GSFM architecture is a simple multi-hot encoded Denoising Autoencoder (DAE), trained to predict held-out genes in gene sets using self-supervised learning. It achieves robust generalization and interpretability.
5.Extensive benchmarking shows that training on Rummagene alone leads to superior results compared to using RummaGEO or the combined dataset, likely due to the greater diversity and sparsity of the gene set space captured.
6.GSFM enables zero-shot predictions—no retraining required—by completing partially known gene sets or predicting gene membership for arbitrary sets across a wide variety of libraries.
7.The model supports several downstream applications: gene function annotation, gene set enrichment augmentation, and even protein-protein interaction or kinase-substrate relationship inference.
8.GSFM predictions are hosted at
gsfm.maayanlab.cloud, offering precomputed gene pages with functional predictions, AUROC confidence scores, and optional LLM-generated gene summaries based on literature.
9.The model weights, training code, and inference tools are publicly released on HuggingFace and GitHub, enabling integration into custom pipelines or embedding-based transfer learning.
10.GSFM is currently trained on coding genes, but future directions include incorporating non-coding genes, additional omics layers, and vector-valued gene set signatures (e.g., logFC or p-values).
💻Code:
huggingface.co/maayanlab/gsf…
📜Paper:
biorxiv.org/content/10.1101/…
#Bioinformatics #GeneFunction #FoundationModels #Transcriptomics #MachineLearning #SystemsBiology