Bimodal masked language modeling for RNA-seq and DNA methylation representation learning
1.This work introduces MOJO, a bimodal masked language model that learns joint representations of bulk RNA-seq and DNA methylation data using self-supervised learning. The model achieves state-of-the-art performance across cancer-type classification, survival prediction, and zero-shot clustering.
2.Unlike late integration approaches that combine separately pretrained models, MOJO is trained end-to-end on both modalities, capturing interactions between RNA-seq and methylation signals from the start. This joint modeling leads to improved generalization and representation quality.
3.MOJO combines convolutional layers with transformers to handle the long-range dependencies and high dimensionality of omics data efficiently. This hybrid architecture enables 100× faster training steps compared to purely transformer-based bimodal models.
4.The model is pre-trained on over 9,000 samples from the TCGA dataset using a bimodal masked language modeling objective, which corrupts 15% of tokens across both modalities to encourage joint representation learning.
5.In pan-cancer classification (33 cancer types), MOJO outperforms strong baselines including BulkRNABert, MethFormer, and CustOmics. It achieves a test macro-F1 of 0.935 and weighted-F1 of 0.952, showing notable gains even over sophisticated late integration methods using cross-attention.
6.For survival analysis, MOJO again outperforms unimodal and late integration baselines, achieving a C-index of 0.771 and weighted C-index of 0.670. These results indicate robust modeling of time-to-event prediction using joint omics embeddings.
7.The learned embeddings show strong zero-shot performance in breast cancer subtyping (PAM50) and pan-cancer cohort clustering, with MOJO achieving 0.777 accuracy for PAM50 and 0.928 for pan-cancer—better than late integration.
8.To tackle the real-world challenge of missing modalities, the authors propose two mechanisms: (i) re-training with incomplete modality samples (MOJO-MMO), and (ii) a mutual information auxiliary loss during fine-tuning, which regularizes the model to produce consistent outputs across partial inputs.
9.When either RNA-seq or methylation is missing at test time, the mutual information strategy allows MOJO to recover performance close to modality-specific models (e.g., from 0.538 to 0.916 weighted-F1 when RNA-seq is absent).
10.Overall, MOJO offers a scalable, performant, and robust approach to multi-omics integration, making it well-suited for clinical applications involving heterogeneous and incomplete datasets.
💻Code:
github.com/instadeepai/multi…
📜Paper:
biorxiv.org/content/10.1101/…
#RNAseq #DNAmethylation #MultiOmics #MaskedLanguageModel #Cancer #RepresentationLearning #Bioinformatics #SelfSupervised #TCGA #SurvivalAnalysis