Zygosity-Aware DNA Language Modeling Improves Ancestry and Gene Expression Prediction
1. A new study explores the impact of incorporating zygosity information in DNA language models (DNA-LMs), demonstrating significant improvements in ancestry classification and gene expression prediction. This approach leverages diploid genome representations to capture biologically meaningful signals often missed by traditional single-sequence models.
2. For ancestry prediction, researchers used HyenaDNA embeddings on the highly polymorphic MHC region and found that concatenating maternal and paternal haplotype embeddings consistently enhanced predictive performance across five superpopulations. This highlights the value of explicit diploid modeling in capturing population-specific genetic variation.
3. In gene expression prediction, convolutional neural networks (CNNs) showed increased accuracy when incorporating zygosity via additive genotype encoding, while pretrained Nucleotide Transformer models exhibited mixed results. This suggests a mismatch between current pretraining objectives and variation-sensitive tasks, emphasizing the need for diploid-aware pretraining strategies.
4. The study underscores the importance of modeling both parental copies of the genome, especially in regions like the MHC where genetic diversity is high. It also highlights the potential for future DNA-LMs to integrate population-level variation and diploid structure to improve variant interpretation and precision medicine applications.
5. The data and code supporting this research are openly available, enabling full reproducibility and further exploration of diploid-aware DNA representations in genomics.
📜Paper:
biorxiv.org/content/10.1101/…
#Genomics #DNALanguageModels #Zygosity #AncestryPrediction #GeneExpression #DiploidModeling