Learning The Native-Like Codons With A 5'UTR And Secondary RNA Structure Aided Species-Informed Transformer Model
A new Transformer-based deep learning model, TransCodon, has been developed to address the challenge of efficient protein expression in heterologous hosts. It tackles the difficulty of reconstructing native-like codon landscapes by integrating 5' untranslated regions (5'UTRs), coding sequences (CDS), explicit species identifiers, and RNA secondary structure information.
TransCodon learns nuanced codon usage patterns across diverse organisms by incorporating multisource genomic data and modeling sequence dependencies via a masked language modeling paradigm. This allows it to effectively capture both local and global determinants of codon preference.
A key innovation is TransCodon's use of a finer-grained vocabulary based solely on nucleotides, which enables partial decoding and preserves richer sequence-level information compared to previous approaches. The model was trained on a large dataset of 5.5 million gene sequences from 1,436 species, ensuring robust cross-species generalization.
Experimental results demonstrate that TransCodon consistently outperforms existing codon optimization tools across multiple evaluation metrics. It identifies native-like codons with less divergence from natural sequences and can capture low-frequency codons often missed by other deep learning methods, especially for highly abundant proteins.
Beyond codon optimization, TransCodon shows robust effectiveness in predicting protein abundance, achieving high correlation with experimentally determined values in zero-shot scenarios. It also excels in 5' UTR-related downstream tasks, such as predicting Mean Ribosome Load (MRL), surpassing other state-of-the-art models.
These findings indicate that TransCodon is a robust codon language model with significant potential for designing genes to achieve high translational efficiency in target host organisms, marking a notable advancement in computational synthetic biology.
📜Paper:
biorxiv.org/content/10.1101/…
#ComputationalBiology #SyntheticBiology #DeepLearning #ProteinExpression #CodonOptimization #Bioinformatics #Genomics #TransformerModels