Design of overlapping genes using deep generative models of protein sequences
🚀 New preprint from David Baker!🚀
1. This study demonstrates that deep generative models can design synthetic overlapping genes (OLGs)—two protein-coding sequences embedded in different reading frames of the same DNA—without compromising protein structure or function.
2. The authors developed an iterative sampling algorithm that jointly generates two protein sequences while enforcing codon compatibility constraints between reading frames, using models like EvoDiff-MSA and ProteinMPNN.
3. Contrary to longstanding assumptions, overlapping proteins can encode well-folded 3D structures: AlphaFold2 predictions showed pLDDTs and TM-scores nearly equal between overlapping and non-overlapping sequences across 15 structural classes.
4. Over 56,000 overlapping designs were created targeting combinations of de novo protein backbones, revealing that all but the -2 frame arrangement are readily compatible with high-quality folding, consistent with codon degeneracy analysis.
5. Experimental validation confirmed 54% expression success for individual proteins and 31% for full overlapping pairs, with thermostable structures observed up to 95°C and circular dichroism matching predicted folds.
6. Even under extreme constraints—e.g., preserving the exact sequence of one protein—overlapping designs were still achievable in ~1–3% of attempts, indicating that natural protein domains often tolerate overlap with minimal or no mutation.
7. Targeted designs of functional homologs (e.g., chorismate mutase and initiation factor 1) confirmed that synthetic OLGs diverge from natural sequences yet occupy similar embedding space in PLMs and retain catalytic motifs and active site residue patterns.
8. Codon usage and frame configuration influence designability: 0, 1, -0, and -1 frames benefit from the block structure of the genetic code, while the -2 frame is significantly less permissive due to overlapping third-base positions.
9. The standard genetic code appears nearly optimal for enabling OLGs in most frames, supporting the hypothesis that evolutionary constraints on the code favor dual-encodability and may facilitate de novo gene emergence.
10. This work not only expands synthetic biology’s toolkit for genome compaction and biocontainment, but also deepens our understanding of protein sequence space and the potential pervasiveness of hidden genes in natural genomes.
💻Code:
github.com/gwbyeon/OLG-desig…
📜Paper:
biorxiv.org/content/10.1101/…
#SyntheticBiology #OverlappingGenes #ProteinDesign #GenerativeModels #ProteinMPNN #AlphaFold2 #DeepLearning #GenomeEngineering #Bioinformatics #OLG #DualCoding #EvolutionaryBiology