HalluCodon enables species-specific codon optimization using multimodal language models
1 HalluCodon is a plant-focused codon-optimization framework that fine-tunes pre-trained protein and RNA language models to generate species-specific coding sequences, aiming to improve heterologous protein expression beyond simple codon-frequency heuristics.
2 The core idea is a two-module scoring system: CodonNAT quantifies “codon-context naturalness” (how compatible a CDS looks relative to endogenous host genes), while CodonEXP predicts the probability that a CDS will yield high protein abundance using experimental protein abundance labels.
3 CodonNAT is built via joint fine-tuning of ESM2-650M (protein LM) and mRNA-FM (codon-token RNA LM) under a masked-language-modeling objective, learning host-specific codon context signatures rather than only per-codon frequencies.
4 Across 15 plant species (including maize, rice, tobacco, wheat, tomato, potato, grape, etc.), CodonNAT achieved higher masked-codon prediction accuracy than a “pick the most frequent codon” baseline (average 66.5% vs 56.6%), with especially strong gains for amino acids with higher synonymous-codon diversity.
5 CodonNAT also showed biologically meaningful signal in a non-plant benchmark: in E. coli ccdA synonymous-mutation fitness data, it improved correlation between predicted and measured fitness (Spearman 0.41) compared with frequency-based scoring (0.32) and slightly above CodonTransformer (0.39), supporting that it captures context effects relevant to cellular fitness.
6 CodonEXP integrates nucleotide-level and protein-level information by learning from both CDS and amino acid sequence features, supervised with protein abundance data (PaxDb-derived labels: top 33% vs bottom 33%). It reached ~79.3% average accuracy and 86.1% average AUC across the 15 plant species, and outperformed RNA-only language model baselines in maize/rice/tobacco comparisons.
7 For sequence generation, HalluCodon offers (a) a genetic algorithm (CodonGa) and (b) a hallucination-style, gradient-guided optimizer (CodonHa). Both maximize a Fitness score defined as Naturalness (CodonNAT) × Expression probability (CodonEXP), but CodonHa converges far faster in compute.
8 In a tobacco DsRed2 optimization example, CodonHa reached near-maximal predicted expression probability in only a few iterations and ran ~46.8× faster than the genetic algorithm on the reported GPU setup, while maintaining codon-context compatibility.
9 Experimental transient expression in tobacco leaves tested five proteins (DsRed2, mCry2Ab, GAT, infliximab-A, infliximab-B). For DsRed2, CodonHa produced the strongest fluorescence and higher protein levels by Western blot (reported 1.57× vs CodonTransformer, 4.32× vs Genewiz, 13.58× vs a frequency baseline), suggesting the combined NAT EXP objective can translate to wet-lab gains.
10 The study highlights GC3 as a learned and actionable plant expression feature: HalluCodon optimization tends to increase GC3 toward host-like levels, and a GC3-rewarding variant (Ha-GC3) enabled expression of larger proteins that were difficult under the default CodonHa, while warning that extreme GC3/GC increases can complicate synthesis and increase methylation-site density.
💻Code:
github.com/YuxuanLou/HalluCo…
📜Paper:
biorxiv.org/content/10.64898…
#CodonOptimization #PlantSyntheticBiology #ComputationalBiology #Bioinformatics #LanguageModels #DeepLearning #ProteinExpression #Transgenic #MolecularFarming #ESM2 #RNAFM