Biology AI Daily

Biology AI Daily

Users
Tweets

May 5

Towards a Generative Protein Evolution Machine with DPLM-Evo 1. DPLM-Evo reframes protein diffusion generation around explicit evolutionary edits: substitutions, insertions, and deletions (indels). This addresses a mismatch in prior DPLMs where proteins “emerge from masks,” despite real evolution proceeding via accumulated edits. 2. Core idea: decouple a fixed-size latent alignment space from the variable-length observed sequence. Indels become gap ↔ residue transitions in the latent space, making variable-length diffusion tractable while keeping compute close to fixed-length models. 3. Architecture: the model denoises in observed sequence space but predicts three edit signals per position via separate heads: (i) amino-acid distribution for substitution, (ii) deletion probability, and (iii) insertion probability (insert to the right; residue identity comes from the substitution head). 4. A key innovation is the contextualized evolutionary noising kernel for substitutions. Instead of uniform random corruption, substitutions are corrupted using a context-dependent distribution derived from the model’s own predictions (after a warmup), producing more biologically plausible mutation patterns during training. 5. This contextualized corruption materially matters: an ablation replacing it with uniform corruption drops ProteinGym average Spearman from 0.42 to 0.295; a static BLOSUM-based kernel lands in-between (~0.35), supporting the claim that context-aware mutation noise better matches evolutionary constraints. 6. Understanding task highlight: DPLM-Evo achieves state-of-the-art mutation effect prediction on ProteinGym among single-sequence foundation models (217 DMS assays). Scoring is “substitution-native”: it directly reads substitution probabilities at mutated sites without masking them, avoiding an artificial mask-token scoring mismatch. 7. Indel effect prediction: on the ProteinGym indel benchmark, DPLM-Evo reaches 0.495 Spearman, outperforming strong single-sequence baselines (e.g., ProGen2 M 0.464) and approaching MSA-based methods (PoET 0.517, ProFam ensemble 0.530), suggesting explicit indel modeling transfers to indel fitness estimation. 8. Generation: DPLM-Evo enables variable-length unconditional protein generation via evolutionary denoising (sub/ins/del), starting from a learned prior rather than an all-mask state. It maintains strong foldability (ESMFold pLDDT ~83.6, comparable to DPLM) while improving diversity and reducing repetition/mode collapse. 9. Conditional design: in motif scaffolding, DPLM-Evo can dynamically adjust scaffold length during sampling (via insertion/deletion heads) while keeping motif residues fixed, avoiding manual enumeration of scaffold lengths required by fixed-length generators; it improves solved motif counts and success rate in zero-shot and further with continued finetuning. 10. Edit-trajectory applications: the model supports post-editing and optimization as explicit evolutionary trajectories. Case studies include in-silico “family expansion” (large sequence divergence while preserving fold) and directed evolution of GFP, where enabling indels improved structural scores faster and higher than substitution-only and an ESM-2 baseline under the same search/filtering protocol. 📜Paper: arxiv.org/abs/2605.00182 #ComputationalBiology #ProteinDesign #ProteinLanguageModels #DiffusionModels #GenerativeAI #MachineLearning #Bioinformatics #DirectedEvolution #ProteinEngineering #VariantEffectPrediction

2,415