Topological Machine Learning for Protein-Nucleic Acid Binding Affinity Changes Upon Mutation
1.This study introduces TopoML, a new topological machine learning model that predicts how single-point mutations affect protein–nucleic acid binding affinities. It is the first to apply persistent Laplacian-based features to this domain and achieves state-of-the-art performance on both protein-DNA and protein-RNA datasets.
2.TopoML integrates three complementary feature types: persistent Laplacian-based topological features, physicochemical descriptors at both atom and residue levels, and protein sequence embeddings from a pretrained Transformer model (ESM-2). This multi-view representation captures both structural and sequence-level nuances of the binding interface.
3.For protein-RNA interactions, TopoML achieves a Pearson correlation coefficient (PCC) of 0.72 and MAE of 0.77 kcal/mol, outperforming prior leading models like PRA-MutPred (PCC 0.67). It also surpasses energy-based baselines PEMPNI and PNBACE in benchmark settings.
4.On protein-DNA interactions, TopoML also leads the field: a 10-fold cross-validation yields a PCC of 0.681 and MAE of 0.612 kcal/mol, outperforming SAMPDI-3Dv2 (PCC 0.65). When trained and evaluated on standard benchmark splits, it consistently delivers improved prediction accuracy.
5.Persistent Laplacians encode both topological (harmonic) and geometric (non-harmonic) properties of simplicial complexes formed at the mutation and binding regions. This richer representation outperforms classic persistent homology across many tasks and is central to the model’s predictive power.
6.The topological features alone provide strong performance—on protein-DNA interactions, using only topological features yields PCC 0.648, higher than using physicochemical or sequence features alone. This emphasizes the value of persistent Laplacian structures for modeling mutation-induced changes.
7.The model uses a gradient boosting tree for regression, combining the diverse features effectively. While this choice balances performance and interpretability, the authors suggest future exploration of ensemble and deep learning methods to further enhance accuracy.
8.Evaluation across mutation types (hydrophobic, polar, charged, alanine) and structural regions (core, surface, rim) reveals that TopoML maintains robust predictive performance. However, prediction accuracy slightly drops for underrepresented mutation types like positively charged residues, indicating areas for dataset expansion.
9.Interestingly, the model captures biophysically meaningful trends—for example, all alanine substitutions tend to increase ∆∆G, reflecting loss of favorable interactions. This alignment with known molecular principles supports the model's reliability.
10.TopoML's architecture and feature extraction pipeline are transparent and reproducible. The authors provide detailed procedures, from dataset curation and mutation modeling to topological complex construction and embedding computation.
11.The study opens promising directions for future research, including the use of persistent sheaf Laplacians or Dirac operators, and ensembling multiple learners. The current results demonstrate the untapped potential of topological approaches in understanding mutation impacts beyond protein-protein systems.
💻Code:
github.com/LiuXiangMath/Topo…
📜Paper:
arxiv.org/abs/2505.22786v1
#ProteinDesign #MutationEffects #BindingAffinity #MachineLearning #TopologicalDataAnalysis #Bioinformatics #StructuralBiology #PersistentLaplacian #ProteinDNA #ProteinRNA