ProSiteHunter: A Unified Framework for Sequence-Based Prediction of Protein-Nucleic Acid and Protein-Protein Binding Sites
1. ProSiteHunter is presented as a unified, sequence-only framework that predicts residue-level binding sites across four interaction types: protein-DNA, protein-RNA, protein-protein, and antibody-antigen, aiming to replace the usual “one model per interface type” fragmentation in sequence-based predictors.
2. The core idea is multi-source sequence representation: it combines (i) a task-specific fine-tuned protein language model SiteT5 for evolutionary/functional-site signals, (ii) ProstT5 embeddings for structure-related priors, plus (iii) geometric descriptors derived from sequence-predicted properties (secondary structure, relative solvent accessibility, symmetric position encoding), and (iv) statistical descriptors (BLOSUM62, physicochemical properties, amino-acid propensity).
3. A key architectural contribution is the Multi-Source Feature Fusion (MSFF) module with “three-track semantic parsing”: Scale-Aware Encoder (multi-kernel 1D CNNs for local patterns), Context-Aware Encoder (BiLSTM for bidirectional semantics), and Importance-Aware Encoder (gated self-attention for long-range dependencies). These tracks are mapped into Q/K/V and fused via cross-attention for dynamic alignment across feature spaces.
4. A second stage, Multi-Level Interaction Learning (MIL), stacks gated multi-head self-attention blocks plus position-wise feed-forward networks to iteratively refine interface signals, producing per-residue binding probabilities (thresholded at 0.5 for site calls).
5. SiteT5 is introduced as a task-adapted PLM derived from ProtT5-XL-UniRef50, fine-tuned with evolutionary information from sub-MSAs (generated with HHblits on UniRef30). Fine-tuning uses LoRA and updates only the last four decoder layers, yielding a relatively small number of trainable parameters while specializing to binding-site patterns.
6. On GraphBind-style temporal splits for nucleic-acid interfaces, ProSiteHunter reports strong gains over prior sequence methods (e.g., CLAPE variants, iDRNA-ITF, DRNApred), emphasizing PRAUC improvements under heavy class imbalance (site:non-site ≈ 1:10), alongside higher ROCAUC/F1/MCC.
7. On protein-protein binding sites (Seq-InSite dataset) and antibody-antigen epitopes (SEMA conformational epitope dataset), the same unified design remains competitive, reporting improvements over methods such as Seq-InSite/ISPRED-SEQ for PPI and CALIBER for antibody-antigen, with particularly notable PRAUC gains on the epitope task.
8. The paper positions ProSiteHunter as complementary to structure predictors: it highlights cases where structure-based approaches (including AlphaFold3) can mis-localize interfaces when structures are imperfect or when binding involves flexible regions, while sequence-driven predictions remain stable and can flag “local flexible sites.”
9. Ablations support the design rationale: removing SiteT5 or ProstT5 embeddings causes the largest drops (SiteT5 removal being most damaging), while removing geometric/statistical features yields smaller but consistent degradations; removing MSFF or MIL leads to substantial performance loss, with MSFF identified as the larger contributor.
10. A downstream demonstration integrates ProSiteHunter-predicted epitopes into an in-house antibody-antigen interaction predictor (Multi-sAAI), reporting improved interaction classification metrics (ROCAUC/F1/precision/recall) and case studies where predicted epitope features sharply increase predicted interaction probabilities for known therapeutic or broadly neutralizing antibody scenarios.
📜Paper:
doi.org/10.1002/advs.75931
#ComputationalBiology #Bioinformatics #ProteinScience #ProteinLanguageModels #DeepLearning #ProteinInteractions #EpitopePrediction #PPI #ProteinDNA #ProteinRNA #AntibodyEngineering #DrugDiscovery