Structure-aware geometric graph learning for modeling protease–substrate specificity at scale
1. The paper introduces OmniCleave, a unified structure-aware geometric graph learning framework that predicts protease cleavage sites across 103 proteases (6 superfamilies) in a single model, aiming to move beyond motif-centric predictors by explicitly modeling 3D context and inter-protease dependencies.
2. Key idea: represent each candidate cleavage site by a cleavage-centric hierarchical structural graph built from predicted substrate structures (AlphaFold DB ESMFold). A residue-level subgraph captures residues within 10 Å of P1; an atom-level subgraph refines local chemistry/geometry. Edges encode distances (RBF) and directions (vectors), enabling geometric reasoning.
3. OmniCleave couples local substrate structure with a protease–protease interaction (PPI) module (STRING-derived). Proteases are nodes in a PPI graph; message passing provides relational priors so the model can learn cooperative or correlated cleavage behavior rather than treating proteases as isolated predictors.
4. Architecture: a hierarchical equivariant graph encoder (GET) updates atom- and residue-level representations, then a heterogeneous graph transformer links proteases to cleavage-site nodes, turning cleavage prediction into a protease–site link prediction problem that naturally supports many-to-one settings.
5. Scale of training data: 57,278 structure-informed protease–substrate cleavage events from MEROPS/UniProt, covering 9,651 substrates; negatives are balanced 1:1 with randomly sampled non-cleavage sites. Substrates are filtered for redundancy (CD-HIT; >70% identity removed), and training/test split is 7:3.
6. Benchmarking against six tools (Procleave, PROSPERous, DeepCleave, SitePrediction, PeptideCutter, ProsperousPlus) shows consistent gains. Reported results include AUC >0.9 for 48 proteases and >0.8 for 75 proteases; under a stricter similarity threshold (<30%), AUC >0.9 for 58 proteases and >0.8 for 74 proteases.
7. Many-to-one cleavage is treated as a first-class problem: MEROPS indicates some sites are cleaved by up to 20 proteases. On a many-to-one test subset, OmniCleave maintains high coverage even when ≥5 proteases target the same site, outperforming alternatives whose performance drops substantially—consistent with the benefit of PPI-informed relational learning.
8. Comparison with AlphaFold3 complex prediction (used as a proxy interface-based heuristic) suggests interface proximity alone misses many annotated sites. In case studies (Cathepsin L/E with P01317; MMP7 with P02671), OmniCleave recovers all annotated cleavage sites while AlphaFold3 identifies only a small subset.
9. Mechanistic interpretability: predictions mirror observed secondary-structure preferences of P1 residues across 54 human proteases (cleavages enriched in loops, α-helices, β-sheets, turns, bends). Feature perturbation highlights strong contributions from Rosetta energy terms (e.g., van der Waals/backbone constraints) and secondary-structure descriptors, supporting a geometry/energetics-driven view of specificity.
10. Experimental validation: in vitro Caspase-3 assays confirm 3 novel substrates (CUL7, THOC5, RPIA) and 21 cleavage sites detected by LC-MS/MS. OmniCleave correctly predicts 8/12 (THOC5), 5/6 (RPIA), and 8/12 (CUL7) sites, while Procleave predicts 0/12, 2/6, and 2/12, respectively; docking analyses provide plausible interaction rationales at example sites.
💻Code:
github.com/ABILiLab/OmniClea…
📜Paper:
biorxiv.org/content/10.64898…
#ComputationalBiology #Bioinformatics #Proteomics #Protease #GraphNeuralNetworks #GeometricDeepLearning #StructuralBiology #AlphaFold #ProteinDesign #MachineLearning