From Atoms to Fragments: A Coarse Representation for Efficient and Functional Protein Design
1. The paper proposes a sparse, interpretable protein representation built from a curated alphabet of 40 evolutionarily conserved “ancient” structural fragments, aiming to replace scaling-heavy sequence or full-atom structure encodings for search and design.
2. Two complementary encodings are introduced: Fragment Sets (presence/absence of fragment types, ignoring arrangement) for speed-critical tasks, and Fragment Graphs (fragments as nodes; peptide-bond and spatial-proximity edges) to retain structural context needed for clustering and design.
3. Fragment detection is performed directly from backbone geometry using a sliding-window scan against a fragment library, evaluating several distance metrics; combining two torsion-angle metrics (LogPr RamRMSD) yields strong detection performance (F1 ≈ 0.85), with an empirically selected classification threshold (3.65%) and AUROC ≈ 87%.
4. On the fold-balanced PDBench benchmark, fragments cover ~40% of residues on average and exhibit distinct biophysical patterns: more intra-fragment hydrogen bonding (notably in mainly-β folds, ~ 15%), fewer inter-fragment hydrogen bonds (notably in mainly-α folds, ~-47%), and slightly reduced solvent accessibility (~-5%), consistent with fragments behaving as more “self-contained” structural units.
5. To test functional signal retention, the authors curate a Protein Function Dataset (PFD) of 215 monomeric proteins spanning 12 binding-function categories (DNA/RNA/ATP/GTP/metal and combinations) filtered to ≤30% sequence identity, making functional grouping challenging for standard similarity measures.
6. Fragment-based distances produce more information-dense embeddings than sequence (BLOSUM) or global shape alignment (RMSD): after PCoA, BagOfNodes (Fragment Sets) preserves >95% variance within 20 dimensions and GraphEditDistance (Fragment Graphs) >80%, vs <60% (BLOSUM) and <40% (RMSD).
7. Functional clustering improves with fragments in multiple ways: BagOfNodes yields very strong cluster compactness/separation (Silhouette ≈ 0.82), while GraphEditDistance best aligns clusters with functional labels (ARI ≈ 0.046; F1 ≈ 0.20), suggesting a practical tradeoff between ultra-compact “bag” features and context-aware graph structure.
8. For functional database search, fragment representations dramatically reduce “tokens per protein” (memory/data points): ~99% fewer than atom/backbone representations and ~94–98% fewer than residue-level sequence representations, while achieving retrieval quality comparable to RMSD/BLOSUM across functions (AUROC/NDCG broadly similar, with some function-specific wins per method).
9. Speed benchmarks (100 queries vs a 100-protein database, 35 cores) show the practical payoff: Fragment Sets (BagOfNodes) answer in ~0.07 s, compared with ~36.6 s for BLOSUM and ~1717 s for RMSD; Fragment Graph edit distance is slower than sequence but still far faster than RMSD (~573 s vs ~1717 s), with a one-time preprocessing cost to build fragment representations.
10. Fragments are also used as functional “blueprints” for generative design: detected fragment backbones are held fixed as templates and RFDiffusion fills missing regions; functional recovery is assessed by FoldSeek hits and GO-code agreement, with reported recovery rates often >40% and reaching near-perfect recovery for some classes (e.g., metal-binding), while random “naive fragments” largely fail—supporting that evolutionary fragment choices, not arbitrary geometry, drive functional signal.
💻Code:
github.com/wells-wood-resear…
📜Paper:
doi.org/10.1093/bioinformati…
#ProteinDesign #ComputationalBiology #Bioinformatics #ProteinStructure #MachineLearning #DiffusionModels #ProteinSearch #GraphLearning #StructuralBiology #RepresentationLearning