Modeling Bias Toward Binding Sites in PDB Structural Models
1. Protein structural models from the PDB, central to biology and machine learning, show significant modeling biases: binding sites are better modeled and fit the experimental data more accurately than non-binding regions.
2. Using metrics like RSCC, RSR, and EDIAm to measure data fit, binding site residues consistently perform better, with higher RSCC (0.96 vs. 0.94) and lower RSR (0.058 vs. 0.076), revealing a focus on "important" regions during manual modeling.
3. These trends persist regardless of resolution or Rfree values, showing that global model quality metrics fail to eliminate local biases in how structures are refined.
4. Binding site residues are more likely to have alternative conformations (5.0% vs. 1.9% elsewhere), indicating that modelers pay greater attention to these areas, manually improving their fit to experimental data.
5. Non-ideal side-chain rotamers at binding sites are better supported by electron density, confirming that unusual conformations in binding regions are biologically meaningful and not artifacts of poor modeling.
6. Pocket residues identified in structures without ligands exhibit similar, though less pronounced, trends, suggesting these biases stem from modeling decisions rather than biological differences alone.
7. This bias toward binding sites has profound implications: structural models are used as "truth" in simulations, docking studies, and machine learning algorithms. Overlooking non-binding regions may propagate errors in downstream analyses.
8. Recognizing these biases highlights the need for improved automated modeling techniques and local metrics to validate entire protein structures, ensuring unbiased biological interpretations and reliable machine learning inputs.
@stephanie_mul
📜Paper:
biorxiv.org/content/10.1101/…
#ProteinStructures #PDB #StructuralBiology #MachineLearning #BiasInData #ComputationalBiology