Protein Structure Alignment Significance Is Often Exaggerated
Computational biology researchers have revealed a significant issue in protein structure alignment: the statistical significance of alignments is frequently overestimated. This problem is exacerbated by the vast number of high-quality predicted protein structures from machine learning, where unrelated proteins often show convergent evolution of secondary and tertiary motifs, leading to an excess of high-scoring false positive alignments.
Previous methods for estimating significance, often relying on Gumbel or Gaussian distribution fits, fail to accurately model the high-scoring tail of false positive distributions. This leads to routine overestimation of significance, in some cases by up to six orders of magnitude, making it challenging to distinguish true biological relationships from random similarities.
To address this, the authors introduce a novel method for robust statistical significance estimation. Their approach, implemented in the software Reseek, provides accurate E-values that successfully scale with increasing database sizes and are robust to the unknown diversity of protein folds within databases. This innovation is critical for navigating the current scale of protein structure data.
A key aspect of this new framework is its ability to correctly account for both intrinsic and data-dependent filters used by modern fast structure search algorithms like Foldseek and Reseek. These filters, which optimize speed by reducing the number of alignments, significantly impact the distribution of scores and must be considered for reliable significance measures.
The study also investigates existing tools, noting that Foldseek E-values can be substantially underestimated as database size grows, and proposes a correction formula. Furthermore, the probability distribution of false positive scores, P(s|FP), is shown to be largely universal, meaning it is independent of the database's size or alpha diversity, which simplifies accurate E-value calculation.
This work provides a more reliable foundation for protein structure analysis in the era of large-scale structural genomics. The insights and the robust E-value estimation method in Reseek are crucial for accurate homolog detection and functional inference from protein structures.
📜Paper:
biorxiv.org/content/10.1101/…
#ComputationalBiology #ProteinStructure #Bioinformatics #MachineLearning #StructuralBiology #Reseek #FalsePositives #EVAlues #BioinformaticsResearch