GOBoost: Leveraging Long-Tail Gene Ontology Terms for Accurate Protein Function Prediction
1. The paper introduces GOBoost, a method tailored for protein function prediction that mitigates the long-tail distribution challenge in Gene Ontology (GO) terms through an innovative ensemble strategy.
2. GOBoost employs three specialized base models (Head, Tail, and All) to focus on high-frequency, medium, and low-frequency labels, ensuring balanced prediction across all GO terms.
3. A novel global-local label graph module dynamically captures the co-occurrence relationships among GO terms, particularly enhancing predictions for rare, low-frequency functions.
4. The multi-granularity focal loss function in GOBoost assigns higher weights to underrepresented GO terms, improving model focus and performance on specific functions.
5. Experimental evaluations show that GOBoost outperforms state-of-the-art methods like HEAL by substantial margins in AUPR, Fmax, and Smin metrics on both PDB and AlphaFold2 datasets.
6. GOBoost demonstrated a remarkable 35.91% improvement in AUPR for biological processes (BP) compared to HEAL on the PDB dataset, showcasing its effectiveness in handling complex protein functions.
7. On the challenging AF2 dataset, where protein sequence similarity is low, GOBoost reduced reliance on sequence-based annotations by leveraging structural and GO co-occurrence information.
8. The ablation studies confirm the importance of the ensemble strategy and long-tail optimization, revealing that each component significantly enhances the overall prediction accuracy and robustness.
9. GOBoost’s framework is adaptable and scalable, making it a promising tool for addressing the imbalanced distribution in protein function prediction tasks.
@cao_renzhi
💻Code:
github.com/Cao-Labs/GOBoost
📜Paper:
biorxiv.org/content/10.1101/…
#ProteinFunction #Bioinformatics #GOBoost #DeepLearning #GraphModels #GeneOntology