Biology AI Daily

Biology AI Daily

Users
Tweets

May 5

MZSGO: Multimodal zero-shot protein function annotation via evolutionary signals and textual semantics 1. MZSGO reframes protein function prediction as a Protein–GO matching task, enabling true zero-shot inference for GO terms never seen during training by comparing protein evidence to GO term definitions in a shared embedding space. 2. The key multimodal idea: combine evolutionary sequence signals (ESM2-650M embeddings) with textual semantics from (a) protein domain descriptions and (b) GO term definitions, both encoded by a unified text embedding model (Qwen3-Embedding-4B). 3. Instead of treating domains and labels as categorical IDs, MZSGO uses their natural-language descriptions/definitions as semantic anchors—so “what a GO term means” becomes directly learnable and transferable to new/rare labels. 4. Architecture highlights: (i) modality-specific projection MLPs to a shared latent space, (ii) asymmetric modality dropout that randomly removes sequence or domain features during training (but keeps GO text), and (iii) label-aware adaptive gated fusion to weight sequence/domain/label signals per protein–GO pair. 5. Why asymmetric dropout matters: it simulates real annotation settings where protein-derived evidence (domains) can be missing or incomplete, while GO term text is always available at inference—improving robustness and reducing reliance on any single protein modality. 6. Why gated fusion matters: naive concatenation can inject noise when modalities contribute unevenly; the learned gate assigns context-dependent weights (sequence vs domain vs GO definition), improving both supervised accuracy and especially generalization to unseen labels. 7. Dataset design targets realistic zero-shot evaluation: training uses CAFA5 Swiss-Prot with GO version cutoff (Jan 2023), while zero-shot labels are defined by a later GO release (Oct 2025). Test proteins are homology-filtered (Diamond, remove >30% identity to training) to prevent leakage. 8. Performance summary (standard test set): MZSGO is competitive on supervised metrics (Fmax/AUPR) and stands out on unseen-label generalization. Example: BP Unseen AUPR 0.2393 vs ProtNote 0.0303; MF Unseen AUPR 0.4806; CC Unseen AUPR 0.5862. Harmonic mean (seen/unseen) also improves strongly (BP 0.3241, MF 0.5821, CC 0.6669). 9. True temporal zero-shot benchmark (new GO terms added after training cutoff): MZSGO improves both precision and overall reliability vs ProtNote—e.g., CC Fmax 0.6610 vs 0.2983, with much higher precision (0.5521 vs 0.1853), suggesting fewer text-driven false positives. 10. Ablations indicate zero-shot transfer depends on semantic domain text (not one-hot domains), and on adaptive fusion modality dropout. Removing dropout can slightly help seen-label Fmax but hurts zero-shot balance; replacing gated fusion with concatenation reduces the seen/unseen trade-off. 💻Code: github.com/toxic-byte/MZSGO 📜Paper: doi.org/10.1093/bioinformati… #ProteinFunctionPrediction #GeneOntology #ZeroShotLearning #ProteinLanguageModels #LLM #MultimodalAI #ComputationalBiology #Bioinformatics #CAFA #SwissProt

915